Data-Centric Visual Development for Self-Driving Labs

📅 2025-12-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Autonomous laboratory systems enhance automation of critical biological operations—such as liquid handling—but high-precision bubble detection remains hindered by scarcity of real negative samples and prohibitive annotation costs. To address this, we propose a human-in-the-loop dual-track data generation framework: (1) real-image acquisition via automated pipetting workflows with selective human verification, and (2) synthesis of high-fidelity virtual negative samples guided by reference conditions and textual prompts, followed by rigorous filtering and reliability validation to ensure dataset balance. This approach effectively mitigates the rare-event data deficiency problem. Evaluated on real-world data alone, our model achieves 99.6% accuracy; when augmented with generated samples, accuracy remains at 99.4%. The framework substantially reduces data collection and annotation overhead while improving model robustness and generalization across diverse experimental conditions.

Technology Category

Application Category

📝 Abstract
Self-driving laboratories offer a promising path toward reducing the labor-intensive, time-consuming, and often irreproducible workflows in the biological sciences. Yet their stringent precision requirements demand highly robust models whose training relies on large amounts of annotated data. However, this kind of data is difficult to obtain in routine practice, especially negative samples. In this work, we focus on pipetting, the most critical and precision sensitive action in SDLs. To overcome the scarcity of training data, we build a hybrid pipeline that fuses real and virtual data generation. The real track adopts a human-in-the-loop scheme that couples automated acquisition with selective human verification to maximize accuracy with minimal effort. The virtual track augments the real data using reference-conditioned, prompt-guided image generation, which is further screened and validated for reliability. Together, these two tracks yield a class-balanced dataset that enables robust bubble detection training. On a held-out real test set, a model trained entirely on automatically acquired real images reaches 99.6% accuracy, and mixing real and generated data during training sustains 99.4% accuracy while reducing collection and review load. Our approach offers a scalable and cost-effective strategy for supplying visual feedback data to SDL workflows and provides a practical solution to data scarcity in rare event detection and broader vision tasks.
Problem

Research questions and friction points this paper is trying to address.

Addresses data scarcity for training robust models in self-driving labs
Focuses on detecting rare events like pipetting bubbles with limited negative samples
Proposes a hybrid real-virtual data pipeline to reduce manual annotation effort
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid pipeline fuses real and virtual data generation
Human-in-the-loop scheme couples automated acquisition with verification
Reference-conditioned prompt-guided image generation augments real data
🔎 Similar Papers
No similar papers found.
Anbang Liu
Anbang Liu
The University of Hong Kong
Integer Linear ProgrammingOperations ResearchMachine LearningManufacturing System
G
Guanzhong Hu
Department of Mechanical Engineering, Northwestern University
J
Jiayi Wang
Department of Computer Science, Northwestern University
P
Ping Guo
Department of Mechanical Engineering, Northwestern University
H
Han Liu
Department of Computer Science, Northwestern University