🤖 AI Summary
This work addresses the challenges of evaluating and generalizing high-precision, multimodal robotic manipulation in digital biology laboratories. Methodologically, we introduce the first vision-language-action (VLA) benchmark tailored to professional scientific settings, built upon a simulation environment that supports language-guided fine manipulation. The environment integrates instrument digital twins, extended physics engines, physically based rendering (PBR), and dynamic GUI rendering, and features a diverse task suite spanning transparent-object manipulation, precision mechatronic control, and real-world experimental protocols. Key contributions include: (1) the first scientific-domain-oriented VLA evaluation framework; (2) open-source release of a reproducible simulation platform and benchmark suite; and (3) empirical analysis revealing critical limitations of current state-of-the-art VLA models in fine-grained action execution, visual reasoning, and precise instruction following—establishing a standardized assessment foundation for biologically grounded robotic automation research.
📝 Abstract
Vision-language-action (VLA) models have shown promise as generalist robotic policies by jointly leveraging visual, linguistic, and proprioceptive modalities to generate action trajectories. While recent benchmarks have advanced VLA research in domestic tasks, professional science-oriented domains remain underexplored. We introduce AutoBio, a simulation framework and benchmark designed to evaluate robotic automation in biology laboratory environments--an application domain that combines structured protocols with demanding precision and multimodal interaction. AutoBio extends existing simulation capabilities through a pipeline for digitizing real-world laboratory instruments, specialized physics plugins for mechanisms ubiquitous in laboratory workflows, and a rendering stack that support dynamic instrument interfaces and transparent materials through physically based rendering. Our benchmark comprises biologically grounded tasks spanning three difficulty levels, enabling standardized evaluation of language-guided robotic manipulation in experimental protocols. We provide infrastructure for demonstration generation and seamless integration with VLA models. Baseline evaluations with two SOTA VLA models reveal significant gaps in precision manipulation, visual reasoning, and instruction following in scientific workflows. By releasing AutoBio, we aim to catalyze research on generalist robotic systems for complex, high-precision, and multimodal professional environments. The simulator and benchmark are publicly available to facilitate reproducible research.