🤖 AI Summary
This work addresses a critical limitation in existing search-and-rescue benchmarks, which evaluate exploration, interaction, and memory capabilities in isolation and thus fail to capture the compounding effects of failure across multi-stage tasks. The study introduces the first compositional four-phase formulation of search-and-rescue—encompassing multimodal exploration, target rescue, memory-guided return, and final handover—and presents a realistic diagnostic benchmark featuring a five-tier difficulty scale and an automated scene generation mechanism. This framework enables independent analysis of bottlenecks in exploration and spatial memory. Leveraging photorealistic simulation, multimodal perception, spatial memory modeling, and topological vision-language navigation, experiments reveal that current methods cannot complete the full pipeline under the highest difficulty setting, identifying autonomous exploration and spatial memory as two distinct and critical performance bottlenecks.
📝 Abstract
Search-and-rescue (SAR) requires embodied agents to explore unfamiliar environments under multimodal uncertainty, perform multi-stage interactions, and retrieve spatial memory over long horizons. Existing benchmarks typically evaluate these capabilities in isolation, leaving unclear how failures compound when they must be composed in realistic workflows. We introduce RescueBench, a photo-realistic diagnostic benchmark that instantiates SAR as a four-stage pipeline: multimodal exploration, target rescue, memory-guided return, and final handoff. By combining sequential task composition with stage-level evaluation, RescueBench enables analysis of how exploration and memory failures propagate through embodied rescue workflows. It contains five progressive difficulty levels that vary in environmental complexity, clue ambiguity, and spatial hierarchy, along with an automatic episode generation and annotation pipeline for scalable evaluation and training. We evaluate seven baselines, an oracle reference, and human players, showing that no baselines complete the full task at the greatest difficulty. Stage-level diagnosis identifies autonomous exploration as the dominant failure mode and spatial memory as a second, independent bottleneck, suggesting that these limitations are not resolved by current topological visual-language navigation or map-based methods. Code is available in https://github.com/wukui-muc/RescueBench