AetherVision-Bench: An Open-Vocabulary RGB-Infrared Benchmark for Multi-Angle Segmentation across Aerial and Ground Perspectives

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing open-vocabulary semantic segmentation (OVSS) methods exhibit limited generalization across viewpoints and modalities, and lack a unified, realistic evaluation benchmark. Method: We introduce the first aerial-ground multi-view (aerial + ground) and RGB-thermal cross-modal open-vocabulary segmentation benchmark, enabling zero-shot cross-view and cross-sensor generalization evaluation. We propose a systematic robustness evaluation framework that quantitatively disentangles the impacts of viewpoint disparity, modality shift, and text-vision alignment on zero-shot transfer. Leveraging foundation models (e.g., CLIP), our protocol integrates registered multi-source imagery and fine-grained annotations, ensuring reproducibility and extensibility. Contribution/Results: Comprehensive experiments expose critical performance bottlenecks of state-of-the-art OVSS models, establishing the first deployment-oriented, real-world-scenario benchmark for open-vocabulary segmentation—specifically designed to advance embodied intelligent perception.

Technology Category

Application Category

📝 Abstract
Open-vocabulary semantic segmentation (OVSS) involves assigning labels to each pixel in an image based on textual descriptions, leveraging world models like CLIP. However, they encounter significant challenges in cross-domain generalization, hindering their practical efficacy in real-world applications. Embodied AI systems are transforming autonomous navigation for ground vehicles and drones by enhancing their perception abilities, and in this study, we present AetherVision-Bench, a benchmark for multi-angle segmentation across aerial, and ground perspectives, which facilitates an extensive evaluation of performance across different viewing angles and sensor modalities. We assess state-of-the-art OVSS models on the proposed benchmark and investigate the key factors that impact the performance of zero-shot transfer models. Our work pioneers the creation of a robustness benchmark, offering valuable insights and establishing a foundation for future research.
Problem

Research questions and friction points this paper is trying to address.

Challenges in cross-domain generalization for open-vocabulary semantic segmentation
Need for multi-angle segmentation across aerial and ground perspectives
Evaluating zero-shot transfer model performance in diverse conditions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-vocabulary segmentation using CLIP models
Multi-angle aerial-ground RGB-infrared benchmark
Robustness evaluation for zero-shot transfer
🔎 Similar Papers
No similar papers found.
Aniruddh Sikdar
Aniruddh Sikdar
Robert Bosch Centre for Cyber Physical Systems , Indian Institute of Science
Machine learningDeep learningComputer Vision
Aditya Gandhamal
Aditya Gandhamal
Predoctoral Researcher, Indian Institute of Science
S
Suresh Sundaram
Robert Bosch Centre for Cyber Physical Systems, Indian Institute of Science, Bengaluru, India; Department of Aerospace Engineering, Indian Institute of Science, Bengaluru, India