SNAP: A Benchmark for Testing the Effects of Capture Conditions on Fundamental Vision Tasks

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the poor generalization of deep learning vision models under real-world image acquisition perturbations—particularly implicit degradation induced by camera parameters (shutter speed, ISO, aperture) and illumination conditions. To this end, the authors introduce SNAP, the first controllable imaging benchmark spanning image classification, object detection, and visual question answering (VQA); propose a capture-condition sensitivity analysis framework; establish a human performance baseline for VQA; and design a unified cross-task evaluation protocol. Experiments reveal: (1) prevalent capture bias in mainstream datasets; (2) substantial performance drops under minor parameter variations; and (3) state-of-the-art models still significantly underperform humans on normally exposed VQA. This study is the first to systematically quantify how the imaging pipeline constrains the robustness of vision AI, providing both a new physical-aware benchmark and an analytical paradigm for developing imaging-robust visual models.

Technology Category

Application Category

📝 Abstract

Generalization of deep-learning-based (DL) computer vision algorithms to various image perturbations is hard to establish and remains an active area of research. The majority of past analyses focused on the images already captured, whereas effects of the image formation pipeline and environment are less studied. In this paper, we address this issue by analyzing the impact of capture conditions, such as camera parameters and lighting, on DL model performance on 3 vision tasks -- image classification, object detection, and visual question answering (VQA). To this end, we assess capture bias in common vision datasets and create a new benchmark, SNAP (for $ extbf{S}$hutter speed, ISO se$ extbf{N}$sitivity, and $ extbf{AP}$erture), consisting of images of objects taken under controlled lighting conditions and with densely sampled camera settings. We then evaluate a large number of DL vision models and show the effects of capture conditions on each selected vision task. Lastly, we conduct an experiment to establish a human baseline for the VQA task. Our results show that computer vision datasets are significantly biased, the models trained on this data do not reach human accuracy even on the well-exposed images, and are susceptible to both major exposure changes and minute variations of camera settings. Code and data can be found at https://github.com/ykotseruba/SNAP

Problem

Research questions and friction points this paper is trying to address.

Analyzing impact of capture conditions on DL vision models

Assessing capture bias in common vision datasets

Evaluating model susceptibility to camera setting variations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes impact of capture conditions on vision tasks

Creates SNAP benchmark with controlled camera settings

Evaluates DL models' susceptibility to capture variations

🔎 Similar Papers

What is the Visual Cognition Gap between Humans and Multimodal LLMs?