AQIFormer: A Transformer-Based Multi-View Architecture for Cross-City Air Quality Classification

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the limitations of existing image-based air quality estimation methods in cross-city generalization and multi-view information fusion. The authors propose a Transformer-based multi-view fusion architecture that jointly leverages front- and rear-view traffic images along with meteorological parameters. By incorporating a weather-aware attention mechanism and multi-task learning, the model achieves robust cross-domain air quality classification. The approach innovatively integrates dual-view imagery and supports few-shot adaptive transfer. Evaluated on a dataset comprising 26,678 image pairs, the method attains an accuracy of 89.96%, surpassing the current state-of-the-art by 14.96%. Furthermore, it demonstrates strong generalization on an independent dataset from Nagpur, India, achieving 81.67% accuracy with only limited samples and exhibiting a modest performance drop of merely 8.29%.

📝 Abstract

Air pollution represents one of the most critical environmental and public health challenges globally, with traditional sensor-based monitoring systems facing significant scalability and economic constraints. Image-based air quality estimation has emerged as a promising alternative, leveraging the visual characteristics of atmospheric pollutants in traffic scenes. However, existing methods suffer from limited cross-city generalization and inadequate exploitation of multi-view perspectives. We present AQIFormer, a novel transformer-based ensemble architecture that addresses these fundamental limitations through innovative dual-view integration, weather-aware attention mechanisms, and comprehensive multi-task learning. Our approach uniquely combines front and rear traffic imagery with meteorological parameters to achieve robust air quality classification across diverse urban environments. Extensive evaluation on a comprehensive dataset of 26,678 synchronized front-rear image pairs demonstrates good performance with 89.96% accuracy, representing a 14.96% improvement over state-of-the-art methods. Most importantly, our model maintains exceptional cross-city generalization capabilities, achieving 81.67% accuracy on an independent dataset collected in Nagpur, India with only 8.29% performance degradation using few-shot adaptation with minimal training samples.

Problem

Research questions and friction points this paper is trying to address.

cross-city generalization

air quality classification

multi-view perspectives

image-based air quality estimation

atmospheric pollutants

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based architecture

multi-view integration

weather-aware attention