🤖 AI Summary
This work addresses the practical challenge of zero-shot, cross-national, fine-grained traffic sign recognition—complicated by substantial inter-country design variations and severe scarcity of labeled data. To tackle this, we propose a “think-before-recognizing” triple-descriptive, multi-stage reasoning paradigm grounded in multimodal large models (LMMs). Our method integrates center-coordinate prompting, prior-hypothesis filtering, few-shot in-context learning, and differential comparative description modeling—requiring no model fine-tuning or training data, and relying solely on a unified instruction and minimal template examples for cross-domain generalization. Evaluated on three public benchmarks and two real-world cross-national datasets, our approach achieves state-of-the-art performance, significantly enhancing robustness in complex road scenes and cross-national transferability for fine-grained sign recognition.
📝 Abstract
We propose a new strategy called think twice before recognizing to improve fine-grained traffic sign recognition (TSR). Fine-grained TSR in the wild is difficult due to the complex road conditions, and existing approaches particularly struggle with cross-country TSR when data is lacking. Our strategy achieves effective fine-grained TSR by stimulating the multiple-thinking capability of large multimodal models (LMM). We introduce context, characteristic, and differential descriptions to design multiple thinking processes for the LMM. The context descriptions with center coordinate prompt optimization help the LMM to locate the target traffic sign in the original road images containing multiple traffic signs and filter irrelevant answers through the proposed prior traffic sign hypothesis. The characteristic description is based on few-shot in-context learning of template traffic signs, which decreases the cross-domain difference and enhances the fine-grained recognition capability of the LMM. The differential descriptions of similar traffic signs optimize the multimodal thinking capability of the LMM. The proposed method is independent of training data and requires only simple and uniform instructions. We conducted extensive experiments on three benchmark datasets and two real-world datasets from different countries, and the proposed method achieves state-of-the-art TSR results on all five datasets.