ACT as Human: Multimodal Large Language Model Data Annotation with Critical Thinking

📅 2025-11-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Supervised learning is hindered by the high cost and time consumption of acquiring high-quality labeled data, while current large language models (LLMs) yield automated annotations significantly inferior to human performance. This paper proposes ACT, a critical-thinking–inspired automatic annotation framework that uniquely employs multimodal LLMs both as annotators and reviewers. ACT introduces introspective error detection to identify high-risk samples, enabling human reviewers to focus selectively and establishing an efficient human-in-the-loop annotation paradigm. Key contributions include: (1) seven annotation quality optimization principles; (2) theoretical convergence guarantees for models trained on ACT-annotated data; and (3) a loss-function correction mechanism. Experiments across multiple domains show that models trained on ACT-annotated data achieve performance within <2% of fully human-labeled baselines, while reducing human annotation effort by up to 90%, demonstrating ACT’s effectiveness and broad applicability.

Technology Category

Application Category

📝 Abstract
Supervised learning relies on high-quality labeled data, but obtaining such data through human annotation is both expensive and time-consuming. Recent work explores using large language models (LLMs) for annotation, but LLM-generated labels still fall short of human-level quality. To address this problem, we propose the Annotation with Critical Thinking (ACT) data pipeline, where LLMs serve not only as annotators but also as judges to critically identify potential errors. Human effort is then directed towards reviewing only the most"suspicious"cases, significantly improving the human annotation efficiency. Our major contributions are as follows: (1) ACT is applicable to a wide range of domains, including natural language processing (NLP), computer vision (CV), and multimodal understanding, by leveraging multimodal-LLMs (MLLMs). (2) Through empirical studies, we derive 7 insights on how to enhance annotation quality while efficiently reducing the human cost, and then translate these findings into user-friendly guidelines. (3) We theoretically analyze how to modify the loss function so that models trained on ACT data achieve similar performance to those trained on fully human-annotated data. Our experiments show that the performance gap can be reduced to less than 2% on most benchmark datasets while saving up to 90% of human costs.
Problem

Research questions and friction points this paper is trying to address.

Improving annotation quality of LLM-generated labels
Reducing human annotation costs through critical error identification
Maintaining model performance with efficient multimodal data labeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs act as annotators and error judges
Human effort focuses on suspicious cases only
Modifies loss function for near-human performance
🔎 Similar Papers
No similar papers found.
Lequan Lin
Lequan Lin
PhD Candidate, University of Sydney
machine learninglarge language modelsgenerative modelsgraph neural networks
D
Dai Shi
University of Sydney, Australia; University of Cambridge, United Kingdom
Andi Han
Andi Han
University of Sydney, RIKEN AIP
Generative modelsLLMOptimization
F
Feng Chen
University of Adelaide, Australia
Q
Qiuzheng Chen
ByteDance, Australia
J
Jiawen Li
ByteDance, Australia
Zhaoyang Li
Zhaoyang Li
Ph.D student, University of Science and Technology of China
Computer Vision
J
Jiyuan Li
ByteDance, Australia
Z
Zhenbang Sun
ByteDance, Australia
J
Junbin Gao
University of Sydney, Australia