AirQA: A Comprehensive QA Dataset for AI Research with Instance-Level Evaluation

๐Ÿ“… 2025-09-21
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Scientific literature question answering lacks high-quality evaluation benchmarks and interactive training data. Method: This paper introduces AirQAโ€”the first multi-task, multimodal scientific paper QA dataset tailored for AI research, comprising 13,948 papers and 1,246 questions, enabling instance-level fine-grained evaluation. We propose ExTrActor, an automated instruction synthesis framework that generates high-quality, multi-turn interactive trajectories without human intervention, integrating multi-agent collaboration, tool invocation, and interactive retrieval. Contribution/Results: Experiments show that state-of-the-art open- and closed-source models achieve limited performance on AirQA, confirming its strong challenge level. ExTrActor significantly enhances small language modelsโ€™ capability in multi-turn tool usage, approaching the performance of large models. The code, dataset, and interaction trajectories are fully open-sourced.

Technology Category

Application Category

๐Ÿ“ Abstract
The growing volume of academic papers has made it increasingly difficult for researchers to efficiently extract key information. While large language models (LLMs) based agents are capable of automating question answering (QA) workflows for scientific papers, there still lacks a comprehensive and realistic benchmark to evaluate their capabilities. Moreover, training an interactive agent for this specific task is hindered by the shortage of high-quality interaction trajectories. In this work, we propose AirQA, a human-annotated comprehensive paper QA dataset in the field of artificial intelligence (AI), with 13,948 papers and 1,246 questions, that encompasses multi-task, multi-modal and instance-level evaluation. Furthermore, we propose ExTrActor, an automated framework for instruction data synthesis. With three LLM-based agents, ExTrActor can perform example generation and trajectory collection without human intervention. Evaluations of multiple open-source and proprietary models show that most models underperform on AirQA, demonstrating the quality of our dataset. Extensive experiments confirm that ExTrActor consistently improves the multi-turn tool-use capability of small models, enabling them to achieve performance comparable to larger ones.
Problem

Research questions and friction points this paper is trying to address.

Lack of comprehensive benchmark for evaluating LLM-based QA on scientific papers
Shortage of high-quality interaction trajectories for training interactive agents
Difficulty extracting key information from growing volume of academic papers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-annotated dataset for AI paper QA
Automated framework for instruction data synthesis
LLM-based agents for trajectory collection
๐Ÿ”Ž Similar Papers
No similar papers found.
Tiancheng Huang
Tiancheng Huang
Nanyang Technological University
Deep LearningGraph Neural NetworkLiDAR3D Point Cloud
Ruisheng Cao
Ruisheng Cao
Shanghai Jiao Tong University
LLM Agenttext-to-SQLcode generationsemantic parsingdialogue systems
Y
Yuxin Zhang
MoE Key Lab of Artificial Intelligence, Shanghai, China; X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, Shanghai, China; Jiangsu Key Lab of Language Computing, Suzhou, China
Z
Zhangyi Kang
MoE Key Lab of Artificial Intelligence, Shanghai, China; X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, Shanghai, China; Jiangsu Key Lab of Language Computing, Suzhou, China
Z
Zijian Wang
MoE Key Lab of Artificial Intelligence, Shanghai, China; X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, Shanghai, China; Jiangsu Key Lab of Language Computing, Suzhou, China
C
Chenrun Wang
MoE Key Lab of Artificial Intelligence, Shanghai, China; X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, Shanghai, China; Jiangsu Key Lab of Language Computing, Suzhou, China
Y
Yijie Luo
MoE Key Lab of Artificial Intelligence, Shanghai, China; X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, Shanghai, China; Jiangsu Key Lab of Language Computing, Suzhou, China
Hang Zheng
Hang Zheng
Zhejiang University
array signal processingDOA estimationbeamformingtensor signal processingmachine learning
L
Lirong Qian
MoE Key Lab of Artificial Intelligence, Shanghai, China; X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, Shanghai, China; Jiangsu Key Lab of Language Computing, Suzhou, China
L
Lu Chen
MoE Key Lab of Artificial Intelligence, Shanghai, China; X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, Shanghai, China; Jiangsu Key Lab of Language Computing, Suzhou, China; Suzhou Laboratory, Suzhou, China
K
Kai Yu
MoE Key Lab of Artificial Intelligence, Shanghai, China; X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, Shanghai, China; Jiangsu Key Lab of Language Computing, Suzhou, China; Suzhou Laboratory, Suzhou, China