Multi-stage Large Language Model Pipelines Can Outperform GPT-4o in Relevance Assessment

📅 2025-01-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high cost and low inter-annotator agreement in manual relevance annotation for search evaluation, this paper proposes a modular, multi-stage large language model (LLM) pipeline. It decomposes relevance judgment into fine-grained subtasks, each assigned to an optimally sized LLM with stage-specific prompting strategies, jointly optimizing accuracy and cost. We introduce the first automated evaluation framework grounded in Krippendorff’s α reliability metric and, on the TREC-DL benchmark, demonstrate that our pipeline surpasses GPT-4o—including its flagship variant: it improves α by 18.4% over GPT-4o mini at just $0.20 per million tokens, and still achieves a 9.7% α gain over the flagship GPT-4o. Key contributions include: (1) a modular, multi-stage LLM classification architecture; (2) heterogeneous LLMs coordinated via task-aware inference; and (3) α-driven prompt engineering and scheduling optimization.

Technology Category

Application Category

📝 Abstract
The effectiveness of search systems is evaluated using relevance labels that indicate the usefulness of documents for specific queries and users. While obtaining these relevance labels from real users is ideal, scaling such data collection is challenging. Consequently, third-party annotators are employed, but their inconsistent accuracy demands costly auditing, training, and monitoring. We propose an LLM-based modular classification pipeline that divides the relevance assessment task into multiple stages, each utilising different prompts and models of varying sizes and capabilities. Applied to TREC Deep Learning (TREC-DL), one of our approaches showed an 18.4% Krippendorff's $alpha$ accuracy increase over OpenAI's GPT-4o mini while maintaining a cost of about 0.2 USD per million input tokens, offering a more efficient and scalable solution for relevance assessment. This approach beats the baseline performance of GPT-4o (5 USD). With a pipeline approach, even the accuracy of the GPT-4o flagship model, measured in $alpha$, could be improved by 9.7%.
Problem

Research questions and friction points this paper is trying to address.

Information Relevance
Cost-effectiveness
User Feedback Reduction
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-level approach
cost-effective processing
enhanced accuracy
🔎 Similar Papers
No similar papers found.