Temperature and Persona Shape LLM Agent Consensus With Minimal Accuracy Gains in Qualitative Coding

📅 2025-07-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates how persona specification and temperature settings affect qualitative coding consensus and accuracy in large language model (LLM)-based multi-agent systems (MAS). Method: We develop a reproducible MAS framework using six open-source LLMs (3B–32B), simulating structured coding discussions with an explicit consensus arbitration mechanism. Contribution/Results: (1) For most models, single-agent performance matches or exceeds that of multi-agent consensus in coding accuracy; (2) only OpenHermesV2-7B shows marginal accuracy gains under low temperature and assertive persona configurations; (3) the primary value of MAS lies not in universally improving accuracy, but in identifying ambiguous coding boundaries and supporting codebook refinement. The findings challenge the implicit assumption that MAS inherently outperforms single-agent approaches, offering empirical grounding and design guidance for LLM-augmented qualitative research methodologies.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) enable new possibilities for qualitative research at scale, including coding and data annotation. While multi-agent systems (MAS) can emulate human coding workflows, their benefits over single-agent coding remain poorly understood. We conducted an experimental study of how agent persona and temperature shape consensus-building and coding accuracy of dialog segments based on a codebook with 8 codes. Our open-source MAS mirrors deductive human coding through structured agent discussion and consensus arbitration. Using six open-source LLMs (with 3 to 32 billion parameters) and 18 experimental configurations, we analyze over 77,000 coding decisions against a gold-standard dataset of human-annotated transcripts from online math tutoring sessions. Temperature significantly impacted whether and when consensus was reached across all six LLMs. MAS with multiple personas (including neutral, assertive, or empathetic), significantly delayed consensus in four out of six LLMs compared to uniform personas. In three of those LLMs, higher temperatures significantly diminished the effects of multiple personas on consensus. However, neither temperature nor persona pairing lead to robust improvements in coding accuracy. Single agents matched or outperformed MAS consensus in most conditions. Only one model (OpenHermesV2:7B) and code category showed above-chance gains from MAS deliberation when temperature was 0.5 or lower and especially when the agents included at least one assertive persona. Qualitative analysis of MAS collaboration for these configurations suggests that MAS may nonetheless aid in narrowing ambiguous code applications that could improve codebooks and human-AI coding. We contribute new insight into the limits of LLM-based qualitative methods, challenging the notion that diverse MAS personas lead to better outcomes. We open-source our MAS and experimentation code.
Problem

Research questions and friction points this paper is trying to address.

How temperature affects consensus in multi-agent LLM systems
Impact of diverse personas on agent consensus and accuracy
Evaluating multi-agent vs single-agent coding performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent system mirrors human coding workflows
Temperature and persona impact consensus-building
Open-source LLMs analyze coding decisions
🔎 Similar Papers
No similar papers found.