🤖 AI Summary
This study addresses the challenge of automatically evaluating large language models in political question-answering scenarios, where assessments must jointly consider factual correctness, response clarity, and evasion detection. The impact of prompt design on such high-level semantic tasks remains underexplored. Leveraging the CLARITY dataset from SemEval 2026, this work presents the first systematic evaluation of different prompting strategies—namely zero-shot prompting, chain-of-thought, and few-shot chain-of-thought—on GPT-3.5 and GPT-5.2 for clarity scoring and topic detection. Results show that GPT-5.2 achieves a clarity prediction accuracy of 63% under few-shot chain-of-thought prompting, up from 56%, and reaches 74% accuracy in topic identification. Evasion detection remains challenging, with peak performance at 34% and substantial difficulties in fine-grained category discrimination. The findings highlight both the efficacy and limitations of structured prompting in complex semantic evaluation tasks.
📝 Abstract
Automatic evaluation of large language model (LLM) responses requires not only factual correctness but also clarity, particularly in political question-answering. While recent datasets provide human annotations for clarity and evasion, the impact of prompt design on automatic clarity evaluation remains underexplored. In this paper, we study prompt-based clarity evaluation using the CLARITY dataset from the SemEval 2026 shared task. We compare a GPT-3.5 baseline provided with the dataset against GPT-5.2 evaluated under three prompting strategies: simple prompting, chain-of-thought prompting, and chain-of-thought with few-shot examples. Model predictions are evaluated against human annotations using accuracy and class-wise metrics for clarity and evasion, along with hierarchical exact match. Results show that GPT-5.2 consistently outperforms the GPT-3.5 baseline on clarity prediction, with accuracy improving from 56 percent to 63 percent under chain-of-thought with few-shot prompting. Chain-of-thought prompting yields the highest evasion accuracy at 34 percent, though improvements are less stable across fine-grained evasion categories. We further evaluate topic identification and find that reasoning-based prompting improves accuracy from 60 percent to 74 percent relative to human annotations. Overall, our findings indicate that prompt design reliably improves high-level clarity evaluation, while fine-grained evasion and topic detection remain challenging despite structured reasoning prompts.