Prompt-Based Clarity Evaluation and Topic Detection in Political Question Answering

📅 2026-01-13

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the challenge of automatically evaluating large language models in political question-answering scenarios, where assessments must jointly consider factual correctness, response clarity, and evasion detection. The impact of prompt design on such high-level semantic tasks remains underexplored. Leveraging the CLARITY dataset from SemEval 2026, this work presents the first systematic evaluation of different prompting strategies—namely zero-shot prompting, chain-of-thought, and few-shot chain-of-thought—on GPT-3.5 and GPT-5.2 for clarity scoring and topic detection. Results show that GPT-5.2 achieves a clarity prediction accuracy of 63% under few-shot chain-of-thought prompting, up from 56%, and reaches 74% accuracy in topic identification. Evasion detection remains challenging, with peak performance at 34% and substantial difficulties in fine-grained category discrimination. The findings highlight both the efficacy and limitations of structured prompting in complex semantic evaluation tasks.

Technology Category

Application Category

📝 Abstract

Automatic evaluation of large language model (LLM) responses requires not only factual correctness but also clarity, particularly in political question-answering. While recent datasets provide human annotations for clarity and evasion, the impact of prompt design on automatic clarity evaluation remains underexplored. In this paper, we study prompt-based clarity evaluation using the CLARITY dataset from the SemEval 2026 shared task. We compare a GPT-3.5 baseline provided with the dataset against GPT-5.2 evaluated under three prompting strategies: simple prompting, chain-of-thought prompting, and chain-of-thought with few-shot examples. Model predictions are evaluated against human annotations using accuracy and class-wise metrics for clarity and evasion, along with hierarchical exact match. Results show that GPT-5.2 consistently outperforms the GPT-3.5 baseline on clarity prediction, with accuracy improving from 56 percent to 63 percent under chain-of-thought with few-shot prompting. Chain-of-thought prompting yields the highest evasion accuracy at 34 percent, though improvements are less stable across fine-grained evasion categories. We further evaluate topic identification and find that reasoning-based prompting improves accuracy from 60 percent to 74 percent relative to human annotations. Overall, our findings indicate that prompt design reliably improves high-level clarity evaluation, while fine-grained evasion and topic detection remain challenging despite structured reasoning prompts.

Problem

Research questions and friction points this paper is trying to address.

clarity evaluation

political question answering

prompt design

evasion detection

topic detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

prompt design

clarity evaluation

chain-of-thought prompting

political question answering

topic detection

🔎 Similar Papers

Measuring the Quality of Answers in Political Q&As with Large Language Models

2024-04-12Citations: 0

A Large Language Model Guided Topic Refinement Mechanism for Short Text Modeling

2024-03-26Citations: 2

Authors to Follow