🤖 AI Summary
Dialogue summarization faces challenges including high supervision costs and weak task relevance—particularly limiting in high-stakes domains such as healthcare. To address this, we propose an unsupervised, question-answering (QA)-driven framework that eliminates the need for human-annotated summaries. First, a large language model generates dialogue summaries and corresponding task-oriented QA pairs in a zero-shot manner. Second, a QA consistency scoring mechanism automatically evaluates and filters high-quality summaries. Finally, the summarization model is fine-tuned on the selected high-scoring instances. Our approach significantly improves both information completeness and task relevance. Empirical evaluation across multiple benchmarks demonstrates performance on par with fully supervised state-of-the-art methods, while substantially outperforming existing zero-shot baselines. These results validate the method’s effectiveness, generalizability, and practical deployability in real-world settings.
📝 Abstract
Dialogue summarization aims to distill the core meaning of a conversation into a concise text. This is crucial for reducing the complexity and noise inherent in dialogue-heavy applications. While recent approaches typically train language models to mimic human-written summaries, such supervision is costly and often results in outputs that lack task-specific focus limiting their effectiveness in downstream applications, such as medical tasks. In this paper, we propose app, a framework for task-oriented utility-based dialogue summarization. app starts by generating multiple summaries and task-oriented question-answer pairs from a dialogue in a zero-shot manner using a pool of large language models (LLMs). The quality of the generated summaries is evaluated by having LLMs answer task-related questions before extit{(i)} selecting the best candidate answers and extit{(ii)} identifying the most informative summary based on these answers. Finally, we fine-tune the best LLM on the selected summaries. When validated on multiple datasets, app demonstrates its effectiveness by achieving competitive results in various zero-shot settings, rivaling fully-supervised State-of-the-Art (SotA) methods.