🤖 AI Summary
This study addresses the challenge of efficient fine-tuning and quality evaluation of large language models (LLMs) for report summarization in sensitive domains—such as government archives, news, and intelligence reports—under low-resource (single/dual A100 GPUs) and zero-shot (unlabeled data) settings. We propose a lightweight, localized fine-tuning framework integrating supervised fine-tuning (SFT) with an unsupervised contrastive learning variant of Direct Preference Optimization (DPO). To rigorously assess output quality, we design a multidimensional evaluation suite measuring semantic fidelity (ROUGE, BERTScore) and robustness (invalid-summary detection). To our knowledge, this is the first systematic validation of feasible, high-quality LLM fine-tuning for report summarization under strict low-resource constraints. Experimental results demonstrate substantial improvements in summary relevance and readability, with over 60% reduction in invalid outputs across diverse report types; all training and deployment are achievable on a single A100 GPU.
📝 Abstract
We study the efficacy of fine-tuning Large Language Models (LLMs) for the specific task of report (government archives, news, intelligence reports) summarization. While this topic is being very actively researched - our specific application set-up faces two challenges: (i) ground-truth summaries maybe unavailable (e.g., for government archives), and (ii) availability of limited compute power - the sensitive nature of the application requires that computation is performed on-premise and for most of our experiments we use one or two A100 GPU cards. Under this set-up we conduct experiments to answer the following questions. First, given that fine-tuning the LLMs can be resource intensive, is it feasible to fine-tune them for improved report summarization capabilities on-premise? Second, what are the metrics we could leverage to assess the quality of these summaries? We conduct experiments on two different fine-tuning approaches in parallel and our findings reveal interesting trends regarding the utility of fine-tuning LLMs. Specifically, we find that in many cases, fine-tuning helps improve summary quality and in other cases it helps by reducing the number of invalid or garbage summaries.