Fine-Tuning LLMs for Report Summarization: Analysis on Supervised and Unsupervised Data

📅 2025-03-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

216K/year
🤖 AI Summary
This study addresses the challenge of efficient fine-tuning and quality evaluation of large language models (LLMs) for report summarization in sensitive domains—such as government archives, news, and intelligence reports—under low-resource (single/dual A100 GPUs) and zero-shot (unlabeled data) settings. We propose a lightweight, localized fine-tuning framework integrating supervised fine-tuning (SFT) with an unsupervised contrastive learning variant of Direct Preference Optimization (DPO). To rigorously assess output quality, we design a multidimensional evaluation suite measuring semantic fidelity (ROUGE, BERTScore) and robustness (invalid-summary detection). To our knowledge, this is the first systematic validation of feasible, high-quality LLM fine-tuning for report summarization under strict low-resource constraints. Experimental results demonstrate substantial improvements in summary relevance and readability, with over 60% reduction in invalid outputs across diverse report types; all training and deployment are achievable on a single A100 GPU.

Technology Category

Application Category

📝 Abstract
We study the efficacy of fine-tuning Large Language Models (LLMs) for the specific task of report (government archives, news, intelligence reports) summarization. While this topic is being very actively researched - our specific application set-up faces two challenges: (i) ground-truth summaries maybe unavailable (e.g., for government archives), and (ii) availability of limited compute power - the sensitive nature of the application requires that computation is performed on-premise and for most of our experiments we use one or two A100 GPU cards. Under this set-up we conduct experiments to answer the following questions. First, given that fine-tuning the LLMs can be resource intensive, is it feasible to fine-tune them for improved report summarization capabilities on-premise? Second, what are the metrics we could leverage to assess the quality of these summaries? We conduct experiments on two different fine-tuning approaches in parallel and our findings reveal interesting trends regarding the utility of fine-tuning LLMs. Specifically, we find that in many cases, fine-tuning helps improve summary quality and in other cases it helps by reducing the number of invalid or garbage summaries.
Problem

Research questions and friction points this paper is trying to address.

Evaluate feasibility of on-premise fine-tuning for LLMs in report summarization.
Identify metrics to assess quality of summaries generated by fine-tuned LLMs.
Compare supervised and unsupervised fine-tuning approaches for report summarization.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning LLMs for report summarization
Experiments with limited on-premise compute power
Evaluation metrics for summary quality improvement
🔎 Similar Papers
No similar papers found.
Swati Rallapalli
Swati Rallapalli
Researcher, IBM Research
Protocol design for mobile networkslocalization and mobility predictionnetwork economics and social networks.
S
Shannon K. Gallagher
Software Engineering Institute, Carnegie Mellon University
A
Andrew O. Mellinger
Software Engineering Institute, Carnegie Mellon University
J
Jasmine Ratchford
Software Engineering Institute, Carnegie Mellon University
A
Anusha Sinha
Software Engineering Institute, Carnegie Mellon University
T
Tyler Brooks
Software Engineering Institute, Carnegie Mellon University
W
William R. Nichols
Software Engineering Institute, Carnegie Mellon University
N
Nick Winski
Software Engineering Institute, Carnegie Mellon University
Bryan Brown
Bryan Brown
Stanford University
Science Educationracelanguagecognition