Approaches to Analysing Historical Newspapers Using LLMs

📅 2026-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study examines the divergent public discourses between conservative Catholic and liberal-progressive factions in late 19th- to early 20th-century Slovenian historical newspapers, focusing on collective identity, political stance, and national belonging, despite poor OCR quality. Integrating BERTopic for topic modeling, fine-grained sentiment analysis powered by the localized large language model GaMS3-12B-Instruct, named entity recognition, entity relation graph construction, network analysis, and critical discourse analysis, this work presents the first systematic evaluation and application of a Slovene large language model for sentiment classification in historical texts. The research not only uncovers shared thematic concerns and ideological divides between the two press camps but also demonstrates GaMS3-12B-Instruct’s superior performance in recognizing neutral sentiment and reveals distinct patterns in how different groups are represented across descriptive versus conflictual contexts.

Technology Category

Application Category

📝 Abstract
This study presents a computational analysis of the Slovene historical newspapers \textit{Slovenec} and \textit{Slovenski narod} from the sPeriodika corpus, combining topic modelling, large language model (LLM)-based aspect-level sentiment analysis, entity-graph visualisation, and qualitative discourse analysis to examine how collective identities, political orientations, and national belonging were represented in public discourse at the turn of the twentieth century. Using BERTopic, we identify major thematic patterns and show both shared concerns and clear ideological differences between the two newspapers, reflecting their conservative-Catholic and liberal-progressive orientations. We further evaluate four instruction-following LLMs for targeted sentiment classification in OCR-degraded historical Slovene and select the Slovene-adapted GaMS3-12B-Instruct model as the most suitable for large-scale application, while also documenting important limitations, particularly its stronger performance on neutral sentiment than on positive or negative sentiment. Applied at dataset scale, the model reveals meaningful variation in the portrayal of collective identities, with some groups appearing predominantly in neutral descriptive contexts and others more often in evaluative or conflict-related discourse. We then create NER graphs to explore the relationships between collective identities and places. We apply a mixed methods approach to analyse the named entity graphs, combining quantitative network analysis with critical discourse analysis. The investigation focuses on the emergence and development of intertwined historical political and socionomic identities. Overall, the study demonstrates the value of combining scalable computational methods with critical interpretation to support digital humanities research on noisy historical newspaper data.
Problem

Research questions and friction points this paper is trying to address.

historical newspapers
collective identity
sentiment analysis
OCR-degraded text
public discourse
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based sentiment analysis
historical newspaper analysis
entity-graph visualization
mixed methods digital humanities
OCR-degraded text processing
🔎 Similar Papers
No similar papers found.
F
Filip Dobranić
Institute of Contemporary History, Privoz 11, SI-1000 Ljubljana
T
Tina Munda
Faculty of Arts, University of Ljubljana, Aškerčeva 2, SI-1000 Ljubljana
O
Oliver Pejić
Institute of Contemporary History, Privoz 11, SI-1000 Ljubljana
V
Vojko Gorjanc
Institute of Contemporary History, Privoz 11, SI-1000 Ljubljana
U
Uroš Šmajdek
Faculty of Computer Science, University of Ljubljana, Večna pot 113, SI-1000 Ljubljana
D
David Bordon
Faculty of Arts, University of Ljubljana, Aškerčeva 2, SI-1000 Ljubljana
J
Jakob Lenardič
Institute of Contemporary History, Privoz 11, SI-1000 Ljubljana
T
Tjaša Konovšek
Institute of Contemporary History, Privoz 11, SI-1000 Ljubljana
K
Kristina Pahor de Maiti Tekavčič
Institute of Contemporary History, Privoz 11, SI-1000 Ljubljana
Ciril Bohak
Ciril Bohak
Assist. prof. at University of Ljubljana, Faculty of Computer and Information Science
Computer GraphicsVisualization
Darja Fišer
Darja Fišer
Assistant Professor, University of Ljubljana
Computational linguisticsCorpus linguisticsComputer-Mediated CommunicationWordnet