Taggus: An Automated Pipeline for the Extraction of Characters' Social Networks from Portuguese Fiction Literature

📅 2025-08-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the poor performance of character identification and social relation extraction in Portuguese fictional literature under low-resource conditions, this paper proposes Taggus—a fully automated, end-to-end NLP pipeline. Taggus integrates part-of-speech tagging, lightweight named entity recognition, and multi-stage heuristic rules, requiring neither large-scale annotated corpora nor large language models. Its key innovations include a Portuguese literary text–specific coreference resolution mechanism and an interaction detection strategy, both tailored to the linguistic and narrative conventions of the genre. Evaluated on a manually annotated Portuguese fiction corpus, Taggus achieves 94.1% F1 for character identification and 75.9% F1 for interaction detection—surpassing the prior state of the art by 50.7 and 22.3 percentage points, respectively. By offering a reusable, high-accuracy, and dependency-light processing paradigm, Taggus advances deep structural analysis of literary texts in under-resourced languages.

Technology Category

Application Category

📝 Abstract
Automatically identifying characters and their interactions from fiction books is, arguably, a complex task that requires pipelines that leverage multiple Natural Language Processing (NLP) methods, such as Named Entity Recognition (NER) and Part-of-speech (POS) tagging. However, these methods are not optimized for the task that leads to the construction of Social Networks of Characters. Indeed, the currently available methods tend to underperform, especially in less-represented languages, due to a lack of manually annotated data for training. Here, we propose a pipeline, which we call Taggus, to extract social networks from literary fiction works in Portuguese. Our results show that compared to readily available State-of-the-Art tools -- off-the-shelf NER tools and Large Language Models (ChatGPT) -- the resulting pipeline, which uses POS tagging and a combination of heuristics, achieves satisfying results with an average F1-Score of $94.1%$ in the task of identifying characters and solving for co-reference and $75.9%$ in interaction detection. These represent, respectively, an increase of $50.7%$ and $22.3%$ on results achieved by the readily available State-of-the-Art tools. Further steps to improve results are outlined, such as solutions for detecting relationships between characters. Limitations on the size and scope of our testing samples are acknowledged. The Taggus pipeline is publicly available to encourage development in this field for the Portuguese language.2
Problem

Research questions and friction points this paper is trying to address.

Extracting social networks from Portuguese literature automatically
Improving NLP methods for character and interaction identification
Addressing lack of annotated data for less-represented languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines POS tagging and heuristics for accuracy
Improves F1-Score by 50.7% for character identification
Publicly available for Portuguese language development
🔎 Similar Papers
No similar papers found.
T
Tiago G. Canário
NOVA Information Management School (NOVA IMS), Universidade Nova de Lisboa, Campus de Campolide, Lisboa, 1070-312, Portugal.
C
Catarina Duarte
NOVA Information Management School (NOVA IMS), Universidade Nova de Lisboa, Campus de Campolide, Lisboa, 1070-312, Portugal.
Flávio L. Pinheiro
Flávio L. Pinheiro
NOVA IMS, Universidade Nova de Lisboa
Computational Social ScienceData ScienceNetwork AnalysisEconomic Complexity
J
João L. M. Pereira
Centro Algoritmi/LASI, University of Évora, 7000-671, Évora Portugal.