🤖 AI Summary
To address the poor performance of character identification and social relation extraction in Portuguese fictional literature under low-resource conditions, this paper proposes Taggus—a fully automated, end-to-end NLP pipeline. Taggus integrates part-of-speech tagging, lightweight named entity recognition, and multi-stage heuristic rules, requiring neither large-scale annotated corpora nor large language models. Its key innovations include a Portuguese literary text–specific coreference resolution mechanism and an interaction detection strategy, both tailored to the linguistic and narrative conventions of the genre. Evaluated on a manually annotated Portuguese fiction corpus, Taggus achieves 94.1% F1 for character identification and 75.9% F1 for interaction detection—surpassing the prior state of the art by 50.7 and 22.3 percentage points, respectively. By offering a reusable, high-accuracy, and dependency-light processing paradigm, Taggus advances deep structural analysis of literary texts in under-resourced languages.
📝 Abstract
Automatically identifying characters and their interactions from fiction books is, arguably, a complex task that requires pipelines that leverage multiple Natural Language Processing (NLP) methods, such as Named Entity Recognition (NER) and Part-of-speech (POS) tagging. However, these methods are not optimized for the task that leads to the construction of Social Networks of Characters. Indeed, the currently available methods tend to underperform, especially in less-represented languages, due to a lack of manually annotated data for training. Here, we propose a pipeline, which we call Taggus, to extract social networks from literary fiction works in Portuguese. Our results show that compared to readily available State-of-the-Art tools -- off-the-shelf NER tools and Large Language Models (ChatGPT) -- the resulting pipeline, which uses POS tagging and a combination of heuristics, achieves satisfying results with an average F1-Score of $94.1%$ in the task of identifying characters and solving for co-reference and $75.9%$ in interaction detection. These represent, respectively, an increase of $50.7%$ and $22.3%$ on results achieved by the readily available State-of-the-Art tools. Further steps to improve results are outlined, such as solutions for detecting relationships between characters. Limitations on the size and scope of our testing samples are acknowledged. The Taggus pipeline is publicly available to encourage development in this field for the Portuguese language.2