Leveraging Multilingual Training for Authorship Representation: Enhancing Generalization across Languages and Domains

📅 2025-09-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the weak generalization of multilingual author representations and poor cross-lingual transfer performance of monolingual models, this paper proposes the first end-to-end multilingual author identification framework. Our method introduces probabilistic content masking and language-aware batching to explicitly decouple writing style from semantic content, thereby mitigating cross-lingual interference and enhancing contrastive learning stability. Built upon a multilingual pre-trained architecture, the model is jointly optimized on a large-scale dataset comprising over 4.5 million authors across 36 languages and 13 domains. Experimental results demonstrate substantial improvements: on 22 non-English languages, the proposed approach outperforms monolingual baselines in 21, achieving an average Recall@8 gain of 4.85% (up to 15.91%), validating the significant benefit of multilingual joint modeling for author style representation.

Technology Category

Application Category

📝 Abstract
Authorship representation (AR) learning, which models an author's unique writing style, has demonstrated strong performance in authorship attribution tasks. However, prior research has primarily focused on monolingual settings-mostly in English-leaving the potential benefits of multilingual AR models underexplored. We introduce a novel method for multilingual AR learning that incorporates two key innovations: probabilistic content masking, which encourages the model to focus on stylistically indicative words rather than content-specific words, and language-aware batching, which improves contrastive learning by reducing cross-lingual interference. Our model is trained on over 4.5 million authors across 36 languages and 13 domains. It consistently outperforms monolingual baselines in 21 out of 22 non-English languages, achieving an average Recall@8 improvement of 4.85%, with a maximum gain of 15.91% in a single language. Furthermore, it exhibits stronger cross-lingual and cross-domain generalization compared to a monolingual model trained solely on English. Our analysis confirms the effectiveness of both proposed techniques, highlighting their critical roles in the model's improved performance.
Problem

Research questions and friction points this paper is trying to address.

Enhancing authorship representation generalization across languages and domains
Overcoming limitations of monolingual models focused primarily on English
Improving cross-lingual and cross-domain generalization in authorship attribution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Probabilistic content masking to focus on stylistic words
Language-aware batching to reduce cross-lingual interference
Multilingual training across 36 languages and 13 domains
🔎 Similar Papers
No similar papers found.