AIxcellent Vibes at GermEval 2025 Shared Task on Candy Speech Detection: Improving Model Performance by Span-Level Training

📅 2025-09-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the automatic identification of fine-grained, positive supportive language—termed “candy speech”—in social media. We propose a span-level fine-tuning framework for multilingual modeling, trained on 46k German YouTube comments. Our approach integrates representations from XLM-RoBERTa-Large, GBERT, and Qwen3 Embedding, and employs an emoji-aware tokenizer to enhance affective and pragmatic modeling. Unlike conventional sentence-level classification, span-level training improves precise localization of supportive linguistic segments. Evaluated on the GermEval 2025 shared task, our system achieved first place: a binary positive F1-score of 0.8906 and a strict span-level F1-score of 0.6307 for categorized supportive spans. These results demonstrate the effectiveness of cross-lingual representation learning combined with fine-grained span annotation for analyzing civil online discourse.

Technology Category

Application Category

📝 Abstract
Positive, supportive online communication in social media (candy speech) has the potential to foster civility, yet automated detection of such language remains underexplored, limiting systematic analysis of its impact. We investigate how candy speech can be reliably detected in a 46k-comment German YouTube corpus by monolingual and multilingual language models, including GBERT, Qwen3 Embedding, and XLM-RoBERTa. We find that a multilingual XLM-RoBERTa-Large model trained to detect candy speech at the span level outperforms other approaches, ranking first in both binary positive F1: 0.8906) and categorized span-based detection (strict F1: 0.6307) subtasks at the GermEval 2025 Shared Task on Candy Speech Detection. We speculate that span-based training, multilingual capabilities, and emoji-aware tokenizers improved detection performance. Our results demonstrate the effectiveness of multilingual models in identifying positive, supportive language.
Problem

Research questions and friction points this paper is trying to address.

Detecting positive supportive online communication automatically
Improving candy speech detection using span-level training
Evaluating multilingual models on German YouTube comment corpus
Innovation

Methods, ideas, or system contributions that make the work stand out.

Span-level training for candy speech detection
Multilingual XLM-RoBERTa-Large model implementation
Emoji-aware tokenizers enhancing detection performance
🔎 Similar Papers
No similar papers found.
C
Christian Rene Thelen
Department of Medical Engineering and Technomathematics, FH Aachen University of Applied Sciences, Jülich, Germany; Academic and Research Department Engineering Hydrology
P
Patrick Gustav Blaneck
Department of Medical Engineering and Technomathematics, FH Aachen University of Applied Sciences, Jülich, Germany; IT Center, RWTH Aachen University, Aachen, Germany
T
Tobias Bornheim
Department for Data Science and AI, ORDIX AG, Paderborn, Germany
N
Niklas Grieger
Department of Medical Engineering and Technomathematics, FH Aachen University of Applied Sciences, Jülich, Germany; Institute for Data-Driven Technologies, FH Aachen University of Applied Sciences, Jülich, Germany; Department of Information and Computing Sciences, Utrecht University, Utrecht, The Netherlands
Stephan Bialonski
Stephan Bialonski
Aachen University of Applied Science
machine learningdata sciencetime series analysisnatural language processingcomplex systems