NBF at SemEval-2025 Task 5: Light-Burst Attention Enhanced System for Multilingual Subject Recommendation

📅 2025-05-06

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses cross-lingual topic classification for English–German academic literature. We propose a lightweight sentence embedding framework that employs token-level self-attention to compress internal dimensionality and integrates negative sampling with margin-based retrieval loss under a bilingual joint training paradigm, substantially reducing GPU memory and computational requirements. Compared to existing lightweight models, our approach achieves an average recall of 32.24% across all topics and qualitative evaluation scores of 43.16% and 31.53%, demonstrating significant performance gains. Our core contribution is the first coupling of token-granularity self-attention with margin loss for cross-lingual academic text embedding learning—achieving high retrieval effectiveness at extremely low computational cost. This establishes a scalable new paradigm for multilingual topic recommendation in resource-constrained settings.

Technology Category

Application Category

📝 Abstract

We present our system submission for SemEval 2025 Task 5, which focuses on cross-lingual subject classification in the English and German academic domains. Our approach leverages bilingual data during training, employing negative sampling and a margin-based retrieval objective. We demonstrate that a dimension-as-token self-attention mechanism designed with significantly reduced internal dimensions can effectively encode sentence embeddings for subject retrieval. In quantitative evaluation, our system achieved an average recall rate of 32.24% in the general quantitative setting (all subjects), 43.16% and 31.53% of the general qualitative evaluation methods with minimal GPU usage, highlighting their competitive performance. Our results demonstrate that our approach is effective in capturing relevant subject information under resource constraints, although there is still room for improvement.

Problem

Research questions and friction points this paper is trying to address.

Cross-lingual subject classification in English and German academic domains

Dimension-reduced self-attention for efficient sentence embedding encoding

Improving subject retrieval performance under limited GPU resource constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bilingual data training with negative sampling

Dimension-as-token self-attention mechanism

Light-Burst Attention for sentence embeddings

🔎 Similar Papers

News Without Borders: Domain Adaptation of Multilingual Sentence Embeddings for Cross-lingual News Recommendation