Duluth at SemEval-2025 Task 7: TF-IDF with Optimized Vector Dimensions for Multilingual Fact-Checked Claim Retrieval

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This paper addresses the multilingual cross-lingual claim retrieval subtask of SemEval-2025 Task 7 on fact checking. We propose a lightweight, efficient TF-IDF–based baseline method employing word-level tokenization, vocabulary truncation, and systematic hyperparameter optimization—including vocabulary size, token granularity, and vector dimensionality—demonstrating for the first time the effectiveness of a 15,000-dimensional word-level TF-IDF configuration for this task. Experimental results show that our unsupervised cross-lingual retrieval model achieves success@10 scores of 0.78 on the development set and 0.69 on the test set across ten languages, substantially outperforming most conventional baselines. Our key contributions are: (i) establishing the practical performance ceiling of lightweight traditional methods in low-resource multilingual fact checking; and (ii) providing a strong, interpretable, and reproducible baseline for future research.

Technology Category

Application Category

📝 Abstract

This paper presents the Duluth approach to the SemEval-2025 Task 7 on Multilingual and Crosslingual Fact-Checked Claim Retrieval. We implemented a TF-IDF-based retrieval system with experimentation on vector dimensions and tokenization strategies. Our best-performing configuration used word-level tokenization with a vocabulary size of 15,000 features, achieving an average success@10 score of 0.78 on the development set and 0.69 on the test set across ten languages. Our system showed stronger performance on higher-resource languages but still lagged significantly behind the top-ranked system, which achieved 0.96 average success@10. Our findings suggest that though advanced neural architectures are increasingly dominant in multilingual retrieval tasks, properly optimized traditional methods like TF-IDF remain competitive baselines, especially in limited compute resource scenarios.

Problem

Research questions and friction points this paper is trying to address.

Optimizing TF-IDF for multilingual claim retrieval

Evaluating tokenization strategies and vector dimensions

Comparing traditional methods with neural architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

TF-IDF-based retrieval system

Optimized vector dimensions

Word-level tokenization strategy

🔎 Similar Papers

No similar papers found.