🤖 AI Summary
Low utilization efficiency of similar-text resources and strong reliance on cross-lingual parallel corpora with explicit alignments hinder progress in natural language processing (NLP). Method: This project proposes a corpus-driven paradigm for generalizable NLP, introducing the first comprehensive landscape of comparable corpora research—breaking free from traditional constraints of parallel data and explicit alignment. It systematically advances three core directions: cross-lingual representation learning, low-resource adaptive modeling, and dynamic contrastive learning, integrating corpus linguistics analysis, unsupervised contrastive modeling, and domain-adaptive evaluation. Contribution/Results: The work synthesizes six frontier research directions and twelve open problems. Its findings have catalyzed new research consensus at BUCC-2025 and affiliated workshops, establishing a methodological foundation for unlocking the value of non-parallel corpora and advancing universal language understanding.
📝 Abstract
Most conference papers present new results, but this paper will focus more on opportunities for the audience to make their own contributions. This paper is intended to challenge the community to think more broadly about what we can do with comparable corpora. We will start with a review of the history, and then suggest new directions for future research. This was a keynote at BUCC-2025, a workshop associated with Coling-2025.