Incorporating Domain Knowledge into Materials Tokenization

๐Ÿ“… 2025-06-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Generic tokenization methods cause excessive fragmentation, semantic distortion, and loss of structural integrity for material entities in materials science text. To address this, we propose MATTER, a domain-adaptive tokenization framework. Its core contributions are: (1) MatDetectorโ€”a material-knowledge-base-pretrained model that accurately identifies material entity boundaries; and (2) a knowledge-driven token reordering and merging mechanism that explicitly preserves semantic fidelity and structural consistency of material concepts. MATTER integrates domain-specific knowledge injection with customized tokenization strategies. Evaluated on generation and classification tasks, MATTER achieves average performance gains of 4.0% and 2.1%, respectively, significantly outperforming mainstream tokenization approaches. It establishes an interpretable, high-fidelity foundational tokenization paradigm for materials NLP.

Technology Category

Application Category

๐Ÿ“ Abstract
While language models are increasingly utilized in materials science, typical models rely on frequency-centric tokenization methods originally developed for natural language processing. However, these methods frequently produce excessive fragmentation and semantic loss, failing to maintain the structural and semantic integrity of material concepts. To address this issue, we propose MATTER, a novel tokenization approach that integrates material knowledge into tokenization. Based on MatDetector trained on our materials knowledge base and a re-ranking method prioritizing material concepts in token merging, MATTER maintains the structural integrity of identified material concepts and prevents fragmentation during tokenization, ensuring their semantic meaning remains intact. The experimental results demonstrate that MATTER outperforms existing tokenization methods, achieving an average performance gain of $4%$ and $2%$ in the generation and classification tasks, respectively. These results underscore the importance of domain knowledge for tokenization strategies in scientific text processing. Our code is available at https://github.com/yerimoh/MATTER
Problem

Research questions and friction points this paper is trying to address.

Improves tokenization for materials science language models
Reduces fragmentation and semantic loss in material concepts
Integrates domain knowledge to maintain structural integrity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates material knowledge into tokenization
Uses MatDetector trained on knowledge base
Re-ranks tokens to prioritize material concepts
๐Ÿ”Ž Similar Papers
No similar papers found.