ViSoLex: An Open-Source Repository for Vietnamese Social Media Lexical Normalization

📅 2025-01-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vietnamese social media exhibits highly irregular internet slang, causing semantic ambiguity, scarcity of annotated data, and significant challenges for NLP processing. To address this, we propose the first lightweight lexical normalization framework for Vietnamese that integrates the PhoBERT pretrained language model with weak supervision, augmented by rule-enhanced dictionary matching and a modular microservice architecture. The system supports non-standard token identification, interactive querying, and end-to-end text normalization—catering to both researchers and non-technical users. We fully open-source the codebase and toolchain. Experiments on multiple real-world social media corpora demonstrate substantial improvements in downstream tasks—including tokenization and named entity recognition—achieving high accuracy and strong generalization across diverse domains. Our framework establishes a reusable, production-ready standardization infrastructure for low-resource Vietnamese NLP.

Technology Category

Application Category

📝 Abstract
ViSoLex is an open-source system designed to address the unique challenges of lexical normalization for Vietnamese social media text. The platform provides two core services: Non-Standard Word (NSW) Lookup and Lexical Normalization, enabling users to retrieve standard forms of informal language and standardize text containing NSWs. ViSoLex's architecture integrates pre-trained language models and weakly supervised learning techniques to ensure accurate and efficient normalization, overcoming the scarcity of labeled data in Vietnamese. This paper details the system's design, functionality, and its applications for researchers and non-technical users. Additionally, ViSoLex offers a flexible, customizable framework that can be adapted to various datasets and research requirements. By publishing the source code, ViSoLex aims to contribute to the development of more robust Vietnamese natural language processing tools and encourage further research in lexical normalization. Future directions include expanding the system's capabilities for additional languages and improving the handling of more complex non-standard linguistic patterns.
Problem

Research questions and friction points this paper is trying to address.

Vietnamese online slang
communication barriers
data analysis difficulties
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vietnamese Social Media Lexicon
Text Standardization
Open-source NLP Tool
🔎 Similar Papers
No similar papers found.