Evolutionary Feature-wise Thresholding for Binary Representation of NLP Embeddings

📅 2025-07-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the accuracy degradation in NLP embedding binarization caused by global thresholding. We propose a feature-level adaptive threshold optimization method that departs from conventional fixed-threshold strategies. Specifically, we design an efficient coordinate-search-based optimization framework to learn an optimal binarization threshold independently for each embedding dimension, augmented with statistical significance testing to ensure robustness. Extensive evaluation across multiple NLP tasks—including semantic similarity, retrieval, and classification—and diverse benchmark datasets (STS, SQuAD, and GLUE subsets) demonstrates that our method preserves the ultra-low storage cost and high computational efficiency of binary representations while achieving average accuracy improvements of 2.1–4.7 percentage points over state-of-the-art binarization baselines. It significantly outperforms both global-threshold and hashing-based approaches, establishing a scalable, high-fidelity paradigm for embedding compression in efficient NLP model deployment.

Technology Category

Application Category

📝 Abstract
Efficient text embedding is crucial for large-scale natural language processing (NLP) applications, where storage and computational efficiency are key concerns. In this paper, we explore how using binary representations (barcodes) instead of real-valued features can be used for NLP embeddings derived from machine learning models such as BERT. Thresholding is a common method for converting continuous embeddings into binary representations, often using a fixed threshold across all features. We propose a Coordinate Search-based optimization framework that instead identifies the optimal threshold for each feature, demonstrating that feature-specific thresholds lead to improved performance in binary encoding. This ensures that the binary representations are both accurate and efficient, enhancing performance across various features. Our optimal barcode representations have shown promising results in various NLP applications, demonstrating their potential to transform text representation. We conducted extensive experiments and statistical tests on different NLP tasks and datasets to evaluate our approach and compare it to other thresholding methods. Binary embeddings generated using using optimal thresholds found by our method outperform traditional binarization methods in accuracy. This technique for generating binary representations is versatile and can be applied to any features, not just limited to NLP embeddings, making it useful for a wide range of domains in machine learning applications.
Problem

Research questions and friction points this paper is trying to address.

Optimizing feature-specific thresholds for binary NLP embeddings
Improving accuracy and efficiency in binary text representation
Enhancing performance across diverse NLP tasks and datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Feature-specific thresholding for binary encoding
Coordinate Search-based optimization framework
Versatile binary representation for various features
🔎 Similar Papers
No similar papers found.
S
Soumen Sinha
Department of EEMCS, TU Delft, Mekelweg 2628 CD, Delft Netherlands
S
Shahryar Rahnamayan
Department of Engineering, Brock University, St. Catharines, ON L2 S 3A1, Canada
Azam Asilian Bidgoli
Azam Asilian Bidgoli
Assistant Professor, Wilfrid Laurier University, Canada
Machine LearningMulti-objective OptimizationEvolutionary ComputationFeature Selection