🤖 AI Summary
This study addresses the high false positive rates in existing credential leakage detection tools, which stem from their reliance on rigid pattern matching and binary classification that fails to distinguish genuine secrets from placeholders and weak credentials. To overcome this limitation, the authors propose the first ternary classification framework that explicitly models placeholders and weak credentials as separate classes. They design a hybrid deep learning architecture combining CodeBERT for semantic understanding with CNN-based character-level features. Evaluated on 9,426 samples, the model achieves a macro F1-score of 0.90 and Matthews correlation coefficient (MCC) of 0.86, with 93% recall and 89% precision for real secrets. Notably, placeholder detection F1 improves from 54% to 81%, and the approach attains cross-language F1-scores above 0.80 in nine out of ten programming languages, substantially reducing critical false positives while maintaining robust security coverage.
📝 Abstract
Credential leakage in public source code repositories poses a critical security threat, with over 23.8 million secrets exposed in 2024 alone. Existing detection tools suffer from high false-positive rates because rigid pattern matching and binary classification schemes fail to distinguish genuine credentials from placeholder or weak credentials. We propose a three-class classification framework that explicitly models placeholder or weak credentials as a distinct class, leveraging CodeBERT-based semantic understanding combined with character-level pattern recognition. We evaluate our approach on a newly constructed dataset of 9,426 samples spanning 10 programming languages. Our model achieves a Matthews Correlation Coefficient of 0.86 and a macro F1-score of 0.90, achieving 93% recall and 89% precision for genuine credential leaks while reducing high severity alerts by 33.0% (from 373 to 250) without sacrificing security coverage. Compared to prior character-level approaches, our method improves placeholder or weak credential detection from 54% to 81% F1-score while maintaining strong cross language generalization, with 9 of 10 languages achieving F1 above 0.80 under leave-one-language-out evaluation.