Text-Independent Speaker Identification Using Audio Looping With Margin Based Loss Functions

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses text-independent speaker verification under variable-length speech inputs, aiming to improve accuracy and robustness. We propose a modified VGG16-based CNN architecture that takes Mel-spectrogram inputs and incorporates an audio cycling strategy to enhance temporal modeling. We systematically compare the discriminative power of CosFace Loss, ArcFace Loss, and Softmax Loss, and investigate the impact of Mel-spectrogram resolution and frame duration on verification performance. Experimental results demonstrate that margin-based losses significantly outperform the Softmax baseline—achieving a 3.2% reduction in Equal Error Rate (EER)—with ArcFace exhibiting superior robustness for short utterances. Audio cycling further improves few-shot generalization. Our work establishes a reproducible, lightweight, and duration-adaptive optimization framework for speaker verification.

Technology Category

Application Category

📝 Abstract
Speaker identification has become a crucial component in various applications, including security systems, virtual assistants, and personalized user experiences. In this paper, we investigate the effectiveness of CosFace Loss and ArcFace Loss for text-independent speaker identification using a Convolutional Neural Network architecture based on the VGG16 model, modified to accommodate mel spectrogram inputs of variable sizes generated from the Voxceleb1 dataset. Our approach involves implementing both loss functions to analyze their effects on model accuracy and robustness, where the Softmax loss function was employed as a comparative baseline. Additionally, we examine how the sizes of mel spectrograms and their varying time lengths influence model performance. The experimental results demonstrate superior identification accuracy compared to traditional Softmax loss methods. Furthermore, we discuss the implications of these findings for future research.
Problem

Research questions and friction points this paper is trying to address.

Investigating CosFace and ArcFace loss functions for speaker identification
Analyzing mel spectrogram size impact on CNN model performance
Developing text-independent speaker recognition using VGG16 architecture
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses CosFace and ArcFace loss functions
Modifies VGG16 for mel spectrogram inputs
Analyzes variable spectrogram sizes impact
🔎 Similar Papers
No similar papers found.
E
Elliot Q. C. Garcia
Universidade Federal Rural de Pernambuco
N
Nicéias Silva Vilela
Universidade Federal Rural de Pernambuco
K
Kátia Pires Nascimento do Sacramento
Universidade Regional do Cariri
Tiago A. E. Ferreira
Tiago A. E. Ferreira
Full Professor of Statistical and Informatics Department - Federal Rural University of Pernambuco
intelligent computationTime Series Analysis and ForecastingQuantum ComputationComputational