A Preliminary Exploration with GPT-4o Voice Mode

📅 2025-02-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Prior work lacks systematic evaluation of GPT-4o’s speech mode on multimodal audio understanding tasks, particularly regarding capability boundaries and safety-aware behavior. Method: We conduct the first end-to-end audio reasoning evaluation across standardized benchmarks—including intent classification, multilingual ASR, semantic/syntactic reasoning, and singing analysis—and manually verify outputs. Contribution/Results: GPT-4o achieves state-of-the-art performance on intent recognition and multilingual ASR with significantly lower hallucination rates than existing audio LMs. However, it underperforms on instrument classification, audio duration prediction, MOS estimation, and deepfake detection. Critically, its built-in safety mechanisms induce high and inconsistent refusal rates on sensitive tasks (e.g., speaker identification) across datasets, and its outputs exhibit strong sensitivity to instruction phrasing and audio quality. This work establishes a new benchmark and reproducible methodology for evaluating large language models’ audio capabilities.

Technology Category

Application Category

📝 Abstract
With the rise of multimodal large language models, GPT-4o stands out as a pioneering model, driving us to evaluate its capabilities. This report assesses GPT-4o across various tasks to analyze its audio processing and reasoning abilities. We find that GPT-4o exhibits strong knowledge in audio, speech, and music understanding, performing well in tasks like intent classification, spoken command classification, semantic and grammatical reasoning., multilingual speech recognition, and singing analysis. It also shows greater robustness against hallucinations than other large audio-language models (LALMs). However, it struggles with tasks such as audio duration prediction and instrument classification. Additionally, GPT-4o's safety mechanisms cause it to decline tasks like speaker identification, age classification, MOS prediction, and audio deepfake detection. Notably, the model exhibits a significantly different refusal rate when responding to speaker verification tasks on different datasets. This is likely due to variations in the accompanying instructions or the quality of the input audio, suggesting the sensitivity of its built-in safeguards. Finally, we acknowledge that model performance varies with evaluation protocols. This report only serves as a preliminary exploration of the current state of LALMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluates GPT-4o's audio processing capabilities
Assesses GPT-4o's reasoning abilities across tasks
Explores GPT-4o's safety mechanisms and limitations
Innovation

Methods, ideas, or system contributions that make the work stand out.

GPT-4o audio processing
Multimodal language model
Robustness against hallucinations
🔎 Similar Papers
No similar papers found.