Investigating Polyglot Speech Foundation Models for Learning Collective Emotion from Crowds

๐Ÿ“… 2025-09-19
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study addresses the challenge of low emotion recognition accuracy in noisy, short-duration (as low as 250 ms) crowd acoustic environments. We conduct the first systematic evaluation of multilingual speech foundation models (SFMs) for crowd emotion recognition (CER), benchmarking them against monolingual and speaker-identification SFMs under controlled conditions. Experiments span three segment durationsโ€”1 s, 500 ms, and 250 msโ€”on a unified evaluation benchmark. Results demonstrate that multilingual SFMs consistently outperform all baselines across all durations, owing to their robust representations of multilingual content, accent variability, and acoustic noise; gains are especially pronounced for 250 ms segments, where they significantly enhance both robustness and classification accuracy. This work establishes multilingual SFMs as a new paradigm for CER and introduces the first reproducible strong baseline and standardized evaluation benchmark for short-duration crowd emotion analysis.

Technology Category

Application Category

๐Ÿ“ Abstract
This paper investigates the polyglot (multilingual) speech foundation models (SFMs) for Crowd Emotion Recognition (CER). We hypothesize that polyglot SFMs, pre-trained on diverse languages, accents, and speech patterns, are particularly adept at navigating the noisy and complex acoustic environments characteristic of crowd settings, thereby offering a significant advantage for CER. To substantiate this, we perform a comprehensive analysis, comparing polyglot, monolingual, and speaker recognition SFMs through extensive experiments on a benchmark CER dataset across varying audio durations (1 sec, 500 ms, and 250 ms). The results consistently demonstrate the superiority of polyglot SFMs, outperforming their counterparts across all audio lengths and excelling even with extremely short-duration inputs. These findings pave the way for adaptation of SFMs in setting up new benchmarks for CER.
Problem

Research questions and friction points this paper is trying to address.

Investigating polyglot speech models for crowd emotion recognition
Comparing model performance across different audio durations
Evaluating robustness with extremely short-duration speech inputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Polyglot speech models handle multilingual crowd emotions
Pre-trained on diverse languages and speech patterns
Outperform monolingual models across short audio durations
๐Ÿ”Ž Similar Papers
No similar papers found.