JOOCI: a Framework for Learning Comprehensive Speech Representations

📅 2024-10-14

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

In existing self-supervised speech representation learning, content-related information (e.g., linguistic features) and non-content-related information (e.g., speaker identity, paralinguistic cues) are artificially segregated across network depths—linguistic features concentrate in deeper layers while speaker-related attributes remain confined to shallow layers—preventing either from leveraging the full hierarchical capacity of the network. To address this, we propose JOOCI, the first end-to-end SSL framework that eliminates inter-layer functional segregation. JOOCI jointly optimizes both information types across all network layers via multi-task collaborative training and hierarchical feature disentanglement. Evaluated on the SUPERB benchmark, JOOCI achieves an average 26.5% relative improvement over WavLM on two speaker identification and two language understanding tasks, significantly outperforming comparably sized models (100M parameters).

Technology Category

Application Category

📝 Abstract

Information in speech can be categorized into two groups: Content (what is being said, such as linguistics) and Other (how it is expressed such as information about speaker and paralinguistic features). Current self-supervised learning (SSL) methods are shown to divide the model's representational-depth or layers in two, with earlier layers specializing in Other and later layers in Content related tasks. This layer-wise division is inherently sub-optimal, as neither information type can use all layers to build hierarchical representations. To address this, we propose JOOCI, a novel speech representation learning method that does not compromise on the representational-depth for either information type. JOOCI outperforms WavLM by 26.5%, and other models of similar size (100M parameters), when evaluated on two speaker recognition and two language tasks from the SUPERB benchmark, demonstrating its effectiveness in Jointly Optimizing Other and Content Information (JOOCI).

Problem

Research questions and friction points this paper is trying to address.

Optimizing speech representation layers

Balancing content and speaker features

Enhancing self-supervised learning efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

JOOCI framework

speech representation learning

joint optimization

🔎 Similar Papers

No similar papers found.

Authors to Follow