From Words to Waves: Analyzing Concept Formation in Speech and Text-Based Foundation Models

📅 2025-06-01

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work investigates the emergence mechanisms of abstract semantic concepts in foundational models for speech and text modalities: (1) whether unimodal (speech or text) models can independently develop structured conceptual representations, and (2) whether multimodal joint training enhances semantic robustness and cross-modal transferability. To this end, we propose the first unsupervised Latent Concept Analysis (LCA) framework, integrating neural representation decoding, cross-modal alignment modeling, and an interpretability toolchain to systematically compare conceptual hierarchy, structural organization, and generalization capacity across three model types. Results show that speech models spontaneously acquire hierarchical semantic concepts, albeit with weaker structural coherence than text models; multimodal joint training markedly improves concept robustness and induces a shared, more structured semantic space across modalities. This study provides the first empirical evidence of the co-evolutionary dynamics between modality-specific and modality-invariant semantic structures.

Technology Category

Application Category

📝 Abstract

The emergence of large language models (LLMs) has demonstrated that systems trained solely on text can acquire extensive world knowledge, develop reasoning capabilities, and internalize abstract semantic concepts--showcasing properties that can be associated with general intelligence. This raises an intriguing question: Do such concepts emerge in models trained on other modalities, such as speech? Furthermore, when models are trained jointly on multiple modalities: Do they develop a richer, more structured semantic understanding? To explore this, we analyze the conceptual structures learned by speech and textual models both individually and jointly. We employ Latent Concept Analysis, an unsupervised method for uncovering and interpreting latent representations in neural networks, to examine how semantic abstractions form across modalities. For reproducibility we made scripts and other resources available to the community.

Problem

Research questions and friction points this paper is trying to address.

Do speech-trained models develop semantic concepts like text-based LLMs?

Does multi-modal training enhance semantic understanding in models?

How do semantic abstractions form across speech and text modalities?

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing concept formation in speech and text models

Using Latent Concept Analysis for neural networks

Joint training for richer semantic understanding

🔎 Similar Papers

Aligned at the Start: Conceptual Groupings in LLM Embeddings