A Survey on Spoken Italian Datasets and Corpora

📅 2025-01-11

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Italian spoken-language resources suffer from scarcity, inadequate representativeness, and poor accessibility. Method: We systematically survey and evaluate 66 existing spoken-language datasets, establishing a multidimensional classification framework covering speech type, recording scenario, and sociolinguistic demographics. We introduce a structured metadata analysis methodology and deploy an open, searchable panoramic index platform hosted on GitHub and Zenodo. Contribution/Results: This work presents the first comprehensive inventory of Italian spoken-language corpora, uncovering critical bottlenecks—including coverage bias, domain imbalance, and pervasive metadata deficiencies. The resulting index serves as a standardized benchmark for downstream tasks such as automatic speech recognition, emotion recognition, and language education. By enabling transparent dataset discovery and comparative evaluation, it advances fairness, reproducibility, and multi-task robustness in spoken-language technology development.

Technology Category

Application Category

📝 Abstract

Spoken language datasets are vital for advancing linguistic research, Natural Language Processing, and speech technology. However, resources dedicated to Italian, a linguistically rich and diverse Romance language, remain underexplored compared to major languages like English or Mandarin. This survey provides a comprehensive analysis of 66 spoken Italian datasets, highlighting their characteristics, methodologies, and applications. The datasets are categorized by speech type, source and context, and demographic and linguistic features, with a focus on their utility in fields such as Automatic Speech Recognition, emotion detection, and education. Challenges related to dataset scarcity, representativeness, and accessibility are discussed alongside recommendations for enhancing dataset creation and utilization. The full dataset inventory is publicly accessible via GitHub and archived on Zenodo, serving as a valuable resource for researchers and developers. By addressing current gaps and proposing future directions, this work aims to support the advancement of Italian speech technologies and linguistic research.

Problem

Research questions and friction points this paper is trying to address.

Oral Italian Dataset

Language Research

Speech Technology Development

Innovation

Methods, ideas, or system contributions that make the work stand out.

Oral Italian Datasets

Speech Technology Advancement

Community Contribution

🔎 Similar Papers

No similar papers found.