Masculine Defaults via Gendered Discourse in Podcasts and Large Language Models

📅 2025-04-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the implicit gender bias of “male default,” examining its manifestation in podcast discourse and large language models (LLMs). To this end, we propose the Gendered Discourse Correlation Framework (GDCF), the first discourse-level analytical framework for quantifying gendered linguistic patterns. Applying LDA and BERTopic to 15,117 podcasts, GDCF identifies salient male-dominated discourse patterns across business, technology/politics, and gaming domains. We further introduce the Discourse Word-Embedding Association Test (D-WEAT) to measure gendered asymmetries in LLM embedding spaces, revealing significantly higher stability and robustness of male-associated discourse terms—constituting a novel form of representational harm. This asymmetry systematically biases downstream task performance in favor of male-related inputs. Our work provides both a methodological innovation and empirical evidence for diagnosing and mitigating structural gender bias in language technologies.

Technology Category

Application Category

📝 Abstract
Masculine defaults are widely recognized as a significant type of gender bias, but they are often unseen as they are under-researched. Masculine defaults involve three key parts: (i) the cultural context, (ii) the masculine characteristics or behaviors, and (iii) the reward for, or simply acceptance of, those masculine characteristics or behaviors. In this work, we study discourse-based masculine defaults, and propose a twofold framework for (i) the large-scale discovery and analysis of gendered discourse words in spoken content via our Gendered Discourse Correlation Framework (GDCF); and (ii) the measurement of the gender bias associated with these gendered discourse words in LLMs via our Discourse Word-Embedding Association Test (D-WEAT). We focus our study on podcasts, a popular and growing form of social media, analyzing 15,117 podcast episodes. We analyze correlations between gender and discourse words -- discovered via LDA and BERTopic -- to automatically form gendered discourse word lists. We then study the prevalence of these gendered discourse words in domain-specific contexts, and find that gendered discourse-based masculine defaults exist in the domains of business, technology/politics, and video games. Next, we study the representation of these gendered discourse words from a state-of-the-art LLM embedding model from OpenAI, and find that the masculine discourse words have a more stable and robust representation than the feminine discourse words, which may result in better system performance on downstream tasks for men. Hence, men are rewarded for their discourse patterns with better system performance by one of the state-of-the-art language models -- and this embedding disparity is a representational harm and a masculine default.
Problem

Research questions and friction points this paper is trying to address.

Identifies masculine defaults in podcasts and LLMs as hidden gender bias.
Proposes frameworks to discover and measure gendered discourse words in media.
Reveals masculine discourse words have more stable LLM representations than feminine.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gendered Discourse Correlation Framework for podcast analysis
Discourse Word-Embedding Association Test for LLMs
LDA and BERTopic for gendered discourse word discovery
🔎 Similar Papers
No similar papers found.