🤖 AI Summary
This study investigates implicit sociodemographic encoding—particularly along gender and race dimensions—in large language models (LLMs) deployed in clinical settings, and its causal role in generating clinical bias. We propose the first mechanistic interpretability framework tailored to medical LLM bias analysis, integrating activation patching, intermediate-layer neuron probing, causal mediation analysis, and distributionally robust evaluation. Our method precisely localizes gender representations within middle MLP layers and enables targeted debiasing interventions for both clinical note generation and depression risk prediction; we further demonstrate that although racial representations are distributed across the network, they remain amenable to intervention. These findings establish a novel paradigm for understanding and mitigating structural biases in healthcare AI, offering reproducible, mechanism-grounded technical pathways for equitable model development and deployment.
📝 Abstract
We know from prior work that LLMs encode social biases, and that this manifests in clinical tasks. In this work we adopt tools from mechanistic interpretability to unveil sociodemographic representations and biases within LLMs in the context of healthcare. Specifically, we ask: Can we identify activations within LLMs that encode sociodemographic information (e.g., gender, race)? We find that gender information is highly localized in middle MLP layers and can be reliably manipulated at inference time via patching. Such interventions can surgically alter generated clinical vignettes for specific conditions, and also influence downstream clinical predictions which correlate with gender, e.g., patient risk of depression. We find that representation of patient race is somewhat more distributed, but can also be intervened upon, to a degree. To our knowledge, this is the first application of mechanistic interpretability methods to LLMs for healthcare.