Attention Mechanisms Don't Learn Additive Models: Rethinking Feature Importance for Transformers

📅 2024-05-22

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

🤖 AI Summary

Existing attribution methods struggle to accurately explain Transformers due to their inherent nonlinearity and non-additivity—fundamentally incompatible with conventional linear or additive surrogate models. This work formally proves, for the first time, that Transformers cannot be exactly represented by any additive model. To address this, we propose the Softmax-Linked Additive Log Odds Model (SLALOM), a Transformer-native attribution framework: it models in log-odds space, explicitly couples with the softmax output layer, and supports differentiable training alongside joint gradient-perturbation attribution. SLALOM achieves a breakthrough trade-off between fidelity and efficiency—improving explanation fidelity by over 30% on both synthetic and real-world benchmarks while reducing computational cost to one-fifth of mainstream methods. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

We address the critical challenge of applying feature attribution methods to the transformer architecture, which dominates current applications in natural language processing and beyond. Traditional attribution methods to explainable AI (XAI) explicitly or implicitly rely on linear or additive surrogate models to quantify the impact of input features on a model's output. In this work, we formally prove an alarming incompatibility: transformers are structurally incapable of representing linear or additive surrogate models used for feature attribution, undermining the grounding of these conventional explanation methodologies. To address this discrepancy, we introduce the Softmax-Linked Additive Log Odds Model (SLALOM), a novel surrogate model specifically designed to align with the transformer framework. SLALOM demonstrates the capacity to deliver a range of insightful explanations with both synthetic and real-world datasets. We highlight SLALOM's unique efficiency-quality curve by showing that SLALOM can produce explanations with substantially higher fidelity than competing surrogate models or provide explanations of comparable quality at a fraction of their computational costs. We release code for SLALOM as an open-source project online at https://github.com/tleemann/slalom_explanations.

Problem

Research questions and friction points this paper is trying to address.

Transformer Interpretability

Mathematical Model Limitations

Importance Attribution

Innovation

Methods, ideas, or system contributions that make the work stand out.

SLALOM

Transformer Interpretability

Efficient Explanation

🔎 Similar Papers

No similar papers found.

Authors to Follow