Mechanistic Interpretability of Fine-Tuned Vision Transformers on Distorted Images: Decoding Attention Head Behavior for Transparent and Trustworthy AI

📅 2025-03-24

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work investigates the attention mechanisms of fine-tuned Vision Transformers (ViTs) on distorted 2D spectrograms contaminated with distractors (e.g., axes, titles, colorbars), aiming to enhance model interpretability and robustness. Methodologically, we conduct head-wise ablation studies, attention map visualization, and quantitative semantic specificity analysis. Our study is the first to systematically uncover a functional hierarchy among ViT attention heads: early layers (1–5) predominantly host task-irrelevant, single-semantic detectors (e.g., text or edge detectors); middle layers (6–11) exhibit strong task-relevant single-semantic selectivity, precisely localizing chirp signal regions; and deeper heads exert the greatest impact on performance. Quantitatively, ablating Layer 6 heads increases MSE by 0.34%, substantially exceeding the effect of early-layer ablation (+0.11%). We successfully identify critical task-specific heads alongside redundant or fragile ones, establishing a transferable mechanistic interpretability framework for model diagnosis, safety enhancement, and transparent design.

Technology Category

Application Category

📝 Abstract

Mechanistic interpretability improves the safety, reliability, and robustness of large AI models. This study examined individual attention heads in vision transformers (ViTs) fine tuned on distorted 2D spectrogram images containing non relevant content (axis labels, titles, color bars). By introducing extraneous features, the study analyzed how transformer components processed unrelated information, using mechanistic interpretability to debug issues and reveal insights into transformer architectures. Attention maps assessed head contributions across layers. Heads in early layers (1 to 3) showed minimal task impact with ablation increased MSE loss slightly ({mu}=0.11%, {sigma}=0.09%), indicating focus on less critical low level features. In contrast, deeper heads (e.g., layer 6) caused a threefold higher loss increase ({mu}=0.34%, {sigma}=0.02%), demonstrating greater task importance. Intermediate layers (6 to 11) exhibited monosemantic behavior, attending exclusively to chirp regions. Some early heads (1 to 4) were monosemantic but non task relevant (e.g. text detectors, edge or corner detectors). Attention maps distinguished monosemantic heads (precise chirp localization) from polysemantic heads (multiple irrelevant regions). These findings revealed functional specialization in ViTs, showing how heads processed relevant vs. extraneous information. By decomposing transformers into interpretable components, this work enhanced model understanding, identified vulnerabilities, and advanced safer, more transparent AI.

Problem

Research questions and friction points this paper is trying to address.

Analyze attention heads in ViTs for distorted image processing

Understand functional specialization in transformer components

Enhance AI transparency via mechanistic interpretability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzed attention heads in fine-tuned ViTs

Used mechanistic interpretability for debugging

Identified functional specialization across layers

🔎 Similar Papers

A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models