A Unifying Framework for Concept-Based Representational Similarity

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Concept alignment lacks a unified definition, and existing methods optimize divergent objectives under the same terminology, obscuring its fundamental nature. This work formalizes its multidimensional structure by decomposing it along two axes—“alignment targets” and “alignment levels”—and identifies four distinct alignment properties, revealing that current approaches satisfy only subsets of these. To address this limitation, we propose Coupled Sparse Autoencoders (CoSAE), a framework that jointly optimizes multiple alignment objectives, alongside InterVenchA, an interventional evaluation benchmark. Experiments demonstrate that optimizing a single objective fails to reliably recover other alignment properties, whereas CoSAE achieves strong instance-level conceptual consistency using merely 0.1% paired data.

📝 Abstract

Learned representations across models and modalities often exhibit striking structural similarities, suggesting shared underlying concept decompositions. However, concept alignment remains poorly defined: existing approaches optimize different objectives under the same terminology, obscuring what is actually aligned. We propose a unifying framework that decomposes alignment along two axes: what is aligned (representations vs. concepts) and at what level (instance-wise vs. distributional). This induces four corresponding properties -- instance-wise and distributional variants of translation and concept consistency -- and reveals precisely which of these guarantees existing methods provide. We further introduce \InterVenchA, an intervention-based benchmark that separately measures extraction quality, translation quality, and concept consistency. Through theory and experiments, we show that commonly assumed equivalences between alignment objectives fail in practice: optimizing one property does not reliably recover the others, and purely unsupervised objectives fail to recover meaningful instance-level alignment. We then propose the Coupled Sparse Autoencoder (CoSAE), which jointly enforces complementary alignment objectives. Strong alignment emerges only in this regime. Surprisingly, as little as 0.1\% paired data is sufficient to recover instance-level alignment when anchoring distributional objectives. Overall, our results show that concept alignment is fundamentally multi-objective: it must be defined, measured, and optimized as such.

Problem

Research questions and friction points this paper is trying to address.

concept alignment

representational similarity

multi-objective alignment

instance-level alignment

distributional alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

concept alignment

representational similarity

multi-objective optimization