🤖 AI Summary
Traditional multimodal approaches rely on costly image-text alignment pretraining to project visual features into a discrete textual token space. This work challenges two core assumptions of that paradigm: it eliminates alignment pretraining altogether and inverts the mapping direction—projecting text embeddings into a continuous visual representation space. To this end, we propose Invers-LLaVA, which enables dynamic cross-modal fusion via selective additive attention at intermediate Transformer layers, obviating the need for large-scale aligned data. Our study provides the first empirical evidence that efficient multimodal learning is achievable without alignment pretraining. Evaluated across nine benchmarks, Invers-LLaVA achieves substantial gains on reasoning tasks (+27.2% on cognitive reasoning), minor trade-offs on perception tasks, and maintains robust overall accuracy. It reduces computational overhead by 45% and better preserves modality-specific characteristics.
📝 Abstract
Traditional multimodal learning approaches require expensive alignment pre-training to bridge vision and language modalities, typically projecting visual features into discrete text token spaces. We challenge both fundamental assumptions underlying this paradigm by proposing Inverse-LLaVA, a novel approach that eliminates alignment pre-training entirely while inverting the conventional mapping direction. Rather than projecting visual features to text space, our method maps text embeddings into continuous visual representation space and performs fusion within transformer intermediate layers. Through selective additive components in attention mechanisms, we enable dynamic integration of visual and textual representations without requiring massive image-text alignment datasets. Comprehensive experiments across nine multimodal benchmarks demonstrate nuanced performance trade-offs: Inverse-LLaVA achieves notable improvements on reasoning-intensive and cognitive tasks (MM-VET: +0.2%, VizWiz: +1.8%, ScienceQA: +0.2%, cognitive reasoning: +27.2%), while showing expected decreases in perception tasks requiring memorized visual-text associations (celebrity recognition: -49.5%, OCR: -21.3%). These results provide the first empirical evidence that alignment pre-training is not necessary for effective multimodal learning, particularly for complex reasoning tasks. Our work establishes the feasibility of a new paradigm that reduces computational requirements by 45%, challenges conventional wisdom about modality fusion, and opens new research directions for efficient multimodal architectures that preserve modality-specific characteristics. Our project website with code and additional resources is available at https://inverse-llava.github.io.