🤖 AI Summary
This work addresses the inefficiency of visual language models (VLMs) in tasks such as image understanding, reasoning, OCR, localization, and counting—particularly their high inference latency and limited deployability on edge devices. To this end, it introduces Zamba2, the first hybrid architecture for VLMs that integrates Mamba2 state-space layers with a small number of shared Transformer blocks. This design enables near-linear prefill computation and constant-size recurrent states, achieving approximately one order of magnitude lower first-token latency than existing state-space models and hybrid VLMs at 1.2B and 2.7B scales. Despite its compact size, the model matches the performance of mainstream Transformer-based VLMs across multiple visual-language benchmarks, substantially enhancing practicality and deployment efficiency on resource-constrained edge platforms.
📝 Abstract
We present Zamba2-VL, a suite of vision-language models built on Zamba2, a hybrid language-model architecture combining Mamba2 state-space layers with a small number of shared transformer blocks. Across a broad range of image understanding, reasoning, OCR, grounding, and counting benchmarks, Zamba2-VL is competitive with leading Transformer-based open-weight VLMs of comparable scale, including the Molmo2, Qwen3-VL, and InternVL3.5 families, and substantially outperforms prior SSM-based and hybrid VLMs such as VL-Mamba, Cobra, and mmMamba. Inheriting the near-linear prefill compute and small, near-constant recurrent state of its Zamba2 backbone, Zamba2-VL delivers roughly an order of magnitude lower time-to-first-token (TTFT) than these Transformer baselines at matched parameter scale, with the efficiency gap most pronounced at the smaller 1.2B and 2.7B scales most relevant to on-device and edge deployment. We release three models -- 1.2B, 2.7B, and 7B -- together with inference code at https://huggingface.co/collections/Zyphra/zamba2-vl.