ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work proposes a unified autoregressive multimodal large model framework designed to jointly advance image understanding, generation, and editing while aligning with human preferences. Leveraging a discrete semantic visual tokenizer, the approach converts visual content into a unified discrete representation, enabling multi-task modeling under a single next-token prediction paradigm. The framework integrates multi-objective supervised training with reinforcement learning–driven preference optimization, substantially enhancing both generation quality and instruction-following capabilities. Experimental results demonstrate state-of-the-art performance on text-to-image generation and instruction-guided editing tasks—evidenced by improvements in WISE from 0.50 to 0.56 and GEdit-Bench-EN G_O from 5.75 to 6.68—and reveal notable cross-task synergistic gains.

📝 Abstract

This paper introduces ARM, a discrete representation-based AutoRegressive Model that unifies image understanding, generation, and editing within a next-token prediction framework. ARM is built on three efforts: first, we train a discrete semantic visual tokenizer that maps images into compact token sequences. Our tokenizer is supervised with multiple objectives that jointly promote semantic discriminability, language alignment and faithful reconstruction, thereby supporting diverse tasks in a shared latent space. With this, we train a 7B autoregressive model over large-scale text and image token sequences, seamlessly developing vision-language perception and generation capabilities. Finally, to further improve preference-aligned behavior for text-to-image generation and instruction-guided editing, ARM applies reinforcement learning (RL) to optimize task-level objectives such as visual quality, instruction adherence, and edit consistency. Surprisingly, the results show that RL not only substantially improves performance on the target tasks (e.g., raising WISE overall from 0.50 to 0.56, GEdit-Bench-EN G_O from 5.75 to 6.68), but also induces cross-task synergy between text-to-image generation and editing. Collectively, these findings highlight autoregressive modeling, when paired with strong representations and preference optimization, as a scalable foundation for multimodal intelligence. Code: https://github.com/wdrink/ARM.

Problem

Research questions and friction points this paper is trying to address.

multimodal modeling

image generation

image editing

discrete representation

autoregressive model

Innovation

Methods, ideas, or system contributions that make the work stand out.

discrete representation

autoregressive multimodal model

semantic visual tokenizer