🤖 AI Summary
This work addresses the efficiency and scalability bottlenecks faced by multimodal large language models when processing long videos and streaming visual inputs, primarily caused by excessively long visual token sequences. Inspired by the human visual system, the authors propose a dual-mode architecture that natively supports dynamic visual token scaling and streaming comprehension. The design incorporates a focus/standby dual-mode perception mechanism to adaptively balance accuracy and efficiency, along with a dynamically separable key-value (KV) cache. Experimental results demonstrate that the model maintains state-of-the-art performance on fine-grained tasks while achieving 97.7%–99.7% of the original accuracy on long video understanding using only 1/40 to 1/10 of the visual tokens, all while enabling efficient streaming visual memory.
📝 Abstract
Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable capabilities in cross-modal understanding and generation. However, the rapid growth of visual token sequences--especially in long-video and streaming scenarios--poses a major challenge to their scalability and real-world deployment. Thus, we introduce POINTS-Long, a native dual-mode MLLM featuring dynamic visual token scaling inspired by the human visual system. The model supports two complementary perception modes: focus mode and standby mode, enabling users to dynamically trade off efficiency and accuracy during inference. On fine-grained visual tasks, the focus mode retains the optimal performance, while on long-form general visual understanding, the standby mode retains 97.7-99.7% of the original accuracy using only 1/40-1/10th of the visual tokens. Moreover, POINTS-Long natively supports streaming visual understanding via a dynamically detachable KV-cache design, allowing efficient maintenance of ultra-long visual memory. Our work provides new insights into the design of future MLLMs and lays the foundation for adaptive and efficient long-form visual understanding.