🤖 AI Summary
This work addresses the high computational cost in multimodal large language models for audio-visual captioning, primarily caused by redundant input tokens, while existing pruning methods often fail to preserve critical information in highly ambiguous regions. To overcome this, the authors propose AVEX-Prune, a reinforcement learning–based dynamic token pruning approach that introduces a novel cross-modal token exchange mechanism. By evaluating the impact of token substitution on generated captions, the method dynamically identifies and retains tokens contributing most significantly to semantic fidelity. Unlike conventional hard-threshold pruning strategies, AVEX-Prune achieves near full-token performance with only 40% of tokens retained, attaining scores of 54.5 and 57.0 on VILA 1.5-8B and VideoLLaMA 2, respectively.
📝 Abstract
Audio-visual captioning generates natural language descriptions from video and audio content. Multimodal LLMs have advanced this task, but both modalities contribute many tokens to the LLM input, where prefill self-attention scales quadratically. Existing token-pruning methods usually retain tokens by attention, saliency, or cross-entropy loss, yet the hard threshold selection makes it difficult to retain tokens that are truly valuable, especially for high-confusing tokens near the decision boundary. To this end, we propose a AVEX-Prune, an RL-based audio-visual dynamic token pruning method in this work. In our AVEX-Prune, an audio-visual token exchange strategy is proposed to select truly valuable tokens by replacing low-confidence retained tokens with high-confidence candidate tokens from the same or the other modality, and measuring the differences in caption generation from token swaps. AVEX-Prune preserves full-token quality at a 40% retention ratio on both VILA 1.5-8B (54.5 vs. 54.6) and VideoLLaMA 2 (57.0 vs. 56.8).