Teaching Metric Distance to Autoregressive Multimodal Foundational Models

📅 2025-03-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Discrete tokens in autoregressive multimodal foundation models lack explicit metric structure, hindering their ability to preserve meaningful geometric, semantic, or spatial relationships during generation. Method: This paper introduces DIST2Loss, a distance-aware training framework that converts pre-defined continuous metric distances between tokens—e.g., Euclidean, semantic, or spatial distances—into discrete classification targets. Leveraging exponential-family distribution modeling and vector-quantized feature interfaces, DIST2Loss implicitly enforces metric consistency in autoregressive token prediction without modifying the model architecture. Contribution/Results: DIST2Loss significantly enhances multimodal distance awareness, yielding substantial performance gains across diverse tasks—including visual grounding, robotic manipulation, generative reward modeling, and VQ-based image generation—with particularly pronounced improvements in few-shot settings. Empirical results demonstrate consistent and robust enhancements in both alignment fidelity and downstream generalization, validating the efficacy of explicitly grounding discrete token prediction in continuous metric geometry.

Technology Category

Application Category

📝 Abstract
As large language models expand beyond natural language to domains such as mathematics, multimodal understanding, and embodied agents, tokens increasingly reflect metric relationships rather than purely linguistic meaning. We introduce DIST2Loss, a distance-aware framework designed to train autoregressive discrete models by leveraging predefined distance relationships among output tokens. At its core, DIST2Loss transforms continuous exponential family distributions derived from inherent distance metrics into discrete, categorical optimization targets compatible with the models' architectures. This approach enables the models to learn and preserve meaningful distance relationships during token generation while maintaining compatibility with existing architectures. Empirical evaluations show consistent performance gains in diverse multimodal applications, including visual grounding, robotic manipulation, generative reward modeling, and image generation using vector-quantized features. These improvements are pronounced in cases of limited training data, highlighting DIST2Loss's effectiveness in resource-constrained settings.
Problem

Research questions and friction points this paper is trying to address.

Teaching autoregressive models to understand metric distances.
Enhancing multimodal applications with distance-aware token generation.
Improving model performance in resource-constrained training scenarios.
Innovation

Methods, ideas, or system contributions that make the work stand out.

DIST2Loss trains models with distance-aware token relationships.
Converts continuous metrics to discrete optimization targets.
Enhances multimodal tasks with limited training data.
🔎 Similar Papers
No similar papers found.
Jiwan Chung
Jiwan Chung
Yonsei University
Computer VisionNLPMultimodal Learning
Saejin Kim
Saejin Kim
Yonsei University
Artificial Intelligence
Y
Yongrae Jo
LG AI Research
J
Jaewoo Park
Yonsei University
D
Dongjun Min
Yonsei University
Y
Youngjae Yu
Yonsei University