Exploring Cross-Modal Flows for Few-Shot Learning

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of highly entangled cross-modal features and the limited alignment precision of single-step fine-tuning in cross-modal few-shot learning, this paper proposes the first parameter-efficient fine-tuning framework supporting multi-step adjustment. Our method models a cross-modal flow field and achieves progressive visual–textual feature alignment via flow matching. We further introduce fixed-class coupling constraints, noise-augmented training, and an early-stopping differential equation solver to enhance robustness and generalization. The framework is model-agnostic and plug-and-play. Extensive experiments across multiple benchmarks and backbone architectures demonstrate consistent and significant improvements over existing single-step fine-tuning approaches—particularly under high feature entanglement and on challenging samples—establishing new state-of-the-art performance.

Technology Category

Application Category

📝 Abstract
Aligning features from different modalities, is one of the most fundamental challenges for cross-modal tasks. Although pre-trained vision-language models can achieve a general alignment between image and text, they often require parameter-efficient fine-tuning (PEFT) for further adjustment. Today's PEFT methods (e.g., prompt tuning, LoRA-based, or adapter-based) always selectively fine-tune a subset of parameters, which can slightly adjust either visual or textual features, and avoid overfitting. In this paper, we are the first to highlight that all existing PEFT methods perform one-step adjustment. It is insufficient for complex (or difficult) datasets, where features of different modalities are highly entangled. To this end, we propose the first model-agnostic multi-step adjustment approach by learning a cross-modal velocity field: Flow Matching Alignment (FMA). Specifically, to ensure the correspondence between categories during training, we first utilize a fixed coupling strategy. Then, we propose a noise augmentation strategy to alleviate the data scarcity issue. Finally, we design an early-stopping solver, which terminates the transformation process earlier, improving both efficiency and accuracy. Compared with one-step PEFT methods, FMA has the multi-step rectification ability to achieve more precise and robust alignment. Extensive results have demonstrated that FMA can consistently yield significant performance gains across various benchmarks and backbones, particularly on challenging datasets.
Problem

Research questions and friction points this paper is trying to address.

Addressing insufficient feature alignment in complex cross-modal datasets
Proposing multi-step adjustment method for precise vision-language alignment
Solving data scarcity and efficiency issues in few-shot learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-step cross-modal alignment using flow matching
Fixed coupling strategy for category correspondence
Early-stopping solver for efficiency and accuracy
🔎 Similar Papers
No similar papers found.
Z
Ziqi Jiang
Department of CSE, The Hong Kong University of Science and Technology
Yanghao Wang
Yanghao Wang
Peking University
neuromorphic computingmemristornonlinear dynamics
L
Long Chen
Department of CSE, The Hong Kong University of Science and Technology