TIME: TabPFN-Integrated Multimodal Engine for Robust Tabular-Image Learning

📅 2025-06-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two key challenges in tabular–image multimodal learning: (1) the absence of standardized, pretrained tabular representations, and (2) insufficient robustness to missing values in tabular inputs. To this end, we propose the first multimodal fusion framework leveraging a frozen TabPFN encoder—a lightweight, inherently missing-value-aware tabular model—integrated into a cross-modal architecture. Our method combines a pretrained vision backbone (ViT or ResNet) with multiple fusion strategies: feature concatenation, attention-based weighting, and cross-modal gating. Evaluated on medical and natural-scene benchmark datasets, our approach consistently outperforms state-of-the-art methods, achieving average improvements of 3.2–7.8% in AUROC (medical) or accuracy (natural scenes) under both complete and missing-tabular-input settings. The framework demonstrates strong robustness to missing data, broad generalizability across domains, and practical deployability due to its parameter efficiency and frozen tabular encoder design.

Technology Category

Application Category

📝 Abstract
Tabular-image multimodal learning, which integrates structured tabular data with imaging data, holds great promise for a variety of tasks, especially in medical applications. Yet, two key challenges remain: (1) the lack of a standardized, pretrained representation for tabular data, as is commonly available in vision and language domains; and (2) the difficulty of handling missing values in the tabular modality, which are common in real-world medical datasets. To address these issues, we propose the TabPFN-Integrated Multimodal Engine (TIME), a novel multimodal framework that builds on the recently introduced tabular foundation model, TabPFN. TIME leverages TabPFN as a frozen tabular encoder to generate robust, strong embeddings that are naturally resilient to missing data, and combines them with image features from pretrained vision backbones. We explore a range of fusion strategies and tabular encoders, and evaluate our approach on both natural and medical datasets. Extensive experiments demonstrate that TIME consistently outperforms competitive baselines across both complete and incomplete tabular inputs, underscoring its practical value in real-world multimodal learning scenarios.
Problem

Research questions and friction points this paper is trying to address.

Lack of standardized pretrained representation for tabular data
Difficulty handling missing values in tabular modality
Need robust multimodal learning for tabular-image integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

TabPFN as frozen tabular encoder
Combines tabular and image features
Resilient to missing tabular data
🔎 Similar Papers
No similar papers found.
J
Jiaqi Luo
School of Mathematical Sciences, Soochow University, No.1 Shizi Street, Suzhou, 215006, Jiangsu Province, China
Y
Yuan Yuan
Zu Chongzhi Center, Duke Kunshan University, No.8 Duke Avenue, Kunshan, 215000, Jiangsu Province, China
Shixin Xu
Shixin Xu
Duke Kunshan Univeristy
machine learningmath biologyelectrodynamicsmoving contact lines