🤖 AI Summary
To address the significant degradation in generalization performance of foundation models under distribution shift, weak supervision, or adversarial attacks in open-world settings, this paper proposes the Object-Concept-Relation Triplet (OCRT) framework. OCRT jointly models sparse high-level semantic concepts and their higher-order relational structures via unsupervised object disentanglement, projection into a semantic concept space, construction of an importance-weighted concept graph, and iterative refinement. It is the first method to achieve *co-disentanglement* of objects, concepts, and relations, coupled with dynamic graph-based reasoning—enabling model-agnostic and task-agnostic generalization enhancement. Evaluated on SAM and CLIP, OCRT substantially improves robustness under out-of-distribution data, weak labeling, and adversarial conditions, yielding an average 12.7% performance gain across multiple downstream tasks while supporting interpretable higher-order relational reasoning.
📝 Abstract
Although foundation models (FMs) claim to be powerful, their generalization ability significantly decreases when faced with distribution shifts, weak supervision, or malicious attacks in the open world. On the other hand, most domain generalization or adversarial fine-tuning methods are task-related or model-specific, ignoring the universality in practical applications and the transferability between FMs. This paper delves into the problem of generalizing FMs to the out-of-domain data. We propose a novel framework, the Object-Concept-Relation Triad (OCRT), that enables FMs to extract sparse, high-level concepts and intricate relational structures from raw visual inputs. The key idea is to bind objects in visual scenes and a set of object-centric representations through unsupervised decoupling and iterative refinement. To be specific, we project the object-centric representations onto a semantic concept space that the model can readily interpret and estimate their importance to filter out irrelevant elements. Then, a concept-based graph, which has a flexible degree, is constructed to incorporate the set of concepts and their corresponding importance, enabling the extraction of high-order factors from informative concepts and facilitating relational reasoning among these concepts. Extensive experiments demonstrate that OCRT can substantially boost the generalizability and robustness of SAM and CLIP across multiple downstream tasks.