š¤ AI Summary
To address the low computational efficiency, poor scalability, and inability to model two-dimensional functions of In-Context Operator Networks on high-dimensional dense data, this work pioneers the integration of Vision Transformers into the in-context operator learning framework. We propose a multi-physics fluid modeling approach based on patch-wise function representation, enabling dynamic context construction, flexible handling of variable time steps and sparse-frame inputs, and enhanced generalization via a multi-physics pretraining paradigm. Evaluated on two benchmark datasets for compressible flow, our method reduces normalized L² error by 40% and 61.6%, respectively, and achieves inference speed three times faster than the state-of-the-art MPP model. Moreover, it significantly improves long-horizon roll-out prediction efficiency.
š Abstract
In-Context Operator Networks (ICONs) are models that learn operators across different types of PDEs using a few-shot, in-context approach. Although they show successful generalization to various PDEs, existing methods treat each data point as a single token, and suffer from computational inefficiency when processing dense data, limiting their application in higher spatial dimensions. In this work, we propose extit{Vision In-Context Operator Networks} (VICON), incorporating a vision transformer architecture that efficiently processes 2D functions through patch-wise operations. We evaluated our method on three fluid dynamics datasets, demonstrating both superior performance (reducing the rescaled $L^2$ error by $40%$ and $61.6%$ for two benchmark datasets for compressible flows, respectively) and computational efficiency (requiring only one-third of the inference time per frame) in long-term rollout predictions compared to the current state-of-the-art sequence-to-sequence model with fixed timestep prediction: Multiple Physics Pretraining (MPP). Compared to MPP, our method preserves the benefits of in-context operator learning, enabling flexible context formation when dealing with insufficient frame counts or varying timestep values.