🤖 AI Summary
Training neural PDE solvers suffers from prohibitively high computational cost in generating high-difficulty samples and poor generalization. Method: We propose a multi-difficulty pre-generated data strategy based on the 2D incompressible Navier–Stokes equations, systematically constructing a dataset spanning geometric complexity and Reynolds number gradients; training prioritizes low- and medium-difficulty samples to reduce reliance on high-difficulty ones. Contribution/Results: Our key insight is uncovering and leveraging the regulatory role of data difficulty distribution on model generalization. Experiments demonstrate that incorporating only a small fraction of high-difficulty samples achieves accuracy comparable to full high-difficulty training, while reducing total pre-generation computational cost by 8.9×. This approach simultaneously enhances data efficiency and generalization performance—particularly under few-shot settings—without compromising solution fidelity.
📝 Abstract
A key aspect of learned partial differential equation (PDE) solvers is that the main cost often comes from generating training data with classical solvers rather than learning the model itself. Another is that there are clear axes of difficulty--e.g., more complex geometries and higher Reynolds numbers--along which problems become (1) harder for classical solvers and thus (2) more likely to benefit from neural speedups. Towards addressing this chicken-and-egg challenge, we study difficulty transfer on 2D incompressible Navier-Stokes, systematically varying task complexity along geometry (number and placement of obstacles), physics (Reynolds number), and their combination. Similar to how it is possible to spend compute to pre-train foundation models and improve their performance on downstream tasks, we find that by classically solving (analogously pre-generating) many low and medium difficulty examples and including them in the training set, it is possible to learn high-difficulty physics from far fewer samples. Furthermore, we show that by combining low and high difficulty data, we can spend 8.9x less compute on pre-generating a dataset to achieve the same error as using only high difficulty examples. Our results highlight that how we allocate classical-solver compute across difficulty levels is as important as how much we allocate overall, and suggest substantial gains from principled curation of pre-generated PDE data for neural solvers. Our code is available at https://github.com/Naman-Choudhary-AI-ML/pregenerating-pde