🤖 AI Summary
To address the dual challenges of I/O bottlenecks and catastrophic forgetting in large-scale plasma simulations, this work introduces the Streaming AI Scientist framework, which pioneers an *in-transit* machine learning paradigm featuring tight simulation–learning coupling. It bypasses filesystem I/O via zero-copy in-memory data streaming, enabling real-time co-execution of simulation and ML training. An asynchronous feature transformation pipeline and an experience-replay–based continual learning mechanism mitigate catastrophic forgetting in non-stationary physical processes. The framework supports cross-language, zero-modification integration with existing simulation codes. Technically, it integrates GPU acceleration (PIConGPU), streaming pipelines, asynchronous memory transfers, and Frontier exascale supercomputer optimization. Evaluated on a thousand-GPU Kelvin–Helmholtz instability workflow on Frontier, it reduces I/O overhead by 90%, achieves storage-free, sub-second model updates, and enables online physical pattern recognition.
📝 Abstract
Increasing HPC cluster sizes and large-scale simulations that produce petabytes of data per run, create massive IO and storage challenges for analysis. Deep learning-based techniques, in particular, make use of these amounts of domain data to extract patterns that help build scientific understanding. Here, we demonstrate a streaming workflow in which simulation data is streamed directly to a machine-learning (ML) framework, circumventing the file system bottleneck. Data is transformed in transit, asynchronously to the simulation and the training of the model. With the presented workflow, data operations can be performed in common and easy-to-use programming languages, freeing the application user from adapting the application output routines. As a proof-of-concept we consider a GPU accelerated particle-in-cell (PIConGPU) simulation of the Kelvin- Helmholtz instability (KHI). We employ experience replay to avoid catastrophic forgetting in learning from this non-steady process in a continual manner. We detail challenges addressed while porting and scaling to Frontier exascale system.