🤖 AI Summary
To address high policy-learning variance and unstable convergence in multi-fidelity reinforcement learning (MF-RL) for engineering design optimization—caused by heterogeneous error distributions across fidelity levels—this paper proposes a non-hierarchical, adaptive MF-RL framework. Departing from manual fidelity scheduling, it introduces a novel low-fidelity policy alignment mechanism, integrating policy alignment evaluation, experience transfer reweighting, adaptive sampling control, and co-training to enable dynamic synergy between heterogeneous low-fidelity and high-fidelity models. Evaluated on an octocopter design optimization task, the framework reduces policy-learning variance by 42% and accelerates convergence by 3.1× compared to conventional hierarchical MF-RL methods, while significantly improving solution quality consistency—all without manual scheduling overhead.
📝 Abstract
Multi-fidelity Reinforcement Learning (RL) frameworks efficiently utilize computational resources by integrating analysis models of varying accuracy and costs. The prevailing methodologies, characterized by transfer learning, human-inspired strategies, control variate techniques, and adaptive sampling, predominantly depend on a structured hierarchy of models. However, this reliance on a model hierarchy can exacerbate variance in policy learning when the underlying models exhibit heterogeneous error distributions across the design space. To address this challenge, this work proposes a novel adaptive multi-fidelity RL framework, in which multiple heterogeneous, non-hierarchical low-fidelity models are dynamically leveraged alongside a high-fidelity model to efficiently learn a high-fidelity policy. Specifically, low-fidelity policies and their experience data are adaptively used for efficient targeted learning, guided by their alignment with the high-fidelity policy. The effectiveness of the approach is demonstrated in an octocopter design optimization problem, utilizing two low-fidelity models alongside a high-fidelity simulator. The results demonstrate that the proposed approach substantially reduces variance in policy learning, leading to improved convergence and consistent high-quality solutions relative to traditional hierarchical multi-fidelity RL methods. Moreover, the framework eliminates the need for manually tuning model usage schedules, which can otherwise introduce significant computational overhead. This positions the framework as an effective variance-reduction strategy for multi-fidelity RL, while also mitigating the computational and operational burden of manual fidelity scheduling.