🤖 AI Summary
Existing DRL-based wireless video adaptive streaming methods suffer from limited QoE due to reliance solely on high-level, delayed state representations. To address this, we propose a cross-layer awareness framework that integrates real-time physical- and link-layer states. We formulate an infinite-horizon discounted MDP model incorporating information-cost trade-offs, design the first eA3C offline training algorithm leveraging low-layer network states, and introduce two user-personalized online continual learning mechanisms for runtime policy refinement. Experiments show that the offline policy improves QoE by 6.8%–14.4% over state-of-the-art baselines; subsequent online optimization further boosts QoE by 6%–28%, significantly enhancing real-time responsiveness and personalization capability.
📝 Abstract
Deep reinforcement learning (DRL) demonstrates its promising potential in the realm of adaptive video streaming and has recently received increasing attention. However, existing DRL-based methods for adaptive video streaming use only application (APP) layer information, adopt heuristic training methods, and train generalized neural networks with pre-collected data. This paper aims to boost the quality of experience (QoE) of adaptive wireless video streaming by using lower-layer information, deriving a rigorous training method, and adopting online tuning with real-time data. First, we formulate a more comprehensive and accurate adaptive wireless video streaming problem as an infinite stage discounted Markov decision process (MDP) problem by additionally incorporating past and lower-layer information, allowing a flexible tradeoff between QoE and costs for obtaining system information and solving the problem. In the offline scenario (only with pre-collected data), we propose an enhanced asynchronous advantage actor-critic (eA3C) method by jointly optimizing the parameters of parameterized policy and value function. Specifically, we build an eA3C network consisting of a policy network and a value network that can utilize cross-layer, past, and current information and jointly train the eA3C network using pre-collected samples. In the online scenario (with additional real-time data), we propose two continual learning-based online tuning methods for designing better policies for a specific user with different QoE and training time tradeoffs. Finally, experimental results show that the proposed offline policy can improve the QoE by 6.8~14.4% compared to the state-of-arts in the offline scenario, and the proposed online policies can further achieve 6~28% gains in QoE over the proposed offline policy in the online scenario.