🤖 AI Summary
This study investigates how large language models (LLMs) can enhance commodity ETF portfolio performance under a fixed information set and execution protocol. The authors propose the first multi-agent LLM framework tailored to commodity ETFs, comprising hawkish, dovish, deliberative, and rule-based agents that generate allocation signals from a unified macroeconomic Z-score input and feed them into a rules-based rebalancing engine. Empirical results demonstrate that the approach significantly outperforms passive benchmarks even under high one-way transaction costs of 30 basis points, with the hawkish and deliberative agents improving Sharpe ratios by 0.044 and 0.040, respectively (p<0.10). The outperformance is most pronounced during economic soft-landing regimes, and the deliberation mechanism primarily serves as a bias-correction device, confirming the incremental value of LLMs as constrained macroeconomic interpretation functions.
📝 Abstract
We test whether large language models (LLMs) add value in commodity portfolio construction when the information set and implementation rules are held fixed across strategies. A Hawkish Agent (inflation-tightening prior), a Dovish Agent (growth-easing prior), a Debate Agent, and a deterministic z-score Rule Agent each receive identical FRED macro z-scores and route their tilt signals through the same portfolio engine. Across 124 weekly rebalancing dates spanning the 2023 U.S. rate peak and the 2024-2025 soft landing, all three LLM strategies outperform the Rule Agent in Sharpe terms; the Hawkish and Debate Agents record the largest gains (ΔSharpe = +0.044 and +0.040, both p < 0.10 under a block bootstrap) and preserve a net-of-cost advantage over the passive inverse-volatility benchmark at one-way trading costs up to 30 basis points, while the Rule Agent's thin margin over passive disappears at approximately 5 basis points.The Debate Agent does not outperform the best single agent (ΔSharpe = -0.004, p = 0.769); its contribution is bias correction -- averaging out the Dovish Agent's miscalibrated prior -- rather than deliberation-generated return. The performance advantage is concentrated in the soft-landing sub-period, the evaluation window spans a single rate cycle, and the reported $p$-values are unadjusted for multiple comparisons. Within these limits, the results suggest that an LLM acting as a constrained macro-interpretation function can add modest but economically meaningful value over a transparent rule layer, though the margin is small and its persistence beyond this sample is unknown.