π€ AI Summary
This study addresses the challenging sequential resource allocation problem in public health, characterized by complex objectives and sparse preference data. We propose DPO-PROβa novel algorithm that integrates Direct Preference Optimization (DPO) with lightweight Distributionally Robust Optimization (DRO), enabling efficient and robust reward modeling without self-reflection mechanisms. Our method fine-tunes large language models using natural-language human preferences, significantly enhancing robustness against noisy preference signals. Compared to existing approaches, DPO-PRO achieves lower conservatism while balancing modeling accuracy and inference efficiency. Experiments on a real-world maternal mobile health deployment and standard alignment benchmarks demonstrate performance competitive with self-reflection baselines, yet with substantially reduced inference cost. DPO-PRO thus establishes a scalable new paradigm for value alignment in low-resource settings.
π Abstract
We study an LLM fine-tuning task for designing reward functions for sequential resource allocation problems in public health, guided by human preferences expressed in natural language. This setting presents a challenging testbed for alignment due to complex and ambiguous objectives and limited data availability. We propose DPO-PRO, a robust fine-tuning algorithm based on Direct Preference Optimization (DPO), which accounts for uncertainty in the preference distribution using a lightweight Distributionally Robust Optimization (DRO) formulation. Unlike prior DRO-based DPO methods, DPO-PRO is significantly less conservative. We evaluate DPO-PRO on a real-world maternal mobile health program operated by the non-profit organization ARMMAN, as well as on standard alignment benchmarks. Experimental results demonstrate that our method consistently improves robustness to noisy preference signals compared to existing DPO variants. Moreover, DPO-PRO achieves comparable performance to prior self-reflection-based baseline for reward function design, while requiring significantly lower inference-time cost.