🤖 AI Summary
This work addresses the lack of lightweight, incentive-compatible mechanisms for multidimensional output quality evaluation in decentralized large language model inference networks. The authors propose a systematic quality scoring framework that decomposes output quality into multiple dimensions—namely, model and cost priors, structural quality, semantic quality, query-output alignment, and consistency/uncertainty. A calibrated, dimensionally filtered, and weighted fusion of these components yields a composite quality signal, which is integrated into a Proof-of-Quality (PoQ) incentive mechanism. Experimental results demonstrate that the calibrated composite score matches or exceeds the performance of the best individual evaluators and consensus-based baselines on question-answering and summarization tasks, while significantly enhancing system robustness under adversarial attacks. The study further reveals, for the first time, the task-dependent nature and potential negative correlations among multidimensional quality metrics.
📝 Abstract
Decentralized large language model (LLM) inference networks can pool heterogeneous compute to scale serving, but they require lightweight and incentive-compatible mechanisms to assess output quality. Prior work introduced cost-aware Proof of Quality (PoQ) and adaptive robust PoQ to allocate rewards under evaluator heterogeneity and adversarial behavior. In this paper, we focus on the quality signal itself and propose a multi-dimensional quality scoring framework that decomposes output quality into modular dimensions, including model and cost priors, structure quality, semantic quality, query-output alignment, and agreement/uncertainty. Using logged outputs from QA and summarization tasks, we systematically audit dimension reliability and show that seemingly reasonable dimensions can be task-dependent and even negatively correlated with reference quality without calibration. While the default composite underperforms a strong single semantic evaluator, ablations reveal that removing unreliable dimensions and re-normalizing weights yields a calibrated composite that matches or exceeds the best single- evaluator and consensus baselines. Finally, we integrate the composite score as a drop-in quality signal in PoQ and demonstrate complementary benefits with robust aggregation and adaptive trust weighting under adversarial evaluator attacks.