🤖 AI Summary
To address performance degradation of large language models (LLMs) in domain-specific applications—such as finance, healthcare, and agriculture—caused by train-test distribution shift, this paper proposes SyTTA, a label-free test-time adaptation framework. SyTTA dynamically refines generation strategies during inference by jointly leveraging input-side perplexity and output-side predictive entropy as complementary uncertainty signals. Crucially, it achieves efficient online adaptation with only four additional tokens per query, eliminating reliance on labeled data or task-specific fine-tuning. The framework is architecture-agnostic, supporting both open- and closed-weight LLMs. Empirical evaluation on agricultural question answering demonstrates a >120% improvement in ROUGE-Lsum over baseline methods, underscoring the effectiveness and practicality of unsupervised test-time adaptation for specialized domains.
📝 Abstract
Large language models (LLMs) are increasingly deployed in specialized domains such as finance, medicine, and agriculture, where they face significant distribution shifts from their training data. Domain-specific fine-tuning can mitigate this challenge but relies on high-quality labeled data that is expensive and slow to collect in expertise-limited settings. We study label-free test-time adaptation for language models and present SyTTA, an inference-time framework that adapts models on-the-fly without additional supervision. SyTTA couples two complementary uncertainty signals that arise under distribution shift: input-side perplexity, indicating mismatch with domain-specific terminology and patterns, and output-side predictive entropy, indicating diffuse and unstable token probabilities during generation. Across diverse model architectures and domain-specific benchmarks, SyTTA delivers consistent gains. Notably, on agricultural question answering, SyTTA improves Rouge-LSum by over 120% on Qwen-2.5-7B with only 4 extra tokens per query. These results show that effective test-time adaptation for language models is achievable without labeled examples, supporting deployment in label-scarce domains. The code will be made available upon acceptance.