EASE-TTT: Evidence-Aligned Selective Test-Time Training for Long-Context Question Answering

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge that small-scale language models struggle to effectively leverage dispersed evidence in long-context question answering, as existing approaches either neglect query-side attention modulation or fail to accurately localize supporting context. The authors propose a novel framework integrating in-context retrieval with test-time training: key evidence spans are first retrieved and then transformed into soft attention supervision signals to dynamically refine query-side attention parameters, enabling precise alignment with relevant context while preserving the full input for generation. This approach is the first to incorporate soft attention targets for evidence alignment into test-time training, achieving state-of-the-art macro-average performance across six LongBench tasks and three small language models, significantly outperforming baselines such as full-context inference, pure retrieval, and query-side test-time training (qTTT).

📝 Abstract

Long-context question answering (QA) remains challenging for smaller language models even when answer-bearing evidence is already present in the input. Existing within-context retrieval methods localize and expose candidate evidence chunks for the question, but they stop at input-level evidence exposure rather than adapting the query-side attention parameters that control how the model allocates attention over full-context positions. In contrast, lightweight test-time adaptation methods, such as query-only test-time training (qTTT), leave evidence localization unresolved because their generic span-level self-supervised objectives do not identify which context positions support the current answer. In this paper, we propose Evidence-Aligned SElective Test-Time Training (EASE-TTT), a within-context retrieval-augmented test-time training framework that converts selected evidence chunks into a soft attention supervision target over their token positions. Instead of replacing the full context with retrieved chunks, EASE-TTT uses the resulting attention target to guide query-side adaptation, with the adapted model generating the final answer from the original full context. Experiments on six LongBench QA tasks and three small decoder-only language models show that EASE-TTT achieves the strongest macro-average performance among full-context inference, retrieval-only baselines, and qTTT, supporting evidence-aligned test-time adaptation in long-context QA.

Problem

Research questions and friction points this paper is trying to address.

long-context question answering

evidence localization

test-time adaptation

attention alignment

small language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time training

long-context QA

evidence alignment