🤖 AI Summary
This work addresses the lack of efficient, end-to-end, reference-free methods for speech translation quality estimation by proposing the first such system built upon the Qwen3-ASR backbone. The approach jointly models the source audio and the translation hypothesis through a lightweight bidirectional Transformer to enable cross-modal interaction, and introduces a learnable sparsemax-based scalar mixing mechanism to fuse multi-layer representations. To mitigate the scarcity of annotated data, it employs a multi-task prediction head alongside a curriculum learning strategy leveraging both synthetic and pseudo-labeled data. Evaluated on the IWSLT 2026 shared task, the system significantly outperforms cascaded text-based baselines and existing direct speech quality estimation methods, demonstrating the effectiveness and superiority of the proposed framework.
📝 Abstract
We present HydraQE, our contribution to the IWSLT 2026 Speech Translation Metrics shared task. HydraQE is an end-to-end, reference-free quality estimation (QE) system for speech translation built on a Qwen3-ASR backbone, which accepts source audio and a translation hypothesis as joint input. Hidden states from all backbone layers are combined via a learnable sparsemax scalar mix, then re-encoded by a lightweight bidirectional Transformer to enable full cross-modal interaction prior to pooling into a shared embedding. Three independent prediction heads are trained on complementary supervision signals: human direct assessment (DA) annotations, MetricX-24 pseudo-labels, and xCOMET pseudo-labels. To address the scarcity of human-annotated data, we train on a combination of synthetically corrupted examples and silver pseudo-labeled machine translation outputs, using a curriculum that begins on synthetic and silver data and gradually shifts toward human-annotated examples. HydraQE outperforms cascaded text-based baselines and prior direct speech QE systems, demonstrating that end-to-end speech translation QE is competitive with cascaded approaches.