pith. sign in

arxiv: 2606.08748 · v1 · pith:L2NB3I2Lnew · submitted 2026-06-07 · 💻 cs.CL

HydraQE: OSU's Submission for the IWSLT 2026 Speech Translation Metrics Shared Task

Pith reviewed 2026-06-27 18:37 UTC · model grok-4.3

classification 💻 cs.CL
keywords speech translationquality estimationend-to-end modelsreference-free evaluationIWSLT shared taskmachine translation metricscurriculum learningpseudo-labeling
0
0 comments X

The pith

HydraQE shows end-to-end speech translation quality estimation can compete with cascaded text-based systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

HydraQE is an end-to-end reference-free quality estimation system for speech translation that accepts source audio and a translation hypothesis as joint input. It mixes hidden states from all layers of a Qwen3-ASR backbone using a learnable sparsemax scalar mix, re-encodes them with a lightweight bidirectional Transformer for cross-modal interaction, and applies three prediction heads trained on human direct assessment annotations, MetricX-24 pseudo-labels, and xCOMET pseudo-labels. To handle limited human data, the system trains first on synthetically corrupted examples and silver pseudo-labeled machine translation outputs before shifting via curriculum toward human-annotated examples. This produces results that beat both cascaded text-based baselines and earlier direct speech QE systems. A sympathetic reader would care because the work indicates that direct audio-based evaluation can match or exceed pipelines that first convert speech to text before estimating quality.

Core claim

HydraQE outperforms cascaded text-based baselines and prior direct speech QE systems, demonstrating that end-to-end speech translation QE is competitive with cascaded approaches.

What carries the argument

Joint audio-plus-hypothesis input to a Qwen3-ASR backbone whose layers are mixed by learnable sparsemax scalars, re-encoded by a bidirectional Transformer, and passed to three independent heads trained on complementary supervision signals.

If this is right

  • End-to-end models can achieve higher performance than cascaded text-based systems for speech translation quality estimation.
  • Multi-head training on human assessments plus automatic metrics allows effective use of scarce labeled data.
  • Curriculum learning from synthetic and silver data to human annotations supports generalization despite limited annotations.
  • Direct audio input removes the need for an intermediate transcription step in the evaluation pipeline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same joint-input architecture could be tested on quality estimation for other speech-related tasks such as summarization or captioning.
  • Removing the cascaded transcription step may reduce error accumulation in live monitoring of speech translation quality.
  • The curriculum approach might transfer to other low-resource multimodal estimation problems where human labels are expensive.

Load-bearing premise

Training on synthetically corrupted examples, silver pseudo-labeled outputs, and a curriculum that shifts toward human annotations will produce a model that generalizes to the unseen IWSLT 2026 test distribution without overfitting to the pseudo-label sources.

What would settle it

If HydraQE scores below the cascaded text-based baselines on the official IWSLT 2026 test set, the claim that end-to-end speech translation QE is competitive would be falsified.

Figures

Figures reproduced from arXiv: 2606.08748 by Eric Fosler-Lussier, Kevin Krahn.

Figure 1
Figure 1. Figure 1: HydraQE architecture: Source audio and trans [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Learned scalar-mix weights of layer-wise mixing module, shown as deviation from uniform weighting [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distributions of human scores and predicted scores from each HydraQE head on the IWSLT 2026 dev set. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

We present HydraQE, our contribution to the IWSLT 2026 Speech Translation Metrics shared task. HydraQE is an end-to-end, reference-free quality estimation (QE) system for speech translation built on a Qwen3-ASR backbone, which accepts source audio and a translation hypothesis as joint input. Hidden states from all backbone layers are combined via a learnable sparsemax scalar mix, then re-encoded by a lightweight bidirectional Transformer to enable full cross-modal interaction prior to pooling into a shared embedding. Three independent prediction heads are trained on complementary supervision signals: human direct assessment (DA) annotations, MetricX-24 pseudo-labels, and xCOMET pseudo-labels. To address the scarcity of human-annotated data, we train on a combination of synthetically corrupted examples and silver pseudo-labeled machine translation outputs, using a curriculum that begins on synthetic and silver data and gradually shifts toward human-annotated examples. HydraQE outperforms cascaded text-based baselines and prior direct speech QE systems, demonstrating that end-to-end speech translation QE is competitive with cascaded approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents HydraQE, an end-to-end reference-free quality estimation system for speech translation built on a Qwen3-ASR backbone. It combines hidden states from all layers via a learnable sparsemax scalar mix, re-encodes them with a lightweight bidirectional Transformer for cross-modal interaction, and uses three independent prediction heads trained on human direct assessment annotations, MetricX-24 pseudo-labels, and xCOMET pseudo-labels. Training uses synthetically corrupted examples and silver pseudo-labeled MT outputs with a curriculum that starts on synthetic/silver data and shifts toward human annotations. The central claim is that HydraQE outperforms cascaded text-based baselines and prior direct speech QE systems.

Significance. If the outperformance claim holds with proper validation, the work would indicate that end-to-end speech translation QE can be competitive with cascaded approaches, addressing data scarcity via multi-source supervision and curriculum learning. The multi-head design on complementary signals is a reasonable response to limited human data. However, the heavy reliance on pseudo-labels from existing systems (MetricX-24 and xCOMET) limits the potential impact unless the model is shown to learn independent cross-modal features rather than artifacts of the labelers. No machine-checked proofs, reproducible code, or parameter-free derivations are described.

major comments (2)
  1. [Abstract] Abstract: the claim that HydraQE 'outperforms cascaded text-based baselines and prior direct speech QE systems' is asserted without any numerical results, correlation scores, error bars, ablation tables, or dataset statistics, so the central empirical claim cannot be evaluated.
  2. [Abstract] Abstract: two of the three supervision signals are pseudo-labels from MetricX-24 and xCOMET; without ablations on the contribution of each head, details on the curriculum schedule/mixing ratios, or results on a held-out human-only validation set, the generalization claim to the unseen IWSLT 2026 test distribution rests on an untested assumption that the model does not overfit to silver-labeler biases.
minor comments (2)
  1. [Abstract] Abstract: the architecture of the 'lightweight bidirectional Transformer' (number of layers, hidden dimension, attention heads) is not specified.
  2. [Abstract] Abstract: no information is given on initialization, regularization, or training details for the learnable sparsemax scalar mix weights.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and will make the indicated revisions to strengthen the presentation of results and analysis.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that HydraQE 'outperforms cascaded text-based baselines and prior direct speech QE systems' is asserted without any numerical results, correlation scores, error bars, ablation tables, or dataset statistics, so the central empirical claim cannot be evaluated.

    Authors: We agree that the abstract would benefit from quantitative support for the performance claim. The full experimental results, including correlation scores on the IWSLT 2026 test set, are reported in the body of the paper. In the revised manuscript we will update the abstract to include the primary correlation metrics (e.g., Pearson and Spearman) comparing HydraQE against the cascaded baselines. revision: yes

  2. Referee: [Abstract] Abstract: two of the three supervision signals are pseudo-labels from MetricX-24 and xCOMET; without ablations on the contribution of each head, details on the curriculum schedule/mixing ratios, or results on a held-out human-only validation set, the generalization claim to the unseen IWSLT 2026 test distribution rests on an untested assumption that the model does not overfit to silver-labeler biases.

    Authors: We acknowledge the need for explicit evidence that the model benefits from the complementary supervision signals rather than simply memorizing pseudo-label artifacts. The revised manuscript will include (1) ablations isolating the contribution of each prediction head, (2) the exact curriculum schedule and data-mixing ratios, and (3) performance numbers on a held-out human-annotated validation set. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical system description with external supervision sources

full rationale

The paper describes an empirical end-to-end QE model trained on human DA annotations plus external pseudo-labels from MetricX-24 and xCOMET, with a curriculum mixing synthetic corruptions. No mathematical derivation, equations, or self-referential definitions are present that reduce any claimed result to its inputs by construction. No self-citations, uniqueness theorems, or ansatzes imported from prior author work appear. The outperformance claim rests on shared-task evaluation rather than any fitted parameter being renamed as a prediction or any load-bearing premise collapsing to self-generated quantities. This is a standard semi-supervised training setup without the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated beyond the learnable scalar mix weights and the assumption that pseudo-labels provide useful supervision.

free parameters (1)
  • learnable sparsemax scalar mix weights
    The weights that combine hidden states from all backbone layers are learned during training and are required for the cross-modal interaction step.

pith-pipeline@v0.9.1-grok · 5723 in / 1337 out tokens · 22839 ms · 2026-06-27T18:37:34.644530+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 22 canonical work pages

  1. [1]

    Idris Abdulmumin, Victor Agostinelli, Tanel Alum \"a e, Antonios Anastasopoulos, Luisa Bentivogli, Ond r ej Bojar, Claudia Borg, Fethi Bougares, Roldano Cattoni, Mauro Cettolo, Lizhong Chen, William Chen, Raj Dabre, Yannick Est \`e ve, Marcello Federico, Mark Fishel, Marco Gaido, D \'a vid Javorsk \'y , Marek Kasztelnik, and 33 others. 2025. https://doi.o...

  2. [2]

    David Ifeoluwa Adelani, Victor Agostinelli, Antonios Anastasopoulos, Luisa Bentivogli, Ond r ej Bojar, Sebastien Brati \`e res, Marine Carpuat, Roldano Cattoni, Mauro Cettolo, Lizhong Chen, Marcello Federico, Marco Gaido, Mahendra Gupta, HyoJung Han, Ali Hatami, David Javorsk \'y , Yejin Jeon, Marek Kasztelnik, Antoine Laurent, and 33 others. 2026. Speech...

  3. [3]

    Mauro Cettolo, Marco Gaido, Matteo Negri, Sara Papi, and Luisa Bentivogli. 2026. https://arxiv.org/abs/2511.03295 How to evaluate speech translation with source-aware neural mt metrics . Preprint, arXiv:2511.03295

  4. [4]

    Costa-juss \`a

    Mingda Chen, Paul-Ambroise Duquenne, Pierre Andrews, Justine Kao, Alexandre Mourachko, Holger Schwenk, and Marta R. Costa-juss \`a . 2023. https://doi.org/10.18653/v1/2023.acl-long.504 BLASER : A text-free speech-to-speech translation evaluation metric . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: ...

  5. [5]

    Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. 2022. https://doi.org/10.1109/SLT54892.2023.10023141 FLEURS: few-shot learning evaluation of universal representations of speech . In IEEE Spoken Language Technology Workshop, SLT 2022, Doha, Qatar, January 9-12, 2023 , pages 798--...

  6. [6]

    Costa-juss \`a

    David Dale and Marta R. Costa-juss \`a . 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.943 BLASER 2.0: a metric for evaluation and quality estimation of massively multilingual speech and text translation . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 16075--16085, Miami, Florida, USA. Association for Computationa...

  7. [7]

    Paul-Ambroise Duquenne, Holger Schwenk, and Benoît Sagot. 2023. https://arxiv.org/abs/2308.11466 Sonar: Sentence-level multimodal and language-agnostic representations . Preprint, arXiv:2308.11466

  8. [8]

    Markus Freitag, Ricardo Rei, Nitika Mathur, Chi-kiu Lo, Craig Stewart, Eleftherios Avramidis, Tom Kocmi, George Foster, Alon Lavie, and Andr \'e F. T. Martins. 2022. https://doi.org/10.18653/v1/2022.wmt-1.2 Results of WMT 22 metrics shared task: Stop using BLEU -- neural metrics are better and more robust . In Proceedings of the Seventh Conference on Mach...

  9. [9]

    Guerreiro, Ricardo Rei, Daan van Stigt, Luisa Coheur, Pierre Colombo, and Andr \'e F

    Nuno M. Guerreiro, Ricardo Rei, Daan van Stigt, Luisa Coheur, Pierre Colombo, and Andr \'e F. T. Martins. 2024. https://doi.org/10.1162/tacl_a_00683 x COMET : Transparent machine translation evaluation through fine-grained error detection . Transactions of the Association for Computational Linguistics, 12:979--995

  10. [10]

    HyoJung Han, Kevin Duh, and Marine Carpuat. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.1218 S peech QE : Estimating the quality of direct speech translation . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21852--21867, Miami, Florida, USA. Association for Computational Linguistics

  11. [11]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen - Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen - Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. https://openreview.net/forum?id=nZeVKeeFYf9 Lora: Low-rank adaptation of large language models . In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net

  12. [12]

    Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, Xinyu Zhang, Pei Zhang, Baosong Yang, Jin Xu, Jingren Zhou, and Junyang Lin. 2026. https://arxiv.org/abs/2601.15621 Qwen3-tts technical report . Preprint, arXiv:2601.15621

  13. [13]

    Ganesh Jawahar, Beno \^i t Sagot, and Djam \'e Seddah. 2019. https://doi.org/10.18653/v1/P19-1356 What does BERT learn about the structure of language? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3651--3657, Florence, Italy. Association for Computational Linguistics

  14. [14]

    Alexander Jones, William Yang Wang, and Kyle Mahowald. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.471 A massively multilingual analysis of cross-linguality in shared embedding space . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5833--5847, Online and Punta Cana, Dominican Republic. Association fo...

  15. [15]

    Juraj Juraska, Daniel Deutsch, Mara Finkelstein, and Markus Freitag. 2024. https://doi.org/10.18653/v1/2024.wmt-1.35 M etric X -24: The G oogle submission to the WMT 2024 metrics shared task . In Proceedings of the Ninth Conference on Machine Translation, pages 492--504, Miami, Florida, USA. Association for Computational Linguistics

  16. [16]

    Juraj Juraska, Mara Finkelstein, Daniel Deutsch, Aditya Siddhant, Mehdi Mirzazadeh, and Markus Freitag. 2023. https://doi.org/10.18653/v1/2023.wmt-1.63 M etric X -23: The G oogle submission to the WMT 2023 metrics shared task . In Proceedings of the Eighth Conference on Machine Translation, pages 756--767, Singapore. Association for Computational Linguistics

  17. [17]

    Tom Kocmi and Christian Federmann. 2023. https://doi.org/10.18653/v1/2023.wmt-1.64 GEMBA - MQM : Detecting translation quality error spans with GPT -4 . In Proceedings of the Eighth Conference on Machine Translation, pages 768--775, Singapore. Association for Computational Linguistics

  18. [18]

    Alon Lavie, Greg Hanneman, Sweta Agrawal, Diptesh Kanojia, Chi-Kiu Lo, Vil \'e m Zouhar, Frederic Blain, Chrysoula Zerva, Eleftherios Avramidis, Sourabh Deoghare, Archchana Sindhujan, Jiayi Wang, David Ifeoluwa Adelani, Brian Thompson, Tom Kocmi, Markus Freitag, and Daniel Deutsch. 2025. https://doi.org/10.18653/v1/2025.wmt-1.24 Findings of the WMT 25 sha...

  19. [19]

    Zhu Liu, Cunliang Kong, Ying Liu, and Maosong Sun. 2024. https://doi.org/10.18653/v1/2024.findings-acl.866 Fantastic semantics and where to find them: Investigating which layers of generative LLM s reflect lexical semantics . In Findings of the Association for Computational Linguistics: ACL 2024, pages 14551--14558, Bangkok, Thailand. Association for Comp...

  20. [20]

    Andr \' e F. T. Martins and Ram \' o n Fernandez Astudillo. 2016. http://proceedings.mlr.press/v48/martins16.html From softmax to sparsemax: A sparse model of attention and multi-label classification . In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016 , JMLR Workshop and Conference...

  21. [21]

    NLLB Team , Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, and 20 others. 2022. https://arxiv.org/abs/2207.04672 No languag...

  22. [22]

    Omnilingual MT Team , Belen Alastruey, Niyati Bafna, Andrea Caciolai, Kevin Heffernan, Artyom Kozhevnikov, Christophe Ropers, Eduardo Sánchez, Charles-Eric Saint-James, Ioannis Tsiamas, Chierh Cheng, Joe Chuang, Paul-Ambroise Duquenne, Mark Duppenthaler, Nate Ekberg, Cynthia Gao, Pere Lluís Huguet Cabot, João Maria Janeiro, Jean Maillard, and 12 others. 2...

  23. [23]

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. https://doi.org/10.3115/1073083.1073135 B leu: a method for automatic evaluation of machine translation . In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311--318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics

  24. [24]

    Guerreiro, Jos \'e Pombal, Daan van Stigt, Marcos Treviso, Luisa Coheur, Jos \'e G

    Ricardo Rei, Nuno M. Guerreiro, Jos \'e Pombal, Daan van Stigt, Marcos Treviso, Luisa Coheur, Jos \'e G. C. de Souza, and Andr \'e F. T. Martins. 2023. https://doi.org/10.18653/v1/2023.wmt-1.73 Scaling up C omet K iwi: Unbabel- IST 2023 submission for the quality estimation shared task . In Proceedings of the Eighth Conference on Machine Translation, page...

  25. [25]

    Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.213 COMET : A neural framework for MT evaluation . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685--2702, Online. Association for Computational Linguistics

  26. [26]

    Guerreiro, Chrysoula Zerva, Ana C Farinha, Christine Maroti, Jos \'e G

    Ricardo Rei, Marcos Treviso, Nuno M. Guerreiro, Chrysoula Zerva, Ana C Farinha, Christine Maroti, Jos \'e G. C. de Souza, Taisiya Glushkova, Duarte Alves, Luisa Coheur, Alon Lavie, and Andr \'e F. T. Martins. 2022. https://doi.org/10.18653/v1/2022.wmt-1.60 C omet K iwi: IST -unbabel 2022 submission for the quality estimation shared task . In Proceedings o...

  27. [27]

    Seamless Communication , Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye, and 49 others. 2023. https://arxiv.org/abs/...

  28. [28]

    Thibault Sellam, Dipanjan Das, and Ankur Parikh. 2020. https://doi.org/10.18653/v1/2020.acl-main.704 BLEURT : Learning robust metrics for text generation . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881--7892, Online. Association for Computational Linguistics

  29. [29]

    Xian Shi, Xiong Wang, Zhifang Guo, Yongqi Wang, Pei Zhang, Xinyu Zhang, Zishan Guo, Hongkun Hao, Yu Xi, Baosong Yang, Jin Xu, Jingren Zhou, and Junyang Lin. 2026. https://arxiv.org/abs/2601.21337 Qwen3-asr technical report . Preprint, arXiv:2601.21337

  30. [30]

    Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. https://doi.org/10.18653/v1/P19-1452 BERT rediscovers the classical NLP pipeline . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593--4601, Florence, Italy. Association for Computational Linguistics

  31. [31]

    Changhan Wang, Anne Wu, Jiatao Gu, and Juan Pino. 2021. https://doi.org/10.21437/INTERSPEECH.2021-2027 Covost 2 and massively multilingual speech translation . In 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30 - September 3, 2021, pages 2247--2251. ISCA

  32. [32]

    Mao Zheng, Zheng Li, Bingxin Qu, Mingyang Song, Yang Du, Mingrui Sun, and Di Wang. 2025. https://arxiv.org/abs/2509.05209 Hunyuan-mt technical report . Preprint, arXiv:2509.05209

  33. [33]

    Vil \'e m Zouhar, Pinzhen Chen, Tsz Kin Lam, Nikita Moghe, and Barry Haddow. 2024. https://doi.org/10.18653/v1/2024.wmt-1.121 Pitfalls and outlooks in using COMET . In Proceedings of the Ninth Conference on Machine Translation, pages 1272--1288, Miami, Florida, USA. Association for Computational Linguistics