How to Evaluate Speech Translation with Source-Aware Neural MT Metrics

Luisa Bentivogli; Marco Gaido; Matteo Negri; Mauro Cettolo; Sara Papi

arxiv: 2511.03295 · v3 · submitted 2025-11-05 · 💻 cs.CL · cs.AI

How to Evaluate Speech Translation with Source-Aware Neural MT Metrics

Mauro Cettolo , Marco Gaido , Matteo Negri , Sara Papi , Luisa Bentivogli This is my paper

Pith reviewed 2026-05-18 01:32 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords speech translation evaluationsource-aware metricsneural MT metricsASR transcriptsback-translationcross-lingual re-segmentationhuman correlation

0 comments

The pith

ASR transcripts serve as a more reliable proxy than back-translations for source-aware neural metrics in speech translation evaluation when word error rate stays below 20 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how to extend source-aware neural MT metrics to speech translation, where the input is audio and direct transcripts are typically unavailable. It tests two ways to create usable text versions of the source: automatic speech recognition outputs and back-translations of the reference translations. A two-step cross-lingual re-segmentation procedure is developed to correct alignment problems between these text proxies and the references. Tests on two large benchmarks spanning 79 language pairs and six different ST systems show that ASR transcripts yield stronger correlation with human judgments than back-translations when ASR quality is good, while back-translations remain a fast and useful fallback. The re-segmentation step makes the whole approach practical even when original alignments are missing.

Core claim

ASR transcripts constitute a more reliable synthetic source than back-translations when word error rate is below 20 percent, while back-translations always represent a computationally cheaper but still effective alternative; the cross-lingual re-segmentation algorithm enables robust use of source-aware MT metrics in ST evaluation even without original transcripts or alignments.

What carries the argument

The two-step cross-lingual re-segmentation algorithm that aligns ASR transcripts or back-translations with reference translations so source-aware metrics can be computed without distortion from segmentation mismatches.

If this is right

Speech translation evaluation can move beyond pure reference matching to incorporate source information without requiring perfect transcripts.
ASR transcripts become the preferred proxy whenever their word error rate is measured below 20 percent.
Back-translation remains a practical low-cost option for generating synthetic sources in any resource setting.
The re-segmentation algorithm supports reliable application across diverse system architectures and language pairs.
Direct comparison against human quality judgments on low-resource pairs such as Bemba-English confirms the same pattern.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Evaluation pipelines could routinely compute both ASR-based and back-translation-based scores and average them for added stability.
Training objectives for speech translation models might incorporate source-aware metric signals directly when suitable proxies are available.
The same proxy-and-resegmentation pattern could be tested in other settings where one input modality lacks an immediate textual form.

Load-bearing premise

The textual proxies created by ASR or back-translation preserve enough of the original audio's meaning for the source-aware metric to produce valid scores.

What would settle it

A new speech translation test set where source-aware metrics using ASR transcripts or back-translations show lower correlation with human judgments than reference-only metrics when ASR word error rate exceeds 25 percent.

read the original abstract

Automatic evaluation of ST systems is typically performed by comparing translation hypotheses with one or more reference translations. While effective to some extent, this approach inherits the limitation of reference-based evaluation that ignores valuable information from the source input. In MT, recent progress has shown that neural metrics incorporating the source text achieve stronger correlation with human judgments. Extending this idea to ST, however, is not trivial because the source is audio rather than text, and reliable transcripts or alignments between source and references are often unavailable. In this work, we conduct the first systematic study of source-aware metrics for ST, with a particular focus on real-world operating conditions where source transcripts are not available. We explore two complementary strategies for generating textual proxies of the input audio, ASR transcripts, and back-translations of the reference translation, and introduce a novel two-step cross-lingual re-segmentation algorithm to address the alignment mismatch between synthetic sources and reference translations. Our experiments, carried out on two ST benchmarks covering 79 language pairs and six ST systems with diverse architectures and performance levels, show that ASR transcripts constitute a more reliable synthetic source than back-translations when word error rate is below 20%, while back-translations always represent a computationally cheaper but still effective alternative. The robustness of these findings is further confirmed by experiments on a low-resource language pair (Bemba-English) and by a direct validation against human quality judgments. Furthermore, our cross-lingual re-segmentation algorithm enables robust use of source-aware MT metrics in ST evaluation, paving the way toward more accurate and principled evaluation methodologies for speech translation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ASR transcripts beat back-translations as proxies for source-aware metrics in ST below 20% WER, and the new re-segmentation algorithm makes them usable without transcripts.

read the letter

Colleague, The one or two things to take away from this paper are that source-aware neural metrics from MT can be made to work for speech translation by using either ASR transcripts or back-translations as proxies for the audio source, with ASR coming out ahead when word error rates are low, and that their new two-step cross-lingual re-segmentation algorithm helps fix the alignment issues that otherwise get in the way. What the paper does well is the systematic coverage. They test on two benchmarks that together span 79 language pairs and include six ST systems of different types and quality levels. The low-resource Bemba-English experiment and the direct validation with human quality judgments give the results more grounding than a lot of similar work. The algorithm itself looks like a solid addition that wasn't just pulled from prior MT papers. Soft spots are limited but worth noting. The central assumption that the textual proxies preserve enough semantics for the metrics to be meaningful seems supported by their outcomes, but the paper would benefit from more detail on how sensitive the results are to the specific ASR or translation models used for the proxies. Also, since the full statistical analysis isn't in the abstract, checking the exact correlation improvements and any variance across pairs would be important. This paper is for researchers and practitioners focused on automatic evaluation of speech translation systems, particularly those interested in moving past reference-only approaches in settings where source transcripts aren't available. Readers dealing with low-resource languages or real-world deployment conditions would get the most out of it. The breadth of the experiments and the practical nature of the contribution mean it deserves a serious referee to go over the methods and confirm the robustness. I'd recommend sending this one for peer review.

Referee Report

2 major / 2 minor

Summary. The paper conducts the first systematic study of source-aware neural MT metrics for speech translation (ST) evaluation under real-world conditions where source transcripts are unavailable. It explores two textual proxies for the audio source—ASR transcripts and back-translations of the reference—and introduces a novel two-step cross-lingual re-segmentation algorithm to resolve alignment mismatches. Experiments across two ST benchmarks (79 language pairs, six diverse ST systems), a low-resource Bemba-English case, and direct human judgment validation show that ASR transcripts are more reliable than back-translations when WER is below 20%, while back-translations remain a computationally cheaper effective option; the re-segmentation algorithm enables robust application of these metrics.

Significance. If the empirical findings hold, the work offers a practical advance over reference-only ST evaluation by incorporating source information, consistent with gains observed in MT. The scale (79 language pairs), diversity of systems, low-resource validation, and human correlation provide solid grounding. The re-segmentation algorithm is a concrete technical contribution that addresses a key practical barrier.

major comments (2)

[§4] §4 (Experiments): The headline claim that ASR transcripts are more reliable below 20% WER is load-bearing; the manuscript should report the precise correlation deltas (e.g., with COMET or similar) and statistical significance tests between ASR and back-translation conditions at the WER threshold to allow readers to assess the sharpness of the cutoff.
[§3.2] §3.2 (Re-segmentation algorithm): The two-step cross-lingual re-segmentation is presented as enabling robust use, yet the description leaves open how segmentation boundaries are chosen across languages and whether error propagation from the first step affects downstream metric scores; a small ablation on alignment quality metrics would strengthen the claim.

minor comments (2)

[Abstract] Abstract: The two benchmarks are not named; adding their identities (e.g., CoVoST or MuST-C) would improve immediate readability.
[Table captions] Table captions and §4: Ensure all reported correlations include the number of systems and language pairs per cell so that the 79-pair aggregate is traceable to the per-pair results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive evaluation and constructive feedback on our manuscript. We address each of the major comments in detail below and indicate the revisions we plan to make.

read point-by-point responses

Referee: [§4] §4 (Experiments): The headline claim that ASR transcripts are more reliable below 20% WER is load-bearing; the manuscript should report the precise correlation deltas (e.g., with COMET or similar) and statistical significance tests between ASR and back-translation conditions at the WER threshold to allow readers to assess the sharpness of the cutoff.

Authors: We agree with this observation and will strengthen the manuscript by including the requested details. Specifically, we will add precise correlation values (Pearson and Spearman) for key metrics like COMET under ASR and back-translation conditions at the 20% WER threshold, along with results from statistical significance tests such as the Williams test to compare the correlations. This will be presented in a new table or expanded section in §4 to better substantiate the claim. revision: yes
Referee: [§3.2] §3.2 (Re-segmentation algorithm): The two-step cross-lingual re-segmentation is presented as enabling robust use, yet the description leaves open how segmentation boundaries are chosen across languages and whether error propagation from the first step affects downstream metric scores; a small ablation on alignment quality metrics would strengthen the claim.

Authors: We appreciate this suggestion for improving the clarity of our technical contribution. In the revised manuscript, we will expand the description in §3.2 to explicitly detail how segmentation boundaries are determined (using a combination of cross-lingual similarity scores and length constraints) and discuss potential error propagation. Additionally, we will include a small ablation study reporting alignment quality metrics (e.g., precision, recall, and F1 on manually annotated subsets) and their correlation with downstream ST metric performance. This should address the concerns about robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical study of source-aware neural MT metrics for speech translation evaluation, relying on systematic experiments across two benchmarks (79 language pairs, six ST systems), a low-resource Bemba-English case, and direct validation against human quality judgments. The central findings—that ASR transcripts are more reliable than back-translations below 20% WER, with back-translations as a cheaper alternative, and that the introduced cross-lingual re-segmentation algorithm enables robust metric use—are derived from comparative performance measurements on held-out data rather than from any self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations. The textual proxies and re-segmentation method are treated as practical engineering solutions whose effectiveness is externally tested, keeping the derivation chain self-contained and independent of internal circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Claims rest on the transferability of MT source-aware metrics to ST via proxies and on the correctness of the alignment algorithm; evidence is provided by comparative experiments rather than theoretical derivation.

axioms (1)

domain assumption Neural metrics that incorporate source text achieve stronger correlation with human judgments than reference-only metrics in machine translation.
This MT result is taken as given and extended to the ST setting.

invented entities (1)

two-step cross-lingual re-segmentation algorithm no independent evidence
purpose: To resolve alignment mismatch between synthetic textual sources and reference translations.
Newly proposed component required to apply source-aware metrics reliably; no independent evidence outside this work is cited.

pith-pipeline@v0.9.0 · 5822 in / 1323 out tokens · 51287 ms · 2026-05-18T01:32:41.559563+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · 1 internal anchor

[1]

No Language Left Behind: Scaling Human-Centered Machine Translation

No Language Left Behind: Scaling Human-Centered Machine Translation.arXiv, 2207.04672. Dale, David and Marta R. Costa-jussà. 2024. BLASER 2.0: a Metric for Evaluation and Quality Estimation of Massively Multilingual Speech and Text Translation. InProc. of EMNLP: Findings, Miami, US-FL. Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 201...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Beyond English-Centric Multilingual Machine Translation.J. Mach. Learn. Res., 22(1). Fang, Qingkai and Yang Feng. 2023. Back Translation for Speech-to-text Translation Without Transcripts. InProc. of ACL (Volume 1: Long Papers), Toronto, Canada. Freitag, Markus, George Foster, David Grangier, Viresh Ratnakar, Qijun Tan, and Wolfgang Macherey. 2021. Expert...

work page arXiv 2023
[3]

MADLAD-400: A Multilingual And Document-Level Large Audited Dataset. InProc. of NeurIPS, New Orleans, US-LA. Larionov, Daniil and Steffen Eger. 2025. BatchGEMBA: Token-Efficient Machine Translation Evaluation with Batched Prompting and Prompt Compression.arXiv, 2503.02756. Ma, Shuming, Li Dong, Shaohan Huang, Dongdong Zhang, Alexandre Muzio, Saksham Singh...

work page arXiv 2025
[4]

OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer. InProc. of Interspeech, Kos Island, Greece. Peng, Yifan, Jinchuan Tian, Brian Yan, Dan Berrebbi, Xuankai Chang, Xinjian Li, Jiatong Shi, Siddhant Arora, William Chen, Roshan Sharma, Wangyou Zhang, Yui Sudo, Muhammad Shakeel, Jee weon Jung, Soumi Maiti, and Shinji Watana...

work page 2023
[5]

Robust Speech Recognition via Large-Scale Weak Supervision. InProc. of ICML, volume 202, Honolulu, US-HI. Rei, Ricardo, José G. C. de Souza, Duarte Alves, Chrysoula Zerva, Ana C Farinha, Taisiya Glushkova, Alon Lavie, Luisa Coheur, and André F. T. Martins. 2022. COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task. InProc. of WMT, Abu Dhabi, ...

work page arXiv 2022
[6]

40 Cettolo, Gaido, Negri, Papi, Bentivogli Source-Aware MT Metrics for ST Tsiamas, Ioannis, Gerard Gállego, José Fonollosa, and Marta Costa-jussà

Democratizing Neural Machine Translation with OPUS-MT.Language Resources and Evaluation, 58:713–755. 40 Cettolo, Gaido, Negri, Papi, Bentivogli Source-Aware MT Metrics for ST Tsiamas, Ioannis, Gerard Gállego, José Fonollosa, and Marta Costa-jussà. 2024. Pushing the Limits of Zero-shot End-to-End Speech Translation. InProc. of ACL: Findings, Bangkok, Thail...

work page 2024

[1] [1]

No Language Left Behind: Scaling Human-Centered Machine Translation

No Language Left Behind: Scaling Human-Centered Machine Translation.arXiv, 2207.04672. Dale, David and Marta R. Costa-jussà. 2024. BLASER 2.0: a Metric for Evaluation and Quality Estimation of Massively Multilingual Speech and Text Translation. InProc. of EMNLP: Findings, Miami, US-FL. Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 201...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Beyond English-Centric Multilingual Machine Translation.J. Mach. Learn. Res., 22(1). Fang, Qingkai and Yang Feng. 2023. Back Translation for Speech-to-text Translation Without Transcripts. InProc. of ACL (Volume 1: Long Papers), Toronto, Canada. Freitag, Markus, George Foster, David Grangier, Viresh Ratnakar, Qijun Tan, and Wolfgang Macherey. 2021. Expert...

work page arXiv 2023

[3] [3]

MADLAD-400: A Multilingual And Document-Level Large Audited Dataset. InProc. of NeurIPS, New Orleans, US-LA. Larionov, Daniil and Steffen Eger. 2025. BatchGEMBA: Token-Efficient Machine Translation Evaluation with Batched Prompting and Prompt Compression.arXiv, 2503.02756. Ma, Shuming, Li Dong, Shaohan Huang, Dongdong Zhang, Alexandre Muzio, Saksham Singh...

work page arXiv 2025

[4] [4]

OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer. InProc. of Interspeech, Kos Island, Greece. Peng, Yifan, Jinchuan Tian, Brian Yan, Dan Berrebbi, Xuankai Chang, Xinjian Li, Jiatong Shi, Siddhant Arora, William Chen, Roshan Sharma, Wangyou Zhang, Yui Sudo, Muhammad Shakeel, Jee weon Jung, Soumi Maiti, and Shinji Watana...

work page 2023

[5] [5]

Robust Speech Recognition via Large-Scale Weak Supervision. InProc. of ICML, volume 202, Honolulu, US-HI. Rei, Ricardo, José G. C. de Souza, Duarte Alves, Chrysoula Zerva, Ana C Farinha, Taisiya Glushkova, Alon Lavie, Luisa Coheur, and André F. T. Martins. 2022. COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task. InProc. of WMT, Abu Dhabi, ...

work page arXiv 2022

[6] [6]

40 Cettolo, Gaido, Negri, Papi, Bentivogli Source-Aware MT Metrics for ST Tsiamas, Ioannis, Gerard Gállego, José Fonollosa, and Marta Costa-jussà

Democratizing Neural Machine Translation with OPUS-MT.Language Resources and Evaluation, 58:713–755. 40 Cettolo, Gaido, Negri, Papi, Bentivogli Source-Aware MT Metrics for ST Tsiamas, Ioannis, Gerard Gállego, José Fonollosa, and Marta Costa-jussà. 2024. Pushing the Limits of Zero-shot End-to-End Speech Translation. InProc. of ACL: Findings, Bangkok, Thail...

work page 2024