PaliBench: A Multi-Reference Blueprint for Classical Language Translation Benchmarks

arxiv: 2605.16881 · v1 · pith:TXAUUGXTnew · submitted 2026-05-16 · 💻 cs.CL

PaliBench: A Multi-Reference Blueprint for Classical Language Translation Benchmarks

M\'at\'e Metzger , Nadnapang Phophichit This is my paper

Pith reviewed 2026-05-19 21:00 UTC · model grok-4.3

classification 💻 cs.CL

keywords Pali translationmulti-reference benchmarkclassical language evaluationmachine translationdigital humanitiesLLM assessmentBuddhist textsinterpretive variation

0 comments p. Extension

pith:TXAUUGXT Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{TXAUUGXT}

Prints a linked pith:TXAUUGXT badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

PaliBench shows how to build multi-reference benchmarks for classical language translation from existing scholarly translations without treating any one as definitive.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PaliBench as both a concrete benchmark for Pali-to-English translation and a general workflow for turning multiple independent human translations of classical texts into evaluation resources. Classical traditions often allow several faithful renderings that differ in wording and emphasis, so single-reference tests can misjudge systems that produce valid alternatives. The method aligns passages from three well-known translators using large language models, verifies the alignments against source files, applies passage-level quality filters, removes repeated formulaic segments, and evaluates models against all references at once. When applied to ten contemporary language models the benchmark produces consistent system rankings across metrics while exposing differences in how reliably each model stays close to the human versions. The central methodological point is that existing scholarly work can supply the necessary references without forcing a single canonical translation.

Core claim

PaliBench supplies 1,700 aligned passages drawn from the Sutta Pitaka together with independent English renderings by Bhikkhu Sujato, Bhikkhu Thanissaro, and Bhikkhu Bodhi. These passages were assembled through LLM-assisted alignment of independently segmented texts, automated verification against the original sources, quality filtering at the passage level, and deduplication of formulaic repetitions, resulting in a collection of 8,389 segments and roughly 345,000 tokens. Evaluation of ten modern large language models on this multi-reference set reveals strong concordance in relative rankings across complementary metrics together with noticeable variation in reliability and rates of semantic

What carries the argument

The multi-reference construction workflow that performs LLM-assisted alignment of independently segmented translations, followed by automated verification, passage-level quality filtering, and deduplication to produce an evaluation set that registers interpretive differences across human renderings.

If this is right

Ten contemporary large language models receive consistent relative rankings when scored against the multi-reference set using several different metrics.
The resulting collection contains 1,700 passages, 8,389 segments, and approximately 345,000 tokens after filtering and deduplication.
The same construction steps can be reused on other classical corpora that already possess several independent scholarly translations.
Model outputs show substantial differences in reliability and frequency of semantic outliers even when overall rankings agree.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on other interpretive traditions such as classical Chinese or Sanskrit texts that also exist in multiple scholarly translations.
Future benchmarks might measure how much adding a fourth or fifth reference changes model rankings and outlier detection.
Systems trained or prompted to produce translations that deliberately vary in register could be evaluated more fairly with this kind of multi-reference set.

Load-bearing premise

Automated LLM alignment combined with verification and filtering yields passages that faithfully preserve the interpretive differences present in the three original human translations without introducing systematic bias.

What would settle it

A side-by-side comparison by independent Pali scholars of a random sample of the aligned passages against the three source translations to check whether the retained segments accurately reflect the content and variation in each human version.

read the original abstract

Digital humanities projects increasingly rely on machine translation and large language models to widen access to classical, religious, and otherwise under-translated textual traditions. Yet standard translation benchmarks are poorly suited to such materials: they typically compare a system output against a single reference translation, even though classical texts often support multiple faithful renderings that differ in terminology, register, and interpretation. This article introduces PaliBench, both a benchmark for Pali-to-English translation and a reusable method for constructing multi-reference translation benchmarks for classical languages. The Pali case study draws on passages from the Sutta Pitaka aligned with independent English translations by Bhikkhu Sujato, Bhikkhu Thanissaro, and Bhikkhu Bodhi. The workflow combines LLM-assisted alignment of independently segmented translations, automated verification against source files, passage-level quality filtering, deduplication of formulaic repetitions, and multi-metric evaluation against multiple human references. The resulting benchmark contains 1,700 passages spanning 8,389 segments and approximately 345,000 tokens. We use it to evaluate ten contemporary large language models with complementary metrics, finding strong cross-metric concordance in system rankings alongside substantial variation in reliability and semantic outlier rates. The broader contribution is methodological: PaliBench shows how existing scholarly translations can be transformed into evaluation infrastructure for interpretive textual traditions without treating any single translation as definitive. Although developed for Pali Buddhist texts, the approach could be portable to other classical corpora where sufficient independent reference translations exist.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PaliBench gives a usable template for turning existing scholarly translations into multi-reference benchmarks, but the LLM alignment step lacks the checks needed to confirm it preserves real interpretive differences.

read the letter

PaliBench puts forward a dataset of 1,700 passages from the Sutta Pitaka, each tied to three independent English translations, plus a step-by-step method for building similar resources for other classical languages. The main advance is practical: it shows how to take pre-existing scholarly work, align the segments with LLM help, verify against source files, filter for quality, drop formulaic repeats, and end up with a test set that supports multiple references instead of forcing a single gold standard. That workflow is described clearly enough to try on another corpus, and the evaluation of ten LLMs with several metrics produces consistent rankings, which is at least a useful data point for people working on low-resource translation. The paper earns credit for shipping an actual dataset and for focusing on the real problem that classical texts often have several defensible renderings. The soft spot sits in the alignment stage. The method relies on LLM-assisted matching of independently segmented translations, followed by automated checks that mostly confirm file-level or token-level consistency. Those checks do not directly test whether the chosen alignments keep the distinctions in terminology, register, or doctrinal emphasis that the three human translators deliberately made. If the LLM is quietly preferring its own segmentations or smoothing over differences, the final multi-reference set stops being a neutral mirror of existing scholarship and starts reflecting model priors instead. The abstract and description give no sample error rates, no human review of alignment quality, and no comparison of how much variation survives the process. That gap is real but not fatal; it is the sort of thing a revision could fix with a small validation set. The work is aimed at digital humanities researchers and machine-translation people who deal with historical or religious texts. Anyone trying to build evaluation resources for under-translated languages will find the construction steps and the released size numbers directly useful. It is coherent on its own terms and shows honest engagement with the single-reference limitation in current benchmarks. I would send it to peer review rather than desk-reject it. The dataset and method are concrete enough that referees can assess the alignment concern with the full paper in hand and ask for the missing validation numbers.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PaliBench as both a benchmark dataset and a reusable methodological blueprint for multi-reference translation evaluation in classical languages. It constructs a 1,700-passage Pali-to-English test set by aligning independent translations from Bhikkhu Sujato, Bhikkhu Thanissaro, and Bhikkhu Bodhi via LLM-assisted segmentation, followed by automated verification, passage-level quality filtering, and deduplication of formulaic repetitions. The resulting resource (8,389 segments, ~345k tokens) is used to evaluate ten contemporary LLMs under complementary metrics, with reported strong cross-metric concordance in system rankings alongside variation in reliability and semantic outlier rates. The central claim is that existing scholarly translations can be transformed into unbiased evaluation infrastructure that captures interpretive variation without privileging any single reference.

Significance. If the alignment and filtering steps are shown to be reliable, the work offers a concrete advance for MT evaluation in interpretive textual traditions where single-reference gold standards are inappropriate. It supplies both an immediately usable Pali benchmark and a portable workflow that could be applied to other classical corpora with multiple independent scholarly translations, addressing a recognized gap between digital-humanities practice and standard NLP benchmarks.

major comments (2)

[Methodology / construction workflow] The methodological workflow (alignment, verification, filtering, and deduplication) is presented without any quantitative validation of alignment accuracy, filtering effects, or residual error rates. No sample-based manual audit, inter-alignment agreement metric, or comparison against purely human alignments is reported. This directly bears on the central claim that the final 1,700-passage set faithfully encodes interpretive variation across the three human translations without LLM-induced bias in segmentation or terminology choices.
[Evaluation and results] The evaluation reports cross-metric concordance but does not include an ablation or baseline comparison that isolates the contribution of the multi-reference design (e.g., single-reference vs. multi-reference scores on the same model outputs). Without this, it remains unclear how much the observed ranking stability and outlier rates are attributable to the multi-reference structure itself.

minor comments (2)

[Evaluation section] Clarify the exact definition and implementation of each complementary metric used in the LLM evaluation; if any are novel, provide the formulas or pseudocode.
[Dataset statistics] The token count (~345,000) and segment count should be broken down by source translation or by passage to allow readers to assess balance across the three references.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the strengths and limitations of our methodological and evaluation sections. We respond to each major comment below and indicate the revisions we plan to incorporate.

read point-by-point responses

Referee: [Methodology / construction workflow] The methodological workflow (alignment, verification, filtering, and deduplication) is presented without any quantitative validation of alignment accuracy, filtering effects, or residual error rates. No sample-based manual audit, inter-alignment agreement metric, or comparison against purely human alignments is reported. This directly bears on the central claim that the final 1,700-passage set faithfully encodes interpretive variation across the three human translations without LLM-induced bias in segmentation or terminology choices.

Authors: We agree that the manuscript would be strengthened by quantitative validation of the alignment and filtering pipeline. In the revised version we will add a manual audit of a random sample of 200 passages, reporting alignment accuracy against human judgment, an inter-alignment agreement metric across the three source translations, and residual error rates after filtering. These results will be presented in a new subsection of the methodology to directly support the claim that LLM-assisted segmentation introduces minimal bias relative to the interpretive variation already present in the scholarly references. revision: yes
Referee: [Evaluation and results] The evaluation reports cross-metric concordance but does not include an ablation or baseline comparison that isolates the contribution of the multi-reference design (e.g., single-reference vs. multi-reference scores on the same model outputs). Without this, it remains unclear how much the observed ranking stability and outlier rates are attributable to the multi-reference structure itself.

Authors: We acknowledge that an explicit ablation would better isolate the effect of the multi-reference design. We will add this analysis in the revised evaluation section by recomputing all metrics and rankings under single-reference conditions (using each translator in turn as the sole reference) and comparing the resulting stability and outlier rates to the multi-reference results. This will quantify the incremental benefit of incorporating multiple scholarly translations. revision: yes

Circularity Check

0 steps flagged

No significant circularity in methodological benchmark construction

full rationale

The paper presents a workflow for assembling PaliBench from pre-existing independent scholarly translations (Bhikkhu Sujato, Thanissaro, Bodhi) via LLM-assisted alignment, automated verification, quality filtering, and deduplication, followed by evaluation of separate LLMs using multiple metrics. No equations, fitted parameters, predictions, or uniqueness claims reduce by construction to inputs from the same dataset; the central methodological claim relies on external source material and an externally verifiable process rather than self-referential definitions or self-citation chains. The derivation is self-contained against the independent translations and LLM test set.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the quality and independence of the three source translations plus the effectiveness of the automated alignment and filtering steps; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The translations by Bhikkhu Sujato, Bhikkhu Thanissaro, and Bhikkhu Bodhi constitute independent, faithful renderings of the same Pali passages that differ meaningfully in terminology and interpretation.
This premise is required for the multi-reference evaluation to be meaningful and is invoked when aligning and scoring against all three versions.

pith-pipeline@v0.9.0 · 5796 in / 1301 out tokens · 41296 ms · 2026-05-19T21:00:33.716397+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The workflow combines LLM-assisted alignment of independently segmented translations, automated verification against source files, passage-level quality filtering, deduplication of formulaic repetitions, and multi-metric evaluation against multiple human references.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The resulting benchmark contains 1,700 passages spanning 8,389 segments and approximately 345,000 tokens.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

[1]

Frequently Asked Questions About Access to Insight , year =

work page
[2]

Nature , volume =

Assael, Yannis and Sommerschield, Thea and Shillingford, Brendan and others , title =. Nature , volume =. 2022 , doi =

work page 2022
[3]

, title =

Bamman, David and Burns, Patrick J. , title =. 2020 , note =

work page 2020
[4]

Bodhi, Bhikkhu , title =

work page
[5]

The Middle Length Discourses of the Buddha: A Translation of the Majjhima Nik

Bodhi, Bhikkhu and. The Middle Length Discourses of the Buddha: A Translation of the Majjhima Nik

work page
[6]

Multi-Hypothesis Machine Translation Evaluation , booktitle =

Fomicheva, Marina and Specia, Lucia and Guzm. Multi-Hypothesis Machine Translation Evaluation , booktitle =. 2020 , doi =

work page 2020
[7]

Proceedings of the Seventh Conference on Machine Translation , pages =

Freitag, Markus and Rei, Ricardo and Mathur, Nitika and others , title =. Proceedings of the Seventh Conference on Machine Translation , pages =. 2022 , doi =

work page 2022
[8]

Transactions of the Association for Computational Linguistics , volume =

Goyal, Naman and Gao, Cynthia and Chaudhary, Vishrav and others , title =. Transactions of the Association for Computational Linguistics , volume =. 2022 , doi =

work page 2022
[9]

and Rei, Ricardo and van Stigt, Daan and Coheur, Luisa and Colombo, Pierre and Martins, Andr

Guerreiro, Nuno M. and Rei, Ricardo and van Stigt, Daan and Coheur, Luisa and Colombo, Pierre and Martins, Andr. Transactions of the Association for Computational Linguistics , volume =. 2024 , doi =

work page 2024
[10]

Systems and Frameworks for Computational Morphology , editor =

Hellwig, Oliver , title =. Systems and Frameworks for Computational Morphology , editor =. 2015 , doi =

work page 2015
[11]

2023 , note =

Hendy, Amr and Abdelrehim, Mohamed and Sharaf, Amr and others , title =. 2023 , note =

work page 2023
[12]

2023 , note =

Jiao, Wenxiang and Wang, Wenxuan and Huang, Jen-tse and Wang, Xing and Tu, Zhaopeng , title =. 2023 , note =

work page 2023
[13]

Proceedings of the Eighth Conference on Machine Translation , pages =

Kocmi, Tom and Federmann, Christian , title =. Proceedings of the Eighth Conference on Machine Translation , pages =. 2023 , doi =

work page 2023
[14]

, title =

Lavie, Alon and Denkowski, Michael J. , title =. Machine Translation , volume =. 2009 , doi =

work page 2009
[15]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages =

Nehrdich, Sebastian and Hellwig, Oliver and Keutzer, Kurt , title =. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages =. 2024 , doi =

work page 2024
[16]

2026 , note =

Nehrdich, Sebastian and Keutzer, Kurt , title =. 2026 , note =

work page 2026
[17]

No Language Left Behind: Scaling Human-Centered Machine Translation , year =

work page
[18]

2024 , doi =

Scaling Neural Machine Translation to 200 Languages , journal =. 2024 , doi =

work page 2024
[19]

The Middle Length Discourses of the Buddha: A Translation of the Majjhima Nik

work page
[20]

Olivelle, Patrick , title =

work page
[21]

Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics , pages =

Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing , title =. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics , pages =. 2002 , doi =

work page 2002
[22]

Proceedings of the Tenth Workshop on Statistical Machine Translation , pages =

Popovi. Proceedings of the Tenth Workshop on Statistical Machine Translation , pages =. 2015 , doi =

work page 2015
[23]

Proceedings of the Second Conference on Machine Translation , pages =

Popovi. Proceedings of the Second Conference on Machine Translation , pages =. 2017 , doi =

work page 2017
[24]

Proceedings of the Third Conference on Machine Translation , pages =

Post, Matt , title =. Proceedings of the Third Conference on Machine Translation , pages =. 2018 , doi =

work page 2018
[25]

and Lavie, Alon , title =

Rei, Ricardo and Stewart, Craig and Farinha, Ana C. and Lavie, Alon , title =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages =. 2020 , doi =

work page 2020
[26]

Proceedings of the Seventh Conference on Machine Translation (WMT) , pages =

Rei, Ricardo and de Souza, Jos. Proceedings of the Seventh Conference on Machine Translation (WMT) , pages =. 2022 , doi =

work page 2022
[27]

, title =

Roebuck, Valerie J. , title =

work page
[28]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages =

Sellam, Thibault and Das, Dipanjan and Parikh, Ankur , title =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages =. 2020 , doi =

work page 2020
[29]

Computational Linguistics , volume =

Sommerschield, Thea and Assael, Yannis and Pavlopoulos, John and others , title =. Computational Linguistics , volume =. 2023 , doi =

work page 2023
[30]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages =

Thompson, Brian and Post, Matt , title =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages =. 2020 , doi =

work page 2020
[31]

Advances in Neural Information Processing Systems 30 , pages =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and others , title =. Advances in Neural Information Processing Systems 30 , pages =. 2017 , note =

work page 2017
[32]

van Buitenen, J. A. B. , title =

work page
[33]

Walshe, Maurice , title =

work page
[34]

, title =

Wu, Si and Wieting, John and Smith, David A. , title =. 2024 , note =

work page 2024
[35]

The Twelfth International Conference on Learning Representations , year =

Xu, Haoran and Kim, Young Jin and Sharaf, Amr and Awadalla, Hany Hassan , title =. The Twelfth International Conference on Learning Representations , year =

work page
[36]

and Artzi, Yoav , title =

Zhang, Tianyi and Kishore, Varsha and Wu, Felix and Weinberger, Kilian Q. and Artzi, Yoav , title =. The Eighth International Conference on Learning Representations , year =

work page
[37]

2025 , note =

Zhang, Yanzhao and Li, Mingxin and Long, Dingkun and others , title =. 2025 , note =

work page 2025
[38]

Findings of the Association for Computational Linguistics: NAACL 2024 , pages =

Zhu, Wenhao and Liu, Hongyi and Dong, Qingxiu and others , title =. Findings of the Association for Computational Linguistics: NAACL 2024 , pages =. 2024 , doi =

work page 2024

[1] [1]

Frequently Asked Questions About Access to Insight , year =

work page

[2] [2]

Nature , volume =

Assael, Yannis and Sommerschield, Thea and Shillingford, Brendan and others , title =. Nature , volume =. 2022 , doi =

work page 2022

[3] [3]

, title =

Bamman, David and Burns, Patrick J. , title =. 2020 , note =

work page 2020

[4] [4]

Bodhi, Bhikkhu , title =

work page

[5] [5]

The Middle Length Discourses of the Buddha: A Translation of the Majjhima Nik

Bodhi, Bhikkhu and. The Middle Length Discourses of the Buddha: A Translation of the Majjhima Nik

work page

[6] [6]

Multi-Hypothesis Machine Translation Evaluation , booktitle =

Fomicheva, Marina and Specia, Lucia and Guzm. Multi-Hypothesis Machine Translation Evaluation , booktitle =. 2020 , doi =

work page 2020

[7] [7]

Proceedings of the Seventh Conference on Machine Translation , pages =

Freitag, Markus and Rei, Ricardo and Mathur, Nitika and others , title =. Proceedings of the Seventh Conference on Machine Translation , pages =. 2022 , doi =

work page 2022

[8] [8]

Transactions of the Association for Computational Linguistics , volume =

Goyal, Naman and Gao, Cynthia and Chaudhary, Vishrav and others , title =. Transactions of the Association for Computational Linguistics , volume =. 2022 , doi =

work page 2022

[9] [9]

and Rei, Ricardo and van Stigt, Daan and Coheur, Luisa and Colombo, Pierre and Martins, Andr

Guerreiro, Nuno M. and Rei, Ricardo and van Stigt, Daan and Coheur, Luisa and Colombo, Pierre and Martins, Andr. Transactions of the Association for Computational Linguistics , volume =. 2024 , doi =

work page 2024

[10] [10]

Systems and Frameworks for Computational Morphology , editor =

Hellwig, Oliver , title =. Systems and Frameworks for Computational Morphology , editor =. 2015 , doi =

work page 2015

[11] [11]

2023 , note =

Hendy, Amr and Abdelrehim, Mohamed and Sharaf, Amr and others , title =. 2023 , note =

work page 2023

[12] [12]

2023 , note =

Jiao, Wenxiang and Wang, Wenxuan and Huang, Jen-tse and Wang, Xing and Tu, Zhaopeng , title =. 2023 , note =

work page 2023

[13] [13]

Proceedings of the Eighth Conference on Machine Translation , pages =

Kocmi, Tom and Federmann, Christian , title =. Proceedings of the Eighth Conference on Machine Translation , pages =. 2023 , doi =

work page 2023

[14] [14]

, title =

Lavie, Alon and Denkowski, Michael J. , title =. Machine Translation , volume =. 2009 , doi =

work page 2009

[15] [15]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages =

Nehrdich, Sebastian and Hellwig, Oliver and Keutzer, Kurt , title =. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages =. 2024 , doi =

work page 2024

[16] [16]

2026 , note =

Nehrdich, Sebastian and Keutzer, Kurt , title =. 2026 , note =

work page 2026

[17] [17]

No Language Left Behind: Scaling Human-Centered Machine Translation , year =

work page

[18] [18]

2024 , doi =

Scaling Neural Machine Translation to 200 Languages , journal =. 2024 , doi =

work page 2024

[19] [19]

The Middle Length Discourses of the Buddha: A Translation of the Majjhima Nik

work page

[20] [20]

Olivelle, Patrick , title =

work page

[21] [21]

Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics , pages =

Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing , title =. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics , pages =. 2002 , doi =

work page 2002

[22] [22]

Proceedings of the Tenth Workshop on Statistical Machine Translation , pages =

Popovi. Proceedings of the Tenth Workshop on Statistical Machine Translation , pages =. 2015 , doi =

work page 2015

[23] [23]

Proceedings of the Second Conference on Machine Translation , pages =

Popovi. Proceedings of the Second Conference on Machine Translation , pages =. 2017 , doi =

work page 2017

[24] [24]

Proceedings of the Third Conference on Machine Translation , pages =

Post, Matt , title =. Proceedings of the Third Conference on Machine Translation , pages =. 2018 , doi =

work page 2018

[25] [25]

and Lavie, Alon , title =

Rei, Ricardo and Stewart, Craig and Farinha, Ana C. and Lavie, Alon , title =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages =. 2020 , doi =

work page 2020

[26] [26]

Proceedings of the Seventh Conference on Machine Translation (WMT) , pages =

Rei, Ricardo and de Souza, Jos. Proceedings of the Seventh Conference on Machine Translation (WMT) , pages =. 2022 , doi =

work page 2022

[27] [27]

, title =

Roebuck, Valerie J. , title =

work page

[28] [28]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages =

Sellam, Thibault and Das, Dipanjan and Parikh, Ankur , title =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages =. 2020 , doi =

work page 2020

[29] [29]

Computational Linguistics , volume =

Sommerschield, Thea and Assael, Yannis and Pavlopoulos, John and others , title =. Computational Linguistics , volume =. 2023 , doi =

work page 2023

[30] [30]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages =

Thompson, Brian and Post, Matt , title =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages =. 2020 , doi =

work page 2020

[31] [31]

Advances in Neural Information Processing Systems 30 , pages =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and others , title =. Advances in Neural Information Processing Systems 30 , pages =. 2017 , note =

work page 2017

[32] [32]

van Buitenen, J. A. B. , title =

work page

[33] [33]

Walshe, Maurice , title =

work page

[34] [34]

, title =

Wu, Si and Wieting, John and Smith, David A. , title =. 2024 , note =

work page 2024

[35] [35]

The Twelfth International Conference on Learning Representations , year =

Xu, Haoran and Kim, Young Jin and Sharaf, Amr and Awadalla, Hany Hassan , title =. The Twelfth International Conference on Learning Representations , year =

work page

[36] [36]

and Artzi, Yoav , title =

Zhang, Tianyi and Kishore, Varsha and Wu, Felix and Weinberger, Kilian Q. and Artzi, Yoav , title =. The Eighth International Conference on Learning Representations , year =

work page

[37] [37]

2025 , note =

Zhang, Yanzhao and Li, Mingxin and Long, Dingkun and others , title =. 2025 , note =

work page 2025

[38] [38]

Findings of the Association for Computational Linguistics: NAACL 2024 , pages =

Zhu, Wenhao and Liu, Hongyi and Dong, Qingxiu and others , title =. Findings of the Association for Computational Linguistics: NAACL 2024 , pages =. 2024 , doi =

work page 2024