pith. sign in

arxiv: 2606.01136 · v1 · pith:5AUELDVKnew · submitted 2026-05-31 · 💻 cs.CL

From Outliers to Errors: Auditing Pali-to-English LLM Translations with Multi-Reference Adjudication

Pith reviewed 2026-06-28 17:09 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM translation auditingPali-to-Englishembedding driftmulti-reference evaluationtranslation error detectionclassical language translationadjudication panel
0
0 comments X

The pith

Drift from a multi-translator reference centroid predicts major-error rates in LLM Pali-to-English translations rather than treating all outliers as mistakes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests a triage method for auditing LLM translations of the Pali Canon that uses embedding drift from the centroid of three established human translations as a signal to prioritize review instead of a single gold standard. It shows this drift correlates with error severity: rates of major errors such as omission or doctrinal mistakes rise from 7.9 percent in the 1.5-2.0 drift band to 51.6 percent above 3.0, while roughly 80 percent of moderate outliers count as valid variations. Model differences appear most clearly in the high-drift tail, with one model exhibiting the highest error volume and rate there. This matters because single-score metrics conflate legitimate variation with error in classical languages that admit multiple defensible renderings. The design supplies a reusable audit workflow that flags the tail for adjudication rather than labeling every outlier as failure.

Core claim

By defining a local reference envelope from three human translations and measuring normalized embedding drift, the authors demonstrate that drift serves as a severity predictor rather than an error label: the major-error rate among adjudicated high-drift candidates rises monotonically across bands, approximately 80 percent of 1.5-2.0 outliers are valid variations, and model differences concentrate in the tail where one model records the highest rates (27.6 percent overall, 74.4 percent above drift 3.0).

What carries the argument

Normalized embedding drift from the multi-translator reference centroid, used as a triage signal before blinded three-model LLM judge panel adjudication calibrated on a 300-instance validation set.

If this is right

  • Major-error rates increase steadily with higher drift thresholds.
  • Most candidates in the 1.5-2.0 drift band represent acceptable translation variations.
  • One model produces both more high-drift outliers and a higher proportion of major errors in the tail than the others.
  • The dominant error categories are omission, truncation, and doctrinal term mistakes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The triage workflow could extend to other classical languages that have multiple authoritative human translations to prioritize review effort.
  • Embedding-based drift may offer a more stable signal than single-reference metrics when variation is expected.
  • Threshold tuning on new corpora could balance review volume against capture of severe errors.

Load-bearing premise

The blinded three-model LLM judge panel, after calibration against author-adjudicated examples, reliably distinguishes legitimate translation variations from major errors.

What would settle it

Large-scale human re-adjudication of the high-drift candidates that shows no monotonic rise in major-error rate with drift bands or no difference in tail error rates across the four models.

read the original abstract

Single-score translation metrics can conflate legitimate variation with error, a problem especially acute for classical languages where multiple defensible English renderings of the same passage coexist. We audit Pali-to-English output from four flagship large language models (LLMs): GPT-5.5, Claude Sonnet 4.6, Gemini 3.1 Pro, and Grok 4.3, on 1,700 passages from the Pali Canon, using three established human translations by Bhikkhu Sujato, Thanissaro Bhikkhu, and Bhikkhu Bodhi as a local reference envelope rather than a single gold standard. Each candidate's normalized embedding drift from the reference centroid serves as a triage signal, not an error label; the 1,203 candidates above a 1.5 drift threshold are then adjudicated by a blinded three-model LLM judge panel, calibrated against a 300-instance author-adjudicated validation set. Two results stand out. First, drift predicts severity rather than error per se: the major-error rate among adjudicated high-drift candidates rose monotonically from 7.9% in the 1.5-2.0 band to 51.6% above 3.0, while approximately 80% of 1.5-2.0 outliers were judged valid translation variations. Second, model differences were clearest in the high-drift tail: GPT-5.5 had the lowest adjudicated high-drift major-error rate, with confidence intervals overlapping those of Claude Sonnet 4.6 and Gemini 3.1 Pro; Grok 4.3 had both the largest outlier volume and the highest tail major-error rate (27.6% overall, 74.4% above drift 3.0). The dominant major-error categories (e.g. omission or truncation, doctrinal term errors) are precisely the failures most likely to mislead readers of doctrinal text. The contribution is a reusable audit design for classical-to-modern translation: define a local reference envelope from multiple human translators, use embedding drift to prioritize review, and adjudicate the flagged tail rather than treating outlier status as error.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper audits Pali-to-English translations from four LLMs (GPT-5.5, Claude Sonnet 4.6, Gemini 3.1 Pro, Grok 4.3) on 1,700 Pali Canon passages. It defines a local reference envelope from three human translations, uses normalized embedding drift above a 1.5 threshold to triage 1,203 outliers, and adjudicates them via a blinded three-model LLM judge panel calibrated on a 300-instance author-adjudicated validation set. Central claims are that drift predicts error severity (major-error rate rises monotonically from 7.9% in the 1.5-2.0 band to 51.6% above 3.0, with ~80% of low-band outliers judged valid variations) and that model differences appear in the high-drift tail (GPT-5.5 lowest major-error rate; Grok 4.3 highest at 27.6% overall and 74.4% above drift 3.0).

Significance. If the adjudication holds, the work supplies a practical, reusable audit design for classical-language translation that avoids single-gold-standard assumptions and focuses review effort on the error-prone tail. The monotonic severity prediction and tail-specific model comparisons would be directly useful for doctrinal text evaluation.

major comments (3)
  1. [adjudication pipeline] Adjudication pipeline (abstract and corresponding methods description): the load-bearing claim that the blinded three-model LLM judge panel reliably distinguishes major errors (omission, truncation, doctrinal term mistakes) from valid variations rests on calibration against the 300-instance validation set, yet no details are supplied on panel composition, exact judge prompts, or agreement metrics (e.g., kappa) between the panel and independent human experts on held-out cases. Without these, the reported major-error rates and model rankings cannot be assessed for bias.
  2. [results on model differences] Results on model differences (high-drift tail paragraph): the claim that GPT-5.5 has the lowest and Grok 4.3 the highest adjudicated major-error rates lacks any statistical test (chi-square, bootstrap CI overlap assessment, or similar) for the observed differences; overlapping CIs are mentioned for GPT-5.5 vs. others but no test is reported for the Grok 4.3 tail result of 74.4%.
  3. [drift calculation] Drift calculation (methods on embedding drift): the embedding model used to compute normalized drift from the reference centroid is not identified, nor is any sensitivity analysis provided; because the triage threshold of 1.5 and all downstream rates depend on this choice, the monotonic trend cannot be reproduced or stress-tested.
minor comments (2)
  1. [abstract] The abstract refers to 'normalized embedding drift' without an equation or definition of the normalization; add a short formal definition or reference in the methods.
  2. [results] No table or figure caption clarifies the exact band boundaries or sample sizes per band; a small summary table would improve readability of the 7.9% / 51.6% figures.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. The three major comments identify important gaps in reproducibility and statistical rigor. We address each below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [adjudication pipeline] Adjudication pipeline (abstract and corresponding methods description): the load-bearing claim that the blinded three-model LLM judge panel reliably distinguishes major errors (omission, truncation, doctrinal term mistakes) from valid variations rests on calibration against the 300-instance validation set, yet no details are supplied on panel composition, exact judge prompts, or agreement metrics (e.g., kappa) between the panel and independent human experts on held-out cases. Without these, the reported major-error rates and model rankings cannot be assessed for bias.

    Authors: We agree that the manuscript currently lacks these details. In the revised version we will add a dedicated methods subsection specifying the exact three models used in the judge panel, the full adjudication prompts, and agreement statistics (including Cohen's kappa) between the panel and the author-adjudicated validation set on held-out instances. revision: yes

  2. Referee: [results on model differences] Results on model differences (high-drift tail paragraph): the claim that GPT-5.5 has the lowest and Grok 4.3 the highest adjudicated major-error rates lacks any statistical test (chi-square, bootstrap CI overlap assessment, or similar) for the observed differences; overlapping CIs are mentioned for GPT-5.5 vs. others but no test is reported for the Grok 4.3 tail result of 74.4%.

    Authors: We acknowledge the omission of formal tests. The revised manuscript will include chi-square tests for differences in major-error proportions across models together with bootstrap confidence-interval comparisons, with explicit reporting for the Grok 4.3 high-drift tail result. revision: yes

  3. Referee: [drift calculation] Drift calculation (methods on embedding drift): the embedding model used to compute normalized drift from the reference centroid is not identified, nor is any sensitivity analysis provided; because the triage threshold of 1.5 and all downstream rates depend on this choice, the monotonic trend cannot be reproduced or stress-tested.

    Authors: The comment is correct; the embedding model is not named and no sensitivity analysis appears. We will identify the model in the methods and add a sensitivity analysis that varies both the embedding model and the drift threshold to verify robustness of the monotonic severity trend. revision: yes

Circularity Check

0 steps flagged

No circularity: results derive from independent human references and separate adjudication

full rationale

The paper uses multi-human reference translations to compute embedding drift as a triage signal only, then adjudicates flagged candidates via an LLM panel calibrated on a distinct 300-instance author-adjudicated validation set. Major-error rates by drift band and model comparisons are outputs of this external adjudication process, not reductions of fitted parameters, self-definitions, or self-citation chains. No equations or steps equate the reported rates to the drift metric by construction; the methodology is self-contained against the provided human references and validation set.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central method depends on a hand-chosen drift threshold and two domain assumptions about embeddings and LLM judges; no new entities are postulated.

free parameters (2)
  • drift threshold of 1.5 = 1.5
    Selected to flag the 1,203 candidates for adjudication; appears chosen by inspection rather than derived from first principles.
  • reporting bands (1.5-2.0, above 3.0)
    Used to demonstrate monotonic rise in error rate; post-hoc selection for presentation.
axioms (2)
  • domain assumption Normalized embedding drift from the multi-reference centroid is a valid proxy for translation deviation severity
    Invoked to justify the triage signal before adjudication.
  • domain assumption The three-model LLM judge panel, after calibration on the author-adjudicated set, produces reliable labels for valid variation versus major error
    Underpins all reported major-error rates and model comparisons.

pith-pipeline@v0.9.1-grok · 5949 in / 1471 out tokens · 38407 ms · 2026-06-28T17:09:16.386417+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 44 canonical work pages · 6 internal anchors

  1. [1]

    Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

    Artetxe, Mikel and Schwenk, Holger. (2019) ’Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond’, Transactions of the Association for Computational Linguistics , 7, pp. 597–610. Available at: https://doi.org/10.1162/tacl_a_00288

  2. [2]

    (2022) ’Restoring and Attributing Ancient Texts Using Deep Neural Networks’, Nature, 603(7900), pp

    Assael, Y annis et al. (2022) ’Restoring and Attributing Ancient Texts Using Deep Neural Networks’, Nature, 603(7900), pp. 280–283. Available at: https://doi.org/10.1038/s41586-022-04448-z

  3. [3]

    (2020) ’Latin BERT: A Contextual Language Model for Classical Philology’

    Bamman, David and Burns, Patrick J. (2020) ’Latin BERT: A Contextual Language Model for Classical Philology’. Available at: https://arxiv.org/abs/2009.10053 (Accessed: 12 May 2026)

  4. [4]

    (2024) ’Machine Translation Hallucination Detection for Low and High Resource Languages Using Large Language Models’

    Benkirane, Kenza et al. (2024) ’Machine Translation Hallucination Detection for Low and High Resource Languages Using Large Language Models’. Available at: https://doi.org/10.48550/arXiv.2407. 16470 (Accessed: 12 May 2026)

  5. [5]

    (2025) ’Experiments in Distant Reading: Using Topic Modeling on Chinese Buddhist Texts from 500-800 CE’, Digital Humanities Quarterly , 19(1)

    Bingenheimer, Marcus, Brody, Justin, and Nichols, Ryan. (2025) ’Experiments in Distant Reading: Using Topic Modeling on Chinese Buddhist Texts from 500-800 CE’, Digital Humanities Quarterly , 19(1). Avail- able at: https://www.digitalhumanities.org/dhq/vol/19/1/000771/000771.html (Accessed: 12 May 2026)

  6. [6]

    (2000) The Connected Discourses of the Buddha: A Translation of the Saṃyutta Nikāya

    Bodhi, Bhikkhu. (2000) The Connected Discourses of the Buddha: A Translation of the Saṃyutta Nikāya . Boston: Wisdom Publications

  7. [7]

    (2012) The Numerical Discourses of the Buddha: A Translation of the Aṅguttara Nikāya

    Bodhi, Bhikkhu. (2012) The Numerical Discourses of the Buddha: A Translation of the Aṅguttara Nikāya . Boston: Wisdom Publications

  8. [8]

    and Lopez, Donald S., Jr

    Buswell, Robert E., Jr. and Lopez, Donald S., Jr. (eds) (2014) The Princeton Dictionary of Buddhism. Prince- ton: Princeton University Press

  9. [9]

    Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

    Chiang, Wei-Lin et al. (2024) ’Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference’ in Proceedings of the 41st International Conference on Machine Learning. Proceedings of Machine Learning Research 235, pp. 8359–8388. Available at: https://arxiv.org/abs/2403.04132 (Accessed: 12 May 2026)

  10. [10]

    (1960) ’ A Coefficient of Agreement for Nominal Scales’, Educational and Psychological Measurement, 20(1), pp

    Cohen, Jacob. (1960) ’ A Coefficient of Agreement for Nominal Scales’, Educational and Psychological Measurement, 20(1), pp. 37–46. Available at: https://doi.org/10.1177/001316446002000104 19

  11. [11]

    Costa-jussà, Marta R. et al. (2022) ’No Language Left Behind: Scaling Human-Centered Machine Transla- tion’. Available at: https://arxiv.org/abs/2207.04672 (Accessed: 12 May 2026)

  12. [12]

    Costa-jussà, Marta R. et al. (2024) ’Scaling Neural Machine Translation to 200 Languages’, Nature, 630(8018), pp. 841–846. Available at: https://doi.org/10.1038/s41586-024-07335-x

  13. [13]

    Dale, David et al. (2023) ’Detecting and Mitigating Hallucinations in Machine Translation: Model Internal Workings Alone Do Well, Sentence Similarity Even Better’ in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics . Toronto: Association for Computational Linguistics, pp. 36–50. Available at: https://doi.org/10.18653...

  14. [14]

    Language-agnostic BERT Sentence Embedding

    Feng, Fangxiaoyu et al. (2022) ’Language-Agnostic BERT Sentence Embedding’ in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics . Dublin: Association for Computational Linguistics, pp. 878–891. Available at: https://doi.org/10.18653/v1/2022.acl-long.62

  15. [15]

    (2020) ’Multi-Hypothesis Machine Translation Evaluation’ in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

    Fomicheva, Marina, Specia, Lucia, and Guzmán, Francisco. (2020) ’Multi-Hypothesis Machine Translation Evaluation’ in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . Online: Association for Computational Linguistics, pp. 1218–1232. Available at: https://doi.org/10 .18653/v1/2020.acl-main.113

  16. [16]

    (2021) ’Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation’, Transactions of the Association for Computational Linguistics , 9, pp

    Freitag, Markus et al. (2021) ’Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation’, Transactions of the Association for Computational Linguistics , 9, pp. 1460–1474. Available at: https://doi.org/10.1162/tacl_a_00437

  17. [17]

    Li, Y ., Fan, H., Hu, R., Feichtenhofer, C., and He, K

    Freitag, Markus et al. (2022) ’Results of WMT22 Metrics Shared Task: Stop Using BLEU–Neural Metrics Are Better and More Robust’ in Proceedings of the Seventh Conference on Machine Translation. Abu Dhabi: Association for Computational Linguistics, pp. 46–68. Available at: https://doi.org/10.18653/v1/ 2022.wmt-1.2

  18. [18]

    (2022) ’The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation’,Transactions of the Association for Computational Linguistics, 10, pp

    Goyal, Naman et al. (2022) ’The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation’,Transactions of the Association for Computational Linguistics, 10, pp. 522–538. Avail- able at: https://doi.org/10.1162/tacl_a_00474

  19. [19]

    Guerreiro, Nuno M. et al. (2023a) ’Hallucinations in Large Multilingual Translation Models’, Transactions of the Association for Computational Linguistics , 11, pp. 1500–1517. Available at: https://doi.org/10 .1162/tacl_a_00615

  20. [20]

    Guerreiro, Nuno M., Voita, Elena, and Martins, André F. T. (2023b) ’Looking for a Needle in a Haystack: A Comprehensive Study of Hallucinations in Neural Machine Translation’ in Proceedings of the 17th Confer- ence of the European Chapter of the Association for Computational Linguistics . Dubrovnik: Association for Computational Linguistics, pp. 1059–1075...

  21. [21]

    Guerreiro, Nuno M. et al. (2024) ’xCOMET: Transparent Machine Translation Evaluation through Fine- Grained Error Detection’, Transactions of the Association for Computational Linguistics , 12, pp. 979–995. Available at: https://doi.org/10.1162/tacl_a_00683

  22. [22]

    (2015) ’Morphological Disambiguation of Classical Sanskrit’ in Mahlow, Cerstin and Pi- otrowski, Michael (eds) Systems and Frameworks for Computational Morphology

    Hellwig, Oliver. (2015) ’Morphological Disambiguation of Classical Sanskrit’ in Mahlow, Cerstin and Pi- otrowski, Michael (eds) Systems and Frameworks for Computational Morphology. Cham: Springer, pp. 41–

  23. [23]

    Available at: https://doi.org/10.1007/978-3-319-23980-4_3

  24. [24]

    (2023) ’How Good Are GPT Models at Machine Translation? A Comprehensive Evalua- tion’

    Hendy, Amr et al. (2023) ’How Good Are GPT Models at Machine Translation? A Comprehensive Evalua- tion’. Available at: https://arxiv.org/abs/2302.09210 (Accessed: 12 May 2026)

  25. [25]

    (2023) ’Is ChatGPT a Good Translator? Y es with GPT-4 as the Engine’

    Jiao, Wenxiang et al. (2023) ’Is ChatGPT a Good Translator? Y es with GPT-4 as the Engine’. Available at: https://doi.org/10.48550/arXiv.2301.08745 (Accessed: 12 May 2026)

  26. [26]

    (2023a) ’GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4’ in Proceedings of the Eighth Conference on Machine Translation

    Kocmi, Tom and Federmann, Christian. (2023a) ’GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4’ in Proceedings of the Eighth Conference on Machine Translation . Singapore: Association for Computational Linguistics, pp. 768–775. Available at: https://doi.org/10.18653/v1/2023.wmt-1 .64 20

  27. [27]

    Kocmi, Tom and Federmann, Christian. (2023b) ’Large Language Models Are State-of-the-Art Evaluators of Translation Quality’ in Proceedings of the 24th Annual Conference of the European Association for Machine Translation. Tampere: European Association for Machine Translation, pp. 193–203. Available at: https: //arxiv.org/abs/2302.14520 (Accessed: 12 May 2026)

  28. [28]

    (2009) ’The METEOR Metric for Automatic Evaluation of Machine Translation’, Machine Translation, 23(2–3), pp

    Lavie, Alon and Denkowski, Michael J. (2009) ’The METEOR Metric for Automatic Evaluation of Machine Translation’, Machine Translation, 23(2–3), pp. 105–115. Available at: https://doi.org/10.1007/s1 0590-009-9059-4

  29. [29]

    (2023) ’G-Eval: NLG Evaluation Using GPT-4 with Better Human Alignment’ in Proceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing

    Liu, Y ang et al. (2023) ’G-Eval: NLG Evaluation Using GPT-4 with Better Human Alignment’ in Proceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing. Singapore: Association for Computational Linguistics, pp. 2511–2522. Available at: https://doi.org/10.18653/v1/2023.e mnlp-main.153

  30. [30]

    (2018) ’Metrics for Translation Quality Assessment: A Case for Standardizing Error T ypolo- gies’ in Moorkens, Joss et al

    Lommel, Arle. (2018) ’Metrics for Translation Quality Assessment: A Case for Standardizing Error T ypolo- gies’ in Moorkens, Joss et al. (eds) Translation Quality Assessment: From Principles to Practice . Cham: Springer, pp. 109–127. Available at: https://doi.org/10.1007/978-3-319-91241-7_6

  31. [31]

    Mathur, Nitika, Baldwin, Timothy, and Cohn, Trevor. (2020) ’Tangled up in BLEU: Reevaluating the Eval- uation of Automatic Machine Translation Evaluation Metrics’ in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . Online: Association for Computational Linguistics, pp. 4984–4997. Available at: https://doi.org/10.18...

  32. [32]

    (n.d.) MQM Error Typology

    MQM Council. (n.d.) MQM Error Typology. Available at: https://themqm.org/error-types-2/typ ology/ (Accessed: 12 May 2026)

  33. [33]

    (2023) ’MTEB: Massive Text Embedding Benchmark’ in Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

    Muennighoff, Niklas et al. (2023) ’MTEB: Massive Text Embedding Benchmark’ in Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics . Dubrovnik: Associ- ation for Computational Linguistics, pp. 2014–2037. Available at: https://doi.org/10.18653/v1/20 23.eacl-main.148

  34. [34]

    Nehrdich, Sebastian and Keutzer, Kurt. (2026) ’MITRA: A Large-Scale Parallel Corpus and Multilingual Pretrained Language Model for Machine Translation and Semantic Retrieval for Pāli, Sanskrit, Buddhist Chinese, and Tibetan’. Available at: https://arxiv.org/abs/2601.06400 (Accessed: 12 May 2026)

  35. [35]

    (2024) ’One Model Is All Y ou Need: ByT5- Sanskrit, a Unified Model for Sanskrit NLP Tasks’ in Findings of the Association for Computational Lin- guistics: EMNLP 2024

    Nehrdich, Sebastian, Hellwig, Oliver, and Keutzer, Kurt. (2024) ’One Model Is All Y ou Need: ByT5- Sanskrit, a Unified Model for Sanskrit NLP Tasks’ in Findings of the Association for Computational Lin- guistics: EMNLP 2024 . Miami: Association for Computational Linguistics, pp. 13742–13751. Available at: https://doi.org/10.18653/v1/2024.findings-emnlp.805

  36. [36]

    (1995) The Middle Length Discourses of the Buddha: A Translation of the Majjhima Nikāya

    Ñāṇamoli, Bhikkhu and Bodhi, Bhikkhu. (1995) The Middle Length Discourses of the Buddha: A Translation of the Majjhima Nikāya . Boston: Wisdom Publications

  37. [37]

    (2002) ’BLEU: A Method for Automatic Evaluation of Machine Translation’ in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics

    Papineni, Kishore et al. (2002) ’BLEU: A Method for Automatic Evaluation of Machine Translation’ in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics . Philadelphia: Association for Computational Linguistics, pp. 311–318. Available at: https://doi.org/10.3115/10 73083.1073135

  38. [38]

    chr F : character n-gram F -score for automatic MT evaluation

    Popović, Maja. (2015) ’chrF: Character N-Gram F-Score for Automatic MT Evaluation’ in Proceedings of the Tenth Workshop on Statistical Machine Translation. Lisbon: Association for Computational Linguistics, pp. 392–395. Available at: https://doi.org/10.18653/v1/W15-3049

  39. [39]

    (2017) ’chrF++: Words Helping Character N-Grams’ in Proceedings of the Second Confer- ence on Machine Translation

    Popović, Maja. (2017) ’chrF++: Words Helping Character N-Grams’ in Proceedings of the Second Confer- ence on Machine Translation. Copenhagen: Association for Computational Linguistics, pp. 612–618. Avail- able at: https://doi.org/10.18653/v1/W17-4770

  40. [40]

    (2018) ’ A Call for Clarity in Reporting BLEU Scores’ in Proceedings of the Third Conference on Machine Translation

    Post, Matt. (2018) ’ A Call for Clarity in Reporting BLEU Scores’ in Proceedings of the Third Conference on Machine Translation . Brussels: Association for Computational Linguistics, pp. 186–191. Available at: https://doi.org/10.18653/v1/W18-6319 21

  41. [41]

    Raunak, Vikas, Menezes, Arul, and Junczys-Dowmunt, Marcin. (2021) ’The Curious Case of Hallucinations in Neural Machine Translation’ in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Online: Association for Computational Linguistics, pp. 1172–1183. Available ...

  42. [42]

    COMET : A Neural Framework for MT Evaluation

    Rei, Ricardo et al. (2020) ’COMET: A Neural Framework for MT Evaluation’ in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Online: Association for Computational Linguistics, pp. 2685–2702. Available at: https://doi.org/10.18653/v1/2020.emnlp-main.213

  43. [43]

    (2022) ’COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task’ in Proceedings of the Seventh Conference on Machine Translation

    Rei, Ricardo et al. (2022) ’COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task’ in Proceedings of the Seventh Conference on Machine Translation. Abu Dhabi: Association for Computational Linguistics, pp. 578–585. Available at: https://doi.org/10.18653/v1/2022.wmt-1.52

  44. [44]

    Sentence-bert: Sentence embeddings using siamese bert-networks

    Reimers, Nils and Gurevych, Iryna. (2019) ’Sentence-BERT: Sentence Embeddings Using Siamese BERT- Networks’ in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing . Hong Kong: Association for Computational Linguistics, pp. 3982–3992. Available at: https://doi.org/ 10.18653/v1/D19-1410

  45. [45]

    (2023) ’Exploring Large Language Models for Classical Philology’ in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics

    Riemenschneider, Frederick and Frank, Anette. (2023) ’Exploring Large Language Models for Classical Philology’ in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics . Toronto: Association for Computational Linguistics, pp. 15181–15199. Available at: https://doi.or g/10.18653/v1/2023.acl-long.846

  46. [46]

    (2020) ’BLEURT: Learning Robust Metrics for Text Generation’ in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

    Sellam, Thibault, Das, Dipanjan, and Parikh, Ankur. (2020) ’BLEURT: Learning Robust Metrics for Text Generation’ in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . Online: Association for Computational Linguistics, pp. 7881–7892. Available at: https://doi.org/10 .18653/v1/2020.acl-main.704

  47. [47]

    (2023) ’Machine Learning for Ancient Languages: A Survey’, Computational Linguistics, 49(3), pp

    Sommerschield, Thea et al. (2023) ’Machine Learning for Ancient Languages: A Survey’, Computational Linguistics, 49(3), pp. 703–747. Available at: https://doi.org/10.1162/coli_a_00481

  48. [48]

    (n.d.) SuttaCentral

    SuttaCentral. (n.d.) SuttaCentral. Available at: https://suttacentral.net (Accessed: 12 May 2026)

  49. [49]

    (n.d.) Suttas

    Thanissaro Bhikkhu. (n.d.) Suttas. Available at: https://www.dhammatalks.org/suttas/ (Accessed: 12 May 2026)

  50. [50]

    (2014) Enlarging Translation, Empowering Translators

    T ymoczko, Maria. (2014) Enlarging Translation, Empowering Translators. London: Routledge

  51. [51]

    (2018) The Translator’s Invisibility: A History of Translation

    Venuti, Lawrence. (2018) The Translator’s Invisibility: A History of Translation. London: Routledge

  52. [52]

    (2006) ’Error Analysis of Statistical Machine Translation Output’ in Proceedings of the Fifth International Conference on Language Resources and Evaluation

    Vilar, David et al. (2006) ’Error Analysis of Statistical Machine Translation Output’ in Proceedings of the Fifth International Conference on Language Resources and Evaluation . Genoa: European Language Re- sources Association. Available at: https://aclanthology.org/L06-1244/ (Accessed: 12 May 2026)

  53. [53]

    0.5-1.5 m height, 30-60 cm spread

    Wilson, Edwin B. (1927) ’Probable Inference, the Law of Succession, and Statistical Inference’, Journal of the American Statistical Association , 22(158), pp. 209–212. Available at: https://doi.org/10.1080/ 01621459.1927.10502953

  54. [54]

    (2024) ’Multiple References with Meaningful Variations Improve Literary Machine Translation’

    Wu, Si, Wieting, John, and Smith, David A. (2024) ’Multiple References with Meaningful Variations Improve Literary Machine Translation’. Available at: https://arxiv.org/abs/2412.18707 (Accessed: 12 May 2026)

  55. [55]

    (2024) ’ A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models’ in The Twelfth International Conference on Learning Representations

    Xu, Haoran et al. (2024) ’ A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models’ in The Twelfth International Conference on Learning Representations . Available at: https://arxiv.org/abs/2309.11674 (Accessed: 12 May 2026)

  56. [56]

    Zainaldin, James L. et al. (2026) ’Evaluating LLM-Based Translation of a Low-Resource Technical Lan- guage: The Medical and Philosophical Greek of Galen’. Available at: https://doi.org/10.48550/arX iv.2602.24119 (Accessed: 12 May 2026). 22

  57. [57]

    BERTScore: Evaluating Text Generation with BERT

    Zhang, Tianyi et al. (2020) ’BERTScore: Evaluating Text Generation with BERT’ in International Confer- ence on Learning Representations . Available at: https://arxiv.org/abs/1904.09675 (Accessed: 12 May 2026)

  58. [58]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Zhang, Y anzhao et al. (2025) ’Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models’. Available at: https://arxiv.org/abs/2506.05176 (Accessed: 12 May 2026)

  59. [59]

    (2023) ’Can Large Language Model Comprehend Ancient Chinese? A Pre- liminary Test on ACLUE’ in Proceedings of the Ancient Language Processing Workshop

    Zhang, Yixuan and Li, Haonan. (2023) ’Can Large Language Model Comprehend Ancient Chinese? A Pre- liminary Test on ACLUE’ in Proceedings of the Ancient Language Processing Workshop. Varna: INCOMA Ltd., pp. 80–87. Available at: https://aclanthology.org/2023.alp-1.9/ (Accessed: 12 May 2026)

  60. [60]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Zheng, Lianmin et al. (2023) ’Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena’ in Advances in Neural Information Processing Systems 36 , Datasets and Benchmarks Track. Available at: https://arxi v.org/abs/2306.05685 (Accessed: 12 May 2026)

  61. [61]

    mn1:5":

    Zhu, Wenhao et al. (2024) ’Multilingual Machine Translation with Large Language Models: Empirical Re- sults and Analysis’ in Findings of the Association for Computational Linguistics: NAACL 2024. Mexico City: Association for Computational Linguistics, pp. 2765–2781. Available at: https://doi.org/10.18653 /v1/2024.findings-naacl.176 A Translation Prompt A....

  62. [62]

    Use the Pāli text to decide what content belongs to each segment

    Pāli is the authority. Use the Pāli text to decide what content belongs to each segment

  63. [63]

    His segmentation helps you understand boundaries, but the target translator may split/merge differently.,→

    Sujato is only a guide. His segmentation helps you understand boundaries, but the target translator may split/merge differently.,→

  64. [64]

    …" or is abbreviated: → Extract ONLY the corresponding term(s) from the translator → Do NOT expand to the full sentence → Example: Pali

    MATCH THE PALI STRUCTURE. The OUTPUT must mirror the structure of the PALI segment: a) If the Pali contains "…" or is abbreviated: → Extract ONLY the corresponding term(s) from the translator → Do NOT expand to the full sentence → Example: Pali "viññātaṁ …" + Sujato "the known …" → Output just "the cognized" (not the full paragraph) b) If the Pali is ful...

  65. [65]

    Extract the smallest text that expresses the Pāli meaning

    Minimal faithful extraction. Extract the smallest text that expresses the Pāli meaning. Prefer contiguous substrings from the original. 29

  66. [66]

    Process the translation in order

    Respect text order. Process the translation in order. Don't reuse non-repetitive text

  67. [67]

    Discard footnotes, section headers (unless part of translation), editor notes, bracketed references.,→

    Filter noise. Discard footnotes, section headers (unless part of translation), editor notes, bracketed references.,→

  68. [68]

    mn1:3.1":

    Null policy. Output null ONLY if you genuinely cannot find matching content. Remember: abbreviated Pali → short output (just the term). OUTPUT FORMAT (STRICT JSON) Return valid JSON with exactly the same keys as the input, in the same order. Each key maps to either a string (extracted text) or null. Example: { "mn1:3.1": "Here, monks, an untaught ordinary...