pith. sign in

arxiv: 2606.06420 · v1 · pith:HQJUNWT6new · submitted 2026-06-04 · 💻 cs.CL

A Komi-Yazva--Russian Parallel Corpus and Evaluation Protocol for Zero- and Few-Shot LLM Translation

Pith reviewed 2026-06-28 01:20 UTC · model grok-4.3

classification 💻 cs.CL
keywords Komi-Yazvaparallel corpuslow-resource machine translationfew-shot promptingLLM evaluationendangered languageszero-shot translation
0
0 comments X

The pith

LLMs produce non-trivial Komi-Yazva to Russian translations, with retrieval-based few-shot prompting improving over zero-shot but showing limited further gains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the first Komi-Yazva--Russian parallel corpus of 457 aligned sentence pairs drawn from 74 narrative texts, together with documented provenance, story identifiers, and a leakage-aware evaluation protocol using story-level cross-validation and deterministic retrieval. This resource is deployed to benchmark large language models on Komi-Yazva-to-Russian translation under extreme parallel-data scarcity in zero-shot and retrieval-based few-shot regimes. Results indicate that LLMs generate non-trivial output across prompting conditions, yet performance differs substantially by model family and regime. Retrieval-based few-shot prompting yields consistent gains over zero-shot, while benefits saturate after small context sizes, and final conclusions hinge on metric selection and failure-handling rules.

Core claim

We present the first Komi-Yazva--Russian parallel corpus together with an explicit evaluation protocol for studying LLM translation in an endangered, extremely low-resource setting. The dataset contains 457 aligned sentence pairs from 74 narrative texts and is accompanied by documented provenance, sentence-level alignment, and story identifiers that enable leakage-aware evaluation. We use this setup to compare modern large language models on Komi-Yazva-to-Russian translation under severe parallel-data scarcity in zero-shot and retrieval-based few-shot regimes. Across models, LLMs produce non-trivial translations, but performance varies strongly by model family and prompting regime. Retrieval

What carries the argument

The Komi-Yazva--Russian parallel corpus of 457 sentence pairs with story identifiers that supports story-level cross-validation and deterministic retrieval for leakage-aware zero- and few-shot evaluation.

If this is right

  • Retrieval-based few-shot prompting produces higher-quality translations than zero-shot prompting for Komi-Yazva to Russian.
  • Gains from additional retrieved examples plateau after a small context size.
  • Evaluative rankings of models change depending on the choice of reference-based versus judge-based metrics and on failure-handling rules.
  • The corpus and protocol together constitute a reproducible testbed for translation systems in other endangered-language settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same story-level protocol could be replicated for other Uralic or similarly documented endangered languages to enable comparable LLM benchmarks.
  • The observed sensitivity to metric choice and failure handling points to a need for standardized failure taxonomies in low-resource machine translation evaluation.
  • The corpus allows direct testing of whether newer model families or alternative retrieval strategies close the performance gap beyond the small-context regime reported here.

Load-bearing premise

The 457 sentence pairs from 74 texts with their documented provenance and story identifiers are sufficient to support leakage-aware evaluation without material alignment errors or unrepresentative sampling that would invalidate comparisons across prompting regimes.

What would settle it

A demonstration that many sentence pairs contain alignment errors or that story identifiers do not prevent leakage during cross-validation would invalidate the reported differences between zero-shot and few-shot regimes.

Figures

Figures reproduced from arXiv: 2606.06420 by Petr Parshakov.

Figure 1
Figure 1. Figure 1: LLM-as-a-judge score as a function of the [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Story-level chrF scores for the top-performing models. [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Average translation quality versus reliability across models. [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
read the original abstract

We present the first Komi-Yazva--Russian parallel corpus together with an explicit evaluation protocol for studying LLM translation in an endangered, extremely low-resource setting. The dataset contains 457 aligned sentence pairs from 74 narrative texts and is accompanied by documented provenance, sentence-level alignment, and story identifiers that enable leakage-aware evaluation. We use this setup to compare modern large language models on Komi-Yazva-to-Russian translation under severe parallel-data scarcity in zero-shot and retrieval-based few-shot regimes. The protocol includes story-level cross-validation, deterministic retrieval for few-shot prompting, strict validation of generated outputs, complementary reference-based and judge-based metrics, and story-level uncertainty estimates. Across models, LLMs produce non-trivial translations, but performance varies strongly by model family and prompting regime. Retrieval-based few-shot prompting consistently improves over zero-shot prompting, while gains beyond a small retrieved context remain limited. The results show that evaluative conclusions in this setting depend materially on metric choice and failure handling, so the paper frames the corpus as both a dataset contribution and a reproducible evaluation testbed for endangered-language machine translation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the first Komi-Yazva--Russian parallel corpus, consisting of 457 aligned sentence pairs drawn from 74 narrative texts with documented provenance and story identifiers. It defines a leakage-aware evaluation protocol using story-level cross-validation and deterministic retrieval, then applies this protocol to compare LLMs on Komi-Yazva-to-Russian translation in zero-shot and retrieval-based few-shot regimes. The central empirical observations are that LLMs produce non-trivial translations whose performance varies strongly by model family and prompting regime, that few-shot retrieval improves over zero-shot with limited further gains beyond small contexts, and that evaluative conclusions depend materially on metric choice and failure handling.

Significance. If the alignment quality and sampling assumptions hold, the work supplies a reproducible testbed and new resource for extremely low-resource endangered-language MT. The explicit leakage-aware design, complementary reference- and judge-based metrics, and story-level uncertainty estimates are concrete strengths that address common pitfalls in few-shot LLM evaluation. The framing as both dataset and protocol contribution, together with the observation that metric choice affects conclusions, provides a useful methodological reference point for future low-resource translation studies.

major comments (2)
  1. [Dataset construction and evaluation protocol] The evaluation protocol (described in the abstract and §4) rests on accurate sentence alignments and representative sampling from the 74 narratives, yet no quantitative alignment-error rate, inter-annotator agreement, or diversity statistics are reported. Without these, it is impossible to rule out that observed differences between zero-shot and few-shot regimes are artifacts of alignment noise or story-specific leakage rather than genuine prompting effects.
  2. [Results and discussion] The claim that 'evaluative conclusions depend materially on metric choice and failure handling' is load-bearing for the paper's framing as a testbed, but the manuscript supplies no error analysis or breakdown of failure modes (e.g., by story or by model) that would allow readers to assess how sensitive the reported trends are to specific metric implementations.
minor comments (1)
  1. [Evaluation protocol] The abstract states that the protocol includes 'strict validation of generated outputs' but does not define the exact validation criteria; a short explicit list in the protocol section would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of dataset quality and analysis depth. We address each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: The evaluation protocol (described in the abstract and §4) rests on accurate sentence alignments and representative sampling from the 74 narratives, yet no quantitative alignment-error rate, inter-annotator agreement, or diversity statistics are reported. Without these, it is impossible to rule out that observed differences between zero-shot and few-shot regimes are artifacts of alignment noise or story-specific leakage rather than genuine prompting effects.

    Authors: We agree that explicit quantitative measures of alignment quality and sampling diversity would strengthen confidence in the protocol. The alignments were performed manually with documented provenance from the source narratives; however, we did not compute formal inter-annotator agreement or alignment-error rates in the original submission. In the revised manuscript we will add a dedicated subsection on corpus construction that reports (i) the alignment procedure, (ii) any available spot-check error estimates, and (iii) basic diversity statistics (sentence-length distribution, narrative-topic coverage, and story-size histogram). These additions will allow readers to assess potential noise or leakage more directly. revision: yes

  2. Referee: The claim that 'evaluative conclusions depend materially on metric choice and failure handling' is load-bearing for the paper's framing as a testbed, but the manuscript supplies no error analysis or breakdown of failure modes (e.g., by story or by model) that would allow readers to assess how sensitive the reported trends are to specific metric implementations.

    Authors: We accept that the current manuscript lacks a systematic error analysis to substantiate the metric-sensitivity claim. In the revision we will insert a new subsection (likely in §5 or an appendix) that provides (i) qualitative examples of common failure modes, (ii) quantitative breakdowns of error types by model and by story, and (iii) a sensitivity table showing how key trends change under alternative failure-handling rules. This will make the methodological point more concrete and reproducible. revision: yes

Circularity Check

0 steps flagged

Empirical corpus creation and model evaluation shows no circularity

full rationale

The paper constructs a new parallel corpus of 457 sentence pairs from 74 texts and applies it to direct empirical comparisons of LLM translation performance under zero-shot and few-shot regimes. No derivations, equations, fitted parameters, or predictions are present; claims rest on measured outputs from external models using documented alignments and story-level splits. No self-citations are load-bearing for any result, and the evaluation protocol is self-contained against the collected data without reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical dataset and protocol contribution; it introduces no mathematical derivations, fitted parameters, background axioms, or postulated entities.

pith-pipeline@v0.9.1-grok · 5723 in / 1193 out tokens · 46586 ms · 2026-06-28T01:20:36.957282+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 20 canonical work pages · 1 internal anchor

  1. [1]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  2. [2]

    Publications Manual , year = "1983", publisher =

  3. [3]

    and Kozen, Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  4. [4]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  5. [5]

    Dan Gusfield , title =. 1997

  6. [6]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  7. [7]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  8. [8]

    Machine Translation into Low-resource Language Varieties , author =. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) , month = aug, year =. doi:10.18653/v1/2021.acl-short.16 , pages =

  9. [9]

    Findings of the Association for Computational Linguistics: ACL 2023 , month = jul, year =

    In-Context Examples Selection for Machine Translation , author =. Findings of the Association for Computational Linguistics: ACL 2023 , month = jul, year =. doi:10.18653/v1/2023.findings-acl.156 , pages =

  10. [10]

    Findings of the Association for Computational Linguistics: EMNLP 2023 , month = dec, year =

    Steering Large Language Models for Machine Translation with Finetuning and In-Context Learning , author =. Findings of the Association for Computational Linguistics: EMNLP 2023 , month = dec, year =. doi:10.18653/v1/2023.findings-emnlp.606 , pages =

  11. [11]

    Translating into an Unwritten Low-Resource Language Pair with

    Elsner, Micha and Needle, Jordan , editor =. Translating into an Unwritten Low-Resource Language Pair with. Proceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology , month = jul, year =. doi:10.18653/v1/2023.sigmorphon-1.2 , pages =

  12. [12]

    Machine Translation with Large Language Models: Prompting, Few-shot Learning, and Fine-tuning with

    Zhang, Biao and Haddow, Barry and Birch, Alexandra , editor =. Machine Translation with Large Language Models: Prompting, Few-shot Learning, and Fine-tuning with. Proceedings of the Eighth Conference on Machine Translation , month = dec, year =. doi:10.18653/v1/2023.wmt-1.43 , pages =

  13. [13]

    Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , month = may, year =

    Neural machine translation in low-resource language pairs using synthetic pivoting , author =. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , month = may, year =

  14. [14]

    Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on

    Pei, Renhao and Liu, Yihong and Lin, Peiqin and Yvon, Fran. Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = jul, year =. doi:10.18653/v1/2025.acl-long.429 , pages =

  15. [15]

    Findings of the Association for Computational Linguistics: NAACL 2025 , month = apr, year =

    In-Context Example Selection via Similarity Search Improves Low-Resource Machine Translation , author =. Findings of the Association for Computational Linguistics: NAACL 2025 , month = apr, year =. doi:10.18653/v1/2025.findings-naacl.68 , pages =

  16. [16]

    Findings of the Association for Computational Linguistics: EMNLP 2025 , month = nov, year =

    Compositional Translation from Large Language Models for Language Pairs with one Low-Resource Language , author =. Findings of the Association for Computational Linguistics: EMNLP 2025 , month = nov, year =. doi:10.18653/v1/2025.findings-emnlp.1216 , pages =

  17. [17]

    How Good Are

    Hendy, Amr and Abdelrehim, Mohamed and Sharaf, Ahmed and Raunak, Vikas and Gabr, Mohamed and Matsushita, Hideki and Kim, Young Jin and Afify, Mohamed and Awadalla, Hany , journal =. How Good Are. 2023 , doi =

  18. [18]

    arXiv preprint arXiv:2302.07856 , year =

    Dictionary-based Phrase-level Prompting of Large Language Models for Machine Translation , author =. arXiv preprint arXiv:2302.07856 , year =. doi:10.48550/arXiv.2302.07856 , url =

  19. [19]

    No Language Left Behind: Scaling Human-Centered Machine Translation

    No Language Left Behind: Scaling Human-Centered Machine Translation , author =. arXiv preprint arXiv:2207.04672 , year =. doi:10.48550/arXiv.2207.04672 , url =

  20. [20]

    arXiv preprint arXiv:2412.20584 , year =

    Towards Neural No-Resource Language Translation: A Comparative Evaluation of Approaches , author =. arXiv preprint arXiv:2412.20584 , year =. doi:10.48550/arXiv.2412.20584 , url =

  21. [21]

    Low-Resource Machine Translation through Retrieval-Augmented

    Merx, Raphael and Mahmudi, Aso and Langford, Katrina and Ara. Low-Resource Machine Translation through Retrieval-Augmented. arXiv preprint arXiv:2404.04809 , year =. doi:10.48550/arXiv.2404.04809 , url =

  22. [22]

    arXiv preprint arXiv:2407.13343 , year =

    Learning-From-Mistakes Prompting for Indigenous Language Translation , author =. arXiv preprint arXiv:2407.13343 , year =. doi:10.48550/arXiv.2407.13343 , url =

  23. [23]

    arXiv preprint arXiv:2402.19167 , year =

    Teaching Large Language Models an Unseen Language on the Fly , author =. arXiv preprint arXiv:2402.19167 , year =. doi:10.48550/arXiv.2402.19167 , url =

  24. [24]

    Compensating for Data with Reasoning: Low-Resource Machine Translation with

    Frontull, Samuel and Str. Compensating for Data with Reasoning: Low-Resource Machine Translation with. arXiv preprint arXiv:2505.22293 , year =. doi:10.48550/arXiv.2505.22293 , url =

  25. [25]

    Prompt, Translate, Fine-Tune, Re-Initialize, or Instruction-Tune? Adapting

    Toukmaji, Christopher and Flanigan, Jeffrey , journal =. Prompt, Translate, Fine-Tune, Re-Initialize, or Instruction-Tune? Adapting. 2025 , doi =

  26. [26]

    and Allu, Niyathi and Garg, Rohin and Fartale, Harshwardhan and Chan, Dun Li , booktitle =

    Ramasethu, A. and Allu, Niyathi and Garg, Rohin and Fartale, Harshwardhan and Chan, Dun Li , booktitle =. Can Linguistically Related Languages Guide. 2026 , doi =

  27. [27]

    On the questions in developing computational infrastructure for

    Rueter, Jack and Partanen, Niko and Ponomareva, Lilia , booktitle =. On the questions in developing computational infrastructure for. 2020 , doi =

  28. [28]

    Instant annotations in

    Gerstenberger, Ciprian and Partanen, Niko and Rie. Instant annotations in. Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages , year =. doi:10.18653/v1/W17-0109 , url =

  29. [29]

    The First

    Partanen, Niko and Blokland, Rogier and Lim, Kyungtae and Poibeau, Thierry and Rie. The First. Proceedings of the Second Workshop on Universal Dependencies (UDW 2018) , year =. doi:10.18653/v1/W18-6015 , url =

  30. [30]

    Towards a Speech Recognizer for

    Hjortnaes, Nils and Partanen, Niko and Rie. Towards a Speech Recognizer for. Proceedings of the Sixth International Workshop on Computational Linguistics of Uralic Languages , year =. doi:10.18653/v1/2020.iwclul-1.5 , url =

  31. [31]

    Evaluating Open

    Tereshchenko, Yehor and H. Evaluating Open. arXiv preprint arXiv:2512.16287 , year =. doi:10.48550/arXiv.2512.16287 , url =

  32. [32]

    Lytkin, V. I. , title =. 1961 , publisher =