pith. sign in

arxiv: 2605.15976 · v1 · pith:RCQJTJEOnew · submitted 2026-05-15 · 💻 cs.CL · cs.AI

Reference-Free Reinforcement Learning Fine-Tuning for MT: A Seq2Seq Perspective

Pith reviewed 2026-05-20 17:53 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords machine translationreinforcement learningreference-freeGRPONLLB-200Seq2Seqlow-resource MTpolicy optimization
0
0 comments X

The pith

Reinforcement learning with reference-free rewards improves Seq2Seq machine translation across 13 languages without parallel data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that Group Relative Policy Optimization can be used to fine-tune encoder-decoder models such as NLLB-200 for machine translation. It relies on a hybrid reward combining LaBSE and COMET-Kiwi scores that needs no parallel data or references during the process. This leads to consistent quality gains on every one of the 13 languages examined, with the biggest boosts reaching +5.03 chrF++ on Traditional Chinese. The improvements are especially notable on morphologically complex languages where the method rivals three epochs of supervised fine-tuning even without target-language data. A reader would care because this offers a practical route to better translation systems in settings where parallel data is hard to obtain.

Core claim

We apply Group Relative Policy Optimization to NLLB-200 (600M and 1.3B) using a hybrid reference-free reward (LaBSE and COMET-Kiwi) that requires no parallel data at fine-tuning time, evaluating across 13 typologically diverse languages. GRPO yields consistent improvements on all 13 languages, up to +5.03 chrF++ for Traditional Chinese, and, without any target-language data, competes with 3-epoch supervised fine-tuning on morphologically complex languages. We identify a consistent empirical pattern in which gains are largest where baseline performance is weakest and reward discriminability is highest, making this approach most effective precisely where parallel data is scarcest, and replicat

What carries the argument

Group Relative Policy Optimization (GRPO) driven by a hybrid LaBSE and COMET-Kiwi reward signal. It enables relative comparisons among candidate translations to update the policy of the encoder-decoder model without reference translations.

If this is right

  • Consistent quality improvements occur on all 13 tested languages.
  • Gains are largest on languages with the weakest baseline performance.
  • The method competes with supervised fine-tuning for morphologically complex languages without target data.
  • The same pattern appears when translating from English and from Spanish.
  • Both the 600M and 1.3B parameter NLLB models benefit from the approach.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Reference-free RL could be tried on other encoder-decoder architectures for translation or related generation tasks.
  • It suggests RL fine-tuning is viable for mid-sized Seq2Seq models rather than only very large decoder-only systems.
  • Developers working on low-resource languages might adopt this to improve systems where collecting parallel data is costly.
  • Testing the approach with even smaller models or additional language pairs would clarify its scalability.

Load-bearing premise

The hybrid reference-free reward from LaBSE and COMET-Kiwi provides an accurate and unbiased signal of translation quality to guide effective policy optimization across typologically diverse languages.

What would settle it

Running human evaluations or an independent quality metric on the outputs of the GRPO-tuned models to check whether the reported chrF++ gains correspond to actual improvements in translation quality.

read the original abstract

Production machine translation relies overwhelmingly on encoder-decoder Seq2Seq models, yet reinforcement learning approaches to MT fine-tuning have largely targeted decoder-only LLMs at $\geq$7B parameters, with limited systematic study of encoder-decoder architectures. We apply Group Relative Policy Optimization to NLLB-200 (600M and 1.3B) using a hybrid reference-free reward (LaBSE and COMET-Kiwi) that requires no parallel data at fine-tuning time, evaluating across 13 typologically diverse languages. GRPO yields consistent improvements on all 13 languages, up to $+$5.03 chrF++ for Traditional Chinese, and, without any target-language data, competes with 3-epoch supervised fine-tuning on morphologically complex languages . We identify a consistent empirical pattern in which gains are largest where baseline performance is weakest and reward discriminability is highest, making this approach most effective precisely where parallel data is scarcest, and replicate this pattern across English and Spanish source languages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper applies Group Relative Policy Optimization (GRPO) to fine-tune NLLB-200 encoder-decoder Seq2Seq models (600M and 1.3B) for machine translation using a hybrid reference-free reward (LaBSE + COMET-Kiwi) that requires no parallel data during fine-tuning. It evaluates across 13 typologically diverse languages from English and Spanish sources, claiming consistent chrF++ gains on all languages (up to +5.03 for Traditional Chinese) that are largest where baselines are weakest, and competitive performance with 3-epoch supervised fine-tuning on morphologically complex languages without target-language data.

Significance. If the empirical results hold under rigorous validation, the work would be significant for low-resource MT by showing that RL fine-tuning with reference-free rewards can improve production-style encoder-decoder models without target data, extending RL techniques beyond decoder-only LLMs. The reported pattern tying gains to baseline weakness and reward discriminability, if reproducible, offers a practical insight for prioritizing such methods where parallel data is scarcest.

major comments (3)
  1. [Abstract] Abstract: the headline claims of consistent improvements across all 13 languages and competition with supervised fine-tuning rest on reported chrF++ deltas (e.g., +5.03 for Traditional Chinese), yet no statistical significance tests, run-to-run variance, or exact data splits are described, leaving the robustness of these gains unassessable.
  2. [Evaluation] Evaluation and reward design: the central assumption that the hybrid LaBSE + COMET-Kiwi reward supplies an accurate, unbiased optimization signal across typologically diverse and morphologically complex languages is load-bearing for the claim of genuine quality gains rather than reward artifacts; no per-language reward-human correlations, ablation removing one component, or bias analysis is provided despite known language-pair and morphological biases in these metrics.
  3. [Results] Results: the pattern that gains are largest where baseline performance is weakest is presented as an empirical finding, but without explicit tables or controls showing that this is not an artifact of the reward's discriminability correlating with the evaluation metric, the interpretation that the method is 'most effective precisely where parallel data is scarcest' remains under-supported.
minor comments (1)
  1. [Abstract] The abstract and results would benefit from a brief statement of the exact number of languages, source-target pairs, and whether the 13 languages include both high- and low-resource cases to clarify the scope.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. We address each of the major comments below and outline the revisions we plan to make to strengthen the manuscript's robustness and clarity.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claims of consistent improvements across all 13 languages and competition with supervised fine-tuning rest on reported chrF++ deltas (e.g., +5.03 for Traditional Chinese), yet no statistical significance tests, run-to-run variance, or exact data splits are described, leaving the robustness of these gains unassessable.

    Authors: We agree that providing statistical significance and variance estimates would enhance the reliability of our reported improvements. In the revised manuscript, we will rerun experiments with multiple random seeds (e.g., 3-5 runs) to report average chrF++ scores with standard deviations. We will also include statistical significance tests, such as paired bootstrap resampling or Wilcoxon signed-rank tests, to assess the significance of the gains over baselines. For data splits, we utilized the publicly available FLORES-200 development and test sets for all languages, which we will explicitly document in the experimental setup section. revision: yes

  2. Referee: [Evaluation] Evaluation and reward design: the central assumption that the hybrid LaBSE + COMET-Kiwi reward supplies an accurate, unbiased optimization signal across typologically diverse and morphologically complex languages is load-bearing for the claim of genuine quality gains rather than reward artifacts; no per-language reward-human correlations, ablation removing one component, or bias analysis is provided despite known language-pair and morphological biases in these metrics.

    Authors: This is a valid concern regarding the potential for reward hacking or metric biases. We will incorporate an ablation study in the revised paper that evaluates the contribution of each reward component (LaBSE alone, COMET-Kiwi alone, and the hybrid) across the language pairs. Additionally, we will add a discussion of known biases in LaBSE and COMET-Kiwi, particularly for morphologically complex languages, and how our hybrid approach aims to mitigate them. However, performing new per-language human correlation studies would require substantial additional resources and human annotations not available in the current experimental setup; we will acknowledge this as a limitation and suggest it for future work. revision: partial

  3. Referee: [Results] Results: the pattern that gains are largest where baseline performance is weakest is presented as an empirical finding, but without explicit tables or controls showing that this is not an artifact of the reward's discriminability correlating with the evaluation metric, the interpretation that the method is 'most effective precisely where parallel data is scarcest' remains under-supported.

    Authors: To better support this interpretation, we will add new figures and tables in the results section of the revised manuscript. Specifically, we will compute and report the discriminability of the reward (measured as the standard deviation of reward scores on sampled outputs or the margin between high and low reward translations) for each language and correlate it with both baseline performance and observed gains. We will also include a control analysis showing the correlation between the reward model and chrF++ on held-out data to demonstrate that the pattern is not merely an artifact. This will provide stronger evidence for the claim. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation on held-out test sets with no reduction of gains to fitted parameters

full rationale

The paper reports empirical improvements from applying GRPO to NLLB-200 models using a hybrid LaBSE+COMET-Kiwi reward, measured via chrF++ on standard held-out test sets across 13 languages. No equations or derivation steps are presented that reduce the observed gains to quantities defined by parameters fitted within the paper itself. The central claims rest on independent test-set comparisons rather than any self-definitional or fitted-input construction, rendering the results self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach depends on the pre-trained quality of LaBSE and COMET-Kiwi as reward models and on the assumption that RL optimization will improve downstream chrF++ without introducing new biases; no free parameters are explicitly listed in the abstract.

axioms (1)
  • domain assumption LaBSE and COMET-Kiwi together form a reliable reference-free proxy for human-judged translation quality across typologically diverse languages.
    This assumption is required for the reward signal to guide useful policy updates.

pith-pipeline@v0.9.0 · 5714 in / 1235 out tokens · 78645 ms · 2026-05-20T17:53:50.554706+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 3 internal anchors

  1. [1]

    Machine Learning , volume =

    Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , author =. Machine Learning , volume =. 1992 , publisher =

  2. [2]

    arXiv preprint arXiv:2601.12535 , year =

    Improving Low-Resource Machine Translation via Round-Trip Reinforcement Learning , author =. arXiv preprint arXiv:2601.12535 , year =

  3. [3]

    Proceedings of the International Conference on Learning Representations (ICLR) , year =

    Sequence Level Training with Recurrent Neural Networks , author =. Proceedings of the International Conference on Learning Representations (ICLR) , year =

  4. [4]

    Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL) , pages =

    Minimum Risk Training for Neural Machine Translation , author =. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL) , pages =

  5. [5]

    Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

    Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , author =. arXiv preprint arXiv:1609.08144 , year =

  6. [6]

    Advances in Neural Information Processing Systems (NeurIPS) , volume =

    Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems (NeurIPS) , volume =

  7. [7]

    Proximal Policy Optimization Algorithms

    Proximal Policy Optimization Algorithms , author =. arXiv preprint arXiv:1707.06347 , year =

  8. [8]

    Proceedings of NAACL , year =

    Accurate Evaluation of Segment-level Machine Translation Metrics , author =. Proceedings of NAACL , year =

  9. [9]

    Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxian and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, Y. K. and Wu, Y. and others , journal =

  10. [10]

    Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxian and Zhang, Ruoyu and Xu, Runxin and Zhu, Qihao and Ma, Shirong and Wang, Peiyi and Bi, Xiao and others , journal =

  11. [11]

    Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL) , pages =

    Bandit Structured Prediction for Neural Sequence-to-Sequence Learning , author =. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL) , pages =

  12. [12]

    Quality Estimation from Scratch (

    Kreutzer, Julia and Uyheng, Joshua and Riezler, Stefan , booktitle =. Quality Estimation from Scratch (

  13. [13]

    He, Minggui and Li, Zhiwei and Li, Shanshan and Peng, Hang and Zhao, Shimin and Li, Yuang and Luo, Jiaxin and Hao, Chang and Guo, Shiyue and Li, Rui and others , journal =

  14. [14]

    Feng, Zhaopeng and Cai, Ruidi and Liu, Jiaxuan and Hu, Junyuan and Wu, Zhiyong , journal =

  15. [15]

    and Artzi, Yoav , journal =

    Zhang, Tianyi and Kishore, Varsha and Wu, Felix and Weinberger, Kilian Q. and Artzi, Yoav , journal =

  16. [16]

    Yang, Yu and Cheng, Shanbo and Xu, Lu and Zhang, Jianbing and Huang, Shujian , journal =

  17. [17]

    Lu, Wenhao and Wang, Xuebo and Zhang, Min and Zhan, Runzhe , journal =

  18. [18]

    2602.14028 , archivePrefix=

    Yang, Sen and Cheng, Shanbo and Xu, Lu and Zhang, Jianbing and Huang, Shujian , year =. 2602.14028 , archivePrefix=

  19. [19]

    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages =

    An Open Dataset and Model for Language Identification , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages =

  20. [20]

    Language-agnostic

    Feng, Fangxiaoyu and Yang, Yinfei and Cer, Daniel and Arivazhagan, Naveen and Wang, Wei , booktitle =. Language-agnostic

  21. [21]

    Proceedings of the Seventh Conference on Machine Translation (WMT) , pages =

    Rei, Ricardo and de Souza, Jos. Proceedings of the Seventh Conference on Machine Translation (WMT) , pages =

  22. [22]

    and Pombal, Jos

    Rei, Ricardo and Guerreiro, Nuno M. and Pombal, Jos. Scaling Up. Proceedings of the Eighth Conference on Machine Translation (WMT) , pages =

  23. [23]

    No Language Left Behind: Scaling Human-Centered Machine Translation

    No Language Left Behind: Scaling Human-Centered Machine Translation , author =. arXiv preprint arXiv:2207.04672 , year =

  24. [24]

    Proceedings of the First Workshop on Neural Machine Translation (WNMT) , pages =

    Six Challenges for Neural Machine Translation , author =. Proceedings of the First Workshop on Neural Machine Translation (WNMT) , pages =

  25. [25]

    Goyal, Naman and Gao, Cynthia and Chaudhary, Vishrav and Chen, Peng-Jen and Wenzek, Guillaume and Ju, Da and Krishnan, Sanjana and Ranzato, Marc'Aurelio and Guzm. The. Transactions of the Association for Computational Linguistics (TACL) , volume =

  26. [26]

    Federmann, Christian and Kocmi, Tom and Xin, Ying , booktitle =

  27. [27]

    Proceedings of the Tenth Workshop on Statistical Machine Translation (WMT) , pages =

    Popovi. Proceedings of the Tenth Workshop on Statistical Machine Translation (WMT) , pages =

  28. [28]

    Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing , booktitle =

  29. [29]

    A Call for Clarity in Reporting

    Post, Matt , booktitle =. A Call for Clarity in Reporting

  30. [30]

    and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

    Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

  31. [31]

    Dettmers, Tim and Pagnoni, Artidoro and Rodola, Ari and Zettlemoyer, Luke , journal =

  32. [32]

    Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL) , pages =

    Improving Neural Machine Translation Models with Monolingual Data , author =. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL) , pages =

  33. [33]

    Proceedings of the International Conference on Learning Representations (ICLR) , year =

    Unsupervised Machine Translation Using Monolingual Corpora Only , author =. Proceedings of the International Conference on Learning Representations (ICLR) , year =