Reference-Free Reinforcement Learning Fine-Tuning for MT: A Seq2Seq Perspective
Pith reviewed 2026-05-20 17:53 UTC · model grok-4.3
The pith
Reinforcement learning with reference-free rewards improves Seq2Seq machine translation across 13 languages without parallel data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We apply Group Relative Policy Optimization to NLLB-200 (600M and 1.3B) using a hybrid reference-free reward (LaBSE and COMET-Kiwi) that requires no parallel data at fine-tuning time, evaluating across 13 typologically diverse languages. GRPO yields consistent improvements on all 13 languages, up to +5.03 chrF++ for Traditional Chinese, and, without any target-language data, competes with 3-epoch supervised fine-tuning on morphologically complex languages. We identify a consistent empirical pattern in which gains are largest where baseline performance is weakest and reward discriminability is highest, making this approach most effective precisely where parallel data is scarcest, and replicat
What carries the argument
Group Relative Policy Optimization (GRPO) driven by a hybrid LaBSE and COMET-Kiwi reward signal. It enables relative comparisons among candidate translations to update the policy of the encoder-decoder model without reference translations.
If this is right
- Consistent quality improvements occur on all 13 tested languages.
- Gains are largest on languages with the weakest baseline performance.
- The method competes with supervised fine-tuning for morphologically complex languages without target data.
- The same pattern appears when translating from English and from Spanish.
- Both the 600M and 1.3B parameter NLLB models benefit from the approach.
Where Pith is reading between the lines
- Reference-free RL could be tried on other encoder-decoder architectures for translation or related generation tasks.
- It suggests RL fine-tuning is viable for mid-sized Seq2Seq models rather than only very large decoder-only systems.
- Developers working on low-resource languages might adopt this to improve systems where collecting parallel data is costly.
- Testing the approach with even smaller models or additional language pairs would clarify its scalability.
Load-bearing premise
The hybrid reference-free reward from LaBSE and COMET-Kiwi provides an accurate and unbiased signal of translation quality to guide effective policy optimization across typologically diverse languages.
What would settle it
Running human evaluations or an independent quality metric on the outputs of the GRPO-tuned models to check whether the reported chrF++ gains correspond to actual improvements in translation quality.
read the original abstract
Production machine translation relies overwhelmingly on encoder-decoder Seq2Seq models, yet reinforcement learning approaches to MT fine-tuning have largely targeted decoder-only LLMs at $\geq$7B parameters, with limited systematic study of encoder-decoder architectures. We apply Group Relative Policy Optimization to NLLB-200 (600M and 1.3B) using a hybrid reference-free reward (LaBSE and COMET-Kiwi) that requires no parallel data at fine-tuning time, evaluating across 13 typologically diverse languages. GRPO yields consistent improvements on all 13 languages, up to $+$5.03 chrF++ for Traditional Chinese, and, without any target-language data, competes with 3-epoch supervised fine-tuning on morphologically complex languages . We identify a consistent empirical pattern in which gains are largest where baseline performance is weakest and reward discriminability is highest, making this approach most effective precisely where parallel data is scarcest, and replicate this pattern across English and Spanish source languages.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper applies Group Relative Policy Optimization (GRPO) to fine-tune NLLB-200 encoder-decoder Seq2Seq models (600M and 1.3B) for machine translation using a hybrid reference-free reward (LaBSE + COMET-Kiwi) that requires no parallel data during fine-tuning. It evaluates across 13 typologically diverse languages from English and Spanish sources, claiming consistent chrF++ gains on all languages (up to +5.03 for Traditional Chinese) that are largest where baselines are weakest, and competitive performance with 3-epoch supervised fine-tuning on morphologically complex languages without target-language data.
Significance. If the empirical results hold under rigorous validation, the work would be significant for low-resource MT by showing that RL fine-tuning with reference-free rewards can improve production-style encoder-decoder models without target data, extending RL techniques beyond decoder-only LLMs. The reported pattern tying gains to baseline weakness and reward discriminability, if reproducible, offers a practical insight for prioritizing such methods where parallel data is scarcest.
major comments (3)
- [Abstract] Abstract: the headline claims of consistent improvements across all 13 languages and competition with supervised fine-tuning rest on reported chrF++ deltas (e.g., +5.03 for Traditional Chinese), yet no statistical significance tests, run-to-run variance, or exact data splits are described, leaving the robustness of these gains unassessable.
- [Evaluation] Evaluation and reward design: the central assumption that the hybrid LaBSE + COMET-Kiwi reward supplies an accurate, unbiased optimization signal across typologically diverse and morphologically complex languages is load-bearing for the claim of genuine quality gains rather than reward artifacts; no per-language reward-human correlations, ablation removing one component, or bias analysis is provided despite known language-pair and morphological biases in these metrics.
- [Results] Results: the pattern that gains are largest where baseline performance is weakest is presented as an empirical finding, but without explicit tables or controls showing that this is not an artifact of the reward's discriminability correlating with the evaluation metric, the interpretation that the method is 'most effective precisely where parallel data is scarcest' remains under-supported.
minor comments (1)
- [Abstract] The abstract and results would benefit from a brief statement of the exact number of languages, source-target pairs, and whether the 13 languages include both high- and low-resource cases to clarify the scope.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive review. We address each of the major comments below and outline the revisions we plan to make to strengthen the manuscript's robustness and clarity.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claims of consistent improvements across all 13 languages and competition with supervised fine-tuning rest on reported chrF++ deltas (e.g., +5.03 for Traditional Chinese), yet no statistical significance tests, run-to-run variance, or exact data splits are described, leaving the robustness of these gains unassessable.
Authors: We agree that providing statistical significance and variance estimates would enhance the reliability of our reported improvements. In the revised manuscript, we will rerun experiments with multiple random seeds (e.g., 3-5 runs) to report average chrF++ scores with standard deviations. We will also include statistical significance tests, such as paired bootstrap resampling or Wilcoxon signed-rank tests, to assess the significance of the gains over baselines. For data splits, we utilized the publicly available FLORES-200 development and test sets for all languages, which we will explicitly document in the experimental setup section. revision: yes
-
Referee: [Evaluation] Evaluation and reward design: the central assumption that the hybrid LaBSE + COMET-Kiwi reward supplies an accurate, unbiased optimization signal across typologically diverse and morphologically complex languages is load-bearing for the claim of genuine quality gains rather than reward artifacts; no per-language reward-human correlations, ablation removing one component, or bias analysis is provided despite known language-pair and morphological biases in these metrics.
Authors: This is a valid concern regarding the potential for reward hacking or metric biases. We will incorporate an ablation study in the revised paper that evaluates the contribution of each reward component (LaBSE alone, COMET-Kiwi alone, and the hybrid) across the language pairs. Additionally, we will add a discussion of known biases in LaBSE and COMET-Kiwi, particularly for morphologically complex languages, and how our hybrid approach aims to mitigate them. However, performing new per-language human correlation studies would require substantial additional resources and human annotations not available in the current experimental setup; we will acknowledge this as a limitation and suggest it for future work. revision: partial
-
Referee: [Results] Results: the pattern that gains are largest where baseline performance is weakest is presented as an empirical finding, but without explicit tables or controls showing that this is not an artifact of the reward's discriminability correlating with the evaluation metric, the interpretation that the method is 'most effective precisely where parallel data is scarcest' remains under-supported.
Authors: To better support this interpretation, we will add new figures and tables in the results section of the revised manuscript. Specifically, we will compute and report the discriminability of the reward (measured as the standard deviation of reward scores on sampled outputs or the margin between high and low reward translations) for each language and correlate it with both baseline performance and observed gains. We will also include a control analysis showing the correlation between the reward model and chrF++ on held-out data to demonstrate that the pattern is not merely an artifact. This will provide stronger evidence for the claim. revision: yes
Circularity Check
Empirical evaluation on held-out test sets with no reduction of gains to fitted parameters
full rationale
The paper reports empirical improvements from applying GRPO to NLLB-200 models using a hybrid LaBSE+COMET-Kiwi reward, measured via chrF++ on standard held-out test sets across 13 languages. No equations or derivation steps are presented that reduce the observed gains to quantities defined by parameters fitted within the paper itself. The central claims rest on independent test-set comparisons rather than any self-definitional or fitted-input construction, rendering the results self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LaBSE and COMET-Kiwi together form a reliable reference-free proxy for human-judged translation quality across typologically diverse languages.
Reference graph
Works this paper leans on
-
[1]
Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , author =. Machine Learning , volume =. 1992 , publisher =
work page 1992
-
[2]
arXiv preprint arXiv:2601.12535 , year =
Improving Low-Resource Machine Translation via Round-Trip Reinforcement Learning , author =. arXiv preprint arXiv:2601.12535 , year =
-
[3]
Proceedings of the International Conference on Learning Representations (ICLR) , year =
Sequence Level Training with Recurrent Neural Networks , author =. Proceedings of the International Conference on Learning Representations (ICLR) , year =
-
[4]
Minimum Risk Training for Neural Machine Translation , author =. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL) , pages =
-
[5]
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , author =. arXiv preprint arXiv:1609.08144 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Advances in Neural Information Processing Systems (NeurIPS) , volume =
Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems (NeurIPS) , volume =
-
[7]
Proximal Policy Optimization Algorithms
Proximal Policy Optimization Algorithms , author =. arXiv preprint arXiv:1707.06347 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Accurate Evaluation of Segment-level Machine Translation Metrics , author =. Proceedings of NAACL , year =
-
[9]
Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxian and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, Y. K. and Wu, Y. and others , journal =
-
[10]
Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxian and Zhang, Ruoyu and Xu, Runxin and Zhu, Qihao and Ma, Shirong and Wang, Peiyi and Bi, Xiao and others , journal =
-
[11]
Bandit Structured Prediction for Neural Sequence-to-Sequence Learning , author =. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL) , pages =
-
[12]
Quality Estimation from Scratch (
Kreutzer, Julia and Uyheng, Joshua and Riezler, Stefan , booktitle =. Quality Estimation from Scratch (
-
[13]
He, Minggui and Li, Zhiwei and Li, Shanshan and Peng, Hang and Zhao, Shimin and Li, Yuang and Luo, Jiaxin and Hao, Chang and Guo, Shiyue and Li, Rui and others , journal =
-
[14]
Feng, Zhaopeng and Cai, Ruidi and Liu, Jiaxuan and Hu, Junyuan and Wu, Zhiyong , journal =
-
[15]
Zhang, Tianyi and Kishore, Varsha and Wu, Felix and Weinberger, Kilian Q. and Artzi, Yoav , journal =
-
[16]
Yang, Yu and Cheng, Shanbo and Xu, Lu and Zhang, Jianbing and Huang, Shujian , journal =
-
[17]
Lu, Wenhao and Wang, Xuebo and Zhang, Min and Zhan, Runzhe , journal =
-
[18]
Yang, Sen and Cheng, Shanbo and Xu, Lu and Zhang, Jianbing and Huang, Shujian , year =. 2602.14028 , archivePrefix=
-
[19]
An Open Dataset and Model for Language Identification , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages =
-
[20]
Feng, Fangxiaoyu and Yang, Yinfei and Cer, Daniel and Arivazhagan, Naveen and Wang, Wei , booktitle =. Language-agnostic
-
[21]
Proceedings of the Seventh Conference on Machine Translation (WMT) , pages =
Rei, Ricardo and de Souza, Jos. Proceedings of the Seventh Conference on Machine Translation (WMT) , pages =
-
[22]
Rei, Ricardo and Guerreiro, Nuno M. and Pombal, Jos. Scaling Up. Proceedings of the Eighth Conference on Machine Translation (WMT) , pages =
-
[23]
No Language Left Behind: Scaling Human-Centered Machine Translation
No Language Left Behind: Scaling Human-Centered Machine Translation , author =. arXiv preprint arXiv:2207.04672 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Proceedings of the First Workshop on Neural Machine Translation (WNMT) , pages =
Six Challenges for Neural Machine Translation , author =. Proceedings of the First Workshop on Neural Machine Translation (WNMT) , pages =
-
[25]
Goyal, Naman and Gao, Cynthia and Chaudhary, Vishrav and Chen, Peng-Jen and Wenzek, Guillaume and Ju, Da and Krishnan, Sanjana and Ranzato, Marc'Aurelio and Guzm. The. Transactions of the Association for Computational Linguistics (TACL) , volume =
-
[26]
Federmann, Christian and Kocmi, Tom and Xin, Ying , booktitle =
-
[27]
Proceedings of the Tenth Workshop on Statistical Machine Translation (WMT) , pages =
Popovi. Proceedings of the Tenth Workshop on Statistical Machine Translation (WMT) , pages =
-
[28]
Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing , booktitle =
- [29]
-
[30]
Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =
-
[31]
Dettmers, Tim and Pagnoni, Artidoro and Rodola, Ari and Zettlemoyer, Luke , journal =
-
[32]
Improving Neural Machine Translation Models with Monolingual Data , author =. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL) , pages =
-
[33]
Proceedings of the International Conference on Learning Representations (ICLR) , year =
Unsupervised Machine Translation Using Monolingual Corpora Only , author =. Proceedings of the International Conference on Learning Representations (ICLR) , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.