Rewrite to Translate, Translate to Reward: Reinforcement Learning for Source Rewriting in Machine Translation
Pith reviewed 2026-06-27 19:56 UTC · model grok-4.3
The pith
Reinforcement learning on translation-quality rewards trains 4B models to rewrite source text more effectively than prompts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Rewriting source text with large language models before translation has been shown to improve machine translation quality. However, prompt-based rewriting can degrade translation quality rather than improve it, particularly when smaller LLMs such as 4B-parameter models are used. This limitation stems from the difficulty of controlling rewriting behavior through natural-language prompts alone. RLSR addresses the issue by training the rewriting model with a reward based on the downstream translation-quality improvement produced by each rewrite.
What carries the argument
RLSR, a reinforcement learning framework that trains a source-rewriting model using a reward signal equal to the measured improvement in downstream machine translation quality.
If this is right
- 4B RLSR-trained rewriting models significantly outperform the no-rewriting baseline.
- They also outperform prompt-based rewriting baselines that use models of the same size.
- Their performance remains competitive with rewriting baselines that rely on a 235B LLM.
- The gains appear across six different MT systems and 16 language pairs.
Where Pith is reading between the lines
- The same reward-from-downstream-task pattern could be tested on other pre-processing steps whose value is judged only after a later model runs.
- Explicit RL optimization may prove more reliable than prompting when the desired behavior is hard to describe in natural language.
- Specialized smaller models trained this way might reduce reliance on very large general-purpose models inside translation pipelines.
Load-bearing premise
That the measured improvement in downstream translation quality provides a stable, non-hacking reward signal sufficient to train the rewriter without introducing artifacts that degrade other aspects of the output or the MT system itself.
What would settle it
An experiment on held-out data in which translations produced after RLSR rewriting show no statistically significant gain over the no-rewriting or prompt-based baselines on standard quality metrics.
Figures
read the original abstract
Rewriting source text with large language models (LLMs) before translation has been shown to improve machine translation (MT) quality. However, we find that prompt-based rewriting can degrade translation quality rather than improve it, particularly when smaller LLMs, such as 4B-parameter models, are used. We argue that this limitation stems from the difficulty of controlling rewriting behavior through natural-language prompts alone: a rewrite is useful only if it improves downstream translation, yet existing prompt-based methods do not explicitly optimize for this signal. To address this issue, we propose RLSR (Reinforcement Learning for Source Rewriting), a reinforcement learning framework that trains the rewriting model with a reward based on the downstream translation-quality improvement produced by each rewrite. Experiments across six MT systems and 16 language pairs show that our 4B RLSR-trained rewriting models significantly outperform both the no-rewriting baseline and prompt-based rewriting baselines at the same model scale, while remaining competitive with baselines that use a 235B LLM.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes RLSR, a reinforcement learning framework for training source rewriting models in machine translation. The reward signal is the improvement in downstream translation quality after rewriting. Experiments across six MT systems and 16 language pairs claim that 4B-parameter RLSR models significantly outperform no-rewriting and prompt-based rewriting baselines at the same scale, while remaining competitive with 235B LLM baselines.
Significance. If the results hold, the work shows that RL can make smaller rewriting models effective by directly optimizing for translation improvement rather than relying on prompts, addressing a limitation of prompt-based methods for 4B-scale models. This could reduce dependence on much larger LLMs for preprocessing in MT pipelines.
major comments (2)
- [Abstract] Abstract: the claim of significant outperformance across six MT systems and 16 language pairs supplies no details on the MT metrics used for the reward, statistical significance tests, error bars, data splits, or controls for confounds. This information is load-bearing for the central empirical claim.
- [Experiments (results description)] The central claim requires that downstream MT quality provides a stable, non-exploitable reward for RL training of the 4B rewriter. No auxiliary checks (human evaluation, side-effect metrics on fluency/adequacy, or ablations on reward variance) are described to rule out metric hacking via superficial changes favored by the MT system or metric.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. The comments focus on strengthening the presentation of empirical results and verifying the reliability of the RL reward. We respond point-by-point below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of significant outperformance across six MT systems and 16 language pairs supplies no details on the MT metrics used for the reward, statistical significance tests, error bars, data splits, or controls for confounds. This information is load-bearing for the central empirical claim.
Authors: We agree that the abstract would be strengthened by including a few key qualifiers. The manuscript body (Sections 4.1 and 5.1) specifies COMET as the primary reward metric, paired bootstrap tests (p < 0.05) for significance, standard deviations across three random seeds for error bars, WMT 2022/2023 test splits, and controls via six distinct MT systems. In the revision we will add a concise clause to the abstract noting the primary metric and significance testing, while keeping the abstract within length limits. revision: partial
-
Referee: [Experiments (results description)] The central claim requires that downstream MT quality provides a stable, non-exploitable reward for RL training of the 4B rewriter. No auxiliary checks (human evaluation, side-effect metrics on fluency/adequacy, or ablations on reward variance) are described to rule out metric hacking via superficial changes favored by the MT system or metric.
Authors: We share the concern about reward stability. The current experiments already use six different MT systems as reward providers and report consistent gains under both COMET and BLEU, which provides some protection against single-metric exploitation. However, explicit ablations on reward variance across training steps and side-effect metrics (e.g., source perplexity for fluency) are not presented. We will add a short subsection and table in the revision to include these analyses. Human evaluation was not conducted owing to scale; we can note this limitation and offer to perform a small-scale study if the referee considers it essential. revision: partial
Circularity Check
No circularity: empirical RL method with external MT rewards
full rationale
The paper proposes RLSR as an RL framework that trains a rewriter using downstream translation quality as the reward signal. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claim rests on experiments across six MT systems and 16 language pairs that compare against baselines, which are independent external evaluations rather than reductions to the method's own inputs by construction. This is a standard empirical setup with no load-bearing self-definitional or uniqueness steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Pre-editing and the use of simplified writing for MT
Pym, Peter. Pre-editing and the use of simplified writing for MT. Proceedings of Translating and the Computer 10: The translation environment 10 years on. 1988
1988
-
[2]
Two in one -- can it work? Readability and translatability by means of controlled language
Reuther, Ursula. Two in one -- can it work? Readability and translatability by means of controlled language. EAMT Workshop: Improving MT through other language technology tools: resources and tools for building MT. 2003
2003
-
[3]
A Large-Scale Evaluation of Pre-editing Strategies for Improving User-Generated Content Translation
Seretan, Violeta and Bouillon, Pierrette and Gerlach, Johanna. A Large-Scale Evaluation of Pre-editing Strategies for Improving User-Generated Content Translation. Proceedings of the Ninth International Conference on Language Resources and Evaluation ( LREC '14). 2014
2014
-
[4]
Understanding Pre-Editing for Black-Box Neural Machine Translation
Miyata, Rei and Fujita, Atsushi. Understanding Pre-Editing for Black-Box Neural Machine Translation. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 2021. doi:10.18653/v1/2021.eacl-main.132
-
[5]
Automatic Input Rewriting Improves Translation with Large Language Models
Ki, Dayeon and Carpuat, Marine. Automatic Input Rewriting Improves Translation with Large Language Models. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.naacl-long.542
-
[6]
AAAI Conference on Artificial Intelligence , year=
Simplify-Then-Translate: Automatic Preprocessing for Black-Box Translation , author=. AAAI Conference on Artificial Intelligence , year=
-
[7]
Automatic Decomposition of Text Editing Examples into Primitive Edit Operations: Toward Analytic Evaluation of Editing Systems
Yamaguchi, Daichi and Miyata, Rei and Fujita, Atsushi and Kajiwara, Tomoyuki and Sato, Satoshi. Automatic Decomposition of Text Editing Examples into Primitive Edit Operations: Toward Analytic Evaluation of Editing Systems. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2...
2024
-
[8]
Improved Statistical Machine Translation Using Paraphrases
Callison-Burch, Chris and Koehn, Philipp and Osborne, Miles. Improved Statistical Machine Translation Using Paraphrases. Proceedings of the Human Language Technology Conference of the NAACL , Main Conference. 2006
2006
-
[9]
Source-Language Entailment Modeling for Translating Unknown Terms
Mirkin, Shachar and Specia, Lucia and Cancedda, Nicola and Dagan, Ido and Dymetman, Marc and Szpektor, Idan. Source-Language Entailment Modeling for Translating Unknown Terms. Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. 2009
2009
-
[10]
Improved Statistical Machine Translation Using Monolingually-Derived Paraphrases
Marton, Yuval and Callison-Burch, Chris and Resnik, Philip. Improved Statistical Machine Translation Using Monolingually-Derived Paraphrases. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. 2009
2009
-
[11]
Can Text Simplification Help Machine Translation?
S tajner, Sanja and Popovic, Maja. Can Text Simplification Help Machine Translation?. Proceedings of the 19th Annual Conference of the E uropean Association for Machine Translation. 2016
2016
-
[12]
S tajner, Sanja and Popovi \'c , Maja. Automated Text Simplification as a Preprocessing Step for Machine Translation into an Under-resourced Language. Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019). 2019. doi:10.26615/978-954-452-056-4_131
-
[13]
Koretaka, Hyuga and Fujita, Atsushi and Kajiwara, Tomoyuki. Targeted Source Text Editing for Machine Translation: Exploiting Quality Estimators and Large Language Models. Proceedings of the Tenth Conference on Machine Translation. 2025. doi:10.18653/v1/2025.wmt-1.12
-
[14]
Kocmi, Tom and Artemova, Ekaterina and Avramidis, Eleftherios and Bawden, Rachel and Bojar, Ond r ej and Dranch, Konstantin and Dvorkovich, Anton and Dukanov, Sergey and Fishel, Mark and Freitag, Markus and Gowda, Thamme and Grundkiewicz, Roman and Haddow, Barry and Karpinska, Marzena and Koehn, Philipp and Lakougna, Howard and Lundin, Jessica and Monz, C...
-
[15]
2026 , url=
Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and YuYue and Weinan Dai and Tiantian Fan and Gaohong Liu and Juncai Liu and LingJun Liu and Xin Liu and Haibin Lin and Zhiqi Lin and Bole Ma and Guangming Sheng and Yuxuan Tong and Chi Zhang and Mofan Zhang and Ru Zhang and Wang Zhang and Hang Zhu and Jinhua Zhu and Jiaze Chen and ...
2026
-
[16]
Li and Y
Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y.K. Li and Y. Wu and Daya Guo , title =. CoRR , volume =. 2024 , url =
2024
-
[17]
2024 , eprint=
SWIFT:A Scalable lightWeight Infrastructure for Fine-Tuning , author=. 2024 , eprint=
2024
-
[18]
International Conference on Learning Representations , year=
Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=
-
[19]
8-bit Optimizers via Block-wise Quantization , author=
-
[20]
Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
-
[21]
Dao, Tri , booktitle=. Flash
-
[22]
2024 , eprint=
Enhancing Training Efficiency Using Packing with Flash Attention , author=. 2024 , eprint=
2024
-
[23]
Proceedings of the Eighth Conference on Machine Translation
GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4 , author =. Proceedings of the Eighth Conference on Machine Translation. 2023
2023
-
[24]
arXiv preprint arXiv:2312.11805 , year=
Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=
-
[25]
Proceedings of the Ninth Conference on Machine Translation , pages=
Mitigating Metric Bias in Minimum Bayes Risk Decoding , author=. Proceedings of the Ninth Conference on Machine Translation , pages=
-
[26]
Results of WMT 23 Metrics Shared Task: Metrics Might Be Guilty but References Are Not Innocent
Freitag, Markus and Mathur, Nitika and Lo, Chi-kiu and Avramidis, Eleftherios and Rei, Ricardo and Thompson, Brian and Kocmi, Tom and Blain, Frederic and Deutsch, Daniel and Stewart, Craig and Zerva, Chrysoula and Castilho, Sheila and Lavie, Alon and Foster, George. Results of WMT 23 Metrics Shared Task: Metrics Might Be Guilty but References Are Not Inno...
-
[27]
Are LLM s Breaking MT Metrics? Results of the WMT 24 Metrics Shared Task
Freitag, Markus and Mathur, Nitika and Deutsch, Daniel and Lo, Chi-Kiu and Avramidis, Eleftherios and Rei, Ricardo and Thompson, Brian and Blain, Frederic and Kocmi, Tom and Wang, Jiayi and Adelani, David Ifeoluwa and Buchicchio, Marianna and Zerva, Chrysoula and Lavie, Alon. Are LLM s Breaking MT Metrics? Results of the WMT 24 Metrics Shared Task. Procee...
-
[28]
Barrault, Lo. Findings of the 2019 Conference on Machine Translation ( WMT 19). Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1). 2019. doi:10.18653/v1/W19-5301
-
[29]
Findings of the 2020 Conference on Machine Translation ( WMT 20)
Barrault, Lo. Findings of the 2020 Conference on Machine Translation ( WMT 20). Proceedings of the Fifth Conference on Machine Translation. 2020. doi:10.18653/v1/2020.wmt-1.1
-
[30]
Akhbardeh, Farhad and Arkhangorodsky, Arkady and Biesialska, Magdalena and Bojar, Ond r ej and Chatterjee, Rajen and Chaudhary, Vishrav and Costa-jussa, Marta R. and Espa \ n a-Bonet, Cristina and Fan, Angela and Federmann, Christian and Freitag, Markus and Graham, Yvette and Grundkiewicz, Roman and Haddow, Barry and Harter, Leonie and Heafield, Kenneth a...
2021
-
[31]
Findings of the 2022 Conference on Machine Translation ( WMT 22)
Kocmi, Tom and Bawden, Rachel and Bojar, Ond r ej and Dvorkovich, Anton and Federmann, Christian and Fishel, Mark and Gowda, Thamme and Graham, Yvette and Grundkiewicz, Roman and Haddow, Barry and Knowles, Rebecca and Koehn, Philipp and Monz, Christof and Morishita, Makoto and Nagata, Masaaki and Nakazawa, Toshiaki and Nov \'a k, Michal and Popel, Martin ...
-
[32]
Kocmi, Tom and Avramidis, Eleftherios and Bawden, Rachel and Bojar, Ond r ej and Dvorkovich, Anton and Federmann, Christian and Fishel, Mark and Freitag, Markus and Gowda, Thamme and Grundkiewicz, Roman and Haddow, Barry and Koehn, Philipp and Marie, Benjamin and Monz, Christof and Morishita, Makoto and Murray, Kenton and Nagata, Masaaki and Nakazawa, Tos...
-
[33]
Kocmi, Tom and Avramidis, Eleftherios and Bawden, Rachel and Bojar, Ond r ej and Dvorkovich, Anton and Federmann, Christian and Fishel, Mark and Freitag, Markus and Gowda, Thamme and Grundkiewicz, Roman and Haddow, Barry and Karpinska, Marzena and Koehn, Philipp and Marie, Benjamin and Monz, Christof and Murray, Kenton and Nagata, Masaaki and Popel, Marti...
-
[34]
Statistical Significance Tests for Machine Translation Evaluation
Koehn, Philipp. Statistical Significance Tests for Machine Translation Evaluation. Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 2004
2004
-
[35]
Effects of Automatic Rewriting of Source Language within a J apanese to E nglish MT System
Shirai, Satoshi and Ikehara, Satoru and Kawaoka, Tsukasa. Effects of Automatic Rewriting of Source Language within a J apanese to E nglish MT System. Proceedings of the Fifth Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages. 1993
1993
-
[36]
Automatic rewriting for controlled language translation , author=
-
[37]
Improvement of translation quality of E nglish newspaper headlines by automatic preediting
Yoshimi, Takehiko and Sata, Ichiko. Improvement of translation quality of E nglish newspaper headlines by automatic preediting. Proceedings of Machine Translation Summit VII. 1999
1999
-
[38]
Improving a Statistical MT System with Automatically Learned Rewrite Patterns
Xia, Fei and McCord, Michael. Improving a Statistical MT System with Automatically Learned Rewrite Patterns. COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics. 2004
2004
-
[39]
A Probabilistic Approach to Syntax-based Reordering for Statistical Machine Translation
Li, Chi-Ho and Li, Minghui and Zhang, Dongdong and Li, Mu and Zhou, Ming and Guan, Yi. A Probabilistic Approach to Syntax-based Reordering for Statistical Machine Translation. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. 2007
2007
-
[40]
Discriminative Preordering Meets Kendall ' s Maximization
Hoshino, Sho and Miyao, Yusuke and Sudoh, Katsuhito and Hayashi, Katsuhiko and Nagata, Masaaki. Discriminative Preordering Meets Kendall ' s Maximization. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 2015. doi:10.3...
-
[41]
Evaluating Neural Machine Translation in E nglish- J apanese Task
Zhu, Zhongyuan. Evaluating Neural Machine Translation in E nglish- J apanese Task. Proceedings of the 2nd Workshop on A sian Translation ( WAT 2015). 2015
2015
-
[42]
Pre-Reordering for Neural Machine Translation: Helpful or Harmful? , volume =
Du, Jinhua and Way, Andy , year =. Pre-Reordering for Neural Machine Translation: Helpful or Harmful? , volume =. The Prague Bulletin of Mathematical Linguistics , doi =
-
[43]
Miyata, Rei and Fujita, Atsushi , year =
-
[44]
2023 , url=
GPT-4 Technical Report , author=. 2023 , url=
2023
-
[45]
2020 , eprint=
Language Models are Few-Shot Learners , author=. 2020 , eprint=
2020
-
[46]
2025 , url=
Learning from others' mistakes: Finetuning machine translation models with span-level error annotations , author=. 2025 , url=
2025
-
[47]
2025 , eprint=
Gemma 3 Technical Report , author=. 2025 , eprint=
2025
-
[48]
2025 , eprint=
Qwen3 Technical Report , author=. 2025 , eprint=
2025
-
[49]
2026 , eprint=
TranslateGemma Technical Report , author=. 2026 , eprint=
2026
-
[50]
and Rei, Ricardo and Stigt, Daan van and Coheur, Luisa and Colombo, Pierre and Martins, Andr \'e F
Guerreiro, Nuno M. and Rei, Ricardo and Stigt, Daan van and Coheur, Luisa and Colombo, Pierre and Martins, Andr \'e F. T. x COMET : Transparent Machine Translation Evaluation through Fine-grained Error Detection. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00683
-
[51]
M etric X -24: The G oogle Submission to the WMT 2024 Metrics Shared Task
Juraska, Juraj and Deutsch, Daniel and Finkelstein, Mara and Freitag, Markus. M etric X -24: The G oogle Submission to the WMT 2024 Metrics Shared Task. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.35
-
[52]
Thirty-seventh Conference on Neural Information Processing Systems , year=
Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.