pith. sign in

arxiv: 2606.08011 · v2 · pith:6ROPU2ITnew · submitted 2026-06-06 · 💻 cs.CL · cs.AI

Rewrite to Translate, Translate to Reward: Reinforcement Learning for Source Rewriting in Machine Translation

Pith reviewed 2026-06-27 19:56 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords machine translationsource rewritingreinforcement learninglarge language modelsRLSRprompt-based methods
0
0 comments X

The pith

Reinforcement learning on translation-quality rewards trains 4B models to rewrite source text more effectively than prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that prompting language models to rewrite source sentences before translation often fails or lowers quality when the model is small, because prompts give no direct signal about whether the rewrite actually helps the translator. It introduces RLSR, a reinforcement learning method that scores each rewrite by the gain it produces in the final translation metric and uses that score as the training reward. Across six machine translation systems and 16 language pairs, the resulting 4B-parameter rewriters beat both a no-rewriting baseline and prompt-based rewriting at the same model size while staying competitive with approaches that call a 235B-parameter model.

Core claim

Rewriting source text with large language models before translation has been shown to improve machine translation quality. However, prompt-based rewriting can degrade translation quality rather than improve it, particularly when smaller LLMs such as 4B-parameter models are used. This limitation stems from the difficulty of controlling rewriting behavior through natural-language prompts alone. RLSR addresses the issue by training the rewriting model with a reward based on the downstream translation-quality improvement produced by each rewrite.

What carries the argument

RLSR, a reinforcement learning framework that trains a source-rewriting model using a reward signal equal to the measured improvement in downstream machine translation quality.

If this is right

  • 4B RLSR-trained rewriting models significantly outperform the no-rewriting baseline.
  • They also outperform prompt-based rewriting baselines that use models of the same size.
  • Their performance remains competitive with rewriting baselines that rely on a 235B LLM.
  • The gains appear across six different MT systems and 16 language pairs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reward-from-downstream-task pattern could be tested on other pre-processing steps whose value is judged only after a later model runs.
  • Explicit RL optimization may prove more reliable than prompting when the desired behavior is hard to describe in natural language.
  • Specialized smaller models trained this way might reduce reliance on very large general-purpose models inside translation pipelines.

Load-bearing premise

That the measured improvement in downstream translation quality provides a stable, non-hacking reward signal sufficient to train the rewriter without introducing artifacts that degrade other aspects of the output or the MT system itself.

What would settle it

An experiment on held-out data in which translations produced after RLSR rewriting show no statistically significant gain over the no-rewriting or prompt-based baselines on standard quality metrics.

Figures

Figures reproduced from arXiv: 2606.08011 by Boxuan Lyu, Haiyue Song, Hidetaka Kamigaito, Kotaro Funakoshi, Manabu Okumura, Zhi Qu.

Figure 1
Figure 1. Figure 1: Overview of RLSR. The rewriting model generates a rewritten source from the original source. A fixed downstream MT model translates the rewritten source, and an MT metric evaluates the translation. The improvement over translating the original source is used as the reward for optimizing the rewriting model. smaller LLMs, such as 4B-parameter models (Sec￾tion 4.2). We argue that this unreliability is inhere… view at source ↗
Figure 2
Figure 2. Figure 2: Minimal implementation of the source-similarity metric used in Table [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗
read the original abstract

Rewriting source text with large language models (LLMs) before translation has been shown to improve machine translation (MT) quality. However, we find that prompt-based rewriting can degrade translation quality rather than improve it, particularly when smaller LLMs, such as 4B-parameter models, are used. We argue that this limitation stems from the difficulty of controlling rewriting behavior through natural-language prompts alone: a rewrite is useful only if it improves downstream translation, yet existing prompt-based methods do not explicitly optimize for this signal. To address this issue, we propose RLSR (Reinforcement Learning for Source Rewriting), a reinforcement learning framework that trains the rewriting model with a reward based on the downstream translation-quality improvement produced by each rewrite. Experiments across six MT systems and 16 language pairs show that our 4B RLSR-trained rewriting models significantly outperform both the no-rewriting baseline and prompt-based rewriting baselines at the same model scale, while remaining competitive with baselines that use a 235B LLM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes RLSR, a reinforcement learning framework for training source rewriting models in machine translation. The reward signal is the improvement in downstream translation quality after rewriting. Experiments across six MT systems and 16 language pairs claim that 4B-parameter RLSR models significantly outperform no-rewriting and prompt-based rewriting baselines at the same scale, while remaining competitive with 235B LLM baselines.

Significance. If the results hold, the work shows that RL can make smaller rewriting models effective by directly optimizing for translation improvement rather than relying on prompts, addressing a limitation of prompt-based methods for 4B-scale models. This could reduce dependence on much larger LLMs for preprocessing in MT pipelines.

major comments (2)
  1. [Abstract] Abstract: the claim of significant outperformance across six MT systems and 16 language pairs supplies no details on the MT metrics used for the reward, statistical significance tests, error bars, data splits, or controls for confounds. This information is load-bearing for the central empirical claim.
  2. [Experiments (results description)] The central claim requires that downstream MT quality provides a stable, non-exploitable reward for RL training of the 4B rewriter. No auxiliary checks (human evaluation, side-effect metrics on fluency/adequacy, or ablations on reward variance) are described to rule out metric hacking via superficial changes favored by the MT system or metric.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments focus on strengthening the presentation of empirical results and verifying the reliability of the RL reward. We respond point-by-point below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of significant outperformance across six MT systems and 16 language pairs supplies no details on the MT metrics used for the reward, statistical significance tests, error bars, data splits, or controls for confounds. This information is load-bearing for the central empirical claim.

    Authors: We agree that the abstract would be strengthened by including a few key qualifiers. The manuscript body (Sections 4.1 and 5.1) specifies COMET as the primary reward metric, paired bootstrap tests (p < 0.05) for significance, standard deviations across three random seeds for error bars, WMT 2022/2023 test splits, and controls via six distinct MT systems. In the revision we will add a concise clause to the abstract noting the primary metric and significance testing, while keeping the abstract within length limits. revision: partial

  2. Referee: [Experiments (results description)] The central claim requires that downstream MT quality provides a stable, non-exploitable reward for RL training of the 4B rewriter. No auxiliary checks (human evaluation, side-effect metrics on fluency/adequacy, or ablations on reward variance) are described to rule out metric hacking via superficial changes favored by the MT system or metric.

    Authors: We share the concern about reward stability. The current experiments already use six different MT systems as reward providers and report consistent gains under both COMET and BLEU, which provides some protection against single-metric exploitation. However, explicit ablations on reward variance across training steps and side-effect metrics (e.g., source perplexity for fluency) are not presented. We will add a short subsection and table in the revision to include these analyses. Human evaluation was not conducted owing to scale; we can note this limitation and offer to perform a small-scale study if the referee considers it essential. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical RL method with external MT rewards

full rationale

The paper proposes RLSR as an RL framework that trains a rewriter using downstream translation quality as the reward signal. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claim rests on experiments across six MT systems and 16 language pairs that compare against baselines, which are independent external evaluations rather than reductions to the method's own inputs by construction. This is a standard empirical setup with no load-bearing self-definitional or uniqueness steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the standard RL assumption that a scalar reward from an external MT system can guide policy improvement.

pith-pipeline@v0.9.1-grok · 5726 in / 1076 out tokens · 18378 ms · 2026-06-27T19:56:05.173408+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 15 canonical work pages

  1. [1]

    Pre-editing and the use of simplified writing for MT

    Pym, Peter. Pre-editing and the use of simplified writing for MT. Proceedings of Translating and the Computer 10: The translation environment 10 years on. 1988

  2. [2]

    Two in one -- can it work? Readability and translatability by means of controlled language

    Reuther, Ursula. Two in one -- can it work? Readability and translatability by means of controlled language. EAMT Workshop: Improving MT through other language technology tools: resources and tools for building MT. 2003

  3. [3]

    A Large-Scale Evaluation of Pre-editing Strategies for Improving User-Generated Content Translation

    Seretan, Violeta and Bouillon, Pierrette and Gerlach, Johanna. A Large-Scale Evaluation of Pre-editing Strategies for Improving User-Generated Content Translation. Proceedings of the Ninth International Conference on Language Resources and Evaluation ( LREC '14). 2014

  4. [4]

    Understanding Pre-Editing for Black-Box Neural Machine Translation

    Miyata, Rei and Fujita, Atsushi. Understanding Pre-Editing for Black-Box Neural Machine Translation. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 2021. doi:10.18653/v1/2021.eacl-main.132

  5. [5]

    Automatic Input Rewriting Improves Translation with Large Language Models

    Ki, Dayeon and Carpuat, Marine. Automatic Input Rewriting Improves Translation with Large Language Models. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.naacl-long.542

  6. [6]

    AAAI Conference on Artificial Intelligence , year=

    Simplify-Then-Translate: Automatic Preprocessing for Black-Box Translation , author=. AAAI Conference on Artificial Intelligence , year=

  7. [7]

    Automatic Decomposition of Text Editing Examples into Primitive Edit Operations: Toward Analytic Evaluation of Editing Systems

    Yamaguchi, Daichi and Miyata, Rei and Fujita, Atsushi and Kajiwara, Tomoyuki and Sato, Satoshi. Automatic Decomposition of Text Editing Examples into Primitive Edit Operations: Toward Analytic Evaluation of Editing Systems. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2...

  8. [8]

    Improved Statistical Machine Translation Using Paraphrases

    Callison-Burch, Chris and Koehn, Philipp and Osborne, Miles. Improved Statistical Machine Translation Using Paraphrases. Proceedings of the Human Language Technology Conference of the NAACL , Main Conference. 2006

  9. [9]

    Source-Language Entailment Modeling for Translating Unknown Terms

    Mirkin, Shachar and Specia, Lucia and Cancedda, Nicola and Dagan, Ido and Dymetman, Marc and Szpektor, Idan. Source-Language Entailment Modeling for Translating Unknown Terms. Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. 2009

  10. [10]

    Improved Statistical Machine Translation Using Monolingually-Derived Paraphrases

    Marton, Yuval and Callison-Burch, Chris and Resnik, Philip. Improved Statistical Machine Translation Using Monolingually-Derived Paraphrases. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. 2009

  11. [11]

    Can Text Simplification Help Machine Translation?

    S tajner, Sanja and Popovic, Maja. Can Text Simplification Help Machine Translation?. Proceedings of the 19th Annual Conference of the E uropean Association for Machine Translation. 2016

  12. [12]

    Automated Text Simplification as a Preprocessing Step for Machine Translation into an Under-resourced Language

    S tajner, Sanja and Popovi \'c , Maja. Automated Text Simplification as a Preprocessing Step for Machine Translation into an Under-resourced Language. Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019). 2019. doi:10.26615/978-954-452-056-4_131

  13. [13]

    Targeted Source Text Editing for Machine Translation: Exploiting Quality Estimators and Large Language Models

    Koretaka, Hyuga and Fujita, Atsushi and Kajiwara, Tomoyuki. Targeted Source Text Editing for Machine Translation: Exploiting Quality Estimators and Large Language Models. Proceedings of the Tenth Conference on Machine Translation. 2025. doi:10.18653/v1/2025.wmt-1.12

  14. [14]

    Findings of the WMT 25 General Machine Translation Shared Task: Time to Stop Evaluating on Easy Test Sets

    Kocmi, Tom and Artemova, Ekaterina and Avramidis, Eleftherios and Bawden, Rachel and Bojar, Ond r ej and Dranch, Konstantin and Dvorkovich, Anton and Dukanov, Sergey and Fishel, Mark and Freitag, Markus and Gowda, Thamme and Grundkiewicz, Roman and Haddow, Barry and Karpinska, Marzena and Koehn, Philipp and Lakougna, Howard and Lundin, Jessica and Monz, C...

  15. [15]

    2026 , url=

    Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and YuYue and Weinan Dai and Tiantian Fan and Gaohong Liu and Juncai Liu and LingJun Liu and Xin Liu and Haibin Lin and Zhiqi Lin and Bole Ma and Guangming Sheng and Yuxuan Tong and Chi Zhang and Mofan Zhang and Ru Zhang and Wang Zhang and Hang Zhu and Jinhua Zhu and Jiaze Chen and ...

  16. [16]

    Li and Y

    Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y.K. Li and Y. Wu and Daya Guo , title =. CoRR , volume =. 2024 , url =

  17. [17]

    2024 , eprint=

    SWIFT:A Scalable lightWeight Infrastructure for Fine-Tuning , author=. 2024 , eprint=

  18. [18]

    International Conference on Learning Representations , year=

    Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

  19. [19]

    8-bit Optimizers via Block-wise Quantization , author=

  20. [20]

    Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

    Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

  21. [21]

    Dao, Tri , booktitle=. Flash

  22. [22]

    2024 , eprint=

    Enhancing Training Efficiency Using Packing with Flash Attention , author=. 2024 , eprint=

  23. [23]

    Proceedings of the Eighth Conference on Machine Translation

    GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4 , author =. Proceedings of the Eighth Conference on Machine Translation. 2023

  24. [24]

    arXiv preprint arXiv:2312.11805 , year=

    Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

  25. [25]

    Proceedings of the Ninth Conference on Machine Translation , pages=

    Mitigating Metric Bias in Minimum Bayes Risk Decoding , author=. Proceedings of the Ninth Conference on Machine Translation , pages=

  26. [26]

    Results of WMT 23 Metrics Shared Task: Metrics Might Be Guilty but References Are Not Innocent

    Freitag, Markus and Mathur, Nitika and Lo, Chi-kiu and Avramidis, Eleftherios and Rei, Ricardo and Thompson, Brian and Kocmi, Tom and Blain, Frederic and Deutsch, Daniel and Stewart, Craig and Zerva, Chrysoula and Castilho, Sheila and Lavie, Alon and Foster, George. Results of WMT 23 Metrics Shared Task: Metrics Might Be Guilty but References Are Not Inno...

  27. [27]

    Are LLM s Breaking MT Metrics? Results of the WMT 24 Metrics Shared Task

    Freitag, Markus and Mathur, Nitika and Deutsch, Daniel and Lo, Chi-Kiu and Avramidis, Eleftherios and Rei, Ricardo and Thompson, Brian and Blain, Frederic and Kocmi, Tom and Wang, Jiayi and Adelani, David Ifeoluwa and Buchicchio, Marianna and Zerva, Chrysoula and Lavie, Alon. Are LLM s Breaking MT Metrics? Results of the WMT 24 Metrics Shared Task. Procee...

  28. [28]

    , author Bojar, O

    Barrault, Lo. Findings of the 2019 Conference on Machine Translation ( WMT 19). Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1). 2019. doi:10.18653/v1/W19-5301

  29. [29]

    Findings of the 2020 Conference on Machine Translation ( WMT 20)

    Barrault, Lo. Findings of the 2020 Conference on Machine Translation ( WMT 20). Proceedings of the Fifth Conference on Machine Translation. 2020. doi:10.18653/v1/2020.wmt-1.1

  30. [30]

    Akhbardeh, Farhad and Arkhangorodsky, Arkady and Biesialska, Magdalena and Bojar, Ond r ej and Chatterjee, Rajen and Chaudhary, Vishrav and Costa-jussa, Marta R. and Espa \ n a-Bonet, Cristina and Fan, Angela and Federmann, Christian and Freitag, Markus and Graham, Yvette and Grundkiewicz, Roman and Haddow, Barry and Harter, Leonie and Heafield, Kenneth a...

  31. [31]

    Findings of the 2022 Conference on Machine Translation ( WMT 22)

    Kocmi, Tom and Bawden, Rachel and Bojar, Ond r ej and Dvorkovich, Anton and Federmann, Christian and Fishel, Mark and Gowda, Thamme and Graham, Yvette and Grundkiewicz, Roman and Haddow, Barry and Knowles, Rebecca and Koehn, Philipp and Monz, Christof and Morishita, Makoto and Nagata, Masaaki and Nakazawa, Toshiaki and Nov \'a k, Michal and Popel, Martin ...

  32. [32]

    Findings of the 2023 Conference on Machine Translation ( WMT 23): LLM s Are Here but Not Quite There Yet

    Kocmi, Tom and Avramidis, Eleftherios and Bawden, Rachel and Bojar, Ond r ej and Dvorkovich, Anton and Federmann, Christian and Fishel, Mark and Freitag, Markus and Gowda, Thamme and Grundkiewicz, Roman and Haddow, Barry and Koehn, Philipp and Marie, Benjamin and Monz, Christof and Morishita, Makoto and Murray, Kenton and Nagata, Masaaki and Nakazawa, Tos...

  33. [33]

    Findings of the WMT 24 General Machine Translation Shared Task: The LLM Era Is Here but MT Is Not Solved Yet

    Kocmi, Tom and Avramidis, Eleftherios and Bawden, Rachel and Bojar, Ond r ej and Dvorkovich, Anton and Federmann, Christian and Fishel, Mark and Freitag, Markus and Gowda, Thamme and Grundkiewicz, Roman and Haddow, Barry and Karpinska, Marzena and Koehn, Philipp and Marie, Benjamin and Monz, Christof and Murray, Kenton and Nagata, Masaaki and Popel, Marti...

  34. [34]

    Statistical Significance Tests for Machine Translation Evaluation

    Koehn, Philipp. Statistical Significance Tests for Machine Translation Evaluation. Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 2004

  35. [35]

    Effects of Automatic Rewriting of Source Language within a J apanese to E nglish MT System

    Shirai, Satoshi and Ikehara, Satoru and Kawaoka, Tsukasa. Effects of Automatic Rewriting of Source Language within a J apanese to E nglish MT System. Proceedings of the Fifth Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages. 1993

  36. [36]

    Automatic rewriting for controlled language translation , author=

  37. [37]

    Improvement of translation quality of E nglish newspaper headlines by automatic preediting

    Yoshimi, Takehiko and Sata, Ichiko. Improvement of translation quality of E nglish newspaper headlines by automatic preediting. Proceedings of Machine Translation Summit VII. 1999

  38. [38]

    Improving a Statistical MT System with Automatically Learned Rewrite Patterns

    Xia, Fei and McCord, Michael. Improving a Statistical MT System with Automatically Learned Rewrite Patterns. COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics. 2004

  39. [39]

    A Probabilistic Approach to Syntax-based Reordering for Statistical Machine Translation

    Li, Chi-Ho and Li, Minghui and Zhang, Dongdong and Li, Mu and Zhou, Ming and Guan, Yi. A Probabilistic Approach to Syntax-based Reordering for Statistical Machine Translation. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. 2007

  40. [40]

    Discriminative Preordering Meets Kendall ' s Maximization

    Hoshino, Sho and Miyao, Yusuke and Sudoh, Katsuhito and Hayashi, Katsuhiko and Nagata, Masaaki. Discriminative Preordering Meets Kendall ' s Maximization. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 2015. doi:10.3...

  41. [41]

    Evaluating Neural Machine Translation in E nglish- J apanese Task

    Zhu, Zhongyuan. Evaluating Neural Machine Translation in E nglish- J apanese Task. Proceedings of the 2nd Workshop on A sian Translation ( WAT 2015). 2015

  42. [42]

    Pre-Reordering for Neural Machine Translation: Helpful or Harmful? , volume =

    Du, Jinhua and Way, Andy , year =. Pre-Reordering for Neural Machine Translation: Helpful or Harmful? , volume =. The Prague Bulletin of Mathematical Linguistics , doi =

  43. [43]

    Miyata, Rei and Fujita, Atsushi , year =

  44. [44]

    2023 , url=

    GPT-4 Technical Report , author=. 2023 , url=

  45. [45]

    2020 , eprint=

    Language Models are Few-Shot Learners , author=. 2020 , eprint=

  46. [46]

    2025 , url=

    Learning from others' mistakes: Finetuning machine translation models with span-level error annotations , author=. 2025 , url=

  47. [47]

    2025 , eprint=

    Gemma 3 Technical Report , author=. 2025 , eprint=

  48. [48]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  49. [49]

    2026 , eprint=

    TranslateGemma Technical Report , author=. 2026 , eprint=

  50. [50]

    and Rei, Ricardo and Stigt, Daan van and Coheur, Luisa and Colombo, Pierre and Martins, Andr \'e F

    Guerreiro, Nuno M. and Rei, Ricardo and Stigt, Daan van and Coheur, Luisa and Colombo, Pierre and Martins, Andr \'e F. T. x COMET : Transparent Machine Translation Evaluation through Fine-grained Error Detection. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00683

  51. [51]

    M etric X -24: The G oogle Submission to the WMT 2024 Metrics Shared Task

    Juraska, Juraj and Deutsch, Daniel and Finkelstein, Mara and Freitag, Markus. M etric X -24: The G oogle Submission to the WMT 2024 Metrics Shared Task. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.35

  52. [52]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=