Detector-Evasive LLM Paraphrasing via Constrained Policy Optimization
Pith reviewed 2026-06-28 23:04 UTC · model grok-4.3
The pith
Detector-evasive LLM paraphrasing is cast as a constrained Markov decision process that treats semantic preservation as a hard constraint while maximizing evasion.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By formulating detector-evasive paraphrasing as a CMDP with evasion as the primary objective and semantic preservation as a hard constraint, then solving it via a Lagrangian primal-dual RL method with GRPO-style updates, DEPO produces paraphrases that reliably evade detectors while satisfying the semantic constraint exactly, unlike scalarized reward approaches that provide only indirect control.
What carries the argument
Constrained Markov Decision Process formulation solved by Lagrangian primal-dual reinforcement learning with group-based policy updates, where semantic preservation acts as the explicit constraint.
If this is right
- DEPO improves attack success rates while remaining inside the allowed semantic-preservation region.
- The method exhibits robustness to changes in domain, detector, and prompt.
- Cross-dataset results on MAGE, M4, RAID, and peer-review texts hold against MAGE, RoBERTa, RADAR, Binoculars, and Fast-DetectGPT detectors.
- The adaptive balancing during training allows the policy to increase evasion without violating the constraint.
- Prompt-level consistency indicates the approach does not require per-prompt retuning.
Where Pith is reading between the lines
- The CMDP framing could extend to other constrained text-generation tasks such as style transfer under readability constraints.
- Detector designers might need to account for policies trained under explicit semantic constraints rather than scalar rewards.
- If the constraint enforcement proves stable, similar primal-dual methods could apply to safety constraints in LLM fine-tuning.
Load-bearing premise
Semantic preservation can be encoded and enforced as an explicit hard constraint inside the CMDP without creating new trade-offs or instabilities that reduce the evasion gains.
What would settle it
A direct comparison experiment in which the CMDP-constrained policy produces lower evasion success rates or higher semantic drift than an unconstrained baseline on the same datasets and detectors.
Figures
read the original abstract
AI-text detectors are vulnerable to paraphrasing and detector-guided paraphrasing attacks, but existing detector-evasion methods often lack precise control over semantic preservation. In particular, optimizing directly for detector evasion can degrade fine-grained semantics, whereas scalarized reward designs provide only indirect, weight-sensitive control over the evasion-semantics trade-off. We address this limitation by formulating detector-evasive LLM paraphrasing as a Constrained Markov Decision Process, where detector evasion is the primary objective and semantic preservation is enforced as an explicit constraint. We propose Detector Evasion Policy Optimization (DEPO), a Lagrangian primal-dual reinforcement learning algorithm with a novel GRPO-style group-based policy update. DEPO adaptively balances semantic preservation and detector evasion during training, enabling the policy to improve attack success within a prescribed semantic-preservation region. Experiments on MAGE, M4, RAID, and peer-review datasets, evaluated against MAGE, RoBERTa, RADAR, Binoculars, and Fast-DetectGPT detectors, show that DEPO achieves strong detector evasion while precisely satisfying the semantic preservation constraint. DEPO also exhibits cross-domain, cross-detector, and prompt-level robustness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formulates detector-evasive LLM paraphrasing as a Constrained Markov Decision Process (CMDP) with detector evasion as the objective and semantic preservation as an explicit constraint. It introduces Detector Evasion Policy Optimization (DEPO), a Lagrangian primal-dual RL algorithm using a novel GRPO-style group-based policy update. Experiments across MAGE, M4, RAID, and peer-review datasets against MAGE, RoBERTa, RADAR, Binoculars, and Fast-DetectGPT detectors claim strong evasion performance while precisely satisfying the semantic constraint, plus cross-domain, cross-detector, and prompt-level robustness.
Significance. If verified, the CMDP formulation with explicit hard constraint offers a principled alternative to scalarized reward designs for controlling the evasion-semantics trade-off, which is a clear methodological strength. The adaptive balancing during training and reported robustness across datasets and detectors would be notable contributions to understanding AI-text detector vulnerabilities, provided the optimization dynamics support the feasibility claims.
major comments (2)
- [Abstract] Abstract: the central claim that DEPO achieves evasion 'while precisely satisfying the semantic preservation constraint' is load-bearing for the contribution; the manuscript provides no constraint-violation statistics, multiplier convergence plots, or feasibility analysis to substantiate this against known primal-dual instabilities in Lagrangian methods.
- [Method] Method section (CMDP and Lagrangian setup): the assumption that semantic preservation can be enforced as a hard constraint inside the CMDP without new trade-offs or instabilities requires explicit empirical support; the GRPO-style group update is presented as novel but lacks reported evidence that it preserves feasibility or improves upon standard primal-dual convergence.
minor comments (2)
- [Experiments] Experiments: specify the exact semantic-preservation threshold value, the similarity metric used to enforce it, and the fraction of outputs meeting it for each detector/dataset combination.
- [Abstract] Abstract and §4: clarify baseline paraphrasing methods and whether they include recent detector-guided attacks for direct comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments correctly identify that the manuscript lacks explicit empirical validation of constraint feasibility and optimization stability for the Lagrangian approach. We agree these additions are necessary to support the central claims and will incorporate them in the revision.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that DEPO achieves evasion 'while precisely satisfying the semantic preservation constraint' is load-bearing for the contribution; the manuscript provides no constraint-violation statistics, multiplier convergence plots, or feasibility analysis to substantiate this against known primal-dual instabilities in Lagrangian methods.
Authors: We acknowledge that the current manuscript does not report constraint-violation statistics, Lagrange multiplier convergence plots, or a dedicated feasibility analysis. While the experimental results show semantic similarity scores consistently meeting the prescribed threshold alongside high evasion rates, this indirect evidence does not directly address potential primal-dual instabilities. In the revised version we will add: (i) per-epoch average and maximum constraint violation rates on all datasets, (ii) plots of the dual multiplier trajectory during training, and (iii) a short feasibility study comparing observed violations against a standard Lagrangian baseline. These additions will directly substantiate the 'precisely satisfying' claim. revision: yes
-
Referee: [Method] Method section (CMDP and Lagrangian setup): the assumption that semantic preservation can be enforced as a hard constraint inside the CMDP without new trade-offs or instabilities requires explicit empirical support; the GRPO-style group update is presented as novel but lacks reported evidence that it preserves feasibility or improves upon standard primal-dual convergence.
Authors: We agree that explicit empirical support for the CMDP constraint enforcement and for the GRPO-style update's effect on feasibility is missing. The group-based update is intended to reduce variance in policy gradients while respecting the constraint, yet no ablation or convergence comparison is provided. In revision we will include: (i) training curves of constraint violation for DEPO versus a standard primal-dual RL implementation without the group update, and (ii) quantitative comparison of final feasibility gap and number of epochs to stable multiplier convergence. This will supply the requested evidence on whether the GRPO modification improves stability. revision: yes
Circularity Check
No significant circularity; new algorithm with external validation
full rationale
The paper formulates detector-evasive paraphrasing as a CMDP and introduces the DEPO algorithm with a novel GRPO-style update. All performance claims are evaluated against external datasets (MAGE, M4, RAID, peer-review) and detectors (MAGE, RoBERTa, RADAR, Binoculars, Fast-DetectGPT). No equations, self-citations, or fitted parameters are shown to reduce the claimed results to inputs by construction. The derivation chain is self-contained against independent benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- semantic-preservation threshold
axioms (1)
- domain assumption Paraphrasing can be modeled as a Markov Decision Process with detector output and semantic similarity as reward/constraint signals
Reference graph
Works this paper leans on
-
[1]
Aho and Jeffrey D
Alfred V. Aho and Jeffrey D. Ullman , title =. 1972
1972
-
[2]
arXiv preprint arXiv:2305.18081 , year=
Game of Tones: Faculty detection of GPT-4 generated content in university assessments , author=. arXiv preprint arXiv:2305.18081 , year=
-
[3]
Operations Research Letters , volume=
Faster algorithm and sharper analysis for constrained Markov decision process , author=. Operations Research Letters , volume=. 2024 , publisher=
2024
-
[4]
Proceedings of the international AAAI conference on web and social media , volume=
Machine-made media: Monitoring the mobilization of machine-generated articles on misinformation and mainstream news websites , author=. Proceedings of the international AAAI conference on web and social media , volume=
-
[5]
Publications Manual , year = "1983", publisher =
1983
-
[6]
Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243
-
[7]
Scalable training of
Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
-
[8]
Dan Gusfield , title =. 1997
1997
-
[9]
Tetreault , title =
Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =
2015
-
[10]
BERTScore: Evaluating Text Generation with BERT
Bertscore: Evaluating text generation with bert , author=. arXiv preprint arXiv:1904.09675 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[11]
Findings of the association for computational linguistics: ACL 2022 , pages=
Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models , author=. Findings of the association for computational linguistics: ACL 2022 , pages=
2022
-
[12]
Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=
ALIGN-SIM: A task-free test bed for evaluating and interpreting sentence embeddings through semantic similarity alignment , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=
2024
-
[13]
Proceedings of the 57th annual meeting of the association for computational linguistics: system demonstrations , pages=
Gltr: Statistical detection and visualization of generated text , author=. Proceedings of the 57th annual meeting of the association for computational linguistics: system demonstrations , pages=
-
[14]
Release Strategies and the Social Impacts of Language Models
Release strategies and the social impacts of language models , author=. arXiv preprint arXiv:1908.09203 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1908
-
[15]
Proceedings of the 58th annual meeting of the association for computational linguistics , pages=
Automatic detection of generated text is easiest when humans are fooled , author=. Proceedings of the 58th annual meeting of the association for computational linguistics , pages=
-
[16]
Findings of the association for computational linguistics: EMNLP 2021 , pages=
Turingbench: A benchmark environment for turing test in the age of neural text generation , author=. Findings of the association for computational linguistics: EMNLP 2021 , pages=
2021
-
[17]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
MAGE: Machine-generated text detection in the wild , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[18]
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
M4: Multi-generator, multi-domain, and multi-lingual black-box machine-generated text detection , author=. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[19]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Raid: A shared benchmark for robust evaluation of machine-generated text detectors , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[20]
Advances in neural information processing systems , volume=
Defending against neural fake news , author=. Advances in neural information processing systems , volume=
-
[21]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Roberta: A robustly optimized bert pretraining approach , author=. arXiv preprint arXiv:1907.11692 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[22]
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=
Ghostbuster: Detecting text ghostwritten by large language models , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=
2024
-
[23]
International conference on machine learning , pages=
Detectgpt: Zero-shot machine-generated text detection using probability curvature , author=. International conference on machine learning , pages=. 2023 , organization=
2023
-
[24]
International Conference on Learning Representations , volume=
Fast-detectgpt: Efficient zero-shot detection of machine-generated text via conditional probability curvature , author=. International Conference on Learning Representations , volume=
-
[25]
Advances in neural information processing systems , volume=
Radar: Robust ai-text detection via adversarial learning , author=. Advances in neural information processing systems , volume=
-
[26]
International Conference on Learning Representations , volume=
Dna-gpt: Divergent n-gram analysis for training-free detection of gpt-generated text , author=. International Conference on Learning Representations , volume=
-
[27]
arXiv preprint arXiv:2401.12070 , year=
Spotting llms with binoculars: Zero-shot detection of machine-generated text , author=. arXiv preprint arXiv:2401.12070 , year=
-
[28]
International conference on machine learning , pages=
A watermark for large language models , author=. International conference on machine learning , pages=. 2023 , organization=
2023
-
[29]
International Conference on Learning Representations , volume=
Language model detectors are easily optimized against , author=. International Conference on Learning Representations , volume=
-
[30]
Advances in Neural Information Processing Systems , volume=
Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text , author=. Advances in Neural Information Processing Systems , volume=
-
[31]
arXiv preprint arXiv:2503.08716 , year=
Authormist: Evading ai text detectors with reinforcement learning , author=. arXiv preprint arXiv:2503.08716 , year=
-
[32]
Proceedings of the 1stWorkshop on GenAI Content Detection (GenAIDetect) , pages=
SilverSpeak: evading AI-generated text detectors using homoglyphs , author=. Proceedings of the 1stWorkshop on GenAI Content Detection (GenAIDetect) , pages=
-
[33]
arXiv preprint arXiv:2602.08934 , year=
StealthRL: Reinforcement Learning Paraphrase Attacks for Multi-Detector Evasion of AI-Text Detectors , author=. arXiv preprint arXiv:2602.08934 , year=
-
[34]
arXiv preprint arXiv:2603.16152 , year=
HIPO: Instruction Hierarchy via Constrained Reinforcement Learning , author=. arXiv preprint arXiv:2603.16152 , year=
-
[35]
Proceedings of COLING 2016, the 26th international conference on computational linguistics: technical papers , pages=
Neural paraphrase generation with stacked residual LSTM networks , author=. Proceedings of COLING 2016, the 26th international conference on computational linguistics: technical papers , pages=
2016
-
[36]
arXiv preprint arXiv:2305.10847 , year=
Large language models can be guided to evade ai-generated text detection , author=. arXiv preprint arXiv:2305.10847 , year=
-
[37]
Proceedings of the aaai conference on artificial intelligence , volume=
A deep generative framework for paraphrase generation , author=. Proceedings of the aaai conference on artificial intelligence , volume=
-
[38]
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=
Controllable paraphrase generation with a syntactic exemplar , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=
-
[39]
Findings of the Association for Computational Linguistics: EMNLP 2020 , pages=
ProphetNet: Predicting future n-gram for sequence-to-SequencePre-training , author=. Findings of the Association for Computational Linguistics: EMNLP 2020 , pages=
2020
-
[40]
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=
Paraphrase generation with deep reinforcement learning , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=
2018
-
[41]
Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining , pages=
Unsupervised paraphrasing via deep reinforcement learning , author=. Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining , pages=
-
[42]
Exploring diverse expressions for paraphrase generation , author=. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) , pages=
2019
-
[43]
Sequence Level Training with Recurrent Neural Networks
Sequence level training with recurrent neural networks , author=. arXiv preprint arXiv:1511.06732 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
International Conference on Learning Representations , year =
An Actor-Critic Algorithm for Sequence Prediction , author =. International Conference on Learning Representations , year =
-
[45]
International Conference on Learning Representations , year =
A Deep Reinforced Model for Abstractive Summarization , author =. International Conference on Learning Representations , year =
-
[46]
Fine-Tuning Language Models from Human Preferences
Fine-tuning language models from human preferences , author=. arXiv preprint arXiv:1909.08593 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[47]
Advances in neural information processing systems , volume=
Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=
-
[48]
Advances in neural information processing systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
-
[49]
Advances in neural information processing systems , volume=
Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
-
[50]
1999 , publisher =
Constrained Markov Decision Processes , author =. 1999 , publisher =
1999
-
[51]
International Conference on Learning Representations , volume=
Safe rlhf: Safe reinforcement learning from human feedback , author=. International Conference on Learning Representations , volume=
-
[52]
Advances in Neural Information Processing Systems , volume=
Stepwise alignment for constrained language model policy optimization , author=. Advances in Neural Information Processing Systems , volume=
-
[53]
International conference on machine learning , pages=
Constrained policy optimization , author=. International conference on machine learning , pages=. 2017 , organization=
2017
-
[54]
Advances in neural information processing systems , volume=
Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense , author=. Advances in neural information processing systems , volume=
-
[55]
Transactions on Machine Learning Research , year=
Can AI-generated text be reliably detected? stress testing AI text detectors under various attacks , author=. Transactions on Machine Learning Research , year=
-
[56]
1999 , publisher=
Nonlinear multiobjective optimization , author=. 1999 , publisher=
1999
-
[57]
Structural and multidisciplinary optimization , volume=
The weighted sum method for multi-objective optimization: new insights , author=. Structural and multidisciplinary optimization , volume=. 2010 , publisher=
2010
-
[58]
Reward Constrained Policy Optimization
Reward constrained policy optimization , author=. arXiv preprint arXiv:1805.11074 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[59]
arXiv preprint arXiv:2301.07597 , year=
How close is chatgpt to human experts? comparison corpus, evaluation, and detection , author=. arXiv preprint arXiv:2301.07597 , year=
-
[60]
arXiv e-prints , pages=
Is your paper being reviewed by an LLM? A new benchmark dataset and approach for detecting AI text in peer review , author=. arXiv e-prints , pages=
-
[61]
arXiv preprint arXiv:2302.07731 , year=
Combat ai with ai: Counteract machine-generated fake restaurant reviews on social media , author=. arXiv preprint arXiv:2302.07731 , year=
-
[62]
arXiv preprint arXiv:2509.04460 , year=
CoCoNUTS: Concentrating on Content while Neglecting Uninformative Textual Styles for AI-Generated Peer Review Detection , author=. arXiv preprint arXiv:2509.04460 , year=
-
[63]
Advances in Neural Information Processing Systems , volume=
Detective: Detecting ai-generated text via multi-level contrastive learning , author=. Advances in Neural Information Processing Systems , volume=
-
[64]
arXiv preprint arXiv:2602.13042 , year=
Gptzero: Robust detection of llm-generated texts , author=. arXiv preprint arXiv:2602.13042 , year=
-
[65]
Computational Linguistics , volume=
A survey on llm-generated text detection: Necessity, methods, and future directions , author=. Computational Linguistics , volume=
-
[66]
Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=
A survey on detection of llms-generated content , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=
2024
-
[67]
arXiv preprint arXiv:2403.01152 , year=
A survey of ai-generated text forensic systems: Detection, attribution, and characterization , author=. arXiv preprint arXiv:2403.01152 , year=
-
[68]
Computer Science Review , volume=
AI-generated text detection: A comprehensive review of methods, datasets, and applications , author=. Computer Science Review , volume=. 2025 , publisher=
2025
-
[69]
Mathematics , volume=
Enhancing the Robustness of AI-Generated Text Detectors: A Survey , author=. Mathematics , volume=. 2025 , publisher=
2025
-
[70]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Are ai-generated text detectors robust to adversarial perturbations? , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[71]
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , pages=
Humanizing machine-generated content: evading AI-text detection through adversarial attack , author=. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , pages=
2024
-
[72]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[73]
von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin , license =
-
[74]
Advances in neural information processing systems , volume=
Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=
-
[75]
Proceedings of the 2023 conference on empirical methods in natural language processing , pages=
G-eval: NLG evaluation using gpt-4 with better human alignment , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=
2023
-
[76]
, author=
Lora: Low-rank adaptation of large language models. , author=. Iclr , volume=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.