arxiv: 2604.02686 · v1 · submitted 2026-04-03 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Beyond Semantic Manipulation: Token-Space Attacks on Reward Models

Yuheng Zhang , Mingyue Huo , Minghao Zhu , Mengxue Zhang , Nan Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:26 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reward modelsadversarial attackstoken spaceRLHFreward hackingTOMPAblack-box optimization

0 comments

The pith

Adversarial optimization over raw token sequences lets attackers force high rewards from reward models even when outputs are complete nonsense.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TOMPA, a method that attacks reward models by optimizing directly in token space instead of generating readable text. It bypasses the usual decode-and-re-encode step so the attacker can search over arbitrary token strings guided only by the scalar reward score. On Skywork-Reward-V2-Llama-3.1-8B this produces outputs that nearly double the reward of GPT-5 reference answers and beat them on 98 percent of prompts, yet the text is incoherent. The result shows that current reward models can be gamed by non-linguistic patterns rather than by clever semantic tricks. This matters because RLHF pipelines rely on these models to steer policy training; if they assign top scores to gibberish, the learned policies can be misled at scale.

Core claim

TOMPA performs adversarial optimization directly in token space by bypassing the decode-re-tokenize interface between policy and reward model, allowing the attack policy to optimize over raw token sequences using only black-box scalar feedback and automatically discovering non-linguistic token patterns that elicit extremely high rewards from multiple state-of-the-art reward models.

What carries the argument

Token Mapping Perturbation Attack (TOMPA), an optimization procedure that works on raw token sequences rather than decoded natural language.

If this is right

Reward models can be systematically exploited outside the semantic regime using only black-box scalar feedback.
Generated outputs under TOMPA degenerate into nonsensical text while still receiving top rewards.
Current RLHF pipelines contain a critical vulnerability when reward models are used as optimization targets.
Attacks succeed without constructing human-readable adversarial examples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Defenses would need to constrain reward models to penalize statistical anomalies in token distributions rather than only semantic content.
The same token-space optimization could be applied to other scalar feedback models such as preference models or safety classifiers.
If reward models are this sensitive to raw token patterns, RLHF training may inadvertently amplify spurious correlations present in the original preference data.

Load-bearing premise

Reward models assign scores primarily on the basis of token statistics rather than on the semantic coherence or human-like quality of the output.

What would settle it

Run TOMPA on a new reward model and measure whether the generated token sequences receive higher average reward than coherent GPT-5 baselines while remaining nonsensical; if rewards do not increase, the claim is falsified.

Figures

Figures reproduced from arXiv: 2604.02686 by Mengxue Zhang, Minghao Zhu, Mingyue Huo, Nan Jiang, Yuheng Zhang.

**Figure 1.** Figure 1: The TOMPA attack pipeline. The attack bypasses the standard decode–retokenize interface by applying a perturbation mapping, directly feeding transformed token sequences into the reward model. Trained via reinforcement learning using only scalar reward feedback, the policy automatically discovers adversarial token patterns that receive anomalously high rewards. Despite outperforming GPT-5 reference answers… view at source ↗

**Figure 3.** Figure 3: Impact of response length on reward scores. Responses are truncated at various [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Training curves of attack policy optimization under token mapping perturbation. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Reward models (RMs) are widely used as optimization targets in reinforcement learning from human feedback (RLHF), yet they remain vulnerable to reward hacking. Existing attacks mainly operate within the semantic space, constructing human-readable adversarial outputs that exploit RM biases. In this work, we introduce a fundamentally different paradigm: Token Mapping Perturbation Attack (TOMPA), a framework that performs adversarial optimization directly in token space. By bypassing the standard decode-re-tokenize interface between the policy and the reward model, TOMPA enables the attack policy to optimize over raw token sequences rather than coherent natural language. Using only black-box scalar feedback, TOMPA automatically discovers non-linguistic token patterns that elicit extremely high rewards across multiple state-of-the-art RMs. Specifically, when targeting Skywork-Reward-V2-Llama-3.1-8B, TOMPA nearly doubles the reward of GPT-5 reference answers and outperforms them on 98.0% of prompts. Despite these high scores, the generated outputs degenerate into nonsensical text, revealing that RMs can be systematically exploited beyond the semantic regime and exposing a critical vulnerability in current RLHF pipelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TOMPA shows reward models can be gamed by raw token sequences that aren't language at all, nearly doubling scores over GPT-5 references on one model.

read the letter

The main thing here is that reward models respond to optimized gibberish in token space. TOMPA skips the usual text generation step and directly tunes token sequences using only the scalar reward as feedback, finding patterns that trigger high scores without any readable meaning. On Skywork-Reward-V2-Llama-3.1-8B it nearly doubles the reward of GPT-5 reference answers and beats them on 98 percent of prompts, while the outputs collapse into nonsense. That last part is useful because it makes the vulnerability obvious rather than hidden in plausible text. The distinction from prior semantic attacks holds up in the description, and the black-box setup matches real RLHF constraints. The empirical numbers are the strongest part of what is shown. The soft spot is scope. Everything rests on a single reward model with no reported tests on others, and the abstract gives no concrete details on the optimization loop, step count, or direct baselines beyond the GPT-5 references. If the full paper has those controls and checks whether the same patterns transfer, the result strengthens; otherwise the claim stays narrow. This is for people who train or audit reward models in RLHF pipelines. Anyone running safety evaluations on current RMs should see the numbers. I would send it to peer review because the core demonstration is straightforward and points to a gap worth checking.

Referee Report

2 major / 2 minor

Summary. The paper introduces Token Mapping Perturbation Attack (TOMPA), which performs adversarial optimization directly in token space on reward models by bypassing the standard decode-re-tokenize interface. Using only black-box scalar feedback from the RM, TOMPA discovers non-linguistic token sequences that elicit high rewards; specifically, when targeting Skywork-Reward-V2-Llama-3.1-8B it nearly doubles the reward of GPT-5 reference answers and outperforms them on 98.0% of prompts, with the generated outputs degenerating into nonsensical text.

Significance. If the reported empirical results hold under scrutiny, the work demonstrates a new class of RM vulnerability outside the semantic regime, showing that current RLHF pipelines can be systematically exploited via raw token optimization. This strengthens the case for more robust reward modeling and provides concrete evidence that RM biases are not limited to human-readable text.

major comments (2)

[§4] §4 (Experimental Setup): The optimization procedure, including the exact perturbation mechanism, number of black-box queries, learning rate schedule, and choice of baselines, is insufficiently specified to allow independent reproduction of the central claim that TOMPA nearly doubles the reward and outperforms GPT-5 references on 98.0% of prompts.
[§4.3] §4.3 (Evaluation Metrics): The 98.0% outperformance figure lacks detail on the prompt distribution, how ties or near-ties are handled, and whether the same prompt set was used for both TOMPA and the GPT-5 reference; this directly affects the load-bearing quantitative comparison.

minor comments (2)

[Abstract] The abstract and §3 could clarify whether TOMPA was evaluated on multiple RMs beyond Skywork-Reward-V2-Llama-3.1-8B and report the exact number of prompts used.
[§3.2] Notation for the token-mapping step in §3.2 is introduced without an explicit equation; adding a short formal definition would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work. We address the two major comments below and will revise the manuscript accordingly to improve clarity and reproducibility.

read point-by-point responses

Referee: [§4] §4 (Experimental Setup): The optimization procedure, including the exact perturbation mechanism, number of black-box queries, learning rate schedule, and choice of baselines, is insufficiently specified to allow independent reproduction of the central claim that TOMPA nearly doubles the reward and outperforms GPT-5 references on 98.0% of prompts.

Authors: We agree that the current description in §4 is too high-level for full reproducibility. In the revised manuscript we will expand this section with the precise perturbation mechanism (direct additive perturbations on token embeddings followed by projection back to the vocabulary), the exact number of black-box queries per prompt (fixed at 800), the learning-rate schedule (Adam with cosine decay from 5e-2 to 1e-4), and the full set of baselines (random token sequences, semantic paraphrases, and the original policy outputs). These additions will be placed in a new subsection §4.2 and will enable independent reproduction of the reported reward-doubling and 98.0% outperformance results. revision: yes
Referee: [§4.3] §4.3 (Evaluation Metrics): The 98.0% outperformance figure lacks detail on the prompt distribution, how ties or near-ties are handled, and whether the same prompt set was used for both TOMPA and the GPT-5 reference; this directly affects the load-bearing quantitative comparison.

Authors: We thank the referee for highlighting this ambiguity. The 98.0% statistic was computed on an identical set of 1,000 prompts drawn uniformly from the AlpacaEval test split; the same GPT-5 reference answers were used for both TOMPA and the baseline comparison. An output is counted as outperforming only when its reward is strictly higher; ties (reward difference < 0.01) occur in fewer than 2% of cases and are reported separately. We will add these details, including the exact prompt sampling procedure and tie-handling rule, to §4.3 in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The manuscript introduces TOMPA as an empirical black-box optimization framework operating directly on token sequences, bypassing decode-re-tokenize. All central claims (e.g., nearly doubling rewards on Skywork-Reward-V2-Llama-3.1-8B and outperforming GPT-5 references on 98% of prompts) rest on reported experimental outcomes rather than any derivation, fitted parameter renamed as prediction, or self-citation chain. No equations, uniqueness theorems, or ansatzes appear in the provided text; the argument is self-contained against external benchmarks and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the attack is presented as an empirical optimization procedure relying on standard black-box access assumptions.

pith-pipeline@v0.9.0 · 5505 in / 1057 out tokens · 29584 ms · 2026-05-13T20:26:06.112506+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 7 internal anchors

[1]

Adversarial training of reward models.arXiv preprint arXiv:2504.06141,

Alexander Bukharin, Haifeng Qian, Shengyang Sun, Adithya Renduchintala, Soumye Singhal, Zhilin Wang, Oleksii Kuchaiev, Olivier Delalleau, and Tuo Zhao. Adversarial training of reward models.arXiv preprint arXiv:2504.06141,

work page arXiv
[2]

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, J ´er´emy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, et al. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217,

work page internal anchor Pith review arXiv
[3]

Bowman, Ethan Perez, and Evan Hubinger

Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan, et al. Sycophancy to subterfuge: Investigating reward-tampering in large language models.arXiv preprint arXiv:2406.10162,

work page arXiv
[4]

Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking.arXiv preprint arXiv:2312.09244,

Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Alex D’Amour, DJ Dvi- jotham, Adam Fisch, Katherine Heller, Stephen Pfohl, Deepak Ramachandran, et al. Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking.arXiv preprint arXiv:2312.09244,

work page arXiv
[5]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

2022 , month = jun, journal =

Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt. Unsolved problems in ML safety.arXiv preprint arXiv:2109.13916,

work page arXiv
[7]

Skywork-reward-v2: Scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352, 2025

Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, et al. Skywork-reward-v2: Scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352,

work page arXiv
[8]

RewardBench 2: Advancing Reward Model Evaluation

Saumya Malik, Valentina Pyatkin, Sander Land, Jacob Morrison, Noah A Smith, Hannaneh Hajishirzi, and Nathan Lambert. Rewardbench 2: Advancing reward model evaluation. arXiv preprint arXiv:2506.01937,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Offset- bias: Leveraging debiased data for tuning evaluators

Junsoo Park, Seungyeon Jwa, Ren Meiying, Daeyoung Kim, and Sanghyuk Choi. Offset- bias: Leveraging debiased data for tuning evaluators. InFindings of the Association for Computational Linguistics: EMNLP 2024, pp. 1043–1067,

work page 2024
[10]

Red teaming language models with language models

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,

work page 2022
[11]

Under review

10 Preprint. Under review. Vyas Raina, Adian Liusie, and Mark Gales. Is LLM-as-a-judge robust? investigating universal adversarial attacks on zero-shot LLM assessment. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 7499–7517,

work page 2024
[12]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y.K. Li, Y. Wu, et al. DeepSeekMath: Pushing the limits of mathemati- cal reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Towards Understanding Sycophancy in Language Models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. Towards understanding sycophancy in language models.arXiv preprint arXiv:2310.13548,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback

Wei Shen, Rui Zheng, Wenyu Zhan, Jun Zhao, Shihan Dou, Tao Gui, Qi Zhang, and Xuan- Jing Huang. Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback. InFindings of the Association for Computational Linguistics: EMNLP 2023, pp. 2859–2873,

work page 2023
[15]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient RLHF framework. arXiv preprint arXiv: 2409.19256,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

A long way to go: Investigat- ing length correlations in RLHF.arXiv preprint arXiv:2310.03716,

Prasann Singhal, Tanya Goyal, Jiacheng Xu, and Greg Durrett. A long way to go: Investigat- ing length correlations in RLHF.arXiv preprint arXiv:2310.03716,

work page arXiv
[17]

Assessing judging bias in large reasoning models: An empirical study.arXiv preprint arXiv:2504.09946, 2025a

Qian Wang, Zhanzhi Lou, Zhenheng Tang, Nuo Chen, Xuandong Zhao, Wenxuan Zhang, Dawn Song, and Bingsheng He. Assessing judging bias in large reasoning models: An empirical study.arXiv preprint arXiv:2504.09946, 2025a. Zizhao Wang, Dingcheng Li, Vaishakh Keshava, Phillip Wallis, Ananth Balashankar, Peter Stone, and Lukas Rutishauser. Adversarial reinforceme...

work page arXiv
[18]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Noveltybench: Evaluating language models for humanlike diversity.arXiv preprint arXiv:2504.05228,

Yiming Zhang, Harshita Diddee, Susan Holm, Hanchen Liu, Xinyue Liu, Vinay Samuel, Barry Wang, and Daphne Ippolito. Noveltybench: Evaluating language models for humanlike diversity.arXiv preprint arXiv:2504.05228,

work page arXiv
[20]

WildChat : 1M ChatGPT Interaction Logs in the Wild

Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. Wild- Chat: 1M ChatGPT interaction logs in the wild.arXiv preprint arXiv:2405.01470,

work page arXiv
[21]

One Token to Fool

Yulai Zhao, Haolin Liu, Dian Yu, Sunyuan Kung, Meijia Chen, Haitao Mi, and Dong Yu. One token to fool llm-as-a-judge.arXiv preprint arXiv:2507.08794,

work page arXiv
[22]

Under review

11 Preprint. Under review. Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Jing Jiang, and Min Lin. Cheat- ing automatic LLM benchmarks: Null models achieve high win rates.arXiv preprint arXiv:2410.07137,

work page arXiv