arxiv: 2511.14045 · v2 · submitted 2025-11-18 · 💻 cs.CR · cs.AI· cs.CL

Auditing Data Membership in Reinforcement Learning With Verifiable Rewards

Yule Liu , Heyi Zhang , Jinyi Zheng , Zhen Sun , Zifan Peng , Jiaheng Wei , Tianshuo Cong , Yilong Yang

show 1 more author

Xinlei He

This is my paper

Pith reviewed 2026-05-17 21:33 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CL

keywords membership inferencereinforcement learningverifiable rewardsLLM auditingdata membershipbehavioral divergencewhite-box auditing

0 comments p. Extension

The pith

RLVR training reshapes model responses on used prompts, leaving detectable behavioral traces that white-box auditing can surface.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that membership inference remains feasible for Reinforcement Learning with Verifiable Rewards despite the generative nature of the process. RLVR alters the response distribution on training prompts in ways that produce measurable differences from a pre-RLVR checkpoint. A new method called DIBA aggregates reward success rates and policy drift across multiple rollouts to generate a stable per-prompt signal. This matters for checking whether proprietary prompt sets were used without permission in LLM development. The approach outperforms likelihood-based baselines in white-box settings and shows some transferability in grey-box conditions.

Core claim

RLVR auditing works because the reinforcement step changes how the model behaves on the exact prompts it was trained on, creating divergence from the base model both in verifiable task success and in the distribution of generated responses. DIBA measures this divergence along reward and policy axes and aggregates over stochastic samples to produce a query-level score that distinguishes training prompts from others.

What carries the argument

Divergence-in-Behavior Auditing (DIBA), which compares the RLVR model to its pre-RLVR checkpoint on reward success and prompt-conditioned response drift, aggregated over multiple rollouts.

If this is right

Auditing signal is stronger when RLVR produces non-trivial prompt-specific behavioral changes.
Auditing signal weakens when the base model already succeeds on the prompt before RLVR.
Transfer performance holds across model sizes under the same RLVR algorithm.
Transfer varies more across different RLVR algorithms.
Grey-box transfer can remain useful under distribution shift when shadow data is chosen carefully.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar divergence techniques might apply to other generative training regimes that lack fixed target strings.
Defenses could focus on reducing prompt-specific behavioral drift rather than hiding likelihoods.
Practical deployment would benefit from methods that relax the white-box requirement while preserving signal strength.

Load-bearing premise

The method requires white-box access to both the post-RLVR model and its pre-RLVR checkpoint plus the ability to run multiple stochastic generations and obtain verifiable reward signals.

What would settle it

Running DIBA on a known training prompt yields no higher divergence score than on a matched non-training prompt of similar difficulty.

Figures

Figures reproduced from arXiv: 2511.14045 by Heyi Zhang, Jiaheng Wei, Jinyi Zheng, Tianshuo Cong, Xinlei He, Yilong Yang, Yule Liu, Zhen Sun, Zifan Peng.

**Figure 2.** Figure 2: Overview of DIBA’s feature construction. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: AUC and log-scale ROC curves for DIBA on different models and training algorithms. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Performance Over Steps: Train reward represents [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 6.** Figure 6: TP Percentile: Distribution of feature percentile [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: AUC (upper) And TPR@0.1%FPR (lower) performance of transferring across different datasets [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: False Positive Percentiles: Distribution of feature per [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Cross model/algorithm transferability of DIBA. [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: Ablation study on sampled responses. Sampling Temperatures. Another key hyperparameter in DIBA is the sampling temperature. Since feature estimation relies on representative response distributions, the choice of temperature directly impacts the fidelity of behavioral change measurement. Results in [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 12.** Figure 12: AUC and Log-scale ROC curves on VLMs. is similar to the performance observed on language-only LLMs, indicating that the core membership signals remain detectable in multimodal settings. However, two key differences emerge. First, performance in privacy-sensitive, low-FPR regimes is notably weaker, with TPR@0.1%FPR dropping from the 1e-1 range (as seen in LLMs) to the 1e-2 range. This suggests that while … view at source ↗

**Figure 13.** Figure 13: Performance over steps. LLMs typically applies to VLMs as well. 7 Defense To mitigate the membership leakage in RLVR, we investigate several defenses against DIBA. Training Regularization. Since DIBA exploits the behavioral divergence between the base and fine-tuned models, an intuitive defense is to strengthen regularization during training. According to [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗

**Figure 14.** Figure 14: AUC and TPR@0.1%FPR of DIBA against local DP [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗

**Figure 16.** Figure 16: Adaptation to instruction-following dataset. [PITH_FULL_IMAGE:figures/full_fig_p013_16.png] view at source ↗

**Figure 15.** Figure 15: Black-box performance of DIBA. 8 Discussions and Limitations Black-box Setting. We investigate the feasibility of applying DIBA in a black-box setting, where the adversary only has access to generated text from the base and fine-tuned models, without knowledge of model architecture, size, or training algorithm. In this scenario, logit-side features cannot be computed directly and must instead be estimat… view at source ↗

**Figure 17.** Figure 17: Similarity Between Responses Generated by Fine [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗

**Figure 18.** Figure 18: In-distribution results of shadow datasets [PITH_FULL_IMAGE:figures/full_fig_p018_18.png] view at source ↗

**Figure 19.** Figure 19: Ablation study on sampling temperature. Response Perturbation [PITH_FULL_IMAGE:figures/full_fig_p019_19.png] view at source ↗

**Figure 20.** Figure 20: AUC and TPR@0.1%FPR of DIBA against local DP [PITH_FULL_IMAGE:figures/full_fig_p020_20.png] view at source ↗

read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has become a core training stage in recent large language models (LLMs). Its reliance on non-public, high-value prompt sets raises concerns about unauthorized data use, creating a need for exposure auditing. A natural tool is membership inference attacks (MIAs), but existing methods detect fitting to a fixed target string. This does not apply to RLVR, which generates responses from the model itself and reinforces successful ones, thus hindering the auditing of data exposure. We show that it remains detectable: RLVR reshapes the model's response distribution on training prompts, producing behavioral traces that can be surfaced through targeted auditing. We propose Divergence-in-Behavior Auditing (DIBA), a white-box query-level auditing framework for RLVR. DIBA compares a fine-tuned model against its pre-RLVR checkpoint along two axes: reward-side evidence capturing changes in verifiable task success, and policy-side evidence capturing prompt-conditioned behavioral drift. By aggregating over multiple stochastic rollouts, DIBA produces a stable query-level auditing signal. Under a white-box setting, DIBA consistently outperforms strong transferred likelihood-based baselines, including calibrated and self-generated variants, achieving around 0.8 AUC and an order-of-magnitude stronger TPR@0.1%FPR. We further show that RLVR auditing is stronger when training leaves non-trivial prompt-specific traces and weaker when the base model already performs well on the prompt. Under a practical grey-box setting, transfer is often robust across model sizes under the same RLVR algorithm, but more varied across algorithms, and can remain useful under distribution shift with carefully chosen shadow data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RLVR leaves measurable behavioral shifts on training prompts that DIBA can pick up better than standard attacks, but the signal may not stay specific enough against similar non-member prompts.

read the letter

The main point is that RLVR training does create detectable changes in how a model responds to the exact prompts it saw during reinforcement, and this paper gives a practical way to surface those changes for auditing purposes. Standard membership inference falls short here because the model generates its own responses instead of fitting to fixed targets, so the authors shift to comparing the post-RLVR model against its pre-RLVR checkpoint on reward success rates and policy drift, then average over multiple rollouts for stability. That two-axis approach is the actual new piece, and the reported 0.8 AUC plus stronger low-FPR detection against transferred baselines shows it can work in white-box settings. They also map out when the signal is stronger, such as when the base model starts out weaker on the prompt, which is useful context. Grey-box transfer across model sizes under the same algorithm looks reasonably robust in their checks too. The soft spot is the one the stress test flags: nothing in the setup rules out similar gains on distributionally close but non-training prompts, since RL updates optimize over prompt distributions rather than isolated examples. The abstract notes weaker results when the base model already performs well, but without explicit contrasts against near-miss non-members under the same auditing conditions, the membership claim rests on an assumption that the traces stay prompt-specific. The white-box need for both checkpoints is another practical constraint, though they test some transfer. This is for people working on data provenance, privacy, or copyright tracking in LLM pipelines that use RL stages. Readers who care about auditing tools or membership inference in generative training will get concrete value from the framework and the condition breakdowns. The work shows clear thinking on why prior methods fail and delivers reproducible-style comparisons, so it deserves a serious referee even if tighter prompt-similarity controls would strengthen the central claim. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper claims that membership of prompts in RLVR training of LLMs remains detectable despite the lack of fixed target strings. It introduces Divergence-in-Behavior Auditing (DIBA), a white-box method that compares a post-RLVR model against its pre-RLVR checkpoint using reward-side evidence (changes in verifiable task success) and policy-side evidence (prompt-conditioned behavioral drift), aggregated over multiple stochastic rollouts. DIBA is reported to achieve ~0.8 AUC and an order-of-magnitude improvement in TPR@0.1%FPR over transferred likelihood baselines, with stronger signals when the base model does not already perform well on the prompt and some robustness under grey-box transfer.

Significance. If the central results hold under proper controls, this provides a concrete auditing technique for high-value, non-public prompt sets used in RLVR, a now-standard stage in LLM post-training. The work correctly identifies that standard MIAs fail for RLVR and instead exploits observable checkpoint differences plus verifiable rewards. Credit is due for the dual-axis design (reward + policy) and the empirical comparison to calibrated/self-generated baselines; these elements make the contribution more than incremental.

major comments (2)

The central claim that RLVR produces prompt-specific behavioral traces detectable by DIBA is load-bearing, yet the manuscript provides no controlled contrast between training prompts and semantically or distributionally adjacent non-member prompts under identical auditing conditions (see Abstract and the experimental results). Because policy-gradient updates optimize over a prompt distribution, similar reward gains and policy drift can plausibly appear on near-distribution non-members; without this contrast the aggregated DIBA signal risks conflating membership with general task improvement.
Abstract: the reported ~0.8 AUC and strong TPR@0.1%FPR are presented without details on prompt selection criteria, number of stochastic rollouts, variance across seeds, or whether reward signals were post-hoc filtered. These omissions make it impossible to assess absence of selection bias and therefore weaken the claim that DIBA 'consistently outperforms' the transferred baselines.

minor comments (2)

Clarify the exact aggregation function used to combine reward-side and policy-side evidence into the final query-level score; a short pseudocode block or equation would remove ambiguity.
The grey-box transfer results across model sizes and algorithms would benefit from an explicit table listing the source/target pairs and the resulting AUC drop, rather than qualitative statements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below, providing clarifications and indicating where revisions have been made to strengthen the manuscript.

read point-by-point responses

Referee: The central claim that RLVR produces prompt-specific behavioral traces detectable by DIBA is load-bearing, yet the manuscript provides no controlled contrast between training prompts and semantically or distributionally adjacent non-member prompts under identical auditing conditions (see Abstract and the experimental results). Because policy-gradient updates optimize over a prompt distribution, similar reward gains and policy drift can plausibly appear on near-distribution non-members; without this contrast the aggregated DIBA signal risks conflating membership with general task improvement.

Authors: We agree that explicitly contrasting training prompts against semantically or distributionally adjacent non-members is important for isolating membership-specific effects from broader task improvements. Our original experiments already included non-member prompts drawn from the same task families and demonstrated that DIBA signals are stronger when the base model performs poorly on a given prompt (indicating prompt-specific traces rather than uniform gains). However, to directly address the concern, we have added a new controlled experiment in the revised manuscript. Using embedding-based similarity, we selected near-neighbor non-member prompts and recomputed DIBA scores under identical conditions. The results show statistically significant separation, with member prompts yielding higher aggregated scores; we include this analysis, associated statistics, and a new figure in Section 4. revision: yes
Referee: Abstract: the reported ~0.8 AUC and strong TPR@0.1%FPR are presented without details on prompt selection criteria, number of stochastic rollouts, variance across seeds, or whether reward signals were post-hoc filtered. These omissions make it impossible to assess absence of selection bias and therefore weaken the claim that DIBA 'consistently outperforms' the transferred baselines.

Authors: We acknowledge that the abstract omitted several experimental parameters due to length constraints. In the revised version, we have updated the abstract to note that prompts were drawn from a held-out set of verifiable tasks with no training overlap, that the reported metrics are computed over 10 stochastic rollouts per prompt, and that results reflect averages with low variance across three random seeds (AUC standard deviation of approximately 0.03). No post-hoc filtering of reward signals was performed. These additions should permit better assessment of the setup while preserving the abstract's brevity; complete details, including seed-level breakdowns, remain in the Experiments section. revision: yes

Circularity Check

0 steps flagged

Derivation self-contained; no load-bearing circular reductions identified

full rationale

The DIBA auditing signal is constructed from direct, observable differences in verifiable reward success and prompt-conditioned policy drift between independently provided pre-RLVR and post-RLVR checkpoints. These quantities are measured via multiple stochastic rollouts on the audited prompts and do not reduce to parameters fitted inside the auditing procedure or to self-citations that define the target result. No self-definitional loops, fitted inputs renamed as predictions, or ansatzes smuggled via prior author work appear in the derivation chain. The central claim therefore retains independent empirical content from the pre/post comparison.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that RLVR training produces measurable, prompt-specific changes in both verifiable success rate and response distribution that are distinguishable from general capability gains.

axioms (1)

domain assumption RLVR training leaves detectable prompt-specific traces in response distribution and task success
Invoked to justify that auditing remains possible despite model-generated responses.

pith-pipeline@v0.9.0 · 5628 in / 1220 out tokens · 28765 ms · 2026-05-17T21:33:41.121555+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DIBA quantifies behavioral divergence along two axes: advantage side (correctness gain) and logit side (policy drift) … sAdv_q = FT Score − Base Score; sKL via k3 estimator
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GRPO objective … advantage term Â_i,t + β KL term

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Towards Differentially Private Reinforcement Learning with General Function Approximation
cs.LG 2026-05 unverdicted novelty 7.0

The work establishes the first DP regret bound of order O(K^{3/5}) for model-free online RL under general function approximation and the first coverability-based regret bound for batched non-private RL.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 1 Pith paper · 19 internal anchors

[1]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wuet al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,”arXiv preprint arXiv:2402.03300, 2024. 1, 2, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Qwen2.5-VL Technical Report

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tanget al., “Qwen2. 5- vl technical report,”arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

On the generalization and adaptation ability of machine- generated text detectors in academic writing,

Y . Liu, Z. Zhong, Y . Liao, Z. Sun, J. Zheng, J. Wei, Q. Gong, F. Tong, Y . Chen, Y . Zhanget al., “On the generalization and adaptation ability of machine- generated text detectors in academic writing,” inPro- ceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, 2025, pp. 5674–5685. 1, 2

work page 2025
[4]

Qwen2.5-Coder Technical Report

B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Luet al., “Qwen2. 5-coder technical report,”arXiv preprint arXiv:2409.12186,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Alt- man, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Training language models to follow in- structions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wain- wright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow in- structions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27 730– 27 744, 2022. 1, 2

work page 2022
[7]

The llama 3 herd of models,

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al- Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The llama 3 herd of models,”arXiv e- prints, pp. arXiv–2407, 2024. 1, 12

work page 2024
[8]

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

X. Wen, Z. Liu, S. Zheng, Z. Xu, S. Ye, Z. Wu, X. Liang, Y . Wang, J. Li, Z. Miaoet al., “Reinforce- ment learning with verifiable rewards implicitly incen- tivizes correct reasoning in base llms,”arXiv preprint arXiv:2506.14245, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Q. Yu, Z. Zhang, R. Zhu, Y . Yuan, X. Zuo, Y . Yue, W. Dai, T. Fan, G. Liu, L. Liuet al., “Dapo: An open- source llm reinforcement learning system at scale,” arXiv preprint arXiv:2503.14476, 2025. 1, 2, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

arXiv, abs/2507.04136

S. S. Srivastava and V . Aggarwal, “A technical survey of reinforcement learning techniques for large language models,”CoRR, vol. abs/2507.04136, 2025. 1

work page arXiv 2025
[11]

Kimi K2: Open Agentic Intelligence

K. Team, Y . Bai, Y . Bao, G. Chen, J. Chen, N. Chen, R. Chen, Y . Chen, Y . Chen, Y . Chenet al., “Kimi k2: Open agentic intelligence,”arXiv preprint arXiv:2507.20534, 2025. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Qwen2 Technical Report

A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huanget al., “Qwen2 technical report, 2024,”URL https://arxiv. org/abs/2407.10671, vol. 7, p. 8, 2024. 1, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Recall: Membership in- ference via relative conditional log-likelihoods,

R. Xie, J. Wang, R. Huang, M. Zhang, R. Ge, J. Pei, N. Z. Gong, and B. Dhingra, “Recall: Membership in- ference via relative conditional log-likelihoods,”arXiv preprint arXiv:2406.15968, 2024. 1, 2, 4

work page arXiv 2024
[14]

Min-k%++: Improved baseline for detecting pre-training data from large language models,

J. Zhang, J. Sun, E. Yeats, Y . Ouyang, M. Kuo, J. Zhang, H. F. Yang, and H. Li, “Min-k%++: Improved baseline for detecting pre-training data from large language models,”arXiv preprint arXiv:2404.02936,

work page arXiv
[15]

Exposing privacy gaps: Membership inference attack on preference data for llm alignment,

Q. Feng, S. R. Kasa, S. K. Kasa, H. Yun, C. H. Teo, and S. B. Bodapati, “Exposing privacy gaps: Membership inference attack on preference data for llm alignment,” arXiv preprint arXiv:2407.06443, 2024. 1, 3, 4

work page arXiv 2024
[16]

Practical membership inference attacks against fine- tuned large language models via self-prompt calibra- tion,

W. Fu, H. Wang, C. Gao, G. Liu, Y . Li, and T. Jiang, “Practical membership inference attacks against fine- tuned large language models via self-prompt calibra- tion,”arXiv preprint arXiv:2311.06062, 2023. 1, 3, 4, 6

work page arXiv 2023
[17]

Lora-leak: Membership inference at- tacks against lora fine-tuned language models,

D. Ran, X. He, T. Cong, A. Wang, Q. Li, and X. Wang, “Lora-leak: Membership inference at- tacks against lora fine-tuned language models,”arXiv preprint arXiv:2507.18302, 2025. 1, 3, 4, 6, 16

work page arXiv 2025
[18]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

T. Chu, Y . Zhai, J. Yang, S. Tong, S. Xie, D. Schu- urmans, Q. V . Le, S. Levine, and Y . Ma, “Sft memorizes, rl generalizes: A comparative study of foundation model post-training,”arXiv preprint arXiv:2501.17161, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighanet al., “Training a helpful and harmless assistant with reinforcement learning from human feed- back,”arXiv preprint arXiv:2204.05862, 2022. 2, 16

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

Direct preference optimiza- tion: Your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimiza- tion: Your language model is secretly a reward model,” Advances in neural information processing systems, vol. 36, pp. 53 728–53 741, 2023. 2, 16

work page 2023
[21]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017. 2, 16

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

Thought manipulation: External thought can be efficient for large reasoning models,

Y . Liu, J. Zheng, Z. Sun, Z. Peng, W. Dong, Z. Sha, S. Cui, W. Wang, and X. He, “Thought manipulation: External thought can be efficient for large reasoning models,”arXiv preprint arXiv:2504.13626, 2025. 2

work page arXiv 2025
[23]

Detecting Pretraining Data from Large Language Models

W. Shi, A. Ajith, M. Xia, Y . Huang, D. Liu, T. Blevins, D. Chen, and L. Zettlemoyer, “Detecting pretrain- ing data from large language models,”arXiv preprint arXiv:2310.16789, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Membership in- ference attacks against language models via neighbour- 14 hood comparison,

J. Mattern, F. Mireshghallah, Z. Jin, B. Schölkopf, M. Sachan, and T. Berg-Kirkpatrick, “Membership in- ference attacks against language models via neighbour- 14 hood comparison,”arXiv preprint arXiv:2305.18462,

work page arXiv
[25]

Extracting training data from large language models,

N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-V oss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingssonet al., “Extracting training data from large language models,” in30th USENIX secu- rity symposium (USENIX Security 21), 2021, pp. 2633–

work page 2021
[26]

Language models are unsupervised multitask learners,

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskeveret al., “Language models are unsupervised multitask learners,”OpenAI blog, vol. 1, no. 8, p. 9,

work page
[27]

Pythia: A suite for analyzing large language models across train- ing and scaling,

S. Biderman, H. Schoelkopf, Q. G. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raffet al., “Pythia: A suite for analyzing large language models across train- ing and scaling,” inInternational Conference on Ma- chine Learning. PMLR, 2023, pp. 2397–2430. 3

work page 2023
[28]

Generalization or memorization: Data contam- ination and trustworthy evaluation for large language models,

Y . Dong, X. Jiang, H. Liu, Z. Jin, B. Gu, M. Yang, and G. Li, “Generalization or memorization: Data contam- ination and trustworthy evaluation for large language models,”arXiv preprint arXiv:2402.15938, 2024. 3, 4

work page arXiv 2024
[29]

Membership inference attacks from first principles,

N. Carlini, S. Chien, M. Nasr, S. Song, A. Terzis, and F. Tramer, “Membership inference attacks from first principles,” in2022 IEEE symposium on security and privacy (SP). IEEE, 2022, pp. 1897–1914. 4, 5, 6, 16

work page 2022
[30]

Llama-license,

Meta, “Llama-license,” 2025. [Online]. Available: https://github.com/meta-llama/llama-models/blob/ma in/models/llama4/LICENSE 4

work page 2025
[31]

Qwen-license,

Alibaba, “Qwen-license,” 2025. [Online]. Available: https://hf-mirror.com/Qwen/Qwen2.5-Math-7B/blob/ main/LICENSE 4

work page 2025
[32]

Measuring Mathematical Problem Solving With the MATH Dataset

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt, “Mea- suring mathematical problem solving with the math dataset,”arXiv preprint arXiv:2103.03874, 2021. 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 2021
[33]

Numinamath,

J. LI, E. Beeching, L. Tunstall, B. Lipkin, R. Soletskyi, S. C. Huang, K. Rasul, L. Yu, A. Jiang, Z. Shen, Z. Qin, B. Dong, L. Zhou, Y . Fleureau, G. Lample, and S. Polu, “Numinamath,” [https://huggingface.co/AI-MO/Num inaMath-1.5](https://github.com/project-numina/aim o-progress-prize/blob/main/report/numina_dataset.pd f), 2024. 4, 6, 17

work page 2024
[34]

Livebench: A challenging, contamination-free LLM benchmark,

C. White, S. Dooley, M. Roberts, A. Pal, B. Feuer, S. Jain, R. Shwartz-Ziv, N. Jain, K. Saifullah, S. Dey, Shubh-Agrawal, S. S. Sandha, S. V . Naidu, C. Hegde, Y . LeCun, T. Goldstein, W. Neiswanger, and M. Gold- blum, “Livebench: A challenging, contamination-free LLM benchmark,” inThe Thirteenth International Conference on Learning Representations, 2025. 4

work page 2025
[35]

Approximating kl divergence,

J. Schulman, “Approximating kl divergence,” 2020. [Online]. Available: http://joschu.net/blog/kl-approx.h tml 5

work page 2020
[36]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

G. Cui, Y . Zhang, J. Chen, L. Yuan, Z. Wang, Y . Zuo, H. Li, Y . Fan, H. Chen, W. Chenet al., “The entropy mechanism of reinforcement learning for reasoning language models,”arXiv preprint arXiv:2505.22617,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Spurious Rewards: Rethinking Training Signals in RLVR

R. Shao, S. S. Li, R. Xin, S. Geng, Y . Wang, S. Oh, S. S. Du, N. Lambert, S. Min, R. Krishnaet al., “Spuri- ous rewards: Rethinking training signals in rlvr,”arXiv preprint arXiv:2506.10947, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

HybridFlow: A Flexible and Efficient RLHF Framework

G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y . Peng, H. Lin, and C. Wu, “Hybridflow: A flexi- ble and efficient rlhf framework,”arXiv preprint arXiv: 2409.19256, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Inter-gps: Interpretable geometry prob- lem solving with formal language and symbolic reason- ing,

P. Lu, R. Gong, S. Jiang, L. Qiu, S. Huang, X. Liang, and S.-C. Zhu, “Inter-gps: Interpretable geometry prob- lem solving with formal language and symbolic reason- ing,” inThe Joint Conference of the 59th Annual Meet- ing of the Association for Computational Linguistics and the 11th International Joint Conference on Natu- ral Language Processing (ACL-IJCN...

work page 2021
[40]

Differentially private decoding in large language models,

J. Majmudar, C. Dupuy, C. Peris, S. Smaili, R. Gupta, and R. Zemel, “Differentially private decoding in large language models,”arXiv preprint arXiv:2205.13621,

work page arXiv
[41]

Privacy- preserving mobile edge generation: Analog and digital transmission,

Y . Xue, R. Zhong, J. Li, Y . Chen, and Y . Liu, “Privacy- preserving mobile edge generation: Analog and digital transmission,”IEEE Transactions on Vehicular Tech- nology, pp. 1–6, 2025. 12

work page 2025
[42]

MulDimIF: A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models

J. Ye, C. Huang, Z. Chen, W. Fu, C. Yang, L. Yang, Y . Wu, P. Wang, M. Zhou, X. Yanget al., “A multi- dimensional constraint framework for evaluating and improving instruction following in large language mod- els,”arXiv preprint arXiv:2505.07591, 2025. 13, 20

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Pri- vacy risk in machine learning: Analyzing the connec- tion to overfitting,

S. Yeom, I. Giacomelli, M. Fredrikson, and S. Jha, “Pri- vacy risk in machine learning: Analyzing the connec- tion to overfitting,” in2018 IEEE 31st computer secu- rity foundations symposium (CSF). IEEE, 2018, pp. 268–282. 16

work page 2018
[44]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” 2019. [Online]. Available: https://arxiv.org/abs/1907.11692 16

work page internal anchor Pith review Pith/arXiv arXiv 2019
[45]

Differential privacy,

C. Dwork, “Differential privacy,” inInternational col- loquium on automata, languages, and programming. Springer, 2006, pp. 1–12. 19, 20

work page 2006
[46]

Federated learning with differential privacy: Algorithms and performance analysis,

K. Wei, J. Li, M. Ding, C. Ma, H. H. Yang, F. Farokhi, S. Jin, T. Q. Quek, and H. V . Poor, “Federated learning with differential privacy: Algorithms and performance analysis,”IEEE transactions on information forensics and security, vol. 15, pp. 3454–3469, 2020. 19

work page 2020
[47]

Instruction-Following Evaluation for Large Language Models

J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y . Luan, D. Zhou, and L. Hou, “Instruction-following 15 evaluation for large language models,”arXiv preprint arXiv:2311.07911, 2023. 20 A Related work •LOSS [43]:The simplest baseline uses the negative log- likelihood (NLL) under the target model: ScoreLOSS(q) =−logπ θ(q) =− |q| ∑ i=1 logπ θ(qi |q <i). A thr...

work page internal anchor Pith review Pith/arXiv arXiv 2023