pith. the verified trust layer for science. sign in

arxiv: 2511.14045 · v2 · submitted 2025-11-18 · 💻 cs.CR · cs.AI· cs.CL

Auditing Data Membership in Reinforcement Learning With Verifiable Rewards

Pith reviewed 2026-05-17 21:33 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CL
keywords membership inferencereinforcement learningverifiable rewardsLLM auditingdata membershipbehavioral divergencewhite-box auditing
0
0 comments X p. Extension

The pith

RLVR training reshapes model responses on used prompts, leaving detectable behavioral traces that white-box auditing can surface.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that membership inference remains feasible for Reinforcement Learning with Verifiable Rewards despite the generative nature of the process. RLVR alters the response distribution on training prompts in ways that produce measurable differences from a pre-RLVR checkpoint. A new method called DIBA aggregates reward success rates and policy drift across multiple rollouts to generate a stable per-prompt signal. This matters for checking whether proprietary prompt sets were used without permission in LLM development. The approach outperforms likelihood-based baselines in white-box settings and shows some transferability in grey-box conditions.

Core claim

RLVR auditing works because the reinforcement step changes how the model behaves on the exact prompts it was trained on, creating divergence from the base model both in verifiable task success and in the distribution of generated responses. DIBA measures this divergence along reward and policy axes and aggregates over stochastic samples to produce a query-level score that distinguishes training prompts from others.

What carries the argument

Divergence-in-Behavior Auditing (DIBA), which compares the RLVR model to its pre-RLVR checkpoint on reward success and prompt-conditioned response drift, aggregated over multiple rollouts.

If this is right

  • Auditing signal is stronger when RLVR produces non-trivial prompt-specific behavioral changes.
  • Auditing signal weakens when the base model already succeeds on the prompt before RLVR.
  • Transfer performance holds across model sizes under the same RLVR algorithm.
  • Transfer varies more across different RLVR algorithms.
  • Grey-box transfer can remain useful under distribution shift when shadow data is chosen carefully.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar divergence techniques might apply to other generative training regimes that lack fixed target strings.
  • Defenses could focus on reducing prompt-specific behavioral drift rather than hiding likelihoods.
  • Practical deployment would benefit from methods that relax the white-box requirement while preserving signal strength.

Load-bearing premise

The method requires white-box access to both the post-RLVR model and its pre-RLVR checkpoint plus the ability to run multiple stochastic generations and obtain verifiable reward signals.

What would settle it

Running DIBA on a known training prompt yields no higher divergence score than on a matched non-training prompt of similar difficulty.

Figures

Figures reproduced from arXiv: 2511.14045 by Heyi Zhang, Jiaheng Wei, Jinyi Zheng, Tianshuo Cong, Xinlei He, Yilong Yang, Yule Liu, Zhen Sun, Zifan Peng.

Figure 1
Figure 1. Figure 1: MIAs in different LLM training paradigms. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of DIBA’s feature construction. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: AUC and log-scale ROC curves for DIBA on different models and training algorithms. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance Over Steps: Train reward represents [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: TP Percentile: Distribution of feature percentile [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: AUC (upper) And TPR@0.1%FPR (lower) performance of transferring across different datasets [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: False Positive Percentiles: Distribution of feature per [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Cross model/algorithm transferability of DIBA. [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Ablation study on sampled responses. Sampling Temperatures. Another key hyperparameter in DIBA is the sampling temperature. Since feature estimation relies on representative response distributions, the choice of temperature directly impacts the fidelity of behavioral change measurement. Results in [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: AUC and Log-scale ROC curves on VLMs. is similar to the performance observed on language-only LLMs, indicating that the core membership signals remain detectable in multimodal settings. However, two key differ￾ences emerge. First, performance in privacy-sensitive, low-FPR regimes is notably weaker, with TPR@0.1%FPR dropping from the 1e-1 range (as seen in LLMs) to the 1e-2 range. This suggests that while … view at source ↗
Figure 13
Figure 13. Figure 13: Performance over steps. LLMs typically applies to VLMs as well. 7 Defense To mitigate the membership leakage in RLVR, we investigate several defenses against DIBA. Training Regularization. Since DIBA exploits the behav￾ioral divergence between the base and fine-tuned models, an intuitive defense is to strengthen regularization during train￾ing. According to [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: AUC and TPR@0.1%FPR of DIBA against local DP [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗
Figure 16
Figure 16. Figure 16: Adaptation to instruction-following dataset. [PITH_FULL_IMAGE:figures/full_fig_p013_16.png] view at source ↗
Figure 15
Figure 15. Figure 15: Black-box performance of DIBA. 8 Discussions and Limitations Black-box Setting. We investigate the feasibility of apply￾ing DIBA in a black-box setting, where the adversary only has access to generated text from the base and fine-tuned models, without knowledge of model architecture, size, or training algorithm. In this scenario, logit-side features can￾not be computed directly and must instead be estimat… view at source ↗
Figure 17
Figure 17. Figure 17: Similarity Between Responses Generated by Fine [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: In-distribution results of shadow datasets [PITH_FULL_IMAGE:figures/full_fig_p018_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Ablation study on sampling temperature. Response Perturbation [PITH_FULL_IMAGE:figures/full_fig_p019_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: AUC and TPR@0.1%FPR of DIBA against local DP [PITH_FULL_IMAGE:figures/full_fig_p020_20.png] view at source ↗
read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has become a core training stage in recent large language models (LLMs). Its reliance on non-public, high-value prompt sets raises concerns about unauthorized data use, creating a need for exposure auditing. A natural tool is membership inference attacks (MIAs), but existing methods detect fitting to a fixed target string. This does not apply to RLVR, which generates responses from the model itself and reinforces successful ones, thus hindering the auditing of data exposure. We show that it remains detectable: RLVR reshapes the model's response distribution on training prompts, producing behavioral traces that can be surfaced through targeted auditing. We propose Divergence-in-Behavior Auditing (DIBA), a white-box query-level auditing framework for RLVR. DIBA compares a fine-tuned model against its pre-RLVR checkpoint along two axes: reward-side evidence capturing changes in verifiable task success, and policy-side evidence capturing prompt-conditioned behavioral drift. By aggregating over multiple stochastic rollouts, DIBA produces a stable query-level auditing signal. Under a white-box setting, DIBA consistently outperforms strong transferred likelihood-based baselines, including calibrated and self-generated variants, achieving around 0.8 AUC and an order-of-magnitude stronger TPR@0.1%FPR. We further show that RLVR auditing is stronger when training leaves non-trivial prompt-specific traces and weaker when the base model already performs well on the prompt. Under a practical grey-box setting, transfer is often robust across model sizes under the same RLVR algorithm, but more varied across algorithms, and can remain useful under distribution shift with carefully chosen shadow data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that membership of prompts in RLVR training of LLMs remains detectable despite the lack of fixed target strings. It introduces Divergence-in-Behavior Auditing (DIBA), a white-box method that compares a post-RLVR model against its pre-RLVR checkpoint using reward-side evidence (changes in verifiable task success) and policy-side evidence (prompt-conditioned behavioral drift), aggregated over multiple stochastic rollouts. DIBA is reported to achieve ~0.8 AUC and an order-of-magnitude improvement in TPR@0.1%FPR over transferred likelihood baselines, with stronger signals when the base model does not already perform well on the prompt and some robustness under grey-box transfer.

Significance. If the central results hold under proper controls, this provides a concrete auditing technique for high-value, non-public prompt sets used in RLVR, a now-standard stage in LLM post-training. The work correctly identifies that standard MIAs fail for RLVR and instead exploits observable checkpoint differences plus verifiable rewards. Credit is due for the dual-axis design (reward + policy) and the empirical comparison to calibrated/self-generated baselines; these elements make the contribution more than incremental.

major comments (2)
  1. The central claim that RLVR produces prompt-specific behavioral traces detectable by DIBA is load-bearing, yet the manuscript provides no controlled contrast between training prompts and semantically or distributionally adjacent non-member prompts under identical auditing conditions (see Abstract and the experimental results). Because policy-gradient updates optimize over a prompt distribution, similar reward gains and policy drift can plausibly appear on near-distribution non-members; without this contrast the aggregated DIBA signal risks conflating membership with general task improvement.
  2. Abstract: the reported ~0.8 AUC and strong TPR@0.1%FPR are presented without details on prompt selection criteria, number of stochastic rollouts, variance across seeds, or whether reward signals were post-hoc filtered. These omissions make it impossible to assess absence of selection bias and therefore weaken the claim that DIBA 'consistently outperforms' the transferred baselines.
minor comments (2)
  1. Clarify the exact aggregation function used to combine reward-side and policy-side evidence into the final query-level score; a short pseudocode block or equation would remove ambiguity.
  2. The grey-box transfer results across model sizes and algorithms would benefit from an explicit table listing the source/target pairs and the resulting AUC drop, rather than qualitative statements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below, providing clarifications and indicating where revisions have been made to strengthen the manuscript.

read point-by-point responses
  1. Referee: The central claim that RLVR produces prompt-specific behavioral traces detectable by DIBA is load-bearing, yet the manuscript provides no controlled contrast between training prompts and semantically or distributionally adjacent non-member prompts under identical auditing conditions (see Abstract and the experimental results). Because policy-gradient updates optimize over a prompt distribution, similar reward gains and policy drift can plausibly appear on near-distribution non-members; without this contrast the aggregated DIBA signal risks conflating membership with general task improvement.

    Authors: We agree that explicitly contrasting training prompts against semantically or distributionally adjacent non-members is important for isolating membership-specific effects from broader task improvements. Our original experiments already included non-member prompts drawn from the same task families and demonstrated that DIBA signals are stronger when the base model performs poorly on a given prompt (indicating prompt-specific traces rather than uniform gains). However, to directly address the concern, we have added a new controlled experiment in the revised manuscript. Using embedding-based similarity, we selected near-neighbor non-member prompts and recomputed DIBA scores under identical conditions. The results show statistically significant separation, with member prompts yielding higher aggregated scores; we include this analysis, associated statistics, and a new figure in Section 4. revision: yes

  2. Referee: Abstract: the reported ~0.8 AUC and strong TPR@0.1%FPR are presented without details on prompt selection criteria, number of stochastic rollouts, variance across seeds, or whether reward signals were post-hoc filtered. These omissions make it impossible to assess absence of selection bias and therefore weaken the claim that DIBA 'consistently outperforms' the transferred baselines.

    Authors: We acknowledge that the abstract omitted several experimental parameters due to length constraints. In the revised version, we have updated the abstract to note that prompts were drawn from a held-out set of verifiable tasks with no training overlap, that the reported metrics are computed over 10 stochastic rollouts per prompt, and that results reflect averages with low variance across three random seeds (AUC standard deviation of approximately 0.03). No post-hoc filtering of reward signals was performed. These additions should permit better assessment of the setup while preserving the abstract's brevity; complete details, including seed-level breakdowns, remain in the Experiments section. revision: yes

Circularity Check

0 steps flagged

Derivation self-contained; no load-bearing circular reductions identified

full rationale

The DIBA auditing signal is constructed from direct, observable differences in verifiable reward success and prompt-conditioned policy drift between independently provided pre-RLVR and post-RLVR checkpoints. These quantities are measured via multiple stochastic rollouts on the audited prompts and do not reduce to parameters fitted inside the auditing procedure or to self-citations that define the target result. No self-definitional loops, fitted inputs renamed as predictions, or ansatzes smuggled via prior author work appear in the derivation chain. The central claim therefore retains independent empirical content from the pre/post comparison.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that RLVR training produces measurable, prompt-specific changes in both verifiable success rate and response distribution that are distinguishable from general capability gains.

axioms (1)
  • domain assumption RLVR training leaves detectable prompt-specific traces in response distribution and task success
    Invoked to justify that auditing remains possible despite model-generated responses.

pith-pipeline@v0.9.0 · 5628 in / 1220 out tokens · 28765 ms · 2026-05-17T21:33:41.121555+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Towards Differentially Private Reinforcement Learning with General Function Approximation

    cs.LG 2026-05 unverdicted novelty 7.0

    The work establishes the first DP regret bound of order O(K^{3/5}) for model-free online RL under general function approximation and the first coverability-based regret bound for batched non-private RL.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 1 Pith paper · 19 internal anchors

  1. [1]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wuet al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,”arXiv preprint arXiv:2402.03300, 2024. 1, 2, 3, 6

  2. [2]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tanget al., “Qwen2. 5- vl technical report,”arXiv preprint arXiv:2502.13923,

  3. [3]

    On the generalization and adaptation ability of machine- generated text detectors in academic writing,

    Y . Liu, Z. Zhong, Y . Liao, Z. Sun, J. Zheng, J. Wei, Q. Gong, F. Tong, Y . Chen, Y . Zhanget al., “On the generalization and adaptation ability of machine- generated text detectors in academic writing,” inPro- ceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, 2025, pp. 5674–5685. 1, 2

  4. [4]

    Qwen2.5-Coder Technical Report

    B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Luet al., “Qwen2. 5-coder technical report,”arXiv preprint arXiv:2409.12186,

  5. [5]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Alt- man, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023. 1, 2

  6. [6]

    Training language models to follow in- structions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wain- wright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow in- structions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27 730– 27 744, 2022. 1, 2

  7. [7]

    The llama 3 herd of models,

    A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al- Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The llama 3 herd of models,”arXiv e- prints, pp. arXiv–2407, 2024. 1, 12

  8. [8]

    Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

    X. Wen, Z. Liu, S. Zheng, Z. Xu, S. Ye, Z. Wu, X. Liang, Y . Wang, J. Li, Z. Miaoet al., “Reinforce- ment learning with verifiable rewards implicitly incen- tivizes correct reasoning in base llms,”arXiv preprint arXiv:2506.14245, 2025. 1

  9. [9]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Q. Yu, Z. Zhang, R. Zhu, Y . Yuan, X. Zuo, Y . Yue, W. Dai, T. Fan, G. Liu, L. Liuet al., “Dapo: An open- source llm reinforcement learning system at scale,” arXiv preprint arXiv:2503.14476, 2025. 1, 2, 3, 6

  10. [10]

    arXiv, abs/2507.04136

    S. S. Srivastava and V . Aggarwal, “A technical survey of reinforcement learning techniques for large language models,”CoRR, vol. abs/2507.04136, 2025. 1

  11. [11]

    Kimi K2: Open Agentic Intelligence

    K. Team, Y . Bai, Y . Bao, G. Chen, J. Chen, N. Chen, R. Chen, Y . Chen, Y . Chen, Y . Chenet al., “Kimi k2: Open agentic intelligence,”arXiv preprint arXiv:2507.20534, 2025. 1, 2

  12. [12]

    Qwen2 Technical Report

    A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huanget al., “Qwen2 technical report, 2024,”URL https://arxiv. org/abs/2407.10671, vol. 7, p. 8, 2024. 1, 6

  13. [13]

    Recall: Membership in- ference via relative conditional log-likelihoods,

    R. Xie, J. Wang, R. Huang, M. Zhang, R. Ge, J. Pei, N. Z. Gong, and B. Dhingra, “Recall: Membership in- ference via relative conditional log-likelihoods,”arXiv preprint arXiv:2406.15968, 2024. 1, 2, 4

  14. [14]

    Min-k%++: Improved baseline for detecting pre-training data from large language models,

    J. Zhang, J. Sun, E. Yeats, Y . Ouyang, M. Kuo, J. Zhang, H. F. Yang, and H. Li, “Min-k%++: Improved baseline for detecting pre-training data from large language models,”arXiv preprint arXiv:2404.02936,

  15. [15]

    Exposing privacy gaps: Membership inference attack on preference data for llm alignment,

    Q. Feng, S. R. Kasa, S. K. Kasa, H. Yun, C. H. Teo, and S. B. Bodapati, “Exposing privacy gaps: Membership inference attack on preference data for llm alignment,” arXiv preprint arXiv:2407.06443, 2024. 1, 3, 4

  16. [16]

    Practical membership inference attacks against fine- tuned large language models via self-prompt calibra- tion,

    W. Fu, H. Wang, C. Gao, G. Liu, Y . Li, and T. Jiang, “Practical membership inference attacks against fine- tuned large language models via self-prompt calibra- tion,”arXiv preprint arXiv:2311.06062, 2023. 1, 3, 4, 6

  17. [17]

    Lora-leak: Membership inference at- tacks against lora fine-tuned language models,

    D. Ran, X. He, T. Cong, A. Wang, Q. Li, and X. Wang, “Lora-leak: Membership inference at- tacks against lora fine-tuned language models,”arXiv preprint arXiv:2507.18302, 2025. 1, 3, 4, 6, 16

  18. [18]

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    T. Chu, Y . Zhai, J. Yang, S. Tong, S. Xie, D. Schu- urmans, Q. V . Le, S. Levine, and Y . Ma, “Sft memorizes, rl generalizes: A comparative study of foundation model post-training,”arXiv preprint arXiv:2501.17161, 2025. 2

  19. [19]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighanet al., “Training a helpful and harmless assistant with reinforcement learning from human feed- back,”arXiv preprint arXiv:2204.05862, 2022. 2, 16

  20. [20]

    Direct preference optimiza- tion: Your language model is secretly a reward model,

    R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimiza- tion: Your language model is secretly a reward model,” Advances in neural information processing systems, vol. 36, pp. 53 728–53 741, 2023. 2, 16

  21. [21]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017. 2, 16

  22. [22]

    Thought manipulation: External thought can be efficient for large reasoning models,

    Y . Liu, J. Zheng, Z. Sun, Z. Peng, W. Dong, Z. Sha, S. Cui, W. Wang, and X. He, “Thought manipulation: External thought can be efficient for large reasoning models,”arXiv preprint arXiv:2504.13626, 2025. 2

  23. [23]

    Detecting Pretraining Data from Large Language Models

    W. Shi, A. Ajith, M. Xia, Y . Huang, D. Liu, T. Blevins, D. Chen, and L. Zettlemoyer, “Detecting pretrain- ing data from large language models,”arXiv preprint arXiv:2310.16789, 2023. 2

  24. [24]

    Membership in- ference attacks against language models via neighbour- 14 hood comparison,

    J. Mattern, F. Mireshghallah, Z. Jin, B. Schölkopf, M. Sachan, and T. Berg-Kirkpatrick, “Membership in- ference attacks against language models via neighbour- 14 hood comparison,”arXiv preprint arXiv:2305.18462,

  25. [25]

    Extracting training data from large language models,

    N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-V oss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingssonet al., “Extracting training data from large language models,” in30th USENIX secu- rity symposium (USENIX Security 21), 2021, pp. 2633–

  26. [26]

    Language models are unsupervised multitask learners,

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskeveret al., “Language models are unsupervised multitask learners,”OpenAI blog, vol. 1, no. 8, p. 9,

  27. [27]

    Pythia: A suite for analyzing large language models across train- ing and scaling,

    S. Biderman, H. Schoelkopf, Q. G. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raffet al., “Pythia: A suite for analyzing large language models across train- ing and scaling,” inInternational Conference on Ma- chine Learning. PMLR, 2023, pp. 2397–2430. 3

  28. [28]

    Generalization or memorization: Data contam- ination and trustworthy evaluation for large language models,

    Y . Dong, X. Jiang, H. Liu, Z. Jin, B. Gu, M. Yang, and G. Li, “Generalization or memorization: Data contam- ination and trustworthy evaluation for large language models,”arXiv preprint arXiv:2402.15938, 2024. 3, 4

  29. [29]

    Membership inference attacks from first principles,

    N. Carlini, S. Chien, M. Nasr, S. Song, A. Terzis, and F. Tramer, “Membership inference attacks from first principles,” in2022 IEEE symposium on security and privacy (SP). IEEE, 2022, pp. 1897–1914. 4, 5, 6, 16

  30. [30]

    Llama-license,

    Meta, “Llama-license,” 2025. [Online]. Available: https://github.com/meta-llama/llama-models/blob/ma in/models/llama4/LICENSE 4

  31. [31]

    Qwen-license,

    Alibaba, “Qwen-license,” 2025. [Online]. Available: https://hf-mirror.com/Qwen/Qwen2.5-Math-7B/blob/ main/LICENSE 4

  32. [32]

    Measuring Mathematical Problem Solving With the MATH Dataset

    D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt, “Mea- suring mathematical problem solving with the math dataset,”arXiv preprint arXiv:2103.03874, 2021. 4, 6

  33. [33]

    Numinamath,

    J. LI, E. Beeching, L. Tunstall, B. Lipkin, R. Soletskyi, S. C. Huang, K. Rasul, L. Yu, A. Jiang, Z. Shen, Z. Qin, B. Dong, L. Zhou, Y . Fleureau, G. Lample, and S. Polu, “Numinamath,” [https://huggingface.co/AI-MO/Num inaMath-1.5](https://github.com/project-numina/aim o-progress-prize/blob/main/report/numina_dataset.pd f), 2024. 4, 6, 17

  34. [34]

    Livebench: A challenging, contamination-free LLM benchmark,

    C. White, S. Dooley, M. Roberts, A. Pal, B. Feuer, S. Jain, R. Shwartz-Ziv, N. Jain, K. Saifullah, S. Dey, Shubh-Agrawal, S. S. Sandha, S. V . Naidu, C. Hegde, Y . LeCun, T. Goldstein, W. Neiswanger, and M. Gold- blum, “Livebench: A challenging, contamination-free LLM benchmark,” inThe Thirteenth International Conference on Learning Representations, 2025. 4

  35. [35]

    Approximating kl divergence,

    J. Schulman, “Approximating kl divergence,” 2020. [Online]. Available: http://joschu.net/blog/kl-approx.h tml 5

  36. [36]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    G. Cui, Y . Zhang, J. Chen, L. Yuan, Z. Wang, Y . Zuo, H. Li, Y . Fan, H. Chen, W. Chenet al., “The entropy mechanism of reinforcement learning for reasoning language models,”arXiv preprint arXiv:2505.22617,

  37. [37]

    Spurious Rewards: Rethinking Training Signals in RLVR

    R. Shao, S. S. Li, R. Xin, S. Geng, Y . Wang, S. Oh, S. S. Du, N. Lambert, S. Min, R. Krishnaet al., “Spuri- ous rewards: Rethinking training signals in rlvr,”arXiv preprint arXiv:2506.10947, 2025. 6

  38. [38]

    HybridFlow: A Flexible and Efficient RLHF Framework

    G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y . Peng, H. Lin, and C. Wu, “Hybridflow: A flexi- ble and efficient rlhf framework,”arXiv preprint arXiv: 2409.19256, 2024. 6

  39. [39]

    Inter-gps: Interpretable geometry prob- lem solving with formal language and symbolic reason- ing,

    P. Lu, R. Gong, S. Jiang, L. Qiu, S. Huang, X. Liang, and S.-C. Zhu, “Inter-gps: Interpretable geometry prob- lem solving with formal language and symbolic reason- ing,” inThe Joint Conference of the 59th Annual Meet- ing of the Association for Computational Linguistics and the 11th International Joint Conference on Natu- ral Language Processing (ACL-IJCN...

  40. [40]

    Differentially private decoding in large language models,

    J. Majmudar, C. Dupuy, C. Peris, S. Smaili, R. Gupta, and R. Zemel, “Differentially private decoding in large language models,”arXiv preprint arXiv:2205.13621,

  41. [41]

    Privacy- preserving mobile edge generation: Analog and digital transmission,

    Y . Xue, R. Zhong, J. Li, Y . Chen, and Y . Liu, “Privacy- preserving mobile edge generation: Analog and digital transmission,”IEEE Transactions on Vehicular Tech- nology, pp. 1–6, 2025. 12

  42. [42]

    MulDimIF: A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models

    J. Ye, C. Huang, Z. Chen, W. Fu, C. Yang, L. Yang, Y . Wu, P. Wang, M. Zhou, X. Yanget al., “A multi- dimensional constraint framework for evaluating and improving instruction following in large language mod- els,”arXiv preprint arXiv:2505.07591, 2025. 13, 20

  43. [43]

    Pri- vacy risk in machine learning: Analyzing the connec- tion to overfitting,

    S. Yeom, I. Giacomelli, M. Fredrikson, and S. Jha, “Pri- vacy risk in machine learning: Analyzing the connec- tion to overfitting,” in2018 IEEE 31st computer secu- rity foundations symposium (CSF). IEEE, 2018, pp. 268–282. 16

  44. [44]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” 2019. [Online]. Available: https://arxiv.org/abs/1907.11692 16

  45. [45]

    Differential privacy,

    C. Dwork, “Differential privacy,” inInternational col- loquium on automata, languages, and programming. Springer, 2006, pp. 1–12. 19, 20

  46. [46]

    Federated learning with differential privacy: Algorithms and performance analysis,

    K. Wei, J. Li, M. Ding, C. Ma, H. H. Yang, F. Farokhi, S. Jin, T. Q. Quek, and H. V . Poor, “Federated learning with differential privacy: Algorithms and performance analysis,”IEEE transactions on information forensics and security, vol. 15, pp. 3454–3469, 2020. 19

  47. [47]

    Instruction-Following Evaluation for Large Language Models

    J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y . Luan, D. Zhou, and L. Hou, “Instruction-following 15 evaluation for large language models,”arXiv preprint arXiv:2311.07911, 2023. 20 A Related work •LOSS [43]:The simplest baseline uses the negative log- likelihood (NLL) under the target model: ScoreLOSS(q) =−logπ θ(q) =− |q| ∑ i=1 logπ θ(qi |q <i). A thr...