pith. machine review for the scientific record. sign in

arxiv: 2605.04477 · v1 · submitted 2026-05-06 · 💻 cs.LG

Recognition: unknown

Data-dependent Exploration for Online Reinforcement Learning from Human Feedback

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:55 UTC · model grok-4.3

classification 💻 cs.LG
keywords online RLHFexplorationpreference optimizationdata-dependent regretLLM alignmentsample efficiency
0
0 comments X

The pith

DEPO adds uncertainty bonuses from historical preferences to guide exploration in online RLHF.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Online RLHF for aligning LLMs struggles with exploration because on-policy uncertainty estimates are unreliable with sparse preference data, causing early avoidance of under-explored but potentially valuable outputs. DEPO constructs an additional data-dependent bonus that targets high-uncertainty regions to encourage sampling there. The method yields a regret bound that depends on the actual data distribution and task difficulty rather than worst-case assumptions. This bound can be strictly tighter when the problem is easier than the worst case. Experiments show consistent gains in sample efficiency over strong baselines on standard benchmarks.

Core claim

DEPO is a simple scalable method that leverages historical data to construct an extra uncertainty bonus for high-uncertainty regions, encouraging exploration toward potentially high-value data. It supplies a data-dependent regret bound that adapts to the hardness of the learning task and can be tighter than worst-case bounds in practice, while delivering improved sample efficiency across benchmarks.

What carries the argument

The data-dependent uncertainty bonus built from historical preference data, which augments the exploration signal inside the preference optimization objective.

Load-bearing premise

Uncertainty estimates constructed from limited historical preference data reliably identify high-value regions without introducing systematic bias or over-exploration in low-value areas.

What would settle it

On a benchmark where adding the historical-data bonus produces no gain or a loss in sample efficiency relative to baselines that omit it, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.04477 by Jiandong Zhang, Lanjihong Ma, Masashi Sugiyama, Yuting Tang, Zhen-Yu Zhang.

Figure 1
Figure 1. Figure 1: Ablation on the bonus scale cb. A moderate choice cb = 2e−2 yields the strongest overall trend. Model WR AvgR (πt , πt)-iter2 88.6 -2.84 (πt , πt)-iter3 92.5 0.21 (πt−1, πref)-iter2 87.8 -4.09 (πt−1, πref)-iter3 91.7 -2.63 view at source ↗
read the original abstract

Online reinforcement learning from human feedback (RLHF) has emerged as a promising paradigm for aligning large language models (LLMs) by continuously collecting new preference feedback during training. A foundational challenge in this setting is exploration, which requires algorithms that enable the LLMs to generate informative comparisons that improve sample-efficiency in online RLHF. Existing exploration strategies often derive bonuses via on-policy expectations, which are difficult to estimate reliably from the limited historical preference data available during training; as a result, the policy can prematurely down-weight under-explored regions that may contain high-value behaviors. In this paper, we propose data-dependent exploration for preference optimization (DEPO), a simple and scalable method that leverages historical data to construct an extra uncertainty bonus for high-uncertainty regions, encouraging exploration toward potentially high-value data. Theoretically, we provide a data-dependent regret bound for the proposed algorithm, showing that it adapts to the hardness of the learning task itself and can be tighter than worst-case bounds in practice. Empirically, the proposed method consistently outperforms strong baselines across benchmarks, demonstrating improved sample efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces DEPO, a method for online RLHF that constructs an extra uncertainty bonus from historical preference data to encourage exploration in high-uncertainty regions that may contain high-value behaviors. It claims a data-dependent regret bound that adapts to the hardness of the learning task and can be tighter than worst-case bounds, together with empirical results showing consistent outperformance and improved sample efficiency over strong baselines across benchmarks.

Significance. If the data-dependent regret bound is valid and the uncertainty estimates reliably support adaptation without systematic bias, the work would advance sample-efficient online RLHF beyond standard worst-case analyses, with direct relevance to LLM alignment. The empirical gains, if robust to implementation details, could influence practical exploration strategies in preference optimization.

major comments (2)
  1. [Theoretical analysis] Theoretical analysis section: the data-dependent regret bound is presented as adapting to task hardness via the uncertainty bonus, but the derivation does not appear to include an analysis of how estimation error in the bonus (from sparse early-stage preference data) affects the bound. This is load-bearing for the central claim that the bound is tighter than worst-case in practice, as noisy or biased estimates could cause the adaptation property to fail.
  2. [Experimental results] Experimental results section: the reported outperformance relies on the uncertainty bonus identifying high-value regions, yet the manuscript provides insufficient detail on the exact estimator, data exclusion rules, or sensitivity to limited historical data. Without these, it is difficult to verify that the gains are not due to post-hoc choices or over-exploration of low-value areas, undermining the sample-efficiency claim.
minor comments (2)
  1. [Method] The notation for the uncertainty bonus could be introduced with an explicit equation in the main body rather than deferred to the appendix for easier reading.
  2. [Experiments] Figure captions for benchmark curves should include the number of runs and whether error bars represent standard deviation or standard error.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address each of the major concerns below, providing clarifications and committing to revisions where appropriate to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Theoretical analysis] Theoretical analysis section: the data-dependent regret bound is presented as adapting to task hardness via the uncertainty bonus, but the derivation does not appear to include an analysis of how estimation error in the bonus (from sparse early-stage preference data) affects the bound. This is load-bearing for the central claim that the bound is tighter than worst-case in practice, as noisy or biased estimates could cause the adaptation property to fail.

    Authors: We appreciate the referee pointing out this potential gap in the analysis. The data-dependent regret bound in our paper is derived by incorporating the uncertainty bonus, which is computed from historical preference data, directly into the regret decomposition. This allows the bound to adapt based on the actual data observed, including the sparsity in early stages. However, to make the adaptation property more robust, we will add a subsection in the theoretical analysis that bounds the estimation error of the bonus using Hoeffding-type inequalities on the preference data, showing that the additional error term vanishes as more data is collected and does not invalidate the data-dependent tightness. This revision will explicitly address how the bound remains valid despite initial estimation inaccuracies. revision: yes

  2. Referee: [Experimental results] Experimental results section: the reported outperformance relies on the uncertainty bonus identifying high-value regions, yet the manuscript provides insufficient detail on the exact estimator, data exclusion rules, or sensitivity to limited historical data. Without these, it is difficult to verify that the gains are not due to post-hoc choices or over-exploration of low-value areas, undermining the sample-efficiency claim.

    Authors: We agree that more implementation details are essential for full reproducibility and to support the empirical claims. In the revised manuscript, we will expand the experimental section to include: (1) the precise mathematical formulation of the uncertainty estimator used in DEPO, which is based on variance estimates from the historical preference dataset; (2) explicit data exclusion rules, specifying that only preferences collected in previous rounds are used to avoid lookahead bias; and (3) additional sensitivity experiments varying the volume of historical data (e.g., using 10%, 50%, and 100% of available data) to demonstrate robustness. These additions will clarify that the performance gains arise from the adaptive exploration mechanism rather than specific tuning choices. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes DEPO using historical preference data for an uncertainty bonus and claims a data-dependent regret bound that adapts to task hardness. No equations or steps are quoted that reduce the bound by construction to fitted inputs, self-definitions, or self-citations. The theoretical claim is presented as independent analysis, and empirical outperformance on benchmarks is separate from the bound derivation. No self-citation load-bearing, ansatz smuggling, or renaming of known results is evident. The derivation chain remains self-contained against external benchmarks and does not exhibit the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the assumption that historical data yields usable uncertainty estimates; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Historical preference data can be used to construct reliable uncertainty estimates for exploration bonuses.
    Invoked when describing how the bonus is built from past data.

pith-pipeline@v0.9.0 · 5495 in / 1081 out tokens · 41762 ms · 2026-05-08T16:55:48.119204+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

85 extracted references · 11 canonical work pages · 4 internal anchors

  1. [1]

    The 13th International Conference on Learning Representations (ICLR) , year=

    Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF , author=. The 13th International Conference on Learning Representations (ICLR) , year=

  2. [2]

    The 13th International Conference on Learning Representations (ICLR) , year=

    Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF , author=. The 13th International Conference on Learning Representations (ICLR) , year=

  3. [4]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Avoiding exp (R) scaling in RLHF through Preference-based Exploration , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  4. [5]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Deep reinforcement learning from human preferences , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  5. [6]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  6. [7]

    Proceedings of the 41st International Conference on Machine Learning (ICML) , pages=

    Efficient exploration for LLMs , author=. Proceedings of the 41st International Conference on Machine Learning (ICML) , pages=

  7. [9]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Rebel: Reinforcement learning via regressing relative rewards , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  8. [10]

    the method of paired comparisons , author=

    Rank analysis of incomplete block designs: I. the method of paired comparisons , author=. Biometrika , volume=

  9. [11]

    Transactions on Machine Learning Research , year=

    RLHF Workflow: From Reward Modeling to Online RLHF A Comprehensive Practical Alignment Recipe of Iterative Preference Learning , author=. Transactions on Machine Learning Research , year=

  10. [13]

    Advances in Neural Information Processing Systems (NeurIPS) , pages=

    \# exploration: A study of count-based exploration for deep reinforcement learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , pages=

  11. [14]

    Proceedings of the 34th International Conference on Machine Learning (ICML) , pages=

    Curiosity-driven exploration by self-supervised prediction , author=. Proceedings of the 34th International Conference on Machine Learning (ICML) , pages=

  12. [15]

    The 7th International Conference on Learning Representations (ICLR) , year=

    Exploration by random network distillation , author=. The 7th International Conference on Learning Representations (ICLR) , year=

  13. [16]

    Journal of Machine Learning Research (JMLR) , volume=

    Deep exploration via randomized value functions , author=. Journal of Machine Learning Research (JMLR) , volume=

  14. [17]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Direct preference optimization: Your language model is secretly a reward model , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  15. [18]

    The 10th International Conference on Learning Representations (ICLR) , year=

    Measuring Massive Multitask Language Understanding , author=. The 10th International Conference on Learning Representations (ICLR) , year=

  16. [19]

    First Conference on Language Modeling , year=

    Gpqa: A graduate-level google-proof q&a benchmark , author=. First Conference on Language Modeling , year=

  17. [21]

    Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

    TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

  18. [27]

    Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

    A general theoretical paradigm to understand learning from human preferences , author=. Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

  19. [28]

    The 13th International Conference on Learning Representations (ICLR) , year=

    Building Math Agents with Multi-Turn Iterative Preference Learning , author=. The 13th International Conference on Learning Representations (ICLR) , year=

  20. [30]

    arXiv preprint arXiv:2502.07193 , year=

    Provably Efficient Online RLHF with One-Pass Reward Modeling , author=. arXiv preprint arXiv:2502.07193 , year=

  21. [31]

    Proceedings of the 37th International Conference on Machine Learning (ICML) , pages=

    Improved optimistic algorithms for logistic bandits , author=. Proceedings of the 37th International Conference on Machine Learning (ICML) , pages=

  22. [32]

    Machine Learning , volume=

    Finite-time analysis of the multiarmed bandit problem , author=. Machine Learning , volume=

  23. [33]

    Advances in Applied Mathematics , volume=

    Asymptotically efficient adaptive allocation rules , author=. Advances in Applied Mathematics , volume=

  24. [34]

    The Annals of Mathematical Statistics , volume=

    Adjustment of an inverse matrix corresponding to a change in one element of a given matrix , author=. The Annals of Mathematical Statistics , volume=

  25. [35]

    Proceedings of the 40th International Conference on Machine Learning (ICML) , pages=

    Principled reinforcement learning with human feedback from pairwise or k-wise comparisons , author=. Proceedings of the 40th International Conference on Machine Learning (ICML) , pages=

  26. [36]

    ICML 2023 Workshop The Many Facets of Preference-Based Learning , year=

    Provable offline reinforcement learning with human feedback , author=. ICML 2023 Workshop The Many Facets of Preference-Based Learning , year=

  27. [37]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Provably mitigating overoptimization in rlhf: Your sft loss is implicitly an adversarial regularizer , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  28. [38]

    Learning for Dynamics and Control , pages=

    Reward biased maximum likelihood estimation for reinforcement learning , author=. Learning for Dynamics and Control , pages=

  29. [39]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Maximize to explore: One objective function fusing estimation, planning, and exploration , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  30. [40]

    Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) , pages=

    Very sparse random projections , author=. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) , pages=

  31. [41]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    The importance of online data: Understanding preference fine-tuning via coverage , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  32. [42]

    2025 , author =

    Clustered Reinforcement Learning , journal =. 2025 , author =

  33. [43]

    2026 , author =

    Diversity from human feedback , journal =. 2026 , author =

  34. [44]

    2026 , author =

    Exploiting large language model with reinforcement learning for generative job recommendations , journal =. 2026 , author =

  35. [45]

    2026 , author =

    The gains do not make up for the losses: a comprehensive evaluation for safety alignment of large language models via machine unlearning , journal =. 2026 , author =

  36. [46]

    2025 , author =

    Offline model-based reinforcement learning with causal structured world models , journal =. 2025 , author =

  37. [47]

    2026 , author =

    Trustworthy evaluation of large language models , journal =. 2026 , author =

  38. [48]

    Provably Efficient

    Long. Provably Efficient. Advances in Neural Information Processing Systems 38 (NeurIPS) , year =

  39. [49]

    Advances in Neural Information Processing Systems (NIPS) , volume=

    Improved algorithms for linear stochastic bandits , author=. Advances in Neural Information Processing Systems (NIPS) , volume=

  40. [50]

    Proceedings of the 19th International Conference on World Wide Web (WWW) , pages=

    A contextual-bandit approach to personalized news article recommendation , author=. Proceedings of the 19th International Conference on World Wide Web (WWW) , pages=

  41. [51]

    Advances in Neural Information Processing Systems (NIPS) , volume=

    Generalized Linear Bandits: Almost Optimal Regret with One-Pass Update , author=. Advances in Neural Information Processing Systems (NIPS) , volume=

  42. [52]

    Proceedings of IEEE 36th Annual Foundations of Computer Science (FOCS) , pages=

    Gambling in a rigged casino: The adversarial multi-armed bandit problem , author=. Proceedings of IEEE 36th Annual Foundations of Computer Science (FOCS) , pages=

  43. [53]

    The 13th International Conference on Learning Representations (ICLR) , year=

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , author=. The 13th International Conference on Learning Representations (ICLR) , year=

  44. [54]

    Proceedings of the 41st International Conference on Machine Learning (ICML) , pages=

    Learning with adaptive resource allocation , author=. Proceedings of the 41st International Conference on Machine Learning (ICML) , pages=

  45. [55]

    Abbasi-Yadkori, D

    Y. Abbasi-Yadkori, D. P \'a l, and C. Szepesv \'a ri. Improved algorithms for linear stochastic bandits. Advances in Neural Information Processing Systems (NIPS), 24: 0 2312--2320, 2011 a

  46. [56]

    Abbasi-Yadkori, D

    Y. Abbasi-Yadkori, D. P \'a l, and C. Szepesv \'a ri. Online least squares estimation with self-normalized processes: An application to bandit problems. arXiv preprint arXiv:1102.2670, 2011 b

  47. [57]

    P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of IEEE 36th Annual Foundations of Computer Science (FOCS), pages 322--331, 1995

  48. [58]

    P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47 0 (2): 0 235--256, 2002

  49. [59]

    M. G. Azar, Z. D. Guo, B. Piot, R. Munos, M. Rowland, M. Valko, and D. Calandriello. A general theoretical paradigm to understand learning from human preferences. In Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 4447--4455, 2024

  50. [60]

    G. Bai, J. Liu, X. Bu, Y. He, J. Liu, Z. Zhou, Z. Lin, W. Su, T. Ge, B. Zheng, et al. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. arXiv preprint arXiv:2402.14762, 2024

  51. [61]

    R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39 0 (3/4): 0 324--345, 1952

  52. [62]

    S. Cen, J. Mei, K. Goshvadi, H. Dai, T. Yang, S. Yang, D. Schuurmans, Y. Chi, and B. Dai. Value-incentivized preference optimization: A unified approach to online and offline rlhf. In The 13th International Conference on Learning Representations (ICLR), 2024

  53. [63]

    M. Chen, Y. Chen, W. Sun, and X. Zhang. Avoiding exp (r) scaling in rlhf through preference-based exploration. Advances in Neural Information Processing Systems (NeurIPS), 39: 0 to appear, 2025

  54. [64]

    P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems (NeurIPS), 30: 0 4299--4307, 2017

  55. [65]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  56. [66]

    H. Dong, W. Xiong, B. Pang, H. Wang, H. Zhao, Y. Zhou, N. Jiang, D. Sahoo, C. Xiong, and T. Zhang. Rlhf workflow: From reward modeling to online rlhf a comprehensive practical alignment recipe of iterative preference learning. Transactions on Machine Learning Research, 2024

  57. [67]

    Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

    Y. Dubois, B. Galambosi, P. Liang, and T. B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024

  58. [68]

    Dwaracherla, S

    V. Dwaracherla, S. M. Asghari, B. Hao, and B. Van Roy. Efficient exploration for llms. In Proceedings of the 41st International Conference on Machine Learning (ICML), pages 12215--12227, 2024

  59. [69]

    S. Guo, B. Zhang, T. Liu, T. Liu, M. Khalman, F. Llinares, A. Rame, T. Mesnard, Y. Zhao, B. Piot, et al. Direct language model alignment from online ai feedback. arXiv preprint arXiv:2402.04792, 2024

  60. [70]

    Hendrycks, C

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. In The 10th International Conference on Learning Representations (ICLR), 2021

  61. [71]

    N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. In The 13th International Conference on Learning Representations (ICLR), 2024

  62. [72]

    T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6 0 (1): 0 4--22, 1985

  63. [73]

    L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web (WWW), pages 661--670, 2010

  64. [74]

    L. Li, Y. Qian, P. Zhao, and Z. Zhou. Provably efficient RLHF pipeline: A unified view from contextual bandits. In Advances in Neural Information Processing Systems 38 (NeurIPS), page to appear, 2025

  65. [75]

    P. Li, T. J. Hastie, and K. W. Church. Very sparse random projections. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 287--296, 2006

  66. [76]

    S. Lin, J. Hilton, and O. Evans. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), pages 3214--3252, 2022

  67. [77]

    X. Ma, S. Zhao, Z. Yin, and W. Li. Clustered reinforcement learning. Frontiers of Computer Science, 19: 0 194313, 2025

  68. [78]

    Ouyang, J

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems (NeurIPS), 35: 0 27730--27744, 2022

  69. [79]

    Rafailov, A

    R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems (NeurIPS), 36: 0 53728--53741, 2023

  70. [80]

    D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024

  71. [81]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  72. [82]

    Sherman and W

    J. Sherman and W. J. Morrison. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. The Annals of Mathematical Statistics, 21 0 (1): 0 124--127, 1950

  73. [83]

    Y. Song, G. Swamy, A. Singh, J. Bagnell, and W. Sun. The importance of online data: Understanding preference fine-tuning via coverage. Advances in Neural Information Processing Systems (NeurIPS), 37: 0 12243--12270, 2024

  74. [84]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  75. [85]

    arXiv preprint arXiv:2510.11686 , year=

    J. Tuyls, D. J. Foster, A. Krishnamurthy, and J. T. Ash. Representation-based exploration for language models: From test-time to post-training. arXiv preprint arXiv:2510.11686, 2025

  76. [86]

    J. Wang, M. Yu, P. Zhao, and Z.-H. Zhou. Learning with adaptive resource allocation. In Proceedings of the 41st International Conference on Machine Learning (ICML), pages 52099--52116, 2024

  77. [87]

    R. Wang, K. Xue, Y. Wang, P. Yang, H. Fu, Q. Fu, and C. Qian. Diversity from human feedback. Frontiers of Computer Science, 20: 0 2002320, 2026

  78. [88]

    T. Xie, D. J. Foster, A. Krishnamurthy, C. Rosset, A. H. Awadallah, and A. Rakhlin. Exploratory preference optimization: Harnessing implicit q*-approximation for sample-efficient rlhf. In The 13th International Conference on Learning Representations (ICLR), 2024

  79. [89]

    Xiong, C

    W. Xiong, C. Shi, J. Shen, A. Rosenberg, Z. Qin, D. Calandriello, M. Khalman, R. Joshi, B. Piot, M. Saleh, et al. Building math agents with multi-turn iterative preference learning. In The 13th International Conference on Learning Representations (ICLR), 2024

  80. [90]

    Zhang, D

    S. Zhang, D. Yu, H. Sharma, H. Zhong, Z. Liu, Z. Yang, S. Wang, H. Hassan, and Z. Wang. Self-exploring language models: Active preference elicitation for online alignment. arXiv preprint arXiv:2405.19332, 2024

Showing first 80 references.