arxiv: 2605.06987 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.GT· econ.TH· stat.ML

Recognition: 3 theorem links

· Lean Theorem

Response Time Enhances Alignment with Heterogeneous Preferences

Alireza Fallah, Baihe Huang, Federico Echenique, Michael I. Jordan

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:58 UTC · model grok-4.3

classification 💻 cs.LG cs.GTecon.THstat.ML

keywords LLM alignmentpreference learningresponse timedrift diffusion modelheterogeneous preferencesanonymous feedbackconsistent estimation

0 comments

The pith

Incorporating response times into preference data restores identifiability of the population-average preference under heterogeneous anonymous labelers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard alignment of language models pools binary choices assuming uniform preferences, but real users differ and stay anonymous, leaving the average preference unidentifiable from choices alone. The paper shows that recording how long users take to decide supplies enough information to correct for these differences. By fitting each choice to a drift-diffusion model whose drift rate encodes preference intensity, the authors construct an estimator that remains consistent for the true average even if every user answers only once. This matters for practical systems because response times cost nothing to collect and require no user tracking.

Core claim

By modeling each binary choice as arising from a drift-diffusion process, the authors derive a novel estimator for the average preference parameter across a heterogeneous population. They prove that this estimator converges in probability to the true population average as the number of observations grows, even in the limiting case where each anonymous labeler provides exactly one choice. Empirical tests on both simulated data and real preference datasets confirm that the method avoids the bias floor encountered by choice-only baselines.

What carries the argument

A drift-diffusion model (DDM) estimator that uses observed response times to infer and correct for varying preference strengths across anonymous labelers.

If this is right

The estimator is asymptotically consistent for the average preference.
It succeeds where standard methods plateau at a fixed bias.
No need for repeated measurements or user identifiers.
Enables improved social benefit in data collection pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future work could replace the DDM with other sequential sampling models if response times follow different patterns.
This technique might extend to other domains like survey design or market research where single anonymous responses are common.
Collecting response times could be combined with other cheap signals to further reduce identifiability issues.

Load-bearing premise

The observed response times must be generated by a drift-diffusion process in which the drift parameter corresponds to each individual's preference strength.

What would settle it

Collect single-choice responses with timings from a known population with heterogeneous preferences whose average is precomputed; the estimator should recover that average if the model holds, but deviate if timings are randomized or generated differently.

Figures

Figures reproduced from arXiv: 2605.06987 by Alireza Fallah, Baihe Huang, Federico Echenique, Michael I. Jordan.

**Figure 1.** Figure 1: Tabular setting. Monte-Carlo MSE of µbn versus sample size n for two latent-drift priors. The DDM estimator dominates the baseline, which plateaus at a heterogeneity-induced bias floor. Shaded bands denote pointwise 95% confidence intervals over R = 50 replications. In the tabular setting, µˆ DDM n = 1 n P i ZiwBen (τi); in the linear setting, ˆθ DDM n is the OLS estimator in Eq. (10). Both use the bias-co… view at source ↗

**Figure 2.** Figure 2: Linear contextual setting. Monte-Carlo MSE of θbn versus sample size n for four priors on θi ∈ R 4 . The DDM estimator again tracks the oracle and uniformly dominates the baseline. Shaded bands denote pointwise 95% confidence intervals over R = 50 replications. • Bradley–Terry MLE baseline: the Bradley–Terry logistic MLE over choice-only data. • Drift Diffusion Model (DDM): the OLS estimator in Eq. (10) us… view at source ↗

**Figure 3.** Figure 3: Intertemporal choice data. Cosine similarity between each estimator and the subjectlevel target versus subsample size n on Amasino et al. [2019] dataset. Shaded regions are 95% bootstrap bands over R = 50 replications. Our DDM estimator achieves higher accuracy than the baseline. 6 Conclusion Under a drift-diffusion model with a common but unknown decision boundary, binary choices combined with response t… view at source ↗

**Figure 4.** Figure 4: Tabular setting. Median and IQR (over R = 10 replications) of the one-scale estimator bbn (green) and the Richardson-extrapolated estimator Ben (purple) as a function of sample size n, for each of the two priors on the latent drift V . The dashed line marks b ⋆ = 1.25. The Richardson correction removes the O(λ −1/2 n ) bias of bbn across all priors. Setup. We reuse the synthetic data-generating processes o… view at source ↗

**Figure 5.** Figure 5: Linear contextual setting. Median and IQR (over R = 10 replications) of bbn (green) and Ben (purple) versus sample size n, for each of the four per-coordinate priors on θi ∈ R 4 . The dashed line marks b ⋆ = 1.25. The Richardson correction is also effective in the contextual setting, where the latent drifts vi = ψ ⊤ i θi are heterogeneous through both ψi and θi . 51 [PITH_FULL_IMAGE:figures/full_fig_p051_5.png] view at source ↗

read the original abstract

Aligning large language models (LLMs) to human preferences typically relies on aggregating pooled feedback into a single reward model. However, this standard approach assumes that all labelers share the same underlying preferences, ignoring the fact that real-world labelers are highly heterogeneous and usually anonymous. Consequently, relying solely on binary choice data fundamentally distorts the learned policy, making the true population-average preference unidentifiable. To overcome this critical limitation, we demonstrate that augmenting preference datasets with a simple, secondary signal -- the user's response time -- can restore the identifiability of the population's average preference. By modeling each decision as a Drift-Diffusion Model (DDM), we introduce a novel, consistent estimator of heterogeneous preferences that successfully corrects the distortions of standard choice-only labels. We prove that our estimator asymptotically converges to the true average preference even in extreme cases where each anonymous labeler contributes only a single choice. Empirically, across both synthetic and real-world datasets, our method consistently outperforms standard baselines that otherwise fail and plateau at a bias floor. Because response times are essentially free to record and require zero user tracking or identification, our results bring promises and open up new opportunities for future data-collection pipelines to improve the social benefit without requiring user-level identifiers or repeated elicitations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that augmenting binary preference labels with response times, modeled via a Drift-Diffusion Model (DDM) where drift rate encodes heterogeneous preference strength, yields a consistent estimator for the population-average preference. This holds asymptotically even when each anonymous labeler supplies only a single choice, because the empirical distribution of RTs can be inverted to recover the mean drift. The approach is shown to outperform choice-only baselines on both synthetic data generated from the assumed DDM and real-world datasets, where standard methods plateau at a bias floor.

Significance. If the consistency result holds under the stated DDM assumptions, the work provides a practical, zero-cost signal (response time) that restores identifiability of average preferences without user identifiers or repeated elicitations. This directly addresses a core limitation of current RLHF pipelines that treat heterogeneous labelers as exchangeable. The explicit proof of asymptotic convergence and the empirical demonstration of bias reduction are concrete strengths.

major comments (3)

[Theoretical section / consistency proof] The consistency theorem (likely Theorem 1 or the main result in the theoretical section) requires that observed response times are generated exactly by the DDM with preference-encoded drift rates, constant non-decision time, and identical boundary separation and diffusion variance across all labelers. The manuscript should state these assumptions explicitly and provide a formal statement of the conditions under which the inversion recovers the true mean drift rather than a different functional of the RT distribution.
[Experiments section] The empirical evaluation on synthetic data is generated from the same DDM used by the estimator, so it does not test robustness to misspecification. The real-world results therefore carry the risk that any performance gain is driven by partial satisfaction of the DDM assumptions rather than by the estimator's general properties; a sensitivity analysis or alternative RT-generating process should be added.
[Estimator derivation] The paper asserts that the estimator is 'parameter-free' or requires no additional tuning beyond the DDM link, yet the DDM itself introduces three free parameters (drift rate, boundary separation, non-decision time) whose identifiability from single observations per labeler is not fully derived. The proof sketch should clarify how many of these are estimated from the same data versus fixed a priori.

minor comments (2)

[Notation / preliminaries] Notation for the recovered preference distribution and the empirical RT histogram should be introduced earlier and used consistently; current presentation mixes population quantities with sample estimates without clear distinction.
[Results] The abstract states that the method 'successfully corrects the distortions,' but the main text should quantify the remaining bias after correction rather than only reporting outperformance relative to baselines.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments on our manuscript. We address each major comment point by point below and commit to revisions that strengthen the theoretical clarity and empirical robustness of the work.

read point-by-point responses

Referee: [Theoretical section / consistency proof] The consistency theorem (likely Theorem 1 or the main result in the theoretical section) requires that observed response times are generated exactly by the DDM with preference-encoded drift rates, constant non-decision time, and identical boundary separation and diffusion variance across all labelers. The manuscript should state these assumptions explicitly and provide a formal statement of the conditions under which the inversion recovers the true mean drift rather than a different functional of the RT distribution.

Authors: We agree that the assumptions underlying the consistency result should be stated more explicitly. In the revised manuscript we will add a dedicated subsection in the theoretical section that enumerates all modeling assumptions (constant non-decision time, identical boundary separation and diffusion variance across labelers, and the DDM generative process). We will also restate Theorem 1 formally, specifying the precise conditions under which the inversion of the empirical RT distribution recovers the population-mean drift rate rather than another functional of the RT law. revision: yes
Referee: [Experiments section] The empirical evaluation on synthetic data is generated from the same DDM used by the estimator, so it does not test robustness to misspecification. The real-world results therefore carry the risk that any performance gain is driven by partial satisfaction of the DDM assumptions rather than by the estimator's general properties; a sensitivity analysis or alternative RT-generating process should be added.

Authors: We acknowledge that the current synthetic experiments assume the exact DDM generative process. To address this limitation we will add, in the revised version, a new set of synthetic experiments that employ alternative response-time models (e.g., a race model and a linear ballistic accumulator) together with a sensitivity analysis that perturbs DDM parameters. These additions will help separate the estimator's performance from exact satisfaction of the modeling assumptions. revision: yes
Referee: [Estimator derivation] The paper asserts that the estimator is 'parameter-free' or requires no additional tuning beyond the DDM link, yet the DDM itself introduces three free parameters (drift rate, boundary separation, non-decision time) whose identifiability from single observations per labeler is not fully derived. The proof sketch should clarify how many of these are estimated from the same data versus fixed a priori.

Authors: We appreciate the request for greater precision on parameter handling. The phrase 'parameter-free' in the manuscript refers to the absence of additional tuning hyperparameters beyond the DDM structure itself. In the revised derivation we will explicitly state that boundary separation and non-decision time are treated as fixed a priori (drawing on values from the cognitive-science literature or estimated from aggregate data), while the distribution of drift rates—and hence its mean—is recovered from the observed RT distribution. We will expand the proof sketch to show identifiability of the mean drift under these conditions even when each labeler contributes only a single observation. revision: partial

Circularity Check

0 steps flagged

No circularity: consistency theorem relies on external DDM assumptions rather than self-referential reduction

full rationale

The paper derives its estimator by positing that observed choices and response times are generated by a Drift-Diffusion Model whose drift rate encodes heterogeneous preference strength. It then proves asymptotic consistency to the population-average preference (even with one observation per anonymous labeler) by inverting the model's likelihood or moment conditions over the empirical RT distribution. This is a standard conditional modeling result and does not reduce the claimed convergence to a tautology or a fitted quantity defined by the paper's own data; the target parameter (mean drift) is recovered from an independently specified generative process. No self-citations are load-bearing, no ansatz is smuggled, and no known empirical pattern is merely renamed. The derivation therefore retains independent mathematical content once the DDM assumptions are granted.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the drift-diffusion model as the generative process for both choices and response times; this is an external domain assumption rather than a derived result.

free parameters (1)

DDM parameters (drift rate, boundary separation, non-decision time)
These parameters must be either assumed or estimated to link observed response times to underlying preference strengths; their status is not specified in the abstract.

axioms (1)

domain assumption Human binary choices and response times are generated by a drift-diffusion process whose drift rate encodes preference strength
Invoked to derive the consistent estimator from timing-augmented data.

pith-pipeline@v0.9.0 · 5538 in / 1329 out tokens · 42720 ms · 2026-05-11T00:58:46.552455+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
By modeling each decision as a Drift-Diffusion Model (DDM)... the estimator bμn = 1/n Σ zi wb(ti) ... wb(t) defined via ratio-of-series with ck(b)=(2k+1)b and Laplace inversion of √(2s)/sinh(b√(2s))
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear
Theorem 2... bbn := −log bLn(λn)/√(2λn) ... Richardson eBn = [log bLn(λn)−log bLn(4λn)]/√(2λn)
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear
We assume... common boundary b... |V|≤M a.s.

Reference graph

Works this paper leans on

279 extracted references · 75 canonical work pages · 22 internal anchors

[1]

The Review of Economic Studies , volume=

Equity and the informational basis of collective choice , author=. The Review of Economic Studies , volume=. 1977 , publisher=

1977
[2]

2025 , eprint=

A General Framework for Estimating Preferences Using Response Time Data , author=. 2025 , eprint=

2025
[4]

arXiv preprint arXiv:2310.12036 , year=

A General Theoretical Paradigm to Understand Learning from Human Preferences , author=. arXiv preprint arXiv:2310.12036 , year=. doi:10.48550/arXiv.2310.12036 , url=

work page doi:10.48550/arxiv.2310.12036
[5]

2024 , url=

Park, Chanwoo and Liu, Mingyang and Kong, Dingwen and Zhang, Kaiqing and Ozdaglar, Asuman , journal=. 2024 , url=

2024
[6]

2023 , url=

Sorensen, Taylor and Jiang, Liwei and Hwang, Jena and Levine, Sydney and Pyatkin, Valentina and West, Peter and Dziri, Nouha and Lu, Ximing and Rao, Kavel and Bhagavatula, Chandra and Sap, Maarten and Tasioulas, John and Choi, Yejin , journal=. 2023 , url=

2023
[7]

Direct Alignment with Heterogeneous Preferences , volume =

Shirali, Ali and Nasr-Esfahany, Arash and Alomar, Abdullah and Mirtaheri, Parsa and Abebe, Rediet and Procaccia, Ariel , booktitle =. Direct Alignment with Heterogeneous Preferences , volume =
[8]

Advances in Neural Information Processing Systems , volume=

Enhancing preference-based linear bandits via human response time , author=. Advances in Neural Information Processing Systems , volume=
[9]

arXiv preprint arXiv:2405.15065 , year=

Direct Preference Optimization With Unobserved Preference Heterogeneity , author=. arXiv preprint arXiv:2405.15065 , year=

work page arXiv
[10]

International Conference on Artificial Intelligence and Statistics , pages=

Learning prediction intervals for regression: Generalization and calibration , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2021 , organization=

2021
[11]

In Findings of the Association for Computational Linguistics: ACL 2024, pages 4998–5017, 2024

Disentangling length from quality in direct preference optimization , author=. arXiv preprint arXiv:2403.19159 , year=

work page arXiv
[12]

Liu, Zhihan and Lu, Miao and Zhang, Shenao and Liu, Boyi and Guo, Hongyi and Yang, Yingxiang and Blanchet, Jose and Wang, Zhaoran , journal=
[13]

Understanding the performance gap between online and offline alignment algorithms.arXiv preprint arXiv:2405.08448,

Understanding the performance gap between online and offline alignment algorithms , author=. arXiv preprint arXiv:2405.08448 , year=

work page arXiv
[14]

arXiv preprint arXiv:2305.17608 , year=

Reward collapse in aligning large language models , author=. arXiv preprint arXiv:2305.17608 , year=

work page arXiv
[15]

Cambridge University Press , year=

Stochastic choice theory , author=. Cambridge University Press , year=
[16]

Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking.arXiv preprint arXiv:2312.09244,

Helping or herding? Reward model ensembles mitigate but do not eliminate reward hacking , author=. arXiv preprint arXiv:2312.09244 , year=

work page arXiv
[17]

arXiv preprint arXiv:2210.01964 , year=

The calibration generalization gap , author=. arXiv preprint arXiv:2210.01964 , year=

work page arXiv
[18]

2024 , eprint=

RewardBench: Evaluating Reward Models for Language Modeling , author=. 2024 , eprint=

2024
[19]

Angelopoulos, Stephen Bates, Clara Fannjiang, Michael I

Prediction-powered inference , author=. arXiv preprint arXiv:2301.09633 , year=

work page arXiv
[20]

2005 , publisher=

Algorithmic learning in a random world , author=. 2005 , publisher=

2005
[21]

International Conference on Machine Learning , pages=

Multicalibration: Calibration for the (computationally-identifiable) masses , author=. International Conference on Machine Learning , pages=. 2018 , organization=

2018
[22]

Biometrika , volume=

Asymptotic calibration , author=. Biometrika , volume=. 1998 , publisher=

1998
[23]

Calibeating

“Calibeating”: Beating forecasters at their own game , author=. Theoretical Economics , volume=. 2023 , publisher=

2023
[24]

arXiv preprint arXiv:2310.00212 , year=

Pairwise proximal policy optimization: Harnessing relative feedback for llm alignment , author=. arXiv preprint arXiv:2310.00212 , year=

work page arXiv
[25]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Local temperature scaling for probability calibration , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[26]

Advances in large margin classifiers , volume=

Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods , author=. Advances in large margin classifiers , volume=. 1999 , publisher=

1999
[27]

Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

Reinforcement learning to rank with pairwise policy gradient , author=. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=
[28]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Advances in Neural Information Processing Systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=
[30]

arXiv preprint arXiv:2404.08495 , year=

Chang, Jonathan D and Zhan, Wenhao and Oertell, Owen and Brantley, Kiant. arXiv preprint arXiv:2404.08495 , year=

work page arXiv
[31]

Provably Robust

Chowdhury, Sayak Ray and Kini, Anush and Natarajan, Nagarajan , journal=. Provably Robust
[32]

Zhao, Yao and Joshi, Rishabh and Liu, Tianqi and Khalman, Misha and Saleh, Mohammad and Liu, Peter J , journal=
[33]

International Conference on Artificial Intelligence and Statistics , pages=

A general theoretical paradigm to understand learning from human preferences , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=

2024
[34]

Advances in Neural Information Processing Systems , volume=

Learning to summarize with human feedback , author=. Advances in Neural Information Processing Systems , volume=
[35]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Direct preference optimization: Your language model is secretly a reward model , author=. arXiv preprint arXiv:2305.18290 , year=

work page internal anchor Pith review arXiv
[36]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

arXiv preprint arXiv:2210.01241

Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization , author=. arXiv preprint arXiv:2210.01241 , year=

work page arXiv
[38]

Fine-Tuning Language Models from Human Preferences

Fine-tuning language models from human preferences , author=. arXiv preprint arXiv:1909.08593 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909
[39]

The method of paired comparisons , author=

Rank analysis of incomplete block designs: I. The method of paired comparisons , author=. Biometrika , volume=. 1952 , publisher=

1952
[40]

2012 , publisher=

Individual choice behavior: A theoretical analysis , author=. 2012 , publisher=

2012
[41]

Journal of the Royal Statistical Society Series C: Applied Statistics , volume=

The analysis of permutations , author=. Journal of the Royal Statistical Society Series C: Applied Statistics , volume=. 1975 , publisher=

1975
[42]

Secrets of

Zheng, Rui and Dou, Shihan and Gao, Songyang and Shen, Wei and Wang, Binghai and Liu, Yan and Jin, Senjie and Liu, Qin and Xiong, Limao and Chen, Lu and others , journal=. Secrets of
[43]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

High-dimensional continuous control using generalized advantage estimation , author=. arXiv preprint arXiv:1506.02438 , year=

work page internal anchor Pith review arXiv
[44]

Proceedings of the 59th Annual Allerton Conference on Communication, Control, and Computing , year=

Fine-Tuning Language Models with Advantage-Induced Policy Alignment , author=. Proceedings of the 59th Annual Allerton Conference on Communication, Control, and Computing , year=
[45]

LaMDA: Language Models for Dialog Applications

Lamda: Language models for dialog applications , author=. arXiv preprint arXiv:2201.08239 , year=

work page Pith review arXiv
[46]

Reinforced Self-Training (ReST) for Language Modeling

Reinforced Self-Training (ReST) for Language Modeling , author=. arXiv preprint arXiv:2308.08998 , year=

work page Pith review arXiv
[47]

arXiv preprint arXiv:2307.12966 , year=

Aligning large language models with human: A survey , author=. arXiv preprint arXiv:2307.12966 , year=

work page arXiv
[48]

Raft: Reward ranked finetuning for generative foundation model alignment.arXiv preprint arXiv:2304.06767, 2023

Raft: Reward ranked finetuning for generative foundation model alignment , author=. arXiv preprint arXiv:2304.06767 , year=

work page arXiv
[49]

arXiv preprint arXiv:2206.11871 , year=

Offline rl for natural language generation with implicit language q learning , author=. arXiv preprint arXiv:2206.11871 , year=

work page arXiv
[50]

International Conference on Machine Learning , pages=

Asynchronous methods for deep reinforcement learning , author=. International Conference on Machine Learning , pages=. 2016 , organization=

2016
[51]

Constitutional AI: Harmlessness from AI Feedback

Constitutional ai: Harmlessness from ai feedback , author=. arXiv preprint arXiv:2212.08073 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[52]

Lee, Harrison and Phatale, Samrat and Mansoor, Hassan and Lu, Kellie and Mesnard, Thomas and Bishop, Colton and Carbune, Victor and Rastogi, Abhinav , journal=
[53]

Journal of Computer and System Sciences , volume=

The k-armed dueling bandits problem , author=. Journal of Computer and System Sciences , volume=. 2012 , publisher=

2012
[54]

Conference on Learning Theory , pages=

Contextual dueling bandits , author=. Conference on Learning Theory , pages=. 2015 , organization=

2015
[55]

Journal of Machine Learning Research , volume=

A survey of preference-based reinforcement learning methods , author=. Journal of Machine Learning Research , volume=. 2017 , publisher=

2017
[56]

Advances in Neural Information Processing Systems , volume=

Learning trajectory preferences for manipulators via iterative improvement , author=. Advances in Neural Information Processing Systems , volume=
[57]

Machine learning , volume=

Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm , author=. Machine learning , volume=. 2014 , publisher=

2014
[58]

Advances in Neural Information Processing Systems , volume=

Deep reinforcement learning from human preferences , author=. Advances in Neural Information Processing Systems , volume=
[59]

Active preference-based learning of reward functions , author=
[60]

Robotics Research: Volume 1 , pages=

Learning dynamic robot-to-human object handover from human feedback , author=. Robotics Research: Volume 1 , pages=. 2018 , publisher=

2018
[61]

Textbooks Are All You Need

Textbooks Are All You Need , author=. arXiv preprint arXiv:2306.11644 , year=

work page internal anchor Pith review arXiv
[62]

Advances in Neural Information Processing Systems , volume=

Policy gradient methods for reinforcement learning with function approximation , author=. Advances in Neural Information Processing Systems , volume=
[63]

PaLM: Scaling Language Modeling with Pathways

Palm: Scaling language modeling with pathways , author=. arXiv preprint arXiv:2204.02311 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[64]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[65]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Sparks of artificial general intelligence: Early experiments with gpt-4 , author=. arXiv preprint arXiv:2303.12712 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[66]

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine

Implementation matters in deep policy gradients: A case study on ppo and trpo , author=. arXiv preprint arXiv:2005.12729 , year=

work page arXiv 2005
[67]

Let's Verify Step by Step

Let's Verify Step by Step , author=. arXiv preprint arXiv:2305.20050 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[68]

2018 , publisher=

Reinforcement learning: An introduction , author=. 2018 , publisher=

2018
[69]

International Conference on Machine Learning , pages=

Scaling laws for reward model overoptimization , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[70]

Proceedings of the Workshop on New Frontiers in Summarization , pages=

Tl; dr: Mining reddit to learn automatic summarization , author=. Proceedings of the Workshop on New Frontiers in Summarization , pages=
[71]

and Tiwari, Aman and Tow, Jonathan and Zhuravinsky, Maksym , title =

Castricato, Louis and Havrilla, Alex and Matiana, Shahbuland and Phung, Duy V. and Tiwari, Aman and Tow, Jonathan and Zhuravinsky, Maksym , title =. doi:10.5281/zenodo.8076391 , url =

work page doi:10.5281/zenodo.8076391
[72]

arXiv preprint arXiv:2304.00723 , year=

Exploring the use of large language models for reference-free text quality evaluation: A preliminary empirical study , author=. arXiv preprint arXiv:2304.00723 , year=

work page arXiv
[73]

2023 , eprint=

UltraFeedback: Boosting Language Models with High-quality Feedback , author=. 2023 , eprint=

2023
[74]

arXiv preprint arXiv:2304.05302 , year=

Rrhf: Rank responses to align language models with human feedback without tears , author=. arXiv preprint arXiv:2304.05302 , year=

work page arXiv
[75]

International Conference on Machine Learning , year=

Principled Reinforcement Learning with Human Feedback from Pairwise or K -wise Comparisons , author=. International Conference on Machine Learning , year=
[76]

International Conference on Machine Learning , pages=

Nearly optimal policy optimization with stable at any time guarantee , author=. International Conference on Machine Learning , pages=. 2022 , organization=

2022
[77]

2020 , publisher=

Bandit algorithms , author=. 2020 , publisher=

2020
[78]

arXiv preprint arXiv:2106.11692 , year=

A Reduction-Based Framework for Conservative Bandits and Reinforcement Learning , author=. arXiv preprint arXiv:2106.11692 , year=

work page arXiv
[79]

International conference on machine learning , pages=

Trust region policy optimization , author=. International conference on machine learning , pages=. 2015 , organization=

2015
[80]

State of Origin series

Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations , author=. arXiv preprint arXiv:2310.01651 , year=

work page arXiv
[81]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Cosface: Large margin cosine loss for deep face recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Showing first 80 references.