How Humans Help LLMs: Assessing and Incentivizing Human Preference Annotators

Hanzhao Wang; Shang Liu; Xiaocheng Li; Zhongyao Ma

arxiv: 2502.06387 · v2 · submitted 2025-02-10 · 💻 cs.LG · cs.GT· econ.TH

How Humans Help LLMs: Assessing and Incentivizing Human Preference Annotators

Shang Liu , Hanzhao Wang , Zhongyao Ma , Xiaocheng Li This is my paper

Pith reviewed 2026-05-23 03:51 UTC · model grok-4.3

classification 💻 cs.LG cs.GTecon.TH

keywords preference annotationLLM alignmentprincipal-agent modelcontract designself-consistency monitoringsample complexityFisher informationcontinuous effort

0 comments

The pith

Linear contracts achieve a shortfall of Θ(1/(I n)) to the perfect-observation benchmark and are rate-optimal when annotator effort is continuous.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how to monitor the quality of human preference annotators for LLM alignment and how to design incentives for them to exert high effort. It proposes self-consistency checks as a monitoring signal whose statistical reliability can be compared to expert review. It then embeds the resulting performance measure in a principal-agent contract model with continuous effort choices and derives the exact rates at which simple contracts approach the ideal benchmark of perfect observability. The analysis shows that linear contracts converge faster than binary ones and are optimal among all contracts in this continuous setting, reversing the known optimality of binary contracts in discrete effort models.

Core claim

Under continuous action space, the shortfall to the ideal benchmark scales as Θ(1/√(I n log n)) for binary contracts and Θ(1/(I n)) for linear contracts, where I is the Fisher information of the monitoring signal and n is the number of samples; linear contracts are rate-optimal among general contracts. This contrasts with the discrete-action result that binary contracts are optimal and achieve exponential convergence.

What carries the argument

Principal-agent contract model in which the monitoring signal (self-consistency or expert review) supplies Fisher information I independent of contract form, used to bound the performance gap when annotator effort is chosen from a continuous interval.

If this is right

Self-consistency monitoring requires fewer inspected samples than expert review when annotators are heterogeneous and downstream model performance is noisy.
Linear contracts reach near-ideal performance with far fewer monitored samples than binary contracts once effort is continuous.
The optimal contract form depends on whether the underlying effort space is modeled as discrete or continuous.
A finite but explicit number of monitored samples suffices to make the contract performance arbitrarily close to the first-best benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Contract designers facing continuous effort should default to linear rather than threshold-based payments.
Self-consistency checks could replace expert review in large-scale preference datasets if the derived sample thresholds are met.
The same monitoring-plus-contract framework could be applied to other human feedback tasks such as instruction following or safety labeling.

Load-bearing premise

The annotator's effort choice lives in a continuous action space and the monitoring signal supplies Fisher information I that does not depend on the chosen contract.

What would settle it

An empirical test that varies the number of monitored samples n while holding the monitoring signal's Fisher information fixed and measures whether the realized performance gap under linear versus binary contracts follows the predicted 1/(I n) versus 1/√(I n log n) scalings.

Figures

Figures reproduced from arXiv: 2502.06387 by Hanzhao Wang, Shang Liu, Xiaocheng Li, Zhongyao Ma.

**Figure 1.** Figure 1: How expert-based monitoring fails on real preference data. Upper four plots: histograms of P(ychosen ≻ yrejected | x) (ychosen and yrejected represent the chosen/preferred and rejected responses, respectively). Lower four plots: the lower bound of the sum of two types of errors against the number of tested annotations n at different η0 with η1 = 1 (see Proposition 3.1). The observations align with Proposit… view at source ↗

**Figure 2.** Figure 2: Comparison between self-consistency monitoring (upper bound) and expert-based monitoring (lower bound). For the sum of two types of errors, we plot the upper bound for self-consistency monitoring with various values of δ (blue, thick line) and the lower bound for expert-based monitoring (red, dashed line), evaluated at η0 ∈ {0.8, 0.9} and η1 = 1 for two datasets. Even with a nontrivial disagreement probabi… view at source ↗

**Figure 3.** Figure 3: Normalized principal utility gap (C − Cn and C − C˜n) under different monitoring and contract settings. In these experiments, we set U0 = 0, δ = 0.02, µ(η) = 1/2η 4/5 , Ga(wa) = 1 − exp(−wa), and E(η) = 0.18η 2 (see Appendix B.1.4 for further details and additional configurations). (i) The self-consistency monitoring consistently outperforms the expert-based monitoring given the same second-best formulatio… view at source ↗

**Figure 4.** Figure 4: Illustration for Lemma A.8. (d) If a ≤ (1 − p)p ≤ b, then exp − n − 1 a (p − p˜) 2 ≤ ∂ ∂pP(Xn(p) ≥ k) ∂ ∂pP(Xn(p) ≥ k)|p=˜p ≤ exp − n − 1 b (p − p˜) 2 , where p˜ = k−1 n−1 . In other words, the curve of ∂ ∂pP(Xn(p) ≥ k) is like a bell curve centered at p˜. (e) ∂ ∂pP(Xn(p) ≥ k) monotonically increases for p < p˜ and monotonically decreases for p > p˜. If k = cn + O(1) for some c ∈ (0, 1), then ∂ ∂pP… view at source ↗

**Figure 5.** Figure 5: Calibration for two datasets. (Top row) Empirical preference probability p(x, y1, y2) vs. the predicted probability before and after calibration. The dashed line (x = y) represents perfect alignment between predictions and empirical observations. (Bottom row) Histogram of the (predicted) preference probability p(x, y1, y2) before and after calibration. We can see the calibration procedure improves alignmen… view at source ↗

**Figure 6.** Figure 6: Additional results for [PITH_FULL_IMAGE:figures/full_fig_p041_6.png] view at source ↗

**Figure 7.** Figure 7: Additional results for [PITH_FULL_IMAGE:figures/full_fig_p042_7.png] view at source ↗

**Figure 8.** Figure 8: Agent utility under the optimal solution, where we set the leisure utility U0 = 0. For all datasets, monitoring method, contract type, and second-best formulation, the resulted agent utility matches the leisure utility, i.e., the corresponding constraint is binding. B.2 Examples for hard-to-choose responses In the following, we present a few examples from HelpSteer (Wang et al., 2023) for which we think it… view at source ↗

**Figure 9.** Figure 9: More principal utility gap results for [PITH_FULL_IMAGE:figures/full_fig_p044_9.png] view at source ↗

read the original abstract

Human-annotated preference data play an important role in aligning large language models (LLMs). In this paper, we study two connected questions: how to monitor the quality of human preference annotators and how to incentivize them to provide high-quality annotations. In current practice, expert-based monitoring is a natural workhorse for quality control, but it performs poorly in preference annotation because annotators are heterogeneous and downstream model performance is an indirect and noisy proxy for annotation quality. We therefore propose a self-consistency monitoring scheme tailored to preference annotation, and analyze the statistical sample complexity of both methods. This practitioner-facing analysis identifies how many inspected samples are needed to reliably assess an annotator and shows when self-consistency monitoring can outperform expert-based monitoring. We then use the resulting monitoring signal as the performance measure in a principal-agent model, which lets us study a second sample-complexity question: how many monitored samples are needed before simple contracts perform close to the ideal benchmark in which annotation quality is perfectly observable. Under this continuous action space, we show that this shortfall scales as $\Theta(1/\sqrt{\mathcal{I} n \log n})$ for binary contracts and $\Theta(1/(\mathcal{I}n))$ for linear contracts, where $\mathcal{I}$ is the Fisher information and $n$ is the number of samples; we further show that the linear contracts are rate-optimal among general contracts. This contrasts with the known result that binary contracts are optimal and of $\exp(-\Theta(n))$ when the action space is discrete \citep{frick2023monitoring}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives new sample-complexity rates for linear versus binary contracts in a continuous-action principal-agent model for preference annotators, but the fixed-I assumption is load-bearing and questionable.

read the letter

The main thing here is the extension of contract-theory sample complexity to continuous annotator effort. They get shortfall scaling as Θ(1/√(I n log n)) for binary contracts and the faster Θ(1/(I n)) for linear ones, plus a rate-optimality claim for linear contracts. This is positioned against the known discrete-action exp(-Θ(n)) result, so the continuous modeling choice drives the difference in rates. The self-consistency monitoring scheme is a reasonable practical suggestion for when expert review is too noisy or expensive given heterogeneous annotators. The monitoring analysis itself looks like standard concentration plus Fisher information arguments, which is fine as far as it goes. The contract part applies the monitoring signal as the observable performance measure and works out the incentive shortfall from the first-best benchmark. That framing is clean and the rates are explicit, which is more than most applied papers deliver. The soft spot is exactly the one flagged in the stress test. The analysis treats the Fisher information I of the monitoring signal as independent of the contract parameters. In reality the annotator's effort choice (which the contract shapes) will likely change the distribution of self-consistency outcomes or expert-review signals, so I becomes contract-dependent. That coupling would alter both the derived rates and the optimality conclusion for linear contracts. The paper does not appear to relax or sensitivity-check this. Continuous effort also feels like an idealization; real annotators probably face discrete effort levels or thresholds. The abstract states the rates come from standard tools with no circularity, which is good, but full proofs would need checking for how they handle the continuous case. This is useful for researchers who work on the economics of preference data collection or RLHF pipelines. It is theoretically grounded enough to merit referee time even if the independence assumption needs pressure in review. I would send it out rather than desk reject.

Referee Report

1 major / 1 minor

Summary. The manuscript studies quality monitoring for human preference annotators in LLM alignment, proposing a self-consistency scheme and deriving its sample complexity relative to expert review. It then embeds the resulting signal into a principal-agent model with continuous action space to analyze incentive contracts, establishing that the performance shortfall to the first-best benchmark scales as Θ(1/√(ℐ n log n)) for binary contracts and Θ(1/(ℐ n)) for linear contracts, with linear contracts rate-optimal among general contracts. This is contrasted with the exp(-Θ(n)) result known for discrete actions.

Significance. If the derivations hold, the work supplies explicit, quantitative guidance on the number of monitored samples needed for reliable annotator assessment and for contracts to approach ideal performance. The use of Fisher information to parameterize monitoring quality and the clean separation of rates by contract type provide a bridge between statistical learning theory and contract theory that is directly relevant to data pipelines for alignment. The continuous-action analysis and its contrast to the discrete case constitute a clear theoretical contribution.

major comments (1)

[Abstract and principal-agent model] Abstract and principal-agent model section: the stated rates Θ(1/√(ℐ n log n)) and Θ(1/(ℐ n)) and the rate-optimality of linear contracts are derived under the assumption that the Fisher information ℐ of the monitoring signal (self-consistency or expert review) is fixed and independent of the contract parameters. Because the contract directly shapes the annotator’s effort choice, and effort can alter the distribution of the monitoring signal, ℐ is plausibly endogenous to the contract. This dependence would couple the monitoring and contracting analyses and change both the sample-complexity bounds and the optimality conclusion. The manuscript should either prove independence under its modeling assumptions or extend the analysis to the contract-dependent case.

minor comments (1)

[Abstract] Notation: ensure that the symbol ℐ is introduced with its precise definition (Fisher information of which random variable) at first use and used consistently thereafter.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and the insightful comment on the principal-agent model. We respond to the major comment below.

read point-by-point responses

Referee: [Abstract and principal-agent model] Abstract and principal-agent model section: the stated rates Θ(1/√(ℐ n log n)) and Θ(1/(ℐ n)) and the rate-optimality of linear contracts are derived under the assumption that the Fisher information ℐ of the monitoring signal (self-consistency or expert review) is fixed and independent of the contract parameters. Because the contract directly shapes the annotator’s effort choice, and effort can alter the distribution of the monitoring signal, ℐ is plausibly endogenous to the contract. This dependence would couple the monitoring and contracting analyses and change both the sample-complexity bounds and the optimality conclusion. The manuscript should either prove independence under its modeling assumptions or extend the analysis to the contract-dependent case.

Authors: In the principal-agent model, the Fisher information ℐ is a fixed parameter of the monitoring technology (self-consistency or expert review) and is independent of the contract by construction. The contract influences the agent's effort choice, but the conditional distribution of the monitoring signal given effort is modeled with a noise structure whose Fisher information with respect to the action remains constant and does not depend on the chosen effort level or contract parameters. This is a standard modeling choice that separates the statistical monitoring analysis from the incentive design. We will add an explicit statement of this assumption and its implications in the principal-agent model section. revision: partial

Circularity Check

0 steps flagged

No circularity: scalings derived from standard Fisher-information concentration under stated assumptions

full rationale

The paper derives the Θ(1/√(I n log n)) and Θ(1/(I n)) shortfall bounds, plus rate-optimality of linear contracts, from the continuous-action principal-agent model using Fisher information I of the monitoring signal and standard concentration arguments. These steps do not reduce to any fitted parameter defined by the paper itself, nor to a self-citation chain; the discrete-action contrast is imported via external citation to frick2023monitoring. The independence of I from contract design is an explicit modeling assumption, not a definitional tautology. No self-definitional, fitted-input, or ansatz-smuggling patterns appear in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on a statistical model of annotator responses that yields Fisher information I, standard concentration inequalities for sample complexity, and the principal-agent framework with continuous effort choice. No new entities are postulated.

axioms (2)

domain assumption Annotator responses admit a parametric model whose Fisher information I governs the monitoring signal quality.
Invoked when defining the monitoring schemes and deriving the rates.
standard math Standard large-deviation and information-theoretic bounds apply to the estimation of annotator quality.
Used to obtain the Θ(1/√(I n log n)) and Θ(1/(I n)) expressions.

pith-pipeline@v0.9.0 · 5826 in / 1347 out tokens · 31936 ms · 2026-05-23T03:51:51.173456+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Incentivizing High-Quality Human Annotations with Golden Questions
cs.GT 2025-05 unverdicted novelty 7.0

The paper derives a Θ(1/√(n log n)) hypothesis testing rate under strategic annotator behavior and shows that high-certainty, format-similar golden questions better reveal annotation quality than standard checks.
Users as Annotators: LLM Preference Learning from Comparison Mode
cs.CL 2025-10 unverdicted novelty 5.0

Introduces a latent user quality model and EM algorithm to infer and filter noisy user-provided pairwise preferences for improved LLM alignment.

Reference graph

Works this paper leans on

92 extracted references · 92 canonical work pages · cited by 2 Pith papers · 6 internal anchors

[1]

, " * write output.state after.block = add.period write newline

ENTRY address author booktitle chapter doi edition editor eid howpublished institution journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sent...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in "" FUNCTION format.date year ...

work page
[3]

Acemoglu, Daron, Ali Makhdoumi, Azarakhsh Malekian, Asu Ozdaglar. 2022. Too much data: Prices and inefficiencies in data markets. American Economic Journal: Microeconomics\/ 14 (4) 218--256

work page 2022
[4]

Adida, Elodie, Fernanda Bravo. 2019. Contracts for healthcare referral services: Coordination via outcome-based penalty contracts. Management Science\/ 65 (3) 1322--1341

work page 2019
[5]

Agarwal, Anish, Munther Dahleh, Tuhin Sarkar. 2019. A marketplace for data: An algorithmic solution. Proceedings of the 2019 ACM Conference on Economics and Computation\/ . 701--726

work page 2019
[6]

Alon, Tal, Paul D \"u tting, Yingkai Li, Inbal Talgam-Cohen. 2022. Bayesian analysis of linear contracts. arXiv preprint arXiv:2211.06850\/

work page arXiv 2022
[7]

Ananthakrishnan, Nivasini, Stephen Bates, Michael Jordan, Nika Haghtalab. 2024 a . Delegating data collection in decentralized machine learning. International Conference on Artificial Intelligence and Statistics\/ . PMLR, 478--486

work page 2024
[8]

Ananthakrishnan, Nivasini, Nika Haghtalab, Chara Podimata, Kunhe Yang. 2024 b . Is knowledge power? on the (im) possibility of learning from strategic interactions. The Thirty-eighth Annual Conference on Neural Information Processing Systems\/

work page 2024
[9]

Artstein, Ron, Massimo Poesio. 2008. Inter-coder agreement for computational linguistics. Computational linguistics\/ 34 (4) 555--596

work page 2008
[10]

Askell, Amanda, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. 2021. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861\/

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

Bacon, David F, Yiling Chen, Ian Kash, David C Parkes, Malvika Rao, Manu Sridharan. 2012. Predicting your own effort. Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 2 (AAMAS)\/ . 695--702

work page 2012
[12]

Bai, Yuntao, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862\/

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Bareket, Dan, Reut Tsarfaty. 2021. Neural modeling for named entities and morphology (nemoˆ2). Transactions of the Association for Computational Linguistics\/ 9 909--928

work page 2021
[14]

Barron, Daniel, George Georgiadis, Jeroen Swinkels. 2020. Optimal contracts with a risk-taking agent. Theoretical Economics\/ 15 (2) 715--761

work page 2020
[15]

Bastan, Mohaddeseh, Mahnaz Koupaee, Youngseo Son, Richard Sicoli, Niranjan Balasubramanian. 2020. Author's sentiment prediction. arXiv preprint arXiv:2011.06128\/

work page arXiv 2020
[16]

Bergemann, Dirk, Alessandro Bonatti. 2019. Markets for information: An introduction. Annual Review of Economics\/ 11 (1) 85--107

work page 2019
[17]

Boyd, Stephen. 2004. Convex optimization. Cambridge UP\/

work page 2004
[18]

Bradley, Ralph Allan, Milton E Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika\/ 39 (3/4) 324--345

work page 1952
[19]

Bretagnolle, Jean, Catherine Huber. 1978. Estimation des densit \'e s: risque minimax. S \'e minaire de probabilit \'e s de Strasbourg\/ 12 342--363

work page 1978
[20]

Cai, Yang, Constantinos Daskalakis, Christos Papadimitriou. 2015. Optimum statistical estimation with strategic data sources. Conference on Learning Theory\/ . PMLR, 280--296

work page 2015
[21]

Callison-Burch, Chris, Mark Dredze. 2010. Creating speech and language data with amazon’s mechanical turk. Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk\/ . 1--12

work page 2010
[22]

Carroll, Gabriel. 2015. Robustness and linear contracts. American Economic Review\/ 105 (2) 536--563

work page 2015
[23]

Chen, Junjie, Minming Li, Haifeng Xu. 2022. Selling data to a machine learner: Pricing via costly signaling. International Conference on Machine Learning\/ . PMLR, 3336--3359

work page 2022
[24]

Chowdhury, Sayak Ray, Anush Kini, Nagarajan Natarajan. 2024. Provably robust dpo: Aligning language models with noisy feedback. arXiv preprint arXiv:2403.00409\/

work page arXiv 2024
[25]

Collina, Natalie, Varun Gupta, Aaron Roth. 2024. Repeated contracting with multiple non-myopic agents: Policy regret and limited liability. Proceedings of the 25th ACM Conference on Economics and Computation\/ . EC '24, Association for Computing Machinery, New York, NY, USA, 640–668. doi:10.1145/3670865.3673607. ://doi.org/10.1145/3670865.3673607

work page doi:10.1145/3670865.3673607 2024
[26]

Corbett, Charles J, Gregory A DeCroix, Albert Y Ha. 2005. Optimal shared-savings contracts in supply chains: Linear contracts and double moral hazard. European journal of operational research\/ 163 (3) 653--667

work page 2005
[27]

Corbett, Charles J, Christopher S Tang. 1999. Designing supply contracts: Contract type and information asymmetry. Quantitative models for supply chain management\/ 269--297

work page 1999
[28]

Cui, Ganqu, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, Maosong Sun. 2023. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377\/

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Dai, Josef, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, Yaodong Yang. 2024. Safe RLHF : Safe reinforcement learning from human feedback. The Twelfth International Conference on Learning Representations\/ . ://openreview.net/forum?id=TyFrPOKYXw

work page 2024
[30]

Dasgupta, Anirban, Arpita Ghosh. 2013. Crowdsourced judgement elicitation with endogenous proficiency. Proceedings of the 22nd international conference on World Wide Web\/ . 319--330

work page 2013
[31]

de Zegher, Joann F, Dan A Iancu, Hau L Lee. 2019. Designing contracts and sourcing channels to create shared value. Manufacturing & Service Operations Management\/ 21 (2) 271--289

work page 2019
[32]

Duetting, Paul, Vahab Mirrokni, Renato Paes Leme, Haifeng Xu, Song Zuo. 2024. Mechanism design for large language models. Proceedings of the ACM on Web Conference 2024\/ . 144--155

work page 2024
[33]

D \"u tting, Paul, Michal Feldman, Inbal Talgam-Cohen, et al. 2024. Algorithmic contract theory: A survey. Foundations and Trends in Theoretical Computer Science\/ 16 (3-4) 211--412

work page 2024
[34]

D \"u tting, Paul, Tim Roughgarden, Inbal Talgam-Cohen. 2019. Simple versus optimal contracts. Proceedings of the 2019 ACM Conference on Economics and Computation\/ . 369--387

work page 2019
[35]

Dutting, Paul, Tim Roughgarden, Inbal Talgam-Cohen. 2021. The complexity of contracts. SIAM Journal on Computing\/ 50 (1) 211--254

work page 2021
[36]

Frick, Mira, Ryota Iijima, Yuhta Ishii. 2023. Monitoring with rich data. arXiv preprint arXiv:2312.16789\/

work page arXiv 2023
[37]

Gao, Yang, Dana Alon, Donald Metzler. 2024. Impact of preference noise on the alignment performance of generative language models. arXiv preprint arXiv:2404.09824\/

work page arXiv 2024
[38]

Georgiadis, George, Balazs Szentes. 2020. Optimal monitoring design. Econometrica\/ 88 (5) 2075--2107

work page 2020
[39]

Ghosal, Deepanway, Siqi Shen, Navonil Majumder, Rada Mihalcea, Soujanya Poria. 2022. Cicero: A dataset for contextualized commonsense inference in dialogues. arXiv preprint arXiv:2203.13926\/

work page arXiv 2022
[40]

Goldwasser, Shafi, Guy N Rothblum, Jonathan Shafer, Amir Yehudayoff. 2021. Interactive proofs for verifying machine learning. 12th Innovations in Theoretical Computer Science Conference (ITCS 2021)\/ . Schloss-Dagstuhl-Leibniz Zentrum f \"u r Informatik

work page 2021
[41]

Grossman, Sanford J, Oliver D Hart. 1992. An analysis of the principal-agent problem. Foundations of Insurance Economics: Readings in Economics and Finance\/ . Springer, 302--340

work page 1992
[42]

Guo, Chuan, Geoff Pleiss, Yu Sun, Kilian Q Weinberger. 2017. On calibration of modern neural networks. International conference on machine learning\/ . PMLR, 1321--1330

work page 2017
[43]

Hao, Shugang, Lingjie Duan. 2024. Online learning from strategic human feedback in llm fine-tuning. arXiv preprint arXiv:2412.16834\/

work page arXiv 2024
[44]

Harris, Keegan, Nicole Immorlica, Brendan Lucier, Aleksandrs Slivkins. 2023. Algorithmic persuasion through simulation: Information design in the age of generative ai. arXiv preprint arXiv:2311.18138\/

work page arXiv 2023
[45]

Harris, Milton, Artur Raviv. 1979. Optimal incentive contracts with imperfect information. Journal of economic theory\/ 20 (2) 231--259

work page 1979
[46]

Herweg, Fabian, Daniel M \"u ller, Philipp Weinschenk. 2010. Binary payment schemes: Moral hazard and loss aversion. American Economic Review\/ 100 (5) 2451--2477

work page 2010
[47]

Ho, Chien-Ju, Aleksandrs Slivkins, Jennifer Wortman Vaughan. 2014. Adaptive contract design for crowdsourcing markets: Bandit algorithms for repeated principal-agent problems. Proceedings of the fifteenth ACM conference on Economics and computation\/ . 359--376

work page 2014
[48]

Holmstr \"o m, Bengt. 1979. Moral hazard and observability. The Bell journal of economics\/ 74--91

work page 1979
[49]

Holmstrom, Bengt, Paul Milgrom. 1987. Aggregation and linearity in the provision of intertemporal incentives. Econometrica: Journal of the Econometric Society\/ 303--328

work page 1987
[50]

Ivanov, Dima, Paul D \"u tting, Inbal Talgam-Cohen, Tonghan Wang, David C Parkes. 2024. Principal-agent reinforcement learning: Orchestrating ai agents with contracts. arXiv preprint arXiv:2407.18074\/

work page arXiv 2024
[51]

Jain, Nitish, Sameer Hasija, Dana G Popescu. 2013. Optimal contracts for outsourcing of repair and restoration services. Operations Research\/ 61 (6) 1295--1311

work page 2013
[52]

Jewitt, Ian. 2006. Information order in decision and agency problems

work page 2006
[53]

Ji, Jiaming, Donghai Hong, Borong Zhang, Boyuan Chen, Josef Dai, Boren Zheng, Tianyi Qiu, Boxun Li, Yaodong Yang. 2024. Pku-saferlhf: Towards multi-level safety alignment for llms with human preference. arXiv preprint arXiv:2406.15513\/

work page arXiv 2024
[54]

Karlin, Samuel, Herman Rubin. 1956. The theory of decision procedures for distributions with monotone likelihood ratio. The Annals of Mathematical Statistics\/ 272--299

work page 1956
[55]

Kaufmann, Timo, Paul Weng, Viktor Bengs, Eyke H \"u llermeier. 2023. A survey of reinforcement learning from human feedback. arXiv preprint arXiv:2312.14925\/

work page arXiv 2023
[56]

Kim, Son Ku. 1995. Efficiency of an information system in an agency model. Econometrica: Journal of the Econometric Society\/ 89--102

work page 1995
[57]

Klie, Jan-Christoph, Richard Eckart de Castilho, Iryna Gurevych. 2024 a . Analyzing dataset annotation quality management in the wild. Computational Linguistics\/ 50 (3) 817--866

work page 2024
[58]

Klie, Jan-Christoph, Juan Haladjian, Marc Kirchner, Rahul Nair. 2024 b . On efficient and statistical quality estimation for data annotation. arXiv preprint arXiv:2405.11919\/

work page arXiv 2024
[59]

Krippendorff, Klaus. 2004. Reliability in content analysis: Some common misconceptions and recommendations. Human communication research\/ 30 (3) 411--433

work page 2004
[60]

Krippendorff, Klaus, et al. 1989. Content analysis. International encyclopedia of communication\/ 1 (1) 403--407

work page 1989
[61]

Laffont, Jean-Jacques, David Martimort. 2009. The theory of incentives: the principal-agent model. The theory of incentives\/ . Princeton university press

work page 2009
[62]

Lazear, Edward P, Paul Oyer. 2007. Personnel economics. Working Paper 13480, National Bureau of Economic Research. doi:10.3386/w13480. ://www.nber.org/papers/w13480

work page doi:10.3386/w13480 2007
[63]

Le Cam, Lucien. 2012. Asymptotic methods in statistical decision theory\/ . Springer Science & Business Media

work page 2012
[64]

Liang, Xize, Chao Chen, Jie Wang, Yue Wu, Zhihang Fu, Zhihao Shi, Feng Wu, Jieping Ye. 2024. Robust preference optimization with provable noise tolerance for llms. arXiv preprint arXiv:2404.04102\/

work page arXiv 2024
[65]

Liao, JG, Arthur Berg. 2019. Sharpening jensen's inequality. The American Statistician\/

work page 2019
[66]

Liu, Chris Yuhao, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, Yahui Zhou. 2024 a . Skywork-reward: Bag of tricks for reward modeling in llms. arXiv preprint arXiv:2410.18451\/

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

Liu, Jinsong, Dongdong Ge, Ruihao Zhu. 2024 b . Reward learning from preference with ties. arXiv preprint arXiv:2410.05328\/

work page arXiv 2024
[68]

Lopomo, Giuseppe, Luca Rigotti, Chris Shannon. 2011. Knightian uncertainty and moral hazard. Journal of Economic Theory\/ 146 (3) 1148--1172

work page 2011
[69]

Miller, Nolan, Paul Resnick, Richard Zeckhauser. 2005. Eliciting informative feedback: The peer-prediction method. Management Science\/ 51 (9) 1359--1373

work page 2005
[70]

Monarch, Robert Munro. 2021. Human-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI\/ . Simon and Schuster

work page 2021
[71]

Moscarini, Giuseppe, Lones Smith. 2002. The law of large demand for information. Econometrica\/ 70 (6) 2351--2366

work page 2002
[72]

Munos, R \'e mi, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Andrea Michi, et al. 2023. Nash learning from human feedback. arXiv preprint arXiv:2312.00886\/

work page arXiv 2023
[73]

Northcutt, Curtis, Lu Jiang, Isaac Chuang. 2021. Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research\/ 70 1373--1411

work page 2021
[74]

Ouyang, Long, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems\/ 35 27730--27744

work page 2022
[75]

Polyanskiy, Yury, Yihong Wu. 2025. Information Theory: From Coding to Learning\/ . Cambridge University Press

work page 2025
[76]

O'Reilly Media, Inc

Pustejovsky, James, Amber Stubbs. 2012. Natural Language Annotation for Machine Learning: A guide to corpus-building for applications\/ . " O'Reilly Media, Inc."

work page 2012
[77]

Qian, Kun, Ahmad Beirami, Zhouhan Lin, Ankita De, Alborz Geramifard, Zhou Yu, Chinnadhurai Sankar. 2021. Annotation inconsistency and entity bias in multiwoz. arXiv preprint arXiv:2105.14150\/

work page arXiv 2021
[78]

Rafailov, Rafael, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, Chelsea Finn. 2024. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems\/ 36

work page 2024
[79]

Saig, Eden, Ohad Einav, Inbal Talgam-Cohen. 2024 a . Incentivizing quality text generation via statistical contracts. The Thirty-eighth Annual Conference on Neural Information Processing Systems\/ . ://openreview.net/forum?id=wZgw4CrxwK

work page 2024
[80]

Saig, Eden, Inbal Talgam-Cohen, Nir Rosenfeld. 2024 b . Delegated classification. Advances in Neural Information Processing Systems\/ 36

work page 2024

Showing first 80 references.

[1] [1]

, " * write output.state after.block = add.period write newline

ENTRY address author booktitle chapter doi edition editor eid howpublished institution journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sent...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in "" FUNCTION format.date year ...

work page

[3] [3]

Acemoglu, Daron, Ali Makhdoumi, Azarakhsh Malekian, Asu Ozdaglar. 2022. Too much data: Prices and inefficiencies in data markets. American Economic Journal: Microeconomics\/ 14 (4) 218--256

work page 2022

[4] [4]

Adida, Elodie, Fernanda Bravo. 2019. Contracts for healthcare referral services: Coordination via outcome-based penalty contracts. Management Science\/ 65 (3) 1322--1341

work page 2019

[5] [5]

Agarwal, Anish, Munther Dahleh, Tuhin Sarkar. 2019. A marketplace for data: An algorithmic solution. Proceedings of the 2019 ACM Conference on Economics and Computation\/ . 701--726

work page 2019

[6] [6]

Alon, Tal, Paul D \"u tting, Yingkai Li, Inbal Talgam-Cohen. 2022. Bayesian analysis of linear contracts. arXiv preprint arXiv:2211.06850\/

work page arXiv 2022

[7] [7]

Ananthakrishnan, Nivasini, Stephen Bates, Michael Jordan, Nika Haghtalab. 2024 a . Delegating data collection in decentralized machine learning. International Conference on Artificial Intelligence and Statistics\/ . PMLR, 478--486

work page 2024

[8] [8]

Ananthakrishnan, Nivasini, Nika Haghtalab, Chara Podimata, Kunhe Yang. 2024 b . Is knowledge power? on the (im) possibility of learning from strategic interactions. The Thirty-eighth Annual Conference on Neural Information Processing Systems\/

work page 2024

[9] [9]

Artstein, Ron, Massimo Poesio. 2008. Inter-coder agreement for computational linguistics. Computational linguistics\/ 34 (4) 555--596

work page 2008

[10] [10]

Askell, Amanda, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. 2021. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861\/

work page internal anchor Pith review Pith/arXiv arXiv 2021

[11] [11]

Bacon, David F, Yiling Chen, Ian Kash, David C Parkes, Malvika Rao, Manu Sridharan. 2012. Predicting your own effort. Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 2 (AAMAS)\/ . 695--702

work page 2012

[12] [12]

Bai, Yuntao, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862\/

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [13]

Bareket, Dan, Reut Tsarfaty. 2021. Neural modeling for named entities and morphology (nemoˆ2). Transactions of the Association for Computational Linguistics\/ 9 909--928

work page 2021

[14] [14]

Barron, Daniel, George Georgiadis, Jeroen Swinkels. 2020. Optimal contracts with a risk-taking agent. Theoretical Economics\/ 15 (2) 715--761

work page 2020

[15] [15]

Bastan, Mohaddeseh, Mahnaz Koupaee, Youngseo Son, Richard Sicoli, Niranjan Balasubramanian. 2020. Author's sentiment prediction. arXiv preprint arXiv:2011.06128\/

work page arXiv 2020

[16] [16]

Bergemann, Dirk, Alessandro Bonatti. 2019. Markets for information: An introduction. Annual Review of Economics\/ 11 (1) 85--107

work page 2019

[17] [17]

Boyd, Stephen. 2004. Convex optimization. Cambridge UP\/

work page 2004

[18] [18]

Bradley, Ralph Allan, Milton E Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika\/ 39 (3/4) 324--345

work page 1952

[19] [19]

Bretagnolle, Jean, Catherine Huber. 1978. Estimation des densit \'e s: risque minimax. S \'e minaire de probabilit \'e s de Strasbourg\/ 12 342--363

work page 1978

[20] [20]

Cai, Yang, Constantinos Daskalakis, Christos Papadimitriou. 2015. Optimum statistical estimation with strategic data sources. Conference on Learning Theory\/ . PMLR, 280--296

work page 2015

[21] [21]

Callison-Burch, Chris, Mark Dredze. 2010. Creating speech and language data with amazon’s mechanical turk. Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk\/ . 1--12

work page 2010

[22] [22]

Carroll, Gabriel. 2015. Robustness and linear contracts. American Economic Review\/ 105 (2) 536--563

work page 2015

[23] [23]

Chen, Junjie, Minming Li, Haifeng Xu. 2022. Selling data to a machine learner: Pricing via costly signaling. International Conference on Machine Learning\/ . PMLR, 3336--3359

work page 2022

[24] [24]

Chowdhury, Sayak Ray, Anush Kini, Nagarajan Natarajan. 2024. Provably robust dpo: Aligning language models with noisy feedback. arXiv preprint arXiv:2403.00409\/

work page arXiv 2024

[25] [25]

Collina, Natalie, Varun Gupta, Aaron Roth. 2024. Repeated contracting with multiple non-myopic agents: Policy regret and limited liability. Proceedings of the 25th ACM Conference on Economics and Computation\/ . EC '24, Association for Computing Machinery, New York, NY, USA, 640–668. doi:10.1145/3670865.3673607. ://doi.org/10.1145/3670865.3673607

work page doi:10.1145/3670865.3673607 2024

[26] [26]

Corbett, Charles J, Gregory A DeCroix, Albert Y Ha. 2005. Optimal shared-savings contracts in supply chains: Linear contracts and double moral hazard. European journal of operational research\/ 163 (3) 653--667

work page 2005

[27] [27]

Corbett, Charles J, Christopher S Tang. 1999. Designing supply contracts: Contract type and information asymmetry. Quantitative models for supply chain management\/ 269--297

work page 1999

[28] [28]

Cui, Ganqu, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, Maosong Sun. 2023. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377\/

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Dai, Josef, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, Yaodong Yang. 2024. Safe RLHF : Safe reinforcement learning from human feedback. The Twelfth International Conference on Learning Representations\/ . ://openreview.net/forum?id=TyFrPOKYXw

work page 2024

[30] [30]

Dasgupta, Anirban, Arpita Ghosh. 2013. Crowdsourced judgement elicitation with endogenous proficiency. Proceedings of the 22nd international conference on World Wide Web\/ . 319--330

work page 2013

[31] [31]

de Zegher, Joann F, Dan A Iancu, Hau L Lee. 2019. Designing contracts and sourcing channels to create shared value. Manufacturing & Service Operations Management\/ 21 (2) 271--289

work page 2019

[32] [32]

Duetting, Paul, Vahab Mirrokni, Renato Paes Leme, Haifeng Xu, Song Zuo. 2024. Mechanism design for large language models. Proceedings of the ACM on Web Conference 2024\/ . 144--155

work page 2024

[33] [33]

D \"u tting, Paul, Michal Feldman, Inbal Talgam-Cohen, et al. 2024. Algorithmic contract theory: A survey. Foundations and Trends in Theoretical Computer Science\/ 16 (3-4) 211--412

work page 2024

[34] [34]

D \"u tting, Paul, Tim Roughgarden, Inbal Talgam-Cohen. 2019. Simple versus optimal contracts. Proceedings of the 2019 ACM Conference on Economics and Computation\/ . 369--387

work page 2019

[35] [35]

Dutting, Paul, Tim Roughgarden, Inbal Talgam-Cohen. 2021. The complexity of contracts. SIAM Journal on Computing\/ 50 (1) 211--254

work page 2021

[36] [36]

Frick, Mira, Ryota Iijima, Yuhta Ishii. 2023. Monitoring with rich data. arXiv preprint arXiv:2312.16789\/

work page arXiv 2023

[37] [37]

Gao, Yang, Dana Alon, Donald Metzler. 2024. Impact of preference noise on the alignment performance of generative language models. arXiv preprint arXiv:2404.09824\/

work page arXiv 2024

[38] [38]

Georgiadis, George, Balazs Szentes. 2020. Optimal monitoring design. Econometrica\/ 88 (5) 2075--2107

work page 2020

[39] [39]

Ghosal, Deepanway, Siqi Shen, Navonil Majumder, Rada Mihalcea, Soujanya Poria. 2022. Cicero: A dataset for contextualized commonsense inference in dialogues. arXiv preprint arXiv:2203.13926\/

work page arXiv 2022

[40] [40]

Goldwasser, Shafi, Guy N Rothblum, Jonathan Shafer, Amir Yehudayoff. 2021. Interactive proofs for verifying machine learning. 12th Innovations in Theoretical Computer Science Conference (ITCS 2021)\/ . Schloss-Dagstuhl-Leibniz Zentrum f \"u r Informatik

work page 2021

[41] [41]

Grossman, Sanford J, Oliver D Hart. 1992. An analysis of the principal-agent problem. Foundations of Insurance Economics: Readings in Economics and Finance\/ . Springer, 302--340

work page 1992

[42] [42]

Guo, Chuan, Geoff Pleiss, Yu Sun, Kilian Q Weinberger. 2017. On calibration of modern neural networks. International conference on machine learning\/ . PMLR, 1321--1330

work page 2017

[43] [43]

Hao, Shugang, Lingjie Duan. 2024. Online learning from strategic human feedback in llm fine-tuning. arXiv preprint arXiv:2412.16834\/

work page arXiv 2024

[44] [44]

Harris, Keegan, Nicole Immorlica, Brendan Lucier, Aleksandrs Slivkins. 2023. Algorithmic persuasion through simulation: Information design in the age of generative ai. arXiv preprint arXiv:2311.18138\/

work page arXiv 2023

[45] [45]

Harris, Milton, Artur Raviv. 1979. Optimal incentive contracts with imperfect information. Journal of economic theory\/ 20 (2) 231--259

work page 1979

[46] [46]

Herweg, Fabian, Daniel M \"u ller, Philipp Weinschenk. 2010. Binary payment schemes: Moral hazard and loss aversion. American Economic Review\/ 100 (5) 2451--2477

work page 2010

[47] [47]

Ho, Chien-Ju, Aleksandrs Slivkins, Jennifer Wortman Vaughan. 2014. Adaptive contract design for crowdsourcing markets: Bandit algorithms for repeated principal-agent problems. Proceedings of the fifteenth ACM conference on Economics and computation\/ . 359--376

work page 2014

[48] [48]

Holmstr \"o m, Bengt. 1979. Moral hazard and observability. The Bell journal of economics\/ 74--91

work page 1979

[49] [49]

Holmstrom, Bengt, Paul Milgrom. 1987. Aggregation and linearity in the provision of intertemporal incentives. Econometrica: Journal of the Econometric Society\/ 303--328

work page 1987

[50] [50]

Ivanov, Dima, Paul D \"u tting, Inbal Talgam-Cohen, Tonghan Wang, David C Parkes. 2024. Principal-agent reinforcement learning: Orchestrating ai agents with contracts. arXiv preprint arXiv:2407.18074\/

work page arXiv 2024

[51] [51]

Jain, Nitish, Sameer Hasija, Dana G Popescu. 2013. Optimal contracts for outsourcing of repair and restoration services. Operations Research\/ 61 (6) 1295--1311

work page 2013

[52] [52]

Jewitt, Ian. 2006. Information order in decision and agency problems

work page 2006

[53] [53]

Ji, Jiaming, Donghai Hong, Borong Zhang, Boyuan Chen, Josef Dai, Boren Zheng, Tianyi Qiu, Boxun Li, Yaodong Yang. 2024. Pku-saferlhf: Towards multi-level safety alignment for llms with human preference. arXiv preprint arXiv:2406.15513\/

work page arXiv 2024

[54] [54]

Karlin, Samuel, Herman Rubin. 1956. The theory of decision procedures for distributions with monotone likelihood ratio. The Annals of Mathematical Statistics\/ 272--299

work page 1956

[55] [55]

Kaufmann, Timo, Paul Weng, Viktor Bengs, Eyke H \"u llermeier. 2023. A survey of reinforcement learning from human feedback. arXiv preprint arXiv:2312.14925\/

work page arXiv 2023

[56] [56]

Kim, Son Ku. 1995. Efficiency of an information system in an agency model. Econometrica: Journal of the Econometric Society\/ 89--102

work page 1995

[57] [57]

Klie, Jan-Christoph, Richard Eckart de Castilho, Iryna Gurevych. 2024 a . Analyzing dataset annotation quality management in the wild. Computational Linguistics\/ 50 (3) 817--866

work page 2024

[58] [58]

Klie, Jan-Christoph, Juan Haladjian, Marc Kirchner, Rahul Nair. 2024 b . On efficient and statistical quality estimation for data annotation. arXiv preprint arXiv:2405.11919\/

work page arXiv 2024

[59] [59]

Krippendorff, Klaus. 2004. Reliability in content analysis: Some common misconceptions and recommendations. Human communication research\/ 30 (3) 411--433

work page 2004

[60] [60]

Krippendorff, Klaus, et al. 1989. Content analysis. International encyclopedia of communication\/ 1 (1) 403--407

work page 1989

[61] [61]

Laffont, Jean-Jacques, David Martimort. 2009. The theory of incentives: the principal-agent model. The theory of incentives\/ . Princeton university press

work page 2009

[62] [62]

Lazear, Edward P, Paul Oyer. 2007. Personnel economics. Working Paper 13480, National Bureau of Economic Research. doi:10.3386/w13480. ://www.nber.org/papers/w13480

work page doi:10.3386/w13480 2007

[63] [63]

Le Cam, Lucien. 2012. Asymptotic methods in statistical decision theory\/ . Springer Science & Business Media

work page 2012

[64] [64]

Liang, Xize, Chao Chen, Jie Wang, Yue Wu, Zhihang Fu, Zhihao Shi, Feng Wu, Jieping Ye. 2024. Robust preference optimization with provable noise tolerance for llms. arXiv preprint arXiv:2404.04102\/

work page arXiv 2024

[65] [65]

Liao, JG, Arthur Berg. 2019. Sharpening jensen's inequality. The American Statistician\/

work page 2019

[66] [66]

Liu, Chris Yuhao, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, Yahui Zhou. 2024 a . Skywork-reward: Bag of tricks for reward modeling in llms. arXiv preprint arXiv:2410.18451\/

work page internal anchor Pith review Pith/arXiv arXiv 2024

[67] [67]

Liu, Jinsong, Dongdong Ge, Ruihao Zhu. 2024 b . Reward learning from preference with ties. arXiv preprint arXiv:2410.05328\/

work page arXiv 2024

[68] [68]

Lopomo, Giuseppe, Luca Rigotti, Chris Shannon. 2011. Knightian uncertainty and moral hazard. Journal of Economic Theory\/ 146 (3) 1148--1172

work page 2011

[69] [69]

Miller, Nolan, Paul Resnick, Richard Zeckhauser. 2005. Eliciting informative feedback: The peer-prediction method. Management Science\/ 51 (9) 1359--1373

work page 2005

[70] [70]

Monarch, Robert Munro. 2021. Human-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI\/ . Simon and Schuster

work page 2021

[71] [71]

Moscarini, Giuseppe, Lones Smith. 2002. The law of large demand for information. Econometrica\/ 70 (6) 2351--2366

work page 2002

[72] [72]

Munos, R \'e mi, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Andrea Michi, et al. 2023. Nash learning from human feedback. arXiv preprint arXiv:2312.00886\/

work page arXiv 2023

[73] [73]

Northcutt, Curtis, Lu Jiang, Isaac Chuang. 2021. Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research\/ 70 1373--1411

work page 2021

[74] [74]

Ouyang, Long, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems\/ 35 27730--27744

work page 2022

[75] [75]

Polyanskiy, Yury, Yihong Wu. 2025. Information Theory: From Coding to Learning\/ . Cambridge University Press

work page 2025

[76] [76]

O'Reilly Media, Inc

Pustejovsky, James, Amber Stubbs. 2012. Natural Language Annotation for Machine Learning: A guide to corpus-building for applications\/ . " O'Reilly Media, Inc."

work page 2012

[77] [77]

Qian, Kun, Ahmad Beirami, Zhouhan Lin, Ankita De, Alborz Geramifard, Zhou Yu, Chinnadhurai Sankar. 2021. Annotation inconsistency and entity bias in multiwoz. arXiv preprint arXiv:2105.14150\/

work page arXiv 2021

[78] [78]

Rafailov, Rafael, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, Chelsea Finn. 2024. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems\/ 36

work page 2024

[79] [79]

Saig, Eden, Ohad Einav, Inbal Talgam-Cohen. 2024 a . Incentivizing quality text generation via statistical contracts. The Thirty-eighth Annual Conference on Neural Information Processing Systems\/ . ://openreview.net/forum?id=wZgw4CrxwK

work page 2024

[80] [80]

Saig, Eden, Inbal Talgam-Cohen, Nir Rosenfeld. 2024 b . Delegated classification. Advances in Neural Information Processing Systems\/ 36

work page 2024