pith. sign in

arxiv: 2502.06387 · v2 · submitted 2025-02-10 · 💻 cs.LG · cs.GT· econ.TH

How Humans Help LLMs: Assessing and Incentivizing Human Preference Annotators

Pith reviewed 2026-05-23 03:51 UTC · model grok-4.3

classification 💻 cs.LG cs.GTecon.TH
keywords preference annotationLLM alignmentprincipal-agent modelcontract designself-consistency monitoringsample complexityFisher informationcontinuous effort
0
0 comments X

The pith

Linear contracts achieve a shortfall of Θ(1/(I n)) to the perfect-observation benchmark and are rate-optimal when annotator effort is continuous.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how to monitor the quality of human preference annotators for LLM alignment and how to design incentives for them to exert high effort. It proposes self-consistency checks as a monitoring signal whose statistical reliability can be compared to expert review. It then embeds the resulting performance measure in a principal-agent contract model with continuous effort choices and derives the exact rates at which simple contracts approach the ideal benchmark of perfect observability. The analysis shows that linear contracts converge faster than binary ones and are optimal among all contracts in this continuous setting, reversing the known optimality of binary contracts in discrete effort models.

Core claim

Under continuous action space, the shortfall to the ideal benchmark scales as Θ(1/√(I n log n)) for binary contracts and Θ(1/(I n)) for linear contracts, where I is the Fisher information of the monitoring signal and n is the number of samples; linear contracts are rate-optimal among general contracts. This contrasts with the discrete-action result that binary contracts are optimal and achieve exponential convergence.

What carries the argument

Principal-agent contract model in which the monitoring signal (self-consistency or expert review) supplies Fisher information I independent of contract form, used to bound the performance gap when annotator effort is chosen from a continuous interval.

If this is right

  • Self-consistency monitoring requires fewer inspected samples than expert review when annotators are heterogeneous and downstream model performance is noisy.
  • Linear contracts reach near-ideal performance with far fewer monitored samples than binary contracts once effort is continuous.
  • The optimal contract form depends on whether the underlying effort space is modeled as discrete or continuous.
  • A finite but explicit number of monitored samples suffices to make the contract performance arbitrarily close to the first-best benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Contract designers facing continuous effort should default to linear rather than threshold-based payments.
  • Self-consistency checks could replace expert review in large-scale preference datasets if the derived sample thresholds are met.
  • The same monitoring-plus-contract framework could be applied to other human feedback tasks such as instruction following or safety labeling.

Load-bearing premise

The annotator's effort choice lives in a continuous action space and the monitoring signal supplies Fisher information I that does not depend on the chosen contract.

What would settle it

An empirical test that varies the number of monitored samples n while holding the monitoring signal's Fisher information fixed and measures whether the realized performance gap under linear versus binary contracts follows the predicted 1/(I n) versus 1/√(I n log n) scalings.

Figures

Figures reproduced from arXiv: 2502.06387 by Hanzhao Wang, Shang Liu, Xiaocheng Li, Zhongyao Ma.

Figure 1
Figure 1. Figure 1: How expert-based monitoring fails on real preference data. Upper four plots: histograms of P(ychosen ≻ yrejected | x) (ychosen and yrejected represent the chosen/preferred and rejected responses, respectively). Lower four plots: the lower bound of the sum of two types of errors against the number of tested annotations n at different η0 with η1 = 1 (see Proposition 3.1). The observations align with Proposit… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison between self-consistency monitoring (upper bound) and expert-based monitoring (lower bound). For the sum of two types of errors, we plot the upper bound for self-consistency monitoring with various values of δ (blue, thick line) and the lower bound for expert-based monitoring (red, dashed line), evaluated at η0 ∈ {0.8, 0.9} and η1 = 1 for two datasets. Even with a nontrivial disagreement probabi… view at source ↗
Figure 3
Figure 3. Figure 3: Normalized principal utility gap (C − Cn and C − C˜n) under different monitoring and contract settings. In these experiments, we set U0 = 0, δ = 0.02, µ(η) = 1/2η 4/5 , Ga(wa) = 1 − exp(−wa), and E(η) = 0.18η 2 (see Appendix B.1.4 for further details and additional configurations). (i) The self-consistency monitoring consistently outperforms the expert-based monitoring given the same second-best formulatio… view at source ↗
Figure 4
Figure 4. Figure 4: Illustration for Lemma A.8. (d) If a ≤ (1 − p)p ≤ b, then exp  − n − 1 a (p − p˜) 2  ≤ ∂ ∂pP(Xn(p) ≥ k) ∂ ∂pP(Xn(p) ≥ k)|p=˜p ≤ exp  − n − 1 b (p − p˜) 2  , where p˜ = k−1 n−1 . In other words, the curve of ∂ ∂pP(Xn(p) ≥ k) is like a bell curve centered at p˜. (e) ∂ ∂pP(Xn(p) ≥ k) monotonically increases for p < p˜ and monotonically decreases for p > p˜. If k = cn + O(1) for some c ∈ (0, 1), then ∂ ∂pP… view at source ↗
Figure 5
Figure 5. Figure 5: Calibration for two datasets. (Top row) Empirical preference probability p(x, y1, y2) vs. the predicted probability before and after calibration. The dashed line (x = y) represents perfect alignment between predictions and empirical observations. (Bottom row) Histogram of the (predicted) preference probability p(x, y1, y2) before and after calibration. We can see the calibration procedure improves alignmen… view at source ↗
Figure 6
Figure 6. Figure 6: Additional results for [PITH_FULL_IMAGE:figures/full_fig_p041_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Additional results for [PITH_FULL_IMAGE:figures/full_fig_p042_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Agent utility under the optimal solution, where we set the leisure utility U0 = 0. For all datasets, monitoring method, contract type, and second-best formulation, the resulted agent utility matches the leisure utility, i.e., the corresponding constraint is binding. B.2 Examples for hard-to-choose responses In the following, we present a few examples from HelpSteer (Wang et al., 2023) for which we think it… view at source ↗
Figure 9
Figure 9. Figure 9: More principal utility gap results for [PITH_FULL_IMAGE:figures/full_fig_p044_9.png] view at source ↗
read the original abstract

Human-annotated preference data play an important role in aligning large language models (LLMs). In this paper, we study two connected questions: how to monitor the quality of human preference annotators and how to incentivize them to provide high-quality annotations. In current practice, expert-based monitoring is a natural workhorse for quality control, but it performs poorly in preference annotation because annotators are heterogeneous and downstream model performance is an indirect and noisy proxy for annotation quality. We therefore propose a self-consistency monitoring scheme tailored to preference annotation, and analyze the statistical sample complexity of both methods. This practitioner-facing analysis identifies how many inspected samples are needed to reliably assess an annotator and shows when self-consistency monitoring can outperform expert-based monitoring. We then use the resulting monitoring signal as the performance measure in a principal-agent model, which lets us study a second sample-complexity question: how many monitored samples are needed before simple contracts perform close to the ideal benchmark in which annotation quality is perfectly observable. Under this continuous action space, we show that this shortfall scales as $\Theta(1/\sqrt{\mathcal{I} n \log n})$ for binary contracts and $\Theta(1/(\mathcal{I}n))$ for linear contracts, where $\mathcal{I}$ is the Fisher information and $n$ is the number of samples; we further show that the linear contracts are rate-optimal among general contracts. This contrasts with the known result that binary contracts are optimal and of $\exp(-\Theta(n))$ when the action space is discrete \citep{frick2023monitoring}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript studies quality monitoring for human preference annotators in LLM alignment, proposing a self-consistency scheme and deriving its sample complexity relative to expert review. It then embeds the resulting signal into a principal-agent model with continuous action space to analyze incentive contracts, establishing that the performance shortfall to the first-best benchmark scales as Θ(1/√(ℐ n log n)) for binary contracts and Θ(1/(ℐ n)) for linear contracts, with linear contracts rate-optimal among general contracts. This is contrasted with the exp(-Θ(n)) result known for discrete actions.

Significance. If the derivations hold, the work supplies explicit, quantitative guidance on the number of monitored samples needed for reliable annotator assessment and for contracts to approach ideal performance. The use of Fisher information to parameterize monitoring quality and the clean separation of rates by contract type provide a bridge between statistical learning theory and contract theory that is directly relevant to data pipelines for alignment. The continuous-action analysis and its contrast to the discrete case constitute a clear theoretical contribution.

major comments (1)
  1. [Abstract and principal-agent model] Abstract and principal-agent model section: the stated rates Θ(1/√(ℐ n log n)) and Θ(1/(ℐ n)) and the rate-optimality of linear contracts are derived under the assumption that the Fisher information ℐ of the monitoring signal (self-consistency or expert review) is fixed and independent of the contract parameters. Because the contract directly shapes the annotator’s effort choice, and effort can alter the distribution of the monitoring signal, ℐ is plausibly endogenous to the contract. This dependence would couple the monitoring and contracting analyses and change both the sample-complexity bounds and the optimality conclusion. The manuscript should either prove independence under its modeling assumptions or extend the analysis to the contract-dependent case.
minor comments (1)
  1. [Abstract] Notation: ensure that the symbol ℐ is introduced with its precise definition (Fisher information of which random variable) at first use and used consistently thereafter.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and the insightful comment on the principal-agent model. We respond to the major comment below.

read point-by-point responses
  1. Referee: [Abstract and principal-agent model] Abstract and principal-agent model section: the stated rates Θ(1/√(ℐ n log n)) and Θ(1/(ℐ n)) and the rate-optimality of linear contracts are derived under the assumption that the Fisher information ℐ of the monitoring signal (self-consistency or expert review) is fixed and independent of the contract parameters. Because the contract directly shapes the annotator’s effort choice, and effort can alter the distribution of the monitoring signal, ℐ is plausibly endogenous to the contract. This dependence would couple the monitoring and contracting analyses and change both the sample-complexity bounds and the optimality conclusion. The manuscript should either prove independence under its modeling assumptions or extend the analysis to the contract-dependent case.

    Authors: In the principal-agent model, the Fisher information ℐ is a fixed parameter of the monitoring technology (self-consistency or expert review) and is independent of the contract by construction. The contract influences the agent's effort choice, but the conditional distribution of the monitoring signal given effort is modeled with a noise structure whose Fisher information with respect to the action remains constant and does not depend on the chosen effort level or contract parameters. This is a standard modeling choice that separates the statistical monitoring analysis from the incentive design. We will add an explicit statement of this assumption and its implications in the principal-agent model section. revision: partial

Circularity Check

0 steps flagged

No circularity: scalings derived from standard Fisher-information concentration under stated assumptions

full rationale

The paper derives the Θ(1/√(I n log n)) and Θ(1/(I n)) shortfall bounds, plus rate-optimality of linear contracts, from the continuous-action principal-agent model using Fisher information I of the monitoring signal and standard concentration arguments. These steps do not reduce to any fitted parameter defined by the paper itself, nor to a self-citation chain; the discrete-action contrast is imported via external citation to frick2023monitoring. The independence of I from contract design is an explicit modeling assumption, not a definitional tautology. No self-definitional, fitted-input, or ansatz-smuggling patterns appear in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on a statistical model of annotator responses that yields Fisher information I, standard concentration inequalities for sample complexity, and the principal-agent framework with continuous effort choice. No new entities are postulated.

axioms (2)
  • domain assumption Annotator responses admit a parametric model whose Fisher information I governs the monitoring signal quality.
    Invoked when defining the monitoring schemes and deriving the rates.
  • standard math Standard large-deviation and information-theoretic bounds apply to the estimation of annotator quality.
    Used to obtain the Θ(1/√(I n log n)) and Θ(1/(I n)) expressions.

pith-pipeline@v0.9.0 · 5826 in / 1347 out tokens · 31936 ms · 2026-05-23T03:51:51.173456+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Incentivizing High-Quality Human Annotations with Golden Questions

    cs.GT 2025-05 unverdicted novelty 7.0

    The paper derives a Θ(1/√(n log n)) hypothesis testing rate under strategic annotator behavior and shows that high-certainty, format-similar golden questions better reveal annotation quality than standard checks.

  2. Users as Annotators: LLM Preference Learning from Comparison Mode

    cs.CL 2025-10 unverdicted novelty 5.0

    Introduces a latent user quality model and EM algorithm to infer and filter noisy user-provided pairwise preferences for improved LLM alignment.

Reference graph

Works this paper leans on

92 extracted references · 92 canonical work pages · cited by 2 Pith papers · 6 internal anchors

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address author booktitle chapter doi edition editor eid howpublished institution journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sent...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in "" FUNCTION format.date year ...

  3. [3]

    Acemoglu, Daron, Ali Makhdoumi, Azarakhsh Malekian, Asu Ozdaglar. 2022. Too much data: Prices and inefficiencies in data markets. American Economic Journal: Microeconomics\/ 14 (4) 218--256

  4. [4]

    Adida, Elodie, Fernanda Bravo. 2019. Contracts for healthcare referral services: Coordination via outcome-based penalty contracts. Management Science\/ 65 (3) 1322--1341

  5. [5]

    Agarwal, Anish, Munther Dahleh, Tuhin Sarkar. 2019. A marketplace for data: An algorithmic solution. Proceedings of the 2019 ACM Conference on Economics and Computation\/ . 701--726

  6. [6]

    Alon, Tal, Paul D \"u tting, Yingkai Li, Inbal Talgam-Cohen. 2022. Bayesian analysis of linear contracts. arXiv preprint arXiv:2211.06850\/

  7. [7]

    Ananthakrishnan, Nivasini, Stephen Bates, Michael Jordan, Nika Haghtalab. 2024 a . Delegating data collection in decentralized machine learning. International Conference on Artificial Intelligence and Statistics\/ . PMLR, 478--486

  8. [8]

    Ananthakrishnan, Nivasini, Nika Haghtalab, Chara Podimata, Kunhe Yang. 2024 b . Is knowledge power? on the (im) possibility of learning from strategic interactions. The Thirty-eighth Annual Conference on Neural Information Processing Systems\/

  9. [9]

    Artstein, Ron, Massimo Poesio. 2008. Inter-coder agreement for computational linguistics. Computational linguistics\/ 34 (4) 555--596

  10. [10]

    Askell, Amanda, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. 2021. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861\/

  11. [11]

    Bacon, David F, Yiling Chen, Ian Kash, David C Parkes, Malvika Rao, Manu Sridharan. 2012. Predicting your own effort. Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 2 (AAMAS)\/ . 695--702

  12. [12]

    Bai, Yuntao, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862\/

  13. [13]

    Bareket, Dan, Reut Tsarfaty. 2021. Neural modeling for named entities and morphology (nemoˆ2). Transactions of the Association for Computational Linguistics\/ 9 909--928

  14. [14]

    Barron, Daniel, George Georgiadis, Jeroen Swinkels. 2020. Optimal contracts with a risk-taking agent. Theoretical Economics\/ 15 (2) 715--761

  15. [15]

    Bastan, Mohaddeseh, Mahnaz Koupaee, Youngseo Son, Richard Sicoli, Niranjan Balasubramanian. 2020. Author's sentiment prediction. arXiv preprint arXiv:2011.06128\/

  16. [16]

    Bergemann, Dirk, Alessandro Bonatti. 2019. Markets for information: An introduction. Annual Review of Economics\/ 11 (1) 85--107

  17. [17]

    Boyd, Stephen. 2004. Convex optimization. Cambridge UP\/

  18. [18]

    Bradley, Ralph Allan, Milton E Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika\/ 39 (3/4) 324--345

  19. [19]

    Bretagnolle, Jean, Catherine Huber. 1978. Estimation des densit \'e s: risque minimax. S \'e minaire de probabilit \'e s de Strasbourg\/ 12 342--363

  20. [20]

    Cai, Yang, Constantinos Daskalakis, Christos Papadimitriou. 2015. Optimum statistical estimation with strategic data sources. Conference on Learning Theory\/ . PMLR, 280--296

  21. [21]

    Callison-Burch, Chris, Mark Dredze. 2010. Creating speech and language data with amazon’s mechanical turk. Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk\/ . 1--12

  22. [22]

    Carroll, Gabriel. 2015. Robustness and linear contracts. American Economic Review\/ 105 (2) 536--563

  23. [23]

    Chen, Junjie, Minming Li, Haifeng Xu. 2022. Selling data to a machine learner: Pricing via costly signaling. International Conference on Machine Learning\/ . PMLR, 3336--3359

  24. [24]

    Chowdhury, Sayak Ray, Anush Kini, Nagarajan Natarajan. 2024. Provably robust dpo: Aligning language models with noisy feedback. arXiv preprint arXiv:2403.00409\/

  25. [25]

    Collina, Natalie, Varun Gupta, Aaron Roth. 2024. Repeated contracting with multiple non-myopic agents: Policy regret and limited liability. Proceedings of the 25th ACM Conference on Economics and Computation\/ . EC '24, Association for Computing Machinery, New York, NY, USA, 640–668. doi:10.1145/3670865.3673607. ://doi.org/10.1145/3670865.3673607

  26. [26]

    Corbett, Charles J, Gregory A DeCroix, Albert Y Ha. 2005. Optimal shared-savings contracts in supply chains: Linear contracts and double moral hazard. European journal of operational research\/ 163 (3) 653--667

  27. [27]

    Corbett, Charles J, Christopher S Tang. 1999. Designing supply contracts: Contract type and information asymmetry. Quantitative models for supply chain management\/ 269--297

  28. [28]

    Cui, Ganqu, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, Maosong Sun. 2023. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377\/

  29. [29]

    Dai, Josef, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, Yaodong Yang. 2024. Safe RLHF : Safe reinforcement learning from human feedback. The Twelfth International Conference on Learning Representations\/ . ://openreview.net/forum?id=TyFrPOKYXw

  30. [30]

    Dasgupta, Anirban, Arpita Ghosh. 2013. Crowdsourced judgement elicitation with endogenous proficiency. Proceedings of the 22nd international conference on World Wide Web\/ . 319--330

  31. [31]

    de Zegher, Joann F, Dan A Iancu, Hau L Lee. 2019. Designing contracts and sourcing channels to create shared value. Manufacturing & Service Operations Management\/ 21 (2) 271--289

  32. [32]

    Duetting, Paul, Vahab Mirrokni, Renato Paes Leme, Haifeng Xu, Song Zuo. 2024. Mechanism design for large language models. Proceedings of the ACM on Web Conference 2024\/ . 144--155

  33. [33]

    D \"u tting, Paul, Michal Feldman, Inbal Talgam-Cohen, et al. 2024. Algorithmic contract theory: A survey. Foundations and Trends in Theoretical Computer Science\/ 16 (3-4) 211--412

  34. [34]

    D \"u tting, Paul, Tim Roughgarden, Inbal Talgam-Cohen. 2019. Simple versus optimal contracts. Proceedings of the 2019 ACM Conference on Economics and Computation\/ . 369--387

  35. [35]

    Dutting, Paul, Tim Roughgarden, Inbal Talgam-Cohen. 2021. The complexity of contracts. SIAM Journal on Computing\/ 50 (1) 211--254

  36. [36]

    Frick, Mira, Ryota Iijima, Yuhta Ishii. 2023. Monitoring with rich data. arXiv preprint arXiv:2312.16789\/

  37. [37]

    Gao, Yang, Dana Alon, Donald Metzler. 2024. Impact of preference noise on the alignment performance of generative language models. arXiv preprint arXiv:2404.09824\/

  38. [38]

    Georgiadis, George, Balazs Szentes. 2020. Optimal monitoring design. Econometrica\/ 88 (5) 2075--2107

  39. [39]

    Ghosal, Deepanway, Siqi Shen, Navonil Majumder, Rada Mihalcea, Soujanya Poria. 2022. Cicero: A dataset for contextualized commonsense inference in dialogues. arXiv preprint arXiv:2203.13926\/

  40. [40]

    Goldwasser, Shafi, Guy N Rothblum, Jonathan Shafer, Amir Yehudayoff. 2021. Interactive proofs for verifying machine learning. 12th Innovations in Theoretical Computer Science Conference (ITCS 2021)\/ . Schloss-Dagstuhl-Leibniz Zentrum f \"u r Informatik

  41. [41]

    Grossman, Sanford J, Oliver D Hart. 1992. An analysis of the principal-agent problem. Foundations of Insurance Economics: Readings in Economics and Finance\/ . Springer, 302--340

  42. [42]

    Guo, Chuan, Geoff Pleiss, Yu Sun, Kilian Q Weinberger. 2017. On calibration of modern neural networks. International conference on machine learning\/ . PMLR, 1321--1330

  43. [43]

    Hao, Shugang, Lingjie Duan. 2024. Online learning from strategic human feedback in llm fine-tuning. arXiv preprint arXiv:2412.16834\/

  44. [44]

    Harris, Keegan, Nicole Immorlica, Brendan Lucier, Aleksandrs Slivkins. 2023. Algorithmic persuasion through simulation: Information design in the age of generative ai. arXiv preprint arXiv:2311.18138\/

  45. [45]

    Harris, Milton, Artur Raviv. 1979. Optimal incentive contracts with imperfect information. Journal of economic theory\/ 20 (2) 231--259

  46. [46]

    Herweg, Fabian, Daniel M \"u ller, Philipp Weinschenk. 2010. Binary payment schemes: Moral hazard and loss aversion. American Economic Review\/ 100 (5) 2451--2477

  47. [47]

    Ho, Chien-Ju, Aleksandrs Slivkins, Jennifer Wortman Vaughan. 2014. Adaptive contract design for crowdsourcing markets: Bandit algorithms for repeated principal-agent problems. Proceedings of the fifteenth ACM conference on Economics and computation\/ . 359--376

  48. [48]

    Holmstr \"o m, Bengt. 1979. Moral hazard and observability. The Bell journal of economics\/ 74--91

  49. [49]

    Holmstrom, Bengt, Paul Milgrom. 1987. Aggregation and linearity in the provision of intertemporal incentives. Econometrica: Journal of the Econometric Society\/ 303--328

  50. [50]

    Ivanov, Dima, Paul D \"u tting, Inbal Talgam-Cohen, Tonghan Wang, David C Parkes. 2024. Principal-agent reinforcement learning: Orchestrating ai agents with contracts. arXiv preprint arXiv:2407.18074\/

  51. [51]

    Jain, Nitish, Sameer Hasija, Dana G Popescu. 2013. Optimal contracts for outsourcing of repair and restoration services. Operations Research\/ 61 (6) 1295--1311

  52. [52]

    Jewitt, Ian. 2006. Information order in decision and agency problems

  53. [53]

    Ji, Jiaming, Donghai Hong, Borong Zhang, Boyuan Chen, Josef Dai, Boren Zheng, Tianyi Qiu, Boxun Li, Yaodong Yang. 2024. Pku-saferlhf: Towards multi-level safety alignment for llms with human preference. arXiv preprint arXiv:2406.15513\/

  54. [54]

    Karlin, Samuel, Herman Rubin. 1956. The theory of decision procedures for distributions with monotone likelihood ratio. The Annals of Mathematical Statistics\/ 272--299

  55. [55]

    Kaufmann, Timo, Paul Weng, Viktor Bengs, Eyke H \"u llermeier. 2023. A survey of reinforcement learning from human feedback. arXiv preprint arXiv:2312.14925\/

  56. [56]

    Kim, Son Ku. 1995. Efficiency of an information system in an agency model. Econometrica: Journal of the Econometric Society\/ 89--102

  57. [57]

    Klie, Jan-Christoph, Richard Eckart de Castilho, Iryna Gurevych. 2024 a . Analyzing dataset annotation quality management in the wild. Computational Linguistics\/ 50 (3) 817--866

  58. [58]

    Klie, Jan-Christoph, Juan Haladjian, Marc Kirchner, Rahul Nair. 2024 b . On efficient and statistical quality estimation for data annotation. arXiv preprint arXiv:2405.11919\/

  59. [59]

    Krippendorff, Klaus. 2004. Reliability in content analysis: Some common misconceptions and recommendations. Human communication research\/ 30 (3) 411--433

  60. [60]

    Krippendorff, Klaus, et al. 1989. Content analysis. International encyclopedia of communication\/ 1 (1) 403--407

  61. [61]

    Laffont, Jean-Jacques, David Martimort. 2009. The theory of incentives: the principal-agent model. The theory of incentives\/ . Princeton university press

  62. [62]

    Lazear, Edward P, Paul Oyer. 2007. Personnel economics. Working Paper 13480, National Bureau of Economic Research. doi:10.3386/w13480. ://www.nber.org/papers/w13480

  63. [63]

    Le Cam, Lucien. 2012. Asymptotic methods in statistical decision theory\/ . Springer Science & Business Media

  64. [64]

    Liang, Xize, Chao Chen, Jie Wang, Yue Wu, Zhihang Fu, Zhihao Shi, Feng Wu, Jieping Ye. 2024. Robust preference optimization with provable noise tolerance for llms. arXiv preprint arXiv:2404.04102\/

  65. [65]

    Liao, JG, Arthur Berg. 2019. Sharpening jensen's inequality. The American Statistician\/

  66. [66]

    Liu, Chris Yuhao, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, Yahui Zhou. 2024 a . Skywork-reward: Bag of tricks for reward modeling in llms. arXiv preprint arXiv:2410.18451\/

  67. [67]

    Liu, Jinsong, Dongdong Ge, Ruihao Zhu. 2024 b . Reward learning from preference with ties. arXiv preprint arXiv:2410.05328\/

  68. [68]

    Lopomo, Giuseppe, Luca Rigotti, Chris Shannon. 2011. Knightian uncertainty and moral hazard. Journal of Economic Theory\/ 146 (3) 1148--1172

  69. [69]

    Miller, Nolan, Paul Resnick, Richard Zeckhauser. 2005. Eliciting informative feedback: The peer-prediction method. Management Science\/ 51 (9) 1359--1373

  70. [70]

    Monarch, Robert Munro. 2021. Human-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI\/ . Simon and Schuster

  71. [71]

    Moscarini, Giuseppe, Lones Smith. 2002. The law of large demand for information. Econometrica\/ 70 (6) 2351--2366

  72. [72]

    Munos, R \'e mi, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Andrea Michi, et al. 2023. Nash learning from human feedback. arXiv preprint arXiv:2312.00886\/

  73. [73]

    Northcutt, Curtis, Lu Jiang, Isaac Chuang. 2021. Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research\/ 70 1373--1411

  74. [74]

    Ouyang, Long, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems\/ 35 27730--27744

  75. [75]

    Polyanskiy, Yury, Yihong Wu. 2025. Information Theory: From Coding to Learning\/ . Cambridge University Press

  76. [76]

    O'Reilly Media, Inc

    Pustejovsky, James, Amber Stubbs. 2012. Natural Language Annotation for Machine Learning: A guide to corpus-building for applications\/ . " O'Reilly Media, Inc."

  77. [77]

    Qian, Kun, Ahmad Beirami, Zhouhan Lin, Ankita De, Alborz Geramifard, Zhou Yu, Chinnadhurai Sankar. 2021. Annotation inconsistency and entity bias in multiwoz. arXiv preprint arXiv:2105.14150\/

  78. [78]

    Rafailov, Rafael, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, Chelsea Finn. 2024. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems\/ 36

  79. [79]

    Saig, Eden, Ohad Einav, Inbal Talgam-Cohen. 2024 a . Incentivizing quality text generation via statistical contracts. The Thirty-eighth Annual Conference on Neural Information Processing Systems\/ . ://openreview.net/forum?id=wZgw4CrxwK

  80. [80]

    Saig, Eden, Inbal Talgam-Cohen, Nir Rosenfeld. 2024 b . Delegated classification. Advances in Neural Information Processing Systems\/ 36

Showing first 80 references.