Recognition: unknown
Adaptive Budget Allocation in LLM-Augmented Surveys
Pith reviewed 2026-05-10 15:08 UTC · model grok-4.3
The pith
An adaptive algorithm allocates limited human verification budget to LLM survey questions by learning per-question reliability in real time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose an adaptive allocation algorithm that learns which questions are hardest for the LLM while simultaneously collecting human responses. Each human label serves a dual role: it improves the estimate for that question and reveals how well the LLM predicts human responses on it. The algorithm directs more budget to questions where the LLM is least reliable, without requiring any prior knowledge of question-level LLM accuracy. We prove that the allocation gap relative to the best possible allocation vanishes as the budget grows, and validate the approach on both synthetic data and a real survey dataset with 68 questions and over 2000 respondents. On real survey data, the standard 100% 6
What carries the argument
The online adaptive allocation rule that estimates LLM reliability from accumulating human labels and reallocates the remaining budget proportionally to current error estimates.
Load-bearing premise
Human labels collected in real time can be used both to correct responses and to learn question-specific LLM reliability without any prior information or separate calibration phase.
What would settle it
Observing on independent survey data whether the waste remains in the 2-6% range and continues to approach zero with larger budgets would confirm or refute the performance claims.
read the original abstract
Large language models (LLMs) can generate survey responses at low cost, but their reliability varies substantially across questions and is unknown before data collection. Deploying LLMs in surveys still requires costly human responses for verification and correction. How should a limited human-labeling budget be allocated across questions in real time? We propose an adaptive allocation algorithm that learns which questions are hardest for the LLM while simultaneously collecting human responses. Each human label serves a dual role: it improves the estimate for that question and reveals how well the LLM predicts human responses on it. The algorithm directs more budget to questions where the LLM is least reliable, without requiring any prior knowledge of question-level LLM accuracy. We prove that the allocation gap relative to the best possible allocation vanishes as the budget grows, and validate the approach on both synthetic data and a real survey dataset with 68 questions and over 2000 respondents. On real survey data, the standard practice of allocating human labels uniformly across questions wastes 10--12% of the budget relative to the optimal; our algorithm reduces this waste to 2--6%, and the advantage grows as questions become more heterogeneous in LLM prediction quality. The algorithm achieves the same estimation quality as traditional uniform sampling with fewer human samples, requires no pilot study, and is backed by formal performance guarantees validated on real survey data. More broadly, the framework applies whenever scarce human oversight must be allocated across tasks where LLM reliability is unknown.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes an adaptive algorithm for allocating a fixed human-labeling budget across survey questions when LLMs generate initial responses whose per-question reliability is unknown a priori. Each human label is used simultaneously to correct the LLM output and to update an online estimate of that question's LLM accuracy; the algorithm then directs future labels toward questions where the LLM is least reliable. The central claims are (i) a proof that the allocation gap relative to the optimal (oracle) allocation vanishes asymptotically as the total budget B grows, and (ii) empirical results on a 68-question real survey showing that the method reduces budget waste from 10-12% (uniform allocation) to 2-6% while achieving equivalent estimation quality with fewer human samples and without a pilot study.
Significance. If the asymptotic guarantee and the reported waste reductions hold under the stated assumptions, the work offers a practical, theoretically grounded method for efficient hybrid LLM-human data collection. The dual-use of each label, the absence of any pre-calibration phase, and the validation on real heterogeneous survey data are concrete strengths that distinguish the contribution from purely heuristic or offline allocation schemes.
major comments (2)
- [§4.2, Theorem 1] §4.2, Theorem 1 (regret bound): the proof sketch relies on a uniform convergence rate for the per-question reliability estimators, but the adaptive allocation rule itself changes the sampling distribution over time; it is not immediate that the resulting martingale difference sequence still satisfies the concentration inequality used to bound the gap (the dependence between the allocation decision and the next observation needs an explicit handling).
- [§5.3, Table 3] §5.3, Table 3 (real-data waste numbers): the 2-6% waste figure is presented as a point estimate without reported standard errors, bootstrap intervals, or a statistical test against the uniform baseline across the 68 questions; because the advantage is claimed to grow with heterogeneity, the absence of variability measures makes it impossible to judge whether the reported improvement is robust or sensitive to the particular respondent pool.
minor comments (3)
- [§2.1] §2.1: the definition of the per-question reliability parameter r_q is introduced only after the algorithm description; moving the formal definition and its relation to the loss function earlier would improve readability.
- [Figure 4] Figure 4: the synthetic-data plots show cumulative waste versus B, but the x-axis is not labeled with the total budget scale used in the theorem statements, making direct comparison to the O(1/sqrt(B)) rate difficult.
- [§6] §6: the discussion of limitations mentions only computational cost; a brief note on the sensitivity of the method to the choice of the exploration parameter epsilon would be useful for practitioners.
Simulated Author's Rebuttal
Thank you for the opportunity to respond to the referee's report. We appreciate the positive assessment of our work and the specific suggestions for improvement. Below we provide point-by-point responses to the major comments, indicating the revisions we plan to incorporate.
read point-by-point responses
-
Referee: [§4.2, Theorem 1] §4.2, Theorem 1 (regret bound): the proof sketch relies on a uniform convergence rate for the per-question reliability estimators, but the adaptive allocation rule itself changes the sampling distribution over time; it is not immediate that the resulting martingale difference sequence still satisfies the concentration inequality used to bound the gap (the dependence between the allocation decision and the next observation needs an explicit handling).
Authors: We thank the referee for highlighting this subtlety in the proof. The proof of Theorem 1 in §4.2 indeed uses a martingale concentration bound (specifically, a version of Freedman's inequality or similar) that is designed to handle adaptive sampling where the allocation at step t depends on previous observations. The key is that the difference sequence is a martingale with respect to the filtration generated by the history up to t-1, and the bounded differences property holds conditionally. However, we acknowledge that the current sketch is brief and does not spell this out explicitly. In the revised version, we will expand the proof to include a detailed explanation of how the adaptive nature is accounted for in the martingale framework, ensuring the concentration inequality applies directly to bound the allocation gap. This will make the argument fully rigorous without altering the result. revision: yes
-
Referee: [§5.3, Table 3] §5.3, Table 3 (real-data waste numbers): the 2-6% waste figure is presented as a point estimate without reported standard errors, bootstrap intervals, or a statistical test against the uniform baseline across the 68 questions; because the advantage is claimed to grow with heterogeneity, the absence of variability measures makes it impossible to judge whether the reported improvement is robust or sensitive to the particular respondent pool.
Authors: We agree that including measures of statistical variability would improve the presentation of the empirical results. The waste percentages in Table 3 are computed from the real survey data with 68 questions and over 2000 respondents, but we did not report variability across possible respondent subsamples or bootstrap replicates. In the revised manuscript, we will augment Table 3 with bootstrap standard errors or 95% confidence intervals for the waste figures under both our algorithm and the uniform allocation. Additionally, we will include a paired statistical test (e.g., Wilcoxon signed-rank test across questions) to assess the significance of the reduction. This will allow readers to better evaluate the robustness of the 2-6% improvement, particularly as heterogeneity increases. revision: yes
Circularity Check
No significant circularity
full rationale
The paper presents an adaptive online allocation algorithm whose core guarantee is that the gap to the optimal allocation vanishes asymptotically as the human-label budget grows. This is framed as a formal property of the dual-use rule (each label corrects the response and updates the per-question LLM reliability estimate) without any prior knowledge. No equations, fitted parameters, or self-citations are exhibited in the provided text that would reduce the claimed prediction or proof to the inputs by construction. The empirical waste reduction (2-6% vs 10-12%) is reported as separate validation on real survey data. The derivation therefore remains self-contained against external benchmarks and does not trigger any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human labels collected during the survey can be used both to obtain ground-truth answers and to estimate question-specific LLM prediction quality in real time.
Reference graph
Works this paper leans on
-
[1]
, " * write output.state after.block = add.period write newline
ENTRY address author booktitle chapter doi edition editor eid howpublished institution isbn issn journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1...
-
[2]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in "" FUNCTION format.date year ...
-
[3]
Advances in neural information processing systems 29
Agrawal S, Devanur N (2016) Linear contextual bandits with knapsacks. Advances in neural information processing systems 29
2016
-
[4]
Science 382(6671):669--674
Angelopoulos AN, Bates S, Fannjiang C, Jordan MI, Zrnic T (2023 a ) Prediction-powered inference. Science 382(6671):669--674
2023
-
[5]
arXiv preprint arXiv:2311.01453 , year=
Angelopoulos AN, Duchi JC, Zrnic T (2023 b ) PPI++ : Efficient prediction-powered inference. arXiv preprint arXiv:2311.01453
-
[6]
Political Analysis 31(3):337--351
Argyle LP, Busby EC, Fulda N, Gubler JR, Rytting C, Wingate D (2023) Out of one, many: Using language models to simulate human samples. Political Analysis 31(3):337--351
2023
-
[7]
Machine Learning 47(2-3):235--256
Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2-3):235--256
2002
-
[8]
Journal of the ACM (JACM) 65(3):1--55
Badanidiyuru A, Kleinberg R, Slivkins A (2018) Bandits with knapsacks. Journal of the ACM (JACM) 65(3):1--55
2018
-
[9]
2025 IEEE 5th International Conference on Human-Machine Systems (ICHMS), 35--40 (IEEE)
Bhat S, Lyons JB, Shi C, Yang XJ (2025) Effects of learning state dependence of reward weights on trust and team performance in a human-robot sequential decision-making task. 2025 IEEE 5th International Conference on Human-Machine Systems (ICHMS), 35--40 (IEEE)
2025
-
[10]
HBS Working Paper (23-062)
Brand J, Israeli A, Ngwe D (2023) Using LLMs for market research. HBS Working Paper (23-062)
2023
-
[11]
Sociological Methods & Research 54(3):1074--1109
Broska D, Howes M, van Loon A (2025) The mixed subjects design: Treating large language models as potentially informative observations. Sociological Methods & Research 54(3):1074--1109
2025
-
[12]
PLOS ONE 20(4):e0319159
Brucks M, Toubia O (2025) Prompt architecture induces methodological artifacts in large language models. PLOS ONE 20(4):e0319159
2025
-
[13]
Journal of Machine Learning Research 16(69):2231--2271
Carpentier A, Munos R, Antos A (2015) Adaptive strategy for stratified monte carlo sampling. Journal of Machine Learning Research 16(69):2231--2271
2015
-
[14]
Operations Research 73(6):3027--3043
Cohen MC, Miao S, Wang Y (2025) Dynamic pricing with fairness constraints. Operations Research 73(6):3027--3043
2025
-
[15]
Production and Operations Management
Dai T, Swaminathan JM (2025) Artificial intelligence and operations: A foundational framework of emerging research and practice. Production and Operations Management
2025
-
[16]
Manufacturing & Service Operations Management 27(6):1814--1831
DiSorbo MD, Ferreira KJ, Balakrishnan M, Tong J (2025) Warnings and endorsements: Improving human- AI collaboration in the presence of outliers. Manufacturing & Service Operations Management 27(6):1814--1831
2025
-
[17]
Advances in Neural Information Processing Systems 37:45850--45878
Dominguez-Olmedo R, Hardt M, Mendler-D \"u nner C (2024) Questioning the survey responses of large language models. Advances in Neural Information Processing Systems 37:45850--45878
2024
-
[18]
Management Science 72(1):538--557
F \"u gener A, Walzner DD, Gupta A (2026) Roles of artificial intelligence in collaboration with humans: Automation, augmentation, and the future of work. Management Science 72(1):538--557
2026
-
[19]
arXiv preprint arXiv:2310.03647
Ge H, Bastani H, Bastani O (2023) Rethinking algorithmic fairness for human- AI collaboration. arXiv preprint arXiv:2310.03647
-
[20]
International Conference on Machine Learning (ICML)
Huang C, Wu Y, Wang K (2025) How many human survey respondents is a large language model worth? An uncertainty quantification perspective. International Conference on Machine Learning (ICML)
2025
-
[21]
Ji W, Lei L, Zrnic T (2025) Predictions as surrogates: Revisiting surrogate outcomes in the age of AI . arXiv preprint arXiv:2501.09731
-
[22]
arXiv preprint arXiv:2510.11408
Krsteski S, Russo G, Chang S, West R, Gligori \'c K (2025) Valid survey simulations with limited human data. arXiv preprint arXiv:2510.11408
-
[23]
Lattimore T, Szepesv \'a ri C (2020) Bandit Algorithms (Cambridge University Press)
2020
-
[24]
arXiv preprint arXiv:2604.01086
Li G, Liang A, Liu M, Lei M, Jasin S, Yang F, Baxi P (2026) Asymptotically optimal sequential testing with heterogeneous llms. arXiv preprint arXiv:2604.01086
-
[25]
Marketing Science 43(2):254--266
Li P, Castelo N, Katona Z, Sarvary M (2024) Frontiers: Determining the validity of large language models for automated perceptual analysis. Marketing Science 43(2):254--266
2024
-
[26]
Proceedings of the 22nd Annual Conference on Learning Theory (COLT), 115--124
Maurer A, Pontil M (2009) Empirical bernstein bounds and sample variance penalization. Proceedings of the 22nd Annual Conference on Learning Theory (COLT), 115--124
2009
-
[27]
Public Choice 198:3--23
Motoki F, Pinho Neto V, Rodrigues V (2024) More human than human: Measuring ChatGPT political bias. Public Choice 198:3--23
2024
-
[28]
Mozer R (2026) PPI is the difference estimator: Recognizing the survey sampling roots of prediction-powered inference. arXiv preprint arXiv:2603.19160
-
[29]
Journal of the Royal Statistical Society 97(4):558--625
Neyman J (1934) On the two different aspects of the representative method. Journal of the Royal Statistical Society 97(4):558--625
1934
-
[30]
Digital Twins as Funhouse Mirrors: Five Key Distortions
Peng T, Gui G, Merlau DJ, Fan GJ, Sliman MB, Brucks M, Johnson EJ, Morwitz V, et al. (2025) A mega-study of digital twins reveals strengths, weaknesses and opportunities for further improvement. arXiv preprint arXiv:2509.19088
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Journal of the American Statistical Association 90(429):54--63
Raghunathan TE, Grizzle JE (1995) A split questionnaire survey design. Journal of the American Statistical Association 90(429):54--63
1995
-
[32]
Management Science 71(6):4828--4846
Simchi-Levi D, Wang C (2025) Multi-armed bandit experimental design: Online decision-making and adaptive inference. Management Science 71(6):4828--4846
2025
-
[33]
Marketing Science 44(6):1446--1455
Toubia O, Gui GZ, Peng T, Merlau DJ, Li A, Chen H (2025) Database report: Twin-2K-500 : A data set for building digital twins of over 2,000 people based on their answers to over 500 questions. Marketing Science 44(6):1446--1455
2025
-
[34]
Marketing Science 22(3):273--303
Toubia O, Simester DI, Hauser JR, Dahan E (2003) Fast polyhedral adaptive conjoint estimation. Marketing Science 22(3):273--303
2003
-
[35]
Proceedings of the National Academy of Sciences 122(22):e2427298122
Vafa K, Athey S, Blei DM (2025) Estimating wage disparities using foundation models. Proceedings of the National Academy of Sciences 122(22):e2427298122
2025
-
[36]
Wang L, Ye Z, Zhao J (2025) Efficient inference using large language models with limited human data: Fine-tuning then rectification. arXiv preprint arXiv:2511.19486
-
[37]
Large Language Models for Market Research: A Data-augmentation Approach
Wang M, Zhang DJ, Zhang H (2024) Large language models for market research: A data-augmentation approach. arXiv preprint arXiv:2412.19363
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Marketing Science 44(5):995--1016
Ye Z, Yoganarasimhan H, Zheng Y (2025) LOLA : LLM -assisted online learning algorithm for content experiments. Marketing Science 44(5):995--1016
2025
-
[39]
Available at SSRN 6078686
Yin QE, Xin L (2025) Synthetic but not infinite: How much LLM -generated data to use in market research. Available at SSRN 6078686
2025
-
[40]
Ziems C, Held W, Shaikh O, Chen J, Zhang Z, Yang D (2024) Can large language models transform computational social science? Computational Linguistics 50(1):237--291
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.