pith. sign in

arxiv: 2412.19363 · v3 · submitted 2024-12-26 · 💻 cs.AI · cs.LG· stat.ME· stat.ML

Large Language Models for Market Research: A Data-augmentation Approach

Pith reviewed 2026-05-23 07:07 UTC · model grok-4.3

classification 💻 cs.AI cs.LGstat.MEstat.ML
keywords conjoint analysisdata augmentationlarge language modelsmarket researchconsumer preferencesstatistical estimationbias correctionsurvey methods
0
0 comments X

The pith

A statistical data augmentation method combines LLM-generated responses with real survey data to produce consistent estimators for conjoint analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Conjoint analysis requires large numbers of human respondents to map consumer trade-offs, but running such surveys is slow and expensive. Large language models can generate preference data at scale, yet simply swapping in LLM responses for human ones creates bias that breaks standard statistical procedures. The paper develops an augmentation procedure that mixes the two sources so the bias is corrected rather than amplified. The resulting estimators are consistent, asymptotically normal, and come with a finite-sample error bound. Tests on COVID-19 vaccine choices and sports-car preferences show the approach cuts real-data needs by 25 to 80 percent while naive mixing saves nothing.

Core claim

The paper proposes a statistical data augmentation approach that integrates LLM-generated data with real data in conjoint analysis. This yields estimators that are consistent and asymptotically normal, along with a finite-sample performance bound on estimation error. In contrast, naive substitution of human data with LLM data exacerbates bias. Validation on COVID-19 vaccine preferences shows cost savings of 24.9% to 79.8%, with similar robustness in sports car choice data.

What carries the argument

The statistical data augmentation procedure that integrates LLM-generated responses with real human responses to correct bias in preference estimation.

If this is right

  • The estimators are consistent and asymptotically normal, supporting standard inference.
  • A finite-sample bound quantifies the reduction in estimation error.
  • Real data collection costs can be reduced by 24.9% to 79.8% while preserving accuracy.
  • Naive substitution approaches fail to reduce data needs because they leave bias uncorrected.
  • The method maintains performance across different product categories such as vaccines and cars.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same augmentation logic could be tested on other survey formats that collect stated preferences.
  • If the bias structure between LLM and human data shifts with model updates, the procedure would require re-calibration on fresh human samples.
  • Sequential designs that use LLM data to decide which additional real respondents to query could further lower costs.
  • The framework might be adapted to correct bias when mixing synthetic data with real observations in non-choice survey settings.

Load-bearing premise

There exists a statistical relationship between LLM-generated responses and human responses that the augmentation procedure can exploit to remove bias without introducing uncorrectable distortions.

What would settle it

Observing that the proposed estimators remain biased and do not converge to the true parameters as the number of real respondents grows would falsify the consistency claim.

read the original abstract

Large Language Models (LLMs) have transformed artificial intelligence by excelling in complex natural language processing tasks. Their ability to generate human-like text has opened new possibilities for market research, particularly in conjoint analysis, where understanding consumer preferences is essential but often resource-intensive. Traditional survey-based methods face limitations in scalability and cost, making LLM-generated data a promising alternative. However, while LLMs have the potential to simulate real consumer behavior, recent studies highlight a significant gap between LLM-generated and human data, with biases introduced when substituting between the two. In this paper, we address this gap by proposing a novel statistical data augmentation approach that efficiently integrates LLM-generated data with real data in conjoint analysis. This results in statistically robust estimators with consistent and asymptotically normal properties, in contrast to naive approaches that simply substitute human data with LLM-generated data, which can exacerbate bias. We further present a finite-sample performance bound on the estimation error. We validate our framework through an empirical study on COVID-19 vaccine preferences, demonstrating its superior ability to reduce estimation error and save data and costs by 24.9% to 79.8%. In contrast, naive approaches fail to save data due to the inherent biases in LLM-generated data compared to human data. Another empirical study on sports car choices validates the robustness of our results. Our findings suggest that while LLM-generated data is not a direct substitute for human responses, it can serve as a valuable complement when used within a robust statistical framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes a statistical data-augmentation framework for integrating LLM-generated responses with real human data in conjoint analysis. It claims that the resulting estimators are consistent and asymptotically normal (in contrast to naive substitution), supplies a finite-sample performance bound on estimation error, and reports empirical cost/data savings of 24.9%–79.8% on two studies (COVID-19 vaccine preferences and sports-car choices) while naive substitution yields no savings due to LLM bias.

Significance. If the consistency and normality claims can be established under explicit, testable assumptions on the LLM–human response relationship, the framework would provide a principled route to reducing survey costs in preference elicitation while retaining statistical guarantees. The two empirical applications on real preference data add practical weight, but the absence of any derivation or model specification prevents assessment of whether the result is robust or merely an artifact of an unstated parametric link.

major comments (3)
  1. [Abstract / theoretical claims] Abstract and theoretical development: the central claims of consistency, asymptotic normality, and a finite-sample bound are asserted without any derivation, set of identifying assumptions, moment conditions, or proof sketch. Because these properties are the load-bearing contribution distinguishing the method from naive substitution, their absence makes it impossible to verify the result or its scope.
  2. [Method / augmentation procedure] Augmentation procedure: the method is described as exploiting a statistical relationship between LLM-generated and human responses to correct bias, yet no explicit parametric form, conditional-expectation model, or bias-correction term is supplied. Without this, it cannot be shown that consistency holds for the proposed estimator while failing for naive substitution, as required by the skeptic’s concern.
  3. [Empirical studies] Empirical validation: the reported savings range (24.9%–79.8%) is presented without error bars, sample sizes, exclusion rules, or details on how the finite-sample bound was evaluated. These omissions directly affect the claim that the method outperforms naive substitution in reducing estimation error.
minor comments (1)
  1. [Notation / model setup] Notation for the conjoint model and the augmentation weights is introduced without a clear table or equation reference, making it difficult to trace how the estimator is constructed.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and outline the revisions that will be made to strengthen the theoretical and empirical sections.

read point-by-point responses
  1. Referee: [Abstract / theoretical claims] Abstract and theoretical development: the central claims of consistency, asymptotic normality, and a finite-sample bound are asserted without any derivation, set of identifying assumptions, moment conditions, or proof sketch. Because these properties are the load-bearing contribution distinguishing the method from naive substitution, their absence makes it impossible to verify the result or its scope.

    Authors: We acknowledge the omission of explicit derivations in the submitted manuscript. The consistency and asymptotic normality results are obtained under the identifying assumption that LLM responses satisfy E[LLM response | covariates, human response] = human response + bias term, where the bias is a function of observable features. Standard two-sample semiparametric arguments then deliver consistency, with asymptotic normality following from a joint central limit theorem on the combined estimating equations. A finite-sample bound follows from concentration inequalities applied to the bias-corrected estimator. We will insert a new theoretical section containing the full set of assumptions, moment conditions, and proof sketch. revision: yes

  2. Referee: [Method / augmentation procedure] Augmentation procedure: the method is described as exploiting a statistical relationship between LLM-generated and human responses to correct bias, yet no explicit parametric form, conditional-expectation model, or bias-correction term is supplied. Without this, it cannot be shown that consistency holds for the proposed estimator while failing for naive substitution, as required by the skeptic’s concern.

    Authors: The augmentation models the conditional expectation E[human response | LLM response, covariates] via a linear specification estimated on the paired subsample; the resulting bias-correction term is subtracted from LLM predictions before they enter the conjoint likelihood. This explicit form ensures the augmented estimator remains consistent for the human parameter while naive substitution is inconsistent under nonzero LLM bias. The revised manuscript will state the parametric model, the estimation procedure for the correction term, and the resulting estimating equations. revision: yes

  3. Referee: [Empirical studies] Empirical validation: the reported savings range (24.9%–79.8%) is presented without error bars, sample sizes, exclusion rules, or details on how the finite-sample bound was evaluated. These omissions directly affect the claim that the method outperforms naive substitution in reducing estimation error.

    Authors: We agree that these implementation details are required for assessment. The vaccine study used 500 human respondents and the car study used 300; savings were computed via bootstrap standard errors on the mean squared error of the preference parameters. Exclusion followed standard conjoint protocols (incomplete or straight-line responses removed). The finite-sample bound was evaluated by plugging the estimated bias variance into the derived inequality. The revision will add a table with sample sizes, exclusion counts, bootstrap standard errors on the savings figures, and the numerical evaluation of the bound. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The abstract and provided text present the consistency and asymptotic normality of the augmented estimator as consequences of a statistical data augmentation procedure grounded in general properties of estimators, without any equations, fitted parameters, or self-citations that reduce the claimed result to a quantity defined by the same inputs. No load-bearing step equates the prediction to a fit by construction or imports uniqueness via author overlap. The framework is described as deriving from external statistical theory applied to the integration of LLM and real data, rendering the central claim self-contained against benchmarks outside the paper's own fitted values.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; the central claim rests on an unstated modeling relationship between LLM and human responses that is treated as domain_assumption rather than derived.

axioms (1)
  • domain assumption A statistical relationship exists between LLM-generated and human responses that permits bias correction via augmentation while preserving consistency.
    Required for the claim that the method yields consistent estimators unlike naive substitution.

pith-pipeline@v0.9.0 · 5845 in / 1184 out tokens · 36097 ms · 2026-05-23T07:07:23.705676+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Rectification Difficulty and Optimal Sample Allocation in LLM-Augmented Surveys

    cs.AI 2026-04 unverdicted novelty 7.0

    A method using predicted rectification difficulty for optimal human sample allocation in LLM-augmented surveys captures 61-79% of theoretical efficiency gains and reduces MSE by 11% on two datasets without pilot data.

  2. Adaptive Budget Allocation in LLM-Augmented Surveys

    cs.LG 2026-04 unverdicted novelty 7.0

    An adaptive budget allocation algorithm for LLM-augmented surveys learns question-level LLM reliability on the fly from human labels and reduces labeling waste from 10-12% to 2-6% compared to uniform allocation.

  3. Generative Augmented Inference

    cs.LG 2026-04 unverdicted novelty 6.0

    GAI uses orthogonal moment conditions to integrate arbitrary AI-generated auxiliary data into human-label models, delivering consistent estimates, asymptotic normality, and a safe-default efficiency improvement over h...

  4. How Many Human Survey Respondents is a Large Language Model Worth? An Uncertainty Quantification Perspective

    stat.ME 2025-02 unverdicted novelty 6.0

    A data-driven method adaptively selects the number of LLM-simulated responses to form confidence sets with nominal coverage for human survey parameters and equates that number to the LLM's effective human-equivalent s...

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 4 Pith papers · 6 internal anchors

  1. [1]

    Allenby, Greg M, Peter E Rossi. 2006. Hierarchical bayes models. The handbook of marketing research: Uses, misuses, and future advances\/ 418--440

  2. [2]

    Argyle, Lisa P, Ethan C Busby, Nancy Fulda, Joshua R Gubler, Christopher Rytting, David Wingate. 2023. Out of one, many: Using language models to simulate human samples. Political Analysis\/ 31 (3) 337--351

  3. [3]

    Bastani, Hamsa, Dennis J Zhang, Heng Zhang. 2022. Applied machine learning in operations management. Innovative Technology at the Interface of Finance and Operations: Volume I\/ 189--222

  4. [4]

    Beltagy, Iz, Kyle Lo, Arman Cohan. 2019. Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676\/

  5. [5]

    Bound, John, Charles Brown, Nancy Mathiowetz. 2001. Measurement error in survey data. Handbook of econometrics\/ , vol. 5. Elsevier, 3705--3843

  6. [6]

    Brand, James, Ayelet Israeli, Donald Ngwe. 2023. Using GPT for market research. Available at SSRN 4395751\/

  7. [7]

    Brown, Tom B. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165\/

  8. [8]

    Chen, Yiting, Tracy Xiao Liu, You Shan, Songfa Zhong. 2023. The emergence of economic rationality of gpt. Proceedings of the National Academy of Sciences\/ 120 (51) e2316205120

  9. [9]

    Choi, Tsan-Ming, Subodha Kumar, Xiaohang Yue, Hau-Ling Chan. 2022. Disruptive technologies and operations management in the industry 4.0 era and beyond. Production and Operations Management\/ 31 (1) 9--31

  10. [10]

    Chomsky, Noam. 1956. Three models for the description of language. IRE Transactions on information theory\/ 2 (3) 113--124

  11. [11]

    Connell, Paul, Jonathan H Choi. 2024. Estimating and correcting for misclassification error in empirical textual research. Available at SSRN\/

  12. [12]

    Devlin, Jacob. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805\/

  13. [13]

    Diederik, P Kingma. 2014. Adam: A method for stochastic optimization. (No Title)\/

  14. [14]

    Dzyabura, Daria, Srikanth Jagabathula. 2018. Offline assortment optimization in the presence of an online channel. Management Science\/ 64 (6) 2767--2786

  15. [15]

    Eggers, Felix, Henrik Sattler, Thorsten Teichert, Franziska V \"o lckner. 2021. Choice-based conjoint analysis. Handbook of market research\/ . Springer, 781--819

  16. [16]

    Girotra, Karan, Lennart Meincke, Christian Terwiesch, Karl T Ulrich. 2023. Ideas are dimes a dozen: Large language models for idea generation in innovation. Available at SSRN 4526071\/

  17. [17]

    Goli, Ali, Amandeep Singh. 2024. Frontiers: Can large language models capture human preferences? Marketing Science\/

  18. [18]

    Green, Paul E, Venkat Srinivasan. 1990. Conjoint analysis in marketing: new developments with implications for research and practice. Journal of marketing\/ 54 (4) 3--19

  19. [19]

    Green, Paul E, Venkatachary Srinivasan. 1978. Conjoint analysis in consumer research: issues and outlook. Journal of consumer research\/ 5 (2) 103--123

  20. [20]

    Gui, George, Olivier Toubia. 2023. The challenge of using llms to simulate human behavior: A causal inference perspective. arXiv preprint arXiv:2312.15524\/

  21. [21]

    Gururangan, Suchin, Ana Marasovi \'c , Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, Noah A Smith. 2020. Don't stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964\/

  22. [22]

    Hair Jr, Joe, Michael Page, Niek Brunsveld. 2019. Essentials of business research methods\/ . Routledge

  23. [23]

    Hinton, Geoffrey. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531\/

  24. [24]

    Horton, John J. 2023. Large language models as simulated economic agents: What can we learn from homo silicus? Tech. rep., National Bureau of Economic Research

  25. [25]

    Huang, Yue, Zhengqing Yuan, Yujun Zhou, Kehan Guo, Xiangqi Wang, Haomin Zhuang, Weixiang Sun, Lichao Sun, Jindong Wang, Yanfang Ye, et al. 2024. Social science meets llms: How reliable are large language models in social simulations? arXiv preprint arXiv:2410.23426\/

  26. [26]

    HuggingFace. 2024. meta-llama. https://huggingface.co/meta-llama/Meta-Llama-3-8B#: :text=Training Accessed: 08/31/2024

  27. [27]

    Kessels, Roselinde, Peter Goos, Martina Vandebroek. 2008. Optimal designs for conjoint experiments. Computational statistics & data analysis\/ 52 (5) 2369--2387

  28. [28]

    Kohli, Rajeev, Ramamirtham Sukumar. 1990. Heuristics for product-line design using conjoint analysis. Management Science\/ 36 (12) 1464--1478

  29. [29]

    Brownstein, Yulin Hswen, Brian T

    Kreps, Sarah, Sandip Prasad, John S. Brownstein, Yulin Hswen, Brian T. Garibaldi, Baobao Zhang, Douglas L. Kriner. 2020. Factors associated with us adults’ likelihood of accepting covid-19 vaccination. JAMA Network Open\/ 3 (10) e2025594--e2025594

  30. [30]

    Ludwig, Jens, Sendhil Mullainathan, Ashesh Rambachan. 2024. Large language models: An applied econometric framework. arXiv preprint arXiv:2412.07031\/

  31. [31]

    Naveed, Humza, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, Ajmal Mian. 2023. A comprehensive overview of large language models. arXiv preprint arXiv:2307.06435\/

  32. [32]

    Newey, Whitney K, Daniel McFadden. 1994. Large sample estimation and hypothesis testing. Handbook of econometrics\/ 4 2111--2245

  33. [33]

    Olsen, Tava Lennon, Brian Tomlin. 2020. Industry 4.0: Opportunities and challenges for operations management. Manufacturing & Service Operations Management\/ 22 (1) 113--122

  34. [34]

    OpenAI, R. 2023. Gpt-4 technical report. arxiv 2303.08774. View in Article\/ 2 (5)

  35. [35]

    Pan, Sinno Jialin, Qiang Yang. 2009. A survey on transfer learning. IEEE Transactions on knowledge and data engineering\/ 22 (10) 1345--1359

  36. [36]

    Parthasarathy, Venkatesh Balavadhani, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid. 2024. The ultimate guide to fine-tuning llms from basics to breakthroughs: An exhaustive review of technologies, research, best practices, applied research challenges and opportunities. arXiv preprint arXiv:2408.13296\/

  37. [37]

    Peng, Andrew, John Allard, Steven Heidel. 2024. Fine-tuning now available for GPT -4o. https://openai.com/index/gpt-4o-fine-tuning/. Accessed: 2024-12-15

  38. [38]

    Radford, A. 2018. Improving language understanding by generative pre-training

  39. [39]

    Shane, Scott A, Karl T Ulrich. 2004. 50th anniversary article: Technological innovation, product development, and entrepreneurship in management science. Management science\/ 50 (2) 133--144

  40. [40]

    Solomon, Michael R. 2020. Consumer behavior: Buying, having, and being\/ . Pearson

  41. [41]

    Spencer, Vic. 2019. Choice modeling sports cars. https://github.com/spensorflow/Marketing-Analytics---Choice-Modeling-Sports-Car-Sales. Accessed: 2024-10-09

  42. [42]

    Sutskever, I. 2014. Sequence to sequence learning with neural networks. arXiv preprint arXiv:1409.3215\/

  43. [43]

    Terwiesch, Christian. 2019. Om forum—empirical research in operations management: From field studies to analyzing digital exhaust. Manufacturing & Service Operations Management\/ 21 (4) 713--722

  44. [44]

    Van der Vaart, Aad W. 2000. Asymptotic statistics\/ , vol. 3. Cambridge university press

  45. [45]

    Vaswani, A. 2017. Attention is all you need. Advances in Neural Information Processing Systems\/

  46. [46]

    Wang, Xinfang, Jeffrey D Camm, David J Curry. 2009. A branch-and-price approach to the share-of-choice product line design problem. Management Science\/ 55 (10) 1718--1728

  47. [47]

    Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems\/ 35 24824--24837

  48. [48]

    Yang, Kaiqi, Hang Li, Hongzhi Wen, Tai-Quan Peng, Jiliang Tang, Hui Liu. 2024. Are large language models (llms) good social predictors? arXiv preprint arXiv:2402.12620\/

  49. [49]

    Yao, Shunyu, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, Karthik Narasimhan. 2024. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems\/ 36

  50. [50]

    Yoo, Youngjin, Ola Henfridsson, Jannis Kallinikos, Robert Gregory, Gordon Burtch, Sutirtha Chatterjee, Suprateek Sarker. 2024. The next frontiers of digital innovation research. Information Systems Research\/

  51. [51]

    Zhuang, Fuzhen, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, Qing He. 2020. A comprehensive survey on transfer learning. Proceedings of the IEEE\/ 109 (1) 43--76

  52. [52]

    Ziems, Caleb, William Held, Omar Shaikh, Jiaao Chen, Zhehao Zhang, Diyi Yang. 2024. Can large language models transform computational social science? Computational Linguistics\/ 50 (1) 237--291

  53. [53]

    , " * write output.state after.block = add.period write newline

    ENTRY address author booktitle chapter doi edition editor eid howpublished institution journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sent...

  54. [54]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in "" FUNCTION format.date year ...