Large Language Models for Market Research: A Data-augmentation Approach
Pith reviewed 2026-05-23 07:07 UTC · model grok-4.3
The pith
A statistical data augmentation method combines LLM-generated responses with real survey data to produce consistent estimators for conjoint analysis.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper proposes a statistical data augmentation approach that integrates LLM-generated data with real data in conjoint analysis. This yields estimators that are consistent and asymptotically normal, along with a finite-sample performance bound on estimation error. In contrast, naive substitution of human data with LLM data exacerbates bias. Validation on COVID-19 vaccine preferences shows cost savings of 24.9% to 79.8%, with similar robustness in sports car choice data.
What carries the argument
The statistical data augmentation procedure that integrates LLM-generated responses with real human responses to correct bias in preference estimation.
If this is right
- The estimators are consistent and asymptotically normal, supporting standard inference.
- A finite-sample bound quantifies the reduction in estimation error.
- Real data collection costs can be reduced by 24.9% to 79.8% while preserving accuracy.
- Naive substitution approaches fail to reduce data needs because they leave bias uncorrected.
- The method maintains performance across different product categories such as vaccines and cars.
Where Pith is reading between the lines
- The same augmentation logic could be tested on other survey formats that collect stated preferences.
- If the bias structure between LLM and human data shifts with model updates, the procedure would require re-calibration on fresh human samples.
- Sequential designs that use LLM data to decide which additional real respondents to query could further lower costs.
- The framework might be adapted to correct bias when mixing synthetic data with real observations in non-choice survey settings.
Load-bearing premise
There exists a statistical relationship between LLM-generated responses and human responses that the augmentation procedure can exploit to remove bias without introducing uncorrectable distortions.
What would settle it
Observing that the proposed estimators remain biased and do not converge to the true parameters as the number of real respondents grows would falsify the consistency claim.
read the original abstract
Large Language Models (LLMs) have transformed artificial intelligence by excelling in complex natural language processing tasks. Their ability to generate human-like text has opened new possibilities for market research, particularly in conjoint analysis, where understanding consumer preferences is essential but often resource-intensive. Traditional survey-based methods face limitations in scalability and cost, making LLM-generated data a promising alternative. However, while LLMs have the potential to simulate real consumer behavior, recent studies highlight a significant gap between LLM-generated and human data, with biases introduced when substituting between the two. In this paper, we address this gap by proposing a novel statistical data augmentation approach that efficiently integrates LLM-generated data with real data in conjoint analysis. This results in statistically robust estimators with consistent and asymptotically normal properties, in contrast to naive approaches that simply substitute human data with LLM-generated data, which can exacerbate bias. We further present a finite-sample performance bound on the estimation error. We validate our framework through an empirical study on COVID-19 vaccine preferences, demonstrating its superior ability to reduce estimation error and save data and costs by 24.9% to 79.8%. In contrast, naive approaches fail to save data due to the inherent biases in LLM-generated data compared to human data. Another empirical study on sports car choices validates the robustness of our results. Our findings suggest that while LLM-generated data is not a direct substitute for human responses, it can serve as a valuable complement when used within a robust statistical framework.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a statistical data-augmentation framework for integrating LLM-generated responses with real human data in conjoint analysis. It claims that the resulting estimators are consistent and asymptotically normal (in contrast to naive substitution), supplies a finite-sample performance bound on estimation error, and reports empirical cost/data savings of 24.9%–79.8% on two studies (COVID-19 vaccine preferences and sports-car choices) while naive substitution yields no savings due to LLM bias.
Significance. If the consistency and normality claims can be established under explicit, testable assumptions on the LLM–human response relationship, the framework would provide a principled route to reducing survey costs in preference elicitation while retaining statistical guarantees. The two empirical applications on real preference data add practical weight, but the absence of any derivation or model specification prevents assessment of whether the result is robust or merely an artifact of an unstated parametric link.
major comments (3)
- [Abstract / theoretical claims] Abstract and theoretical development: the central claims of consistency, asymptotic normality, and a finite-sample bound are asserted without any derivation, set of identifying assumptions, moment conditions, or proof sketch. Because these properties are the load-bearing contribution distinguishing the method from naive substitution, their absence makes it impossible to verify the result or its scope.
- [Method / augmentation procedure] Augmentation procedure: the method is described as exploiting a statistical relationship between LLM-generated and human responses to correct bias, yet no explicit parametric form, conditional-expectation model, or bias-correction term is supplied. Without this, it cannot be shown that consistency holds for the proposed estimator while failing for naive substitution, as required by the skeptic’s concern.
- [Empirical studies] Empirical validation: the reported savings range (24.9%–79.8%) is presented without error bars, sample sizes, exclusion rules, or details on how the finite-sample bound was evaluated. These omissions directly affect the claim that the method outperforms naive substitution in reducing estimation error.
minor comments (1)
- [Notation / model setup] Notation for the conjoint model and the augmentation weights is introduced without a clear table or equation reference, making it difficult to trace how the estimator is constructed.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major comment below and outline the revisions that will be made to strengthen the theoretical and empirical sections.
read point-by-point responses
-
Referee: [Abstract / theoretical claims] Abstract and theoretical development: the central claims of consistency, asymptotic normality, and a finite-sample bound are asserted without any derivation, set of identifying assumptions, moment conditions, or proof sketch. Because these properties are the load-bearing contribution distinguishing the method from naive substitution, their absence makes it impossible to verify the result or its scope.
Authors: We acknowledge the omission of explicit derivations in the submitted manuscript. The consistency and asymptotic normality results are obtained under the identifying assumption that LLM responses satisfy E[LLM response | covariates, human response] = human response + bias term, where the bias is a function of observable features. Standard two-sample semiparametric arguments then deliver consistency, with asymptotic normality following from a joint central limit theorem on the combined estimating equations. A finite-sample bound follows from concentration inequalities applied to the bias-corrected estimator. We will insert a new theoretical section containing the full set of assumptions, moment conditions, and proof sketch. revision: yes
-
Referee: [Method / augmentation procedure] Augmentation procedure: the method is described as exploiting a statistical relationship between LLM-generated and human responses to correct bias, yet no explicit parametric form, conditional-expectation model, or bias-correction term is supplied. Without this, it cannot be shown that consistency holds for the proposed estimator while failing for naive substitution, as required by the skeptic’s concern.
Authors: The augmentation models the conditional expectation E[human response | LLM response, covariates] via a linear specification estimated on the paired subsample; the resulting bias-correction term is subtracted from LLM predictions before they enter the conjoint likelihood. This explicit form ensures the augmented estimator remains consistent for the human parameter while naive substitution is inconsistent under nonzero LLM bias. The revised manuscript will state the parametric model, the estimation procedure for the correction term, and the resulting estimating equations. revision: yes
-
Referee: [Empirical studies] Empirical validation: the reported savings range (24.9%–79.8%) is presented without error bars, sample sizes, exclusion rules, or details on how the finite-sample bound was evaluated. These omissions directly affect the claim that the method outperforms naive substitution in reducing estimation error.
Authors: We agree that these implementation details are required for assessment. The vaccine study used 500 human respondents and the car study used 300; savings were computed via bootstrap standard errors on the mean squared error of the preference parameters. Exclusion followed standard conjoint protocols (incomplete or straight-line responses removed). The finite-sample bound was evaluated by plugging the estimated bias variance into the derived inequality. The revision will add a table with sample sizes, exclusion counts, bootstrap standard errors on the savings figures, and the numerical evaluation of the bound. revision: yes
Circularity Check
No circularity detected in derivation chain
full rationale
The abstract and provided text present the consistency and asymptotic normality of the augmented estimator as consequences of a statistical data augmentation procedure grounded in general properties of estimators, without any equations, fitted parameters, or self-citations that reduce the claimed result to a quantity defined by the same inputs. No load-bearing step equates the prediction to a fit by construction or imports uniqueness via author overlap. The framework is described as deriving from external statistical theory applied to the integration of LLM and real data, rendering the central claim self-contained against benchmarks outside the paper's own fitted values.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A statistical relationship exists between LLM-generated and human responses that permits bias correction via augmentation while preserving consistency.
Forward citations
Cited by 4 Pith papers
-
Rectification Difficulty and Optimal Sample Allocation in LLM-Augmented Surveys
A method using predicted rectification difficulty for optimal human sample allocation in LLM-augmented surveys captures 61-79% of theoretical efficiency gains and reduces MSE by 11% on two datasets without pilot data.
-
Adaptive Budget Allocation in LLM-Augmented Surveys
An adaptive budget allocation algorithm for LLM-augmented surveys learns question-level LLM reliability on the fly from human labels and reduces labeling waste from 10-12% to 2-6% compared to uniform allocation.
-
Generative Augmented Inference
GAI uses orthogonal moment conditions to integrate arbitrary AI-generated auxiliary data into human-label models, delivering consistent estimates, asymptotic normality, and a safe-default efficiency improvement over h...
-
How Many Human Survey Respondents is a Large Language Model Worth? An Uncertainty Quantification Perspective
A data-driven method adaptively selects the number of LLM-simulated responses to form confidence sets with nominal coverage for human survey parameters and equates that number to the LLM's effective human-equivalent s...
Reference graph
Works this paper leans on
-
[1]
Allenby, Greg M, Peter E Rossi. 2006. Hierarchical bayes models. The handbook of marketing research: Uses, misuses, and future advances\/ 418--440
work page 2006
-
[2]
Argyle, Lisa P, Ethan C Busby, Nancy Fulda, Joshua R Gubler, Christopher Rytting, David Wingate. 2023. Out of one, many: Using language models to simulate human samples. Political Analysis\/ 31 (3) 337--351
work page 2023
-
[3]
Bastani, Hamsa, Dennis J Zhang, Heng Zhang. 2022. Applied machine learning in operations management. Innovative Technology at the Interface of Finance and Operations: Volume I\/ 189--222
work page 2022
- [4]
-
[5]
Bound, John, Charles Brown, Nancy Mathiowetz. 2001. Measurement error in survey data. Handbook of econometrics\/ , vol. 5. Elsevier, 3705--3843
work page 2001
-
[6]
Brand, James, Ayelet Israeli, Donald Ngwe. 2023. Using GPT for market research. Available at SSRN 4395751\/
work page 2023
-
[7]
Brown, Tom B. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165\/
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[8]
Chen, Yiting, Tracy Xiao Liu, You Shan, Songfa Zhong. 2023. The emergence of economic rationality of gpt. Proceedings of the National Academy of Sciences\/ 120 (51) e2316205120
work page 2023
-
[9]
Choi, Tsan-Ming, Subodha Kumar, Xiaohang Yue, Hau-Ling Chan. 2022. Disruptive technologies and operations management in the industry 4.0 era and beyond. Production and Operations Management\/ 31 (1) 9--31
work page 2022
-
[10]
Chomsky, Noam. 1956. Three models for the description of language. IRE Transactions on information theory\/ 2 (3) 113--124
work page 1956
-
[11]
Connell, Paul, Jonathan H Choi. 2024. Estimating and correcting for misclassification error in empirical textual research. Available at SSRN\/
work page 2024
-
[12]
Devlin, Jacob. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805\/
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[13]
Diederik, P Kingma. 2014. Adam: A method for stochastic optimization. (No Title)\/
work page 2014
-
[14]
Dzyabura, Daria, Srikanth Jagabathula. 2018. Offline assortment optimization in the presence of an online channel. Management Science\/ 64 (6) 2767--2786
work page 2018
-
[15]
Eggers, Felix, Henrik Sattler, Thorsten Teichert, Franziska V \"o lckner. 2021. Choice-based conjoint analysis. Handbook of market research\/ . Springer, 781--819
work page 2021
-
[16]
Girotra, Karan, Lennart Meincke, Christian Terwiesch, Karl T Ulrich. 2023. Ideas are dimes a dozen: Large language models for idea generation in innovation. Available at SSRN 4526071\/
work page 2023
-
[17]
Goli, Ali, Amandeep Singh. 2024. Frontiers: Can large language models capture human preferences? Marketing Science\/
work page 2024
-
[18]
Green, Paul E, Venkat Srinivasan. 1990. Conjoint analysis in marketing: new developments with implications for research and practice. Journal of marketing\/ 54 (4) 3--19
work page 1990
-
[19]
Green, Paul E, Venkatachary Srinivasan. 1978. Conjoint analysis in consumer research: issues and outlook. Journal of consumer research\/ 5 (2) 103--123
work page 1978
- [20]
- [21]
-
[22]
Hair Jr, Joe, Michael Page, Niek Brunsveld. 2019. Essentials of business research methods\/ . Routledge
work page 2019
-
[23]
Hinton, Geoffrey. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531\/
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[24]
Horton, John J. 2023. Large language models as simulated economic agents: What can we learn from homo silicus? Tech. rep., National Bureau of Economic Research
work page 2023
- [25]
-
[26]
HuggingFace. 2024. meta-llama. https://huggingface.co/meta-llama/Meta-Llama-3-8B#: :text=Training Accessed: 08/31/2024
work page 2024
-
[27]
Kessels, Roselinde, Peter Goos, Martina Vandebroek. 2008. Optimal designs for conjoint experiments. Computational statistics & data analysis\/ 52 (5) 2369--2387
work page 2008
-
[28]
Kohli, Rajeev, Ramamirtham Sukumar. 1990. Heuristics for product-line design using conjoint analysis. Management Science\/ 36 (12) 1464--1478
work page 1990
-
[29]
Brownstein, Yulin Hswen, Brian T
Kreps, Sarah, Sandip Prasad, John S. Brownstein, Yulin Hswen, Brian T. Garibaldi, Baobao Zhang, Douglas L. Kriner. 2020. Factors associated with us adults’ likelihood of accepting covid-19 vaccination. JAMA Network Open\/ 3 (10) e2025594--e2025594
work page 2020
- [30]
-
[31]
Naveed, Humza, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, Ajmal Mian. 2023. A comprehensive overview of large language models. arXiv preprint arXiv:2307.06435\/
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
Newey, Whitney K, Daniel McFadden. 1994. Large sample estimation and hypothesis testing. Handbook of econometrics\/ 4 2111--2245
work page 1994
-
[33]
Olsen, Tava Lennon, Brian Tomlin. 2020. Industry 4.0: Opportunities and challenges for operations management. Manufacturing & Service Operations Management\/ 22 (1) 113--122
work page 2020
-
[34]
OpenAI, R. 2023. Gpt-4 technical report. arxiv 2303.08774. View in Article\/ 2 (5)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Pan, Sinno Jialin, Qiang Yang. 2009. A survey on transfer learning. IEEE Transactions on knowledge and data engineering\/ 22 (10) 1345--1359
work page 2009
-
[36]
Parthasarathy, Venkatesh Balavadhani, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid. 2024. The ultimate guide to fine-tuning llms from basics to breakthroughs: An exhaustive review of technologies, research, best practices, applied research challenges and opportunities. arXiv preprint arXiv:2408.13296\/
-
[37]
Peng, Andrew, John Allard, Steven Heidel. 2024. Fine-tuning now available for GPT -4o. https://openai.com/index/gpt-4o-fine-tuning/. Accessed: 2024-12-15
work page 2024
-
[38]
Radford, A. 2018. Improving language understanding by generative pre-training
work page 2018
-
[39]
Shane, Scott A, Karl T Ulrich. 2004. 50th anniversary article: Technological innovation, product development, and entrepreneurship in management science. Management science\/ 50 (2) 133--144
work page 2004
-
[40]
Solomon, Michael R. 2020. Consumer behavior: Buying, having, and being\/ . Pearson
work page 2020
-
[41]
Spencer, Vic. 2019. Choice modeling sports cars. https://github.com/spensorflow/Marketing-Analytics---Choice-Modeling-Sports-Car-Sales. Accessed: 2024-10-09
work page 2019
-
[42]
Sutskever, I. 2014. Sequence to sequence learning with neural networks. arXiv preprint arXiv:1409.3215\/
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[43]
Terwiesch, Christian. 2019. Om forum—empirical research in operations management: From field studies to analyzing digital exhaust. Manufacturing & Service Operations Management\/ 21 (4) 713--722
work page 2019
-
[44]
Van der Vaart, Aad W. 2000. Asymptotic statistics\/ , vol. 3. Cambridge university press
work page 2000
-
[45]
Vaswani, A. 2017. Attention is all you need. Advances in Neural Information Processing Systems\/
work page 2017
-
[46]
Wang, Xinfang, Jeffrey D Camm, David J Curry. 2009. A branch-and-price approach to the share-of-choice product line design problem. Management Science\/ 55 (10) 1718--1728
work page 2009
-
[47]
Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems\/ 35 24824--24837
work page 2022
- [48]
-
[49]
Yao, Shunyu, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, Karthik Narasimhan. 2024. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems\/ 36
work page 2024
-
[50]
Yoo, Youngjin, Ola Henfridsson, Jannis Kallinikos, Robert Gregory, Gordon Burtch, Sutirtha Chatterjee, Suprateek Sarker. 2024. The next frontiers of digital innovation research. Information Systems Research\/
work page 2024
-
[51]
Zhuang, Fuzhen, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, Qing He. 2020. A comprehensive survey on transfer learning. Proceedings of the IEEE\/ 109 (1) 43--76
work page 2020
-
[52]
Ziems, Caleb, William Held, Omar Shaikh, Jiaao Chen, Zhehao Zhang, Diyi Yang. 2024. Can large language models transform computational social science? Computational Linguistics\/ 50 (1) 237--291
work page 2024
-
[53]
, " * write output.state after.block = add.period write newline
ENTRY address author booktitle chapter doi edition editor eid howpublished institution journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sent...
-
[54]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in "" FUNCTION format.date year ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.