Beyond Averages: Evaluating LLMs on Human Survey Replication at the Distributional Level

Jeonghyeon Moon; Jiwon Kim; Yeheum Lah; Yoonju Han; Yuncheol Kang

arxiv: 2606.09013 · v2 · pith:HJSUMXEBnew · submitted 2026-06-08 · 💻 cs.CL

Beyond Averages: Evaluating LLMs on Human Survey Replication at the Distributional Level

Jeonghyeon Moon , Jiwon Kim , Yeheum Lah , Yoonju Han , Yuncheol Kang This is my paper

Pith reviewed 2026-06-27 17:02 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM evaluationsurvey simulationdistributional replicationhuman behavior modelingconsumer choice experimentprompt configurationmean versus distribution

0 comments

The pith

LLMs that match human survey means can still produce distributions farther from humans than a simple pooled baseline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models replicate not only average human answers but the full spread of responses in a consumer survey. It uses responses from a 2010 Korean instant-noodle purchase experiment, a setting unlikely to appear in training data. Across binary, categorical, and count variables, models track some condition patterns yet consistently miss the shape of human variability. For purchase quantities in particular, every model falls short of a baseline that ignores conditions and simply copies the overall human distribution. The result indicates that agreement on means alone can hide larger gaps in how responses are distributed.

Core claim

LLMs reproduce condition-level patterns reasonably well but fail to capture distributional structure: for purchase quantity, no model beats a condition-insensitive baseline that simply matches the pooled human distribution. Because models that match human means well can still produce distributions further from humans than this baseline, mean-based evaluation alone can be actively misleading. Replication also varies with input configuration, with structured personas and multimodal inputs improving alignment while explicit reasoning prompting degrades it monotonically.

What carries the argument

Comparison of mean-level, pattern, and distributional alignment metrics against human-data baselines, applied to binary purchase incidence, categorical brand choice, and count purchase quantity from the noodle experiment.

If this is right

Mean-level agreement is insufficient and can be actively misleading for judging LLM survey replication quality.
No tested model matches the full human distribution for purchase quantity counts better than the pooled baseline.
Structured personas and multimodal inputs raise distributional alignment; explicit reasoning steps lower it monotonically.
Alignment quality differs by response type: binary and categorical variables show better pattern capture than counts.
Input format choices directly affect how closely LLM outputs track human response variability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Evaluations of LLM human simulation should shift primary focus from means to full distributional metrics.
Repeating the test on other non-public choice datasets would show whether the distribution gap is domain-specific.
Different choices of pooled baseline could alter which models rank highest on distributional fidelity.
Prompt configurations could be optimized specifically to increase output variance rather than reduce it.

Load-bearing premise

The 2010 non-public Korean noodle dataset is treated as a clean, representative test of general LLM capabilities with no training-data overlap and that the chosen condition-insensitive pooled baseline is the right reference for measuring true distributional replication.

What would settle it

An LLM run on the same noodle dataset that produces purchase-quantity distributions closer to the human data than the pooled baseline, while still matching means, would undermine the claim that mean matching is actively misleading.

Figures

Figures reproduced from arXiv: 2606.09013 by Jeonghyeon Moon, Jiwon Kim, Yeheum Lah, Yoonju Han, Yuncheol Kang.

**Figure 2.** Figure 2: Example purchase quantity distributions for the best-aligned and worst-aligned promotional conditions [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: AI-generated supermarket ramen-shelf image [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

read the original abstract

LLMs are increasingly used to simulate human survey responses, but prior work has mainly evaluated replication using mean-level or aggregate agreement, offering limited insight into whether LLMs reproduce the variability of human behavior. We evaluate LLM-based survey replication at the distributional level using a non-public 2010 consumer choice experiment on Korean instant noodle purchases, a setting unlikely to overlap with model training data. We evaluate three response variables of differing statistical type: binary purchase incidence, categorical brand choice, and count purchase quantity. For each, we compare human and LLM responses at mean-level, pattern, and distributional alignment, and against reference baselines from the human data alone. LLMs reproduce condition-level patterns reasonably well but fail to capture distributional structure: for purchase quantity, no model beats a condition-insensitive baseline that simply matches the pooled human distribution. Because models that match human means well can still produce distributions further from humans than this baseline, mean-based evaluation alone can be actively misleading. Replication also varies with input configuration, with structured personas and multimodal inputs improving alignment while explicit reasoning prompting degrades it monotonically.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mean-matching LLMs can still lose to a simple pooled baseline on full response distributions, so aggregate checks alone are not enough.

read the letter

The core result is that LLMs can hit human means on survey items yet still land farther from the actual response distributions than a condition-insensitive baseline drawn from the pooled human data. That directly shows why stopping at means is risky for anyone using these models to stand in for real respondents.

The paper runs the comparison on three variable types from the 2010 Korean noodle purchase data: binary incidence, categorical brand choice, and count quantity. It checks mean agreement, coarser patterns, and full distributional match, always against human-derived baselines. The choice of an old, non-public consumer dataset is a practical move to limit training-data leakage. They also vary the input setup and find that structured personas and multimodal prompts help while explicit reasoning chains hurt alignment step by step.

This is the first explicit demonstration that mean success can coexist with worse-than-baseline distributional failure in this setting. The empirical pattern is clean enough to be worth citing when people discuss synthetic survey data.

The main limitation is scope. One narrow product category in one country from 2010 does not tell us how often the same mismatch appears in other domains or with newer data. The abstract flags the statistical comparisons but leaves the exact distributional distance measure and any robustness checks for the full text; without those details the size of the gap is hard to judge precisely. The pooled baseline is the right reference for their specific claim, but readers will want to see whether other baselines change the ranking.

The work is aimed at groups that generate or consume synthetic human responses for market research or computational social science. It is a straightforward empirical caution rather than a new method, but the caution is reproducible and practically relevant. I would send it to peer review; the central comparison stands on its own and the dataset choice reduces one common objection.

Referee Report

1 major / 2 minor

Summary. The manuscript evaluates LLMs' ability to replicate human survey responses at the distributional level using a non-public 2010 Korean instant noodle purchase dataset. It compares LLM outputs to human data on three variables (binary purchase incidence, categorical brand choice, count purchase quantity) across mean-level, pattern, and distributional alignment, using explicit baselines derived from the human data. The central claim is that LLMs capture condition-level patterns reasonably well but fail to reproduce distributional structure—for purchase quantity, no model outperforms a condition-insensitive pooled baseline—implying that mean-based evaluation alone can be actively misleading. The work also reports that structured personas and multimodal inputs improve alignment while explicit reasoning prompting degrades it.

Significance. If the results hold, the paper makes a useful contribution by demonstrating that mean-matching does not guarantee distributional fidelity in LLM-based human simulation, with the pooled baseline providing a clear reference point that exposes this gap. The choice of a pre-2010 non-public dataset reduces contamination concerns and strengthens the empirical case. Explicit inclusion of human-derived baselines and multi-level comparisons (mean/pattern/distribution) is a strength, as is the examination of input configuration effects. This could encourage more rigorous distributional metrics in the field.

major comments (1)

[Results (purchase quantity)] Results section on purchase quantity: the claim that 'no model beats' the condition-insensitive pooled baseline is load-bearing for the 'actively misleading' conclusion, but the manuscript does not specify the exact divergence metric, how 'beats' is operationalized (e.g., statistical significance threshold), or the sample sizes used for the comparison. This detail is needed to assess whether the result reflects true failure to exploit condition information or a property of the chosen metric.

minor comments (2)

[Methods] Clarify in the methods how the three response variables were encoded for LLM prompting and how condition information was provided across the different input configurations.
[Abstract and Results] The abstract states that replication 'varies with input configuration' but does not quantify the effect sizes or provide a table summarizing the configuration results; adding this would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation of minor revision. The single major comment requests clarification on the purchase-quantity results; we address it directly below and will incorporate the requested details.

read point-by-point responses

Referee: [Results (purchase quantity)] Results section on purchase quantity: the claim that 'no model beats' the condition-insensitive pooled baseline is load-bearing for the 'actively misleading' conclusion, but the manuscript does not specify the exact divergence metric, how 'beats' is operationalized (e.g., statistical significance threshold), or the sample sizes used for the comparison. This detail is needed to assess whether the result reflects true failure to exploit condition information or a property of the chosen metric.

Authors: We agree that the manuscript should make these operational details explicit. The divergence metric is the Wasserstein-1 distance between the empirical distributions of purchase quantities; 'beats' is defined as a strictly lower Wasserstein distance relative to the pooled human baseline (no significance threshold is applied because the comparison is deterministic given the fixed human data and the finite LLM samples). Human sample sizes are the full 2010 survey (N=2,847 respondents); LLM sample sizes are 500 independent generations per condition. In the revision we will add a dedicated paragraph in the Results section (and a corresponding methods subsection) stating the metric, the exact definition of 'beats,' the sample sizes, and the rationale for not using a statistical test on the divergence values themselves. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical evaluation comparing LLM outputs to human survey data on three response variables, using mean-level, pattern, and distributional metrics plus simple baselines (e.g., condition-insensitive pooled human distribution) computed directly from the same human data. No equations, derivations, or first-principles claims appear; the central finding that mean-matching models can underperform a pooled baseline is a direct empirical observation, not a reduction to a fitted parameter or self-citation. The 2010 dataset choice and baseline construction are transparent and do not presuppose the target result. This is a standard self-contained empirical study with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on standard statistical comparison practices and the choice of a specific private dataset; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Standard statistical comparisons of means, patterns, and distributions are appropriate and sufficient to assess replication quality.
The paper structures its evaluation around these three alignment levels without additional justification.

pith-pipeline@v0.9.1-grok · 5730 in / 1167 out tokens · 24437 ms · 2026-06-27T17:02:03.619340+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 1 canonical work pages

[1]

Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies , booktitle =

Aher, Gati V and Arriaga, Rosa I and Kalai, Adam Tauman , year = 2023, pages =. Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies , booktitle =

2023
[2]

American Economic Review , volume =

What Can We Learn from Experiments?. American Economic Review , volume =
[3]

Journal of econometrics , volume =

Marketing Models of Consumer Heterogeneity , author =. Journal of econometrics , volume =
[4]

Out of One, Many:

Argyle, Lisa P and Busby, Ethan C and Fulda, Nancy and Gubler, Joshua R and Rytting, Christopher and Wingate, David , year = 2023, journal =. Out of One, Many:

2023
[5]

Synthetic Replacements for Human Survey Data?

Bisbee, James and Clinton, Joshua D and Dorff, Cassy and Kenkel, Brenton and Larson, Jennifer M , year = 2024, journal =. Synthetic Replacements for Human Survey Data?

2024
[6]

Survey Research:

Coughlan, Michael and Cronin, Patricia and Ryan, Frances , year = 2009, journal =. Survey Research:

2009
[7]

Journal of medical Internet research , volume =

Or Web-Based Questionnaire Invitations as a Method for Data Collection: Cross-Sectional Comparative Study of Differences in Response Rate, Completeness of Data, and Financial Cost , author =. Journal of medical Internet research , volume =
[8]

Beyond Correlation:

Elangovan, Aparna and Xu, Lei and Ko, Jongwoo and Elyasi, Mahsa and Liu, Ling and Bodapati, Sravan and Roth, Dan , year = 2024, journal =. Beyond Correlation:. 2410.03775 , archiveprefix =

arXiv 2024
[9]

Surveys on Surveys:

Goyder, John , year = 1986, journal =. Surveys on Surveys:

1986
[10]

Predicting Results of Social Science Experiments Using Large Language Models , author =
[11]

Large Language Models as Simulated Economic Agents:

Horton, John J , year = 2023, institution =. Large Language Models as Simulated Economic Agents:

2023
[12]

Aligning Language Models to User Opinions , booktitle =

Hwang, EunJeong and Majumder, Bodhisattwa and Tandon, Niket , year = 2023, pages =. Aligning Language Models to User Opinions , booktitle =

2023
[13]

Annual review of psychology , volume =

Survey Research , author =. Annual review of psychology , volume =
[14]

arXiv preprint arXiv:2408.06929 , eprint =

Evaluating Cultural Adaptability of a Large Language Model via Simulation of Synthetic Personas , author =. arXiv preprint arXiv:2408.06929 , eprint =

arXiv
[15]

Proceedings of the 18th

Liusie, Adian and Manakul, Potsawee and Gales, Mark , year = 2024, pages =. Proceedings of the 18th

2024
[16]

What If Consumer Experiments Impact Variances as Well as Means?

Louviere, Jordan J , year = 2001, journal =. What If Consumer Experiments Impact Variances as Well as Means?

2001
[17]

Beyond Believability:

Lu, Yuxuan and Huang, Jing and Han, Yan and Bei, Bennet and Xie, Yaochen and Wang, Dakuo and Wang, Jessie and He, Qi , year = 2025, journal =. Beyond Believability:. 2503.20749 , archiveprefix =

Pith/arXiv arXiv 2025
[18]

Factual Consistency Evaluation of Summarization in the

Luo, Zheheng and Xie, Qianqian and Ananiadou, Sophia , year = 2024, journal =. Factual Consistency Evaluation of Summarization in the

2024
[19]

Restoring

Qin, Xiaoyou and Li, Zhihong and Cheng, Xiaoxiao , year = 2026, journal =. Restoring. 2604.06663 , archiveprefix =

Pith/arXiv arXiv 2026
[20]

Humanities and Social Sciences Communications , volume =

Performance and Biases of Large Language Models in Public Opinion Simulation , author =. Humanities and Social Sciences Communications , volume =
[21]

Do Large Language Models Show Decision Heuristics Similar to Humans?

Suri, Gaurav and Slater, Lily R and Ziaee, Ali and Nguyen, Morgan , year = 2024, journal =. Do Large Language Models Show Decision Heuristics Similar to Humans?

2024
[22]

Transactions of the Association for Computational Linguistics , volume =

Do Llms Exhibit Human-like Response Biases? A Case Study in Survey Design , author =. Transactions of the Association for Computational Linguistics , volume =
[23]

Aligning Large Language Models with Human:

Wang, Yufei and Zhong, Wanjun and Li, Liangyou and Mi, Fei and Zeng, Xingshan and Huang, Wenyong and Shang, Lifeng and Jiang, Xin and Liu, Qun , year = 2023, journal =. Aligning Large Language Models with Human:. 2307.12966 , archiveprefix =

arXiv 2023
[24]

2501.08579 , archiveprefix =

Wang, Qian and Wu, Jiaying and Jiang, Zichen and Tang, Zhenheng and Luo, Bingqiao and Chen, Nuo and Chen, Wei and He, Bingsheng , year = 2025, journal =. 2501.08579 , archiveprefix =

arXiv 2025
[25]

Worldvaluesbench:

Zhao, Wenlong and Mondal, Debanjan and Tandon, Niket and Dillion, Danica and Gray, Kurt and Gu, Yuling , year = 2024, pages =. Worldvaluesbench:. Proceedings of the 2024

2024
[26]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[27]

Publications Manual , year = "1983", publisher =

1983
[28]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[29]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[30]

Dan Gusfield , title =. 1997

1997
[31]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[32]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

[1] [1]

Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies , booktitle =

Aher, Gati V and Arriaga, Rosa I and Kalai, Adam Tauman , year = 2023, pages =. Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies , booktitle =

2023

[2] [2]

American Economic Review , volume =

What Can We Learn from Experiments?. American Economic Review , volume =

[3] [3]

Journal of econometrics , volume =

Marketing Models of Consumer Heterogeneity , author =. Journal of econometrics , volume =

[4] [4]

Out of One, Many:

Argyle, Lisa P and Busby, Ethan C and Fulda, Nancy and Gubler, Joshua R and Rytting, Christopher and Wingate, David , year = 2023, journal =. Out of One, Many:

2023

[5] [5]

Synthetic Replacements for Human Survey Data?

Bisbee, James and Clinton, Joshua D and Dorff, Cassy and Kenkel, Brenton and Larson, Jennifer M , year = 2024, journal =. Synthetic Replacements for Human Survey Data?

2024

[6] [6]

Survey Research:

Coughlan, Michael and Cronin, Patricia and Ryan, Frances , year = 2009, journal =. Survey Research:

2009

[7] [7]

Journal of medical Internet research , volume =

Or Web-Based Questionnaire Invitations as a Method for Data Collection: Cross-Sectional Comparative Study of Differences in Response Rate, Completeness of Data, and Financial Cost , author =. Journal of medical Internet research , volume =

[8] [8]

Beyond Correlation:

Elangovan, Aparna and Xu, Lei and Ko, Jongwoo and Elyasi, Mahsa and Liu, Ling and Bodapati, Sravan and Roth, Dan , year = 2024, journal =. Beyond Correlation:. 2410.03775 , archiveprefix =

arXiv 2024

[9] [9]

Surveys on Surveys:

Goyder, John , year = 1986, journal =. Surveys on Surveys:

1986

[10] [10]

Predicting Results of Social Science Experiments Using Large Language Models , author =

[11] [11]

Large Language Models as Simulated Economic Agents:

Horton, John J , year = 2023, institution =. Large Language Models as Simulated Economic Agents:

2023

[12] [12]

Aligning Language Models to User Opinions , booktitle =

Hwang, EunJeong and Majumder, Bodhisattwa and Tandon, Niket , year = 2023, pages =. Aligning Language Models to User Opinions , booktitle =

2023

[13] [13]

Annual review of psychology , volume =

Survey Research , author =. Annual review of psychology , volume =

[14] [14]

arXiv preprint arXiv:2408.06929 , eprint =

Evaluating Cultural Adaptability of a Large Language Model via Simulation of Synthetic Personas , author =. arXiv preprint arXiv:2408.06929 , eprint =

arXiv

[15] [15]

Proceedings of the 18th

Liusie, Adian and Manakul, Potsawee and Gales, Mark , year = 2024, pages =. Proceedings of the 18th

2024

[16] [16]

What If Consumer Experiments Impact Variances as Well as Means?

Louviere, Jordan J , year = 2001, journal =. What If Consumer Experiments Impact Variances as Well as Means?

2001

[17] [17]

Beyond Believability:

Lu, Yuxuan and Huang, Jing and Han, Yan and Bei, Bennet and Xie, Yaochen and Wang, Dakuo and Wang, Jessie and He, Qi , year = 2025, journal =. Beyond Believability:. 2503.20749 , archiveprefix =

Pith/arXiv arXiv 2025

[18] [18]

Factual Consistency Evaluation of Summarization in the

Luo, Zheheng and Xie, Qianqian and Ananiadou, Sophia , year = 2024, journal =. Factual Consistency Evaluation of Summarization in the

2024

[19] [19]

Restoring

Qin, Xiaoyou and Li, Zhihong and Cheng, Xiaoxiao , year = 2026, journal =. Restoring. 2604.06663 , archiveprefix =

Pith/arXiv arXiv 2026

[20] [20]

Humanities and Social Sciences Communications , volume =

Performance and Biases of Large Language Models in Public Opinion Simulation , author =. Humanities and Social Sciences Communications , volume =

[21] [21]

Do Large Language Models Show Decision Heuristics Similar to Humans?

Suri, Gaurav and Slater, Lily R and Ziaee, Ali and Nguyen, Morgan , year = 2024, journal =. Do Large Language Models Show Decision Heuristics Similar to Humans?

2024

[22] [22]

Transactions of the Association for Computational Linguistics , volume =

Do Llms Exhibit Human-like Response Biases? A Case Study in Survey Design , author =. Transactions of the Association for Computational Linguistics , volume =

[23] [23]

Aligning Large Language Models with Human:

Wang, Yufei and Zhong, Wanjun and Li, Liangyou and Mi, Fei and Zeng, Xingshan and Huang, Wenyong and Shang, Lifeng and Jiang, Xin and Liu, Qun , year = 2023, journal =. Aligning Large Language Models with Human:. 2307.12966 , archiveprefix =

arXiv 2023

[24] [24]

2501.08579 , archiveprefix =

Wang, Qian and Wu, Jiaying and Jiang, Zichen and Tang, Zhenheng and Luo, Bingqiao and Chen, Nuo and Chen, Wei and He, Bingsheng , year = 2025, journal =. 2501.08579 , archiveprefix =

arXiv 2025

[25] [25]

Worldvaluesbench:

Zhao, Wenlong and Mondal, Debanjan and Tandon, Niket and Dillion, Danica and Gray, Kurt and Gu, Yuling , year = 2024, pages =. Worldvaluesbench:. Proceedings of the 2024

2024

[26] [26]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972

[27] [27]

Publications Manual , year = "1983", publisher =

1983

[28] [28]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[29] [29]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

[30] [30]

Dan Gusfield , title =. 1997

1997

[31] [31]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015

[32] [32]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =