Digital Twins as Funhouse Mirrors: Five Key Distortions
Pith reviewed 2026-05-18 14:18 UTC · model grok-4.3
The pith
Digital twins built from individual survey answers predict new behavior only modestly better than generic LLMs and display five systematic distortions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Digital twins trained on each individual's prior responses to over 500 questions produce predictions that are only modestly more accurate than those of a homogeneous base LLM and exhibit weak correlation with human responses (average r = 0.20) across 164 diverse outcomes. The models display five systematic distortions: insufficient individuation, stereotyping, representation bias, ideological bias, and hyper-rationality.
What carries the argument
The side-by-side comparison of digital-twin outputs against both human answers and base-LLM outputs, used to surface the five named behavioral distortions.
If this is right
- Policy and social-science applications should treat digital-twin outputs as provisional until the documented distortions are reduced.
- Development efforts should target better individuation and explicit checks for stereotyping and ideological skew.
- The released dataset and code provide a common testbed that future models must beat on the reported accuracy and correlation benchmarks.
- Aggregate-level simulations may remain useful even if individual-level fidelity stays low.
Where Pith is reading between the lines
- Persistent distortions across base models would suggest limits inherent to current language-model training rather than fixable data shortages.
- The weak individual correlations imply digital twins may work better for estimating population averages than for forecasting single-person decisions.
- Real-world deployment in high-stakes settings such as hiring or misinformation policy would allow direct measurement of whether the listed biases alter outcomes.
Load-bearing premise
Training on each person's prior answers to more than 500 questions supplies enough information to model that person's choices in new situations.
What would settle it
A replication study on the same or similar outcomes that obtains average correlations above 0.40 between digital-twin predictions and fresh human responses while eliminating the five listed distortions.
read the original abstract
Scientists and practitioners are increasingly moving to deploy digital twins--LLM-based models of real individuals--across social science and policy research. We conduct 19 pre-registered studies spanning 164 diverse outcomes (e.g., attitudes toward hiring algorithms, intentions to share misinformation), comparing human responses to those of their corresponding digital twins, which are trained on each individual's prior responses to over 500 questions. We establish an empirical benchmark for digital twin performance: their predictions are only modestly more accurate than those of a homogeneous base LLM and exhibit weak correlation with human responses (average $r = 0.20$). To inform future development, we identify five systematic distortions in digital twin behavior: (i) insufficient individuation, (ii) stereotyping, (iii) representation bias, (iv) ideological bias, and (v) hyper-rationality. Finally, we release our full dataset and code as a standardized testbed for evaluating and improving digital twin methodologies. Together, our findings caution against premature deployment while laying the groundwork for a transparent, replicable, and iterative science of responsible digital twin development.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports results from 19 pre-registered studies spanning 164 diverse outcomes (e.g., attitudes toward hiring algorithms, intentions to share misinformation) that compare human responses to those of LLM-based digital twins trained on each individual's prior responses to over 500 questions. It establishes an empirical benchmark showing that the twins' predictions are only modestly more accurate than those of a homogeneous base LLM and exhibit weak correlation with human responses (average r = 0.20). The authors identify five systematic distortions—insufficient individuation, stereotyping, representation bias, ideological bias, and hyper-rationality—and release the full dataset and code as a standardized testbed, cautioning against premature deployment while calling for responsible development.
Significance. If the results hold, the work supplies a valuable pre-registered benchmark and concrete taxonomy of distortions for the emerging practice of individual-level digital twins in social science and policy. The pre-registered design across diverse outcomes and the public release of the dataset and code are clear strengths that enable reproducibility and iterative improvement; these elements directly support the paper's call for a transparent science of digital twin evaluation.
major comments (1)
- [Results section (and abstract)] The interpretation of the average r = 0.20 as evidence of systematic failure and support for the five distortions (insufficient individuation, stereotyping, etc.) is load-bearing for the central claims and the caution against deployment. The manuscript does not report test-retest reliability, intra-class correlations, or any human consistency baseline for the 164 outcome measures. Without this anchor, it remains possible that r = 0.20 captures most of the reliable human variance rather than demonstrating fundamental limits of the twins.
minor comments (2)
- [Abstract] The abstract claims the twins are 'only modestly more accurate' than the base LLM but does not specify the exact accuracy metric (e.g., mean absolute error, classification accuracy) or the statistical test used for the comparison; adding these details would strengthen the benchmark claim.
- [Discussion] The description of the five distortions would benefit from a brief table or explicit mapping to the 164 outcomes so readers can see which distortion is evidenced by which subset of results.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments. We address the single major comment below and have revised the manuscript to incorporate additional discussion of this issue.
read point-by-point responses
-
Referee: The interpretation of the average r = 0.20 as evidence of systematic failure and support for the five distortions (insufficient individuation, stereotyping, etc.) is load-bearing for the central claims and the caution against deployment. The manuscript does not report test-retest reliability, intra-class correlations, or any human consistency baseline for the 164 outcome measures. Without this anchor, it remains possible that r = 0.20 captures most of the reliable human variance rather than demonstrating fundamental limits of the twins.
Authors: We agree that a human consistency baseline would aid interpretation of the reported correlations. Our central claims, however, rest on two additional pillars that do not require such a baseline: (1) the digital twins, despite being trained on more than 500 individual-specific responses, still produce only modest accuracy gains relative to a homogeneous base LLM, and (2) the five distortions are documented through targeted, pre-registered contrasts (e.g., demographic-stratified response gaps for stereotyping and representation bias, and divergence from human ideological patterns). These patterns are observable even if overall correlations partly reflect measurement noise. We have added a dedicated paragraph in the Discussion section acknowledging the absence of test-retest or intra-class correlation estimates as a limitation and outlining how future work could collect repeated measures to establish such benchmarks. We have also tempered language in the Results and abstract to frame r = 0.20 as a comparative benchmark rather than an absolute claim of failure. revision: partial
Circularity Check
No significant circularity; empirical benchmark from held-out comparisons
full rationale
The paper's claims derive from pre-registered empirical comparisons: digital twins trained on each individual's >500 prior responses are evaluated against held-out human answers on 164 outcomes, producing measured accuracy gains over a base LLM and an observed average correlation r=0.20. These quantities are computed directly from the data splits and are not equivalent to the training inputs by construction, nor do they rely on fitted parameters renamed as predictions. The five distortions are identified via post-hoc analysis of the same held-out discrepancies. No self-definitional equations, load-bearing self-citations, uniqueness theorems, or ansatzes smuggled via prior work appear in the reported chain. The public dataset and pre-registration further anchor the results externally, making the derivation self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM-based models trained on an individual's prior survey responses can serve as proxies for that individual's future responses
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We focus on performance measures that leverage this feature of the data. Specifically, for each outcome, we match each twin’s response to its corresponding human’s response and compute two metrics: Individual-level accuracy … Correlation …
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Twins exhibited systematic strengths and weaknesses—performing better in social and personality domains, but worse in political ones
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Rectification Difficulty and Optimal Sample Allocation in LLM-Augmented Surveys
A method using predicted rectification difficulty for optimal human sample allocation in LLM-augmented surveys captures 61-79% of theoretical efficiency gains and reduces MSE by 11% on two datasets without pilot data.
-
Adaptive Budget Allocation in LLM-Augmented Surveys
An adaptive budget allocation algorithm for LLM-augmented surveys learns question-level LLM reliability on the fly from human labels and reduces labeling waste from 10-12% to 2-6% compared to uniform allocation.
Reference graph
Works this paper leans on
-
[1]
Joel Huber, John W. Payne, and Christopher Puto. Adding asymmetrically dominated alternatives: Violations of regularity and the similarity hypothesis. Journal of Consumer Research, 9(1):90–98, June 1982
work page 1982
-
[2]
Itamar Simonson. Choice based on reasons: The case of attraction and com- promise effects.Journal of Consumer Research, 16(2):158–174, September 1989
work page 1989
-
[3]
Eric J. Johnson and Daniel Goldstein. Do defaults save lives?Science, 302(5649):1338–1339, 2003
work page 2003
-
[4]
Daniel Pichert and Konstantinos V. Katsikopoulos. Green defaults: Informa- tion presentation and pro-environmental behaviour.Journal of Environmental Psychology, 28(1):63–73, 2008
work page 2008
-
[5]
Pablo Briñol, Michael J. McCaslin, and Richard E. Petty. Self-generated persua- sion: Effects of the target and direction of arguments.Journal of Personality and Social Psychology, 102(5):925–940, May 2012. Epub 2012 Feb 20
work page 2012
-
[6]
Afoundationmodeltopredictandcapturehumancognition.Nature, pages 1–8, 2025
Marcel Binz, Elif Akata, Matthias Bethge, Franziska Brändle, Fred Callaway, Julian Coda-Forno, Peter Dayan, Can Demircan, Maria K Eckstein, Noémi Éltető,etal. Afoundationmodeltopredictandcapturehumancognition.Nature, pages 1–8, 2025
work page 2025
-
[7]
Gordon Pennycook and David G. Rand. Lazy, not biased: Susceptibility to partisan fake news is better explained by lack of reasoning than by motivated reasoning.Cognition, 188:39–50, 2019
work page 2019
-
[8]
Arechar, Dean Eckles, and David G
Gordon Pennycook, Zachary Epstein, Mohsen Mosleh, Antonio A. Arechar, Dean Eckles, and David G. Rand. Shifting attention to accuracy can reduce misinformation online.Nature, 592:590–595, 2021
work page 2021
-
[9]
Kristen Lane and Merrie Brucks. A marketing perspective on misinforma- tion sharing: How the target audience shapes consumers’ focus on accuracy vs. entertainment. Working paper, 2025
work page 2025
-
[10]
What is news? news values revisited (again)
Tony Harcup and Deirdre O’Neill. What is news? news values revisited (again). Journalism Studies, 18(12):1470–1488, 2016
work page 2016
-
[11]
Reuters Institute. Digital news report 2023. Technical report, University of Oxford, 2023. Retrieved from Reuters Institute for the Study of Journalism
work page 2023
-
[12]
Regina Widjaya, Samuel Bestvater, and Aaron Smith. Who u.s. adults follow on tiktok. Technical report, Pew Research Center, October 2024. 162
work page 2024
- [13]
-
[14]
Jonah Berger and Katherine L. Milkman. What makes online content viral? Journal of Marketing Research, 49(2):192–205, 2012
work page 2012
-
[15]
Arousal increases social transmission of information.Psychological Science, 22(7):891–893, 2011
Jonah Berger. Arousal increases social transmission of information.Psychological Science, 22(7):891–893, 2011
work page 2011
-
[16]
Robert A. Emmons and Michael E. McCullough. Counting blessings versus bur- dens: An experimental investigation of gratitude and subjective well-being in daily life.Journal of Personality and Social Psychology, 84(2):377–389, February 2003
work page 2003
-
[17]
David DeSteno, Ye Li, Leah Dickens, and Jennifer S. Lerner. Gratitude: A tool for reducing economic impatience.Psychological Science, 25(6):1262–1267, June 2014
work page 2014
-
[18]
Gratitude promotes prosocial behavior even in uncertain situation.Scientific Reports, 14:14379, 2024
Ryo Oguni and Chihiro Ishii. Gratitude promotes prosocial behavior even in uncertain situation.Scientific Reports, 14:14379, 2024
work page 2024
-
[19]
Marcin Bukowski, Andrzej Potoczek, Krzysztof Barzykowski, and others. What do we manipulate when reminding people of (not) having control? in search of construct validity.Behavior Research Methods, 56:3706–3724, 2024
work page 2024
-
[20]
Charlene Y. Chen, Leonard Lee, and Andy J. Yap. Control deprivation motivates acquisition of utilitarian products.Journal of Consumer Research, 43(6):1031– 1047, April 2017
work page 2017
-
[21]
Christophe Lembregts and Mario Pandelaere. Falling back on numbers: When preference for numerical product information increases after a personal control threat.Journal of Marketing Research, 56(1):104–122, 2018
work page 2018
-
[22]
Jennifer A. Whitson and Adam D. Galinsky. Lacking control increases illusory pattern perception.Science, 322(5898):115–117, 2008
work page 2008
-
[23]
Simone Schnall, James Roper, and Daniel M. T. Fessler. Elevation leads to altruistic behavior.Psychological Science, 21(3):315–320, March 2010
work page 2010
-
[24]
Lauren C. Walsh, Christina N. Armenta, Guy Itzchakov, Michael M. Fritz, and Sonja Lyubomirsky. More than merely positive: The immediate affective and motivational consequences of gratitude.Sustainability, 14(14):8679, 2022
work page 2022
-
[25]
Raquel Oliveira, Aíssa Baldé, Marta Madeira, Teresa Ribeiro, and Patrícia Arriaga. The impact of writing about gratitude on the intention to engage in prosocial behaviors during the covid-19 outbreak.Frontiers in Psychology, 12:588691, 2021. 163
work page 2021
-
[26]
Katharine H. Greenaway, S. Alexander Haslam, Tegan Cruwys, Nyla R. Branscombe, Renate Ysseldyk, and Cameron Heldreth. From “we” to “me”: Group identification enhances perceived personal control with consequences for health and well-being.Journal of Personality and Social Psychology, 109(1):53–74, 2015
work page 2015
-
[27]
Grzegorz Sedek and Miroslaw Kofta. When cognitive exertion does not yield cog- nitive gain: Toward an informational explanation of learned helplessness.Journal of Personality and Social Psychology, 58(4):729–743, April 1990
work page 1990
-
[28]
Donna M. Webster and Arie W. Kruglanski. Individual differences in need for cognitive closure.Journal of Personality and Social Psychology, 67(6):1049–1062, December 1994
work page 1994
-
[29]
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert, 2020
work page 2020
-
[30]
Consumerminimalism.Journal of Consumer Research, 48(5):796–816, February 2022
AnneV.WilsonandSilviaBellezza. Consumerminimalism.Journal of Consumer Research, 48(5):796–816, February 2022
work page 2022
-
[31]
Jachimowicz, Simon Duncan, Elke U
Jon M. Jachimowicz, Simon Duncan, Elke U. Weber, and Eric J. Johnson. When and why defaults influence decisions: A meta-analysis of default effects. Behavioural Public Policy, 3(2):159–186, 2019
work page 2019
-
[32]
Managing the growth tradeoff: Challenges and opportunities in luxury branding
Kevin Lane Keller. Managing the growth tradeoff: Challenges and opportunities in luxury branding. In Jean-Noël Kapferer, Jörg Kernstock, Tim Brexendorf, and Sabrina Powell, editors,Advances in Luxury Brand Management, Journal of Brand Management: Advanced Collections. Palgrave Macmillan, Cham, 2017
work page 2017
-
[33]
Silvia Bellezza and Anat Keinan. Brand tourists: How non–core users enhance the brand image by eliciting pride.Journal of Consumer Research, 41(2):397–417, August 2014
work page 2014
-
[34]
Nft for conspicuous consump- tion
Eric Park, Kristen Lane, and Silvia Bellezza. Nft for conspicuous consump- tion. In Haipeng (Allan) Chen, Giana Eckhardt, and Rebecca Hamilton, editors,Advances in Consumer Research, Volume 50, Duluth, MN, October 2022. Association for Consumer Research. Conference held in Denver, CO
work page 2022
-
[35]
Eu parliament approves supply chain law
Human Rights Watch. Eu parliament approves supply chain law. News release, Human Rights Watch, April 2024. Accessed: 2025-09-27
work page 2024
-
[36]
Patrick J. Curran, Stephen G. West, and John F. Finch. The robustness of test statistics to nonnormality and specification error in confirmatory factor analysis. Psychological Methods, 1(1):16–29, 1996
work page 1996
-
[37]
The president’s initiative on junk fees and related pric- ing practices
The White House. The president’s initiative on junk fees and related pric- ing practices. https://www.whitehouse.gov/briefing-room/blog/2022/10/26/ the-presidents-initiative-on-junk-fees-and-related-pricing-practices/, October 26 164
work page 2022
-
[39]
Aaron M. McCright, Riley E. Dunlap, and Chenyang Xiao. Perceived scientific agreement and support for government action on climate change in the usa. Climatic Change, 119:511–518, 2013
work page 2013
-
[40]
Fabrizio Dell’Acqua, Bruce Kogut, and Patryk Perkowski. Super mario meets ai: Experimental effects of automation and skills on team performance and coordination.Review of Economics and Statistics, pages 1–16, 2025
work page 2025
-
[41]
Olivier Toubia, George Z. Gui, Tianyi Peng, Daniel J. Merlau, Ang Li, and Haozhe Chen. Database report: Twin-2k-500: A data set for building digital twins of over 2,000 people based on their answers to over 500 questions.Marketing Science, 0(0), 2025
work page 2025
-
[42]
Judd B Kessler, Corinne Low, and Colin D Sullivan. Incentivized resume rating: Eliciting employer preferences without deception.American Economic Review, 109(11):3713–3744, 2019
work page 2019
-
[43]
Ai agents can enable superior market designs
Gili Rusak, Benjamin S Manning, and John J Horton. Ai agents can enable superior market designs. 2025
work page 2025
-
[44]
Yunhao Zhang and Renée Gosline. Human favoritism, not ai aversion: People’s perceptions (and bias) toward generative ai, human experts, and human–gai col- laboration in persuasive content generation.Judgment and Decision Making, 18:e41, 2023
work page 2023
-
[45]
Stephan Lewandowsky and Sander van der Linden. Countering misinformation and fake news through inoculation and prebunking.European Review of Social Psychology, 32(2):348–384, 2021
work page 2021
-
[46]
Gordon Pennycook and David G. Rand. Who falls for fake news? the roles of bullshit receptivity, overclaiming, familiarity, and analytic thinking.Journal of Personality, 88(2):185–200, April 2020
work page 2020
-
[47]
Gordon Pennycook and David G. Rand. The psychology of fake news.Trends in Cognitive Sciences, 25(5):388–402, May 2021
work page 2021
-
[48]
Gözde Ceylan, Ian A. Anderson, and Wendy Wood. Sharing of misinformation is habitual, not just lazy or biased.Proceedings of the National Academy of Sciences, 120(4):e2216614120, 2023
work page 2023
-
[49]
The spread of true and false news online.Science, 359(6380):1146–1151, 2018
Soroush Vosoughi, Deb Roy, and Sinan Aral. The spread of true and false news online.Science, 359(6380):1146–1151, 2018
work page 2018
-
[50]
Van Bavel, Steve Rathje, Madalina Vlasceanu, and Clara Pretus
Jay J. Van Bavel, Steve Rathje, Madalina Vlasceanu, and Clara Pretus. Updat- ing the identity-based model of belief: From false belief to the spread of 165 misinformation.Current Opinion in Psychology, 56:101787, 2024
work page 2024
-
[51]
Orme and Keith Chrzan.Becoming an Expert in Conjoint Analysis: Choice Modeling for Pros
Bryan K. Orme and Keith Chrzan.Becoming an Expert in Conjoint Analysis: Choice Modeling for Pros. Sawtooth Software, Provo, UT, 2017
work page 2017
-
[52]
Lane, Miaomiao Zhang, Vladimir Jacimovic, and Karim R
Léonard Boussioux, Jacqueline N. Lane, Miaomiao Zhang, Vladimir Jacimovic, and Karim R. Lakhani. The crowdless future? generative ai and creative problem- solving.Organization Science, 35(5):1589–1607, 2024
work page 2024
-
[53]
Byung Cheol Lee and Jaeyeon J. Chung. An empirical investigation of the impact of chatgpt on creativity.Nature Human Behaviour, 8(10):1906–1914, October
work page 1906
-
[54]
Kurt Gray et al. “forward flow”: A new measure to quantify free thought and predict creativity.American Psychologist, 74(5):539–554, 2019
work page 2019
-
[55]
putting yourself in their shoes
Rachael Catapano, Zakary L. Tormala, and Derek D. Rucker. Perspective taking and self-persuasion: Why “putting yourself in their shoes” reduces openness to attitude change.Psychological Science, 30(3):424–435, 2019
work page 2019
-
[56]
Preferences for redistribution
Alberto Alesina and Paola Giuliano. Preferences for redistribution. In Jess Benhabib, Alberto Bisin, and Matthew O. Jackson, editors,Handbook of Social Economics, volume 1, chapter 4, pages 93–131. North-Holland, Amsterdam, 2011
work page 2011
-
[57]
Princeton University Press, Princeton, NJ, 2012
Martin Gilens.Affluence and Influence: Economic Inequality and Political Power in America. Princeton University Press, Princeton, NJ, 2012
work page 2012
-
[58]
Mohamed A. Hussein, Courtney Lee, and S. Christian Wheeler. How do con- sumers react to ads that meddle in out-party primaries?Journal of Consumer Research, 51(6):1186–1208, April 2025
work page 2025
-
[59]
Polarization in america: Two possible futures
Georgie Heltzel and Kristin Laurin. Polarization in america: Two possible futures. Current Opinion in Behavioral Sciences, 34:179–184, August 2020
work page 2020
-
[60]
As partisan hostility grows, signs of frustration with the two-party system
Pew Research Center. As partisan hostility grows, signs of frustration with the two-party system. https://www.pewresearch.org/politics/2022/08/09/ as-partisan-hostility-grows-signs-of-frustration-with-the-two-party-system/,
work page 2022
-
[61]
Accessed: 2025-09-25
work page 2025
-
[62]
Mohamed A. Hussein and Zakary L. Tormala. Undermining your case to enhance your impact: A framework for understanding the effects of acts of receptiveness in persuasion.Personality and Social Psychology Review, 25(3):229–250, August 2021
work page 2021
-
[63]
Mohamed A. Hussein and S. Christian Wheeler. Reputational costs of recep- tiveness: When and why being receptive to opposing political views backfires. Journal of Experimental Psychology: General, 153(6):1425–1448, June 2024. 166
work page 2024
-
[64]
Paul E. Green and V. Srinivasan. Conjoint analysis in consumer research: Issues and outlook.Journal of Consumer Research, 5(2):103–123, 1978
work page 1978
-
[65]
Christopher J. Frank, Paul F. Magnone, and Oded Netzer.Decisions Over Dec- imals: Striking the Balance Between Intuition and Information. John Wiley & Sons, 2022
work page 2022
-
[66]
Franklin Shaddy, Elizabeth M. S. Friedman, and Olivier Toubia. Fairness perceptions in demographic targeting.Journal of Consumer Research, 2025. ucaf048
work page 2025
-
[67]
Cen, Andrew Ilyas, Jennifer Allen, Hannah Li, and Aleksander Mądry
Sarah H. Cen, Andrew Ilyas, Jennifer Allen, Hannah Li, and Aleksander Mądry. Measuringstrategization inrecommendation: Usersadapt theirbehaviorto shape future content. InProceedings of the 25th ACM Conference on Economics and Computation (EC ’24), pages 203–204, New York, NY, USA, 2024. Association for Computing Machinery. 167
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.