LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals

Aaron Shaw; Benjamin Mako Hill; Carolyn Q. Zou; Carrie Cai; Jonne Kamphorst; Joon Sung Park; Meredith Ringel Morris; Michael S. Bernstein; Niles Egan; Percy Liang

arxiv: 2411.10109 · v2 · submitted 2024-11-15 · 💻 cs.AI · cs.HC· cs.LG

LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals

Joon Sung Park , Carolyn Q. Zou , Jonne Kamphorst , Niles Egan , Aaron Shaw , Benjamin Mako Hill , Carrie Cai , Meredith Ringel Morris

show 3 more authors

Percy Liang Robb Willer Michael S. Bernstein

This is my paper

Pith reviewed 2026-05-23 17:21 UTC · model grok-4.3

classification 💻 cs.AI cs.HCcs.LG

keywords LLM agentsgenerative agentsself-report dataindividual simulationGeneral Social Surveypersonality traitsbehavior predictiontest-retest consistency

0 comments

The pith

LLM agents grounded in self-report interviews and surveys simulate individual responses to new questions at 82-86% of human test-retest consistency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can build general-purpose simulations of specific people by feeding them detailed self-report data from interviews or surveys. Using responses from a national sample of Americans, the resulting agents predict answers on held-out survey items nearly as reliably as the original participants do when retested after two weeks. This method outperforms agents that receive only demographic information and also narrows accuracy gaps across racial and ideological groups. The approach requires no task-specific training data for each new outcome, suggesting a route to flexible individual-level behavioral simulation.

Core claim

Agents constructed from two-hour semi-structured interviews, structured surveys including the General Social Survey and Big Five inventory, or both sources combined reach 83%, 82%, and 86% of participants' two-week test-retest consistency on held-out GSS items. Demographics-only agents reach only 74%. The same agents predict personality traits and experimental behaviors at comparable levels and reduce accuracy disparities across racial and ideological groups relative to the demographics baseline.

What carries the argument

Person-specific generative agents created by prompting an LLM with an individual's qualitative interview transcripts or quantitative survey responses to produce responses to new questions.

If this is right

Self-report data alone supports simulation of individuals across multiple outcomes without separate training for each task.
Combining interview and survey sources yields higher consistency with human retest reliability than either source alone.
The resulting agents narrow prediction gaps that appear when using only demographic information.
No task-specific labeled data is required beyond the initial self-reports to generate predictions on new items.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same grounding method could be used to forecast how specific people would respond to proposed policies or interventions before rollout.
Testing whether the agents predict real-world actions outside survey or lab settings would clarify the scope of the simulation claim.
Extending the approach to multi-agent interactions might reveal emergent group-level patterns that single-person simulations cannot capture.

Load-bearing premise

The LLM's outputs from the supplied self-report data reflect stable individual traits that govern answers to unseen questions rather than only the model's pre-trained knowledge or prompt effects.

What would settle it

Agents built from the same self-report data perform no better than demographics-only agents when tested on a fresh set of held-out questions drawn from a different domain or time period.

read the original abstract

Machine learning can predict human behavior well when substantial structured data and well-defined outcomes are available, but these models are typically limited to specific outcomes and cannot readily be applied to new domains. We test whether large language models (LLMs) can support a more general-purpose approach by building person-specific simulations (i.e., "generative agents") grounded in self-report data. Using data from a diverse national sample of 1,052 Americans, we build agents from (i) two-hour, semi-structured interviews (elicited using the American Voices Project interview schedule), (ii) structured surveys (the General Social Survey and Big Five personality inventory), or (iii) both sources combined. On held-out General Social Survey items, agent accuracy reached 83% (interview only), 82% (surveys only), and 86% (combined) of participants' two-week test-retest consistency, compared with agents prompted only with individuals' demographics (74%). Agents predicted personality traits and behaviors in experiments with similar accuracy, and reduced disparities in accuracy across racial and ideological groups relative to demographics-only baselines. Together, these results show that LLMs agents grounded in rich qualitative or quantitative self-report data can support general-purpose simulation of individuals across outcomes, without requiring task-specific training data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLM agents grounded in self-reports hit 82-86% of human test-retest on held-out GSS items versus 74% for demographics, but the results do not yet separate faithful individual modeling from LLM priors activated by the prompts.

read the letter

The main thing here is that agents built from two-hour interviews or from GSS plus Big Five data reach most of a person's own two-week consistency on held-out survey questions, and combining both sources edges it a bit higher. The demographics-only baseline sits noticeably lower, and the approach also narrows gaps across racial and ideological groups on the same items. They further test the agents on personality trait prediction and on behavior in experiments, with similar relative performance to the human benchmark. That is the concrete empirical core on a 1,052-person national sample using American Voices Project interviews plus structured surveys. The test-retest ceiling is a sensible choice because it avoids claiming the agents are perfect and instead measures how close they come to stable human responding. The scale and the mix of qualitative and quantitative grounding data are the clearest points of novelty relative to earlier agent work. The paper does a reasonable job showing that richer self-report input improves over coarse demographics. The soft spot is the one flagged in the stress-test note. Nothing in the reported results rules out the possibility that the LLM is matching the detailed self-report text to patterns already in its training distribution rather than constructing a stable person-specific model. The demographics baseline rules out the coarsest version of that problem but leaves finer-grained prompt or prior effects open. The abstract gives no prompt templates, model version, or preprocessing details, so the central claim rests on numbers whose robustness is hard to judge from what is shown. If the full paper includes strong ablations or controls that address this, the picture changes; on the given evidence the concern stands. This is for computational social scientists or AI groups that want to simulate individuals without task-specific training data. A reader who needs a practical method with real-sample numbers will get something usable even if the mechanism requires more work. The sample size and the external benchmark are enough to merit a serious referee who can check the methods and statistics in detail. I would send it out for peer review.

Referee Report

2 major / 1 minor

Summary. The paper claims that LLM-based generative agents grounded in self-report data—either two-hour interviews, GSS/Big Five surveys, or both—from a national sample of 1,052 Americans can simulate individual responses to held-out GSS items at 83% (interview), 82% (surveys), and 86% (combined) of participants' two-week test-retest consistency, outperforming a demographics-only baseline (74%). Similar accuracy is reported for personality traits and experimental behaviors, with reduced accuracy disparities across racial and ideological groups.

Significance. If the central empirical results hold after addressing transparency and control issues, the work demonstrates a scalable route to general-purpose, person-specific behavioral simulation that does not require outcome-specific training data, extending beyond narrow predictive models in social science.

major comments (2)

[Methods] Methods section: No prompt templates, exact model version (e.g., GPT-4 vs. others), temperature settings, or preprocessing steps for interview transcripts/survey responses are provided. These details are load-bearing for the claim that performance reflects grounding in self-reports rather than prompt-induced artifacts or model priors.
[Results] Results (held-out GSS evaluation): The demographics baseline rules out coarse group-level priors, but the design lacks controls such as permuted self-report data or matched-profile prompts without individual grounding to test whether accuracies derive from stable trait inference versus activation of pre-trained distributional patterns on similar profiles.

minor comments (1)

[Abstract] Abstract and §4: Clarify the exact subset of GSS items used for the held-out test and the precise two-week test-retest protocol (number of items, participant overlap) to allow direct comparison with the agent accuracies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and commit to revisions that enhance transparency and add controls where feasible.

read point-by-point responses

Referee: [Methods] Methods section: No prompt templates, exact model version (e.g., GPT-4 vs. others), temperature settings, or preprocessing steps for interview transcripts/survey responses are provided. These details are load-bearing for the claim that performance reflects grounding in self-reports rather than prompt-induced artifacts or model priors.

Authors: We agree these details are essential for reproducibility. The revised manuscript will include the exact model (GPT-4), temperature (0.7), full prompt templates for agent responses, and preprocessing steps for transcripts and surveys. This will allow verification that results stem from self-report grounding. revision: yes
Referee: [Results] Results (held-out GSS evaluation): The demographics baseline rules out coarse group-level priors, but the design lacks controls such as permuted self-report data or matched-profile prompts without individual grounding to test whether accuracies derive from stable trait inference versus activation of pre-trained distributional patterns on similar profiles.

Authors: We acknowledge the value of stronger controls. In revision we will add a permuted self-report ablation on a subset of items to isolate individual grounding effects, and clarify how the demographics baseline differs from matched-profile prompts. This addresses the concern without altering the core claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results rest on held-out data and external benchmarks

full rationale

The paper reports an empirical study in which generative agents are constructed directly from provided self-report data (interviews or surveys) and evaluated on held-out GSS items against an independent human two-week test-retest consistency benchmark and a demographics-only control. No equations, fitted parameters, or self-citations are invoked to derive the reported accuracies; the central claims are measured outcomes on external data rather than quantities defined by construction from the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that LLMs can faithfully simulate stable individual traits from biographical text without additional task-specific training or post-hoc parameter fitting.

axioms (1)

domain assumption LLMs prompted with self-report data can produce responses that reflect the stable traits of the source individual rather than model priors alone.
Invoked to justify building and evaluating person-specific agents.

pith-pipeline@v0.9.0 · 5799 in / 1161 out tokens · 50859 ms · 2026-05-23T17:21:25.604215+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

generative agents... grounded in self-report data... normalized accuracy of 0.85... outperforming demographic-based agents (0.71)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

interview-based generative agents... Big Five... economic games

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 47 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Narrative Sharpens Gender Gaps: Surveying Film Characters with LLM Agents
cs.HC 2026-05 unverdicted novelty 7.0

LLM agents built from movie scripts reproduce and exaggerate real-world gender attitude gaps, indicating that film narratives sharpen rather than smooth gender contrasts.
From Role to Person: Trust Calibration Challenges in Twin Agents
cs.HC 2026-05 unverdicted novelty 7.0

Twin agents as personal digital representations create distinct trust calibration challenges because they dissolve the boundary between AI and human decision-makers, unlike existing frameworks designed for clear separation.
PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media
cs.CL 2026-05 unverdicted novelty 7.0

PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic onlin...
SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents
cs.AI 2026-05 conditional novelty 7.0

SimPersona uses VQ-VAE to induce discrete buyer types from clickstreams, maps them to LLM persona tokens, and fine-tunes agents to achieve 78% conversion-rate alignment with real buyers across 42 storefronts.
ScioMind: Cognitively Grounded Multi-Agent Social Simulation with Anchoring-Based Belief Dynamics and Dynamic Profiles
cs.AI 2026-05 unverdicted novelty 7.0

ScioMind combines anchoring-based belief updates, hierarchical memory, and dynamic profiles in LLM multi-agent systems to produce more stable, diverse, and psychologically aligned opinion trajectories than prior fixed...
Measuring and Mitigating the Distributional Gap Between Real and Simulated User Behaviors
cs.CL 2026-05 unverdicted novelty 7.0

A clustering and divergence method reveals a large distributional gap between real and LLM-simulated user behaviors on coding and writing tasks, partially closed by combining complementary simulators.
PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI
cs.HC 2026-05 unverdicted novelty 7.0

Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.
WhatIf: Interactive Exploration of LLM-Powered Social Simulations for Policy Reasoning
cs.HC 2026-04 unverdicted novelty 7.0

WhatIf provides an interactive platform for real-time exploration of LLM-driven social simulations, enabling policymakers to iteratively test plans, reflect on assumptions, and uncover vulnerabilities in emergency pre...
IntervenSim: Intervention-Aware Social Network Simulation for Opinion Dynamics
cs.SI 2026-04 unverdicted novelty 7.0

IntervenSim is an intervention-aware social network simulation that couples source interventions with crowd interactions in a feedback loop, improving MAPE by 41.6% and DTW by 66.9% over prior static frameworks on rea...
Text-Based Personas for Simulating User Privacy Decisions
cs.CR 2026-03 unverdicted novelty 7.0

Narriva generates behavior-grounded text personas from survey data that achieve up to 87% accuracy in predicting privacy decisions, improve 6-17 points over baselines, cut tokens by 80-95%, and reproduce aggregate dis...
Evalet: Evaluating Large Language Models through Functional Fragmentation
cs.HC 2025-09 conditional novelty 7.0

Evalet applies functional fragmentation to deliver fragment-level qualitative analysis of LLM evaluations, with a user study showing 48% more misalignment detections than holistic scoring.
ChatCLIDS: Simulating Persuasive AI Dialogues to Promote Closed-Loop Insulin Adoption in Type 1 Diabetes Care
cs.AI 2025-08 unverdicted novelty 7.0

ChatCLIDS creates a library of expert-validated virtual patients and tests LLM agents using evidence-based persuasive strategies in simulated longitudinal and adversarial health counseling sessions for closed-loop ins...
You Can't Fool Us: Understanding the Resilience of LLM-driven Agent Communities to Misinformation
cs.CY 2026-05 unverdicted novelty 6.0

LLM agent simulations show higher actively open-minded thinking boosts resistance to and recovery from misinformation while ideological moderation supports more reliable correction than polarization.
SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents
cs.AI 2026-05 unverdicted novelty 6.0

SimPersona induces a discrete buyer-type space from clickstreams via VQ-VAE, maps types to LLM persona tokens, fine-tunes agents on traces, and samples from merchant distributions to achieve 78% conversion-rate alignm...
PrivacySIM: Evaluating LLM Simulation of User Privacy Behavior
cs.CR 2026-05 unverdicted novelty 6.0

PrivacySIM shows that conditioning LLMs on user personas like demographics and attitudes improves simulation of privacy choices but reaches only 40.4% accuracy against real responses from 1,000 users.
Post-training makes large language models less human-like
cs.CL 2026-05 unverdicted novelty 6.0

Post-training reduces LLMs' behavioral alignment with humans across families and sizes, with the misalignment increasing in newer generations while persona induction fails to improve individual-level predictions.
PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI
cs.HC 2026-05 unverdicted novelty 6.0

PersonaTeaming Workflow improves automated red-teaming attack success rates over RainbowPlus using personas while maintaining diversity, and PersonaTeaming Playground supports human-AI collaboration in red-teaming as ...
The Collapse of Heterogeneity in Silicon Philosophers
cs.CY 2026-04 unverdicted novelty 6.0

Large language models collapse philosophical heterogeneity by over-correlating judgments across domains, creating artificial consensus unlike the views of 277 professional philosophers.
CHORUS: An Agentic Framework for Generating Realistic Deliberation Data
cs.AI 2026-04 unverdicted novelty 6.0

Chorus generates realistic deliberation discussions via LLM agents with memory and Poisson-timed participation, validated by 30 experts on realism, coherence, and utility.
Behavioral Transfer in AI Agents: Evidence and Privacy Implications
econ.GN 2026-04 unverdicted novelty 6.0

AI agents on Moltbook reflect the specific behavioral traits of their linked human owners across multiple dimensions, with stronger transfer linked to greater privacy risks.
In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores
cs.CL 2026-04 unverdicted novelty 6.0

Standardized-test benchmarks for LLM fairness are unreliable because prompt wording alone drives most score variance and ranking changes, while a multi-agent conversational framework reveals consistent model-specific ...
Explicit Trait Inference for Multi-Agent Coordination
cs.AI 2026-04 unverdicted novelty 6.0

ETI lets LLM agents infer and track partners' psychological traits (warmth and competence) from histories, cutting payoff loss 45-77% in games and boosting performance 3-29% on MultiAgentBench versus CoT baselines.
Can LLM Agents Simulate Dynamic Networks? A Case Study on Email Networks with Phishing Synthesis
cs.SI 2026-03 unverdicted novelty 6.0

LLM multi-agent systems augmented with data-driven event triggers and Hawkes processes simulate both micro-level interactions and macroscopic topologies in dynamic email networks for realistic phishing synthesis.
Agora: Teaching the Skill of Consensus-Finding with AI Personas Grounded in Human Voice
cs.HC 2026-03 conditional novelty 6.0

Agora uses AI to ground policy discussions in real human voices and a small study shows it improves users' perspective-taking compared to numerical summaries alone.
StreetDesignAI: Broadening Designer Perspectives Through Multi-Persona Evaluation of Cycling Infrastructure
cs.HC 2026-01 unverdicted novelty 6.0

StreetDesignAI uses multi-persona AI feedback to help designers identify and resolve experiential conflicts in cycling infrastructure through structured evaluation and iterative modification.
StreetDesignAI: Broadening Designer Perspectives Through Multi-Persona Evaluation of Cycling Infrastructure
cs.HC 2026-01 unverdicted novelty 6.0

StreetDesignAI provides structured multi-persona feedback on cycling designs and a user study shows it broadens designers' grasp of diverse cyclist perspectives and improves design decision confidence.
Graph-Based Alternatives to LLMs for Human Simulation
cs.CL 2025-11 conditional novelty 6.0

GEMS formulates close-ended human-behavior simulation as link prediction on a heterogeneous graph and matches or exceeds LLM performance with three orders of magnitude fewer parameters across three datasets and three ...
Synthia: Scalable Grounded Persona Generation from Social Media Data
cs.CL 2025-07 unverdicted novelty 6.0

Synthia creates scalable personas from Bluesky posts that better match human survey responses than prior methods, uses smaller models, and retains social network structure for network-aware analysis.
TinyTroupe: An LLM-powered Multiagent Persona Simulation Toolkit
cs.MA 2025-07 accept novelty 6.0

TinyTroupe provides a toolkit for fine-grained persona-based LLM multi-agent simulations with built-in support for population sampling, experimentation, and validation.
Language Model Fine-Tuning on Scaled Survey Data for Predicting Distributions of Public Opinions
cs.CL 2025-02 unverdicted novelty 6.0

Fine-tuning LLMs on the SubPOP dataset of 3,362 questions and 70K pairs reduces the gap between LLM predictions and human survey responses by up to 46% and generalizes to unseen surveys and subpopulations.
AgentSociety: Large-Scale Simulation of LLM-Driven Generative Agents Advances Understanding of Human Behaviors and Society
cs.SI 2025-02 unverdicted novelty 6.0

AgentSociety is a large-scale LLM agent-based social simulator validated on polarization, UBI, disasters, and sustainability issues with alignment to real experiments.
Why Expert Alignment Is Hard: Evidence from Subjective Evaluation
cs.CL 2026-05 unverdicted novelty 5.0

Expert alignment in subjective LLM evaluations is difficult because expert judgments are heterogeneous, partly tacit, dimension-dependent, and temporally unstable.
Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception
cs.CL 2026-04 conditional novelty 5.0

Persona prompting creates stable but minimally differentiated LLM behavior on urban sentiment tasks, with a no-persona baseline frequently matching or exceeding persona-conditioned agreement to human labels.
From Demographics to Survey Anchors: Evaluating LLM Agents for Modeling Retirement Attitudes
cs.CY 2026-04 conditional novelty 5.0

Demographic-only LLM agents for retirement survey prediction exhibit central tendency bias, fail to reproduce incorrect or 'don't know' answers, and miss factor interactions in regressions, unlike survey-anchored agents.
JudgeMeNot: Personalizing Large Language Models to Emulate Judicial Reasoning in Hebrew
cs.CL 2026-04 unverdicted novelty 5.0

A pipeline using causal language modeling and synthetic instruction-tuning personalizes LLMs to replicate individual Hebrew judges' reasoning, outperforming baselines on similarity metrics with outputs indistinguishab...
Imperfectly Cooperative Human-AI Interactions: Comparing the Impacts of Human and AI Attributes in Simulated and User Studies
cs.CL 2026-04 unverdicted novelty 5.0

In real human subjects, AI transparency impacts imperfectly cooperative interactions far more than personality traits, unlike simulations where both are comparably influential.
AI and Collective Decisions: Strengthening Legitimacy and Losers' Consent
cs.HC 2026-04 unverdicted novelty 5.0

An AI system that elicits personal experiences and visualizes policy support increased perceived legitimacy and perspective-taking in collective decisions despite unfavorable outcomes.
Same Voice, Different Lab: On the Homogenization of Frontier LLM Personalities
cs.HC 2026-03 unverdicted novelty 5.0

Frontier LLMs homogenize toward systematic and analytical personalities, suppressing emotional traits like remorseful or sycophantic, indicating an implicit consensus on optimal assistant behavior.
When AI Agents Learn from Each Other: Insights from Emergent AI Agent Communities on OpenClaw for Human-AI Partnership in Education
cs.CY 2026-03 unverdicted novelty 5.0

Qualitative observations of over 167,000 AI agents in open platforms reveal emergent peer learning, shared memory architectures, and trust dynamics that can inform multi-agent educational AI design.
Simulating Online Social Media Conversations on Controversial Topics Using AI Agents Calibrated on Real-World Data
cs.SI 2025-09 conditional novelty 5.0

LLM agents calibrated on Italian election data produce coherent posts and realistic network structure but show less tone and toxicity variation than real users, with opinion changes resembling traditional mathematical models.
Large Language Models as Virtual Survey Respondents: Evaluating Sociodemographic Response Generation
cs.AI 2025-09 conditional novelty 5.0

Introduces PAS and FAS task abstractions plus the LLM-S^3 benchmark to evaluate LLMs on generating sociodemographic survey responses across 11 real datasets and multiple models.
The Rise of AI Companions: Interaction with AI Companions and Psychological Well-being
cs.HC 2025-06 conditional novelty 5.0

Survey and chat data from CharacterAI users link companionship-focused AI use to lower well-being, with stronger ties for users who have small offline networks and engage intensively or disclosively.
AgentDynEx: Nudging the Mechanics and Dynamics of Multi-Agent Simulations
cs.MA 2025-04 unverdicted novelty 5.0

AgentDynEx introduces nudging and a Configuration Matrix to help set up and maintain balanced mechanics and dynamics in multi-agent LLM simulations.
Can LLMs Emulate Human Belief Dynamics?
cs.SI 2026-05 unverdicted novelty 4.0

LLMs fail to emulate human belief dynamics: they mismatch initial distributions and show higher conformity than humans in network interactions.
Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception
cs.CL 2026-04 unverdicted novelty 4.0

Persona prompting in multimodal LLMs for urban sentiment yields high within-persona stability but limited cross-persona variation, with no-persona models often matching or exceeding persona-conditioned agreement to hu...
Network Effects and Agreement Drift in LLM Debates
cs.SI 2026-04 unverdicted novelty 4.0

LLM agents in controlled network debates show agreement drift toward specific opinion positions, requiring separation of structural effects from LLM biases before using them as human behavioral proxies.
We Need Strong Preconditions For Using Simulations In Policy
cs.CY 2026-04 unverdicted novelty 4.0

Societal-scale LLM agent simulations for policy need three preconditions: avoid neutral treatment of marginalized population simulations, require population participation, ensure accountability, plus development and d...

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 43 Pith papers

[1]

E.Bruch,J.Atwell,Agent-BasedModelsinEmpiricalSocialResearch.Sociological Methods&Research44,186-221(2015)

work page 2015
[2]

C.Schelling,Dynamicmodelsofsegregation.JournalofMathematicalSociology1, 143-186(1971)

T. C.Schelling,Dynamicmodelsofsegregation.JournalofMathematicalSociology1, 143-186(1971)

work page 1971
[3]

J.M.Epstein,R.L.Axtell,GrowingArtificialSocieties:SocialSciencefromtheBottom Up(TheMITPress,1996)

work page 1996
[4]

Whyagents?Onthevariedmotivationsforagentcomputinginthesocial sciences

R.Axtell,"Whyagents?Onthevariedmotivationsforagentcomputinginthesocial sciences"(CenteronSocialandEconomicDynamicsWorkingPaperNo.17,2000)

work page 2000
[5]

OrganizationScience3, 20-46(1992)

K.M.Carley, Organizationallearningandpersonnelturnover. OrganizationScience3, 20-46(1992)

work page 1992
[6]

JournalofArtificialSocietiesandSocial Simulation5,3(2002)

I.S.Lustick,PS-I:Auser-friendlyagent-basedmodelingplatformfortestingtheoriesof politicalidentityandpoliticalstability. JournalofArtificialSocietiesandSocial Simulation5,3(2002)

work page 2002
[7]

C.Schelling,MicromotivesandMacrobehavior(W

T. C.Schelling,MicromotivesandMacrobehavior(W. W. Norton&Company, 1978)

work page 1978
[8]

E.Bonabeau,Agent-basedmodeling:Methodsandtechniquesforsimulatinghuman systems.Proc.Natl.Acad.Sci.U.S.A.99(suppl.3),7280-7287(2002); https://doi.org/10.1073/pnas.082080899

work page doi:10.1073/pnas.082080899 2002
[9]

Largelanguagemodelsassimulatedeconomicagents:Whatcanwelearn fromhomosilicus?

M.W. Macy, R.Willer, FromFactorstoActors:ComputationalSociologyand Agent-BasedModeling.Annu.Rev. Sociol.28,143-166(2002); https://doi.org/10.1146/annurev.soc.28.110601.141117 10.J.vonNeumann,O.Morgenstern,TheoryofGamesandEconomicBehavior(Princeton UniversityPress,1944). 11.McFadden,D.(1974).Conditionallogitanalysisofqualitativechoicebehavior. InP. Zarembk...

work page doi:10.1146/annurev.soc.28.110601.141117 2002
[10]

aimtosimulatehow [they]mightbehaveinspecificsituationsorrespondtocertainsurveyquestions,

ConstructingtheAgentBank Wecreatedover1,000generativeagents,eachmodelingarealindividualintheU.S., collectivelyformingarepresentativesampleoftheU.S.population.Toachievethis,werecruited astratifiedsampleof1,052individualsfromtheU.S.andconductedtwohourvoice-to-voice interviewsusinganAIinterviewer(SM2).Inaddition,wecollectedeachparticipant'sresponses toaserie...

work page
[11]

IwasborninNewHampshire…I reallyenjoyednaturethere,

CreatingtheAIInterviewerAgent Toensurethehighqualityandconsistencyoftherichtrainingdataneededforcreating generativeagents,wedevelopedanAIintervieweragenttoconductsemi-structuredinterviews withstudyparticipants.Wesoughtinterviewsratherthansurveysbecauseweanticipatedthat interviewscouldyieldmorecomprehensiveandnuancedinformation,enablingthecreationof genera...

work page
[12]

Assesstheinterviewprogressby reasoningstepby step-- whatdidtheintervieweesayso far,andin yourview,whatwould countas theinterviewobjectivebeingachieved?Writea short (3~4sentences)assessmenton whethertheinterviewobjectiveis beingachieved.Whilestayingon thecurrenttopic,whatkindof follow-upquestionsshouldtheinterviewerfurtheraskthe intervieweeto betterachieve...

work page
[13]

Authortheinterviewer'snextutterance.To notgo toofar astrayfromtheinterviewobjective,authora follow-upquestion thatwouldbetterachievetheinterviewobjective. Onaverage,withthisimplementation,ourAIintervieweragentspoke5372.59(std=2406.12) wordsduringtheinterview, askingonaverage81.71(std=54.39)follow-upquestionsfrom99 scriptedquestions,towhichourparticipantsr...

work page
[14]

memorystream

GenerativeAgentArchitecture Generativeagentsaresoftwaresystemsthatsimulatehumanbehavior, poweredbya languagemodelaugmentedwithasetofmemoriestodefinetheirbehaviors(14,15). These memories,storedinadatabase(or"memorystream")intextform,areretrievedasneededto generatetheagent'sbehaviorsusingalanguagemodel.Thisispairedwithareflectionmodule thatsynthesizesthesem...

work page
[15]

treatments

SurveysandExperimentalConstructs Toevaluatethefidelityofourgenerativeagents,weaimedtoassesstheirpredictive accuracyregardingtheattitudesandbehaviorsoftheunderlyingsampleacrosssurveysand experimentalconstructsfromabroadarrayofsocialscientificdisciplinesandmethods.To operationalizethis,weidentifiedfourexistingconstructscommonlydeployedinthesocial sciences.I...

work page 2015
[16]

EvaluationMethods Givenoursurveyandexperimentalconstructs,wesetouttoevaluatethepredictivepower ofthegenerativeagents.Inthissection,wedescribethemetricsandevaluationmethodsusedfor thispurpose.Theindividualsubsectionheadersinthissectionareorganizedtomatchthe presentationinthemaindocument. Study1.PredictingIndividuals’AttitudesandBehaviors Todeterminewhether...

work page
[17]

Weuseparticipants'responsesfromthefirstphaseofparticipationtoassesstheaccuracy rateofouragents'predictions—thenumberofanswerspredictedcorrectlyoverthetotal numberofquestions

work page
[18]

Weusethesecondphaseofparticipationtoassessindividuals'rateofinternal consistency—theparticipants'rateofpredictiononthebatteryofsurveysand experimentsusedinthisstudy

work page
[19]

Thediverseresponsetypesinoursurveysandexperimentalconstructspresentachallenge indeterminingasinglemetricforassessingouragents'predictiveaccuracy

Wethencalculatethenormalizedaccuracyasfollows: normalizedaccuracy= 𝑎𝑔𝑒𝑛𝑡'𝑠 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑖𝑛𝑡𝑒𝑟𝑛𝑎𝑙 𝑐𝑜𝑛𝑠𝑖𝑠𝑡𝑒𝑛𝑐𝑦 Conceptually, anormalizedaccuracyof1.0meansthatthegenerativeagentpredictsthe individual'sresponsesasaccuratelyasthepersonreplicatestheirownresponsestwoweekslater. Thediverseresponsetypesinoursurveysandexperimentalconstructspresentachallenge ...

work page
[20]

Reportmetricsappropriateforeachresponsetype(e.g.,accuracyrateforcategorical, MAEfornumerical). 18

work page
[21]

Ensuremetricinterpretabilityacrossdifferentcommunitynorms(e.g.,accuracy/MAEin machinelearningliterature,correlationinsocialscienceliterature)

work page
[22]

Fortheaverageperson,howaccuratearethegenerativeagents?

Provideametricallowingcomparisonacrossdifferentconstructs. Tomeetthesecriteria,wereportaccuracyratesforevaluationconstructswithcategorical-ordinal responsetypes,MAEfornumericalresponsetypes,andPearsoncorrelationcoefficientasa metriccomparableacrossconstructs.Additionally, wepresentthesemetricsalongsidethe normalizedaccuracytoprovideacomprehensiveevaluatio...

work page
[23]

ApplyingFisher'sz-transformationtoeachcorrelationcoefficient:𝑧 = 1 2 𝑙𝑛( 1 + 𝑟 1 − 𝑟)

work page
[24]

Calculatingtheaverageofthez-values

work page
[25]

Yourage,

ApplyingtheinverseFisher'sz-transformationtotheaveragez-value:𝑟 = 𝑡𝑎𝑛ℎ(𝑧) Below, wedescribeinmoredetailtheevaluationmethodsandourreportingstrategiesfor theindividualconstructs. TheGeneralSocialSurvey. ThesubsetofthecoremoduleoftheGSSthatfitsourinclusion criteria,asdescribedinthepriorsection,includes183questions,mostofwhich—177—are categoricalorordinal(cat...

work page 2015
[26]

Ourmainanalysis,conductedwithallagentsinouragentbankanddetailedinthemain article,comparesthepredictiveperformanceofinterview-basedgenerativeagentsagainst agentscreatedusingtheknownpracticesfromrecentliteraturethatstudiedhuman behavioralsimulationswithlanguagemodels

work page
[27]

childhood_town

Exploratoryanalysesconductedonarandomsubsetof100agentsintheagentbank, investigatingabroaderrangeofdesignspaces.Thisincludesexamininggenerativeagents withinterviewlesionsandagentsinformedbysurveydatainsteadofinterviewdata. Themainanalysisaimstoestablishabaselineforpredictiveperformancegroundedinprior literatureandevaluatewhetherouragentarchitecturesurpasse...

work page
[28]

SupplementaryResults Inthissection,wepresentahigher-levelinterpretationsupplementaryresultsthat,whilenot centraltoourmainfindings,offervaluableinsights.Themethodsareasoutlinedintheprevious section,anddetailedtablesoftheseresultscanbefoundinSection8. NumericalGSSPredictionResults InadditiontoourpredictiveperformanceontheprimarycategoricalquestionsoftheGene...

work page
[29]

ResearchAccessfortheAgentBank Inthissection,weoutlineaframeworkthatdefinesthekeyelementsofourresearchaccessand presentaplanforprovidingscientificaccesstotheagentbank.Accesstotheagentbankoffers valuetothescientificcommunity, withimportantimplicationsfortwokeydomains: ● Insocialscience,agentsfromtheagentbankcanbeusedtodevelopsimulations involvingindividualo...

work page
[30]

Central W.N

SupplementaryTables Age 18to24 25to34 35to44 45to54 55to64 65to74 11.03% 13.88% 17.49% 19.77% 21.48% 13.50% 75or more 2.85% Censusdivision NewEngland MiddleAtlantic E.N. Central W.N. Central SouthAtlantic E.S. Central 6.65% 12.83% 18.73% 8.08% 10.08% 11.5% W.S. Central Mountain Pacific Foreign 8.65% 5.13% 15.78% 2.57% Education Lessthanhigh school graduat...

work page

[1] [1]

E.Bruch,J.Atwell,Agent-BasedModelsinEmpiricalSocialResearch.Sociological Methods&Research44,186-221(2015)

work page 2015

[2] [2]

C.Schelling,Dynamicmodelsofsegregation.JournalofMathematicalSociology1, 143-186(1971)

T. C.Schelling,Dynamicmodelsofsegregation.JournalofMathematicalSociology1, 143-186(1971)

work page 1971

[3] [3]

J.M.Epstein,R.L.Axtell,GrowingArtificialSocieties:SocialSciencefromtheBottom Up(TheMITPress,1996)

work page 1996

[4] [4]

Whyagents?Onthevariedmotivationsforagentcomputinginthesocial sciences

R.Axtell,"Whyagents?Onthevariedmotivationsforagentcomputinginthesocial sciences"(CenteronSocialandEconomicDynamicsWorkingPaperNo.17,2000)

work page 2000

[5] [5]

OrganizationScience3, 20-46(1992)

K.M.Carley, Organizationallearningandpersonnelturnover. OrganizationScience3, 20-46(1992)

work page 1992

[6] [6]

JournalofArtificialSocietiesandSocial Simulation5,3(2002)

I.S.Lustick,PS-I:Auser-friendlyagent-basedmodelingplatformfortestingtheoriesof politicalidentityandpoliticalstability. JournalofArtificialSocietiesandSocial Simulation5,3(2002)

work page 2002

[7] [7]

C.Schelling,MicromotivesandMacrobehavior(W

T. C.Schelling,MicromotivesandMacrobehavior(W. W. Norton&Company, 1978)

work page 1978

[8] [8]

E.Bonabeau,Agent-basedmodeling:Methodsandtechniquesforsimulatinghuman systems.Proc.Natl.Acad.Sci.U.S.A.99(suppl.3),7280-7287(2002); https://doi.org/10.1073/pnas.082080899

work page doi:10.1073/pnas.082080899 2002

[9] [9]

Largelanguagemodelsassimulatedeconomicagents:Whatcanwelearn fromhomosilicus?

M.W. Macy, R.Willer, FromFactorstoActors:ComputationalSociologyand Agent-BasedModeling.Annu.Rev. Sociol.28,143-166(2002); https://doi.org/10.1146/annurev.soc.28.110601.141117 10.J.vonNeumann,O.Morgenstern,TheoryofGamesandEconomicBehavior(Princeton UniversityPress,1944). 11.McFadden,D.(1974).Conditionallogitanalysisofqualitativechoicebehavior. InP. Zarembk...

work page doi:10.1146/annurev.soc.28.110601.141117 2002

[10] [10]

aimtosimulatehow [they]mightbehaveinspecificsituationsorrespondtocertainsurveyquestions,

ConstructingtheAgentBank Wecreatedover1,000generativeagents,eachmodelingarealindividualintheU.S., collectivelyformingarepresentativesampleoftheU.S.population.Toachievethis,werecruited astratifiedsampleof1,052individualsfromtheU.S.andconductedtwohourvoice-to-voice interviewsusinganAIinterviewer(SM2).Inaddition,wecollectedeachparticipant'sresponses toaserie...

work page

[11] [11]

IwasborninNewHampshire…I reallyenjoyednaturethere,

CreatingtheAIInterviewerAgent Toensurethehighqualityandconsistencyoftherichtrainingdataneededforcreating generativeagents,wedevelopedanAIintervieweragenttoconductsemi-structuredinterviews withstudyparticipants.Wesoughtinterviewsratherthansurveysbecauseweanticipatedthat interviewscouldyieldmorecomprehensiveandnuancedinformation,enablingthecreationof genera...

work page

[12] [12]

Assesstheinterviewprogressby reasoningstepby step-- whatdidtheintervieweesayso far,andin yourview,whatwould countas theinterviewobjectivebeingachieved?Writea short (3~4sentences)assessmenton whethertheinterviewobjectiveis beingachieved.Whilestayingon thecurrenttopic,whatkindof follow-upquestionsshouldtheinterviewerfurtheraskthe intervieweeto betterachieve...

work page

[13] [13]

Authortheinterviewer'snextutterance.To notgo toofar astrayfromtheinterviewobjective,authora follow-upquestion thatwouldbetterachievetheinterviewobjective. Onaverage,withthisimplementation,ourAIintervieweragentspoke5372.59(std=2406.12) wordsduringtheinterview, askingonaverage81.71(std=54.39)follow-upquestionsfrom99 scriptedquestions,towhichourparticipantsr...

work page

[14] [14]

memorystream

GenerativeAgentArchitecture Generativeagentsaresoftwaresystemsthatsimulatehumanbehavior, poweredbya languagemodelaugmentedwithasetofmemoriestodefinetheirbehaviors(14,15). These memories,storedinadatabase(or"memorystream")intextform,areretrievedasneededto generatetheagent'sbehaviorsusingalanguagemodel.Thisispairedwithareflectionmodule thatsynthesizesthesem...

work page

[15] [15]

treatments

SurveysandExperimentalConstructs Toevaluatethefidelityofourgenerativeagents,weaimedtoassesstheirpredictive accuracyregardingtheattitudesandbehaviorsoftheunderlyingsampleacrosssurveysand experimentalconstructsfromabroadarrayofsocialscientificdisciplinesandmethods.To operationalizethis,weidentifiedfourexistingconstructscommonlydeployedinthesocial sciences.I...

work page 2015

[16] [16]

EvaluationMethods Givenoursurveyandexperimentalconstructs,wesetouttoevaluatethepredictivepower ofthegenerativeagents.Inthissection,wedescribethemetricsandevaluationmethodsusedfor thispurpose.Theindividualsubsectionheadersinthissectionareorganizedtomatchthe presentationinthemaindocument. Study1.PredictingIndividuals’AttitudesandBehaviors Todeterminewhether...

work page

[17] [17]

Weuseparticipants'responsesfromthefirstphaseofparticipationtoassesstheaccuracy rateofouragents'predictions—thenumberofanswerspredictedcorrectlyoverthetotal numberofquestions

work page

[18] [18]

Weusethesecondphaseofparticipationtoassessindividuals'rateofinternal consistency—theparticipants'rateofpredictiononthebatteryofsurveysand experimentsusedinthisstudy

work page

[19] [19]

Thediverseresponsetypesinoursurveysandexperimentalconstructspresentachallenge indeterminingasinglemetricforassessingouragents'predictiveaccuracy

Wethencalculatethenormalizedaccuracyasfollows: normalizedaccuracy= 𝑎𝑔𝑒𝑛𝑡'𝑠 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑖𝑛𝑡𝑒𝑟𝑛𝑎𝑙 𝑐𝑜𝑛𝑠𝑖𝑠𝑡𝑒𝑛𝑐𝑦 Conceptually, anormalizedaccuracyof1.0meansthatthegenerativeagentpredictsthe individual'sresponsesasaccuratelyasthepersonreplicatestheirownresponsestwoweekslater. Thediverseresponsetypesinoursurveysandexperimentalconstructspresentachallenge ...

work page

[20] [20]

Reportmetricsappropriateforeachresponsetype(e.g.,accuracyrateforcategorical, MAEfornumerical). 18

work page

[21] [21]

Ensuremetricinterpretabilityacrossdifferentcommunitynorms(e.g.,accuracy/MAEin machinelearningliterature,correlationinsocialscienceliterature)

work page

[22] [22]

Fortheaverageperson,howaccuratearethegenerativeagents?

Provideametricallowingcomparisonacrossdifferentconstructs. Tomeetthesecriteria,wereportaccuracyratesforevaluationconstructswithcategorical-ordinal responsetypes,MAEfornumericalresponsetypes,andPearsoncorrelationcoefficientasa metriccomparableacrossconstructs.Additionally, wepresentthesemetricsalongsidethe normalizedaccuracytoprovideacomprehensiveevaluatio...

work page

[23] [23]

ApplyingFisher'sz-transformationtoeachcorrelationcoefficient:𝑧 = 1 2 𝑙𝑛( 1 + 𝑟 1 − 𝑟)

work page

[24] [24]

Calculatingtheaverageofthez-values

work page

[25] [25]

Yourage,

ApplyingtheinverseFisher'sz-transformationtotheaveragez-value:𝑟 = 𝑡𝑎𝑛ℎ(𝑧) Below, wedescribeinmoredetailtheevaluationmethodsandourreportingstrategiesfor theindividualconstructs. TheGeneralSocialSurvey. ThesubsetofthecoremoduleoftheGSSthatfitsourinclusion criteria,asdescribedinthepriorsection,includes183questions,mostofwhich—177—are categoricalorordinal(cat...

work page 2015

[26] [26]

Ourmainanalysis,conductedwithallagentsinouragentbankanddetailedinthemain article,comparesthepredictiveperformanceofinterview-basedgenerativeagentsagainst agentscreatedusingtheknownpracticesfromrecentliteraturethatstudiedhuman behavioralsimulationswithlanguagemodels

work page

[27] [27]

childhood_town

Exploratoryanalysesconductedonarandomsubsetof100agentsintheagentbank, investigatingabroaderrangeofdesignspaces.Thisincludesexamininggenerativeagents withinterviewlesionsandagentsinformedbysurveydatainsteadofinterviewdata. Themainanalysisaimstoestablishabaselineforpredictiveperformancegroundedinprior literatureandevaluatewhetherouragentarchitecturesurpasse...

work page

[28] [28]

SupplementaryResults Inthissection,wepresentahigher-levelinterpretationsupplementaryresultsthat,whilenot centraltoourmainfindings,offervaluableinsights.Themethodsareasoutlinedintheprevious section,anddetailedtablesoftheseresultscanbefoundinSection8. NumericalGSSPredictionResults InadditiontoourpredictiveperformanceontheprimarycategoricalquestionsoftheGene...

work page

[29] [29]

ResearchAccessfortheAgentBank Inthissection,weoutlineaframeworkthatdefinesthekeyelementsofourresearchaccessand presentaplanforprovidingscientificaccesstotheagentbank.Accesstotheagentbankoffers valuetothescientificcommunity, withimportantimplicationsfortwokeydomains: ● Insocialscience,agentsfromtheagentbankcanbeusedtodevelopsimulations involvingindividualo...

work page

[30] [30]

Central W.N

SupplementaryTables Age 18to24 25to34 35to44 45to54 55to64 65to74 11.03% 13.88% 17.49% 19.77% 21.48% 13.50% 75or more 2.85% Censusdivision NewEngland MiddleAtlantic E.N. Central W.N. Central SouthAtlantic E.S. Central 6.65% 12.83% 18.73% 8.08% 10.08% 11.5% W.S. Central Mountain Pacific Foreign 8.65% 5.13% 15.78% 2.57% Education Lessthanhigh school graduat...

work page