Think-Before-Speak: From Internal Evaluation to Public Expression in Multi-Agent Social Simulation

Hui Liu; Kaiqi Yang; Sanguk Lee; Tai-Quan Peng

arxiv: 2606.03137 · v2 · pith:AAYXPVHRnew · submitted 2026-06-02 · 💻 cs.AI

Think-Before-Speak: From Internal Evaluation to Public Expression in Multi-Agent Social Simulation

Kaiqi Yang , Tai-Quan Peng , Sanguk Lee , Hui Liu This is my paper

Pith reviewed 2026-07-02 23:08 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-agent simulationLLM agentsinternal evaluationopinion dynamicssocial simulationturn allocationdissonance appraisalsilence pressure

0 comments

The pith

TBS separates agents' private internal evaluations from public utterances in multi-agent simulations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TBS as an interval-based framework where agents first update five structured internal states from shared history and memory before any public speech occurs. These states track dissonance-related appraisal, perceived opinion climate, perceived isolation risk, response strategy, and willingness to speak. The system then resolves competing intentions through an orchestrator to produce one public utterance per turn. Results from climate policy town hall simulations show that the internal traces change systematically with turn allocation, silence rules, and memory conditions, with dissonance appraisal raising speaking willingness and silence pressure lowering it. Once intention forms, turn-allocation rules determine the final public expression. This separation lets researchers examine the usually hidden pathway from private appraisal to observable dialogue.

Core claim

TBS has every agent update five internal states at each interval based on the dialogue history and its own memory, then passes the resulting willingness-to-speak values to an orchestrator that selects and commits one utterance to the shared record. In the evaluated town-hall runs, the resulting internal-state traces remain coherent and differ predictably across turn-allocation, silence, and memory conditions; dissonance-related appraisal raises willingness to speak while silence-pressure appraisal lowers it; once an agent forms a speaking intention, turn-allocation rules become the dominant factor shaping what is actually expressed publicly.

What carries the argument

Interval-based update of five structured internal states (dissonance-related appraisal, perceived opinion climate, perceived isolation risk, response strategy, willingness to speak) followed by orchestrator resolution of speaking intentions into public utterances.

If this is right

Internal evaluation and public expression co-evolve over successive intervals in the simulation.
Dissonance-related appraisal increases agents' willingness to speak.
Silence-pressure appraisal decreases agents' willingness to speak.
Once speaking intention is formed, turn-allocation rules primarily determine which utterance reaches the public record.
The framework makes the full pathway from private appraisal to public speech observable and analyzable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could support controlled tests of classic opinion-formation theories such as the spiral of silence by varying the internal appraisal parameters.
Policy-deliberation simulations could become more transparent by exposing how perceived isolation risk influences who stays silent.
Different memory-update rules could be compared directly to isolate their effect on long-term opinion stability in the same agent population.

Load-bearing premise

Large language models can reliably update the five structured internal states so that the values reflect the intended psychological constructs rather than prompt artifacts or model idiosyncrasies.

What would settle it

Internal-state traces that fail to vary systematically across turn-allocation, silence, and memory conditions, or that show no positive link between higher dissonance-related appraisal and higher willingness to speak, would indicate the framework does not produce faithful internal evaluations.

Figures

Figures reproduced from arXiv: 2606.03137 by Hui Liu, Kaiqi Yang, Sanguk Lee, Tai-Quan Peng.

**Figure 1.** Figure 1: Framework of Hierarchical Multi-agent System. Sequential Turn-taking In sequential turn-taking, each round ri consists of Ai intervals, as each agent speaks per interval. For the selected agent α, the generation is given by: XGroupContext = [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 1.** Figure 1: Framework of TBS System. for qualitative robustness checks, but we leave systematic crossbackbone evaluation to future work. Turning Mode. To control the allocation of dialogue intervals, following prior work, we deploy two turn-allocation modes. The willing mode maintains an open discussion setting, where agents autonomously apply for speaking opportunities. When multiple agents express willingness to sp… view at source ↗

**Figure 2.** Figure 2: Framework of Sequential Multi-agent System. 4 TBS : Efficient Time-Aware Social Simulation In this section, we introduce TBS , a flexible multi-agent framework that manages agents’ speaking and thinking through a controllable interaction pipeline. The framework supports interval-based interaction, continuous internal reasoning, and conflict-resolved speaking allocation. We first describe the agent design, … view at source ↗

**Figure 2.** Figure 2: Framework of Hierarchical Multi-agent System [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 3.** Figure 3: Framework of TBS System. 5 Experiments In this section, we describe the setup and settings of experiments. To evaluate the framework, we run simulation with societal-important topics with real human profiles, and analyze the generated discussion logs to present the key features from the views of dialogue and communication studies. 5.1 Experimental Setup We use a town hall discussion as the task scenario an… view at source ↗

**Figure 3.** Figure 3: Framework of Sequential Multi-agent System. emerges from internal evaluation rather than from fixed turn order or immediate reactive response alone. The outcome was whether an agent wanted to speak at a given interval. The logistic mixed-effects model included the dissonance index, the silence-pressure index, centered interval, persona ecology, turn-allocation rule, Force Speak setting, and memory mode. T… view at source ↗

read the original abstract

LLM-based multi-agent simulation offers a promising way to study social interaction, deliberation, and collective opinion dynamics. However, many existing dialogue simulation frameworks represent interaction mainly as observable turn exchange or aggregated outputs, leaving the internal evaluative processes behind silence, speaking intention, and public expression difficult to examine. We introduce TBS (Think-Before-Speak), an interval-based multi-agent simulation framework that separates agents' private reasoning from public utterance generation. At each interval, all agents update structured internal states based on the shared dialogue history and their own memory. These states include dissonance-related appraisal, perceived opinion climate, perceived isolation risk, response strategy, and willingness to speak. The orchestrator then resolves competing speaking intentions and commits one utterance to the public dialogue, allowing internal evaluation and public interaction to co-evolve over time. We evaluate TBS in simulated town hall discussions on a climate-related policy issue. Results show that TBS produces coherent internal-state traces and that these traces vary systematically across turn-allocation, silence, and memory conditions. Dissonance-related appraisal increases agents' willingness to speak, whereas silence-pressure appraisal decreases it. Once speaking intention is formed, public expression is shaped mainly by turn-allocation rules. These findings suggest that TBS supports mechanism-sensitive social simulation by making the pathway from internal evaluation to public expression observable and analyzable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TBS gives a workable separation of internal states from public speech in LLM multi-agent sims, but the results hinge on untested state updates with no supporting details.

read the letter

The main takeaway is that this paper builds TBS, an interval-based setup where agents maintain five structured internal states before any utterance is generated. The orchestrator then picks one speaker based on those states. That split between private evaluation and public output is a clear addition to existing dialogue frameworks.

It does a decent job laying out how the states evolve with shared history and memory, then showing that traces shift across turn-allocation, silence, and memory conditions. The reported pattern—that dissonance appraisal raises willingness to speak while silence pressure lowers it, with turn rules dominating once intention forms—follows logically from the design if the states behave as intended.

The soft spot is exactly the one the reader flagged: nothing shows that the LLM populates those states in a way that tracks the psychological constructs rather than prompt artifacts. The abstract supplies no methods, no consistency checks, no error bars, and no tests, so the systematic variation and appraisal effects cannot be evaluated. That assumption carries the whole claim.

This is for people already working on LLM social simulations who need more internal visibility than standard turn-taking setups provide. A reader focused on mechanism-sensitive modeling would find the architecture useful even if the specific results stay provisional.

It deserves peer review. The framework is simple enough to implement and directly targets an observability gap, so referees should see the full implementation and any validation steps.

Referee Report

2 major / 1 minor

Summary. The paper introduces TBS (Think-Before-Speak), an interval-based multi-agent simulation framework using LLMs. Agents maintain and update five structured internal states (dissonance-related appraisal, perceived opinion climate, perceived isolation risk, response strategy, willingness to speak) from shared dialogue history and private memory at each interval. An orchestrator resolves competing speaking intentions to produce public utterances. In simulated town-hall discussions on climate policy, the framework yields coherent internal-state traces that vary systematically with turn-allocation, silence, and memory conditions; dissonance appraisal raises willingness to speak while silence-pressure appraisal lowers it; once intention forms, public expression is governed primarily by turn-allocation rules.

Significance. If the internal-state updates can be shown to track the intended psychological constructs rather than prompt artifacts, TBS would supply a mechanism-sensitive instrument for studying how private evaluation translates into public expression in social simulations—an observable pathway that most existing turn-exchange frameworks leave opaque. The approach directly addresses a recognized limitation in LLM multi-agent work and could support falsifiable experiments on opinion dynamics.

major comments (2)

[§4] §4 (Experimental Setup / State Update Mechanism): The five internal states are updated exclusively through LLM prompts, yet the manuscript provides no validation—neither consistency checks across prompt paraphrases, nor ablation on prompt sensitivity, nor any external anchoring to human judgments. This assumption is load-bearing for every reported result on systematic variation and appraisal effects.
[§5] §5 (Results): Claims of 'coherent internal-state traces' and 'systematic variation' across conditions, as well as the directional effects of dissonance and silence-pressure appraisals, are presented without statistical tests, effect sizes, confidence intervals, or even basic quantitative metrics; only qualitative descriptions appear to be supplied.

minor comments (1)

[§3] Notation for the five internal states is introduced in the abstract and §3 but never given explicit formal definitions or update equations; a short table or pseudocode block would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for strengthening the manuscript. We address each major comment below and outline the planned revisions.

read point-by-point responses

Referee: [§4] §4 (Experimental Setup / State Update Mechanism): The five internal states are updated exclusively through LLM prompts, yet the manuscript provides no validation—neither consistency checks across prompt paraphrases, nor ablation on prompt sensitivity, nor any external anchoring to human judgments. This assumption is load-bearing for every reported result on systematic variation and appraisal effects.

Authors: We agree that validation of the LLM-prompt-based state updates is essential given their central role. In the revised manuscript, we will add consistency checks by paraphrasing the update prompts and reporting agreement rates across variants. We will also include an ablation analysis on prompt sensitivity by systematically varying key instructions and measuring impacts on state distributions. External anchoring to human judgments is a valuable direction but requires separate data collection; we will expand the limitations and future work sections to discuss this explicitly rather than claiming it in the current study. revision: partial
Referee: [§5] §5 (Results): Claims of 'coherent internal-state traces' and 'systematic variation' across conditions, as well as the directional effects of dissonance and silence-pressure appraisals, are presented without statistical tests, effect sizes, confidence intervals, or even basic quantitative metrics; only qualitative descriptions appear to be supplied.

Authors: The current presentation emphasizes qualitative illustration of the traces to demonstrate the framework. We will revise the results section to incorporate quantitative support, including mean values and standard deviations for key states (e.g., willingness to speak) across conditions, along with statistical tests such as t-tests or ANOVA to evaluate systematic differences and directional effects. Effect sizes (e.g., Cohen's d) and confidence intervals will be reported for the main comparisons. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework is self-contained simulation

full rationale

The paper introduces TBS as a new interval-based multi-agent framework where LLM agents update five explicitly defined internal states (dissonance-related appraisal, perceived opinion climate, perceived isolation risk, response strategy, willingness to speak) from dialogue history and memory, after which an orchestrator selects utterances. Results consist of observed systematic variation in these states across experimental conditions (turn-allocation, silence, memory) and reported correlations between appraisals and speaking willingness. No equations, fitted parameters, or self-citation chains are present that would reduce any claimed outcome to an input by construction; the simulation outputs are generated externally via LLM execution and are not tautological with the state definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that LLMs can maintain and update psychologically meaningful internal states without introducing artifacts; it introduces new invented entities in the form of the five structured internal states with no independent evidence outside the simulation itself.

axioms (1)

domain assumption LLMs can reliably update structured internal states (dissonance-related appraisal, perceived opinion climate, perceived isolation risk, response strategy, willingness to speak) from dialogue history and memory in a manner that reflects the intended constructs.
Invoked when the paper states that agents update these states at each interval based on shared history and own memory.

invented entities (1)

structured internal states (dissonance-related appraisal, perceived opinion climate, perceived isolation risk, response strategy, willingness to speak) no independent evidence
purpose: To separate private reasoning from public utterance generation and make the pathway from internal evaluation to speaking observable.
These five states are newly introduced constructs in the TBS framework; the abstract provides no external validation or falsifiable handle outside the simulation runs.

pith-pipeline@v0.9.1-grok · 5775 in / 1565 out tokens · 23810 ms · 2026-07-02T23:08:00.245340+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 26 canonical work pages · 6 internal anchors

[1]

Anisha Agarwal, Aaron Chan, Shubham Chandel, Jinu Jang, Shaun Miller, Roshanak Zilouchian Moghaddam, Yevhen Mohylevskyy, Neel Sundaresan, and Michele Tufano. 2024. Copilot evaluation harness: Evaluating LLM-guided soft- ware programming.arXiv preprint arXiv:2402.14261(2024)

work page arXiv 2024
[2]

Argyle, Christopher A

Lisa P. Argyle, Christopher A. Bail, Ethan C. Busby, Joshua R. Gubler, Thomas Howe, Christopher Rytting, Taylor Sorensen, and David Wingate. 2023. Leverag- ing AI for democratic discourse: Chat interventions can improve online political conversations at scale.Proceedings of the National Academy of Sciences120, 41 (2023), e2311627120. doi:10.1073/pnas.2311627120

work page doi:10.1073/pnas.2311627120 2023
[3]

Joshua Ashkinaze, Emily Fry, Narendra Edara, Eric Gilbert, and Ceren Budak
[4]

In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25)

Plurals: A System for Guiding LLMs via Simulated Social Ensembles. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA, 1–21. doi:10.1145/3706598.3713675

work page doi:10.1145/3706598.3713675 2025
[5]

Hsuan-Ting Chen. 2018. Spiral of Silence on Social Media and the Moderating Role of Disagreement and Publicness in the Network: Analyzing Expressive and Withdrawal Behaviors.New Media & Society20, 10 (2018), 3917–3936. doi:10.1177/1461444818763384

work page doi:10.1177/1461444818763384 2018
[6]

Mengzhuo Chen, Junjie Wang, Zhe Liu, Yawen Wang, and Qing Wang. 2026. From Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness Flaws.arXiv preprint arXiv:2606.06324(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori Hashimoto. 2023. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback.ArXiv abs/2305.14387 (2023). https://api.semanticscholar.org/CorpusID:258865545

work page arXiv 2023
[8]

Amitai Etzioni. 1972. Minerva: An electronic town hall.Policy Sciences3 (1972), 457–474. https://api.semanticscholar.org/CorpusID:154572096

1972
[9]

Heterogeneity

William P. Eveland and Myiah H. Hively. 2009. Political Discussion Frequency, Network Size, and “Heterogeneity” of Discussion as Predictors of Political Knowledge and Participation.Journal of Communication59, 2 (2009), 205–224. doi:10.1111/j.1460-2466.2009.01412.x

work page doi:10.1111/j.1460-2466.2009.01412.x 2009
[10]

Wangxuan Fan, Xiaoyu Nie, and Zhongxiang Dai. 2026. Harness-MU: A Safe, Governed, and Effective Harness for Multi-User LLM Agents.arXiv preprint arXiv:2606.21856(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. 2025. Group-in-Group Policy Optimization for LLM Agent Training.ArXivabs/2505.10978 (2025). https: //api.semanticscholar.org/CorpusID:278715074

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

1957.A Theory of Cognitive Dissonance

Leon Festinger. 1957.A Theory of Cognitive Dissonance. Stanford University Press, Stanford, CA

1957
[13]

Was It Something I Said?

Sherice Gearhart and Weiwu Zhang. 2015. “Was It Something I Said?” “No, It Was Something You Posted!” A Study of the Spiral of Silence Theory in Social Media Contexts.Cyberpsychology, Behavior, and Social Networking18, 4 (2015), 208–213. doi:10.1089/cyber.2014.0443

work page doi:10.1089/cyber.2014.0443 2015
[14]

Önder Gürcan. 2024. Llm-augmented agent-based modelling for social simula- tions: Challenges and opportunities. InHHAI 2024: Hybrid Human AI Systems for the Social Good: Proceedings of the Third International Conference on Hybrid Human-Artificial Intelligence. SAGE Publications 1 Oliver’s Yard, 55 City Road, London, EC1Y 1SP, 134–144

2024
[15]

Hayes, Carroll J

Andrew F. Hayes, Carroll J. Glynn, and James Shanahan. 2005. Validating the Willingness to Self-Censor Scale: Individual Differences in the Effect of the Climate of Opinion on Opinion Expression.International Journal of Public Opinion Research17, 4 (2005), 443–455. doi:10.1093/ijpor/edh072

work page doi:10.1093/ijpor/edh072 2005
[16]

Hayes, Carroll J

Andrew F. Hayes, Carroll J. Glynn, and James Shanahan. 2005. Willingness to Self-Censor: A Construct and Measurement Tool for Public Opinion Research. International Journal of Public Opinion Research17, 3 (2005), 298–323. doi:10. 1093/ijpor/edh073

2005
[17]

Marshall, Chesley Cheatham, Kris Austin, Kimberly D

Monica Webb Hooper, Charlene Mitchell, Vanessa J. Marshall, Chesley Cheatham, Kris Austin, Kimberly D. Sanders, Smitha S. Krishnamurthi, and Lena L. Grafton
[18]

https://api.semanticscholar.org/CorpusID:202407482

Understanding Multilevel Factors Related to Urban Community Trust in Healthcare and Research.International Journal of Environmental Research and Public Health16 (2019). https://api.semanticscholar.org/CorpusID:202407482

2019
[19]

Minwoo Jeong, Jeeyun Chang, and Yoonjin Yoon. 2025. Speak to Simulate: An LLM-Guided Agentic Framework for Traffic Simulation in SUMO.Proceedings of the 8th ACM SIGSPATIAL International Workshop on Geospatial Simulation(2025). https://api.semanticscholar.org/CorpusID:282395843

2025
[20]

Anthony Leiserowitz, Connie Roser-Renouf, Jennifer Marlon, and Edward Maibach. 2021. Global Warming’s Six Americas: a review and recommenda- tions for climate change communication.Current Opinion in Behavioral Sciences 42 (2021), 97–103

2021
[21]

Xinyi Li, Sai Wang, Siqi Zeng, Yu Wu, and Yi Yang. 2024. A survey on LLM-based multi-agent systems: workflow, infrastructure, and challenges.Vicinagearth1 (2024). https://api.semanticscholar.org/CorpusID:273218743

2024
[22]

Minhua Lin, Juncheng Wu, Zijun Wang, Zhan Shi, Yisi Sang, Bing He, Zewen Liu, Tianxin Wei, Zongyu Wu, Zhiwei Zhang, et al. 2026. Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents.arXiv preprint arXiv:2605.30621(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Chenxi Liu, Sun Yang, Qianxiong Xu, Zhishuai Li, Cheng Long, Ziyue Li, and Rui Zhao. 2024. Spatial-Temporal Large Language Model for Traffic Prediction.2024 25th IEEE International Conference on Mobile Data Management (MDM)(2024), 31–40. https://api.semanticscholar.org/CorpusID:267035019

2024
[24]

Xinghua Lou, Miguel Lázaro-Gredilla, Antoine Dedieu, Carter Wendelken, Wolf- gang Lehrach, and Kevin P Murphy. 2026. Autoharness: improving LLM agents by automatically synthesizing a code harness.arXiv preprint arXiv:2603.03329 (2026)

work page arXiv 2026
[25]

Moshe Maor, Sharon Gilad, and Pazit Ben-Nun Bloom. 2013. Organizational Repu- tation, Regulatory Talk, and Strategic Silence.Journal of Public Administration Re- search and Theory23 (2013), 581–608. https://api.semanticscholar.org/CorpusID: 154790924

2013
[26]

Spiral of Silence

Jörg Matthes, Johannes Knoll, and Christian von Sikorski. 2018. The “Spiral of Silence” Revisited: A Meta-Analysis on the Relationship Between Perceptions of Opinion Support and Political Opinion Expression.Communication Research45, 1 (2018), 3–33. doi:10.1177/0093650217745429

work page doi:10.1177/0093650217745429 2018
[27]

Metzger, Ethan H

Miriam J. Metzger, Ethan H. Hartsell, and Andrew J. Flanagin. 2020. Cognitive Dissonance or Credibility? A Comparison of Two Theoretical Explanations for Selective Exposure to Partisan News.Communication Research47, 1 (2020), 3–28. doi:10.1177/0093650215613136

work page doi:10.1177/0093650215613136 2020
[28]

Xinyi Mou, Xuanwen Ding, Qi He, Liang Wang, Jingcong Liang, Xinnong Zhang, Libo Sun, Jiayu Lin, Jie Zhou, Huang Xuanjing, et al. 2026. From individual to society: A survey on social simulation driven by large language model-based agents.Comput. Surveys58, 11 (2026), 1–41

2026
[29]

Diana C. Mutz. 2006.Hearing the Other Side: Deliberative versus Participatory Democracy. Cambridge University Press, New York

2006
[30]

Mutz and Paul S

Diana C. Mutz and Paul S. Martin. 2001. Facilitating Communication across Lines of Political Difference: The Role of Mass Media.American Political Science Review 95, 1 (2001), 97–114. doi:10.1017/S0003055401000223

work page doi:10.1017/s0003055401000223 2001
[31]

German Neubaum and Nicole C. Krämer. 2017. Opinion Climates in Social Media: Blending Mass and Interpersonal Communication.Human Communication Research43, 4 (2017), 464–476. doi:10.1111/hcre.12118

work page doi:10.1111/hcre.12118 2017
[32]

Elisabeth Noelle-Neumann. 1974. The Spiral of Silence: A Theory of Public Opinion.Journal of Communication24, 2 (1974), 43–51. doi:10.1111/j.1460- 2466.1974.tb00367.x

work page doi:10.1111/j.1460- 1974
[33]

Zou, Jonne Kamphorst, Niles Egan, Aaron Shaw, Benjamin Mako Hill, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, Robb Willer, and Michael S

Joon Sung Park, Carolyn Q. Zou, Jonne Kamphorst, Niles Egan, Aaron Shaw, Benjamin Mako Hill, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, Robb Willer, and Michael S. Bernstein. 2024. LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals. https://api.semanticscholar. org/CorpusID:274117080

2024
[34]

Jinghua Piao, Yuwei Yan, Jun Zhang, Nian Li, Junbo Yan, Xiaochong Lan, Zhihong Lu, Zhiheng Zheng, Jing Yi Wang, Di Zhou, et al. 2025. Agentsociety: Large-scale simulation of llm-driven generative agents advances understanding of human behaviors and society. (2025)

2025
[35]

John Rountree and John Gastil. 2026. The Case for Using Generative AI to Run Deliberation Simulations.Journal of Deliberative Democracy1, 1 (Feb. 2026). doi:10.16997/jdd.1625

work page doi:10.16997/jdd.1625 2026
[36]

David G. Taylor. 1982. Pluralistic Ignorance and the Spiral of Silence: A Formal Analysis.Public Opinion Quarterly46 (1982), 311–335. https://api.semanticscholar. org/CorpusID:144108774

1982
[37]

Bakker, Daniel Jarrett, Hannah Sheahan, Mar- tin J

Michael Henry Tessler, Michiel A. Bakker, Daniel Jarrett, Hannah Sheahan, Mar- tin J. Chadwick, Raphael Koster, Georgina Evans, Lucy Campbell-Gillingham, Tantum Collins, David C. Parkes, Matthew Botvinick, and Christopher Summer- field. 2024. AI can help humans find common ground in democratic deliberation. Science386, 6719 (Oct. 2024), eadq2852. doi:10.1...

work page doi:10.1126/science.adq2852 2024
[38]

Wojcieszak and Vincent Price

Magdalena E. Wojcieszak and Vincent Price. 2012. Perceived Versus Actual Disagreement: Which Influences Deliberative Experiences?Journal of Communi- cation62, 3 (2012), 418–436. doi:10.1111/j.1460-2466.2012.01645.x

work page doi:10.1111/j.1460-2466.2012.01645.x 2012
[39]

White, Doug Burger, and Chi Wang

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang (Eric) Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. 2023. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. https://api.semanticscholar. org/CorpusID:263611068

2023
[40]

Tianshi Xu, Huifeng Wen, and Meng Li. 2026. Adapting the interface, not the model: Runtime harness adaptation for deterministic LLM agents.arXiv preprint arXiv:2605.22166(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[41]

Ziyi Yang, Zaibin Zhang, Zirui Zheng, Yuxian Jiang, Ziyue Gan, Zhiyu Wang, Zijian Ling, Jinsong Chen, Martz Ma, Bowen Dong, Prateek Gupta, Shuyue Hu, Zhenfei Yin, Guohao Li, Xu Jia, Lijun Wang, Bernard Ghanem, Huchuan Lu, Chaochao Lu, Wanli Ouyang, Yu Qiao, Philip Torr, and Jing Shao. 2024. OASIS: Open Agent Social Interaction Simulations with One Million...

work page arXiv 2024
[42]

Xinli Yu, Zheng Chen, Yuan Ling, Shujing Dong, Zongying Liu, and Yanbin Lu
[43]

https://api.semanticscholar.org/CorpusID: 259203723

Temporal Data Meets LLM - Explainable Financial Time Series Forecast- ing.ArXivabs/2306.11025 (2023). https://api.semanticscholar.org/CorpusID: 259203723

work page arXiv 2023
[44]

Lan Zhang, Yuxuan Hu, Weihua Li, Quan wei Bai, and Parma Nand. 2025. LLM-AIDSim: LLM-Enhanced Agent-Based Influence Diffusion Simulation in SciSoc Agents & LLMs ’26, August 9, 2026, Jeju, Republic of Korea Yang et al. Social Networks.Syst.13 (2025), 29. https://api.semanticscholar.org/CorpusID: 275313752

2025
[45]

Yuxuan Zhang, Haoyang Yu, Lanxiang Hu, Haojian Jin, and Hao Zhang. 2025. General Modular Harness for LLM Agents in Multi-Turn Gaming Environments. arXiv preprint arXiv:2507.11633(2025)

work page arXiv 2025
[46]

Chenyu Zhou, Huacan Chai, Wenteng Chen, Zihan Guo, Rong Shan, Yuanyi Song, Tianyi Xu, Yingxuan Yang, Aofan Yu, Weiming Zhang, et al . 2026. Ex- ternalization in LLM agents: A unified review of memory, skills, protocols and harness engineering.arXiv preprint arXiv:2604.08224(2026). A Theoretical Elaboration on Research Questions This section provides a ful...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[47]

deliberation-making

make a compelling case for using generative AI to run delibera- tion simulations that complement, rather than replace, human judg- ment. They frame such simulations as “deliberation-making” tools rather than decision-making shortcuts, with potential applications in facilitator training, time-sensitive policy consultation, classroom deliberation, and theor...

2026

[1] [1]

Anisha Agarwal, Aaron Chan, Shubham Chandel, Jinu Jang, Shaun Miller, Roshanak Zilouchian Moghaddam, Yevhen Mohylevskyy, Neel Sundaresan, and Michele Tufano. 2024. Copilot evaluation harness: Evaluating LLM-guided soft- ware programming.arXiv preprint arXiv:2402.14261(2024)

work page arXiv 2024

[2] [2]

Argyle, Christopher A

Lisa P. Argyle, Christopher A. Bail, Ethan C. Busby, Joshua R. Gubler, Thomas Howe, Christopher Rytting, Taylor Sorensen, and David Wingate. 2023. Leverag- ing AI for democratic discourse: Chat interventions can improve online political conversations at scale.Proceedings of the National Academy of Sciences120, 41 (2023), e2311627120. doi:10.1073/pnas.2311627120

work page doi:10.1073/pnas.2311627120 2023

[3] [3]

Joshua Ashkinaze, Emily Fry, Narendra Edara, Eric Gilbert, and Ceren Budak

[4] [4]

In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25)

Plurals: A System for Guiding LLMs via Simulated Social Ensembles. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA, 1–21. doi:10.1145/3706598.3713675

work page doi:10.1145/3706598.3713675 2025

[5] [5]

Hsuan-Ting Chen. 2018. Spiral of Silence on Social Media and the Moderating Role of Disagreement and Publicness in the Network: Analyzing Expressive and Withdrawal Behaviors.New Media & Society20, 10 (2018), 3917–3936. doi:10.1177/1461444818763384

work page doi:10.1177/1461444818763384 2018

[6] [6]

Mengzhuo Chen, Junjie Wang, Zhe Liu, Yawen Wang, and Qing Wang. 2026. From Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness Flaws.arXiv preprint arXiv:2606.06324(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori Hashimoto. 2023. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback.ArXiv abs/2305.14387 (2023). https://api.semanticscholar.org/CorpusID:258865545

work page arXiv 2023

[8] [8]

Amitai Etzioni. 1972. Minerva: An electronic town hall.Policy Sciences3 (1972), 457–474. https://api.semanticscholar.org/CorpusID:154572096

1972

[9] [9]

Heterogeneity

William P. Eveland and Myiah H. Hively. 2009. Political Discussion Frequency, Network Size, and “Heterogeneity” of Discussion as Predictors of Political Knowledge and Participation.Journal of Communication59, 2 (2009), 205–224. doi:10.1111/j.1460-2466.2009.01412.x

work page doi:10.1111/j.1460-2466.2009.01412.x 2009

[10] [10]

Wangxuan Fan, Xiaoyu Nie, and Zhongxiang Dai. 2026. Harness-MU: A Safe, Governed, and Effective Harness for Multi-User LLM Agents.arXiv preprint arXiv:2606.21856(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[11] [11]

Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. 2025. Group-in-Group Policy Optimization for LLM Agent Training.ArXivabs/2505.10978 (2025). https: //api.semanticscholar.org/CorpusID:278715074

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

1957.A Theory of Cognitive Dissonance

Leon Festinger. 1957.A Theory of Cognitive Dissonance. Stanford University Press, Stanford, CA

1957

[13] [13]

Was It Something I Said?

Sherice Gearhart and Weiwu Zhang. 2015. “Was It Something I Said?” “No, It Was Something You Posted!” A Study of the Spiral of Silence Theory in Social Media Contexts.Cyberpsychology, Behavior, and Social Networking18, 4 (2015), 208–213. doi:10.1089/cyber.2014.0443

work page doi:10.1089/cyber.2014.0443 2015

[14] [14]

Önder Gürcan. 2024. Llm-augmented agent-based modelling for social simula- tions: Challenges and opportunities. InHHAI 2024: Hybrid Human AI Systems for the Social Good: Proceedings of the Third International Conference on Hybrid Human-Artificial Intelligence. SAGE Publications 1 Oliver’s Yard, 55 City Road, London, EC1Y 1SP, 134–144

2024

[15] [15]

Hayes, Carroll J

Andrew F. Hayes, Carroll J. Glynn, and James Shanahan. 2005. Validating the Willingness to Self-Censor Scale: Individual Differences in the Effect of the Climate of Opinion on Opinion Expression.International Journal of Public Opinion Research17, 4 (2005), 443–455. doi:10.1093/ijpor/edh072

work page doi:10.1093/ijpor/edh072 2005

[16] [16]

Hayes, Carroll J

Andrew F. Hayes, Carroll J. Glynn, and James Shanahan. 2005. Willingness to Self-Censor: A Construct and Measurement Tool for Public Opinion Research. International Journal of Public Opinion Research17, 3 (2005), 298–323. doi:10. 1093/ijpor/edh073

2005

[17] [17]

Marshall, Chesley Cheatham, Kris Austin, Kimberly D

Monica Webb Hooper, Charlene Mitchell, Vanessa J. Marshall, Chesley Cheatham, Kris Austin, Kimberly D. Sanders, Smitha S. Krishnamurthi, and Lena L. Grafton

[18] [18]

https://api.semanticscholar.org/CorpusID:202407482

Understanding Multilevel Factors Related to Urban Community Trust in Healthcare and Research.International Journal of Environmental Research and Public Health16 (2019). https://api.semanticscholar.org/CorpusID:202407482

2019

[19] [19]

Minwoo Jeong, Jeeyun Chang, and Yoonjin Yoon. 2025. Speak to Simulate: An LLM-Guided Agentic Framework for Traffic Simulation in SUMO.Proceedings of the 8th ACM SIGSPATIAL International Workshop on Geospatial Simulation(2025). https://api.semanticscholar.org/CorpusID:282395843

2025

[20] [20]

Anthony Leiserowitz, Connie Roser-Renouf, Jennifer Marlon, and Edward Maibach. 2021. Global Warming’s Six Americas: a review and recommenda- tions for climate change communication.Current Opinion in Behavioral Sciences 42 (2021), 97–103

2021

[21] [21]

Xinyi Li, Sai Wang, Siqi Zeng, Yu Wu, and Yi Yang. 2024. A survey on LLM-based multi-agent systems: workflow, infrastructure, and challenges.Vicinagearth1 (2024). https://api.semanticscholar.org/CorpusID:273218743

2024

[22] [22]

Minhua Lin, Juncheng Wu, Zijun Wang, Zhan Shi, Yisi Sang, Bing He, Zewen Liu, Tianxin Wei, Zongyu Wu, Zhiwei Zhang, et al. 2026. Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents.arXiv preprint arXiv:2605.30621(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [23]

Chenxi Liu, Sun Yang, Qianxiong Xu, Zhishuai Li, Cheng Long, Ziyue Li, and Rui Zhao. 2024. Spatial-Temporal Large Language Model for Traffic Prediction.2024 25th IEEE International Conference on Mobile Data Management (MDM)(2024), 31–40. https://api.semanticscholar.org/CorpusID:267035019

2024

[24] [24]

Xinghua Lou, Miguel Lázaro-Gredilla, Antoine Dedieu, Carter Wendelken, Wolf- gang Lehrach, and Kevin P Murphy. 2026. Autoharness: improving LLM agents by automatically synthesizing a code harness.arXiv preprint arXiv:2603.03329 (2026)

work page arXiv 2026

[25] [25]

Moshe Maor, Sharon Gilad, and Pazit Ben-Nun Bloom. 2013. Organizational Repu- tation, Regulatory Talk, and Strategic Silence.Journal of Public Administration Re- search and Theory23 (2013), 581–608. https://api.semanticscholar.org/CorpusID: 154790924

2013

[26] [26]

Spiral of Silence

Jörg Matthes, Johannes Knoll, and Christian von Sikorski. 2018. The “Spiral of Silence” Revisited: A Meta-Analysis on the Relationship Between Perceptions of Opinion Support and Political Opinion Expression.Communication Research45, 1 (2018), 3–33. doi:10.1177/0093650217745429

work page doi:10.1177/0093650217745429 2018

[27] [27]

Metzger, Ethan H

Miriam J. Metzger, Ethan H. Hartsell, and Andrew J. Flanagin. 2020. Cognitive Dissonance or Credibility? A Comparison of Two Theoretical Explanations for Selective Exposure to Partisan News.Communication Research47, 1 (2020), 3–28. doi:10.1177/0093650215613136

work page doi:10.1177/0093650215613136 2020

[28] [28]

Xinyi Mou, Xuanwen Ding, Qi He, Liang Wang, Jingcong Liang, Xinnong Zhang, Libo Sun, Jiayu Lin, Jie Zhou, Huang Xuanjing, et al. 2026. From individual to society: A survey on social simulation driven by large language model-based agents.Comput. Surveys58, 11 (2026), 1–41

2026

[29] [29]

Diana C. Mutz. 2006.Hearing the Other Side: Deliberative versus Participatory Democracy. Cambridge University Press, New York

2006

[30] [30]

Mutz and Paul S

Diana C. Mutz and Paul S. Martin. 2001. Facilitating Communication across Lines of Political Difference: The Role of Mass Media.American Political Science Review 95, 1 (2001), 97–114. doi:10.1017/S0003055401000223

work page doi:10.1017/s0003055401000223 2001

[31] [31]

German Neubaum and Nicole C. Krämer. 2017. Opinion Climates in Social Media: Blending Mass and Interpersonal Communication.Human Communication Research43, 4 (2017), 464–476. doi:10.1111/hcre.12118

work page doi:10.1111/hcre.12118 2017

[32] [32]

Elisabeth Noelle-Neumann. 1974. The Spiral of Silence: A Theory of Public Opinion.Journal of Communication24, 2 (1974), 43–51. doi:10.1111/j.1460- 2466.1974.tb00367.x

work page doi:10.1111/j.1460- 1974

[33] [33]

Zou, Jonne Kamphorst, Niles Egan, Aaron Shaw, Benjamin Mako Hill, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, Robb Willer, and Michael S

Joon Sung Park, Carolyn Q. Zou, Jonne Kamphorst, Niles Egan, Aaron Shaw, Benjamin Mako Hill, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, Robb Willer, and Michael S. Bernstein. 2024. LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals. https://api.semanticscholar. org/CorpusID:274117080

2024

[34] [34]

Jinghua Piao, Yuwei Yan, Jun Zhang, Nian Li, Junbo Yan, Xiaochong Lan, Zhihong Lu, Zhiheng Zheng, Jing Yi Wang, Di Zhou, et al. 2025. Agentsociety: Large-scale simulation of llm-driven generative agents advances understanding of human behaviors and society. (2025)

2025

[35] [35]

John Rountree and John Gastil. 2026. The Case for Using Generative AI to Run Deliberation Simulations.Journal of Deliberative Democracy1, 1 (Feb. 2026). doi:10.16997/jdd.1625

work page doi:10.16997/jdd.1625 2026

[36] [36]

David G. Taylor. 1982. Pluralistic Ignorance and the Spiral of Silence: A Formal Analysis.Public Opinion Quarterly46 (1982), 311–335. https://api.semanticscholar. org/CorpusID:144108774

1982

[37] [37]

Bakker, Daniel Jarrett, Hannah Sheahan, Mar- tin J

Michael Henry Tessler, Michiel A. Bakker, Daniel Jarrett, Hannah Sheahan, Mar- tin J. Chadwick, Raphael Koster, Georgina Evans, Lucy Campbell-Gillingham, Tantum Collins, David C. Parkes, Matthew Botvinick, and Christopher Summer- field. 2024. AI can help humans find common ground in democratic deliberation. Science386, 6719 (Oct. 2024), eadq2852. doi:10.1...

work page doi:10.1126/science.adq2852 2024

[38] [38]

Wojcieszak and Vincent Price

Magdalena E. Wojcieszak and Vincent Price. 2012. Perceived Versus Actual Disagreement: Which Influences Deliberative Experiences?Journal of Communi- cation62, 3 (2012), 418–436. doi:10.1111/j.1460-2466.2012.01645.x

work page doi:10.1111/j.1460-2466.2012.01645.x 2012

[39] [39]

White, Doug Burger, and Chi Wang

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang (Eric) Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. 2023. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. https://api.semanticscholar. org/CorpusID:263611068

2023

[40] [40]

Tianshi Xu, Huifeng Wen, and Meng Li. 2026. Adapting the interface, not the model: Runtime harness adaptation for deterministic LLM agents.arXiv preprint arXiv:2605.22166(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[41] [41]

Ziyi Yang, Zaibin Zhang, Zirui Zheng, Yuxian Jiang, Ziyue Gan, Zhiyu Wang, Zijian Ling, Jinsong Chen, Martz Ma, Bowen Dong, Prateek Gupta, Shuyue Hu, Zhenfei Yin, Guohao Li, Xu Jia, Lijun Wang, Bernard Ghanem, Huchuan Lu, Chaochao Lu, Wanli Ouyang, Yu Qiao, Philip Torr, and Jing Shao. 2024. OASIS: Open Agent Social Interaction Simulations with One Million...

work page arXiv 2024

[42] [42]

Xinli Yu, Zheng Chen, Yuan Ling, Shujing Dong, Zongying Liu, and Yanbin Lu

[43] [43]

https://api.semanticscholar.org/CorpusID: 259203723

Temporal Data Meets LLM - Explainable Financial Time Series Forecast- ing.ArXivabs/2306.11025 (2023). https://api.semanticscholar.org/CorpusID: 259203723

work page arXiv 2023

[44] [44]

Lan Zhang, Yuxuan Hu, Weihua Li, Quan wei Bai, and Parma Nand. 2025. LLM-AIDSim: LLM-Enhanced Agent-Based Influence Diffusion Simulation in SciSoc Agents & LLMs ’26, August 9, 2026, Jeju, Republic of Korea Yang et al. Social Networks.Syst.13 (2025), 29. https://api.semanticscholar.org/CorpusID: 275313752

2025

[45] [45]

Yuxuan Zhang, Haoyang Yu, Lanxiang Hu, Haojian Jin, and Hao Zhang. 2025. General Modular Harness for LLM Agents in Multi-Turn Gaming Environments. arXiv preprint arXiv:2507.11633(2025)

work page arXiv 2025

[46] [46]

Chenyu Zhou, Huacan Chai, Wenteng Chen, Zihan Guo, Rong Shan, Yuanyi Song, Tianyi Xu, Yingxuan Yang, Aofan Yu, Weiming Zhang, et al . 2026. Ex- ternalization in LLM agents: A unified review of memory, skills, protocols and harness engineering.arXiv preprint arXiv:2604.08224(2026). A Theoretical Elaboration on Research Questions This section provides a ful...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[47] [47]

deliberation-making

make a compelling case for using generative AI to run delibera- tion simulations that complement, rather than replace, human judg- ment. They frame such simulations as “deliberation-making” tools rather than decision-making shortcuts, with potential applications in facilitator training, time-sensitive policy consultation, classroom deliberation, and theor...

2026