Benchmarking LLMs for Community Governance Simulation with Life-history Narratives

Anding Wang; Ji-Rong Wen; Lei Shi; Lei Wang; Nan Lu; Xiaoxing Fu; Xu Chen; Yang Wang; Yuanzi Li

arxiv: 2605.23783 · v2 · pith:F4T4AZIKnew · submitted 2026-05-22 · 💻 cs.CY

Benchmarking LLMs for Community Governance Simulation with Life-history Narratives

Xu Chen , Yuanzi Li , Lei Wang , Nan Lu , Yang Wang , Anding Wang , Lei Shi , Xiaoxing Fu

show 1 more author

Ji-Rong Wen

This is my paper

Pith reviewed 2026-05-25 02:47 UTC · model grok-4.3

classification 💻 cs.CY

keywords LLM simulationcommunity governancelife-history narrativescurriculum-LoRAfidelity-cost tradeoffparameter-efficient adaptationresident profilingpolicy evaluation

0 comments

The pith

Curriculum-LoRA matches the strongest LLM simulation fidelity at roughly 10x lower per-call cost using life-history narratives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that adding rich first-person life histories from 92 detailed resident interviews raises how closely LLMs reproduce specific individuals' stated views on community governance issues. This gain requires longer prompts that increase token cost, creating a practical barrier for local use. Curriculum-LoRA, a parameter-efficient adaptation method, closes the gap by delivering equivalent fidelity at about one-tenth the cost and dominating other approaches on the cost-fidelity trade-off. The full pipeline then supports closed-loop testing of governance policies through simulation before real deployment, making individualized modeling feasible for resource-limited administrations.

Core claim

Collecting 1.2 million characters of interview data across nine governance domains and testing 18 LLMs shows that life-history profiles improve fidelity over demographic baselines but raise input costs. Curriculum-LoRA then achieves the highest baseline fidelity while cutting per-call cost by a factor of roughly 10 and Pareto-dominating every configuration tested, with the resulting system enabling in-silico pre-evaluation of community policies.

What carries the argument

curriculum-LoRA, a parameter-efficient personalization framework that adapts models to individual life-history profiles to generate resident-specific responses.

If this is right

Rich life-history profiles raise simulation fidelity above the no-profile baseline across the tested LLMs.
Standard prompting with full profiles increases input token counts and therefore per-call cost.
Curriculum-LoRA matches the strongest baseline fidelity at roughly 10 times lower per-call cost.
The method Pareto-dominates every prompting and adaptation configuration tested on the fidelity-cost frontier.
Individual-level resident simulation becomes reachable for resource-constrained local administrations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The interview dataset could be reused as a public benchmark for testing other personalization methods on governance attitudes.
Deployment in a real community decision process would test whether the fidelity scores translate into accurate forecasts of votes or survey responses.
Similar narrative-based adaptation might reduce costs in adjacent simulation settings such as patient preference modeling or student learning profiles.
Scaling the approach to additional residents could reveal systematic patterns linking life-history elements to attitude clusters.

Load-bearing premise

The benchmark's fidelity metric, which measures how closely LLM outputs match residents' interview statements, is a valid proxy for how well the simulations would predict actual resident behavior and preferences in real governance decisions.

What would settle it

Run a follow-up round of interviews or votes on new policy proposals with the same 92 residents and check whether the simulated responses from curriculum-LoRA models align with those actual answers at the reported fidelity levels.

Figures

Figures reproduced from arXiv: 2605.23783 by Anding Wang, Ji-Rong Wen, Lei Shi, Lei Wang, Nan Lu, Xiaoxing Fu, Xu Chen, Yang Wang, Yuanzi Li.

**Figure 2.** Figure 2: Benchmarking 18 mainstream LLMs for individual-resident simulation. a [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Generalization to unseen residents (a) and unseen governance domains (b). [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: A closed-loop, end-to-end platform for policy simulation and optimization. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

read the original abstract

Effective community governance hinges on understanding what specific residents think and need. Recent work has used large language models (LLMs) to simulate human respondents, offering a scalable, reproducible way to study human attitudes and behaviors at low cost. However, these studies typically prompt the model with just a few demographic variables (age, gender, income), simulating only general role types. This is insufficient for community governance, where decisions depend on the views of specific residents. We bridge this gap with an integrated research framework covering dataset, benchmark, algorithm, and system. The dataset comprises approximately 1.2 million characters of first-person narrative collected through two-hour semi-structured interviews with each of 92 residents in an urban community, organized around nine community-governance domains. The benchmark probes 18 mainstream LLMs across four prompting strategies and shows that adding rich life-history profiles meaningfully raises fidelity above the no-profile baseline, but this gain comes with more input tokens per call from the longer prompts they require. The algorithm, curriculum-LoRA, is a parameter-efficient personalization framework that, by closing this fidelity-cost gap, matches the strongest baseline's fidelity at roughly 10x lower per-call cost and Pareto-dominates every configuration tested. The system integrates curriculum-LoRA into a closed-loop policy-evaluation pipeline. Together, these results bring individual-level LLM-based resident simulation within reach of resource-constrained local administrations, enabling community-governance decisions to be systematically pre-evaluated in silico before real-world deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper supplies a sizable first-person interview dataset and shows curriculum-LoRA can cut token cost while keeping alignment to self-reported views, but offers no test that this alignment predicts real resident behavior in governance settings.

read the letter

The main things here are a 1.2 million character dataset from 92 residents across nine governance domains and the curriculum-LoRA method that reportedly matches high-fidelity baselines at about one-tenth the per-call cost. The benchmark across 18 models demonstrates that richer life-history prompts improve fidelity over simple demographics, and the LoRA approach closes the cost gap enough to Pareto-dominate the tested setups. That combination is the concrete advance: a reusable dataset plus a practical personalization trick for this use case. The system-level claim about closed-loop policy evaluation follows from integrating the model, but rests on the same benchmark numbers. The soft spot is the missing link between matching residents' stated views in interviews and actually forecasting how they would act or vote on real policies. The abstract frames fidelity as alignment with self-reports, yet the governance application requires that this proxy tracks revealed preferences or observed outcomes; nothing in the provided text shows external validation on held-out decisions. Methods details, error bars, and statistical tests are also absent from the abstract, so the strength of the 10x cost claim cannot be judged without the full tables. The work is aimed at groups already running LLM-based social simulations who want cheaper personalization for local policy work. It is coherent on its own terms and engages the literature enough to merit referee time, even if the applied payoff needs more evidence. I would send it for review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a dataset of ~1.2M characters of first-person life-history narratives from semi-structured interviews with 92 residents across nine community-governance domains; benchmarks 18 LLMs under four prompting strategies to show that rich profiles increase fidelity to residents' stated views (at the cost of longer prompts); proposes curriculum-LoRA, a parameter-efficient personalization method claimed to match the strongest baseline fidelity at roughly 10x lower per-call cost while Pareto-dominating tested configurations; and integrates the method into a closed-loop policy-evaluation pipeline for in-silico governance decisions.

Significance. If the reported fidelity-cost trade-off holds under rigorous evaluation and the interview-alignment metric proves predictive of real behavior, the work could lower barriers for resource-constrained local administrations to pre-test policies. The dataset and benchmark also supply a concrete testbed for personalization techniques. However, the absence of any external validation tying benchmark scores to observed votes, participation, or policy outcomes substantially limits the immediate applied significance.

major comments (2)

[Abstract] Abstract: the headline claim that curriculum-LoRA 'matches the strongest baseline's fidelity at roughly 10x lower per-call cost and Pareto-dominates every configuration tested' is presented without any accompanying numbers, tables, error bars, or statistical tests, preventing evaluation of whether the result is load-bearing or an artifact of the chosen fidelity metric.
[Abstract] Abstract (final sentence) and system description: the assertion that the framework enables 'systematically pre-evaluated in silico' community-governance decisions rests on the untested assumption that fidelity to self-reported interview views is a valid proxy for actual resident behavior or revealed preferences; no correlation with held-out votes, participation rates, or real policy outcomes is reported, leaving the translation from benchmark score to governance utility unsupported.

minor comments (2)

[Abstract] The abstract refers to 'four prompting strategies' and '18 mainstream LLMs' without naming them or indicating where the full list and exact protocol appear; this should be clarified with a table or section reference for reproducibility.
[Abstract] Notation for the fidelity metric and cost metric is not defined in the abstract; explicit definitions (even if deferred to §3) would help readers assess the Pareto-dominance claim.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments. We address each major comment below with specific revisions where appropriate and clarifications on scope.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that curriculum-LoRA 'matches the strongest baseline's fidelity at roughly 10x lower per-call cost and Pareto-dominates every configuration tested' is presented without any accompanying numbers, tables, error bars, or statistical tests, preventing evaluation of whether the result is load-bearing or an artifact of the chosen fidelity metric.

Authors: We agree the abstract should include quantitative support. The main text (Section 4, Tables 2-4) reports mean fidelity scores with standard errors, per-token costs, and paired t-tests showing curriculum-LoRA matches the strongest baseline (p>0.05) at 9.8x lower cost while dominating all other configurations on the Pareto front. In revision we will insert these specific values, error bars, and test results into the abstract to make the claim self-contained. revision: yes
Referee: [Abstract] Abstract (final sentence) and system description: the assertion that the framework enables 'systematically pre-evaluated in silico' community-governance decisions rests on the untested assumption that fidelity to self-reported interview views is a valid proxy for actual resident behavior or revealed preferences; no correlation with held-out votes, participation rates, or real policy outcomes is reported, leaving the translation from benchmark score to governance utility unsupported.

Authors: The manuscript evaluates fidelity strictly to the collected interview narratives and does not claim or demonstrate correlation with external behavioral data. We will revise the abstract and system description to state explicitly that the in-silico pipeline is intended for simulation conditioned on interview profiles, with the acknowledged limitation that predictive validity for real-world actions remains untested in this study. revision: partial

standing simulated objections not resolved

Absence of external validation linking fidelity scores to observed votes, participation, or policy outcomes, as no such ground-truth data were collected.

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct fidelity measurements

full rationale

The paper reports an empirical study: interview data collection (1.2M characters from 92 residents), benchmarking of 18 LLMs across prompting strategies, definition of fidelity as alignment with residents' stated interview views, and evaluation of curriculum-LoRA showing it matches baseline fidelity at lower cost. No equations, derivations, or first-principles claims exist. Performance results are measured outcomes on the held-out or tested interview responses rather than quantities defined in terms of fitted parameters or reduced by construction. No self-citation chains or uniqueness theorems are invoked as load-bearing. The central claims remain independent empirical findings on the provided dataset and benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. curriculum-LoRA implies training hyperparameters and a curriculum schedule, but none are enumerated.

pith-pipeline@v0.9.0 · 5816 in / 1183 out tokens · 24661 ms · 2026-05-25T02:47:53.101290+00:00 · methodology

Benchmarking LLMs for Community Governance Simulation with Life-history Narratives

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)