Using Zero-Shot LLM-Generated Survey Data for Geographically Explicit Population Synthesis

Amira Roess; Andrew Crooks; Emma Von Hoene; Hamdi Kavak; Orhan Yagizer Cinar; Sara Von Hoene; Taylor Anderson

arxiv: 2605.27401 · v1 · pith:6AJQPEHUnew · submitted 2026-04-23 · 💻 cs.CY · cs.AI

Using Zero-Shot LLM-Generated Survey Data for Geographically Explicit Population Synthesis

Taylor Anderson , Sara Von Hoene , Orhan Yagizer Cinar , Emma Von Hoene , Amira Roess , Andrew Crooks , Hamdi Kavak This is my paper

Pith reviewed 2026-07-04 20:02 UTC · model glm-5.2

classification 💻 cs.CY cs.AI

keywords LLM-generated survey datapopulation synthesisiterative proportional fittingBRFSSgeographically explicitsynthetic populationszero-shot generationspatial validation

0 comments

The pith

LLM-Generated Health Surveys Capture State Contrasts

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether zero-shot LLM-generated health survey responses can replace real survey data as input to iterative proportional fitting (IPF), a standard method for building geographically explicit synthetic populations. The authors prompt GPT-4.1 and Gemini-2.5-Pro to generate synthetic BRFSS-like survey records for Colorado and Mississippi, feed those records into an IPF pipeline, and validate the resulting census tract-level populations against external benchmarks (ACS and CDC PLACES). The central finding is that zero-shot LLM-generated survey data captures broad state-level health contrasts and sometimes produces spatial patterns that correlate reasonably well with ground truth, but performance is highly variable-dependent: some variables are reproduced almost perfectly while others diverge substantially. A key mechanism finding is that IPF does not simply propagate upstream errors in a predictable direction—it sometimes amplifies them, sometimes reduces them, and occasionally an LLM-based population outperforms a real-survey-based one on specific variables. The paper concludes that zero-shot LLM-generated survey data is a promising supplementary input for population synthesis when real survey data is unavailable, but not yet a drop-in replacement.

Core claim

The paper establishes two findings. First, zero-shot LLMs can generate state-conditioned health survey data that reproduces broad geographic contrasts (e.g., Colorado being healthier than Mississippi), but accuracy varies dramatically by variable: sex and age are near-perfect while health insurance and income diverge significantly. Second, the relationship between survey-data accuracy and downstream synthetic-population accuracy is not monotonic—IPF partially regularizes differences between LLM-generated datasets, sometimes reducing divergence from ground truth and sometimes worsening it, meaning that evaluating LLM-generated survey data in isolation gives an incomplete picture of its real合成

What carries the argument

The central machinery is the IPF (iterative proportional fitting) pipeline: LLM-generated individual survey records serve as the joint-distributional template, which IPF then reweights to match census tract-level demographic marginals from the American Community Survey. The evaluation uses Jensen-Shannon divergence to measure distributional similarity and Pearson correlation against external benchmarks (ACS for insurance, CDC PLACES for general health) to assess spatial accuracy.

If this is right

Population synthesis practitioners in data-sparse regions or domains could use LLM-generated survey data as a provisional input when real surveys are unavailable, provided they validate the specific variables of interest downstream rather than at the generation stage alone.
Variable-level validation is essential: some health variables (insurance, income, heart disease) are systematically harder for LLMs to reproduce and may require targeted correction or constrained generation rather than zero-shot prompting.
The finding that IPF can either amplify or reduce upstream errors suggests that the choice of synthesis method interacts non-trivially with input data quality, and future work should characterize which method-error combinations are self-correcting versus error-amplifying.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the BRFSS survey structure and state-level health statistics are present in LLM pretraining data, the apparent geographic differentiation may partly reflect memorization rather than reasoning, which would limit generalizability to less-prominent surveys or geographies not well-represented in training corpora.
The variable-dependent accuracy pattern may correlate with category cardinality and rarity: variables with many categories or rare outcomes (insurance, heart disease) are harder for LLMs to reproduce, suggesting that few-shot or constrained generation could disproportionately improve the weakest variables.
The occasional outperformance of LLM-based populations over BRFSS-based ones for specific variables (education, flu vaccination) may indicate that LLMs smooth noisy sampling distributions, which could be a feature rather than a bug for small-sample survey contexts.

Load-bearing premise

The paper assumes that the LLMs are generating plausible synthetic survey data from general geographic knowledge rather than regurgitating memorized distributions from the BRFSS survey, which is a prominent public dataset likely present in their pretraining data. If the models are recalling seen examples rather than reasoning about state-level context, the generalizability to less-prominent surveys or geographies would be undermined.

What would settle it

If LLM-generated survey data for a survey instrument and geography not present in pretraining data showed the same level of state-level contrast reproduction, the geographic reasoning claim would be strengthened; if performance collapsed, the memorization concern would be confirmed.

Figures

Figures reproduced from arXiv: 2605.27401 by Amira Roess, Andrew Crooks, Emma Von Hoene, Hamdi Kavak, Orhan Yagizer Cinar, Sara Von Hoene, Taylor Anderson.

**Figure 2.** Figure 2: Distributions for CO and MS comparing ground truth and the LLM-generated individual [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Residuals between the ground truth and the synthetic populations for [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Spatial distribution of residuals for health insurance and general health across CO census tracts. Maps (A–C) show residuals for health insurance coverage compared against ACS 2023 estimates for: (A) BRFSS-based; (B) GPT-based; and (C) Gemini-based synthetic populations. Maps (D–F) show residuals for poor health status compared against 2023 CDC PLACES estimates for: (D) BRFSS-based; (E) GPT-based; and (F) … view at source ↗

**Figure 5.** Figure 5: Spatial distribution of residuals for health insurance and general health across Missis [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

read the original abstract

There is a growing interest in utilizing synthetic populations for a diverse range of applications. At the same time, we are witnessing a tremendous growth in artificial intelligence in all walks of life. This paper evaluates whether zero-shot large language model (LLM)-generated health survey data can serve as inputs to a conventional iterative proportional fitting (IPF) workflow for geographically explicit population synthesis. Using the 2023 Behavioral Risk Factor Surveillance System (BRFSS), we generate synthetic survey records for the U.S. states of Colorado and Mississippi with GPT-4.1 and Gemini-2.5-Pro. We use the generated data in an IPF-based synthesis pipeline and evaluate the resulting census tract-level synthetic populations against external benchmarks. Results show both LLMs capture several major state-level contrasts, indicating zero-shot generation produces geographically differentiated survey data. However, performance is strongly variable-dependent. Downstream effects in population synthesis are mixed, as IPF sometimes amplifies or reduces errors in the generated data. Spatial validation shows that LLM-based populations reproduce census tract-level patterns reasonably well, especially for variables that were more aligned with the ground truth data. Overall, the LLM-generated survey data shows promise as supplementary input, but not yet as a replacement for real survey data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Solid empirical study with an honest reporting style; the main soft spot is the memorization confound, which the authors acknowledge but don't test.

read the letter

This paper asks a practical question: can zero-shot LLM-generated survey data serve as input to IPF-based population synthesis? The answer is a qualified yes—some variables reproduce well, others don't, and IPF has a complicated relationship with upstream error. That's a useful, honest finding for people building synthetic populations in data-sparse settings. The code and prompts are public, which matters here. The evaluation uses external benchmarks (ACS, CDC PLACES) for spatial validation rather than relying solely on internal consistency, and the authors transparently report failures—GPT generating zero uninsured individuals is a good example of the kind of honesty that makes the results credible. The finding that IPF sometimes amplifies and sometimes reduces errors in the generated data is genuinely useful and not something you'd get from evaluating the LLM outputs alone. The two-state contrast (Colorado vs. Mississippi) is a reasonable design choice for showing whether the models produce geographically differentiated data. What's new here is specifically the downstream evaluation: prior work looked at LLM-generated survey data in isolation, but this paper traces what happens when you feed it through a standard synthesis pipeline. That's a real contribution. The main soft spot is the memorization concern, and it's real. The prompt explicitly tells the LLMs to use their knowledge of BRFSS, which is a widely disseminated public dataset. So when the models reproduce state-level contrasts, we can't tell whether they're reasoning about geographic context or retrieving memorized marginals. The authors flag this in Section 4, which is good, but they don't test it. A replication with a less prominent survey or a novel geography would go a long way. The variable-dependent accuracy (insurance is poorly reproduced) offers weak evidence against pure memorization, but as the stress-test notes, incomplete memorization could produce the same pattern. A secondary concern: the evaluation focuses on marginal distributions per variable, but the paper's own framing identifies joint distributions as the real challenge for IPF. The spatial validation provides indirect evidence on this, but a direct check on bivariate or higher-order joint distributions would strengthen the claims substantially. The two-state, two-model scope limits generalizability, though the authors are upfront about this. This is a well-executed empirical study that fills a genuine gap. It's not a breakthrough, but it's the kind of careful, honest work that practitioners in spatial population synthesis will read and use. It deserves a serious referee who can push the authors to address the memorization confound directly—ideally with a less prominent survey—and to add joint distribution checks.

Referee Report

1 major / 7 minor

Summary. This manuscript evaluates whether zero-shot LLM-generated health survey data (using GPT-4.1 and Gemini-2.5-Pro) can serve as input to an IPF-based geographically explicit population synthesis workflow. The authors generate synthetic BRFSS-like survey records for Colorado and Mississippi, use them in an IPF pipeline fitted to ACS marginal controls, and evaluate the resulting census tract-level synthetic populations against external benchmarks (ACS for insurance, CDC PLACES for general health). The evaluation uses JS divergence for marginal distributions and Pearson correlation for spatial agreement. The authors find that LLMs capture broad state-level contrasts but performance is strongly variable-dependent, with downstream IPF effects being mixed (sometimes amplifying, sometimes reducing errors). The paper is transparent about limitations, including cases where LLM data fails entirely (e.g., no uninsured individuals generated by GPT).

Significance. The paper addresses a practically important question at the intersection of LLM-generated synthetic data and population synthesis. The contribution is well-scoped: rather than evaluating LLM-generated records in isolation, the authors evaluate them as inputs to an established synthesis pipeline, which is the context where joint distributions matter. The use of external benchmarks (ACS, CDC PLACES) for spatial validation is a strength, as is the transparent reporting of variable-dependent failures. The code and prompts are publicly available (OSF DOI), supporting reproducibility. The two-state design (CO vs. MS) provides a meaningful geographic contrast. The finding that IPF partially regularizes but does not fix LLM-generated data errors is useful for the community.

major comments (1)

Section 2.1: The prompt explicitly instructs LLMs to 'use their knowledge of the Behavioral Risk Factor Surveillance System Survey.' Since BRFSS is a widely disseminated public dataset with published state-level tables, and both models have training cutoffs that likely include 2023 BRFSS data, the evaluation may conflate memorization of seen distributions with genuine geographic reasoning. The authors acknowledge this risk in Section 4 but do not test it. This is load-bearing because the paper's claim that zero-shot generation 'produces geographically differentiated survey data' (Abstract) could be an artifact of the models reproducing memorized state-level marginals. A concrete test would be to run the same pipeline on a less-prominent survey instrument or a geography with less public data, or to compare results with and without the BRFSS name in the prompt. Without such a test, the 'ge

minor comments (7)

Table 1: The row mean for 'Insurance' (0.129) is reported as the highest divergence, but the text on the same page says 'insurance has the maintains the lowest mean divergence of 0.070' — this appears to be a typo conflating Table 1 and Table 2 values. Please clarify.
Section 3.2.1, paragraph discussing Table 3: 'insurance has the maintains the lowest mean divergence of 0.070' is grammatically broken. Also, 0.070 is the highest row mean in Table 2, not the lowest, so the claim appears incorrect.
Figure 1 caption: 'A negative residual indicates that the LLM overestimates the category relative to the ground truth data and a positive residual indicates that the LLM underestimates.' This sign convention is non-intuitive (negative = overestimate). Consider clarifying or reversing the sign for reader intuition.
Section 2.1: The batch size of 75 was selected based on experiments testing sizes of 50, 75, 100, 150, and 200, but no quantitative results from these experiments are reported. A brief table or sentence summarizing the trade-offs would strengthen the justification.
Section 2.3: The JS divergence is defined with values 'approaching 1' for maximum dissimilarity, but JS divergence using log base 2 has an upper bound of 1 only for distributions over 2 categories. For 14-category variables, the maximum is still 1 (since JS is bounded by log(2) = 1 in base 2), but this should be stated explicitly for clarity.
Table 2: Several BRFSS-based population divergences are non-zero (e.g., Education 0.038 for CO BRFSS, Flu Vaccination 0.041 for CO BRFSS). The text explains these arise from the fitting and expansion process, but a brief note in the table caption would help readers interpret why the 'ground truth' reference is not zero.
Section 4: The phrase 'the BRFSS is a prominent public survey, so some of its structure is likely reflected in LLM pretraining data' is an important limitation. Consider elevating this to a more prominent position (e.g., in the Introduction or Methods) rather than burying it in the Discussion, as it affects interpretation of all results.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for a careful and constructive review. The referee raises one major concern about the potential confounding of memorization with genuine geographic reasoning. We agree this is an important issue and outline below how we will address it.

read point-by-point responses

Referee: Section 2.1: The prompt explicitly instructs LLMs to 'use their knowledge of the Behavioral Risk Factor Surveillance System Survey.' Since BRFSS is a widely disseminated public dataset with published state-level tables, and both models have training cutoffs that likely include 2023 BRFSS data, the evaluation may conflate memorization of seen distributions with genuine geographic reasoning. The authors acknowledge this risk in Section 4 but do not test it. This is load-bearing because the paper's claim that zero-shot generation 'produces geographically differentiated survey data' (Abstract) could be an artifact of the models reproducing memorized state-level marginals. A concrete test would be to run the same pipeline on a less-prominent survey instrument or a geography with less public data, or to compare results with and without the BRFSS name in the prompt.

Authors: We agree this is the most important limitation of the study, and we appreciate the referee framing it so precisely. The concern is valid: because BRFSS is a prominent public dataset with widely available state-level tables, the models may be reproducing memorized marginals rather than performing genuine geographic reasoning. We already flag this in Section 4, but the referee is right that acknowledging a limitation is not the same as testing it. We will take two concrete steps in revision. First, we will run an ablation in which the BRFSS name is removed from the prompt—replacing the instruction to 'use knowledge of BRFSS' with a neutral instruction to generate realistic health survey responses for the specified state population. This directly tests whether naming the instrument is driving the results. Second, we will soften the abstract claim from 'produces geographically differentiated survey data' to language that does not presuppose the mechanism—e.g., 'produces survey data that differs across states'—and we will add a sentence in the abstract noting that the role of memorization cannot be ruled out with the current design. We note honestly that we cannot fully resolve the memorization question within this paper. The ablation will provide evidence about whether naming BRFSS matters, but even without the name, the models may have internalized BRFSS distributions from pretraining. A definitive test would require a less-prominent survey instrument or a geography with minimal public data, which we identify as a priority for future work and will state explicitly. We believe the ablation and the revised framing meaningfully address the referee's concern without overclaiming what the current study can establish. revision: partial

standing simulated objections not resolved

We cannot definitively distinguish memorization from genuine geographic reasoning, even with the proposed ablation, because both models may have internalized BRFSS distributions during pretraining regardless of whether the instrument is named in the prompt. A fully convincing test would require a survey instrument or geography not represented in the training data, which is beyond the scope of the current revision.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained against external benchmarks

full rationale

The paper's central claim—that zero-shot LLM-generated survey data can serve as supplementary input to IPF-based population synthesis—is evaluated against external benchmarks (BRFSS ground truth, ACS estimates, CDC PLACES) that are independent of the generation process. The LLM-generated data is not defined in terms of the evaluation targets, and no parameter is fitted to a subset of data and then 'predicted' on closely related data. The one self-citation (Von Hoene et al., 2025, co-authored by Emma Von Hoene) appears in Section 2.2 as a methodological reference for standard IPF procedures alongside Lovelace and Ballas (2013) and Huang and Williamson (2001); it is not load-bearing for the paper's central claim, which rests on empirical comparison against external data. The authors explicitly acknowledge that BRFSS-based synthetic populations are 'not a fully independent validation benchmark' (Section 3.2.1) and flag the memorization risk in Section 4. These are correctness/validity concerns, not circularity. The derivation chain—generate data, compare to benchmarks, feed into IPF, compare synthetic populations to external benchmarks—does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

4 free parameters · 4 axioms · 0 invented entities

The paper introduces no new entities, particles, or theoretical constructs. It applies existing tools (LLMs, IPF) to a new domain. The free parameters are practical engineering choices (batch size, temperature) rather than theoretical constants. The axioms are domain assumptions about LLM capabilities and data independence, with the pretraining-memorization concern being the most significant untested premise.

free parameters (4)

Batch size (75 rows per call) = 75
Selected empirically from tested sizes of 50, 75, 100, 150, 200 based on output truncation and malformed JSON rates. Not a principled derivation.
Temperature = 1.0
Set to maximize diversity in generated responses. Standard choice but not derived from the data.
Top-p = 1.0
Set alongside temperature for diversity. Not tuned to the specific task.
IPF fitting variables (age, race, gender, income, education) = 5 variables
Selected based on established relationships with health attributes. Not optimized or validated against alternatives.

axioms (4)

domain assumption LLMs can generate state-specific survey responses reflecting geographic context from zero-shot prompts
Invoked in Section 2.1: 'We instructed the LLMs to generate survey responses that are representative of the 2023 adult population of the two U.S. states.' This is the core premise being tested.
domain assumption BRFSS data is not substantially memorized in LLM pretraining corpora
Acknowledged as a limitation in Section 4: 'the BRFSS is a prominent public survey, so some of its structure is likely reflected in LLM pretraining data.' If false, performance is inflated.
standard math IPF preserves joint distributions from input survey data
Standard property of IPF invoked in Section 1: 'IPF seeks to preserve the joint distributions present in the input records.' Well-established in the literature.
domain assumption ACS and CDC PLACES estimates are valid external benchmarks for tract-level health outcomes
Used in Section 2.3 for spatial validation. CDC PLACES is model-derived from BRFSS, introducing partial dependence on the same source.

pith-pipeline@v1.1.0-glm · 15609 in / 2613 out tokens · 127502 ms · 2026-07-04T20:02:48.937621+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 7 internal anchors

[1]

Prabin Bhandari, Antonios Anastasopoulos, and Dieter Pfoser

doi: 10.1017/pan.2023.2. Prabin Bhandari, Antonios Anastasopoulos, and Dieter Pfoser. Urban mobility assessment using llms. InProceedings of the 32nd ACM International Conference on Advances in Geographic Information Systems, pages 67–79,

work page doi:10.1017/pan.2023.2 2023
[2]

2023 brfss survey data and documentation,

CDC. 2023 brfss survey data and documentation,

work page 2023
[3]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Avaiable athttps://www.cdc.gov/places. Google DeepMind. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

doi: 10.48550/arXiv.2507.06261. John J Grefenstette, Shawn T Brown, Roni Rosenfeld, Jay DePasse, Nathan TB Stone, Phillip C Cooley, William D Wheaton, Alona Fyshe, David D Galloway, Anuroop Sriram, et al. Fred (a framework for reconstructing epidemic dynamics): an open-source software system for modeling infectious diseases and control strategies using ce...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.06261
[5]

David Han, Samiul Islam, Taylor Anderson, Andrew T Crooks, and Hamdi Kavak

doi: 10.1186/1471-2458-13-940. David Han, Samiul Islam, Taylor Anderson, Andrew T Crooks, and Hamdi Kavak. Quantitative comparison of pop- ulation synthesis techniques. In2025 Winter Simulation Conference (WSC), pages 151–162. IEEE,

work page doi:10.1186/1471-2458-13-940
[6]

Deirdre A Hennessy, William M Flanagan, Peter Tanuseputro, Carol Bennett, Meltem Tuna, Jacek Kopec, Michael C Wolfson, and Douglas G Manuel

doi: 10.1109/WSC68292.2025.11338945. Deirdre A Hennessy, William M Flanagan, Peter Tanuseputro, Carol Bennett, Meltem Tuna, Jacek Kopec, Michael C Wolfson, and Douglas G Manuel. The population health model (pohem): an overview of rationale, methods and applications.Population Health Metrics, 13(1):24,

work page doi:10.1109/wsc68292.2025.11338945 2025
[7]

doi: 10.1186/s12963-015-0057-x. Z. Huang and P. Williamson. A comparison of synthetic reconstruction and combinatorial optimisation approaches to the creation of small-area microdata. Technical report, Department of Geography, University of Liverpool,

work page doi:10.1186/s12963-015-0057-x
[8]

doi: 10.1371/journal.pcbi. 1009149. Ansley J Kunnath, Daniel E Sack, and Consuelo H Wilkins. Relative predictive value of sociodemographic factors for chronic diseases among all of us participants: a descriptive analysis.BMC Public Health, 24(1):405,

work page doi:10.1371/journal.pcbi
[9]

David T Levy, Patricia L Mabry, Amanda L Graham, C Tracy Orleans, and David B Abrams

doi: 10.1186/s12889-024-17834-1. David T Levy, Patricia L Mabry, Amanda L Graham, C Tracy Orleans, and David B Abrams. Reaching healthy people 2010 by 2013: a simsmoke simulation.American Journal of Preventive Medicine, 38(3):S373–S381,

work page doi:10.1186/s12889-024-17834-1 2010
[10]

Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, and Haobo Wang

doi: 10.1016/j.amepre.2009.11.018. Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, and Haobo Wang. On llms-driven synthetic data generation, curation, and evaluation: A survey. InFindings of the Association for Computational Linguistics, pages 11065–11082,

work page doi:10.1016/j.amepre.2009.11.018 2009
[11]

13 Robin Lovelace and Dimitris Ballas

doi: 10.18653/v1/2024.findings-acl.658. 13 Robin Lovelace and Dimitris Ballas. ‘truncate, replicate, sample’: a method for creating integer weights for spatial microsimulation.Computers, Environment and Urban Systems, 41,

work page doi:10.18653/v1/2024.findings-acl.658 2024
[12]

doi: 10.1016/j.compenvurbsys.2013.03

work page doi:10.1016/j.compenvurbsys.2013.03 2013
[13]

Deirdre A Hennessy, William M Flanagan, Peter Tanuseputro, Carol Bennett, Meltem Tuna, Jacek Kopec, Michael C Wolfson, and Douglas G Manuel

doi: 10.1109/WSC68292.2025.11339080. Pedro Nascimento de Lima, Christopher Maerzluft, Jonathan Ozik, Nicholson Collier, and Carolyn M Rutter. Stress- testing u.s. colorectal cancer screening guidelines: Decennial colonoscopy from age 45 is robust to natural history uncertainty and colonoscopy sensitivity assumptions.Medical Decision Making, 45(5):557–568,

work page doi:10.1109/wsc68292.2025.11339080 2025
[14]

doi: 10.1145/3764919. 3770885. OpenAI. GPT-4.1.https://openai.com/index/gpt-4-1/,

work page doi:10.1145/3764919
[16]

Persona Generators: Generating Diverse Synthetic Personas for Arbitrary Contexts

doi: 10.48550/ arXiv.2602.03545. Zhenlin Qin, Yancheng Ling, Leizhen Wang, Francisco Câmara Pereira, and Zhenliang Ma. Semapop: Semantic- persona conditioned population synthesis.arXiv preprint arXiv:2602.11569,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

doi: 10.48550/arXiv.2602. 11569. Nabeel Seedat, Nicolas Huynh, Boris van Breugel, and Mihaela van der Schaar. Curated LLM: synergy of LLMs and data curation for tabular augmentation in low-data regimes. InProceedings of the 41st International Conference on Machine Learning,

work page doi:10.48550/arxiv.2602
[18]

LLMSynthor: Macro-Aligned Micro-Records Synthesis with Large Language Models

doi: 10.1016/j.healthplace.2015.03.015. Yihong Tang, Menglin Kong, Junlin He, Tong Nie, and Lijun Sun. Llmsynthor: Macro-aligned micro-records synthesis with large language models.arXiv preprint arXiv:2505.14752,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/j.healthplace.2015.03.015 2015
[19]

LLMSynthor: Macro-Aligned Micro-Records Synthesis with Large Language Models

doi: 10.48550/arXiv.2505.14752. US Census Bureau. American community survey (acs),

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.14752
[20]

David Villarreal-Zegarra and Luciana Bellido-Boza

Available athttps://www.census.gov/ programs-surveys/acs. David Villarreal-Zegarra and Luciana Bellido-Boza. Generation of synthetic data in health surveys using large lan- guage models.medRxiv, pages 2026–01,

work page 2026
[21]

Emma V on Hoene, Amira Roess, Hamdi Kavak, and Taylor Anderson

doi: 10.64898/2026.01.27.26345015. Emma V on Hoene, Amira Roess, Hamdi Kavak, and Taylor Anderson. Synthetic population generation with public health characteristics for spatial agent-based models.PLOS Computational Biology, 21(3):1–22, 03

work page doi:10.64898/2026.01.27.26345015 2026
[22]

Sean J Westwood

doi: 10.1371/journal.pcbi.1012439. Sean J Westwood. The potential existential threat of large language models to online survey research.Proceedings of the National Academy of Sciences, 122(47):e2518075122,

work page doi:10.1371/journal.pcbi.1012439
[23]

Fuzhen Yin, Na Jiang, Andrew Crooks, and Lucie Laurian

doi: 10.1073/pnas.2518075122. Fuzhen Yin, Na Jiang, Andrew Crooks, and Lucie Laurian. Agent-based modeling of covid-19 vaccine uptake in new york state: Information diffusion in hybrid spaces. InProceedings of the 7th ACM SIGSPATIAL International Workshop on GeoSpatial Simulation, pages 11–20,

work page doi:10.1073/pnas.2518075122
[24]

doi: 10.1145/3681770.369857. 14 X. Zhang, J.B. Holt, S. Yun, H. Lu, K.J. Greenlund, and J.B. Croft. Validation of multilevel regression and poststrat- ification methodology for small area estimation of health indicators from the behavioral risk factor surveillance system.American Journal of Epidemiology, 182(2):127–137,

work page doi:10.1145/3681770.369857
[25]

Large Language Models as Virtual Survey Respondents: Evaluating Sociodemographic Response Generation

doi: 10.1093/aje/kwv002. Jianpeng Zhao, Chenyu Yuan, Weiming Luo, Haoling Xie, Guangwei Zhang, Steven Jige Quan, Zixuan Yuan, Pengyang Wang, and Denghui Zhang. Large language models as virtual survey respondents: Evaluating sociode- mographic response generation.arXiv preprint arXiv:2509.06337,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1093/aje/kwv002
[26]

Large Language Models as Virtual Survey Respondents: Evaluating Sociodemographic Response Generation

doi: 10.48550/arXiv.2509.06337. AUTHOR BIOGRAPHIES TA YLOR ANDERSONis an Associate Professor in the Department of Geography and Geoinformation Science (GGS) at George Mason University (GMU). Her research focuses on modeling the spread of diseases in human and ecological systems. Her e-mail address istander6@gmu.eduand her website ishttps://science.gmu.edu...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.06337

[1] [1]

Prabin Bhandari, Antonios Anastasopoulos, and Dieter Pfoser

doi: 10.1017/pan.2023.2. Prabin Bhandari, Antonios Anastasopoulos, and Dieter Pfoser. Urban mobility assessment using llms. InProceedings of the 32nd ACM International Conference on Advances in Geographic Information Systems, pages 67–79,

work page doi:10.1017/pan.2023.2 2023

[2] [2]

2023 brfss survey data and documentation,

CDC. 2023 brfss survey data and documentation,

work page 2023

[3] [3]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Avaiable athttps://www.cdc.gov/places. Google DeepMind. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

doi: 10.48550/arXiv.2507.06261. John J Grefenstette, Shawn T Brown, Roni Rosenfeld, Jay DePasse, Nathan TB Stone, Phillip C Cooley, William D Wheaton, Alona Fyshe, David D Galloway, Anuroop Sriram, et al. Fred (a framework for reconstructing epidemic dynamics): an open-source software system for modeling infectious diseases and control strategies using ce...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.06261

[5] [5]

David Han, Samiul Islam, Taylor Anderson, Andrew T Crooks, and Hamdi Kavak

doi: 10.1186/1471-2458-13-940. David Han, Samiul Islam, Taylor Anderson, Andrew T Crooks, and Hamdi Kavak. Quantitative comparison of pop- ulation synthesis techniques. In2025 Winter Simulation Conference (WSC), pages 151–162. IEEE,

work page doi:10.1186/1471-2458-13-940

[6] [6]

Deirdre A Hennessy, William M Flanagan, Peter Tanuseputro, Carol Bennett, Meltem Tuna, Jacek Kopec, Michael C Wolfson, and Douglas G Manuel

doi: 10.1109/WSC68292.2025.11338945. Deirdre A Hennessy, William M Flanagan, Peter Tanuseputro, Carol Bennett, Meltem Tuna, Jacek Kopec, Michael C Wolfson, and Douglas G Manuel. The population health model (pohem): an overview of rationale, methods and applications.Population Health Metrics, 13(1):24,

work page doi:10.1109/wsc68292.2025.11338945 2025

[7] [7]

doi: 10.1186/s12963-015-0057-x. Z. Huang and P. Williamson. A comparison of synthetic reconstruction and combinatorial optimisation approaches to the creation of small-area microdata. Technical report, Department of Geography, University of Liverpool,

work page doi:10.1186/s12963-015-0057-x

[8] [8]

doi: 10.1371/journal.pcbi. 1009149. Ansley J Kunnath, Daniel E Sack, and Consuelo H Wilkins. Relative predictive value of sociodemographic factors for chronic diseases among all of us participants: a descriptive analysis.BMC Public Health, 24(1):405,

work page doi:10.1371/journal.pcbi

[9] [9]

David T Levy, Patricia L Mabry, Amanda L Graham, C Tracy Orleans, and David B Abrams

doi: 10.1186/s12889-024-17834-1. David T Levy, Patricia L Mabry, Amanda L Graham, C Tracy Orleans, and David B Abrams. Reaching healthy people 2010 by 2013: a simsmoke simulation.American Journal of Preventive Medicine, 38(3):S373–S381,

work page doi:10.1186/s12889-024-17834-1 2010

[10] [10]

Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, and Haobo Wang

doi: 10.1016/j.amepre.2009.11.018. Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, and Haobo Wang. On llms-driven synthetic data generation, curation, and evaluation: A survey. InFindings of the Association for Computational Linguistics, pages 11065–11082,

work page doi:10.1016/j.amepre.2009.11.018 2009

[11] [11]

13 Robin Lovelace and Dimitris Ballas

doi: 10.18653/v1/2024.findings-acl.658. 13 Robin Lovelace and Dimitris Ballas. ‘truncate, replicate, sample’: a method for creating integer weights for spatial microsimulation.Computers, Environment and Urban Systems, 41,

work page doi:10.18653/v1/2024.findings-acl.658 2024

[12] [12]

doi: 10.1016/j.compenvurbsys.2013.03

work page doi:10.1016/j.compenvurbsys.2013.03 2013

[13] [13]

Deirdre A Hennessy, William M Flanagan, Peter Tanuseputro, Carol Bennett, Meltem Tuna, Jacek Kopec, Michael C Wolfson, and Douglas G Manuel

doi: 10.1109/WSC68292.2025.11339080. Pedro Nascimento de Lima, Christopher Maerzluft, Jonathan Ozik, Nicholson Collier, and Carolyn M Rutter. Stress- testing u.s. colorectal cancer screening guidelines: Decennial colonoscopy from age 45 is robust to natural history uncertainty and colonoscopy sensitivity assumptions.Medical Decision Making, 45(5):557–568,

work page doi:10.1109/wsc68292.2025.11339080 2025

[14] [14]

doi: 10.1145/3764919. 3770885. OpenAI. GPT-4.1.https://openai.com/index/gpt-4-1/,

work page doi:10.1145/3764919

[15] [16]

Persona Generators: Generating Diverse Synthetic Personas for Arbitrary Contexts

doi: 10.48550/ arXiv.2602.03545. Zhenlin Qin, Yancheng Ling, Leizhen Wang, Francisco Câmara Pereira, and Zhenliang Ma. Semapop: Semantic- persona conditioned population synthesis.arXiv preprint arXiv:2602.11569,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [17]

doi: 10.48550/arXiv.2602. 11569. Nabeel Seedat, Nicolas Huynh, Boris van Breugel, and Mihaela van der Schaar. Curated LLM: synergy of LLMs and data curation for tabular augmentation in low-data regimes. InProceedings of the 41st International Conference on Machine Learning,

work page doi:10.48550/arxiv.2602

[17] [18]

LLMSynthor: Macro-Aligned Micro-Records Synthesis with Large Language Models

doi: 10.1016/j.healthplace.2015.03.015. Yihong Tang, Menglin Kong, Junlin He, Tong Nie, and Lijun Sun. Llmsynthor: Macro-aligned micro-records synthesis with large language models.arXiv preprint arXiv:2505.14752,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/j.healthplace.2015.03.015 2015

[18] [19]

LLMSynthor: Macro-Aligned Micro-Records Synthesis with Large Language Models

doi: 10.48550/arXiv.2505.14752. US Census Bureau. American community survey (acs),

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.14752

[19] [20]

David Villarreal-Zegarra and Luciana Bellido-Boza

Available athttps://www.census.gov/ programs-surveys/acs. David Villarreal-Zegarra and Luciana Bellido-Boza. Generation of synthetic data in health surveys using large lan- guage models.medRxiv, pages 2026–01,

work page 2026

[20] [21]

Emma V on Hoene, Amira Roess, Hamdi Kavak, and Taylor Anderson

doi: 10.64898/2026.01.27.26345015. Emma V on Hoene, Amira Roess, Hamdi Kavak, and Taylor Anderson. Synthetic population generation with public health characteristics for spatial agent-based models.PLOS Computational Biology, 21(3):1–22, 03

work page doi:10.64898/2026.01.27.26345015 2026

[21] [22]

Sean J Westwood

doi: 10.1371/journal.pcbi.1012439. Sean J Westwood. The potential existential threat of large language models to online survey research.Proceedings of the National Academy of Sciences, 122(47):e2518075122,

work page doi:10.1371/journal.pcbi.1012439

[22] [23]

Fuzhen Yin, Na Jiang, Andrew Crooks, and Lucie Laurian

doi: 10.1073/pnas.2518075122. Fuzhen Yin, Na Jiang, Andrew Crooks, and Lucie Laurian. Agent-based modeling of covid-19 vaccine uptake in new york state: Information diffusion in hybrid spaces. InProceedings of the 7th ACM SIGSPATIAL International Workshop on GeoSpatial Simulation, pages 11–20,

work page doi:10.1073/pnas.2518075122

[23] [24]

doi: 10.1145/3681770.369857. 14 X. Zhang, J.B. Holt, S. Yun, H. Lu, K.J. Greenlund, and J.B. Croft. Validation of multilevel regression and poststrat- ification methodology for small area estimation of health indicators from the behavioral risk factor surveillance system.American Journal of Epidemiology, 182(2):127–137,

work page doi:10.1145/3681770.369857

[24] [25]

Large Language Models as Virtual Survey Respondents: Evaluating Sociodemographic Response Generation

doi: 10.1093/aje/kwv002. Jianpeng Zhao, Chenyu Yuan, Weiming Luo, Haoling Xie, Guangwei Zhang, Steven Jige Quan, Zixuan Yuan, Pengyang Wang, and Denghui Zhang. Large language models as virtual survey respondents: Evaluating sociode- mographic response generation.arXiv preprint arXiv:2509.06337,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1093/aje/kwv002

[25] [26]

Large Language Models as Virtual Survey Respondents: Evaluating Sociodemographic Response Generation

doi: 10.48550/arXiv.2509.06337. AUTHOR BIOGRAPHIES TA YLOR ANDERSONis an Associate Professor in the Department of Geography and Geoinformation Science (GGS) at George Mason University (GMU). Her research focuses on modeling the spread of diseases in human and ecological systems. Her e-mail address istander6@gmu.eduand her website ishttps://science.gmu.edu...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.06337