pith. sign in

arxiv: 2605.27401 · v1 · pith:6AJQPEHUnew · submitted 2026-04-23 · 💻 cs.CY · cs.AI

Using Zero-Shot LLM-Generated Survey Data for Geographically Explicit Population Synthesis

Pith reviewed 2026-07-04 20:02 UTC · model glm-5.2

classification 💻 cs.CY cs.AI
keywords LLM-generated survey datapopulation synthesisiterative proportional fittingBRFSSgeographically explicitsynthetic populationszero-shot generationspatial validation
0
0 comments X

The pith

LLM-Generated Health Surveys Capture State Contrasts

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether zero-shot LLM-generated health survey responses can replace real survey data as input to iterative proportional fitting (IPF), a standard method for building geographically explicit synthetic populations. The authors prompt GPT-4.1 and Gemini-2.5-Pro to generate synthetic BRFSS-like survey records for Colorado and Mississippi, feed those records into an IPF pipeline, and validate the resulting census tract-level populations against external benchmarks (ACS and CDC PLACES). The central finding is that zero-shot LLM-generated survey data captures broad state-level health contrasts and sometimes produces spatial patterns that correlate reasonably well with ground truth, but performance is highly variable-dependent: some variables are reproduced almost perfectly while others diverge substantially. A key mechanism finding is that IPF does not simply propagate upstream errors in a predictable direction—it sometimes amplifies them, sometimes reduces them, and occasionally an LLM-based population outperforms a real-survey-based one on specific variables. The paper concludes that zero-shot LLM-generated survey data is a promising supplementary input for population synthesis when real survey data is unavailable, but not yet a drop-in replacement.

Core claim

The paper establishes two findings. First, zero-shot LLMs can generate state-conditioned health survey data that reproduces broad geographic contrasts (e.g., Colorado being healthier than Mississippi), but accuracy varies dramatically by variable: sex and age are near-perfect while health insurance and income diverge significantly. Second, the relationship between survey-data accuracy and downstream synthetic-population accuracy is not monotonic—IPF partially regularizes differences between LLM-generated datasets, sometimes reducing divergence from ground truth and sometimes worsening it, meaning that evaluating LLM-generated survey data in isolation gives an incomplete picture of its real合成

What carries the argument

The central machinery is the IPF (iterative proportional fitting) pipeline: LLM-generated individual survey records serve as the joint-distributional template, which IPF then reweights to match census tract-level demographic marginals from the American Community Survey. The evaluation uses Jensen-Shannon divergence to measure distributional similarity and Pearson correlation against external benchmarks (ACS for insurance, CDC PLACES for general health) to assess spatial accuracy.

If this is right

  • Population synthesis practitioners in data-sparse regions or domains could use LLM-generated survey data as a provisional input when real surveys are unavailable, provided they validate the specific variables of interest downstream rather than at the generation stage alone.
  • Variable-level validation is essential: some health variables (insurance, income, heart disease) are systematically harder for LLMs to reproduce and may require targeted correction or constrained generation rather than zero-shot prompting.
  • The finding that IPF can either amplify or reduce upstream errors suggests that the choice of synthesis method interacts non-trivially with input data quality, and future work should characterize which method-error combinations are self-correcting versus error-amplifying.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the BRFSS survey structure and state-level health statistics are present in LLM pretraining data, the apparent geographic differentiation may partly reflect memorization rather than reasoning, which would limit generalizability to less-prominent surveys or geographies not well-represented in training corpora.
  • The variable-dependent accuracy pattern may correlate with category cardinality and rarity: variables with many categories or rare outcomes (insurance, heart disease) are harder for LLMs to reproduce, suggesting that few-shot or constrained generation could disproportionately improve the weakest variables.
  • The occasional outperformance of LLM-based populations over BRFSS-based ones for specific variables (education, flu vaccination) may indicate that LLMs smooth noisy sampling distributions, which could be a feature rather than a bug for small-sample survey contexts.

Load-bearing premise

The paper assumes that the LLMs are generating plausible synthetic survey data from general geographic knowledge rather than regurgitating memorized distributions from the BRFSS survey, which is a prominent public dataset likely present in their pretraining data. If the models are recalling seen examples rather than reasoning about state-level context, the generalizability to less-prominent surveys or geographies would be undermined.

What would settle it

If LLM-generated survey data for a survey instrument and geography not present in pretraining data showed the same level of state-level contrast reproduction, the geographic reasoning claim would be strengthened; if performance collapsed, the memorization concern would be confirmed.

Figures

Figures reproduced from arXiv: 2605.27401 by Amira Roess, Andrew Crooks, Emma Von Hoene, Hamdi Kavak, Orhan Yagizer Cinar, Sara Von Hoene, Taylor Anderson.

Figure 1
Figure 1. Figure 1: Residuals between ground truth and generated survey data for health insurance and [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distributions for CO and MS comparing ground truth and the LLM-generated individual [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Residuals between the ground truth and the synthetic populations for [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Spatial distribution of residuals for health insurance and general health across CO census tracts. Maps (A–C) show residuals for health insurance coverage compared against ACS 2023 estimates for: (A) BRFSS-based; (B) GPT-based; and (C) Gemini-based synthetic populations. Maps (D–F) show residuals for poor health status compared against 2023 CDC PLACES estimates for: (D) BRFSS-based; (E) GPT-based; and (F) … view at source ↗
Figure 5
Figure 5. Figure 5: Spatial distribution of residuals for health insurance and general health across Missis [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
read the original abstract

There is a growing interest in utilizing synthetic populations for a diverse range of applications. At the same time, we are witnessing a tremendous growth in artificial intelligence in all walks of life. This paper evaluates whether zero-shot large language model (LLM)-generated health survey data can serve as inputs to a conventional iterative proportional fitting (IPF) workflow for geographically explicit population synthesis. Using the 2023 Behavioral Risk Factor Surveillance System (BRFSS), we generate synthetic survey records for the U.S. states of Colorado and Mississippi with GPT-4.1 and Gemini-2.5-Pro. We use the generated data in an IPF-based synthesis pipeline and evaluate the resulting census tract-level synthetic populations against external benchmarks. Results show both LLMs capture several major state-level contrasts, indicating zero-shot generation produces geographically differentiated survey data. However, performance is strongly variable-dependent. Downstream effects in population synthesis are mixed, as IPF sometimes amplifies or reduces errors in the generated data. Spatial validation shows that LLM-based populations reproduce census tract-level patterns reasonably well, especially for variables that were more aligned with the ground truth data. Overall, the LLM-generated survey data shows promise as supplementary input, but not yet as a replacement for real survey data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 7 minor

Summary. This manuscript evaluates whether zero-shot LLM-generated health survey data (using GPT-4.1 and Gemini-2.5-Pro) can serve as input to an IPF-based geographically explicit population synthesis workflow. The authors generate synthetic BRFSS-like survey records for Colorado and Mississippi, use them in an IPF pipeline fitted to ACS marginal controls, and evaluate the resulting census tract-level synthetic populations against external benchmarks (ACS for insurance, CDC PLACES for general health). The evaluation uses JS divergence for marginal distributions and Pearson correlation for spatial agreement. The authors find that LLMs capture broad state-level contrasts but performance is strongly variable-dependent, with downstream IPF effects being mixed (sometimes amplifying, sometimes reducing errors). The paper is transparent about limitations, including cases where LLM data fails entirely (e.g., no uninsured individuals generated by GPT).

Significance. The paper addresses a practically important question at the intersection of LLM-generated synthetic data and population synthesis. The contribution is well-scoped: rather than evaluating LLM-generated records in isolation, the authors evaluate them as inputs to an established synthesis pipeline, which is the context where joint distributions matter. The use of external benchmarks (ACS, CDC PLACES) for spatial validation is a strength, as is the transparent reporting of variable-dependent failures. The code and prompts are publicly available (OSF DOI), supporting reproducibility. The two-state design (CO vs. MS) provides a meaningful geographic contrast. The finding that IPF partially regularizes but does not fix LLM-generated data errors is useful for the community.

major comments (1)
  1. Section 2.1: The prompt explicitly instructs LLMs to 'use their knowledge of the Behavioral Risk Factor Surveillance System Survey.' Since BRFSS is a widely disseminated public dataset with published state-level tables, and both models have training cutoffs that likely include 2023 BRFSS data, the evaluation may conflate memorization of seen distributions with genuine geographic reasoning. The authors acknowledge this risk in Section 4 but do not test it. This is load-bearing because the paper's claim that zero-shot generation 'produces geographically differentiated survey data' (Abstract) could be an artifact of the models reproducing memorized state-level marginals. A concrete test would be to run the same pipeline on a less-prominent survey instrument or a geography with less public data, or to compare results with and without the BRFSS name in the prompt. Without such a test, the 'ge
minor comments (7)
  1. Table 1: The row mean for 'Insurance' (0.129) is reported as the highest divergence, but the text on the same page says 'insurance has the maintains the lowest mean divergence of 0.070' — this appears to be a typo conflating Table 1 and Table 2 values. Please clarify.
  2. Section 3.2.1, paragraph discussing Table 3: 'insurance has the maintains the lowest mean divergence of 0.070' is grammatically broken. Also, 0.070 is the highest row mean in Table 2, not the lowest, so the claim appears incorrect.
  3. Figure 1 caption: 'A negative residual indicates that the LLM overestimates the category relative to the ground truth data and a positive residual indicates that the LLM underestimates.' This sign convention is non-intuitive (negative = overestimate). Consider clarifying or reversing the sign for reader intuition.
  4. Section 2.1: The batch size of 75 was selected based on experiments testing sizes of 50, 75, 100, 150, and 200, but no quantitative results from these experiments are reported. A brief table or sentence summarizing the trade-offs would strengthen the justification.
  5. Section 2.3: The JS divergence is defined with values 'approaching 1' for maximum dissimilarity, but JS divergence using log base 2 has an upper bound of 1 only for distributions over 2 categories. For 14-category variables, the maximum is still 1 (since JS is bounded by log(2) = 1 in base 2), but this should be stated explicitly for clarity.
  6. Table 2: Several BRFSS-based population divergences are non-zero (e.g., Education 0.038 for CO BRFSS, Flu Vaccination 0.041 for CO BRFSS). The text explains these arise from the fitting and expansion process, but a brief note in the table caption would help readers interpret why the 'ground truth' reference is not zero.
  7. Section 4: The phrase 'the BRFSS is a prominent public survey, so some of its structure is likely reflected in LLM pretraining data' is an important limitation. Consider elevating this to a more prominent position (e.g., in the Introduction or Methods) rather than burying it in the Discussion, as it affects interpretation of all results.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for a careful and constructive review. The referee raises one major concern about the potential confounding of memorization with genuine geographic reasoning. We agree this is an important issue and outline below how we will address it.

read point-by-point responses
  1. Referee: Section 2.1: The prompt explicitly instructs LLMs to 'use their knowledge of the Behavioral Risk Factor Surveillance System Survey.' Since BRFSS is a widely disseminated public dataset with published state-level tables, and both models have training cutoffs that likely include 2023 BRFSS data, the evaluation may conflate memorization of seen distributions with genuine geographic reasoning. The authors acknowledge this risk in Section 4 but do not test it. This is load-bearing because the paper's claim that zero-shot generation 'produces geographically differentiated survey data' (Abstract) could be an artifact of the models reproducing memorized state-level marginals. A concrete test would be to run the same pipeline on a less-prominent survey instrument or a geography with less public data, or to compare results with and without the BRFSS name in the prompt.

    Authors: We agree this is the most important limitation of the study, and we appreciate the referee framing it so precisely. The concern is valid: because BRFSS is a prominent public dataset with widely available state-level tables, the models may be reproducing memorized marginals rather than performing genuine geographic reasoning. We already flag this in Section 4, but the referee is right that acknowledging a limitation is not the same as testing it. We will take two concrete steps in revision. First, we will run an ablation in which the BRFSS name is removed from the prompt—replacing the instruction to 'use knowledge of BRFSS' with a neutral instruction to generate realistic health survey responses for the specified state population. This directly tests whether naming the instrument is driving the results. Second, we will soften the abstract claim from 'produces geographically differentiated survey data' to language that does not presuppose the mechanism—e.g., 'produces survey data that differs across states'—and we will add a sentence in the abstract noting that the role of memorization cannot be ruled out with the current design. We note honestly that we cannot fully resolve the memorization question within this paper. The ablation will provide evidence about whether naming BRFSS matters, but even without the name, the models may have internalized BRFSS distributions from pretraining. A definitive test would require a less-prominent survey instrument or a geography with minimal public data, which we identify as a priority for future work and will state explicitly. We believe the ablation and the revised framing meaningfully address the referee's concern without overclaiming what the current study can establish. revision: partial

standing simulated objections not resolved
  • We cannot definitively distinguish memorization from genuine geographic reasoning, even with the proposed ablation, because both models may have internalized BRFSS distributions during pretraining regardless of whether the instrument is named in the prompt. A fully convincing test would require a survey instrument or geography not represented in the training data, which is beyond the scope of the current revision.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained against external benchmarks

full rationale

The paper's central claim—that zero-shot LLM-generated survey data can serve as supplementary input to IPF-based population synthesis—is evaluated against external benchmarks (BRFSS ground truth, ACS estimates, CDC PLACES) that are independent of the generation process. The LLM-generated data is not defined in terms of the evaluation targets, and no parameter is fitted to a subset of data and then 'predicted' on closely related data. The one self-citation (Von Hoene et al., 2025, co-authored by Emma Von Hoene) appears in Section 2.2 as a methodological reference for standard IPF procedures alongside Lovelace and Ballas (2013) and Huang and Williamson (2001); it is not load-bearing for the paper's central claim, which rests on empirical comparison against external data. The authors explicitly acknowledge that BRFSS-based synthetic populations are 'not a fully independent validation benchmark' (Section 3.2.1) and flag the memorization risk in Section 4. These are correctness/validity concerns, not circularity. The derivation chain—generate data, compare to benchmarks, feed into IPF, compare synthetic populations to external benchmarks—does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

4 free parameters · 4 axioms · 0 invented entities

The paper introduces no new entities, particles, or theoretical constructs. It applies existing tools (LLMs, IPF) to a new domain. The free parameters are practical engineering choices (batch size, temperature) rather than theoretical constants. The axioms are domain assumptions about LLM capabilities and data independence, with the pretraining-memorization concern being the most significant untested premise.

free parameters (4)
  • Batch size (75 rows per call) = 75
    Selected empirically from tested sizes of 50, 75, 100, 150, 200 based on output truncation and malformed JSON rates. Not a principled derivation.
  • Temperature = 1.0
    Set to maximize diversity in generated responses. Standard choice but not derived from the data.
  • Top-p = 1.0
    Set alongside temperature for diversity. Not tuned to the specific task.
  • IPF fitting variables (age, race, gender, income, education) = 5 variables
    Selected based on established relationships with health attributes. Not optimized or validated against alternatives.
axioms (4)
  • domain assumption LLMs can generate state-specific survey responses reflecting geographic context from zero-shot prompts
    Invoked in Section 2.1: 'We instructed the LLMs to generate survey responses that are representative of the 2023 adult population of the two U.S. states.' This is the core premise being tested.
  • domain assumption BRFSS data is not substantially memorized in LLM pretraining corpora
    Acknowledged as a limitation in Section 4: 'the BRFSS is a prominent public survey, so some of its structure is likely reflected in LLM pretraining data.' If false, performance is inflated.
  • standard math IPF preserves joint distributions from input survey data
    Standard property of IPF invoked in Section 1: 'IPF seeks to preserve the joint distributions present in the input records.' Well-established in the literature.
  • domain assumption ACS and CDC PLACES estimates are valid external benchmarks for tract-level health outcomes
    Used in Section 2.3 for spatial validation. CDC PLACES is model-derived from BRFSS, introducing partial dependence on the same source.

pith-pipeline@v1.1.0-glm · 15609 in / 2613 out tokens · 127502 ms · 2026-07-04T20:02:48.937621+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 7 internal anchors

  1. [1]

    Prabin Bhandari, Antonios Anastasopoulos, and Dieter Pfoser

    doi: 10.1017/pan.2023.2. Prabin Bhandari, Antonios Anastasopoulos, and Dieter Pfoser. Urban mobility assessment using llms. InProceedings of the 32nd ACM International Conference on Advances in Geographic Information Systems, pages 67–79,

  2. [2]

    2023 brfss survey data and documentation,

    CDC. 2023 brfss survey data and documentation,

  3. [3]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Avaiable athttps://www.cdc.gov/places. Google DeepMind. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.arXiv preprint arXiv:2507.06261,

  4. [4]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    doi: 10.48550/arXiv.2507.06261. John J Grefenstette, Shawn T Brown, Roni Rosenfeld, Jay DePasse, Nathan TB Stone, Phillip C Cooley, William D Wheaton, Alona Fyshe, David D Galloway, Anuroop Sriram, et al. Fred (a framework for reconstructing epidemic dynamics): an open-source software system for modeling infectious diseases and control strategies using ce...

  5. [5]

    David Han, Samiul Islam, Taylor Anderson, Andrew T Crooks, and Hamdi Kavak

    doi: 10.1186/1471-2458-13-940. David Han, Samiul Islam, Taylor Anderson, Andrew T Crooks, and Hamdi Kavak. Quantitative comparison of pop- ulation synthesis techniques. In2025 Winter Simulation Conference (WSC), pages 151–162. IEEE,

  6. [6]

    Deirdre A Hennessy, William M Flanagan, Peter Tanuseputro, Carol Bennett, Meltem Tuna, Jacek Kopec, Michael C Wolfson, and Douglas G Manuel

    doi: 10.1109/WSC68292.2025.11338945. Deirdre A Hennessy, William M Flanagan, Peter Tanuseputro, Carol Bennett, Meltem Tuna, Jacek Kopec, Michael C Wolfson, and Douglas G Manuel. The population health model (pohem): an overview of rationale, methods and applications.Population Health Metrics, 13(1):24,

  7. [7]

    doi: 10.1186/s12963-015-0057-x. Z. Huang and P. Williamson. A comparison of synthetic reconstruction and combinatorial optimisation approaches to the creation of small-area microdata. Technical report, Department of Geography, University of Liverpool,

  8. [8]

    doi: 10.1371/journal.pcbi. 1009149. Ansley J Kunnath, Daniel E Sack, and Consuelo H Wilkins. Relative predictive value of sociodemographic factors for chronic diseases among all of us participants: a descriptive analysis.BMC Public Health, 24(1):405,

  9. [9]

    David T Levy, Patricia L Mabry, Amanda L Graham, C Tracy Orleans, and David B Abrams

    doi: 10.1186/s12889-024-17834-1. David T Levy, Patricia L Mabry, Amanda L Graham, C Tracy Orleans, and David B Abrams. Reaching healthy people 2010 by 2013: a simsmoke simulation.American Journal of Preventive Medicine, 38(3):S373–S381,

  10. [10]

    Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, and Haobo Wang

    doi: 10.1016/j.amepre.2009.11.018. Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, and Haobo Wang. On llms-driven synthetic data generation, curation, and evaluation: A survey. InFindings of the Association for Computational Linguistics, pages 11065–11082,

  11. [11]

    13 Robin Lovelace and Dimitris Ballas

    doi: 10.18653/v1/2024.findings-acl.658. 13 Robin Lovelace and Dimitris Ballas. ‘truncate, replicate, sample’: a method for creating integer weights for spatial microsimulation.Computers, Environment and Urban Systems, 41,

  12. [12]

    doi: 10.1016/j.compenvurbsys.2013.03

  13. [13]

    Deirdre A Hennessy, William M Flanagan, Peter Tanuseputro, Carol Bennett, Meltem Tuna, Jacek Kopec, Michael C Wolfson, and Douglas G Manuel

    doi: 10.1109/WSC68292.2025.11339080. Pedro Nascimento de Lima, Christopher Maerzluft, Jonathan Ozik, Nicholson Collier, and Carolyn M Rutter. Stress- testing u.s. colorectal cancer screening guidelines: Decennial colonoscopy from age 45 is robust to natural history uncertainty and colonoscopy sensitivity assumptions.Medical Decision Making, 45(5):557–568,

  14. [14]

    doi: 10.1145/3764919. 3770885. OpenAI. GPT-4.1.https://openai.com/index/gpt-4-1/,

  15. [16]

    Persona Generators: Generating Diverse Synthetic Personas for Arbitrary Contexts

    doi: 10.48550/ arXiv.2602.03545. Zhenlin Qin, Yancheng Ling, Leizhen Wang, Francisco Câmara Pereira, and Zhenliang Ma. Semapop: Semantic- persona conditioned population synthesis.arXiv preprint arXiv:2602.11569,

  16. [17]

    doi: 10.48550/arXiv.2602. 11569. Nabeel Seedat, Nicolas Huynh, Boris van Breugel, and Mihaela van der Schaar. Curated LLM: synergy of LLMs and data curation for tabular augmentation in low-data regimes. InProceedings of the 41st International Conference on Machine Learning,

  17. [18]

    LLMSynthor: Macro-Aligned Micro-Records Synthesis with Large Language Models

    doi: 10.1016/j.healthplace.2015.03.015. Yihong Tang, Menglin Kong, Junlin He, Tong Nie, and Lijun Sun. Llmsynthor: Macro-aligned micro-records synthesis with large language models.arXiv preprint arXiv:2505.14752,

  18. [19]

    LLMSynthor: Macro-Aligned Micro-Records Synthesis with Large Language Models

    doi: 10.48550/arXiv.2505.14752. US Census Bureau. American community survey (acs),

  19. [20]

    David Villarreal-Zegarra and Luciana Bellido-Boza

    Available athttps://www.census.gov/ programs-surveys/acs. David Villarreal-Zegarra and Luciana Bellido-Boza. Generation of synthetic data in health surveys using large lan- guage models.medRxiv, pages 2026–01,

  20. [21]

    Emma V on Hoene, Amira Roess, Hamdi Kavak, and Taylor Anderson

    doi: 10.64898/2026.01.27.26345015. Emma V on Hoene, Amira Roess, Hamdi Kavak, and Taylor Anderson. Synthetic population generation with public health characteristics for spatial agent-based models.PLOS Computational Biology, 21(3):1–22, 03

  21. [22]

    Sean J Westwood

    doi: 10.1371/journal.pcbi.1012439. Sean J Westwood. The potential existential threat of large language models to online survey research.Proceedings of the National Academy of Sciences, 122(47):e2518075122,

  22. [23]

    Fuzhen Yin, Na Jiang, Andrew Crooks, and Lucie Laurian

    doi: 10.1073/pnas.2518075122. Fuzhen Yin, Na Jiang, Andrew Crooks, and Lucie Laurian. Agent-based modeling of covid-19 vaccine uptake in new york state: Information diffusion in hybrid spaces. InProceedings of the 7th ACM SIGSPATIAL International Workshop on GeoSpatial Simulation, pages 11–20,

  23. [24]

    doi: 10.1145/3681770.369857. 14 X. Zhang, J.B. Holt, S. Yun, H. Lu, K.J. Greenlund, and J.B. Croft. Validation of multilevel regression and poststrat- ification methodology for small area estimation of health indicators from the behavioral risk factor surveillance system.American Journal of Epidemiology, 182(2):127–137,

  24. [25]

    Large Language Models as Virtual Survey Respondents: Evaluating Sociodemographic Response Generation

    doi: 10.1093/aje/kwv002. Jianpeng Zhao, Chenyu Yuan, Weiming Luo, Haoling Xie, Guangwei Zhang, Steven Jige Quan, Zixuan Yuan, Pengyang Wang, and Denghui Zhang. Large language models as virtual survey respondents: Evaluating sociode- mographic response generation.arXiv preprint arXiv:2509.06337,

  25. [26]

    Large Language Models as Virtual Survey Respondents: Evaluating Sociodemographic Response Generation

    doi: 10.48550/arXiv.2509.06337. AUTHOR BIOGRAPHIES TA YLOR ANDERSONis an Associate Professor in the Department of Geography and Geoinformation Science (GGS) at George Mason University (GMU). Her research focuses on modeling the spread of diseases in human and ecological systems. Her e-mail address istander6@gmu.eduand her website ishttps://science.gmu.edu...