arxiv: 2605.08630 · v1 · submitted 2026-05-09 · 💻 cs.HC

Recognition: no theorem link

Sycamore: Characterizing Synthetic Personas for Evaluating Genomics Visualization Retrieval

Astrid van den Brandt, Huyen N. Nguyen, Nils Gehlenborg

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:21 UTC · model grok-4.3

classification 💻 cs.HC

keywords synthetic personasLLM evaluationgenomics visualizationuser studyvisualization retrievalhuman-computer interactiongroundingexpert feedback

0 comments

The pith

Grounding synthetic personas in real interview data shifts their feedback on a genomics visualization tool toward documented user concerns, while ungrounded versions focus on operational details and both miss experts' image-modality focus.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests what happens when large language models generate synthetic personas to evaluate a real genomics visualization search engine called Geranium. It runs a three-condition comparison: generic ungrounded personas, personas constrained by voice-of-customer notes from an earlier expert interview study, and the original published expert evaluations. Grounding moves the synthetic comments closer to the language and priorities of actual users, while ungrounded ones emphasize implementation specifics that experts ignored. Both synthetic conditions settle on a find-and-adapt evaluation pattern and overlook the experts' preference for image-based queries. The work matters for fields where domain experts are scarce, because it clarifies how synthetic evaluators can be steered and where they still diverge from human judgments.

Core claim

In the Sycamore three-condition probe, grounding synthetic personas with voice-of-customer artifacts from a prior interview study shifted their feedback toward the language and concerns of documented users, while ungrounded evaluators drifted toward operational specifics that real participants did not raise; both synthetic conditions converged on a find-and-adapt frame and missed the image-modality preference observed in the expert study.

What carries the argument

The three-condition probe design that contrasts ungrounded synthetic personas drawn from generic LLM priors, grounded synthetic personas constrained by voice-of-customer artifacts, and a published baseline of real domain experts evaluating the Geranium multimodal genomics visualization search engine.

If this is right

Grounding via prior user artifacts can steer synthetic feedback to resemble documented user language and priorities in visualization evaluation.
Both grounded and ungrounded synthetic conditions may systematically converge on a find-and-adapt evaluation frame.
Synthetic evaluations can miss expert preferences such as image-modality use that appear in real studies.
Synthetic personas could serve as an initial probe alongside targeted expert studies rather than a full replacement in niche domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The grounding technique might extend to other visualization domains if the same pattern of alignment and missed modalities holds in new interview datasets.
Prompt engineering targeted at modality preferences could be tested as a way to reduce the gap between synthetic and expert evaluations.
The find-and-adapt frame may reflect a general property of current LLM reasoning about search tools rather than something specific to genomics.
Combining synthetic probes with small expert validation sets could become a practical workflow when expert time is limited.

Load-bearing premise

The observed differences in feedback patterns between the three conditions are driven primarily by the grounding manipulation rather than by prompt phrasing, model choice, or the specific properties of the Geranium tool and the single prior interview study.

What would settle it

A replication using a different LLM, revised prompts for the ungrounded condition, or a new visualization tool and interview dataset that produces no measurable shift in feedback language or missed preferences between grounded and ungrounded synthetic personas would falsify the claim that grounding is the main driver.

Figures

Figures reproduced from arXiv: 2605.08630 by Astrid van den Brandt, Huyen N. Nguyen, Nils Gehlenborg.

**Figure 1.** Figure 1: Diagram of the Sycamore system. Left: three-condition evaluation on a common visualization retrieval system to address [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: The object of evaluation, Geranium [16] multimodal retrieval system. Users can search with text, image, or Gosling specification and will rank their modality preferences at the end of the evaluation. 3.2 Evaluation Protocol Sycamore adopts the published Geranium user study [16] in two roles. First, its protocol provides the shared procedure that all three conditions follow. Second, its reported findings pr… view at source ↗

**Figure 3.** Figure 3: Four-step pipeline for instantiating synthetic evaluators: [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: The Sycamore session viewer streaming a grounded evaluator (CB1, Computational Biologist) through the Geranium protocol. (1) [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

read the original abstract

Evaluating visualization systems in niche domains such as genomics is challenging due to scarcity of domain experts and difficulty recruiting a representative user base. While LLM-based synthetic personas are increasingly used to ease evaluation bottlenecks, they face well-founded skepticism. Rather than weighing synthetic personas as substitutes for real users, we ask a fundamental open question: when synthetic personas evaluate a real visualization system, what do they actually produce, and how does that output change when grounded in documented human contexts? We present Sycamore, an exploratory three-condition probe design using Geranium, a search engine for multimodal genomics visualization, as a case study. Sycamore evaluates Geranium using: (1) ungrounded synthetic personas from generic LLM priors; (2) grounded synthetic personas constrained by voice-of-customer artifacts from a prior interview study; and (3) a published baseline study of real domain experts. We observe that grounding shifts synthetic feedback toward the language and concerns of documented users, while ungrounded evaluators drift toward operational specifics that real participants did not raise; both synthetic conditions, however, converge on a find-and-adapt frame and miss the image-modality preference observed in the expert study. We discuss what these observations imply for where synthetic personas might fit alongside expert studies in domain-specific visualization evaluation. All supplemental materials are available at https://osf.io/kdfr3/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Grounding personas with interview data shifts their feedback closer to real experts on this genomics viz tool, but the three-condition probe is too small and qualitative to pin down why.

read the letter

The main thing to know is that grounding the synthetic personas with voice-of-customer artifacts from the prior interview study does push their comments on Geranium toward the language and concerns that real domain experts raised, while the ungrounded versions lean more into operational details that the experts skipped. Both synthetic conditions still converge on a find-and-adapt view and overlook the image-modality preference that showed up in the expert baseline. That three-way comparison on the same system is the clearest new piece here, and the paper lays out the observations without overclaiming they replace real users. They keep it honest as an exploratory probe and note the practical angle for domains where experts are scarce. The design itself is simple enough to follow. The soft spots are straightforward. There are no quantitative metrics, no inter-rater reliability numbers, and no statistical checks on how the feedback was coded or compared, so the differences rest on interpretive reading. The grounded prompts embed a lot more specific text from the single interview study, which means the shift could come from prompt length or concreteness rather than the grounding step itself; the paper does not run ablations to separate those. Scope is narrow—one tool, one domain, one prior study—so it is hard to know how far the pattern travels. This is useful for HCI or visualization researchers who need early-stage checks on niche tools and are already experimenting with LLM personas. A reader looking for practical hints on making synthetic evaluators less generic would find some value, though not a polished method. I would send it to peer review because the direct comparison is worth discussing and the limitations are stated plainly enough that referees can focus on tightening the controls and adding measures.

Referee Report

2 major / 2 minor

Summary. The paper presents Sycamore, an exploratory three-condition probe using Geranium (a multimodal genomics visualization search engine) as a case study. It compares (1) ungrounded synthetic personas based on generic LLM priors, (2) grounded synthetic personas constrained by voice-of-customer artifacts from a prior interview study, and (3) a published baseline of real domain experts. The central claim is that grounding shifts synthetic feedback toward the language and concerns of documented users while ungrounded evaluators drift toward operational specifics; both synthetic conditions converge on a find-and-adapt frame and miss the image-modality preference observed in the expert study. The authors discuss implications for fitting synthetic personas alongside expert studies in domain-specific visualization evaluation, with all materials available on OSF.

Significance. If the observations hold, the work offers a constructive empirical characterization of synthetic personas in a niche domain where expert recruitment is difficult. It explicitly credits the open release of all supplemental materials on OSF and the direct comparison against a published expert baseline. The findings highlight both the potential value of grounding (alignment with real-user language) and its limits (persistent convergence on find-and-adapt and omission of modality preferences), contributing to HCI discussions on LLM use in visualization evaluation without claiming substitution for real users.

major comments (2)

[§3 (three-condition probe design)] §3 (three-condition probe design): The grounded condition necessarily incorporates detailed artifacts from the single prior interview study, producing systematically longer, more concrete, and more structured prompts than the generic ungrounded condition. No ablation holds prompt length, specificity, and structure constant while varying only the presence of grounding content, so the observed shifts in feedback language, concerns, and missed modalities cannot be unambiguously attributed to the grounding manipulation rather than prompt variation or LLM properties. This directly underpins the central claim.
[§4 (qualitative observations and analysis)] §4 (qualitative observations and analysis): The reported differences (e.g., convergence on find-and-adapt frame, omission of image-modality preference) rest on interpretive comparison without quantitative metrics, inter-rater reliability, or a transparent coding scheme. Because the study is a single three-condition qualitative probe, the absence of these elements makes it difficult to assess the robustness or replicability of the pattern differences that support the main conclusions.

minor comments (2)

[Methods] The methods section would benefit from explicit inclusion or direct OSF links to the exact prompt templates used in each condition to allow readers to evaluate the concreteness differences noted above.
[Discussion] A few sentences in the discussion repeat observations already detailed in the results; tightening these would improve flow without altering content.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our exploratory probe. We address each major comment below and have revised the manuscript to increase transparency and explicitly discuss design limitations while preserving the original scope and claims.

read point-by-point responses

Referee: §3 (three-condition probe design): The grounded condition necessarily incorporates detailed artifacts from the single prior interview study, producing systematically longer, more concrete, and more structured prompts than the generic ungrounded condition. No ablation holds prompt length, specificity, and structure constant while varying only the presence of grounding content, so the observed shifts in feedback language, concerns, and missed modalities cannot be unambiguously attributed to the grounding manipulation rather than prompt variation or LLM properties. This directly underpins the central claim.

Authors: We agree that the grounded prompts are longer and more structured by construction because they embed specific voice-of-customer artifacts. This difference is inherent to testing grounding as it would be used in practice rather than an unintended confound. An additional control condition that artificially lengthens and structures non-grounded prompts would not correspond to how ungrounded synthetic personas are typically deployed. In the revised manuscript we have added an explicit limitations paragraph in §5 that acknowledges this attribution challenge and clarifies that the central claim concerns the observable differences between the two realistic synthetic conditions as implemented, not a fully isolated causal effect of grounding content alone. revision: partial
Referee: §4 (qualitative observations and analysis): The reported differences (e.g., convergence on find-and-adapt frame, omission of image-modality preference) rest on interpretive comparison without quantitative metrics, inter-rater reliability, or a transparent coding scheme. Because the study is a single three-condition qualitative probe, the absence of these elements makes it difficult to assess the robustness or replicability of the pattern differences that support the main conclusions.

Authors: The analysis is qualitative and interpretive, as appropriate for an exploratory three-condition probe. To improve transparency and replicability we have now deposited the full set of prompts, all generated persona outputs, the coding scheme, and annotated examples in the OSF repository. The methods section has been expanded to describe the iterative theme identification and cross-condition comparison process. Because this was a single-coder analysis conducted by the first author for this initial probe, inter-rater reliability was not computed; we now state this limitation explicitly and note that future larger-scale work could incorporate multiple coders and quantitative agreement metrics. The patterns remain useful for characterizing synthetic-persona behavior in this domain-specific case study. revision: partial

Circularity Check

0 steps flagged

No significant circularity: purely empirical three-condition comparison

full rationale

The paper conducts an exploratory empirical probe comparing feedback from ungrounded synthetic personas, grounded synthetic personas (using artifacts from a prior published interview study), and a published baseline of real domain experts evaluating the Geranium system. There are no equations, fitted parameters, predictions, derivations, uniqueness theorems, or ansatzes. The central observations (shifts in language/concerns, convergence on find-and-adapt frame, omission of image-modality preference) are direct qualitative comparisons of outputs from the three independent conditions against the external baseline. No step reduces by construction to the inputs or relies on self-citation as load-bearing justification; references to the prior study serve only as grounding data for one condition. The study is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical exploratory HCI study with no mathematical model, no fitted parameters, no new axioms, and no invented entities; the only background assumptions are standard qualitative analysis practices and the validity of the prior interview study used for grounding.

pith-pipeline@v0.9.0 · 5544 in / 1301 out tokens · 46865 ms · 2026-05-12T01:21:53.390487+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 1 internal anchor

[1]

ATLAS.ti Mac

ATLAS.ti Scientific Software Development GmbH. ATLAS.ti Mac. https://atlasti.com, 2024. Version 24.0.1. 3

work page 2024
[2]

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhari- wal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. A...

work page 2020
[3]

Cooper et al.The inmates are running the asylum: Why high-tech products drive us crazy and how to restore the sanity, vol

A. Cooper et al.The inmates are running the asylum: Why high-tech products drive us crazy and how to restore the sanity, vol. 2. Sams Indianapolis, 2004. 2

work page 2004
[4]

Crisan, B

A. Crisan, B. Fiore-Gartland, and M. Tory. Passing the data baton: A retrospective analysis on data science work and workers.IEEE Transactions on Visualization and Computer Graphics, 27(2):1860– 1870, 2021. doi: 10.1109/TVCG.2020.3030340 2

work page doi:10.1109/tvcg.2020.3030340 2021
[5]

B. Gao, Z. Zeng, Y . Yu, I. P. Werry, C. L. Chan, M. Chen, H. Zhang, B. Huang, J. Ji, C. Leung, and C. Miao. ”it seems to understand my heart”: An empirical study of persona-driven persuasive ai agent for aging-in-place in singapore. InProceedings of the 2026 CHI Confer- ence on Human Factors in Computing Systems, CHI ’26. Association for Computing Machin...

work page arXiv 2026
[6]

S. Jain, C. Park, M. Viana, A. Wilson, and D. Calacci. Interaction con- text often increases sycophancy in llms. InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems, CHI ’26. Association for Computing Machinery, New York, NY , USA, 2026. doi: 10.1145/3772318.3791915 4

work page doi:10.1145/3772318.3791915 2026
[7]

Are AI-Generated Synthetic Users Replacing Personas? What UX Designers Need to Know, 2024

James Newhook, Interaction Design Foundation. Are AI-Generated Synthetic Users Replacing Personas? What UX Designers Need to Know, 2024. 1

work page 2024
[8]

you always get an answer

I. Kaate, J. Salminen, S.-G. Jung, T. T. T. Xuan, E. H ¨ayh¨anen, J. Y . Azem, and B. J. Jansen. “you always get an answer”: Analyzing users’ interaction with ai-generated personas given unanswerable questions and risk of hallucination. InProceedings of the 30th International Conference on Intelligent User Interfaces, pp. 1624–1638, 2025. 2

work page 2025
[9]

A. B. Kocaballi, M. Prpa, J. Salminen, D. Amin, and B. J Jansen. From generation to simulation: Responsible use of ai personas in human- centered design and research. InProceedings of the Extended Ab- stracts of the 2026 CHI Conference on Human Factors in Computing Systems, CHI EA ’26. Association for Computing Machinery, New York, NY , USA, 2026. doi: 10...

work page doi:10.1145/3772363.3778745 2026
[10]

Krzywinski, J

M. Krzywinski, J. Schein, I. Birol, J. Connors, R. Gascoyne, D. Hors- man, S. J. Jones, and M. A. Marra. Circos: an information aesthetic for comparative genomics.Genome research, 19(9):1639–1645, 2009. 1

work page 2009
[11]

S. LYi, Q. Wang, F. Lekschas, and N. Gehlenborg. Gosling: A grammar-based toolkit for scalable and interactive genomics data visualization.IEEE Transactions on Visualization and Computer Graphics, 28(1):140–150, 2021. 1, 2

work page 2021
[12]

S. L’Yi, A. van den Brandt, E. Adams, H. N. Nguyen, and N. Gehlen- borg. Learnable and expressive visualization authoring through blended interfaces.IEEE Transactions on Visualization and Computer Graphics, 31(1):459–469, 2025. doi: 10.1109/TVCG.2024.3456598 1

work page doi:10.1109/tvcg.2024.3456598 2025
[13]

S. L’Yi, Q. Wang, and N. Gehlenborg. The role of visualization in genomics data analysis workflows: The interviews. In2023 IEEE Visualization and Visual Analytics (VIS), pp. 101–105. IEEE, 2023. 2

work page 2023
[14]

H. N. Nguyen and N. Gehlenborg. Safire: Similarity framework for visualization retrieval. In2025 IEEE Visualization and Visual Ana- lytics (VIS), pp. 246–250, 2025. doi: 10.1109/VIS60296.2025.00055 2

work page doi:10.1109/vis60296.2025.00055 2025
[15]

H. N. Nguyen and N. Gehlenborg. Visualization retrieval for data literacy: Position paper.CHI 2026 Workshop on Data Literacy, Mar

work page 2026
[16]

doi: 10.48550/arXiv.2604.09598 4

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.09598
[17]

H. N. Nguyen, S. L’Yi, T. C. Smits, S. Gao, M. Zitnik, and N. Gehlen- borg. Geranium: Multimodal retrieval of genomics data visualiza- tions.IEEE Transactions on Visualization and Computer Graphics, pp. 1–17, 2026. doi: 10.1109/TVCG.2026.3683429 1, 2, 3

work page doi:10.1109/tvcg.2026.3683429 2026
[18]

Pandey, S

A. Pandey, S. L’Yi, Q. Wang, M. A. Borkin, and N. Gehlenborg. Genorec: A recommendation system for interactive genomics data visualization.IEEE Transactions on Visualization and Computer Graphics, 29(1):570–580, 2023. doi: 10.1109/TVCG.2022.3209407 2

work page doi:10.1109/tvcg.2022.3209407 2023
[19]

J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein. Generative agents: Interactive simulacra of human behav- ior. UIST ’23. Association for Computing Machinery, New York, NY , USA, 2023. doi: 10.1145/3586183.3606763 2

work page doi:10.1145/3586183.3606763 2023
[20]

J. S. Park, C. Q. Zou, J. Kamphorst, N. Egan, A. Shaw, B. M. Hill, C. Cai, M. R. Morris, P. Liang, R. Willer, and M. S. Bernstein. Llm agents grounded in self-reports enable general-purpose simulation of individuals, 2026. 2

work page 2026
[21]

Salminen, C

J. Salminen, C. Liu, W. Pian, J. Chi, E. H ¨ayh¨anen, and B. J. Jansen. Deus ex machina and personas from large language models: Investi- gating the composition of ai-generated persona descriptions. InPro- ceedings of the 2024 CHI Conference on Human Factors in Comput- ing Systems, CHI ’24. Association for Computing Machinery, New York, NY , USA, 2024. do...

work page doi:10.1145/3613904.3642036 2024
[22]

Thorvaldsd ´ottir, J

H. Thorvaldsd ´ottir, J. T. Robinson, and J. P. Mesirov. Integrative ge- nomics viewer (igv): high-performance genomics data visualization and exploration.Briefings in bioinformatics, 14(2):178–192, 2013. 1, 2

work page 2013
[23]

M. Truss. Personacite: V oc-grounded interviewable agentic synthetic ai personas for verifiable user and design research. InProceedings of the Extended Abstracts of the 2026 CHI Conference on Human Fac- tors in Computing Systems, pp. 1–7, 2026. 2, 3

work page 2026
[24]

van den Brandt, S

A. van den Brandt, S. L’Yi, H. N. Nguyen, A. Vilanova, and N. Gehlenborg. Understanding visualization authoring techniques for genomics data in the context of personas and tasks.IEEE Trans- actions on Visualization and Computer Graphics, 31(1):1180–1190,

work page
[25]

doi: 10.1109/TVCG.2024.3456298 1, 2, 3

work page doi:10.1109/tvcg.2024.3456298 2024
[26]

Welch, F

L. Welch, F. Lewitter, R. Schwartz, C. Brooksbank, P. Radivojac, B. Gaeta, and M. V . Schneider. Bioinformatics curriculum guidelines: toward a definition of core competencies.PLOS computational biol- ogy, 10(3):e1003496, 2014. 2

work page 2014