pith. machine review for the scientific record. sign in

arxiv: 2604.18724 · v2 · submitted 2026-04-20 · 💻 cs.AI

Recognition: unknown

Beyond One Output: Visualizing and Comparing Distributions of Language Model Generations

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:02 UTC · model grok-4.3

classification 💻 cs.AI
keywords language model generationsdistribution visualizationtext graphinteractive visualizationuser studyprompt iterationstochastic outputs
0
0 comments X

The pith

GROVE represents multiple language model generations as overlapping paths in a text graph to expose hidden distributional structures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current interactions with language models rely on single outputs, which obscure the full distribution of possible responses including modes, rare cases, and prompt sensitivities. This leads users to overgeneralize from limited examples when developing prompts for open-ended tasks. The paper presents GROVE, an interactive visualization that displays generations as paths through a shared text graph to highlight common structures and branches. Evaluations from three user studies indicate that this graph view strengthens judgments about diversity and overall structure, whereas examining raw outputs remains preferable for detailed analysis, pointing to a hybrid evaluation approach.

Core claim

We introduce GROVE as an interactive visualization that represents multiple LM generations as overlapping paths through a text graph, revealing shared structure, branching points, and clusters while preserving access to raw outputs. Formative research with LM researchers informed the design, and three crowdsourced studies targeting distributional tasks support a hybrid workflow: graph summaries improve structural judgments such as assessing diversity, while direct output inspection remains stronger for detail-oriented questions.

What carries the argument

GROVE, the interactive visualization system that models LM outputs as paths in a text graph to show overlaps and divergences.

Load-bearing premise

The tasks and participant pools in the three crowdsourced user studies sufficiently represent the real-world needs and behaviors of researchers and practitioners who use language models for open-ended tasks.

What would settle it

If a controlled experiment finds that users perform no better at distributional tasks with GROVE than with standard single-output interfaces, the value of the graph visualization would be questioned.

Figures

Figures reproduced from arXiv: 2604.18724 by Claire Yang, Deniz Nazar, Emily Reif, Jared Hwang, Jeff Heer, Noah A. Smith.

Figure 1
Figure 1. Figure 1: GROVE visualizes a set of outputs from one or more prompts or model configurations. Here we see a distribution of generations [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The basic GROVE interface. A: the global controls for generation, including the prompt, number of generations, model, and temperature. B: the graph of generations. C: the graph controls, including various ways to simplify and render it. D: The original raw text outputs, in list form, which can be expanded or minimized. coarser, higher-level views. We did not have access to the model’s internal tokenization… view at source ↗
Figure 3
Figure 3. Figure 3: A comparison of the prompts “summarize the [Trump/Obama/Taft] presidency in one sentence”. The nodes in the graph are colored by the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overall interface preference (1 = graph, 7 = list) by study. For diversity, participants strongly favored the graph. For single distribution, preferences were polarized. For the comparison task, preferences were more evenly spread. (participants strongly favored either graph or list); and more uniform for comparison. 6.2.1 Diversity comparison Users were significantly more accurate at evaluating the relati… view at source ↗
Figure 4
Figure 4. Figure 4: Per-participant difference (Graph − List) in accuracy for all three studies. Diversity (n=36): graph yielded higher accuracy, p=0.012. Single distribution (n=26): list yielded higher accuracy, p=0.009. Comparison (n=40): list yielded higher accuracy, p=0.002. Summary. Echoing what we found in the interviews, different tasks required different interfaces. For the diversity task, the graph yielded higher acc… view at source ↗
Figure 7
Figure 7. Figure 7: Outputs from the open-ended prompt “name a Greek god or [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Examples of how graph visualizations surface distributional structure: (a) temporally aligned translation outputs, (b) structural templates in [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
read the original abstract

Users typically interact with and evaluate language models via single outputs, but each output is just one sample from a broad distribution of possible completions. This interaction hides distributional structure such as modes, uncommon edge cases, and sensitivity to small prompt changes, leading users to over-generalize from anecdotes when iterating on prompts for open-ended tasks. Informed by a formative study with researchers who use LMs (n=13) examining when stochasticity matters in practice, how they reason about distributions over language, and where current workflows break down, we introduce GROVE. GROVE is an interactive visualization that represents multiple LM generations as overlapping paths through a text graph, revealing shared structure, branching points, and clusters while preserving access to raw outputs. We evaluate across three crowdsourced user studies (N=47, 44, and 40 participants) targeting complementary distributional tasks. Our results support a hybrid workflow: graph summaries improve structural judgments such as assessing diversity, while direct output inspection remains stronger for detail-oriented questions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces GROVE, an interactive visualization that renders multiple LM generations as overlapping paths in a text graph to expose distributional features such as modes, branches, and clusters while retaining access to raw text. Grounded in a formative study with 13 LM researchers, the authors evaluate GROVE via three crowdsourced user studies (N=47, 44, 40) on complementary tasks and conclude that the results support a hybrid workflow: graph summaries aid structural judgments (e.g., diversity assessment) while direct output inspection remains preferable for detail-oriented questions.

Significance. If the empirical support holds, the work addresses a practical gap in LM interfaces by making stochasticity and distributional structure visible rather than hidden behind single samples. The formative study with domain experts provides useful grounding for design choices, and the hybrid-workflow finding offers a concrete, actionable recommendation for prompt iteration and model evaluation. The graph-based approach for text distributions is a novel contribution that could influence future tooling in open-ended generation tasks.

major comments (2)
  1. [§5 (User Studies)] §5 (User Studies): The three crowdsourced evaluations (N=47/44/40) lack reported details on statistical methods, effect sizes, participant demographics, screening criteria, or raw data/analysis code. Because the central hybrid-workflow claim rests entirely on these studies showing graph summaries improving structural judgments, the moderate evidence strength cannot be fully assessed without this information.
  2. [§3 (Formative Study) and §5] §3 (Formative Study) and §5: No validation or discussion is provided on whether the crowdsourced participants' reasoning patterns, domain knowledge, or task behaviors align with the n=13 researchers from the formative study. If crowd workers treat the tasks as abstract puzzles rather than real prompt-engineering workflows, the measured advantages for GROVE on diversity and structural tasks may not generalize to the target practitioner population, which is load-bearing for the recommended hybrid interface.
minor comments (2)
  1. [Figure 3] Figure 3 (or equivalent visualization figure): Adding explicit callouts or legends for shared prefixes, branching points, and cluster boundaries would improve readability for readers new to the graph representation.
  2. [Discussion] The paper could more explicitly discuss limitations of the crowdsourced setup and potential differences from expert users in the discussion or limitations section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights key areas for improving the transparency and generalizability of our user studies. We address each major comment below, indicating the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: §5 (User Studies): The three crowdsourced evaluations (N=47/44/40) lack reported details on statistical methods, effect sizes, participant demographics, screening criteria, or raw data/analysis code. Because the central hybrid-workflow claim rests entirely on these studies showing graph summaries improving structural judgments, the moderate evidence strength cannot be fully assessed without this information.

    Authors: We agree that these details are necessary for readers to fully assess the evidence. In the revised manuscript, we will expand §5 with a new subsection on study methodology and analysis. This will include: (1) full description of statistical methods (e.g., specific tests, p-values, and corrections), (2) effect sizes for all key comparisons, (3) participant demographics (age, gender, education, LM usage frequency), (4) screening criteria (attention checks, platform qualifications), and (5) a link to anonymized raw data and analysis code in a public repository. These additions will directly support evaluation of the hybrid-workflow findings. revision: yes

  2. Referee: §3 (Formative Study) and §5: No validation or discussion is provided on whether the crowdsourced participants' reasoning patterns, domain knowledge, or task behaviors align with the n=13 researchers from the formative study. If crowd workers treat the tasks as abstract puzzles rather than real prompt-engineering workflows, the measured advantages for GROVE on diversity and structural tasks may not generalize to the target practitioner population, which is load-bearing for the recommended hybrid interface.

    Authors: We acknowledge this limitation in generalizability. The revised manuscript will add explicit discussion in §5 and the limitations section comparing the two participant groups. We will explain how task designs were derived from formative study insights to approximate real workflows, report available background data on crowdsourced participants, and qualify the hybrid-workflow recommendation as based on complementary but distinct populations. We will also outline plans for future expert-user validation studies. This addresses the concern without overstating current evidence. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims grounded in independent empirical studies

full rationale

The paper introduces GROVE after a formative study (n=13) and evaluates it via three new crowdsourced user studies (N=47/44/40). The central claim that graph summaries aid structural judgments while direct inspection aids detail-oriented ones is presented as a direct outcome of those fresh participant results. No mathematical derivations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the abstract or described content. The work is self-contained against external benchmarks because the reported advantages derive from the new experiments rather than reducing to prior author work or internal definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work rests on standard HCI assumptions about the validity of crowdsourced studies for evaluating visualization tools and the assumption that the text-graph representation faithfully captures distributional properties of LM outputs without introducing misleading artifacts.

axioms (1)
  • domain assumption Crowdsourced participants performing the described tasks provide reliable evidence about visualization effectiveness for LM users
    The evaluation depends on the three user studies with given sample sizes.
invented entities (1)
  • GROVE no independent evidence
    purpose: Interactive system that renders multiple LM generations as overlapping paths in a text graph
    New tool introduced by the authors; no independent evidence outside the paper's own studies.

pith-pipeline@v0.9.0 · 5478 in / 1378 out tokens · 47446 ms · 2026-05-10T04:02:47.757957+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 35 canonical work pages · 1 internal anchor

  1. [1]

    D. B. Acharya, K. Kuppan, and B. Divya. Agentic AI: Autonomous intelligence for complex goals: A comprehensive survey.IEEE Access, 13:18912–18936, 2025. doi: 10.1109/ACCESS.2025.3532853 1

  2. [2]

    Agarwal, M

    D. Agarwal, M. Naaman, and A. Vashistha. Ai suggestions homogenize writing toward western styles and diminish cultural nuances. InProceed- ings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI ’25, art. no. 1117, 21 pages. Association for Computing Machinery, New York, NY , USA, 2025. doi: 10.1145/3706598.3713564 1

  3. [3]

    Glassman

    I. Arawjo, C. Swoopes, P. Vaithilingam, M. Wattenberg, and E. L. Glass- man. ChainForge: A visual toolkit for prompt engineering and LLM hypothesis testing. InProceedings of the CHI Conference on Human Factors in Computing Systems, pp. 1–18. ACM, New York, NY , USA, May 2024. doi: 10.1145/3613904.3642016 2

  4. [4]

    B. Bach, C. Shi, N. Heulot, T. Madhyastha, T. Grabowski, and P. Dragice- vic. Time curves: Folding time to visualize patterns of temporal evolution in data.IEEE Transactions on Visualization and Computer Graphics, 22(1):559–568, Jan. 2016. doi: 10.1109/TVCG.2015.2467851 4, 13

  5. [5]

    Brath, A

    R. Brath, A. Bradley, and D. Jonker. Visualizing LLM text style transfer: Visually dissecting how to talk like a pirate. InProceedings of the VIS Workshop on NLP meets Visualization (NLVIZ), 2023. 2

  6. [6]

    Brath, A

    R. Brath, A. Bradley, and D. Jonker. Visualizing textual distributions of repeated LLM responses to characterize LLM knowledge. InProceedings of the VIS Workshop on NLP meets Visualization (NLVIZ), 2023. 2

  7. [7]

    Braun and V

    V . Braun and V . Clarke. Using thematic analysis in psychology.Qual- itative Research in Psychology, 3(2):77–101, Jan. 2006. doi: 10.1191/ 1478088706qp063oa 3

  8. [8]

    C. J. Brockett and W. B. Dolan. User-modifiable word lattice display for editing documents and search queries, Mar. 2015. 5

  9. [9]

    Cheng, V

    F. Cheng, V . Zouhar, S. Arora, M. Sachan, H. Strobelt, and M. El- Assady. RELIC: Investigating large language model responses using self-consistency. InProceedings of the CHI Conference on Human Fac- tors in Computing Systems, pp. 1–18. ACM, New York, NY , USA, May

  10. [10]

    doi: 10.1145/3613904.3641904 2

  11. [11]

    Collins, S

    C. Collins, S. Carpendale, and G. Penn. Visualization of uncer- tainty in lattices to support decision-making. InProceedings of the Eurographics/IEEE-VGTC Symposium on Visualization, pp. 51–58, 2007. doi: 10.2312/VisSym/EuroVis07/051-058 2

  12. [12]

    Collins and G

    C. Collins and G. Penn. Leveraging uncertainty visualization to enhance multilingual chat. InProceedings of the CSCW, 2006. 2

  13. [13]

    G. W. Furnas. Generalized fisheye views. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 16–23. ACM,

  14. [14]

    doi: 10.1145/22627.22342 6

  15. [15]

    K. I. Gero, C. Swoopes, Z. Gu, J. K. Kummerfeld, and E. L. Glassman. Supporting sensemaking of large language model outputs at scale. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, CHI ’24, art. no. 838, 21 pages. Association for Computing Machinery, New York, NY , USA, 2024. doi: 10.1145/3613904.3642139 1, 2, 4, 5, 8

  16. [16]

    Z. Gu, J. Zhou, N.-E. n. Lei, J. K. Kummerfeld, M. Jasim, N. Mahyar et al. AbstractExplorer: Leveraging structure-mapping theory to enhance comparative close reading at scale. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology, pp. 1–25. ACM, New York, NY , USA, Sept. 2025. doi: 10.1145/3746059.3747773 2

  17. [17]

    Hamilton

    S. Hamilton. Detecting mode collapse in language models via narration. In Proceedings of the First edition of the Workshop on the Scaling Behavior of Large Language Models (SCALE-LLM 2024), pp. 65–72. Association for Computational Linguistics, St. Julian’s, Malta, 2024. doi: 10.18653/ v1/2024.scalellm-1.5 2

  18. [18]

    Z. He, S. Naphade, and T.-H. K. Huang. Prompting in the dark: Assessing human performance in prompt engineering for data labeling when gold labels are absent. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pp. 1–33. ACM, New York, NY , USA, Apr

  19. [19]

    doi: 10.1145/3706598.3714319 2

  20. [20]

    Heimerl, S

    F. Heimerl, S. Lohmann, S. Lange, and T. Ertl. Word cloud explorer: Text analytics based on word clouds. In2014 47th Hawaii International Conference on System Sciences, pp. 1833–1842, 2014. doi: 10.1109/ HICSS.2014.231 5

  21. [21]

    A. K. Hopkins, A. Renda, and M. Carbin. Can llms generate random numbers? evaluating llm sampling in controlled domains. InICML 2023 workshop: sampling and optimization in discrete space, 2023. 1, 8

  22. [22]

    M. Hu, K. Wongsuphasawat, and J. Stasko. Visualizing social media con- tent with SentenTree.IEEE Transactions on Visualization and Computer Graphics, 23(1):621–630, 2017. doi: 10.1109/TVCG.2016.2598590 2

  23. [23]

    Jänicke, A

    S. Jänicke, A. Geßner, M. Büchler, and G. Scheuermann. Visualizations for text re-use. InProceedings of the International Conference on Information Visualization Theory and Applications (IVAPP), pp. 59–70. SciTePress,

  24. [24]

    doi: 10.5220/0004692500590070 2, 5

  25. [25]

    doi:10.48550/arXiv.2510.22954 , url =

    L. Jiang, Y . Chai, M. Li, M. Liu, R. Fok, N. Dziri et al. Artificial hivemind: The open-ended homogeneity of language models (and beyond), 2025. doi: 10.48550/arXiv.2510.22954 1, 2

  26. [26]

    Kahng, I

    M. Kahng, I. Tenney, M. Pushkarna, M. X. Liu, J. Wexler, E. Reif et al. LLM comparator: Visual analytics for side-by-side evaluation of large language models. InExtended Abstracts of the CHI Conference on Human Factors in Computing Systems, pp. 1–7. ACM, New York, NY , USA, May

  27. [27]

    doi: 10.1145/3613905.3650755 2

  28. [28]

    L. Kuhn, Y . Gal, and S. Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In International Conference on Learning Representations, 2023. doi: 10. 48550/arXiv.2302.09664 2

  29. [29]

    Z. Li, E. Shareghi, and N. Collier. ReasonGraph: Visualization of reason- ing methods and extended inference paths. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pp. 140–147. Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.acl-demo.14 2

  30. [30]

    Z. Ma, Y . Mei, and Z. Su. Understanding the benefits and challenges of using large language model-based conversational agents for mental well- being support.AMIA Annual Symposium Proceedings, 2023:1105–1114,

  31. [31]

    J. M. Morse. Critical Analysis of Strategies for Determining Rigor in Qualitative Inquiry.Qualitative Health Research, 25(9):1212–1222, Sept

  32. [32]

    doi: 10.1177/1049732315588501 3

  33. [33]

    T. Munz, D. Väth, P. Kuznecov, N. T. Vu, and D. Weiskopf. Visualization- based improvement of neural machine translation.Computers & Graphics, 103:45–60, Apr. 2022. doi: 10.1016/j.cag.2021.12.003 2

  34. [34]

    O’Mahony, L

    L. O’Mahony, L. Grinsztajn, H. Schoelkopf, and S. Biderman. Attributing mode collapse in the fine-tuning of large language models. InICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2024. 2

  35. [35]

    R. Y . Pang, K. J. K. Feng, S. Feng, C. Li, W. Shi, Y . Tsvetkov et al. Inter- active reasoning: Visualizing and controlling chain-of-thought reasoning in large language models, 2025. doi: 10.48550/arXiv.2506.23678 2

  36. [36]

    Pirolli and S

    P. Pirolli and S. Card. The sensemaking process and leverage points for analyst technology as identified through cognitive task analysis. In Proceedings of international conference on intelligence analysis, vol. 5, pp. 2–4. McLean, V A, USA, 2005. 2

  37. [37]

    M. Read. Who is elara voss? https://maxread.substack.com/p/ who-is-elara-voss, 2024. Accessed: 2025-11-19. 1

  38. [38]

    Reinhart, B

    A. Reinhart, B. Markey, M. Laudenbach, K. Pantusen, R. Yurko, G. Wein- berg et al. Do LLMs write like humans? variation in grammatical and rhetorical styles.Proceedings of the National Academy of Sciences, 122(8):e2422455122, 2025. doi: 10.1073/pnas.2422455122 1

  39. [39]

    Riehmann, H

    P. Riehmann, H. Gruendl, M. Potthast, M. Trenkmann, B. Stein, and B. Froehlich. WORDGRAPH: Keyword-in-context visualization for NET- SPEAK’s wildcard search.IEEE Transactions on Visualization and Com- puter Graphics, 18(9):1411–1423, 2012. doi: 10.1109/TVCG.2012.96 2

  40. [40]

    Sevastjanova, S

    R. Sevastjanova, S. V ogelbacher, A. Spitz, D. Keim, and M. El-Assady. Visual comparison of text sequences generated by large language models. In2023 IEEE Visualization in Data Science (VDS), pp. 11–20. IEEE, Oct

  41. [41]

    doi: 10.1109/vds60365.2023.00007 2

  42. [42]

    Shneiderman

    B. Shneiderman. The eyes have it: A task by data type taxonomy for information visualizations. InProceedings 1996 IEEE Symposium on Visual Languages, pp. 336–343. IEEE, 1996. doi: 10.1109/VL.1996. 545307 2

  43. [43]

    Strobelt, B

    H. Strobelt, B. Hoover, A. Satyanarayan, and S. Gehrmann. LMdiff: A visual diff tool to compare language models. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Process- ing: System Demonstrations, pp. 96–105. Association for Computational Linguistics, Nov. 2021. doi: 10.18653/v1/2021.emnlp-demo.12 2

  44. [44]

    Sugiyama and K

    K. Sugiyama and K. Misue. Graph drawing by the magnetic spring model. Journal of Visual Languages & Computing, 6(3):217–231, 1995. doi: 10. 1006/jvlc.1995.1013 5 11

  45. [45]

    Swoopes, T

    C. Swoopes, T. Holloway, and E. L. Glassman. The impact of revealing large language model stochasticity on trust, reliability, and anthropomor- phization, 2025. doi: 10.48550/arXiv.2503.16114 1, 2

  46. [46]

    J. R. Trippas, S. F. D. Al Lawati, J. Mackenzie, and L. Gallagher. What do users really ask large language models? an initial log analysis of google bard interactions in the wild. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’24, pp. 2703–2707. Association for Computing Machinery,...

  47. [47]

    van Ham, M

    F. van Ham, M. Wattenberg, and F. B. Viégas. Mapping text with phrase nets.IEEE Trans. Vis. Comput. Graph., 15(6):1169–1176, Nov. 2009. doi: 10.1109/TVCG.2009.165 2

  48. [48]

    Wattenberg

    L. Wattenberg. 2025 name of the year is elara, the fa- vorite name of ai. https://namerology.com/2025/12/15/ 2025-name-of-the-year-is-elara-the-favorite-name-of-ai/ ,

  49. [49]

    Accessed: 2026-03-19. 1

  50. [50]

    Wattenberg and F

    M. Wattenberg and F. B. Viégas. The word tree, an interactive visual concordance.IEEE Transactions on Visualization and Computer Graphics, 14(6):1221–1228, 2008. doi: 10.1109/TVCG.2008.172 2, 4, 5

  51. [51]

    Wikipedia:signs of ai writing — Wikipedia, the free encyclopedia, 2026

    Wikipedia contributors. Wikipedia:signs of ai writing — Wikipedia, the free encyclopedia, 2026. [Online; accessed 30-March-2026]. 1

  52. [52]

    Biometrics Bulletin , author =

    F. Wilcoxon. Individual comparisons by ranking methods.Biometrics Bulletin, 1(6):80–83, 1945. doi: 10.2307/3001968 7

  53. [53]

    T. Wu, M. Terry, and C. J. Cai. AI chains: Transparent and controllable human-AI interaction by chaining large language model prompts. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pp. 1–22. ACM, 2022. doi: 10.1145/3491102.3517582 2, 3

  54. [54]

    S. J. Young, N. Russell, and J. S. Thornton.Token passing: a simple conceptual model for connected speech recognition systems. Cambridge University Engineering Department Cambridge, UK, 1989. 5

  55. [55]

    Zamfirescu-Pereira, H

    J. Zamfirescu-Pereira, H. Wei, A. Xiao, K. Gu, G. Jung, M. G. Lee et al. Herding ai cats: Lessons from designing a chatbot by prompting gpt-3. In Proceedings of the 2023 ACM Designing Interactive Systems Conference, DIS ’23, pp. 2206–2220. Association for Computing Machinery, New York, NY , USA, 2023. doi: 10.1145/3563657.3596138 2

  56. [56]

    Zhang, Jonathan Bragg, and Joseph Chee Chang

    J. Zamfirescu-Pereira, R. Y . Wong, B. Hartmann, and Q. Yang. Why johnny can’t prompt: How non-ai experts try (and fail) to design llm prompts. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI ’23, art. no. 437, 21 pages. Association for Computing Machinery, New York, NY , USA, 2023. doi: 10.1145/3544548 .3581388 1, 2

  57. [58]

    Noveltybench: Evaluating language models for humanlike diversity.arXiv preprint arXiv:2504.05228,

    Y . Zhang, H. Diddee, S. Holm, H. Liu, X. Liu, V . Samuel et al. Novelty- Bench: Evaluating language models for humanlike diversity, 2025. doi: 10.48550/arXiv.2504.05228 2

  58. [59]

    Zhang, S

    Y . Zhang, S. A. Khan, A. Mahmud, H. Yang, A. Lavin, M. Levin et al. Exploring the role of large language models in the scientific method: from hypothesis to discovery.npj Artificial Intelligence, 1(1):14, 2025. doi: 10. 1038/s44387-025-00019-5 1

  59. [60]

    arXiv preprint arXiv:2404.10859 , year=

    Y . Zhang, A. Schwarzschild, N. Carlini, J. Z. Kolter, and D. Ippolito. Forc- ing diffuse distributions out of language models, 2024. First Conference on Language Modeling (COLM 2024). doi: 10.48550/arXiv.2404.10859 2 12 Preprint. Under review. A TOKEN SIMILARITY ALGORITHM Algorithm A.1Token similarity score used during the merge step. For stopwords, simi...

  60. [61]

    Introduction (2 to 3 min).The facilitator explained the study’s purpose: to understand how LM researchers currently assess, address, and leverage the stochasticity of models, and to identify potential use cases for a graph-based visualization of LM output distributions

  61. [62]

    What kinds of NLP/LLM/AI problems are you working on?

    Background and semi-structured discussion (10 to 15 min). Participants described their current research focus and how they typi- cally use LMs. Questions were adapted to each participant’s context but drew from the following: •“What kinds of NLP/LLM/AI problems are you working on?” •“Do your tasks have constrained outputs (e.g., multiple choice), or are t...

  62. [63]

    The facilitator walked through a preloaded example, demonstrating the graph layout, node selection and filtering, and comparison mode

    Prototype demo (5 min).Participants viewed an early pro- totype of the visualization tool. The facilitator walked through a preloaded example, demonstrating the graph layout, node selection and filtering, and comparison mode. (a) Word tree prototype. (b) Color-coded span highlighting prototype. (c) Time Curves-style prototype [4]. Fig. A.2: During our ite...

  63. [64]

    Is this clear? What is confusing?

    Feedback (5 to 10 min).Participants gave reactions to the prototype’s clarity, utility, and perceived relevance (or lack thereof) to their own workflows. Sample prompts included: •“Is this clear? What is confusing?” •“Would this be useful for any of the tasks you described?” •“What is missing or would you change?”

  64. [65]

    Use case brainstorming (5 to 10 min).Participants were asked to imagine how the tool could fit into their current work, and discussed potential extensions, new features, or alternative analysis tasks the tool could support

  65. [66]

    Using this interface, I understood how diverse (i.e., how narrow or broad) the output space was for a given prompt

    Wrap-up (2 min).The facilitator asked for any remaining thoughts and thanked the participant. Analysis.Interview notes and recordings were coded to identify common workflow patterns for LM output inspection, recurrent themes in potential use cases, and specific design feedback. We used open coding followed by thematic synthesis, as described in Section 3....