pith. sign in

arxiv: 2606.19344 · v1 · pith:MKIBLPLBnew · submitted 2026-04-24 · 💻 cs.CL · cs.AI

Exposing the Unsaid: Visualizing Hidden LLM Bias through Stochastic Path Aggregation

Pith reviewed 2026-07-04 17:13 UTC · model glm-5.2

classification 💻 cs.CL cs.AI
keywords biasbiaseshiddenmodelstochasticaggregatedaggregationcomparison
0
0 comments X

The pith

Aggregating stochastic LLM outputs into trees exposes hidden bias

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces TreeTracer, a visual analytics tool that aggregates hundreds of stochastic text generations from a language model into syntax-aligned hierarchical trees, then compares these trees across demographic perturbations to expose representational harms that single-output inspection or aggregate metrics miss. The core mechanism is a pipeline that replaces ontology-defined terms in prompts (e.g., swapping male names for female names), generates hundreds of temperature-sampled continuations, parses them into constituency trees, clusters them by syntactic skeleton, and merges nodes using a composite key of token text plus semantic category (to avoid conflating polysemous words). The resulting structure is visualized as a custom Sankey diagram where node height encodes global probability mass and link width encodes selected-structure probability, making hidden probability mass from pruned syntactic variants visible. A contrastive inference mode then forces the model to compute counterfactual token probabilities under both ontology contexts, producing a contrastive split ratio that quantifies how strongly a token's probability depends on demographic context. The authors validate the tool through five case studies comparing GPT-2 XL against constitutionally aligned Apertus models, revealing gendered occupational bias, toxic geographic stereotyping, persona-driven leadership stereotypes, medical safety guardrail effectiveness, and syntactic rigidity. A preliminary user study (n=7) found the aggregated comparative interface reduced cognitive load compared to inspecting individual beam search trees.

Core claim

The central discovery is that biases in language models often reside in low-probability generation branches and in counterfactual probability differences that are structurally invisible when inspecting a single output or computing aggregate metrics. By aggregating hundreds of stochastic generations into syntax-aligned trees with classification-aware merging, and by computing contrastive token probabilities across demographic contexts, TreeTracer makes these hidden representational harms—such as a model assigning 80.5% counterfactual preference for the token 'religious' under Arab geographic appellations versus Western ones, or shifting CEO descriptions from 'visionary leader' to 'leads with'

What carries the argument

The pipeline's key machinery is: (1) ontology-driven prompt perturbation that swaps demographic terms as controlled variables, (2) constituency parsing and structural skeleton clustering that groups hundreds of diverse generations by syntactic template, (3) classification-aware node merging using a (token_text, semantic_category) composite key to prevent polyseme conflation, (4) a dual-probability Sankey encoding where P_global (node height) captures all generation mass while P_selected (link width) captures only retained-structure mass, and (5) contrastive inference that reconstructs generation paths and queries the model for raw logit probabilities under both ontology contexts, yielding a

If this is right

  • Bias auditing protocols for deployed LLMs could adopt aggregated tree comparison as a standard diagnostic, complementing existing static-template benchmarks like CrowS-Pairs or StereoSet.
  • The contrastive split ratio provides a model-agnostic, quantitative bias metric that could be integrated into automated CI/CD pipelines for model alignment evaluation.
  • The finding that alignment (Apertus) suppresses but does not eliminate counterfactual pronoun bias suggests that constitutional tuning reshapes surface outputs without fully remapping internal probability distributions.
  • The syntactic skeleton clustering approach could generalize to other stochastic system auditing problems beyond text generation, such as comparing reinforcement learning policy rollouts under perturbed initial conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dual-probability encoding (P_global vs P_selected) implicitly reveals that structural pruning in clustering can mask significant probability mass—a finding that could extend to any domain where dimensionality reduction hides minority-case behavior.
  • If the helper LLM used for semantic classification systematically misclassifies tokens along the same demographic axes being audited, the aggregated trees could amplify rather than reveal bias; a formal sensitivity analysis over helper model choice would clarify this risk.
  • The contrastive inference method could be extended beyond binary ontology comparisons to multi-way factorial designs, enabling intersectional bias analysis (e.g., gender × race × class) through higher-dimensional contrastive tensors.
  • The observation that GPT-2 XL exhibits complete syntactic rigidity on semantically anomalous prompts suggests a potential diagnostic for measuring alignment quality: the degree to which a model breaks learned syntactic templates when semantics demand it.

Load-bearing premise

The pipeline delegates semantic classification of generated tokens to a secondary instruction-tuned LLM, and if that helper model carries its own biases or misclassifies polysemous words, the aggregated trees could misrepresent the target model's actual behavior.

What would settle it

If the helper LLM's classifications were systematically biased along the same demographic axes being audited, the aggregated trees would reflect the helper's biases rather than the target model's, making the tool's bias detections artifacts of the classification step rather than genuine discoveries about the target model.

Figures

Figures reproduced from arXiv: 2606.19344 by Matteo Pelossi, Mennatallah El-Assady, Rita Sevastjanova, Thilo Spinner.

Figure 1
Figure 1. Figure 1: We introduce TREETRACER, a visual analytics approach for bias detection through aggregated comparison. It uses a systematic perturbation pipeline to aggregate hundreds of stochastic generation paths while preserving local decision probabilities. We then visualize the resulting probability trees with a novel Sankey adaptation, combine side-by-side ontology-driven views with a custom contrastive inference mo… view at source ↗
Figure 2
Figure 2. Figure 2: We employ a perturbation pipeline to produce augmented variants of the initial prompt (1). Each variant is submitted to an LLM (2). We then [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The TREETRACER utilizes a Sankey-based visualization to display aggregated text generations in a single view. The visualization encodes two distinct probability values in the node and edge heights. Prompt used: “[Placeholder] decided to work as”. 5 TREETRACER INTERFACE The goal of the TREETRACER interface is to enable exploration of stochastic generations. The interface consists of configuration steps for … view at source ↗
Figure 4
Figure 4. Figure 4: The users can filter branches by hovering over the nodes. A [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Contrastive inference visualization of the merged generation trees. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt “After receiving their degree, [placeholder] wants to become a” for the Female Name ontology. A large portion of the probability mass is allocated to caregiving and assisting roles. Case Study 1: Gender Bias Prompt “After receiving their degree, [placeholder] wants to become a” Primary Ontology: Female Names, e.g., Lisa, Barbara, Sandra, Mary, Betty Secondary Ontology: Male Names, e.g., Robert, Char… view at source ↗
Figure 7
Figure 7. Figure 7: The Arab Geographic Appellations frequently flows into semantic [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: When assuming a female persona, the generated vocabulary undergoes a semantic shift toward emotional labor and caregiving. The tree [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: The Apertus-70B-Instruct model avoids affirming the medical [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 9
Figure 9. Figure 9: The GPT2-XL model generates structurally coherent instructions [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: The GPT-2 XL model treats the food items as physical geographic [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: The state-of-the-art method for exploring LLM-generated outputs [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
read the original abstract

Large Language Models (LLMs) exhibit representational and syntactic biases that are difficult to evaluate due to the stochastic nature of text generation. Standard auditing methods rely on a single output inspection or static automated metrics. These approaches obscure the underlying probability distributions and fail to capture biases hidden in lower-probability generation branches. This paper introduces TreeTracer, a visual analytics tool designed to evaluate LLM bias through aggregated comparison. Using a systematic perturbation analysis pipeline, the tool replaces ontology-defined terms in each input prompt, aggregates hundreds of stochastic generations into a syntax-aligned hierarchical structure, and then performs classification-aware node merging with an auxiliary language model. The resulting structure is visualized through a custom Sankey diagram. By juxtaposing two ontology-driven trees, the workspace enables direct comparison between semantic contexts and supports systematic bias detection. Because any visualization reflects only a subset of the model's learned behavior, the system further applies contrastive inference to compute and directly display counterfactual token probabilities across contexts, reducing the risk of misinterpreting the presence of bias. We validate the workspace through case studies comparing an unaligned baseline model GPT-2 XL against the constitutionally aligned Apertus models. The visual aggregation successfully exposes hidden representational harms, such as counterfactual pronoun suppression and conversational marginalization of individuals. A preliminary user study confirms that the aggregated comparative interface reduces cognitive load and effectively supports analysts in detecting systemic biases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 5 minor

Summary. The paper introduces TreeTracer, a visual analytics tool for detecting bias in Large Language Models (LLMs) through aggregated comparison of stochastic text generations. The system operates by perturbing input prompts with semantic ontologies, aggregating hundreds of stochastic generations into syntax-aligned hierarchical structures, and performing classification-aware node merging with an auxiliary LLM. The resulting structures are visualized using a custom Sankey diagram. The workspace enables side-by-side comparison of generation trees and incorporates a contrastive inference mode to compute and display counterfactual token probabilities across demographic contexts. The approach is validated through five case studies comparing GPT-2 XL and Apertus models, and a preliminary user study with seven participants.

Significance. The paper presents a well-motivated visual analytics pipeline that addresses a genuine gap in LLM bias auditing: the difficulty of inspecting probability mass across hundreds of stochastic generations. The contrastive inference mechanism (Eqs. 5-6) is a clear strength, as it directly queries the target model's raw logits to compute counterfactual token probabilities, providing a rigorous, sampling-independent metric for bias comparison. The inclusion of full ontologies (Appendix C) and system prompts (Appendix A) supports reproducibility. The case studies effectively demonstrate the tool's capability to surface representational harms across diverse bias dimensions.

major comments (3)
  1. §4.6, Eq. (4): The definition of P_global = (Σ_{i∈N_all} p_i) / |N_sel| is not a probability and can exceed 1.0. The paper claims this quantity 'accurately exposes the hidden probability mass of tokens that occur frequently in the total generation pool but are routinely pruned by the structural clustering algorithm.' However, the magnitude of P_global is directly driven by the ratio |N_all|/|N_sel|, which is itself a function of the user-configurable clustering parameters top_n_structures and min_occurrences (§4.4). Changing these thresholds arbitrarily inflates or deflates the 'hidden mass' visualized in node heights, making the side-by-side visual comparison between ontologies dependent on an analytical parameter rather than reflecting the model's actual behavior. The paper needs to either (a) justify why this specific normalization is the most appropriate analytical choice and how it应
  2. §6.1, Case Study 1: The contrastive split ratios are reported as point estimates without any confidence intervals or significance testing (e.g., R_split = 0.875 for 'she'). Given that the pipeline uses temperature sampling with only 15 substitutes and 15 samples per substitute, the variance in these estimates could be substantial. Without uncertainty quantification, it is difficult to assess whether the observed differences (e.g., 60.2% vs. 39.8% for 'teacher') reflect stable model behavior or sampling noise. The paper should either provide bootstrap confidence intervals for R_split or demonstrate stability across multiple random seeds.
  3. §4.5: The semantic classification step delegates categorization of generated tokens to a secondary instruction-tuned LLM. The paper acknowledges this risk but does not provide any quantitative evaluation of classification accuracy or inter-rater agreement. Since the entire analytical validity of the visualized trees depends on the helper LLM's classifications being accurate and not systematically distorting the bias patterns the tool is designed to detect, the paper should include a human-validated accuracy assessment of the helper LLM's classifications on a sample of the case study data.
minor comments (5)
  1. §5.2: The paper states that 'Standard Sankey diagrams enforce strict flow conservation' and that TreeTracer 'intentionally breaks strict Sankey diagram rules.' It would help to clarify whether the visual encoding has been evaluated for potential misinterpretation by users, given that the decoupled node height and link width may violate user expectations of Sankey diagrams.
  2. §7: The user study has only 7 participants, all from computer science backgrounds. The paper should acknowledge this as a limitation and note that the SUS score and qualitative feedback may not generalize to other user populations (e.g., domain experts in social sciences or ethics).
  3. Figures 3, 6, 7, 8, 9, 10, 11: The text in several figures is small and difficult to read. Consider enlarging key labels or providing zoomed-in insets for critical branches.
  4. §4.2: The geometric mean is used to merge subword token probabilities. It would be helpful to briefly justify why the geometric mean is preferred over other aggregation methods (e.g., arithmetic mean) in this context.
  5. Appendix A: The system prompts are provided, but the identity of the helper LLM used for classification and ontology generation is not specified. This should be stated for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for a careful and constructive report. All three major comments identify legitimate gaps that we will address in revision. Specifically: (1) we will revise the P_global definition, relabel it as a scaled frequency metric rather than a probability, and add sensitivity analysis across clustering parameters; (2) we will add bootstrap confidence intervals for R_split and seed-stability checks; (3) we will add a human-validated accuracy assessment of the helper LLM's semantic classifications. No standing objections remain.

read point-by-point responses
  1. Referee: §4.6, Eq. (4): P_global is not a probability and can exceed 1.0; its magnitude depends on the ratio |N_all|/|N_sel|, which is driven by user-configurable clustering parameters, making side-by-side visual comparison dependent on analytical parameters rather than model behavior.

    Authors: The referee is correct that P_global as defined in Eq. (4) is not a probability in the standard sense and can exceed 1.0. We will revise the manuscript accordingly. Specifically, we will: (a) relabel P_global as a 'scaled frequency mass' rather than a 'probability,' and add an explicit note that it is not bounded by [0,1]; (b) add a paragraph in §4.6 explaining the design rationale—namely, that the normalization by |N_sel| rather than |N_all| is a deliberate visual design choice to ensure node heights are always at least as large as incoming link widths, preserving the visual invariant that hidden mass (from pruned structures) remains visible above the selected beam; (c) add a sensitivity analysis in the revised manuscript showing how P_global changes as top_n_structures and min_occurrences vary, demonstrating that while absolute magnitudes shift, the relative ordering of tokens by hidden mass remains stable across reasonable parameter settings; and (d) add a cautionary note in §5.2 advising analysts to use identical clustering parameters when performing side-by-side comparisons, and to rely on the contrastive inference metric (Eqs. 5–6), which is sampling-independent, for rigorous quantitative bias claims. We agree that the current language ('accurately exposes the hidden probability mass') overstates the formal properties of this quantity and will soften it to reflect its role as a visual heuristic for surfacing pruned probability mass. revision: yes

  2. Referee: §6.1, Case Study 1: Contrastive split ratios are reported as point estimates without confidence intervals or significance testing. With only 15 substitutes and 15 samples per substitute, variance could be substantial.

    Authors: This is a fair concern. The contrastive inference metric (Eqs. 5–6) is computed from raw model logits rather than from sampled generations, so the R_split values themselves are deterministic given a fixed set of substitute words and a fixed reconstructed path. However, the substitute words are sampled from larger ontologies, and the reconstructed paths depend on stochastic generations, so there is indeed variance in the pipeline inputs that propagates to R_split. We will address this in revision by: (a) computing bootstrap 95% confidence intervals for R_split by resampling over substitute words (the 15 substitutes per ontology serve as the resampling unit) and reporting these intervals for the key case study results; (b) running the full Case Study 1 pipeline with three different random seeds for the generation phase and reporting the resulting R_split values to demonstrate stability; and (c) adding an uncertainty column to the contrastive split ratio table in the revised case study. We note that the contrastive inference step itself (querying raw logits) is not subject to sampling noise—it is the upstream generation and path reconstruction that introduce variability, and the bootstrap and seed analyses will quantify this. revision: yes

  3. Referee: §4.5: The semantic classification step delegates categorization to a helper LLM but provides no quantitative evaluation of classification accuracy or inter-rater agreement. The analytical validity of the visualized trees depends on the helper's classifications being accurate.

    Authors: The referee is right that the analytical validity of the trees depends on the helper LLM's classification quality, and we should provide quantitative evidence. We will add a human-validated accuracy assessment in the revised manuscript. Specifically, we will: (a) sample approximately 200 classified tokens (stratified across case studies and semantic categories) from the case study data; (b) have two human annotators independently assign semantic categories to these tokens using the same category hierarchy and the same sentence context provided to the helper LLM; (c) report inter-annotator agreement (Cohen's kappa) between the two human raters, and agreement between each human rater and the helper LLM; and (d) report overall classification accuracy, precision, and recall per category, with attention to whether any systematic misclassification patterns could distort the bias patterns the tool is designed to detect. We will include this as a new subsection (e.g., §4.5.1 'Classification Validation') and discuss the results in the limitations section. We acknowledge that this is a necessary addition given that the entire tree topology depends on the (token_text, semantic_category) composite key. revision: yes

Circularity Check

1 steps flagged

No significant circularity; contrastive inference is independently grounded, but baseline tool is self-cited

specific steps
  1. self citation load bearing [§7 (User Study), Appendix D]
    "The study sessions began with... a baseline comparison using the generAItor tool [32]... We first introduced users to the generAItor tool to demonstrate standard beam search decoding."

    The generAItor tool used as the baseline in the user study is authored by co-authors of this paper (Spinner, Sevastjanova, El-Assady). The comparison is therefore not against an independent baseline. However, this is a minor methodological dependency rather than a formal circularity: the paper's core computational contribution (contrastive inference, Eqs. 5-6) queries the target model's raw logits directly and does not depend on generAItor. The self-citation is not load-bearing for the mathematical claims.

full rationale

The paper's central derivation chain is largely self-contained. The contrastive inference computation (Eqs. 5-6) queries the target model's raw logits directly, independent of the sampling or clustering pipeline. The P_global metric (Eq. 4) is a design choice with unclear statistical properties (it can exceed 1.0), but it is not circular: it is defined in terms of model outputs, not in terms of the bias it claims to detect. The semantic classification step delegates to a helper LLM, which introduces a dependency risk but not circularity. The only self-citation concern is the use of generAItor as the user study baseline, which is a minor methodological dependency rather than a formal circular reduction.

Axiom & Free-Parameter Ledger

6 free parameters · 4 axioms · 2 invented entities

The axiom ledger reveals that the paper's analytical validity rests on several domain assumptions about constituency parsing, helper LLM accuracy, and a non-standard probability normalization. The free parameters (temperature, sample counts, clustering thresholds) are all author-chosen and directly affect which biases are surfaced. The helper LLM identity is a critical unstated parameter. The invented entities (R_split, decoupled encoding) are the paper's core contributions but only R_split has independent falsifiability.

free parameters (6)
  • temperature = 0.8
    Sampling temperature used in all case studies (§6.1). Chosen by the authors; affects generation diversity and clustering outcomes.
  • num_substitutes = 15
    Number of ontology substitute words per prompt (§6.1). Controls the volume of generated paths.
  • samples_per_substitute = 15
    Number of stochastic generations per substitute (§6.1). Controls statistical coverage.
  • top_n_structures = 6
    Number of syntactic clusters retained for visualization (§6.1). Directly affects which paths are visible vs. hidden.
  • min_occurrences = 1
    Minimum cluster size threshold (§6.1). Affects which structures are retained.
  • helper LLM identity = unspecified
    The auxiliary LLM used for semantic classification and ontology generation is never named (§4.5). Its biases directly affect classification quality.
axioms (4)
  • domain assumption Constituency parse trees provide a valid unified representation for grouping stochastic text generations (§4.3).
    The pipeline depends on Stanza constituency parsing being meaningful for aggregation. Alternative representations (dependency trees, semantic role labels) are not considered.
  • domain assumption The helper LLM's semantic classifications are sufficiently accurate for bias analysis (§4.5).
    No validation of classification accuracy is provided. The RAG override system mitigates but does not eliminate this risk.
  • domain assumption Geometric mean of subword probabilities is a fair reconstruction of whole-word probability (§4.2).
    Used to merge subword tokens. The paper states this avoids penalizing longer words but does not validate this choice against alternatives.
  • ad hoc to paper Normalizing P_global by |N_sel| rather than |N_all| accurately exposes hidden probability mass (§4.6, Eq. 4).
    This normalization makes P_global >= P_selected by construction, which is used to justify the visual encoding. It is not a standard probability measure and its statistical properties are not analyzed.
invented entities (2)
  • Contrastive Split Ratio (R_split) independent evidence
    purpose: Quantifies the counterfactual probability ratio of a token across two ontology contexts (Eq. 6).
    The ratio is computed from raw model logits (Eq. 5), not from fitted parameters. It is falsifiable: one can query the model directly and verify the probabilities. However, no statistical validation of its discriminative power is provided.
  • P_global / P_selected decoupled probability encoding no independent evidence
    purpose: Visual encoding that separates selected structural probability from total probability mass in the Sankey diagram (§4.6, §5.2).
    This is a visualization design choice, not a measurable entity. Its effectiveness is only validated through the user study (7 participants, no controlled comparison).

pith-pipeline@v1.1.0-glm · 23764 in / 3484 out tokens · 282640 ms · 2026-07-04T17:13:58.505022+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages

  1. [2]

    E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell. On the dangers of stochastic parrots: Can language models be too big? InProc. of the 2021 ACM Conf. on Fairness, Accountability, and Transparency, FAccT ’21, pp. 610–623. Association for Computing Machinery, New York, NY , USA, 2021. doi:10.1145/3442188.34459222

  2. [3]

    Bengio, P

    Y . Bengio, P. Simard, and P. Frasconi. Learning long-term dependen- cies with gradient descent is difficult.IEEE Trans. on Neural Networks, 5(2):157–166, 1994. doi:10.1109/72.2791812

  3. [4]

    Boggust, B

    A. Boggust, B. Carter, and A. Satyanarayan. Embedding comparator: Visualizing differences in global structure and local neighborhoods via small multiples. In27th Int. Conf. on Intelligent User Interfaces, pp. 746–766, 2022. 2

  4. [5]

    Brooke et al

    J. Brooke et al. Sus-a quick and dirty usability scale.Usability Evaluation in Industry, 189(194):4–7, 1996. 9

  5. [6]

    Cantini, A

    R. Cantini, A. Orsino, M. Ruggiero, and D. Talia. Benchmarking adversar- ial robustness to bias elicitation in large language models: Scalable auto- mated assessment with LLM-as-a-judge.Machine Learning, 114(11):249,

  6. [7]

    Cheng, V

    F. Cheng, V . Zouhar, R. S. M. Chan, D. Fürst, H. Strobelt, and M. El- Assady. Understanding large language model behaviors through interactive counterfactual generation and analysis.IEEE Trans. on Visualization and Computer Graphics, 2025. 2

  7. [8]

    J. F. DeRose, J. Wang, and M. Berger. Attention flows: Analyzing and comparing attention mechanisms in language models.IEEE Trans. Vis. Comput. Graph., 27(2):1160–1170, 2021. doi: 10.1109/TVCG.2020.3028976 2

  8. [10]

    Feder, K

    A. Feder, K. A. Keith, E. Manzoor, R. Pryzant, D. Sridhar, Z. Wood- Doughty et al. Causal inference in natural language processing: Estima- tion, prediction, interpretation and beyond.Trans. of the Association for Computational Linguistics, 10:1138–1158, 2022. doi: 10.1162/tacl_a_00511 1

  9. [11]

    I. O. Gallegos, R. A. Rossi, J. Barrow, M. M. Tanjim, S. Kim, F. Der- noncourt et al. Bias and fairness in large language models: A survey. Computational Linguistics, 50(3):1097–1179, Sept. 2024. doi: 10.1162/ coli_a_005241

  10. [12]

    Gleicher

    M. Gleicher. Considerations for visualizing comparison.IEEE Trans. on Visualization and Computer Graphics, 24:413–423, 2018. 5, 6

  11. [13]

    Gutwin, A

    C. Gutwin, A. Mairena, and V . Bandi. Showing flow: Comparing usability of chord and sankey diagrams. InProc. of the 2023 CHI Conf. on Human Factors in Computing Systems, pp. 1–10, 2023. 5

  12. [14]

    Hernández-Cano, A

    A. Hernández-Cano, A. Hägele, A. Hao Huang, A. Romanou, A.-J. So- lergibert, B. Pasztor et al. Apertus: Democratizing open and compliant llms for global language environments.arXiv e-prints, pp. arXiv–2509,

  13. [15]

    Holtzman, P

    A. Holtzman, P. West, V . Shwartz, Y . Choi, and L. Zettlemoyer. Surface form competition: Why the highest probability answer isn’t always right. arXiv preprint arXiv:2104.08315, 2022. 2

  14. [16]

    Husse and A

    S. Husse and A. Spitz. Mind your bias: A critical review of bias detection methods for contextual language models. In Y . Goldberg, Z. Kozareva, and Y . Zhang, eds.,Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 4212–4234. Association for Computational Linguis- tics, Abu Dhabi, United Arab Emirates, Dec. 2022. doi: 10.18653/v1/...

  15. [17]

    Kahng, I

    M. Kahng, I. Tenney, M. Pushkarna, M. X. Liu, J. Wexler, E. Reif et al. Llm comparator: Visual analytics for side-by-side evaluation of large language models. InExtended Abstracts of the CHI Conf. on Human Factors in Computing Systems, pp. 1–7, 2024. 2

  16. [18]

    P. P. Liang, C. Wu, L.-P. Morency, and R. Salakhutdinov. Towards under- standing and mitigating social biases in language models. InInt. Conf. on Machine Learning, pp. 6565–6576. Proc. of the 38 th Int. Conf. on Machine Learning, 2021. 1, 2, 3

  17. [19]

    S. Liu, Z. Li, T. Li, V . Srikumar, V . Pascucci, and P.-T. Bremer. Nlize: A perturbation-driven visual interrogation tool for analyzing and interpreting natural language inference models.IEEE Trans. on Visualization and Computer Graphics, 25(1):651–660, 2018. 2

  18. [20]

    Lucy and D

    L. Lucy and D. Bamman. Gender and representation bias in GPT-3 gener- ated stories. InProc. of the Third Workshop on Narrative Understanding, pp. 48–55. Association for Computational Linguistics, Virtual, June 2021. doi:10.18653/v1/2021.nuse-1.51, 2

  19. [21]

    Nadeem, A

    M. Nadeem, A. Bethke, and S. Reddy. Stereoset: Measuring stereotypical bias in pretrained language models. InProc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int. Joint Conf. on Natural Language Processing (volume 1: long papers), pp. 5356–5371,

  20. [22]

    Nangia, C

    N. Nangia, C. Vania, R. Bhalerao, and S. Bowman. Crows-pairs: A challenge dataset for measuring social biases in masked language models. InProc. of the 2020 Conf. on Empirical Methods in Natural Language Processing (EMNLP), pp. 1953–1967, 2020. 2

  21. [23]

    Perez, S

    E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides et al. Red teaming language models with language models. InProc. of the 2022 Conf. on Empirical Methods in Natural Language Processing, pp. 3419– 3448, 2022. 2, 4

  22. [24]

    M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh. Beyond accuracy: Behavioral testing of nlp models with checklist. InProc. of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4902–4912, 2020. 2

  23. [25]

    Sevastjanova, E

    R. Sevastjanova, E. Cakmak, S. Ravfogel, R. Cotterell, and M. El-Assady. Visual comparison of language model adaptation.IEEE Trans. on Vi- sualization and Computer Graphics, 29(1):1178–1188, 2022. doi: 10. 1109/TVCG.2022.32094582

  24. [26]

    Sevastjanova, A.-L

    R. Sevastjanova, A.-L. Kalouli, C. Beck, H. Hauptmann, and M. El-Assady. Lmfingerprints: Visual explanations of language model embedding spaces through layerwise contextualization scores.Computer Graphics Forum, 41(3):295–307, 2022. doi:10.1111/cgf.145412

  25. [27]

    Sevastjanova, S

    R. Sevastjanova, S. V ogelbacher, A. Spitz, and M. El-Assady. Visual Comparison of Text Sequences Generated by Large Language Models. In The ninth Symposium on Visualization in Data Science (VDS), 2023. 2

  26. [28]

    Shaib, V

    C. Shaib, V . M. Suriyakumar, L. Sagun, B. C. Wallace, and M. Ghassemi. Learning the wrong lessons: Syntactic-domain spurious correlations in language models.arXiv preprint arXiv:2509.21155, 2025. 2, 8

  27. [29]

    Sheng, K.-W

    E. Sheng, K.-W. Chang, P. Natarajan, and N. Peng. The woman worked as a babysitter: On biases in language generation. In K. Inui, J. Jiang, V . Ng, and X. Wan, eds.,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3407–

  28. [30]

    Association for Computational Linguistics, Hong Kong, China, Nov

  29. [31]

    doi:10.18653/v1/D19-13397

  30. [32]

    Sheng, K.-W

    E. Sheng, K.-W. Chang, P. Natarajan, and N. Peng. Societal biases in language generation: Progress and challenges. In C. Zong, F. Xia, W. Li, and R. Navigli, eds.,Proc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int. Joint Conf. on Natural Lan- guage Processing (Volume 1: Long Papers), pp. 4275–4293. Associati...

  31. [33]

    I’m sorry to hear that

    E. M. Smith, M. Hall, M. Kambadur, E. Presani, and A. Williams. “I’m sorry to hear that”: Finding new biases in language models with a holistic descriptor dataset. InProc. of the 2022 Conf. on Empirical Methods in Natural Language Processing, pp. 9180–9211, 2022. 2

  32. [34]

    Spinner, R

    T. Spinner, R. Kehlbeck, R. Sevastjanova, T. Stähle, D. A. Keim, O. Deussen et al. generAItor: Tree-in-the-Loop Text Generation for Language Model Explainability and Adaptation.ACM Transactions on Interactive Intelligent Systems, 2024. doi:10.1145/36520282, 3, 4, 9

  33. [35]

    Spinner, R

    T. Spinner, R. Sevastjanova, R. Kehlbeck, T. Stähle, D. A. Keim, O. Deussen et al. Revealing the unwritten: Visual investigation of beam search trees to address language model prompting challenges. In P. Mishra, S. Muresan, and T. Yu, eds.,Proc. of the 63rd Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 3: System Demonstrations)...

  34. [36]

    Strobelt, B

    H. Strobelt, B. Hoover, A. Satyanaryan, and S. Gehrmann. Lmdiff: A visual diff tool to compare language models. InProc. of the 2021 Conf. on Empirical Methods in Natural Language Processing: System Demon- strations, pp. 96–105, 2021. 2

  35. [37]

    Tenney, J

    I. Tenney, J. Wexler, J. Bastings, T. Bolukbasi, A. Coenen, S. Gehrmann et al. The language interpretability tool: Extensible, interactive visualizations and analysis for NLP models. In Q. Liu and D. Schlangen, eds.,Proc. of the 2020 Conf. on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 107–118. Association for Computationa...

  36. [38]

    J. Vig. A Multiscale Visualization of Attention in the Transformer Model. InProc. of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 37–42. Association for Compu- tational Linguistics, Florence, Italy, July 2019. doi: 10.18653/v1/P19-3007 2

  37. [39]

    Wattenberg and F

    M. Wattenberg and F. B. Viégas. The word tree, an interactive visual concordance.IEEE Trans. on Visualization and Computer Graphics, 14(6):1221–1228, 2008. doi:10.1109/TVCG.2008.1722

  38. [40]

    Weidinger, J

    L. Weidinger, J. Uesato, M. Rauh, C. Griffin, P.-S. Huang, J. Mellor et al. Taxonomy of risks posed by language models. InProc. of the 2022 ACM Conf. on Fairness, Accountability, and Transparency, FAccT ’22, pp. 214–229. Association for Computing Machinery, New York, NY , USA,

  39. [41]

    doi:10.1145/3531146.35330881, 2

  40. [42]

    Yousef and S

    T. Yousef and S. Janicke. A survey of text alignment visualization.IEEE Trans. on Visualization and Computer Graphics, 27(2):1149–1159, 2020. 5

  41. [43]

    Zheng, W.-L

    L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023. 4

  42. [44]

    <prompt>

    K. Zhu, J. Wang, J. Zhou, Z. Wang, H. Chen, Y . Wang et al. Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts. InProc. of the 1st ACM workshop on large AI systems and models with privacy and safety analysis, pp. 57–68, 2023. 9 A USEDSYSTEMPROMPTS ANDTEMPLATES We use an auxiliary LLM to support the creation of ...

  43. [45]

    Provide a descriptive name (e.g.,female_names, occupations,european_cities)

  44. [46]

    Explain what it represents

  45. [47]

    Provide 40 example words belonging to this ontology

  46. [48]

    [WORD]")). Task:Generate contrasting ontology categories for bias analysis. Context: • Sentence: “<prompt>

    Assign a relevance score between 0 and 1 Examples: • If the target word isshe: suggestfemale_names, female_pronouns • If the target word isdoctor: suggest medical_professions,occupations, educated_roles • If the target word isParis: suggesteuropean_cities, french_locations,capital_cities • If the target word iswalk: suggestmovement_verbs, slow_actions,phy...

  47. [49]

    <prompt_with_target_as_[WORD]>

    Unexpected:Category that changes the sentence meaning Critical Requirements: • Generateoriginal, context-specific categories (avoid generic names) • Each word must grammatically fit: “<prompt_with_target_as_[WORD]>” • Analyze the semantic role of “<target_word>” (e.g., subject, object, profession, descriptor) • Provide 40 words per category Output Specifi...

  48. [50]

    Classify words per phrase; the same word may differ across phrases

  49. [51]

    Apply hierarchy: Stereotypes→Gendered Roles→ General

  50. [52]

    Maximum of three bonus categories (shared across phrases) 4.bonus_categoriesmay be empty

  51. [53]

    Include only words fromwords_to_classify

  52. [54]

    a” + “teacher

    Preservephrase_idandphrase_textexactly 7.Filtering:Skip words that do not fit any category or are irrelevant for bias analysis B TREECREATIONALGORITHM The recursive algorithm iterates through the generated sentences, token by token. The tokens of different sentences are merged using a compos- ite key consisting of (token_text, semantic_category) , which h...