Exposing the Unsaid: Visualizing Hidden LLM Bias through Stochastic Path Aggregation

Matteo Pelossi; Mennatallah El-Assady; Rita Sevastjanova; Thilo Spinner

arxiv: 2606.19344 · v1 · pith:MKIBLPLBnew · submitted 2026-04-24 · 💻 cs.CL · cs.AI

Exposing the Unsaid: Visualizing Hidden LLM Bias through Stochastic Path Aggregation

Matteo Pelossi , Rita Sevastjanova , Thilo Spinner , Mennatallah El-Assady This is my paper

Pith reviewed 2026-07-04 17:13 UTC · model glm-5.2

classification 💻 cs.CL cs.AI

keywords biasbiaseshiddenmodelstochasticaggregatedaggregationcomparison

0 comments

The pith

Aggregating stochastic LLM outputs into trees exposes hidden bias

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces TreeTracer, a visual analytics tool that aggregates hundreds of stochastic text generations from a language model into syntax-aligned hierarchical trees, then compares these trees across demographic perturbations to expose representational harms that single-output inspection or aggregate metrics miss. The core mechanism is a pipeline that replaces ontology-defined terms in prompts (e.g., swapping male names for female names), generates hundreds of temperature-sampled continuations, parses them into constituency trees, clusters them by syntactic skeleton, and merges nodes using a composite key of token text plus semantic category (to avoid conflating polysemous words). The resulting structure is visualized as a custom Sankey diagram where node height encodes global probability mass and link width encodes selected-structure probability, making hidden probability mass from pruned syntactic variants visible. A contrastive inference mode then forces the model to compute counterfactual token probabilities under both ontology contexts, producing a contrastive split ratio that quantifies how strongly a token's probability depends on demographic context. The authors validate the tool through five case studies comparing GPT-2 XL against constitutionally aligned Apertus models, revealing gendered occupational bias, toxic geographic stereotyping, persona-driven leadership stereotypes, medical safety guardrail effectiveness, and syntactic rigidity. A preliminary user study (n=7) found the aggregated comparative interface reduced cognitive load compared to inspecting individual beam search trees.

Core claim

The central discovery is that biases in language models often reside in low-probability generation branches and in counterfactual probability differences that are structurally invisible when inspecting a single output or computing aggregate metrics. By aggregating hundreds of stochastic generations into syntax-aligned trees with classification-aware merging, and by computing contrastive token probabilities across demographic contexts, TreeTracer makes these hidden representational harms—such as a model assigning 80.5% counterfactual preference for the token 'religious' under Arab geographic appellations versus Western ones, or shifting CEO descriptions from 'visionary leader' to 'leads with'

What carries the argument

The pipeline's key machinery is: (1) ontology-driven prompt perturbation that swaps demographic terms as controlled variables, (2) constituency parsing and structural skeleton clustering that groups hundreds of diverse generations by syntactic template, (3) classification-aware node merging using a (token_text, semantic_category) composite key to prevent polyseme conflation, (4) a dual-probability Sankey encoding where P_global (node height) captures all generation mass while P_selected (link width) captures only retained-structure mass, and (5) contrastive inference that reconstructs generation paths and queries the model for raw logit probabilities under both ontology contexts, yielding a

If this is right

Bias auditing protocols for deployed LLMs could adopt aggregated tree comparison as a standard diagnostic, complementing existing static-template benchmarks like CrowS-Pairs or StereoSet.
The contrastive split ratio provides a model-agnostic, quantitative bias metric that could be integrated into automated CI/CD pipelines for model alignment evaluation.
The finding that alignment (Apertus) suppresses but does not eliminate counterfactual pronoun bias suggests that constitutional tuning reshapes surface outputs without fully remapping internal probability distributions.
The syntactic skeleton clustering approach could generalize to other stochastic system auditing problems beyond text generation, such as comparing reinforcement learning policy rollouts under perturbed initial conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dual-probability encoding (P_global vs P_selected) implicitly reveals that structural pruning in clustering can mask significant probability mass—a finding that could extend to any domain where dimensionality reduction hides minority-case behavior.
If the helper LLM used for semantic classification systematically misclassifies tokens along the same demographic axes being audited, the aggregated trees could amplify rather than reveal bias; a formal sensitivity analysis over helper model choice would clarify this risk.
The contrastive inference method could be extended beyond binary ontology comparisons to multi-way factorial designs, enabling intersectional bias analysis (e.g., gender × race × class) through higher-dimensional contrastive tensors.
The observation that GPT-2 XL exhibits complete syntactic rigidity on semantically anomalous prompts suggests a potential diagnostic for measuring alignment quality: the degree to which a model breaks learned syntactic templates when semantics demand it.

Load-bearing premise

The pipeline delegates semantic classification of generated tokens to a secondary instruction-tuned LLM, and if that helper model carries its own biases or misclassifies polysemous words, the aggregated trees could misrepresent the target model's actual behavior.

What would settle it

If the helper LLM's classifications were systematically biased along the same demographic axes being audited, the aggregated trees would reflect the helper's biases rather than the target model's, making the tool's bias detections artifacts of the classification step rather than genuine discoveries about the target model.

Figures

Figures reproduced from arXiv: 2606.19344 by Matteo Pelossi, Mennatallah El-Assady, Rita Sevastjanova, Thilo Spinner.

**Figure 1.** Figure 1: We introduce TREETRACER, a visual analytics approach for bias detection through aggregated comparison. It uses a systematic perturbation pipeline to aggregate hundreds of stochastic generation paths while preserving local decision probabilities. We then visualize the resulting probability trees with a novel Sankey adaptation, combine side-by-side ontology-driven views with a custom contrastive inference mo… view at source ↗

**Figure 2.** Figure 2: We employ a perturbation pipeline to produce augmented variants of the initial prompt (1). Each variant is submitted to an LLM (2). We then [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The TREETRACER utilizes a Sankey-based visualization to display aggregated text generations in a single view. The visualization encodes two distinct probability values in the node and edge heights. Prompt used: “[Placeholder] decided to work as”. 5 TREETRACER INTERFACE The goal of the TREETRACER interface is to enable exploration of stochastic generations. The interface consists of configuration steps for … view at source ↗

**Figure 4.** Figure 4: The users can filter branches by hovering over the nodes. A [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Contrastive inference visualization of the merged generation trees. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt “After receiving their degree, [placeholder] wants to become a” for the Female Name ontology. A large portion of the probability mass is allocated to caregiving and assisting roles. Case Study 1: Gender Bias Prompt “After receiving their degree, [placeholder] wants to become a” Primary Ontology: Female Names, e.g., Lisa, Barbara, Sandra, Mary, Betty Secondary Ontology: Male Names, e.g., Robert, Char… view at source ↗

**Figure 7.** Figure 7: The Arab Geographic Appellations frequently flows into semantic [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: When assuming a female persona, the generated vocabulary undergoes a semantic shift toward emotional labor and caregiving. The tree [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 10.** Figure 10: The Apertus-70B-Instruct model avoids affirming the medical [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 9.** Figure 9: The GPT2-XL model generates structurally coherent instructions [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 11.** Figure 11: The GPT-2 XL model treats the food items as physical geographic [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗

**Figure 13.** Figure 13: The state-of-the-art method for exploring LLM-generated outputs [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

read the original abstract

Large Language Models (LLMs) exhibit representational and syntactic biases that are difficult to evaluate due to the stochastic nature of text generation. Standard auditing methods rely on a single output inspection or static automated metrics. These approaches obscure the underlying probability distributions and fail to capture biases hidden in lower-probability generation branches. This paper introduces TreeTracer, a visual analytics tool designed to evaluate LLM bias through aggregated comparison. Using a systematic perturbation analysis pipeline, the tool replaces ontology-defined terms in each input prompt, aggregates hundreds of stochastic generations into a syntax-aligned hierarchical structure, and then performs classification-aware node merging with an auxiliary language model. The resulting structure is visualized through a custom Sankey diagram. By juxtaposing two ontology-driven trees, the workspace enables direct comparison between semantic contexts and supports systematic bias detection. Because any visualization reflects only a subset of the model's learned behavior, the system further applies contrastive inference to compute and directly display counterfactual token probabilities across contexts, reducing the risk of misinterpreting the presence of bias. We validate the workspace through case studies comparing an unaligned baseline model GPT-2 XL against the constitutionally aligned Apertus models. The visual aggregation successfully exposes hidden representational harms, such as counterfactual pronoun suppression and conversational marginalization of individuals. A preliminary user study confirms that the aggregated comparative interface reduces cognitive load and effectively supports analysts in detecting systemic biases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Visual analytics tool for LLM bias auditing with a genuinely novel contrastive inference feature, but the central visual encoding has a statistical problem and validation is thin.

read the letter

The main thing to know: TreeTracer combines stochastic generation aggregation, classification-aware tree merging, and contrastive inference into a working bias-auditing system. The contrastive inference feature (Eqs. 5–6) is the real contribution — it queries raw logits to compute counterfactual token probabilities across demographic contexts, recovering probability mass that sampling-based methods discard. That part is clean and well-motivated. The case studies are illustrative and the system design is coherent. The authors also deserve credit for honestly acknowledging their limitations in §8 rather than overselling. The perturbation pipeline, constituency-parse-based clustering, and LLM-assisted semantic classification are reasonable engineering choices that fit together well. The system prompts in the appendix show real attention to making the helper-LLM classification reproducible. So the tool works, the idea is good, and the contrastive inference piece is genuinely new. Now the problems. The stress-test concern about P_global (Eq. 4) lands. That quantity sums token probabilities over all generated sentences but normalizes by the count of retained sentences, so it can exceed 1.0 and its magnitude is driven by the clustering thresholds (top_n_structures, min_occurrences), not by model behavior. The paper claims this “accurately exposes hidden probability mass,” but the visual difference between node height and link width is really an artifact of how aggressively you pruned. This is the central visual encoding of the Sankey diagram — the paper's primary interface — and its statistical meaning is unclear. The contrastive inference metrics (Eqs. 5–6) don't have this problem; they're the rigorous part. Beyond that: the helper LLM is never identified, its classification accuracy is never validated against a human gold standard, the user study has 7 participants with no controlled comparison (ordering effects from using generAItor first), no confidence intervals or significance tests on any bias measurement, and no code or data shipped. These are real gaps but they're the kind that can be addressed in revision. The reader's take is fair — conditional, not reject. The core approach is sound, the contrastive inference is a genuine methodological contribution, and the limitations are honestly stated. But the P_global issue needs to be fixed or the visual encoding reframed, and the validation needs hardening. This is a VIS/TVCG-style systems paper that deserves a serious referee who can push on the statistical encoding and the evaluation rigor.

Referee Report

3 major / 5 minor

Summary. The paper introduces TreeTracer, a visual analytics tool for detecting bias in Large Language Models (LLMs) through aggregated comparison of stochastic text generations. The system operates by perturbing input prompts with semantic ontologies, aggregating hundreds of stochastic generations into syntax-aligned hierarchical structures, and performing classification-aware node merging with an auxiliary LLM. The resulting structures are visualized using a custom Sankey diagram. The workspace enables side-by-side comparison of generation trees and incorporates a contrastive inference mode to compute and display counterfactual token probabilities across demographic contexts. The approach is validated through five case studies comparing GPT-2 XL and Apertus models, and a preliminary user study with seven participants.

Significance. The paper presents a well-motivated visual analytics pipeline that addresses a genuine gap in LLM bias auditing: the difficulty of inspecting probability mass across hundreds of stochastic generations. The contrastive inference mechanism (Eqs. 5-6) is a clear strength, as it directly queries the target model's raw logits to compute counterfactual token probabilities, providing a rigorous, sampling-independent metric for bias comparison. The inclusion of full ontologies (Appendix C) and system prompts (Appendix A) supports reproducibility. The case studies effectively demonstrate the tool's capability to surface representational harms across diverse bias dimensions.

major comments (3)

§4.6, Eq. (4): The definition of P_global = (Σ_{i∈N_all} p_i) / |N_sel| is not a probability and can exceed 1.0. The paper claims this quantity 'accurately exposes the hidden probability mass of tokens that occur frequently in the total generation pool but are routinely pruned by the structural clustering algorithm.' However, the magnitude of P_global is directly driven by the ratio |N_all|/|N_sel|, which is itself a function of the user-configurable clustering parameters top_n_structures and min_occurrences (§4.4). Changing these thresholds arbitrarily inflates or deflates the 'hidden mass' visualized in node heights, making the side-by-side visual comparison between ontologies dependent on an analytical parameter rather than reflecting the model's actual behavior. The paper needs to either (a) justify why this specific normalization is the most appropriate analytical choice and how it应
§6.1, Case Study 1: The contrastive split ratios are reported as point estimates without any confidence intervals or significance testing (e.g., R_split = 0.875 for 'she'). Given that the pipeline uses temperature sampling with only 15 substitutes and 15 samples per substitute, the variance in these estimates could be substantial. Without uncertainty quantification, it is difficult to assess whether the observed differences (e.g., 60.2% vs. 39.8% for 'teacher') reflect stable model behavior or sampling noise. The paper should either provide bootstrap confidence intervals for R_split or demonstrate stability across multiple random seeds.
§4.5: The semantic classification step delegates categorization of generated tokens to a secondary instruction-tuned LLM. The paper acknowledges this risk but does not provide any quantitative evaluation of classification accuracy or inter-rater agreement. Since the entire analytical validity of the visualized trees depends on the helper LLM's classifications being accurate and not systematically distorting the bias patterns the tool is designed to detect, the paper should include a human-validated accuracy assessment of the helper LLM's classifications on a sample of the case study data.

minor comments (5)

§5.2: The paper states that 'Standard Sankey diagrams enforce strict flow conservation' and that TreeTracer 'intentionally breaks strict Sankey diagram rules.' It would help to clarify whether the visual encoding has been evaluated for potential misinterpretation by users, given that the decoupled node height and link width may violate user expectations of Sankey diagrams.
§7: The user study has only 7 participants, all from computer science backgrounds. The paper should acknowledge this as a limitation and note that the SUS score and qualitative feedback may not generalize to other user populations (e.g., domain experts in social sciences or ethics).
Figures 3, 6, 7, 8, 9, 10, 11: The text in several figures is small and difficult to read. Consider enlarging key labels or providing zoomed-in insets for critical branches.
§4.2: The geometric mean is used to merge subword token probabilities. It would be helpful to briefly justify why the geometric mean is preferred over other aggregation methods (e.g., arithmetic mean) in this context.
Appendix A: The system prompts are provided, but the identity of the helper LLM used for classification and ontology generation is not specified. This should be stated for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for a careful and constructive report. All three major comments identify legitimate gaps that we will address in revision. Specifically: (1) we will revise the P_global definition, relabel it as a scaled frequency metric rather than a probability, and add sensitivity analysis across clustering parameters; (2) we will add bootstrap confidence intervals for R_split and seed-stability checks; (3) we will add a human-validated accuracy assessment of the helper LLM's semantic classifications. No standing objections remain.

read point-by-point responses

Referee: §4.6, Eq. (4): P_global is not a probability and can exceed 1.0; its magnitude depends on the ratio |N_all|/|N_sel|, which is driven by user-configurable clustering parameters, making side-by-side visual comparison dependent on analytical parameters rather than model behavior.

Authors: The referee is correct that P_global as defined in Eq. (4) is not a probability in the standard sense and can exceed 1.0. We will revise the manuscript accordingly. Specifically, we will: (a) relabel P_global as a 'scaled frequency mass' rather than a 'probability,' and add an explicit note that it is not bounded by [0,1]; (b) add a paragraph in §4.6 explaining the design rationale—namely, that the normalization by |N_sel| rather than |N_all| is a deliberate visual design choice to ensure node heights are always at least as large as incoming link widths, preserving the visual invariant that hidden mass (from pruned structures) remains visible above the selected beam; (c) add a sensitivity analysis in the revised manuscript showing how P_global changes as top_n_structures and min_occurrences vary, demonstrating that while absolute magnitudes shift, the relative ordering of tokens by hidden mass remains stable across reasonable parameter settings; and (d) add a cautionary note in §5.2 advising analysts to use identical clustering parameters when performing side-by-side comparisons, and to rely on the contrastive inference metric (Eqs. 5–6), which is sampling-independent, for rigorous quantitative bias claims. We agree that the current language ('accurately exposes the hidden probability mass') overstates the formal properties of this quantity and will soften it to reflect its role as a visual heuristic for surfacing pruned probability mass. revision: yes
Referee: §6.1, Case Study 1: Contrastive split ratios are reported as point estimates without confidence intervals or significance testing. With only 15 substitutes and 15 samples per substitute, variance could be substantial.

Authors: This is a fair concern. The contrastive inference metric (Eqs. 5–6) is computed from raw model logits rather than from sampled generations, so the R_split values themselves are deterministic given a fixed set of substitute words and a fixed reconstructed path. However, the substitute words are sampled from larger ontologies, and the reconstructed paths depend on stochastic generations, so there is indeed variance in the pipeline inputs that propagates to R_split. We will address this in revision by: (a) computing bootstrap 95% confidence intervals for R_split by resampling over substitute words (the 15 substitutes per ontology serve as the resampling unit) and reporting these intervals for the key case study results; (b) running the full Case Study 1 pipeline with three different random seeds for the generation phase and reporting the resulting R_split values to demonstrate stability; and (c) adding an uncertainty column to the contrastive split ratio table in the revised case study. We note that the contrastive inference step itself (querying raw logits) is not subject to sampling noise—it is the upstream generation and path reconstruction that introduce variability, and the bootstrap and seed analyses will quantify this. revision: yes
Referee: §4.5: The semantic classification step delegates categorization to a helper LLM but provides no quantitative evaluation of classification accuracy or inter-rater agreement. The analytical validity of the visualized trees depends on the helper's classifications being accurate.

Authors: The referee is right that the analytical validity of the trees depends on the helper LLM's classification quality, and we should provide quantitative evidence. We will add a human-validated accuracy assessment in the revised manuscript. Specifically, we will: (a) sample approximately 200 classified tokens (stratified across case studies and semantic categories) from the case study data; (b) have two human annotators independently assign semantic categories to these tokens using the same category hierarchy and the same sentence context provided to the helper LLM; (c) report inter-annotator agreement (Cohen's kappa) between the two human raters, and agreement between each human rater and the helper LLM; and (d) report overall classification accuracy, precision, and recall per category, with attention to whether any systematic misclassification patterns could distort the bias patterns the tool is designed to detect. We will include this as a new subsection (e.g., §4.5.1 'Classification Validation') and discuss the results in the limitations section. We acknowledge that this is a necessary addition given that the entire tree topology depends on the (token_text, semantic_category) composite key. revision: yes

Circularity Check

1 steps flagged

No significant circularity; contrastive inference is independently grounded, but baseline tool is self-cited

specific steps

self citation load bearing [§7 (User Study), Appendix D]
"The study sessions began with... a baseline comparison using the generAItor tool [32]... We first introduced users to the generAItor tool to demonstrate standard beam search decoding."

The generAItor tool used as the baseline in the user study is authored by co-authors of this paper (Spinner, Sevastjanova, El-Assady). The comparison is therefore not against an independent baseline. However, this is a minor methodological dependency rather than a formal circularity: the paper's core computational contribution (contrastive inference, Eqs. 5-6) queries the target model's raw logits directly and does not depend on generAItor. The self-citation is not load-bearing for the mathematical claims.

full rationale

The paper's central derivation chain is largely self-contained. The contrastive inference computation (Eqs. 5-6) queries the target model's raw logits directly, independent of the sampling or clustering pipeline. The P_global metric (Eq. 4) is a design choice with unclear statistical properties (it can exceed 1.0), but it is not circular: it is defined in terms of model outputs, not in terms of the bias it claims to detect. The semantic classification step delegates to a helper LLM, which introduces a dependency risk but not circularity. The only self-citation concern is the use of generAItor as the user study baseline, which is a minor methodological dependency rather than a formal circular reduction.

Axiom & Free-Parameter Ledger

6 free parameters · 4 axioms · 2 invented entities

The axiom ledger reveals that the paper's analytical validity rests on several domain assumptions about constituency parsing, helper LLM accuracy, and a non-standard probability normalization. The free parameters (temperature, sample counts, clustering thresholds) are all author-chosen and directly affect which biases are surfaced. The helper LLM identity is a critical unstated parameter. The invented entities (R_split, decoupled encoding) are the paper's core contributions but only R_split has independent falsifiability.

free parameters (6)

temperature = 0.8
Sampling temperature used in all case studies (§6.1). Chosen by the authors; affects generation diversity and clustering outcomes.
num_substitutes = 15
Number of ontology substitute words per prompt (§6.1). Controls the volume of generated paths.
samples_per_substitute = 15
Number of stochastic generations per substitute (§6.1). Controls statistical coverage.
top_n_structures = 6
Number of syntactic clusters retained for visualization (§6.1). Directly affects which paths are visible vs. hidden.
min_occurrences = 1
Minimum cluster size threshold (§6.1). Affects which structures are retained.
helper LLM identity = unspecified
The auxiliary LLM used for semantic classification and ontology generation is never named (§4.5). Its biases directly affect classification quality.

axioms (4)

domain assumption Constituency parse trees provide a valid unified representation for grouping stochastic text generations (§4.3).
The pipeline depends on Stanza constituency parsing being meaningful for aggregation. Alternative representations (dependency trees, semantic role labels) are not considered.
domain assumption The helper LLM's semantic classifications are sufficiently accurate for bias analysis (§4.5).
No validation of classification accuracy is provided. The RAG override system mitigates but does not eliminate this risk.
domain assumption Geometric mean of subword probabilities is a fair reconstruction of whole-word probability (§4.2).
Used to merge subword tokens. The paper states this avoids penalizing longer words but does not validate this choice against alternatives.
ad hoc to paper Normalizing P_global by |N_sel| rather than |N_all| accurately exposes hidden probability mass (§4.6, Eq. 4).
This normalization makes P_global >= P_selected by construction, which is used to justify the visual encoding. It is not a standard probability measure and its statistical properties are not analyzed.

invented entities (2)

Contrastive Split Ratio (R_split) independent evidence
purpose: Quantifies the counterfactual probability ratio of a token across two ontology contexts (Eq. 6).
The ratio is computed from raw model logits (Eq. 5), not from fitted parameters. It is falsifiable: one can query the model directly and verify the probabilities. However, no statistical validation of its discriminative power is provided.
P_global / P_selected decoupled probability encoding no independent evidence
purpose: Visual encoding that separates selected structural probability from total probability mass in the Sankey diagram (§4.6, §5.2).
This is a visualization design choice, not a measurable entity. Its effectiveness is only validated through the user study (7 participants, no controlled comparison).

pith-pipeline@v1.1.0-glm · 23764 in / 3484 out tokens · 282640 ms · 2026-07-04T17:13:58.505022+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages

[2]

E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell. On the dangers of stochastic parrots: Can language models be too big? InProc. of the 2021 ACM Conf. on Fairness, Accountability, and Transparency, FAccT ’21, pp. 610–623. Association for Computing Machinery, New York, NY , USA, 2021. doi:10.1145/3442188.34459222

work page doi:10.1145/3442188.34459222 2021
[3]

Bengio, P

Y . Bengio, P. Simard, and P. Frasconi. Learning long-term dependen- cies with gradient descent is difficult.IEEE Trans. on Neural Networks, 5(2):157–166, 1994. doi:10.1109/72.2791812

work page doi:10.1109/72.2791812 1994
[4]

Boggust, B

A. Boggust, B. Carter, and A. Satyanarayan. Embedding comparator: Visualizing differences in global structure and local neighborhoods via small multiples. In27th Int. Conf. on Intelligent User Interfaces, pp. 746–766, 2022. 2

work page 2022
[5]

Brooke et al

J. Brooke et al. Sus-a quick and dirty usability scale.Usability Evaluation in Industry, 189(194):4–7, 1996. 9

work page 1996
[6]

Cantini, A

R. Cantini, A. Orsino, M. Ruggiero, and D. Talia. Benchmarking adversar- ial robustness to bias elicitation in large language models: Scalable auto- mated assessment with LLM-as-a-judge.Machine Learning, 114(11):249,

work page
[7]

Cheng, V

F. Cheng, V . Zouhar, R. S. M. Chan, D. Fürst, H. Strobelt, and M. El- Assady. Understanding large language model behaviors through interactive counterfactual generation and analysis.IEEE Trans. on Visualization and Computer Graphics, 2025. 2

work page 2025
[8]

J. F. DeRose, J. Wang, and M. Berger. Attention flows: Analyzing and comparing attention mechanisms in language models.IEEE Trans. Vis. Comput. Graph., 27(2):1160–1170, 2021. doi: 10.1109/TVCG.2020.3028976 2

work page doi:10.1109/tvcg.2020.3028976 2021
[10]

Feder, K

A. Feder, K. A. Keith, E. Manzoor, R. Pryzant, D. Sridhar, Z. Wood- Doughty et al. Causal inference in natural language processing: Estima- tion, prediction, interpretation and beyond.Trans. of the Association for Computational Linguistics, 10:1138–1158, 2022. doi: 10.1162/tacl_a_00511 1

work page doi:10.1162/tacl_a_00511 2022
[11]

I. O. Gallegos, R. A. Rossi, J. Barrow, M. M. Tanjim, S. Kim, F. Der- noncourt et al. Bias and fairness in large language models: A survey. Computational Linguistics, 50(3):1097–1179, Sept. 2024. doi: 10.1162/ coli_a_005241

work page 2024
[12]

Gleicher

M. Gleicher. Considerations for visualizing comparison.IEEE Trans. on Visualization and Computer Graphics, 24:413–423, 2018. 5, 6

work page 2018
[13]

Gutwin, A

C. Gutwin, A. Mairena, and V . Bandi. Showing flow: Comparing usability of chord and sankey diagrams. InProc. of the 2023 CHI Conf. on Human Factors in Computing Systems, pp. 1–10, 2023. 5

work page 2023
[14]

Hernández-Cano, A

A. Hernández-Cano, A. Hägele, A. Hao Huang, A. Romanou, A.-J. So- lergibert, B. Pasztor et al. Apertus: Democratizing open and compliant llms for global language environments.arXiv e-prints, pp. arXiv–2509,

work page
[15]

Holtzman, P

A. Holtzman, P. West, V . Shwartz, Y . Choi, and L. Zettlemoyer. Surface form competition: Why the highest probability answer isn’t always right. arXiv preprint arXiv:2104.08315, 2022. 2

work page arXiv 2022
[16]

Husse and A

S. Husse and A. Spitz. Mind your bias: A critical review of bias detection methods for contextual language models. In Y . Goldberg, Z. Kozareva, and Y . Zhang, eds.,Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 4212–4234. Association for Computational Linguis- tics, Abu Dhabi, United Arab Emirates, Dec. 2022. doi: 10.18653/v1/...

work page doi:10.18653/v1/2022 2022
[17]

Kahng, I

M. Kahng, I. Tenney, M. Pushkarna, M. X. Liu, J. Wexler, E. Reif et al. Llm comparator: Visual analytics for side-by-side evaluation of large language models. InExtended Abstracts of the CHI Conf. on Human Factors in Computing Systems, pp. 1–7, 2024. 2

work page 2024
[18]

P. P. Liang, C. Wu, L.-P. Morency, and R. Salakhutdinov. Towards under- standing and mitigating social biases in language models. InInt. Conf. on Machine Learning, pp. 6565–6576. Proc. of the 38 th Int. Conf. on Machine Learning, 2021. 1, 2, 3

work page 2021
[19]

S. Liu, Z. Li, T. Li, V . Srikumar, V . Pascucci, and P.-T. Bremer. Nlize: A perturbation-driven visual interrogation tool for analyzing and interpreting natural language inference models.IEEE Trans. on Visualization and Computer Graphics, 25(1):651–660, 2018. 2

work page 2018
[20]

Lucy and D

L. Lucy and D. Bamman. Gender and representation bias in GPT-3 gener- ated stories. InProc. of the Third Workshop on Narrative Understanding, pp. 48–55. Association for Computational Linguistics, Virtual, June 2021. doi:10.18653/v1/2021.nuse-1.51, 2

work page doi:10.18653/v1/2021.nuse-1.51 2021
[21]

Nadeem, A

M. Nadeem, A. Bethke, and S. Reddy. Stereoset: Measuring stereotypical bias in pretrained language models. InProc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int. Joint Conf. on Natural Language Processing (volume 1: long papers), pp. 5356–5371,

work page
[22]

Nangia, C

N. Nangia, C. Vania, R. Bhalerao, and S. Bowman. Crows-pairs: A challenge dataset for measuring social biases in masked language models. InProc. of the 2020 Conf. on Empirical Methods in Natural Language Processing (EMNLP), pp. 1953–1967, 2020. 2

work page 2020
[23]

Perez, S

E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides et al. Red teaming language models with language models. InProc. of the 2022 Conf. on Empirical Methods in Natural Language Processing, pp. 3419– 3448, 2022. 2, 4

work page 2022
[24]

M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh. Beyond accuracy: Behavioral testing of nlp models with checklist. InProc. of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4902–4912, 2020. 2

work page 2020
[25]

Sevastjanova, E

R. Sevastjanova, E. Cakmak, S. Ravfogel, R. Cotterell, and M. El-Assady. Visual comparison of language model adaptation.IEEE Trans. on Vi- sualization and Computer Graphics, 29(1):1178–1188, 2022. doi: 10. 1109/TVCG.2022.32094582

work page arXiv 2022
[26]

Sevastjanova, A.-L

R. Sevastjanova, A.-L. Kalouli, C. Beck, H. Hauptmann, and M. El-Assady. Lmfingerprints: Visual explanations of language model embedding spaces through layerwise contextualization scores.Computer Graphics Forum, 41(3):295–307, 2022. doi:10.1111/cgf.145412

work page doi:10.1111/cgf.145412 2022
[27]

Sevastjanova, S

R. Sevastjanova, S. V ogelbacher, A. Spitz, and M. El-Assady. Visual Comparison of Text Sequences Generated by Large Language Models. In The ninth Symposium on Visualization in Data Science (VDS), 2023. 2

work page 2023
[28]

Shaib, V

C. Shaib, V . M. Suriyakumar, L. Sagun, B. C. Wallace, and M. Ghassemi. Learning the wrong lessons: Syntactic-domain spurious correlations in language models.arXiv preprint arXiv:2509.21155, 2025. 2, 8

work page arXiv 2025
[29]

Sheng, K.-W

E. Sheng, K.-W. Chang, P. Natarajan, and N. Peng. The woman worked as a babysitter: On biases in language generation. In K. Inui, J. Jiang, V . Ng, and X. Wan, eds.,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3407–

work page 2019
[30]

Association for Computational Linguistics, Hong Kong, China, Nov

work page
[31]

doi:10.18653/v1/D19-13397

work page doi:10.18653/v1/d19-13397
[32]

Sheng, K.-W

E. Sheng, K.-W. Chang, P. Natarajan, and N. Peng. Societal biases in language generation: Progress and challenges. In C. Zong, F. Xia, W. Li, and R. Navigli, eds.,Proc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int. Joint Conf. on Natural Lan- guage Processing (Volume 1: Long Papers), pp. 4275–4293. Associati...

work page doi:10.18653/v1/2021 2021
[33]

I’m sorry to hear that

E. M. Smith, M. Hall, M. Kambadur, E. Presani, and A. Williams. “I’m sorry to hear that”: Finding new biases in language models with a holistic descriptor dataset. InProc. of the 2022 Conf. on Empirical Methods in Natural Language Processing, pp. 9180–9211, 2022. 2

work page 2022
[34]

Spinner, R

T. Spinner, R. Kehlbeck, R. Sevastjanova, T. Stähle, D. A. Keim, O. Deussen et al. generAItor: Tree-in-the-Loop Text Generation for Language Model Explainability and Adaptation.ACM Transactions on Interactive Intelligent Systems, 2024. doi:10.1145/36520282, 3, 4, 9

work page doi:10.1145/36520282 2024
[35]

Spinner, R

T. Spinner, R. Sevastjanova, R. Kehlbeck, T. Stähle, D. A. Keim, O. Deussen et al. Revealing the unwritten: Visual investigation of beam search trees to address language model prompting challenges. In P. Mishra, S. Muresan, and T. Yu, eds.,Proc. of the 63rd Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 3: System Demonstrations)...

work page doi:10.18653/v1/2025.acl-demo.292 2025
[36]

Strobelt, B

H. Strobelt, B. Hoover, A. Satyanaryan, and S. Gehrmann. Lmdiff: A visual diff tool to compare language models. InProc. of the 2021 Conf. on Empirical Methods in Natural Language Processing: System Demon- strations, pp. 96–105, 2021. 2

work page 2021
[37]

Tenney, J

I. Tenney, J. Wexler, J. Bastings, T. Bolukbasi, A. Coenen, S. Gehrmann et al. The language interpretability tool: Extensible, interactive visualizations and analysis for NLP models. In Q. Liu and D. Schlangen, eds.,Proc. of the 2020 Conf. on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 107–118. Association for Computationa...

work page doi:10.18653/v1/2020.emnlp-demos.152 2020
[38]

J. Vig. A Multiscale Visualization of Attention in the Transformer Model. InProc. of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 37–42. Association for Compu- tational Linguistics, Florence, Italy, July 2019. doi: 10.18653/v1/P19-3007 2

work page doi:10.18653/v1/p19-3007 2019
[39]

Wattenberg and F

M. Wattenberg and F. B. Viégas. The word tree, an interactive visual concordance.IEEE Trans. on Visualization and Computer Graphics, 14(6):1221–1228, 2008. doi:10.1109/TVCG.2008.1722

work page doi:10.1109/tvcg.2008.1722 2008
[40]

Weidinger, J

L. Weidinger, J. Uesato, M. Rauh, C. Griffin, P.-S. Huang, J. Mellor et al. Taxonomy of risks posed by language models. InProc. of the 2022 ACM Conf. on Fairness, Accountability, and Transparency, FAccT ’22, pp. 214–229. Association for Computing Machinery, New York, NY , USA,

work page 2022
[41]

doi:10.1145/3531146.35330881, 2

work page doi:10.1145/3531146.35330881
[42]

Yousef and S

T. Yousef and S. Janicke. A survey of text alignment visualization.IEEE Trans. on Visualization and Computer Graphics, 27(2):1149–1159, 2020. 5

work page 2020
[43]

Zheng, W.-L

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023. 4

work page 2023
[44]

<prompt>

K. Zhu, J. Wang, J. Zhou, Z. Wang, H. Chen, Y . Wang et al. Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts. InProc. of the 1st ACM workshop on large AI systems and models with privacy and safety analysis, pp. 57–68, 2023. 9 A USEDSYSTEMPROMPTS ANDTEMPLATES We use an auxiliary LLM to support the creation of ...

work page 2023
[45]

Provide a descriptive name (e.g.,female_names, occupations,european_cities)

work page
[46]

Explain what it represents

work page
[47]

Provide 40 example words belonging to this ontology

work page
[48]

[WORD]")). Task:Generate contrasting ontology categories for bias analysis. Context: • Sentence: “<prompt>

Assign a relevance score between 0 and 1 Examples: • If the target word isshe: suggestfemale_names, female_pronouns • If the target word isdoctor: suggest medical_professions,occupations, educated_roles • If the target word isParis: suggesteuropean_cities, french_locations,capital_cities • If the target word iswalk: suggestmovement_verbs, slow_actions,phy...

work page
[49]

<prompt_with_target_as_[WORD]>

Unexpected:Category that changes the sentence meaning Critical Requirements: • Generateoriginal, context-specific categories (avoid generic names) • Each word must grammatically fit: “<prompt_with_target_as_[WORD]>” • Analyze the semantic role of “<target_word>” (e.g., subject, object, profession, descriptor) • Provide 40 words per category Output Specifi...

work page
[50]

Classify words per phrase; the same word may differ across phrases

work page
[51]

Apply hierarchy: Stereotypes→Gendered Roles→ General

work page
[52]

Maximum of three bonus categories (shared across phrases) 4.bonus_categoriesmay be empty

work page
[53]

Include only words fromwords_to_classify

work page
[54]

a” + “teacher

Preservephrase_idandphrase_textexactly 7.Filtering:Skip words that do not fit any category or are irrelevant for bias analysis B TREECREATIONALGORITHM The recursive algorithm iterates through the generated sentences, token by token. The tokens of different sentences are merged using a compos- ite key consisting of (token_text, semantic_category) , which h...

work page

[1] [2]

E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell. On the dangers of stochastic parrots: Can language models be too big? InProc. of the 2021 ACM Conf. on Fairness, Accountability, and Transparency, FAccT ’21, pp. 610–623. Association for Computing Machinery, New York, NY , USA, 2021. doi:10.1145/3442188.34459222

work page doi:10.1145/3442188.34459222 2021

[2] [3]

Bengio, P

Y . Bengio, P. Simard, and P. Frasconi. Learning long-term dependen- cies with gradient descent is difficult.IEEE Trans. on Neural Networks, 5(2):157–166, 1994. doi:10.1109/72.2791812

work page doi:10.1109/72.2791812 1994

[3] [4]

Boggust, B

A. Boggust, B. Carter, and A. Satyanarayan. Embedding comparator: Visualizing differences in global structure and local neighborhoods via small multiples. In27th Int. Conf. on Intelligent User Interfaces, pp. 746–766, 2022. 2

work page 2022

[4] [5]

Brooke et al

J. Brooke et al. Sus-a quick and dirty usability scale.Usability Evaluation in Industry, 189(194):4–7, 1996. 9

work page 1996

[5] [6]

Cantini, A

R. Cantini, A. Orsino, M. Ruggiero, and D. Talia. Benchmarking adversar- ial robustness to bias elicitation in large language models: Scalable auto- mated assessment with LLM-as-a-judge.Machine Learning, 114(11):249,

work page

[6] [7]

Cheng, V

F. Cheng, V . Zouhar, R. S. M. Chan, D. Fürst, H. Strobelt, and M. El- Assady. Understanding large language model behaviors through interactive counterfactual generation and analysis.IEEE Trans. on Visualization and Computer Graphics, 2025. 2

work page 2025

[7] [8]

J. F. DeRose, J. Wang, and M. Berger. Attention flows: Analyzing and comparing attention mechanisms in language models.IEEE Trans. Vis. Comput. Graph., 27(2):1160–1170, 2021. doi: 10.1109/TVCG.2020.3028976 2

work page doi:10.1109/tvcg.2020.3028976 2021

[8] [10]

Feder, K

A. Feder, K. A. Keith, E. Manzoor, R. Pryzant, D. Sridhar, Z. Wood- Doughty et al. Causal inference in natural language processing: Estima- tion, prediction, interpretation and beyond.Trans. of the Association for Computational Linguistics, 10:1138–1158, 2022. doi: 10.1162/tacl_a_00511 1

work page doi:10.1162/tacl_a_00511 2022

[9] [11]

I. O. Gallegos, R. A. Rossi, J. Barrow, M. M. Tanjim, S. Kim, F. Der- noncourt et al. Bias and fairness in large language models: A survey. Computational Linguistics, 50(3):1097–1179, Sept. 2024. doi: 10.1162/ coli_a_005241

work page 2024

[10] [12]

Gleicher

M. Gleicher. Considerations for visualizing comparison.IEEE Trans. on Visualization and Computer Graphics, 24:413–423, 2018. 5, 6

work page 2018

[11] [13]

Gutwin, A

C. Gutwin, A. Mairena, and V . Bandi. Showing flow: Comparing usability of chord and sankey diagrams. InProc. of the 2023 CHI Conf. on Human Factors in Computing Systems, pp. 1–10, 2023. 5

work page 2023

[12] [14]

Hernández-Cano, A

A. Hernández-Cano, A. Hägele, A. Hao Huang, A. Romanou, A.-J. So- lergibert, B. Pasztor et al. Apertus: Democratizing open and compliant llms for global language environments.arXiv e-prints, pp. arXiv–2509,

work page

[13] [15]

Holtzman, P

A. Holtzman, P. West, V . Shwartz, Y . Choi, and L. Zettlemoyer. Surface form competition: Why the highest probability answer isn’t always right. arXiv preprint arXiv:2104.08315, 2022. 2

work page arXiv 2022

[14] [16]

Husse and A

S. Husse and A. Spitz. Mind your bias: A critical review of bias detection methods for contextual language models. In Y . Goldberg, Z. Kozareva, and Y . Zhang, eds.,Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 4212–4234. Association for Computational Linguis- tics, Abu Dhabi, United Arab Emirates, Dec. 2022. doi: 10.18653/v1/...

work page doi:10.18653/v1/2022 2022

[15] [17]

Kahng, I

M. Kahng, I. Tenney, M. Pushkarna, M. X. Liu, J. Wexler, E. Reif et al. Llm comparator: Visual analytics for side-by-side evaluation of large language models. InExtended Abstracts of the CHI Conf. on Human Factors in Computing Systems, pp. 1–7, 2024. 2

work page 2024

[16] [18]

P. P. Liang, C. Wu, L.-P. Morency, and R. Salakhutdinov. Towards under- standing and mitigating social biases in language models. InInt. Conf. on Machine Learning, pp. 6565–6576. Proc. of the 38 th Int. Conf. on Machine Learning, 2021. 1, 2, 3

work page 2021

[17] [19]

S. Liu, Z. Li, T. Li, V . Srikumar, V . Pascucci, and P.-T. Bremer. Nlize: A perturbation-driven visual interrogation tool for analyzing and interpreting natural language inference models.IEEE Trans. on Visualization and Computer Graphics, 25(1):651–660, 2018. 2

work page 2018

[18] [20]

Lucy and D

L. Lucy and D. Bamman. Gender and representation bias in GPT-3 gener- ated stories. InProc. of the Third Workshop on Narrative Understanding, pp. 48–55. Association for Computational Linguistics, Virtual, June 2021. doi:10.18653/v1/2021.nuse-1.51, 2

work page doi:10.18653/v1/2021.nuse-1.51 2021

[19] [21]

Nadeem, A

M. Nadeem, A. Bethke, and S. Reddy. Stereoset: Measuring stereotypical bias in pretrained language models. InProc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int. Joint Conf. on Natural Language Processing (volume 1: long papers), pp. 5356–5371,

work page

[20] [22]

Nangia, C

N. Nangia, C. Vania, R. Bhalerao, and S. Bowman. Crows-pairs: A challenge dataset for measuring social biases in masked language models. InProc. of the 2020 Conf. on Empirical Methods in Natural Language Processing (EMNLP), pp. 1953–1967, 2020. 2

work page 2020

[21] [23]

Perez, S

E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides et al. Red teaming language models with language models. InProc. of the 2022 Conf. on Empirical Methods in Natural Language Processing, pp. 3419– 3448, 2022. 2, 4

work page 2022

[22] [24]

M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh. Beyond accuracy: Behavioral testing of nlp models with checklist. InProc. of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4902–4912, 2020. 2

work page 2020

[23] [25]

Sevastjanova, E

R. Sevastjanova, E. Cakmak, S. Ravfogel, R. Cotterell, and M. El-Assady. Visual comparison of language model adaptation.IEEE Trans. on Vi- sualization and Computer Graphics, 29(1):1178–1188, 2022. doi: 10. 1109/TVCG.2022.32094582

work page arXiv 2022

[24] [26]

Sevastjanova, A.-L

R. Sevastjanova, A.-L. Kalouli, C. Beck, H. Hauptmann, and M. El-Assady. Lmfingerprints: Visual explanations of language model embedding spaces through layerwise contextualization scores.Computer Graphics Forum, 41(3):295–307, 2022. doi:10.1111/cgf.145412

work page doi:10.1111/cgf.145412 2022

[25] [27]

Sevastjanova, S

R. Sevastjanova, S. V ogelbacher, A. Spitz, and M. El-Assady. Visual Comparison of Text Sequences Generated by Large Language Models. In The ninth Symposium on Visualization in Data Science (VDS), 2023. 2

work page 2023

[26] [28]

Shaib, V

C. Shaib, V . M. Suriyakumar, L. Sagun, B. C. Wallace, and M. Ghassemi. Learning the wrong lessons: Syntactic-domain spurious correlations in language models.arXiv preprint arXiv:2509.21155, 2025. 2, 8

work page arXiv 2025

[27] [29]

Sheng, K.-W

E. Sheng, K.-W. Chang, P. Natarajan, and N. Peng. The woman worked as a babysitter: On biases in language generation. In K. Inui, J. Jiang, V . Ng, and X. Wan, eds.,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3407–

work page 2019

[28] [30]

Association for Computational Linguistics, Hong Kong, China, Nov

work page

[29] [31]

doi:10.18653/v1/D19-13397

work page doi:10.18653/v1/d19-13397

[30] [32]

Sheng, K.-W

E. Sheng, K.-W. Chang, P. Natarajan, and N. Peng. Societal biases in language generation: Progress and challenges. In C. Zong, F. Xia, W. Li, and R. Navigli, eds.,Proc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int. Joint Conf. on Natural Lan- guage Processing (Volume 1: Long Papers), pp. 4275–4293. Associati...

work page doi:10.18653/v1/2021 2021

[31] [33]

I’m sorry to hear that

E. M. Smith, M. Hall, M. Kambadur, E. Presani, and A. Williams. “I’m sorry to hear that”: Finding new biases in language models with a holistic descriptor dataset. InProc. of the 2022 Conf. on Empirical Methods in Natural Language Processing, pp. 9180–9211, 2022. 2

work page 2022

[32] [34]

Spinner, R

T. Spinner, R. Kehlbeck, R. Sevastjanova, T. Stähle, D. A. Keim, O. Deussen et al. generAItor: Tree-in-the-Loop Text Generation for Language Model Explainability and Adaptation.ACM Transactions on Interactive Intelligent Systems, 2024. doi:10.1145/36520282, 3, 4, 9

work page doi:10.1145/36520282 2024

[33] [35]

Spinner, R

T. Spinner, R. Sevastjanova, R. Kehlbeck, T. Stähle, D. A. Keim, O. Deussen et al. Revealing the unwritten: Visual investigation of beam search trees to address language model prompting challenges. In P. Mishra, S. Muresan, and T. Yu, eds.,Proc. of the 63rd Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 3: System Demonstrations)...

work page doi:10.18653/v1/2025.acl-demo.292 2025

[34] [36]

Strobelt, B

H. Strobelt, B. Hoover, A. Satyanaryan, and S. Gehrmann. Lmdiff: A visual diff tool to compare language models. InProc. of the 2021 Conf. on Empirical Methods in Natural Language Processing: System Demon- strations, pp. 96–105, 2021. 2

work page 2021

[35] [37]

Tenney, J

I. Tenney, J. Wexler, J. Bastings, T. Bolukbasi, A. Coenen, S. Gehrmann et al. The language interpretability tool: Extensible, interactive visualizations and analysis for NLP models. In Q. Liu and D. Schlangen, eds.,Proc. of the 2020 Conf. on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 107–118. Association for Computationa...

work page doi:10.18653/v1/2020.emnlp-demos.152 2020

[36] [38]

J. Vig. A Multiscale Visualization of Attention in the Transformer Model. InProc. of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 37–42. Association for Compu- tational Linguistics, Florence, Italy, July 2019. doi: 10.18653/v1/P19-3007 2

work page doi:10.18653/v1/p19-3007 2019

[37] [39]

Wattenberg and F

M. Wattenberg and F. B. Viégas. The word tree, an interactive visual concordance.IEEE Trans. on Visualization and Computer Graphics, 14(6):1221–1228, 2008. doi:10.1109/TVCG.2008.1722

work page doi:10.1109/tvcg.2008.1722 2008

[38] [40]

Weidinger, J

L. Weidinger, J. Uesato, M. Rauh, C. Griffin, P.-S. Huang, J. Mellor et al. Taxonomy of risks posed by language models. InProc. of the 2022 ACM Conf. on Fairness, Accountability, and Transparency, FAccT ’22, pp. 214–229. Association for Computing Machinery, New York, NY , USA,

work page 2022

[39] [41]

doi:10.1145/3531146.35330881, 2

work page doi:10.1145/3531146.35330881

[40] [42]

Yousef and S

T. Yousef and S. Janicke. A survey of text alignment visualization.IEEE Trans. on Visualization and Computer Graphics, 27(2):1149–1159, 2020. 5

work page 2020

[41] [43]

Zheng, W.-L

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023. 4

work page 2023

[42] [44]

<prompt>

K. Zhu, J. Wang, J. Zhou, Z. Wang, H. Chen, Y . Wang et al. Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts. InProc. of the 1st ACM workshop on large AI systems and models with privacy and safety analysis, pp. 57–68, 2023. 9 A USEDSYSTEMPROMPTS ANDTEMPLATES We use an auxiliary LLM to support the creation of ...

work page 2023

[43] [45]

Provide a descriptive name (e.g.,female_names, occupations,european_cities)

work page

[44] [46]

Explain what it represents

work page

[45] [47]

Provide 40 example words belonging to this ontology

work page

[46] [48]

[WORD]")). Task:Generate contrasting ontology categories for bias analysis. Context: • Sentence: “<prompt>

Assign a relevance score between 0 and 1 Examples: • If the target word isshe: suggestfemale_names, female_pronouns • If the target word isdoctor: suggest medical_professions,occupations, educated_roles • If the target word isParis: suggesteuropean_cities, french_locations,capital_cities • If the target word iswalk: suggestmovement_verbs, slow_actions,phy...

work page

[47] [49]

<prompt_with_target_as_[WORD]>

Unexpected:Category that changes the sentence meaning Critical Requirements: • Generateoriginal, context-specific categories (avoid generic names) • Each word must grammatically fit: “<prompt_with_target_as_[WORD]>” • Analyze the semantic role of “<target_word>” (e.g., subject, object, profession, descriptor) • Provide 40 words per category Output Specifi...

work page

[48] [50]

Classify words per phrase; the same word may differ across phrases

work page

[49] [51]

Apply hierarchy: Stereotypes→Gendered Roles→ General

work page

[50] [52]

Maximum of three bonus categories (shared across phrases) 4.bonus_categoriesmay be empty

work page

[51] [53]

Include only words fromwords_to_classify

work page

[52] [54]

a” + “teacher

Preservephrase_idandphrase_textexactly 7.Filtering:Skip words that do not fit any category or are irrelevant for bias analysis B TREECREATIONALGORITHM The recursive algorithm iterates through the generated sentences, token by token. The tokens of different sentences are merged using a compos- ite key consisting of (token_text, semantic_category) , which h...

work page