arxiv: 2605.12702 · v1 · submitted 2026-05-12 · 💻 cs.AI · cs.HC

Recognition: 2 theorem links

· Lean Theorem

DisaBench: A Participatory Evaluation Framework for Disability Harms in Language Models

Eugenia Kim , Ioana Tanase , Christina Mallon

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:05 UTC · model grok-4.3

classification 💻 cs.AI cs.HC

keywords disability harmslanguage modelsparticipatory evaluationsafety benchmarksred teamingAI ethicslived experiencetaxonomy

0 comments

The pith

General-purpose safety benchmarks for language models miss disability harms because those harms are personal, intersectional, and community-defined.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard safety evaluations fail to detect disability-related harms in language models since they treat harm as uniform and obvious rather than tied to individual context. It introduces DisaBench, built through direct collaboration with people who have disabilities, to include a taxonomy of twelve harm categories, paired prompts across seven life domains, and 175 human-annotated examples. The evaluations show harms differ sharply by disability type, that terminology issues shift with culture and time, and that only lived-experience reviewers spot the subtle failures general tests ignore. If this holds, it means existing benchmarks systematically undercount real risks and cannot stand alone for responsible model release.

Core claim

DisaBench supplies a taxonomy of twelve disability harm categories developed with people who have lived experience, a method that pairs benign and adversarial prompts in seven life domains, and a set of 175 prompts whose 525 responses were labeled by four evaluators with disabilities. The annotations establish that harm rates vary by disability type and intensify outside text, that terminology harms are culturally and temporally specific, and that standard safety checks catch only overt problems while missing subtler ones visible only to domain experts. Disability harm therefore cannot be separated from a person's full identity and community, so general-purpose benchmarks miss it by design.

What carries the argument

The taxonomy of twelve disability harm categories co-created with evaluators who have lived experience, which structures the prompt pairs and human annotations to surface context-dependent harms.

If this is right

Safety pipelines must add participatory review steps to catch harms that standard red-teaming overlooks.
Evaluation datasets need separate tracks for each disability type rather than a single aggregate score.
Terminology checks in benchmarks require regular updates to match shifting cultural standards.
Non-text model outputs will require new testing layers because harms compound there.
The framework can slot directly into current safety tools without extra infrastructure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Model developers will need to treat community involvement as a recurring requirement rather than a one-time audit.
The same participatory structure could be adapted to measure harms experienced by other groups whose identities are not captured in broad benchmarks.
Training data filtering rules may need revision once subtle harms become measurable.
Real-world deployment of these models should include ongoing feedback loops with affected communities.

Load-bearing premise

The taxonomy and 175 prompts developed with four evaluators who have lived experience are enough to represent the full range of disability harms across cultures, time, and non-text forms.

What would settle it

A side-by-side test in which the same 175 prompts and four lived-experience annotators are run through an existing general safety benchmark and produce the same detection rates for the subtle, context-specific harms that DisaBench identifies.

Figures

Figures reproduced from arXiv: 2605.12702 by Christina Mallon, Eugenia Kim, Ioana Tanase.

**Figure 3.** Figure 3: Harm rate by model × harm category (agreed-only labels). (severity-3.8 prompts: 51% harm; severity-5.0: 17%). Terminology prompts show the same disconnect: annotators rated them at lower severity (x¯ = 1.6) than adversarial prompts (x¯ = 3.83), but responses to terminology queries still produced harm 18.8% of the time. Eighteen prompts (10.3%) contain contested disability terminology; annotators coded ni… view at source ↗

**Figure 2.** Figure 2: Mean prompt harmful rating by harm category (all [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

General-purpose safety benchmarks for large language models do not adequately evaluate disability-related harms. We introduce DisaBench: a taxonomy of twelve disability harm categories co-created with people with disabilities and red teaming experts, a taxonomy-driven evaluation methodology that pairs benign and adversarial prompts across seven life domains, and a dataset of 175 prompts with human-annotated labels on 525 prompt-response pairs. Annotation by four evaluators with lived disability experience reveals three findings: harm rates vary sharply by disability type and will compound in non-text modalities, terminology-driven harm is culturally and temporally bound rather than universally assessable, and standard safety evaluation catches overt failures while missing the subtle harms that only domain expertise can recognize. Disability harm is simultaneously personal, intersectional, and community-defined: it cannot be isolated from the full context of who a person is, and general-purpose benchmarks systematically miss it. We will release the dataset, taxonomy, and methodology via Hugging Face and an open-source red teaming framework for direct integration into existing safety pipelines with no additional infrastructure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DisaBench gives a practical participatory taxonomy and prompt set for disability harms that general benchmarks overlook, but four annotators and 175 prompts leave the broader claims thinly supported.

read the letter

The paper's core offering is DisaBench: a twelve-category taxonomy built with people who have disabilities, a set of 175 prompts paired as benign and adversarial across seven life domains, and labels from four evaluators with lived experience on 525 pairs. They report that harm rates shift by disability type, that terminology problems are culturally and temporally specific, and that standard safety checks catch obvious issues but miss subtler ones that require domain knowledge. The plan to release the dataset and red-teaming framework on Hugging Face is straightforward and useful for anyone already running safety pipelines. That combination of co-creation and concrete artifacts is the part worth paying attention to. It shows a clear path for moving safety evaluation toward more targeted, context-aware tests rather than generic ones. The participatory framing also surfaces the point that disability harm is personal and intersectional in ways that broad classifiers tend to flatten. The main limitation is scale. Four annotators is a narrow base for asserting that harms are community-defined or that general benchmarks systematically miss them. Without inter-annotator agreement numbers, details on prompt sampling, or direct quantitative overlap with existing safety classifiers, the three findings rest on a small sample that could reflect the specific group rather than wider patterns. The abstract does not supply those checks, so the stronger language about systematic gaps is not yet backed by the evidence shown. This is the kind of work that belongs in a reading group focused on AI safety evaluation or participatory methods. People building or auditing domain-specific benchmarks will get immediate value from the taxonomy and prompt structure even if they plan to enlarge the evaluator pool later. It deserves peer review because the artifacts are new, the release is open, and the direction addresses a real gap, though any referee will want more validation data and comparison results before the claims can be taken as settled.

Referee Report

2 major / 2 minor

Summary. The paper introduces DisaBench, a participatory framework for evaluating disability-related harms in LLMs. It presents a 12-category taxonomy co-created with people with disabilities and red-teaming experts, a methodology pairing benign and adversarial prompts across seven life domains, and a dataset of 175 prompts yielding 525 human-annotated prompt-response pairs. Annotations by four evaluators with lived experience support three findings: harm rates vary by disability type and compound in non-text modalities; terminology-driven harm is culturally and temporally bound; and standard safety evaluations miss subtle harms recognizable only with domain expertise. The central claim is that disability harm is personal, intersectional, and community-defined, so general-purpose benchmarks systematically fail to capture it. The artifacts are slated for open release.

Significance. If the participatory methodology and reported patterns hold under broader validation, the work supplies concrete, integrable artifacts (taxonomy, prompts, dataset) that could improve detection of nuanced disability harms currently overlooked by existing safety pipelines. The emphasis on lived-experience annotation and the open red-teaming framework constitute a practical contribution to the field.

major comments (2)

[§3] §3 (Methodology) and abstract: The three findings rest on annotations from only four evaluators with lived experience. No inter-annotator agreement statistics, prompt sampling protocol, or quantitative overlap metrics with existing safety classifiers on the 525 pairs are reported, leaving the claims that harms 'vary sharply by type' and that standard benchmarks 'systematically miss' subtle cases without sufficient reliability or comparative evidence.
[§4] §4 (Findings) and §5 (Discussion): The assertion that disability harm is 'community-defined' and cannot be isolated from personal/intersectional context is supported solely by the n=4 participatory input and 175-prompt set. This sample size is too small to underwrite the general claim that general-purpose benchmarks miss such harms across cultures and modalities; external validation against larger disability communities or direct quantitative comparison is required.

minor comments (2)

The abstract states the dataset will be released via Hugging Face but provides no licensing, versioning, or exact schema details; these should be added for reproducibility.
[§3] Notation for the seven life domains and the pairing of benign/adversarial prompts could be clarified with an explicit table or diagram in §3.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our participatory evaluation framework. We address the major comments point by point below, providing clarifications on our methodology while committing to targeted revisions that strengthen transparency without altering the core participatory approach.

read point-by-point responses

Referee: [§3] §3 (Methodology) and abstract: The three findings rest on annotations from only four evaluators with lived experience. No inter-annotator agreement statistics, prompt sampling protocol, or quantitative overlap metrics with existing safety classifiers on the 525 pairs are reported, leaving the claims that harms 'vary sharply by type' and that standard benchmarks 'systematically miss' subtle cases without sufficient reliability or comparative evidence.

Authors: We agree that inter-annotator agreement statistics should have been reported for full transparency. In the revised manuscript we will add Fleiss' kappa (or equivalent) computed across the four evaluators' annotations on the 525 pairs. The prompt sampling protocol was taxonomy-driven and systematically spanned the 12 harm categories across the seven life domains through iterative co-creation with the participatory group; we will expand §3 to describe this process explicitly, including how prompts were paired as benign/adversarial. We will also add a quantitative overlap analysis comparing our annotations against outputs from standard safety classifiers (e.g., Perspective API, OpenAI moderation) on the same 525 pairs. The choice of four evaluators with lived experience follows established participatory design principles in disability studies, where depth of expertise is prioritized; we will articulate this rationale more clearly while acknowledging the trade-off in scale. revision: partial
Referee: [§4] §4 (Findings) and §5 (Discussion): The assertion that disability harm is 'community-defined' and cannot be isolated from personal/intersectional context is supported solely by the n=4 participatory input and 175-prompt set. This sample size is too small to underwrite the general claim that general-purpose benchmarks miss such harms across cultures and modalities; external validation against larger disability communities or direct quantitative comparison is required.

Authors: We accept that the n=4 participatory input and 175-prompt set limit broad generalizability, and we do not claim the results constitute exhaustive proof across all cultures or modalities. The central argument is that disability harm is inherently personal and community-defined, which our participatory process was designed to surface; the findings illustrate specific cases where standard benchmarks fail to detect subtle harms that domain experts recognize. We will revise §5 to more explicitly frame the work as an initial demonstration and to include a stronger call for external validation by larger disability communities. The artifacts (taxonomy, prompts, dataset) are released precisely to enable such follow-on studies. Direct quantitative comparison with existing classifiers will be added as noted in the response to §3. revision: partial

Circularity Check

0 steps flagged

No significant circularity; contribution is new artifact creation and empirical annotation

full rationale

The paper introduces a new taxonomy of 12 categories, 175 prompts, and 525 annotated pairs created via participatory co-design with four evaluators having lived disability experience. The three reported findings (varying harm rates, cultural bounding of terminology harm, and missed subtle cases) are direct observations from annotating this newly constructed dataset rather than any mathematical derivation, fitted parameter, or self-referential reduction. No equations, predictive models, uniqueness theorems, or self-citations appear in the provided text that would make claims equivalent to inputs by construction. The central assertion that general-purpose benchmarks miss harms follows from the new evaluation framework itself, which is self-contained as an independent artifact.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an applied evaluation-framework paper; it introduces no fitted numerical parameters, no new physical entities, and relies on the domain assumption that participatory input from affected communities improves harm detection over purely technical approaches.

axioms (1)

domain assumption Participatory co-creation with people who have disabilities produces more valid harm categories than expert-only design.
Stated implicitly through the methodology of co-creating the taxonomy and using lived-experience annotators.

pith-pipeline@v0.9.0 · 5482 in / 1254 out tokens · 51455 ms · 2026-05-14T20:05:54.469728+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

taxonomy of twelve disability harm categories co-created with people with disabilities... dataset of 175 prompts with human-annotated labels on 525 prompt-response pairs
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

harm rates vary sharply by disability type... standard safety evaluation catches overt failures while missing the subtle harms

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 5 canonical work pages · 2 internal anchors

[1]

2001 , address =

International Classification of Functioning, Disability and Health (. 2001 , address =

2001
[2]

2011 , address =

World Report on Disability , author =. 2011 , address =

2011
[3]

2024 , howpublished =

Disability and Health , author =. 2024 , howpublished =

2024
[4]

Applied Sciences , volume =

Identification of Challenges and Best Practices for Including Users with Disabilities in User-Based Testing , author =. Applied Sciences , volume =. 2023 , doi =

2023
[5]

American Journal of Orthopsychiatry , volume =

Shifting the Discourse on Disability: Moving to an Inclusive, Intersectional Focus , author =. American Journal of Orthopsychiatry , volume =. 2023 , doi =

2023
[6]

American Psychologist , volume =

Person-First and Identity-First Language: Developing Psychologists' Cultural Competence Using Disability Language , author =. American Psychologist , volume =. 2015 , doi =

2015
[7]

and Mortenson, W

Best, Krista L. and Mortenson, W. Ben and Lauzière-Fitzgerald, Zoé and Smith, Emma M. , journal =. Language Matters!. 2022 , doi =

2022
[8]

, booktitle =

Sharif, Aashaka and McCall, Aedan Liam and Bolante, Kevin R. , booktitle =. Should. 2022 , doi =

2022
[9]

Frontiers in Public Health , volume =

The Study of Ableism in Population Health: A Critical Review , author =. Frontiers in Public Health , volume =. 2024 , doi =

2024
[10]

and Quintanilha, Daniel de Freitas , journal =

da Silva, Lucas Teles and Abramov, Dimitri M. and Quintanilha, Daniel de Freitas , journal =. Are We Truly Fighting Ableism?. 2025 , doi =

2025
[11]

2021 , publisher =

Demystifying Disability: What to Know, What to Say, and How to Be an Ally , author =. 2021 , publisher =

2021
[12]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone , author =. arXiv preprint arXiv:2404.14219 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[13]

2024 , url =

Phi-4 Technical Report , author =. 2024 , url =

2024
[14]

2025 , howpublished =

Grok-3 Model , author =. 2025 , howpublished =

2025
[15]

2025 , month =

Grok 4 Model Card , author =. 2025 , month =

2025
[16]

2025 , month =

Grok 4.1 Model Card , author =. 2025 , month =

2025
[17]

2023 , month =

Open Release of Grok-1 , author =. 2023 , month =

2023
[18]

and Minnich, Amanda J

Lopez Munoz, Gary D. and Minnich, Amanda J. and Lutz, Roman and Lundeen, Richard and Dheekonda, Raja Sekhar Rao and Chikanov, Nina and Jagdagdorj, Bolor-Erdene and Pouliot, Martin and Chawla, Shiven and Maxwell, Whitney and Bullwinkel, Blake and Pratt, Katherine and de Gruyter, Joris and Siska, Charlotte and Bryan, Pete and Westerhoff, Tori and Kawaguchi,...

work page arXiv
[19]

Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society (AIES '23) , pages =

Sociotechnical Harms of Algorithmic Systems: Scoping a Taxonomy for Harm Reduction , author =. Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society (AIES '23) , pages =. 2023 , publisher =

2023
[20]

2020 , publisher =

Design Justice: Community-Led Practices to Build the Worlds We Need , author =. 2020 , publisher =

2020
[21]

2024 , note =

Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and Forsyth, David and Hendrycks, Dan , journal =. 2024 , note =

2024
[22]

Social Biases in

Hutchinson, Ben and Prabhakaran, Vinodkumar and Denton, Emily and Webster, Kellie and Zhong, Yu and Denuyl, Stephen , booktitle =. Social Biases in. 2020 , publisher =

2020
[23]

Proceedings of the 29th International Conference on Computational Linguistics , pages =

A Study of Implicit Bias in Pretrained Language Models against People with Disabilities , author =. Proceedings of the 29th International Conference on Computational Linguistics , pages =. 2022 , publisher =

2022
[24]

arXiv preprint arXiv:2307.09209 , year =

Automated Ableism: An Exploration of Explicit Disability Biases in Sentiment and Toxicity Analysis Models , author =. arXiv preprint arXiv:2307.09209 , year =

work page arXiv
[25]

Gadiraju, Vinitha and Kane, Shaun and Dev, Sunipa and Taylor, Alex and Wang, Ding and Denton, Emily and Brewer, Robin , booktitle =. ``. 2023 , publisher =

2023
[26]

2022 , publisher =

Parrish, Alicia and Chen, Angelica and Nangia, Nikita and Padmakumar, Vishakh and Phang, Jason and Thompson, Jana and Htut, Phu Mon and Bowman, Samuel , booktitle =. 2022 , publisher =

2022
[27]

2025 , publisher =

Panda, Srikant and Agarwal, Amit and Patel, Hitesh Laxmichand , booktitle =. 2025 , publisher =

2025
[28]

Phutane, Mahika and Seelam, Ananya and Vashistha, Aditya , booktitle =. ``. 2024 , publisher =. doi:10.1145/3630106.3659038 , note =

work page doi:10.1145/3630106.3659038 2024
[29]

Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency , pages =

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? , author =. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency , pages =. 2021 , publisher =

2021
[30]

2021 , url =

On the Opportunities and Risks of Foundation Models , author =. 2021 , url =

2021
[31]

Documenting Large Webtext Corpora: A Case Study on the

Dodge, Jesse and Sap, Maarten and Marasovi. Documenting Large Webtext Corpora: A Case Study on the. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages =. 2021 , publisher =

2021
[32]

and Tram

Chao, Patrick and Debenedetti, Edoardo and Robey, Alexander and Andriushchenko, Maksym and Croce, Francesco and Sehwag, Vikash and Dobriban, Edgar and Flammarion, Nicolas and Pappas, George J. and Tram. NeurIPS Datasets and Benchmarks Track , year =
[33]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Universal and Transferable Adversarial Attacks on Aligned Language Models , author =. arXiv preprint arXiv:2307.15043 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , pages =

R. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , pages =. 2024 , doi =

2024
[35]

and Hendren, Sara and Kaziunas, Liz and Mills, Mara and Morris, Meredith Ringel and Rankin, Joy and Rogers, Emily and Salas, Marcel and others , institution =

Whittaker, Meredith and Alper, Meryl and Bennett, Cynthia L. and Hendren, Sara and Kaziunas, Liz and Mills, Mara and Morris, Meredith Ringel and Rankin, Joy and Rogers, Emily and Salas, Marcel and others , institution =. Disability, Bias, and. 2019 , url =

2019
[36]

2021 , publisher =

Dhamala, Jwala and Sun, Tony and Kumar, Varun and Krishna, Satyapriya and Pruksachatkun, Yada and Chang, Kai-Wei and Gupta, Rahul , booktitle =. 2021 , publisher =

2021
[37]

, booktitle =

Gehman, Samuel and Gururangan, Suchin and Sap, Maarten and Choi, Yejin and Smith, Noah A. , booktitle =. 2020 , doi =

2020
[38]

Lessons From Red Teaming 100 Generative

Bullwinkel, Blake and Minnich, Amanda and Chawla, Shiven and Lopez, Gary and Pouliot, Martin and Maxwell, Whitney and de Gruyter, Joris and Pratt, Katherine and Qi, Saphir and Chikanov, Nina and Lutz, Roman and Dheekonda, Raja Sekhar Rao and Jagdagdorj, Bolor-Erdene and Kim, Eugenia and Song, Justin and Hines, Keegan and Jones, Daniel and Severi, Giorgio ...
[39]

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

Annotators with Attitudes: How Annotator Beliefs And Identities Bias Toxic Language Detection , author=. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=. 2022 , publisher=

2022
[40]

Biometrics , volume =

The Measurement of Observer Agreement for Categorical Data , author =. Biometrics , volume =. 1977 , publisher =

1977
[41]

Computational Linguistics , volume =

Inter-Coder Agreement for Computational Linguistics , author =. Computational Linguistics , volume =. 2008 , publisher =

2008
[42]

Red-Teaming for Generative

Feffer, Michael and Sinha, Anusha and Deng, Wesley Hanwen and Lipton, Zachary Chase and Heidari, Hoda , booktitle =. Red-Teaming for Generative. 2024 , publisher =

2024
[43]

Communications of the ACM , volume =

Datasheets for Datasets , author =. Communications of the ACM , volume =. 2021 , publisher =

2021
[44]

Proceedings of the AAAI Conference on Artificial Intelligence , volume =

A Holistic Approach to Undesired Content Detection in the Real World , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =. 2023 , doi =

2023
[45]

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics , pages =

Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection , author =. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics , pages =. 2021 , doi =

2021