Artificial Intolerance: Stigmatizing Language in Clinical Documentation Skews Large Language Model Decision-Making

Amy Oh; Anne R. Links; Didi Zhou; Faith Kamau; Jen-tse Huang; Mark Dredze; Mary Catherine Beach; Somnath Saha

arxiv: 2605.17228 · v1 · pith:EMJYWM5Tnew · submitted 2026-05-17 · 💻 cs.CL

Artificial Intolerance: Stigmatizing Language in Clinical Documentation Skews Large Language Model Decision-Making

Jen-tse Huang , Didi Zhou , Faith Kamau , Amy Oh , Anne R. Links , Mark Dredze , Mary Catherine Beach , Somnath Saha This is my paper

Pith reviewed 2026-05-20 14:44 UTC · model grok-4.3

classification 💻 cs.CL

keywords stigmatizing languagelarge language modelsclinical decision supportAI biashealth disparitiesclinical NLPLLM robustness

0 comments

The pith

Stigmatizing language in clinical notes causes large language models to recommend less aggressive patient management

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models pick up stigmatizing language that doctors sometimes write in patient notes and whether that language changes the medical decisions the models suggest. Researchers created patient vignettes for four conditions and added sentences expressing doubt, blame, or maligning the patient at different strengths. All nine frontier models they tried showed the same pattern: the presence of even one such sentence shifted recommendations toward less aggressive care, and more stigmatizing language produced stronger shifts. Common fixes such as asking the model to reason step by step or to correct its own bias did not remove the effect. A sympathetic reader would care because hospitals are starting to use these models to help with real clinical choices, which could quietly widen differences in how patients are treated.

Core claim

All nine evaluated frontier large language models exhibit substantial bias when processing clinical vignettes that contain stigmatizing language. Clinical decision-making is significantly skewed toward less aggressive patient management, with a clear dose-response relationship in which a single stigmatizing sentence is sufficient to alter model outputs. Standard prompt-based mitigation strategies, including Chain-of-Thought reasoning and model self-debiasing, show limited efficacy because models struggle to identify stigmatizing language explicitly while remaining implicitly influenced by it.

What carries the argument

Stigmatizing language phenotypes (doubt, blame, and maligning) injected at varying intensities into otherwise matched clinical vignettes for four medical conditions; this serves as the controlled variable to isolate effects on model-generated clinical management decisions

If this is right

Models used for clinical decision support will systematically recommend less aggressive care when patient notes contain stigmatizing language.
A single stigmatizing sentence is enough to change model outputs, showing high sensitivity to linguistic framing.
Chain-of-Thought and self-debiasing prompts fail to eliminate the implicit influence of stigmatizing language.
The bias appears consistently across all nine tested frontier models, pointing to a general limitation in current LLM clinical applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hospitals deploying LLMs may need to scan and rewrite incoming notes to remove stigmatizing phrasing before feeding them to models.
The dose-response pattern implies that the prevalence of stigmatizing language in training corpora could determine how strongly future models inherit this bias.
Similar effects could appear when LLMs process any human text that carries subtle negative framing, such as legal or employment records.
Routine linguistic audits of training data and input text may be required to prevent automated reinforcement of existing health disparities.

Load-bearing premise

The artificially constructed vignettes with added stigmatizing language produce the same model behavior as stigmatizing language that occurs naturally in real human-written clinical documentation.

What would settle it

Run the same models on pairs of real hospital clinical notes that differ only in the presence or absence of stigmatizing language and check whether the models still recommend less aggressive management for the stigmatized versions.

Figures

Figures reproduced from arXiv: 2605.17228 by Amy Oh, Anne R. Links, Didi Zhou, Faith Kamau, Jen-tse Huang, Mark Dredze, Mary Catherine Beach, Somnath Saha.

**Figure 1.** Figure 1: The presence of stigmatizing language within clinical notes can bias LLMs to favor less aggressive management. thermore, exposure to SL resulted in a consistent decline in simulated clinician attitudes across all models and clinical scenarios. Mitigation strategies showed limited efficacy; while CoT provided partial relief, self-debiasing underperformed, suggesting models struggle to explicitly identify … view at source ↗

**Figure 2.** Figure 2: Impact of varying intensities of SL on LLM clinical decision-making and simulated attitudes across four disease scenarios (SCD, Obesity, Cirrhosis, and Fibromyalgia) evaluated on nine frontier models. Across both panels, markers denote the dose of SL injected into the clinical vignette: Neutral baseline (light purple circles), 7 SL sentences (purple squares), 14 SL sentences (dark purple downward triangles… view at source ↗

**Figure 3.** Figure 3: Comparative effect sizes of patient demographics versus SL on model outputs. The lollipop charts illustrate the magnitude of influence each variable exerts on the LLMs’ responses. consistent. This indicates that LLMs inadvertently inherit and propagate implicit biases embedded within humangenerated clinical narratives, defaulting to less comprehensive care when a patient’s presentation is linguistically … view at source ↗

**Figure 4.** Figure 4: Impact of SL type and amount. The grouped bar chart illustrates the effect of varying doses (1, 4, and 7 sentences) of SL across Doubt (red), Blame (blue), Maligning (yellow), and a Mixed set of all three (All, grey). weighed those of all demographic permutations across both objective treatment decisions (Cramer’s V; Figure 3a) and simulated clinician attitudes (η 2 ; Figure 3b). This finding signifies a p… view at source ↗

**Figure 5.** Figure 5: Impact of SL across clinical scenarios and models. The heatmap displays the delta between the stigmatized and neutral baseline. Darker purple cells indicate more severe disparities. Values in parentheses represent the marginal average. that the primary vector for AI-driven healthcare disparities is shifting from overt demographic prejudice toward the covert forms of implicit bias, like the mechanism of SL.… view at source ↗

**Figure 6.** Figure 6: Legend: SCD (DeepSeek) 0 25 50 75 100 N S(1) S(4) S(7) S(14) S(21) Treatment Score Obesity (Qwen) 0 25 50 75 100 N S(1) S(4) S(7) S(14) S(21) Treatment Score Cirrhosis (Gemini) 0 25 50 75 100 N S(1) S(4) S(7) S(14) S(21) Treatment Score Fibromyalgia (LLaMA) 0 25 50 75 100 N S(1) S(4) S(7) S(14) S(21) Treatment Score 15 20 25 30 35 40 N S(1) S(4) S(7) S(14) S(21) PASS Score 15 20 25 30 35 40 N S(1) S(4) S(7… view at source ↗

**Figure 7.** Figure 7: The prompt used for our debiasing (mitigation strategy 2). Notes are appended at the end, replacing “note”. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: One pair (neutral and stigmatized) of example prompts used for testing models on the SCD scenario. Demographic information and other relevant descriptions (e.g., pronouns) are varied across one trail. Red highlights doubt language, Blue represents blame language, and Yellow denotes maligning language. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: One pair (neutral and stigmatized) of example prompts used for testing models on the obesity scenario. Demographic information and other relevant descriptions (e.g., pronouns) are varied across one trail. Red highlights doubt language, Blue represents blame language, and Yellow denotes maligning language. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: One pair (neutral and stigmatized) of example prompts used for testing models on the cirrhosis scenario. Demographic information and other relevant descriptions (e.g., pronouns) are varied across one trail. Red highlights doubt language, Blue represents blame language, and Yellow denotes maligning language. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: One pair (neutral and stigmatized) of example prompts used for testing models on the fibromyalgia scenario. Demographic information and other relevant descriptions (e.g., pronouns) are varied across one trail. Red highlights doubt language, Blue represents blame language, and Yellow denotes maligning language. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

read the original abstract

Large Language Models (LLMs) are increasingly deployed in high-stakes domains such as clinical decision support and medical documentation. However, the robustness of these models against subtle linguistic variations, specifically stigmatizing language (SL) commonly found in human-authored clinical notes, remains critically under-explored. In this work, we investigate whether frontier LLMs inherit and propagate this human bias when processing clinical text. We systematically evaluate nine frontier LLMs across four stigmatized medical conditions, utilizing clinical vignettes injected with varying intensities and phenotypes of SL (doubt, blame, and maligning). Our results demonstrate that all evaluated models exhibit substantial bias, with clinical decision-making significantly skewed towards less aggressive patient management. Notably, we observe a high sensitivity to linguistic framing, where a single SL sentence is sufficient to alter model outputs, revealing a clear dose-response relationship. Furthermore, we evaluate standard prompt-based mitigation strategies, including Chain-of-Thought (CoT) reasoning and model self-debiasing. These approaches show limited efficacy; models struggle to explicitly identify SL while remaining implicitly influenced by it. Our findings expose a critical vulnerability in current LLMs regarding fairness and robustness in clinical NLP, underscoring the need for rigorous algorithmic guardrails to prevent the automation of health disparities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows LLMs shift toward less aggressive clinical decisions when stigmatizing language is added to vignettes, with effects from a single sentence, but the design leaves room for length and structure confounds.

read the letter

The core finding is that frontier LLMs, when given clinical vignettes with added stigmatizing language, tend to recommend less aggressive management across the conditions tested. A single stigmatizing sentence appears enough to move outputs, and the effect scales with intensity. They also check whether chain-of-thought or self-debiasing prompts reduce the shift and report limited success. That combination of multi-model testing, multiple stigmatizing language types, and explicit mitigation checks is the main new piece relative to earlier bias work in other domains. The paper does a reasonable job documenting consistent directional patterns and the dose-response observation, which at least suggests the result is not fragile to model choice. The clinical framing also makes the stakes concrete without overclaiming deployment impact. The soft spots sit mainly in the methods details that are missing from the abstract. Without reported statistical tests, vignette validation steps, or explicit controls that match sentence length and structure with neutral additions, it is hard to rule out that some of the shift comes from prompt length or added clinical detail rather than the stigmatizing content itself. The stress-test note on this point is worth checking in the full text; if the paper only compares original versus SL-injected versions without neutral-sentence baselines, the attribution to stigmatizing language specifically stays somewhat insecure. This work is aimed at people studying fairness and robustness in clinical NLP or LLM deployment in medicine. A reader already tracking bias literature would get value from the scale of the model sweep and the mitigation results, even if the controls need tightening. It is coherent enough on its own terms to deserve a serious referee, though the review will likely focus on experimental controls and statistical reporting. I would send it to peer review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper evaluates nine frontier LLMs on clinical vignettes for four stigmatized conditions, injecting varying intensities and phenotypes of stigmatizing language (doubt, blame, maligning). It reports that all models exhibit bias toward less aggressive patient management, that a single SL sentence suffices to shift outputs, and that a dose-response pattern appears; standard mitigations (CoT, self-debiasing) show limited success in removing the implicit influence.

Significance. If the attribution to SL holds after proper controls, the work supplies concrete empirical evidence that current LLMs can propagate subtle linguistic biases from human clinical notes into high-stakes decisions. The multi-model, multi-condition design and the finding that even minimal SL alters outputs would be useful for informing robustness requirements in clinical NLP deployments.

major comments (2)

[Methods / Vignette Design] Vignette construction (Methods): the manuscript compares SL-injected vignettes against the original versions but does not describe matched neutral-sentence controls of equivalent length, sentence count, or lexical diversity. Without these controls, shifts in model outputs cannot be isolated to the stigmatizing content rather than incidental prompt-length or structural changes, which directly undermines the central claim that SL itself skews decision-making.
[Results] Results reporting: the abstract and main findings state consistent directional effects and a dose-response relationship, yet no details are supplied on the precise operationalization of 'less aggressive management,' the statistical tests used, effect sizes, or vignette validation procedures. These omissions make it impossible to judge the reliability or magnitude of the reported bias.

minor comments (2)

[Methods] The description of the four medical conditions and the exact SL phenotypes would benefit from an explicit table listing the injected sentences for each intensity level.
[Results / Figures] Figure captions for the dose-response plots should include the number of model runs or seeds and whether error bars represent standard error or confidence intervals.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments, which have prompted us to strengthen the methodological controls and reporting transparency in our work. We address each major comment in detail below and outline the revisions we will make.

read point-by-point responses

Referee: [Methods / Vignette Design] Vignette construction (Methods): the manuscript compares SL-injected vignettes against the original versions but does not describe matched neutral-sentence controls of equivalent length, sentence count, or lexical diversity. Without these controls, shifts in model outputs cannot be isolated to the stigmatizing content rather than incidental prompt-length or structural changes, which directly undermines the central claim that SL itself skews decision-making.

Authors: We appreciate this methodological concern. Our original vignettes function as the primary baseline, with SL phrases inserted into otherwise fixed clinical content and structure across conditions. However, we acknowledge that this design does not fully rule out confounds from added sentence length or lexical shifts. To address the point directly, the revised manuscript will incorporate a new set of matched neutral-sentence controls constructed to preserve equivalent length, sentence count, and lexical diversity (measured via type-token ratio and word frequency norms). We will detail the construction procedure in Methods, present comparative results, and discuss how these controls confirm that the observed shifts are attributable to SL content rather than surface features. revision: yes
Referee: [Results] Results reporting: the abstract and main findings state consistent directional effects and a dose-response relationship, yet no details are supplied on the precise operationalization of 'less aggressive management,' the statistical tests used, effect sizes, or vignette validation procedures. These omissions make it impossible to judge the reliability or magnitude of the reported bias.

Authors: We thank the referee for identifying these reporting gaps. In the revision we will expand the Results and Methods sections with the following: (1) explicit operationalization of 'less aggressive management' via concrete decision metrics (e.g., binary choice of aggressive vs. conservative treatment options and a 1-5 intensity rating scale); (2) full description of statistical procedures, including the specific tests (paired t-tests or McNemar's tests with Bonferroni correction), significance thresholds, and power considerations; (3) effect sizes (Cohen's d for continuous outcomes and odds ratios for categorical decisions); and (4) vignette validation details, including review by two board-certified clinicians for clinical plausibility and face validity. These additions will enable readers to evaluate both the magnitude and robustness of the reported effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical evaluation

full rationale

The paper conducts a controlled empirical comparison of LLM outputs on clinical vignettes with and without injected stigmatizing language across multiple models and conditions. No mathematical derivations, equations, fitted parameters, or first-principles predictions are presented that could reduce to inputs by construction. Results derive directly from model inference runs and statistical analysis of output differences, with no self-citation chains or ansatzes serving as load-bearing premises for the central claims. The evaluation is self-contained against external benchmarks of model behavior.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central findings rest on the assumption that the chosen stigmatizing language categories and vignette modifications isolate the intended bias effect without introducing unrelated textual artifacts.

axioms (1)

domain assumption Stigmatizing language in clinical notes can be reliably categorized into doubt, blame, and maligning phenotypes with controllable intensity.
Invoked to construct the experimental vignettes and interpret the dose-response results.

pith-pipeline@v0.9.0 · 5778 in / 1172 out tokens · 53324 ms · 2026-05-20T14:44:52.279956+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We systematically evaluate nine frontier LLMs across four stigmatized medical conditions, utilizing clinical vignettes injected with varying intensities and phenotypes of SL (doubt, blame, and maligning).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

166 extracted references · 166 canonical work pages · 40 internal anchors

[1]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

work page
[2]

OpenAI Blog , year=

Improving language understanding by generative pre-training , author=. OpenAI Blog , year=

work page
[3]

OpenAI blog , year=

Language models are unsupervised multitask learners , author=. OpenAI blog , year=

work page
[4]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[5]

OpenAI Blog Nov 30 2022 , url=

Introducing ChatGPT , author=. OpenAI Blog Nov 30 2022 , url=

work page 2022
[6]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

GPT-4o System Card

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

OpenAI Blog Apr 14 2025 , url=

Introducing GPT-4.1 in the API , author=. OpenAI Blog Apr 14 2025 , url=

work page 2025
[9]

OpenAI Blog Feb 27 2025 , url=

Introducing GPT-4.5 , author=. OpenAI Blog Feb 27 2025 , url=

work page 2025
[10]

OpenAI GPT-5 System Card

Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

OpenAI Blog Nov 12 2025 , url=

GPT-5.1: A smarter, more conversational ChatGPT , author=. OpenAI Blog Nov 12 2025 , url=

work page 2025
[12]

OpenAI Blog Dec 11 2025 , url=

Introducing GPT-5.2 , author=. OpenAI Blog Dec 11 2025 , url=

work page 2025
[13]

OpenAI Blog Mar 5 2026 , url=

Introducing GPT-5.4 , author=. OpenAI Blog Mar 5 2026 , url=

work page 2026
[14]

OpenAI Blog Sep 15 2025 , url=

Introducing upgrades to Codex , author=. OpenAI Blog Sep 15 2025 , url=

work page 2025
[15]

OpenAI Blog Nov 19 2025 , url=

Building more with GPT‑5.1‑Codex‑Max , author=. OpenAI Blog Nov 19 2025 , url=

work page 2025
[16]

OpenAI Blog Dec 18 2025 , url=

Introducing GPT‑5.2‑Codex , author=. OpenAI Blog Dec 18 2025 , url=

work page 2025
[17]

OpenAI Blog Feb 5 2026 , url=

Introducing GPT‑5.3‑Codex , author=. OpenAI Blog Feb 5 2026 , url=

work page 2026
[18]

gpt-oss-120b & gpt-oss-20b Model Card

gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

OpenAI o1 System Card

Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

OpenAI Blog Apr 16 2025 , url=

Introducing OpenAI o3 and o4-mini , author=. OpenAI Blog Apr 16 2025 , url=

work page 2025
[21]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , author=. arXiv preprint arXiv:2403.05530 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Google Blog Dec 11 2024 , url=

Introducing Gemini 2.0: our new AI model for the agentic era , author=. Google Blog Dec 11 2024 , url=

work page 2024
[24]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Google Blog Nov 18 2025 , url=

A new era of intelligence with Gemini 3 , author=. Google Blog Nov 18 2025 , url=

work page 2025
[26]

Google Blog Dec 17 2025 , url=

Gemini 3 Flash: frontier intelligence built for speed , author=. Google Blog Dec 17 2025 , url=

work page 2025
[27]

Google Blog Feb 12 2026 , url=

Gemini 3 Deep Think: Advancing science, research and engineering , author=. Google Blog Feb 12 2026 , url=

work page 2026
[28]

Google Blog Feb 19 2026 , url=

Gemini 3.1 Pro: A smarter model for your most complex tasks , author=. Google Blog Feb 19 2026 , url=

work page 2026
[29]

Gemma: Open Models Based on Gemini Research and Technology

Gemma: Open models based on gemini research and technology , author=. arXiv preprint arXiv:2403.08295 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma 2: Improving open language models at a practical size , author=. arXiv preprint arXiv:2408.00118 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Gemma 3 Technical Report

Gemma 3 technical report , author=. arXiv preprint arXiv:2503.19786 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Anthropic Blog Mar 24 2023 , url=

Introducing Claude , author=. Anthropic Blog Mar 24 2023 , url=

work page 2023
[33]

Anthropic Blog Jul 11 2023 , url=

Claude 2 , author=. Anthropic Blog Jul 11 2023 , url=

work page 2023
[34]

Anthropic Blog Nov 21 2023 , url=

Introducing Claude 2.1 , author=. Anthropic Blog Nov 21 2023 , url=

work page 2023
[35]

Anthropic Blog Mar 13 2024 , url=

Claude 3 Haiku: our fastest model yet , author=. Anthropic Blog Mar 13 2024 , url=

work page 2024
[36]

Anthropic Blog Jun 21 2024 , url=

Claude 3.5 Sonnet , author=. Anthropic Blog Jun 21 2024 , url=

work page 2024
[37]

Anthropic Blog Feb 24 2025 , url=

Claude 3.7 Sonnet and Claude Code , author=. Anthropic Blog Feb 24 2025 , url=

work page 2025
[38]

Anthropic Blog Mar 22 2025 , url=

Introducing Claude 4 , author=. Anthropic Blog Mar 22 2025 , url=

work page 2025
[39]

Anthropic Blog Aug 5 2025 , url=

Claude Opus 4.1 , author=. Anthropic Blog Aug 5 2025 , url=

work page 2025
[40]

Anthropic Blog Sep 29 2025 , url=

Introducing Claude Sonnet 4.5 , author=. Anthropic Blog Sep 29 2025 , url=

work page 2025
[41]

Anthropic Blog Oct 15 2025 , url=

Introducing Claude Haiku 4.5 , author=. Anthropic Blog Oct 15 2025 , url=

work page 2025
[42]

Anthropic Blog Nov 24 2025 , url=

Introducing Claude Opus 4.5 , author=. Anthropic Blog Nov 24 2025 , url=

work page 2025
[43]

Anthropic Blog Feb 5 2026 , url=

Introducing Claude Opus 4.6 , author=. Anthropic Blog Feb 5 2026 , url=

work page 2026
[44]

Anthropic Blog Feb 17 2026 , url=

Introducing Claude Sonnet 4.6 , author=. Anthropic Blog Feb 17 2026 , url=

work page 2026
[45]

xAI Blogs Nov 3 2023 , url=

Announcing Grok , author=. xAI Blogs Nov 3 2023 , url=

work page 2023
[46]

xAI Blogs Mar 28 2024 , url=

Announcing Grok-1.5 , author=. xAI Blogs Mar 28 2024 , url=

work page 2024
[47]

xAI Blogs Aug 13 2024 , url=

Grok-2 Beta Release , author=. xAI Blogs Aug 13 2024 , url=

work page 2024
[48]

xAI Blogs Feb 19 2025 , url=

Grok 3 Beta — The Age of Reasoning Agents , author=. xAI Blogs Feb 19 2025 , url=

work page 2025
[49]

xAI Blogs Jul 9 2025 , url=

Grok 4 , author=. xAI Blogs Jul 9 2025 , url=

work page 2025
[50]

xAI Blogs Nov 17 2025 , url=

Grok 4.1 , author=. xAI Blogs Nov 17 2025 , url=

work page 2025
[51]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[52]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[53]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[54]

Meta Blog Jul 23 2024 , url=

Introducing Llama 3.1: Our most capable models to date , author=. Meta Blog Jul 23 2024 , url=

work page 2024
[55]

Meta Blog Sep 25 2024 , url=

Llama 3.2: Revolutionizing edge AI and vision with open, customizable models , author=. Meta Blog Sep 25 2024 , url=

work page 2024
[56]

Meta Blog Dec 6 2024 , url=

The Meta Llama 3.3 70B Instruct , author=. Meta Blog Dec 6 2024 , url=

work page 2024
[57]

Meta Blog Apr 5 2025 , url=

The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation , author=. Meta Blog Apr 5 2025 , url=

work page 2025
[58]

Advances in neural information processing systems , volume=

Visual instruction tuning , author=. Advances in neural information processing systems , volume=

work page
[59]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Improved baselines with visual instruction tuning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[60]

LLaVA Blogs Jan 2024 , url=

LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , author=. LLaVA Blogs Jan 2024 , url=

work page 2024
[61]

Seed1.5-thinking: Advancing superb reasoning models with reinforce- ment learning

Seed1. 5-thinking: Advancing superb reasoning models with reinforcement learning , author=. arXiv preprint arXiv:2504.13914 , year=

work page arXiv
[62]

Seed1.5-VL Technical Report

Seed1.5-vl technical report , author=. arXiv preprint arXiv:2505.07062 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[63]

ByteDance Seed Blog Jun 25 2025 , url=

Introduction to Techniques Used in Seed1.6 , author=. ByteDance Seed Blog Jun 25 2025 , url=

work page 2025
[64]

ByteDance Seed Blog Dec 18 2025 , url=

Official Release of Seed1.8: A Generalized Agentic Model , author=. ByteDance Seed Blog Dec 18 2025 , url=

work page 2025
[65]

ByteDance Seed Blog Feb 14 2026 , url=

Seed 2.0 Official Launch , author=. ByteDance Seed Blog Feb 14 2026 , url=

work page 2026
[66]

ByteDance Seed Blog Aug 21 2025 , url=

Seed-OSS Open-Source Models Release , author=. ByteDance Seed Blog Aug 21 2025 , url=

work page 2025
[67]

Qwen Technical Report

Qwen technical report , author=. arXiv preprint arXiv:2309.16609 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[68]

Qwen Blogs Feb 4 2024 , url=

Introducing Qwen1.5 , author=. Qwen Blogs Feb 4 2024 , url=

work page 2024
[69]

Qwen2 Technical Report

Qwen2 technical report , author=. arXiv preprint arXiv:2407.10671 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[70]

Qwen2.5 Technical Report

Qwen2.5 technical report , author=. arXiv preprint arXiv:2412.15115 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[71]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[72]

Qwen Blogs Feb 16 2026 , url=

Qwen3.5: Towards Native Multimodal Agents , author=. Qwen Blogs Feb 16 2026 , url=

work page 2026
[73]

Qwen Blogs Nov 28 2024 , url=

QwQ: Reflect Deeply on the Boundaries of the Unknown , author=. Qwen Blogs Nov 28 2024 , url=

work page 2024
[74]

Qwen Blogs Dec 25 2024 , url=

QVQ: To See the World with Wisdom , author=. Qwen Blogs Dec 25 2024 , url=

work page 2024
[75]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. arXiv preprint arXiv:2308.12966 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[76]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution , author=. arXiv preprint arXiv:2409.12191 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[77]

Qwen2.5-VL Technical Report

Qwen2.5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[78]

Qwen3-VL Technical Report

Qwen3-vl technical report , author=. arXiv preprint arXiv:2511.21631 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[79]

Qwen Blogs Jan 25 2024 , url=

Introducing Qwen-VL , author=. Qwen Blogs Jan 25 2024 , url=

work page 2024
[80]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

work page internal anchor Pith review Pith/arXiv arXiv

Showing first 80 references.

[1] [1]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

work page

[2] [2]

OpenAI Blog , year=

Improving language understanding by generative pre-training , author=. OpenAI Blog , year=

work page

[3] [3]

OpenAI blog , year=

Language models are unsupervised multitask learners , author=. OpenAI blog , year=

work page

[4] [4]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page

[5] [5]

OpenAI Blog Nov 30 2022 , url=

Introducing ChatGPT , author=. OpenAI Blog Nov 30 2022 , url=

work page 2022

[6] [6]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

GPT-4o System Card

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

OpenAI Blog Apr 14 2025 , url=

Introducing GPT-4.1 in the API , author=. OpenAI Blog Apr 14 2025 , url=

work page 2025

[9] [9]

OpenAI Blog Feb 27 2025 , url=

Introducing GPT-4.5 , author=. OpenAI Blog Feb 27 2025 , url=

work page 2025

[10] [10]

OpenAI GPT-5 System Card

Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

OpenAI Blog Nov 12 2025 , url=

GPT-5.1: A smarter, more conversational ChatGPT , author=. OpenAI Blog Nov 12 2025 , url=

work page 2025

[12] [12]

OpenAI Blog Dec 11 2025 , url=

Introducing GPT-5.2 , author=. OpenAI Blog Dec 11 2025 , url=

work page 2025

[13] [13]

OpenAI Blog Mar 5 2026 , url=

Introducing GPT-5.4 , author=. OpenAI Blog Mar 5 2026 , url=

work page 2026

[14] [14]

OpenAI Blog Sep 15 2025 , url=

Introducing upgrades to Codex , author=. OpenAI Blog Sep 15 2025 , url=

work page 2025

[15] [15]

OpenAI Blog Nov 19 2025 , url=

Building more with GPT‑5.1‑Codex‑Max , author=. OpenAI Blog Nov 19 2025 , url=

work page 2025

[16] [16]

OpenAI Blog Dec 18 2025 , url=

Introducing GPT‑5.2‑Codex , author=. OpenAI Blog Dec 18 2025 , url=

work page 2025

[17] [17]

OpenAI Blog Feb 5 2026 , url=

Introducing GPT‑5.3‑Codex , author=. OpenAI Blog Feb 5 2026 , url=

work page 2026

[18] [18]

gpt-oss-120b & gpt-oss-20b Model Card

gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

OpenAI o1 System Card

Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

OpenAI Blog Apr 16 2025 , url=

Introducing OpenAI o3 and o4-mini , author=. OpenAI Blog Apr 16 2025 , url=

work page 2025

[21] [21]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , author=. arXiv preprint arXiv:2403.05530 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Google Blog Dec 11 2024 , url=

Introducing Gemini 2.0: our new AI model for the agentic era , author=. Google Blog Dec 11 2024 , url=

work page 2024

[24] [24]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Google Blog Nov 18 2025 , url=

A new era of intelligence with Gemini 3 , author=. Google Blog Nov 18 2025 , url=

work page 2025

[26] [26]

Google Blog Dec 17 2025 , url=

Gemini 3 Flash: frontier intelligence built for speed , author=. Google Blog Dec 17 2025 , url=

work page 2025

[27] [27]

Google Blog Feb 12 2026 , url=

Gemini 3 Deep Think: Advancing science, research and engineering , author=. Google Blog Feb 12 2026 , url=

work page 2026

[28] [28]

Google Blog Feb 19 2026 , url=

Gemini 3.1 Pro: A smarter model for your most complex tasks , author=. Google Blog Feb 19 2026 , url=

work page 2026

[29] [29]

Gemma: Open Models Based on Gemini Research and Technology

Gemma: Open models based on gemini research and technology , author=. arXiv preprint arXiv:2403.08295 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma 2: Improving open language models at a practical size , author=. arXiv preprint arXiv:2408.00118 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

Gemma 3 Technical Report

Gemma 3 technical report , author=. arXiv preprint arXiv:2503.19786 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

Anthropic Blog Mar 24 2023 , url=

Introducing Claude , author=. Anthropic Blog Mar 24 2023 , url=

work page 2023

[33] [33]

Anthropic Blog Jul 11 2023 , url=

Claude 2 , author=. Anthropic Blog Jul 11 2023 , url=

work page 2023

[34] [34]

Anthropic Blog Nov 21 2023 , url=

Introducing Claude 2.1 , author=. Anthropic Blog Nov 21 2023 , url=

work page 2023

[35] [35]

Anthropic Blog Mar 13 2024 , url=

Claude 3 Haiku: our fastest model yet , author=. Anthropic Blog Mar 13 2024 , url=

work page 2024

[36] [36]

Anthropic Blog Jun 21 2024 , url=

Claude 3.5 Sonnet , author=. Anthropic Blog Jun 21 2024 , url=

work page 2024

[37] [37]

Anthropic Blog Feb 24 2025 , url=

Claude 3.7 Sonnet and Claude Code , author=. Anthropic Blog Feb 24 2025 , url=

work page 2025

[38] [38]

Anthropic Blog Mar 22 2025 , url=

Introducing Claude 4 , author=. Anthropic Blog Mar 22 2025 , url=

work page 2025

[39] [39]

Anthropic Blog Aug 5 2025 , url=

Claude Opus 4.1 , author=. Anthropic Blog Aug 5 2025 , url=

work page 2025

[40] [40]

Anthropic Blog Sep 29 2025 , url=

Introducing Claude Sonnet 4.5 , author=. Anthropic Blog Sep 29 2025 , url=

work page 2025

[41] [41]

Anthropic Blog Oct 15 2025 , url=

Introducing Claude Haiku 4.5 , author=. Anthropic Blog Oct 15 2025 , url=

work page 2025

[42] [42]

Anthropic Blog Nov 24 2025 , url=

Introducing Claude Opus 4.5 , author=. Anthropic Blog Nov 24 2025 , url=

work page 2025

[43] [43]

Anthropic Blog Feb 5 2026 , url=

Introducing Claude Opus 4.6 , author=. Anthropic Blog Feb 5 2026 , url=

work page 2026

[44] [44]

Anthropic Blog Feb 17 2026 , url=

Introducing Claude Sonnet 4.6 , author=. Anthropic Blog Feb 17 2026 , url=

work page 2026

[45] [45]

xAI Blogs Nov 3 2023 , url=

Announcing Grok , author=. xAI Blogs Nov 3 2023 , url=

work page 2023

[46] [46]

xAI Blogs Mar 28 2024 , url=

Announcing Grok-1.5 , author=. xAI Blogs Mar 28 2024 , url=

work page 2024

[47] [47]

xAI Blogs Aug 13 2024 , url=

Grok-2 Beta Release , author=. xAI Blogs Aug 13 2024 , url=

work page 2024

[48] [48]

xAI Blogs Feb 19 2025 , url=

Grok 3 Beta — The Age of Reasoning Agents , author=. xAI Blogs Feb 19 2025 , url=

work page 2025

[49] [49]

xAI Blogs Jul 9 2025 , url=

Grok 4 , author=. xAI Blogs Jul 9 2025 , url=

work page 2025

[50] [50]

xAI Blogs Nov 17 2025 , url=

Grok 4.1 , author=. xAI Blogs Nov 17 2025 , url=

work page 2025

[51] [51]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[52] [52]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[53] [53]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[54] [54]

Meta Blog Jul 23 2024 , url=

Introducing Llama 3.1: Our most capable models to date , author=. Meta Blog Jul 23 2024 , url=

work page 2024

[55] [55]

Meta Blog Sep 25 2024 , url=

Llama 3.2: Revolutionizing edge AI and vision with open, customizable models , author=. Meta Blog Sep 25 2024 , url=

work page 2024

[56] [56]

Meta Blog Dec 6 2024 , url=

The Meta Llama 3.3 70B Instruct , author=. Meta Blog Dec 6 2024 , url=

work page 2024

[57] [57]

Meta Blog Apr 5 2025 , url=

The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation , author=. Meta Blog Apr 5 2025 , url=

work page 2025

[58] [58]

Advances in neural information processing systems , volume=

Visual instruction tuning , author=. Advances in neural information processing systems , volume=

work page

[59] [59]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Improved baselines with visual instruction tuning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[60] [60]

LLaVA Blogs Jan 2024 , url=

LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , author=. LLaVA Blogs Jan 2024 , url=

work page 2024

[61] [61]

Seed1.5-thinking: Advancing superb reasoning models with reinforce- ment learning

Seed1. 5-thinking: Advancing superb reasoning models with reinforcement learning , author=. arXiv preprint arXiv:2504.13914 , year=

work page arXiv

[62] [62]

Seed1.5-VL Technical Report

Seed1.5-vl technical report , author=. arXiv preprint arXiv:2505.07062 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[63] [63]

ByteDance Seed Blog Jun 25 2025 , url=

Introduction to Techniques Used in Seed1.6 , author=. ByteDance Seed Blog Jun 25 2025 , url=

work page 2025

[64] [64]

ByteDance Seed Blog Dec 18 2025 , url=

Official Release of Seed1.8: A Generalized Agentic Model , author=. ByteDance Seed Blog Dec 18 2025 , url=

work page 2025

[65] [65]

ByteDance Seed Blog Feb 14 2026 , url=

Seed 2.0 Official Launch , author=. ByteDance Seed Blog Feb 14 2026 , url=

work page 2026

[66] [66]

ByteDance Seed Blog Aug 21 2025 , url=

Seed-OSS Open-Source Models Release , author=. ByteDance Seed Blog Aug 21 2025 , url=

work page 2025

[67] [67]

Qwen Technical Report

Qwen technical report , author=. arXiv preprint arXiv:2309.16609 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[68] [68]

Qwen Blogs Feb 4 2024 , url=

Introducing Qwen1.5 , author=. Qwen Blogs Feb 4 2024 , url=

work page 2024

[69] [69]

Qwen2 Technical Report

Qwen2 technical report , author=. arXiv preprint arXiv:2407.10671 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[70] [70]

Qwen2.5 Technical Report

Qwen2.5 technical report , author=. arXiv preprint arXiv:2412.15115 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[71] [71]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[72] [72]

Qwen Blogs Feb 16 2026 , url=

Qwen3.5: Towards Native Multimodal Agents , author=. Qwen Blogs Feb 16 2026 , url=

work page 2026

[73] [73]

Qwen Blogs Nov 28 2024 , url=

QwQ: Reflect Deeply on the Boundaries of the Unknown , author=. Qwen Blogs Nov 28 2024 , url=

work page 2024

[74] [74]

Qwen Blogs Dec 25 2024 , url=

QVQ: To See the World with Wisdom , author=. Qwen Blogs Dec 25 2024 , url=

work page 2024

[75] [75]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. arXiv preprint arXiv:2308.12966 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[76] [76]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution , author=. arXiv preprint arXiv:2409.12191 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[77] [77]

Qwen2.5-VL Technical Report

Qwen2.5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[78] [78]

Qwen3-VL Technical Report

Qwen3-vl technical report , author=. arXiv preprint arXiv:2511.21631 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[79] [79]

Qwen Blogs Jan 25 2024 , url=

Introducing Qwen-VL , author=. Qwen Blogs Jan 25 2024 , url=

work page 2024

[80] [80]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

work page internal anchor Pith review Pith/arXiv arXiv