arxiv: 2604.06264 · v1 · submitted 2026-04-07 · 🧬 q-bio.QM · cs.AI

Recognition: 2 theorem links

· Lean Theorem

ToxReason: A Benchmark for Mechanistic Chemical Toxicity Reasoning via Adverse Outcome Pathway

Jueon Park , Wonjune Jang , Chanhwi Kim , Yein Park , Jaewoo Kang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:09 UTC · model grok-4.3

classification 🧬 q-bio.QM cs.AI

keywords toxicity predictionadverse outcome pathwaylarge language modelsmechanistic reasoningbenchmarkchemical toxicityAOPdrug-target interaction

0 comments

The pith

Strong toxicity prediction accuracy in LLMs does not guarantee reliable mechanistic reasoning about biological pathways.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ToxReason, a benchmark grounded in the Adverse Outcome Pathway framework to test whether large language models can reason mechanistically about chemical toxicity from molecular initiating events through to organ-level adverse outcomes. It combines experimental drug-target interaction data with toxicity labels across multiple organs, forcing models to connect predictions to specific biological mechanisms rather than relying on pattern matching alone. Evaluations across diverse LLMs show that high accuracy on toxicity labels frequently coincides with explanations that contradict or ignore known pathways. The work further demonstrates that training methods emphasizing reasoning improve both the quality of mechanistic explanations and the accuracy of the underlying predictions. These results indicate that trustworthy toxicity modeling requires explicit attention to reasoning processes during both evaluation and model development.

Core claim

We introduce ToxReason, a benchmark grounded in the Adverse Outcome Pathway (AOP) that evaluates organ-level toxicity reasoning across multiple organs. ToxReason integrates experimental drug-target interaction evidence with toxicity labels, requiring models to infer both toxic outcomes and their underlying mechanisms from Molecular Initiating Event (MIE) to Adverse Outcome (AO). Using ToxReason, we evaluate toxicity prediction performance and reasoning quality across diverse LLMs. We find that strong predictive performance does not necessarily imply reliable reasoning. Furthermore, we show that reasoning-aware training improves mechanistic reasoning and, consequently, toxicity prediction.

What carries the argument

ToxReason benchmark, which requires models to link toxicity labels to AOP chains by integrating drug-target interaction evidence from MIE to AO across organs.

If this is right

Strong predictive performance does not necessarily imply reliable reasoning.
Reasoning-aware training improves mechanistic reasoning quality.
Reasoning-aware training improves toxicity prediction performance.
LLMs can generate fluent but biologically unfaithful explanations.
Benchmarks must assess grounding in valid mechanisms rather than accuracy alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could be adapted to test mechanistic reasoning in related areas such as drug efficacy or environmental hazard assessment.
Prioritizing explanation faithfulness alongside accuracy may increase regulatory acceptance of LLM-based toxicity predictions.
Combining the AOP structure with graph neural networks for molecular inputs could further strengthen the link between structure and mechanism.
Scaling reasoning-aware training to larger models might reveal whether improved pathway fidelity generalizes across different chemical classes.

Load-bearing premise

The AOP framework combined with integrated drug-target interaction evidence and toxicity labels provides a faithful and sufficient representation of biological mechanisms for evaluating LLM reasoning.

What would settle it

If models achieve high accuracy on ToxReason toxicity labels yet produce explanations that contradict established AOP mechanisms, or if reasoning-aware training produces no measurable gain in either reasoning quality or prediction accuracy on held-out AOP tasks.

Figures

Figures reproduced from arXiv: 2604.06264 by Chanhwi Kim, Jaewoo Kang, Jueon Park, Wonjune Jang, Yein Park.

**Figure 2.** Figure 2: Overview of ToxReason, (A) Organ-specific AOPs are selected from AOP-Wiki, disease-linked chemicals are retrieved from CTD, and ChEMBL-derived MIE data are used to construct training and test sets under similarity constraints.(▲ Activation, ▼ Inhibition) (B) Learning framework built on ToxReason, combining supervised finetuning and reinforcement learning for mechanistic toxicity reasoning. Training Data T… view at source ↗

read the original abstract

Recent advances in large language models (LLMs) have enabled molecular reasoning for property prediction. However, toxicity arises from complex biological mechanisms beyond chemical structure, necessitating mechanistic reasoning for reliable prediction. Despite its importance, current benchmarks fail to systematically evaluate this capability. LLMs can generate fluent but biologically unfaithful explanations, making it difficult to assess whether predicted toxicities are grounded invalid mechanisms. To bridge this gap, we introduce ToxReason, a benchmark grounded in the Adverse Outcome Pathway (AOP) that evaluates organ-level toxicity reasoning across multiple organs. ToxReason integrates experimental drug-target interaction evidence with toxicity labels, requiring models to infer both toxic outcomes and their underlying mechanisms from Molecular Initiating Event (MIE) to Adverse Outcome (AO). Using ToxReason, we evaluate toxicity prediction performance and reasoning quality across diverse LLMs. We find that strong predictive performance does not necessarily imply reliable reasoning. Furthermore, we show that reasoning-aware training improves mechanistic reasoning and, consequently, toxicity prediction performance. Together, these results underscore the necessity of integrating reasoning into both evaluation and training for trustworthy toxicity modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ToxReason is a timely benchmark for checking if LLMs trace real toxicity mechanisms or just predict accurately, though its linear AOP setup leaves some biological complexity untested.

read the letter

The main takeaway is that this paper gives researchers a concrete benchmark to separate fluent toxicity predictions from actual mechanistic reasoning. ToxReason pulls together adverse outcome pathways with drug-target interaction data and organ-level labels, forcing models to connect a molecular initiating event to an adverse outcome rather than stopping at a yes/no call. The authors test several LLMs, show that high prediction accuracy often pairs with weak or unfaithful reasoning chains, and demonstrate that adding reasoning supervision during training lifts both explanation quality and downstream accuracy. That dissociation is the useful finding, and the benchmark construction itself is new relative to standard toxicity datasets that ignore pathways. The work does a clean job of stating the evaluation setup and the training intervention without overclaiming. One limitation stands out: AOPs are curated linear sequences, so they leave out network redundancy, feedback loops, cell-type differences, and multi-pathway toxicities that occur in real systems. If the ground-truth chains in the benchmark do not flag these gaps, models could match the labels while still missing compensatory biology or off-target effects, which would weaken the claim that the benchmark reliably measures mechanistic fidelity. The abstract and available details do not show sensitivity checks against known non-AOP toxicities or coverage statistics for edge cases. This paper is aimed at groups building or evaluating LLMs for chemical safety and drug development who need better tests for explanation faithfulness. It deserves peer review because the core idea fills a documented gap and the results point to a practical training fix, even if the dataset construction and validation details will need tightening before the benchmark can be adopted widely.

Referee Report

1 major / 2 minor

Summary. The paper introduces ToxReason, a benchmark for evaluating LLMs on mechanistic chemical toxicity reasoning using the Adverse Outcome Pathway (AOP) framework. It integrates drug-target interaction evidence with toxicity labels to assess models' ability to infer toxic outcomes and mechanisms from MIE to AO across organs. Evaluations show that strong predictive performance does not imply reliable reasoning, and reasoning-aware training improves both reasoning quality and prediction accuracy.

Significance. This benchmark addresses a critical gap in assessing whether LLMs truly understand biological mechanisms in toxicity prediction rather than relying on superficial patterns. If the results hold, it could guide the development of more trustworthy AI models for toxicology, with implications for drug safety and regulatory science. The emphasis on reasoning-aware training is a positive contribution.

major comments (1)

[§3 (Benchmark Design)] The central claim that the benchmark evaluates 'reliable reasoning' rests on the fidelity of AOPs as mechanistic representations. However, AOPs are linear curated pathways that omit biological complexities such as network redundancy, feedback loops, cell-type specificity, and multi-pathway toxicities. Without validation against known cases where AOPs fail to capture the full mechanism (e.g., idiosyncratic toxicities), the distinction between predictive performance and reasoning quality may not be robustly established.

minor comments (2)

[Abstract] The abstract would be strengthened by including specific quantitative metrics (e.g., accuracy scores, reasoning quality scores) and details on the number of AOPs or test cases in ToxReason to allow readers to assess the scale and impact of the findings.
[Evaluation] Clarify the exact metrics used for 'reasoning quality' and how they are computed, as this is crucial for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and positive review, which recognizes the benchmark's potential contribution to trustworthy AI in toxicology. We address the single major comment below with a proposed partial revision to strengthen the manuscript's discussion of AOP limitations.

read point-by-point responses

Referee: The central claim that the benchmark evaluates 'reliable reasoning' rests on the fidelity of AOPs as mechanistic representations. However, AOPs are linear curated pathways that omit biological complexities such as network redundancy, feedback loops, cell-type specificity, and multi-pathway toxicities. Without validation against known cases where AOPs fail to capture the full mechanism (e.g., idiosyncratic toxicities), the distinction between predictive performance and reasoning quality may not be robustly established.

Authors: We agree that AOPs are inherently simplified linear representations that do not capture the full spectrum of biological complexities, including network redundancy, feedback loops, cell-type specificity, and multi-pathway toxicities. This is a substantive limitation when interpreting the benchmark results as evidence of 'reliable' mechanistic reasoning in an absolute sense. Our benchmark is explicitly grounded in the AOP framework because it provides the most standardized, evidence-linked structure currently available for tracing toxicity from MIE to AO in regulatory and research contexts. The evaluation therefore measures how well models follow these established pathways using integrated experimental evidence, rather than claiming to validate against all possible biological realities. In the revised manuscript, we will add a dedicated paragraph to the Discussion section that explicitly acknowledges these AOP simplifications, cites relevant literature on their shortcomings (particularly for idiosyncratic toxicities), and clarifies that the reported gap between predictive accuracy and reasoning quality holds relative to AOP ground truth. We will also note that exhaustive validation against known AOP failure cases would require additional curated datasets of mechanistic discrepancies, which lies beyond the scope of the current benchmark construction and will be identified as an important direction for future work. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark and claims are empirically grounded in external sources

full rationale

The paper constructs ToxReason as a new benchmark by integrating external AOP knowledge bases with experimental DTI evidence and toxicity labels; no equations, fitted parameters, or derivations are presented. The central claims (predictive accuracy does not imply reliable reasoning; reasoning-aware training improves both) are empirical results from LLM evaluations on this externally sourced benchmark. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described methodology. The derivation chain is self-contained against external data and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper contributes a benchmark rather than new theory, so it rests on the established AOP framework as a domain assumption with no free parameters or invented entities introduced.

axioms (1)

domain assumption The Adverse Outcome Pathway (AOP) framework accurately structures toxicity mechanisms from Molecular Initiating Event to Adverse Outcome across organs.
Invoked as the grounding for the entire benchmark and evaluation.

pith-pipeline@v0.9.0 · 5501 in / 1158 out tokens · 66162 ms · 2026-05-10T19:09:30.998758+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ToxReason integrates experimental drug-target interaction evidence with toxicity labels, requiring models to infer both toxic outcomes and their underlying mechanisms from Molecular Initiating Event (MIE) to Adverse Outcome (AO).
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We employ an LLM-based evaluator to assess the reasoning quality based on four complementary metrics... Hallucination Avoidance, Causal Coherence, Biological Fidelity, Overall

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MolDeTox: Evaluating Language Model's Stepwise Fragment Editing for Molecular Detoxification
cs.AI 2026-05 unverdicted novelty 6.0

MolDeTox is a new benchmark that shows fragment-level stepwise editing by LLMs and VLMs improves structural validity and detoxification quality over prior toxicity-focused evaluations.
An explainable hypothesis-driven approach to Drug-Induced Liver Injury with HADES
cs.AI 2026-05 unverdicted novelty 6.0

HADES is an agentic AI system that generates mechanistic hypotheses for drug-induced liver injury using molecular, metabolite, and pathway evidence, outperforming prior binary classifiers on the new DILER benchmark wh...

Reference graph

Works this paper leans on

44 extracted references · 5 canonical work pages · cited by 2 Pith papers · 1 internal anchor

[1]

The Llama 3 Herd of Models

Adverse outcome pathways: a conceptual framework to support ecotoxicology research and risk assessment.Environmental toxicology and chem- istry, 29(3):730–741. Anthropic. 2025a. Introducing Claude Haiku 4.5. https://www.anthropic.com/news/ claude-haiku-4-5. Released: 2025-10-16. Anthropic. 2025b. Introducing Claude Sonnet 4.5. https://www.anthropic.com/ne...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Mol-llama: Towards general understanding of molecules in large molecular language model.arXiv preprint arXiv:2502.13449. Marcel Leist, Ahmed Ghallab, Rabea Graepel, Rose- marie Marchan, Reham Hassan, Susanne Hougaard Bennekou, Alice Limonciel, Mathieu Vinken, Stefan Schildknecht, Tanja Waldmann, and 1 others. 2017. Adverse outcome pathways: opportunities,...

work page arXiv 2017
[3]

Beyond chemical qa: Evaluating llm’s chemical reasoning with modular chemical operations.arXiv preprint arXiv:2505.21318, 2025

Beyond chemical qa: Evaluating llm’s chem- ical reasoning with modular chemical operations. arXiv preprint arXiv:2505.21318. Xuan Liu, Siru Ouyang, Xianrui Zhong, Jiawei Han, and Huimin Zhao. 2025. Fgbench: A dataset and benchmark for molecular property reasoning at func- tional group-level in large language models.arXiv preprint arXiv:2508.01055. Saul B ...

work page arXiv 2025
[4]

Payal Rana, Stephen Kogut, Xuerong Wen, Fatemeh Akhlaghi, and Michael D Aleo

Cotox: Chain-of-thought-based molecular toxicity reasoning and prediction.arXiv preprint arXiv:2508.03159. Payal Rana, Stephen Kogut, Xuerong Wen, Fatemeh Akhlaghi, and Michael D Aleo. 2020. Most influen- tial physicochemical and in vitro assay descriptors for hepatotoxicity and nephrotoxicity prediction.Chemi- cal Research in Toxicology, 33(7):1780–1790....

work page arXiv 2020
[5]

Yizhen Zheng, Huan Yee Koh, Jiaxin Ju, Madeleine Yang, Lauren T May, Geoffrey I Webb, Li Li, Shirui Pan, and George Church

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information pro- cessing systems, 36:46595–46623. Yizhen Zheng, Huan Yee Koh, Jiaxin Ju, Madeleine Yang, Lauren T May, Geoffrey I Webb, Li Li, Shirui Pan, and George Church. 2025. Large language mod- els for drug discovery and development.Patterns, 6(10). Jiaxi Zhuang, Yaorui Shi, J...

2025
[6]

Johanna Zilliacus, Monica K Draskau, Hanna KL Jo- hansson, Terje Svingen, and Anna Beronius

Reasoning-enhanced large language models for molecular property prediction.arXiv preprint arXiv:2510.10248. Johanna Zilliacus, Monica K Draskau, Hanna KL Jo- hansson, Terje Svingen, and Anna Beronius. 2024. Building an adverse outcome pathway network for estrogen-, androgen-and steroidogenesis-mediated reproductive toxicity.Frontiers in Toxicology, 6:1357...

work page arXiv 2024
[7]

A high score means little to no hallucination and strong grounding in the provided AOP and inputs

Hallucination_Avoidance — The degree to which the model avoids inventing unsupported facts. A high score means little to no hallucination and strong grounding in the provided AOP and inputs
[8]

Each step should follow causally from the previous one (MIE→KE→AO→Organ Toxicity) without unjustified jumps, contradictions, or reversed order

Causal_Coherence — Logical consistency of the mechanistic chain. Each step should follow causally from the previous one (MIE→KE→AO→Organ Toxicity) without unjustified jumps, contradictions, or reversed order
[9]

Uses correct terminology, accurate MIE/KE/AO relationships, and reflects realistic heart/liver/kidney toxicology and physiology

Biological_Fidelity — Biological validity of the mechanism. Uses correct terminology, accurate MIE/KE/AO relationships, and reflects realistic heart/liver/kidney toxicology and physiology
[10]

Hallucination_Avoidance

Overall — An overall quality score summarizing the four criteria above. Output Format Requirement: You must output a single valid JSON object with the following structure: { "Hallucination_Avoidance": <number from 1 to 10>, "Causal_Coherence": <number from 1 to 10>, "Biological_Fidelity": <number from 1 to 10>, "Overall": <number from 1 to 10>, "Explanati...
[11]

non-active

When interpreting reference evidence: - Use ONLY Activation Examples to infer activation. - Use ONLY Inhibition Examples to infer inhibition. - Never infer inhibition from “non-active”. - Never infer activation from “non-inhibit”. - Give more weight to examples with higher similarity scores. - Base all conclusions on structural similarity + target evidence
[12]

Step 1, Step 2, Step 3

For EACH inferred MIE, produce mechanistic reasoning describing how it can lead to organ toxicity: - Use “Step 1, Step 2, Step 3...” format. - Each step≤2 sentences. - Steps must follow: MIE→Key Events (KEs)→Adverse Outcome (AO)→organ toxicity
[13]

cardiotoxicity

Only consider the following toxicity types: - “cardiotoxicity” - “liver toxicity” - “kidney toxicity” Choose exactly ONE organ toxicity per MIE. { "MIE_Prediction": { "Target1": "Activation or Inhibition", "Target2": "Activation or Inhibition", ... }, "Toxicity_Reasoning": [ { "MIE": "", "Reasoning_Steps": [ "Step 1: ...", "Step 2: ...", "Step 3: ..." ], ...
[14]

Activation of the glucocorticoid receptor (GR) leads to dysregulation of gene expression involved in metabolism and inflammation
[15]

This dysregulation results in increased hepatic lipid accumulation and oxidative stress, contributing to liver injury
[16]

•Predicted Toxicity:Liver toxicity ToxReason-4B-GRPO •MIE: GR activation •Model Reasoning:

Chronic liver injury progresses to hepatocyte necrosis and inflammation, leading toliver toxicity. •Predicted Toxicity:Liver toxicity ToxReason-4B-GRPO •MIE: GR activation •Model Reasoning:
[17]

Glucocorticoid receptor (GR) activation leads toreduced hepatic fatty acid β-oxidation in the liver
[18]

Reduced fatty acidβ-oxidation leads to decreased fatty acid oxidation and increased fatty acid accumulation
[19]

Increased fatty acid accumulation leads to increased hepatictriglyceride accumulation (steatosis)
[20]

•Predicted Toxicity:Liver toxicity LLM-as-a-Judge Evaluation: Qwen-4B-Inst

Increasedhepatic steatosis leads to liver toxicity. •Predicted Toxicity:Liver toxicity LLM-as-a-Judge Evaluation: Qwen-4B-Inst. •Judge Scores: –Hallucination Avoidance:4 –Causal Coherence:5 –Biological Fidelity:5 –Overall:4 •Judge Summary:
[21]

Correctly identifies glucocorticoid receptor (GR) activation as the initiating molecular event
[22]

Mentions general lipid accumulation but omits critical intermediate steps such as decreased acyl-CoA dehydrogenases and impaired mitochondrialβ-oxidation
[23]

ToxReason-4B-GRPO •Judge Scores: –Hallucination Avoidance:8 –Causal Coherence:9 –Biological Fidelity:8 –Overall:8 •Judge Summary:

Introduces unsupported mechanisms (oxidative stress, inflammation, hepatocyte necrosis) not specified in the ground-truth AOP. ToxReason-4B-GRPO •Judge Scores: –Hallucination Avoidance:8 –Causal Coherence:9 –Biological Fidelity:8 –Overall:8 •Judge Summary:
[24]

Glucocorticoid receptor activation leads to reduced mitochondrial fatty acid β-oxidation
[25]

Reducedβ-oxidation results in increased fatty acid availability in hepatocytes
[26]

Excess fatty acids are stored as triglycerides, causing hepatic steatosis
[27]

Table G:Case study combining model-generated AOP reasoning (highlighted for alignment) and LLM-as-a-Judge evaluation (no highlighting)

Hepatic steatosis is directly linked to liver toxicity. Table G:Case study combining model-generated AOP reasoning (highlighted for alignment) and LLM-as-a-Judge evaluation (no highlighting). The upper block shows the reference AOP context and each model’s reasoning; the lower block reports judge scores and diagnostic summaries for the same responses. Cas...
[28]

Inhibition of acetylcholinesterase increases synaptic acetylcholine in cardiac parasympa- thetic pathways
[29]

Excess acetylcholine overstimulates M2 muscarinic receptors, leading to slowed cardiac electrical conduction and increased vagal tone
[30]

Resulting electrical disturbances progress to arrhythmias and impaired cardiac output, culmi- nating in cardiotoxicity. •Predicted Toxicity:cardiotoxicity GPT-5 Final Predicted Toxicity: cardiotoxicity Table H:Case study example illustrating strong alignment between an AChE-mediated Gold AOP (Aop:559) and the GPT-5 reasoning path. Highlighted segments den...
[31]

Inhibition of COX1 in renal tissues reduces synthesis of vasodilatory prostaglandins that normally maintain renal blood flow and glomerular filtration
[32]

Reduced prostaglandin levels cause renal vasoconstriction, decreased perfusion, and medullary ischemia accompanied by oxidative stress
[33]

Sustained ischemic and oxidative injury progresses to proximal tubular damage and impaired renal function, resulting in kidney toxicity. •Predicted Toxicity:kidney toxicity GPT-5.1 Final Predicted Toxicity: kidney toxicity Table I:Case study example illustrating alignment between a COX1-mediated renal toxicity AOP (Aop:177) and the GPT-5.1 reasoning path....
[34]

The molecule activates the glucocorticoid receptor (GR) in hepatocytes, altering transcrip- tion of metabolic genes
[35]

GR-driven metabolic reprogramming promotes lipid storage and functionally reduces fatty-acid oxidation capacity, consistent with decreased mitochondrialβ-oxidation
[36]

Persistent lipid accumulation hepatic triglyceride buildup, progressing to liver steatosis and liver toxicity. •Predicted Toxicity:liver toxicity o3 Final Predicted Toxicity: liver toxicity Table J:Case study example illustrating strong alignment between a GR-mediated Gold AOP (Aop:318) and the model reasoning path. Highlighted segments denote aligned mol...
[37]

OAT1 inhibition reduces the renal uptake of organic anions, including drugs and endogenous metabolites
[38]

This leads to decreased renal clearance and elevated plasma concentrations of potentially nephrotoxic compounds
[39]

•Predicted Toxicity:kidney toxicity Qwen3-4B Final Predicted Toxicity: kidney toxicity Table K:Case study example illustrating a transporter-mediated kidney toxicity pathway

Accumulation of these substances induces tubular damage, impairing nephron function and progressing to kidney injury. •Predicted Toxicity:kidney toxicity Qwen3-4B Final Predicted Toxicity: kidney toxicity Table K:Case study example illustrating a transporter-mediated kidney toxicity pathway. The Qwen3-4B model reproduces the core OAT1 inhibition mechanism...
[40]

KCNH2 (potassium channel hERG) inhibition reduces potassium efflux during cardiac ac- tion potential repolarization, preventing normal membrane potential restoration
[41]

Reduced repolarization leads to prolongation of action potential duration
[42]

Action potential prolongation causes QT interval prolongation, reflecting delayed ventricular repolarization
[43]

Extended repolarization creates conditions for early afterdepolarizations and premature depolarizations, increasing arrhythmogenicity
[44]

•Predicted Toxicity:cardiotoxicity ToxReason-4B-GRPO Final Predicted Toxicity: cardiotoxicity Table L:Case study example illustrating an ion-channel–mediated cardiotoxicity pathway

These electrophysiological disturbances culminate in heart failure. •Predicted Toxicity:cardiotoxicity ToxReason-4B-GRPO Final Predicted Toxicity: cardiotoxicity Table L:Case study example illustrating an ion-channel–mediated cardiotoxicity pathway. The ToxReason- 4B-GRPO model accurately reproduces the hERG/KCNH2 inhibition–driven electrophysiological me...