pith. sign in

arxiv: 2604.10874 · v1 · submitted 2026-04-13 · 💻 cs.CL · cs.AI

AOP-Smart: A RAG-Enhanced Large Language Model Framework for Adverse Outcome Pathway Analysis

Pith reviewed 2026-05-10 16:33 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords Adverse Outcome PathwaysRetrieval-Augmented GenerationLarge Language ModelsToxicologyQuestion AnsweringAOP-WikiHallucination Mitigation
0
0 comments X

The pith

A retrieval-augmented system called AOP-Smart lifts large language model accuracy on adverse outcome pathway tasks from under 35 percent to 95-100 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that retrieval from the official AOP-Wiki database of key events, key event relationships, and pathway details can ground large language model answers to toxicological questions and largely eliminate factual errors. Without this retrieval step the tested models answered only 15 to 35 percent of a 20-question benchmark correctly; with it they reached 95 to 100 percent. A sympathetic reader would care because adverse outcome pathways organize mechanistic toxicology knowledge used in risk assessment, and reliable automated support could speed up pathway construction and interpretation without replacing expert review.

Core claim

By feeding user questions to an LLM together with retrieved excerpts from AOP-Wiki XML records on key events and their relationships, the AOP-Smart framework produces answers whose accuracy on identification, upstream-downstream, and complex retrieval tasks matches or exceeds human-curated references.

What carries the argument

Retrieval-Augmented Generation pipeline that selects and injects relevant Key Events, Key Event Relationships, and AOP metadata from the AOP-Wiki XML corpus before the language model generates its response.

If this is right

  • Complex AOP retrieval tasks that currently require manual literature searches can be automated at high reliability.
  • Hallucination rates in LLM outputs on toxicology knowledge drop sharply once official pathway data are retrieved and supplied.
  • The same retrieval pattern can be applied to any domain whose knowledge is stored in structured XML or ontology formats.
  • Downstream applications such as automated AOP network construction or regulatory report drafting become more feasible.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the accuracy gains hold on larger and more diverse question sets, AOP-Smart could serve as a first-pass tool that toxicologists then verify rather than a replacement for expert curation.
  • The approach suggests a general template for reducing LLM errors in any scientific field that maintains a public, machine-readable knowledge base.
  • Testing the framework on questions that cross multiple AOPs or that require inference across pathways would reveal whether the current retrieval scope is sufficient.

Load-bearing premise

The authors' 20-question test set represents typical real-world AOP tasks and that correctness was judged without bias or post-selection adjustments.

What would settle it

An independently assembled collection of at least 100 AOP questions on which the RAG-enhanced models fall below 80 percent accuracy or on which the unaugmented models match or exceed the reported scores.

Figures

Figures reproduced from arXiv: 2604.10874 by Lu Yan, Qinjiang Niu.

Figure 1
Figure 1. Figure 1: Overview of the AOP-Smart framework. (A) In the absence of retrieval augmentation, the LLM directly generates responses based solely on its internal parameters, which may lead to incomplete or hallucinated outputs. (B) With the proposed AOP-Smart framework, user queries are first used to retrieve relevant KE candidates from an indexed KE list, followed by structured expansion to KE, KER, and AOP-level info… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the KE-based expansion process, in￾cluding KE augmentation, KER reconstruction, and AOP retrieval, followed by integration with LLM-based reason￾ing. in the knowledge base, and selects those relationships that connect any two KEs within the expanded KE set, forming an expanded KER set. Through this joint ex￾pansion process, the system can not only identify the event nodes most relevant to the q… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the Python-based system interface, including configuration, input, and output modules, with adjustable parameters for model selection and KE-based re￾trieval control. model output, where a lower value leads to more stable output and a higher value leads to more diverse gen￾erated results; and the Top-N KE parameter, which is the core control variable in this study, used to specify the number of… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the 20-question AOP benchmark, grouped into four task categories: KE identification, down￾stream KE retrieval, upstream KE retrieval, and complex AOP query. According to different task types, the test set is di￾vided into four categories: (1) Key Event Identification (KE Identifica￾tion): Given a KE ID, the model is required to identify the corresponding biological event and its applicable spec… view at source ↗
read the original abstract

Adverse Outcome Pathways (AOPs) are an important knowledge framework in toxicological research and risk assessment. In recent years, large language models (LLMs) have gradually been applied to AOP-related question answering and mechanistic reasoning tasks. However, due to the existence of the hallucination problem, that is, the model may generate content that is inconsistent with facts or lacks evidence, their reliability is still limited. To address this issue, this study proposes an AOP-oriented Retrieval-Augmented Generation (RAG) framework, AOP-Smart. Based on the official XML data from AOP-Wiki, this method uses Key Events (KEs), Key Event Relationships (KERs), and specific AOP information to retrieve relevant knowledge for user questions, thereby improving the reliability of the generated results of large language models. To evaluate the effectiveness of the proposed method, this study constructed a test set containing 20 AOP-related question answering tasks, covering KE identification, upstream and downstream KE retrieval, and complex AOP retrieval tasks. Experiments were conducted on three mainstream large language models, Gemini, DeepSeek, and ChatGPT, and comparative tests were performed under two settings: without RAG and with RAG. The experimental results show that, without using RAG, the accuracies of GPT, DeepSeek, and Gemini were 15.0\%, 35.0\%, and 20.0\%, respectively; after using RAG, their accuracies increased to 95.0\%, 100.0\%, and 95.0\%, respectively. The results indicate that AOP-Smart can significantly alleviate the hallucination problem of large language models in AOP knowledge tasks, and greatly improve the accuracy and consistency of their answers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes AOP-Smart, a retrieval-augmented generation (RAG) framework that retrieves Key Events (KEs), Key Event Relationships (KERs), and AOP information from AOP-Wiki XML data to ground large language model responses for AOP-related question answering. It evaluates the framework on three LLMs (GPT, DeepSeek, Gemini) using a custom test set of 20 AOP tasks covering KE identification, upstream/downstream KE retrieval, and complex AOP retrieval, reporting accuracy gains from 15.0%/35.0%/20.0% (no RAG) to 95.0%/100.0%/95.0% (with RAG) and concluding that the approach significantly alleviates hallucination.

Significance. If the reported gains are shown to generalize, AOP-Smart would provide a practical, domain-specific RAG pipeline for improving LLM reliability in toxicological mechanistic reasoning. The direct grounding in official AOP-Wiki XML is a clear methodological strength, and the magnitude of the before/after delta on structured knowledge tasks suggests RAG can be particularly effective here. The work is timely given growing interest in LLMs for regulatory science.

major comments (2)
  1. [Evaluation section] Evaluation section (test set construction and results): The central empirical claim rests on accuracy improvements observed on an author-constructed set of only 20 questions, yet the manuscript provides no sampling protocol, diversity metrics, question sourcing details, or inter-annotator agreement for correctness judgments. This directly affects the load-bearing assertion that the method 'significantly alleviate[s] the hallucination problem,' as the observed delta cannot be distinguished from performance on a narrow, potentially favorable subset without these controls.
  2. [Results] Results paragraph and Table (implied by the reported percentages): No error analysis, breakdown of remaining 5% errors after RAG, or comparison against standard (non-AOP-specific) RAG baselines is presented. Without these, it is unclear whether the gains are attributable to the AOP-oriented retrieval design or simply to any retrieval augmentation on this particular test distribution.
minor comments (2)
  1. [Abstract] The abstract and method description could more explicitly note the small scale and custom nature of the test set as a limitation when claiming broad alleviation of hallucinations.
  2. [Experiments] Notation for the three LLMs (GPT, DeepSeek, Gemini) is used inconsistently with the full model names in the results; clarifying the exact versions would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The concerns regarding the evaluation methodology are important, and we address each major comment below with our responses and planned revisions.

read point-by-point responses
  1. Referee: [Evaluation section] Evaluation section (test set construction and results): The central empirical claim rests on accuracy improvements observed on an author-constructed set of only 20 questions, yet the manuscript provides no sampling protocol, diversity metrics, question sourcing details, or inter-annotator agreement for correctness judgments. This directly affects the load-bearing assertion that the method 'significantly alleviate[s] the hallucination problem,' as the observed delta cannot be distinguished from performance on a narrow, potentially favorable subset without these controls.

    Authors: We acknowledge that the manuscript provides only a high-level description of the 20-task test set and omits detailed information on its construction. The tasks were selected to span the categories of KE identification, upstream/downstream retrieval, and complex AOP retrieval using content from AOP-Wiki. To address this concern, we will revise the Evaluation section to include a more explicit account of question sourcing, selection criteria, and how ground-truth correctness was determined by reference to the official XML data. These additions will improve transparency without altering the reported results. revision: yes

  2. Referee: [Results] Results paragraph and Table (implied by the reported percentages): No error analysis, breakdown of remaining 5% errors after RAG, or comparison against standard (non-AOP-specific) RAG baselines is presented. Without these, it is unclear whether the gains are attributable to the AOP-oriented retrieval design or simply to any retrieval augmentation on this particular test distribution.

    Authors: We agree that the absence of error analysis and baseline comparisons limits interpretation of the results. In the revised manuscript we will add a short error analysis describing the nature of the residual errors after RAG. A quantitative comparison to generic RAG was not performed in the original work because the retrieval component is deliberately specialized to the AOP-Wiki XML schema and KE/KER structure; however, we will expand the discussion to explain this design choice and its expected advantages over off-the-shelf RAG. If space and resources allow, we will also include a minimal generic-RAG baseline for the same test set. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical accuracy gains are measured outcomes on an external benchmark

full rationale

The paper describes a standard RAG pipeline over AOP-Wiki XML data and reports direct before/after accuracy measurements on a 20-question test set. No equations, fitted parameters, or derivations are present that could reduce results to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The evaluation, while limited by the small author-constructed set, consists of observable accuracy deltas rather than tautological restatements, satisfying the criteria for a self-contained empirical claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach assumes that keyword or structure-based retrieval from AOP-Wiki XML will supply complete, noise-free context for LLM prompts without introducing new errors; no free parameters are fitted and no new entities are postulated.

axioms (1)
  • domain assumption Retrieval from official AOP-Wiki XML using KEs, KERs, and AOP metadata will provide accurate and sufficient context to suppress LLM hallucinations on AOP tasks
    Invoked when describing how the RAG component augments the models; treated as given rather than empirically validated within the reported experiments.

pith-pipeline@v0.9.0 · 5608 in / 1323 out tokens · 66122 ms · 2026-05-10T16:33:53.139329+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

  1. [1]

    Ankley, Richard S

    Gerald T. Ankley, Richard S. Bennett, Rus- sell J. Erickson, Dale J. Hoff, Michael W. Hor- nung, Rodney D. Johnson, David R. Mount, John W. Nichols, Christine L. Russom, Patricia K. 5 Schmieder, Jose A. Serrrano, Joseph E. Tietge, and Daniel L. Villeneuve. Adverse outcome path- ways: A conceptual framework to support eco- toxicology research and risk asse...

  2. [2]

    (Bette) Meek

    Anna Bal-Price and M.E. (Bette) Meek. Adverse outcome pathways: Application to enhance mech- anistic understanding of neurotoxicity.Pharmacol- ogy & Therapeutics, 179:84–95, November 2017

  3. [3]

    Aop-helpfinder webserver: a tool for com- prehensive analysis of the literature to support ad- verse outcome pathways development.Bioinfor- matics, 38(4):1173–1175, October 2021

    Florence Jornod, Thomas Jaylet, Ludek Blaha, Denis Sarigiannis, Luc Tamisier, and Karine Au- douze. Aop-helpfinder webserver: a tool for com- prehensive analysis of the literature to support ad- verse outcome pathways development.Bioinfor- matics, 38(4):1173–1175, October 2021

  4. [4]

    Aopwiki-explorer: An in- teractive graph-based query engine leveraging large language models.Computational Toxicology, 30:100308, June 2024

    Saurav Kumar, Deepika Deepika, Karin Slater, and Vikas Kumar. Aopwiki-explorer: An in- teractive graph-based query engine leveraging large language models.Computational Toxicology, 30:100308, June 2024

  5. [5]

    Attention is all you need.Advances in neural information pro- cessing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information pro- cessing systems, 30, 2017

  6. [6]

    Language models are few-shot learners.Advances in neural infor- mation processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural infor- mation processing systems, 33:1877–1901, 2020

  7. [7]

    Biobert: a pre-trained biomedical language representation model for biomedical text mining.Bioinformatics, 36(4):1234–1240, Septem- ber 2019

    Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. Biobert: a pre-trained biomedical language representation model for biomedical text mining.Bioinformatics, 36(4):1234–1240, Septem- ber 2019

  8. [8]

    Haochun Shi and Yanbin Zhao. Integration of ad- vanced large language models into the construction of adverse outcome pathways: Opportunities and challenges.Environmental Science & Technology, 58(35):15355–15358, August 2024

  9. [9]

    Survey of hal- lucination in natural language generation.ACM Computing Surveys, 55(12):1–38, March 2023

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, An- drea Madotto, and Pascale Fung. Survey of hal- lucination in natural language generation.ACM Computing Surveys, 55(12):1–38, March 2023

  10. [10]

    Truthfulqa: Measuring how models mimic human falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th annual meet- ing of the association for computational linguistics (volume 1: long papers), pages 3214–3252, 2022

  11. [11]

    Retrieval-augmented generation for knowledge-intensive nlp tasks.Ad- vances in neural information processing systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Pik- tus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨ uttler, Mike Lewis, Wen-tau Yih, Tim Rockt¨ aschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Ad- vances in neural information processing systems, 33:9459–9474, 2020

  12. [12]

    Reading wikipedia to answer open- domain questions

    Danqi Chen, Adam Fisch, Jason Weston, and An- toine Bordes. Reading wikipedia to answer open- domain questions. InProceedings of the 55th Annual Meeting of the Association forComputa- tional Linguistics (Volume 1: Long Papers), page 1870–1879. Association for Computational Lin- guistics, 2017

  13. [13]

    Dense passage retrieval for open-domain question answering

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceed- ings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 6769–6781, 2020. 6