arxiv: 2604.23938 · v2 · submitted 2026-04-27 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

TSAssistant: A Human-in-the-Loop Agentic Framework for Automated Target Safety Assessment

Klas Hatje, Melanie Guerard, Tatyana Doktorova, Xiaochen Zheng, Zhiwen Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:44 UTC · model grok-4.3

classification 💻 cs.CL

keywords Target Safety Assessmentmulti-agent frameworkhuman-in-the-loopevidence synthesisreport draftingbiomedical dataagentic AItoxicology automation

0 comments

The pith

TSAssistant deploys specialized AI sub-agents to draft citable sections of target safety assessment reports while humans retain editing and approval control through an interactive loop.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a modular multi-agent system that breaks target safety assessment report writing into separate sections, each handled by a dedicated sub-agent. These agents pull structured data, literature, and other evidence from biomedical sources using standardized tools and output individually referenced content. A hierarchical set of instructions guides the agents, and an interactive refinement loop lets users edit sections, add sources, or trigger revisions while the system keeps memory of prior steps. The goal is to shift the mechanical work of gathering and organizing heterogeneous evidence onto the agents so that toxicologists focus on judgment and final decisions. If the approach works, it would make the iterative process of evaluating therapeutic target safety more scalable and reproducible without removing expert oversight.

Core claim

We present TSAssistant, a multi-agent framework designed to support TSA report drafting through a modular, section-based, and human-in-the-loop paradigm. The framework decomposes report generation into a coordinated pipeline of specialised subagents, each targeting a single TSA section. Specialised subagents retrieve structured and unstructured data as well as literature evidence from curated biomedical sources through standardised tool interfaces, producing individually citable, evidence-grounded sections. Agent behaviour is governed by a hierarchical instruction architecture comprising system prompts, domain-specific skill modules, and runtime user instructions. A key feature is an is an a

What carries the argument

A coordinated pipeline of specialised sub-agents, each assigned to one TSA report section, that retrieve evidence via tool interfaces and operate under a hierarchical instruction architecture plus an interactive refinement loop that preserves conversational memory.

If this is right

The system produces individually citable, evidence-grounded sections for each part of a TSA report.
It reduces the mechanical burden of evidence synthesis and report drafting for toxicologists.
It enables a hybrid workflow in which agentic AI handles synthesis while humans keep final decision authority.
The interactive loop allows users to edit sections, upload new sources, or re-run specific agents while maintaining memory across iterations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same section-by-section agent structure could be applied to other regulatory or scientific documents that integrate many data types.
By logging all human edits and agent revisions, the framework creates a traceable record that might help audit reproducibility across different assessment teams.
The design offers a practical testbed for measuring how often agent hallucinations occur in specialized biomedical domains and how effectively human feedback reduces them over multiple rounds.

Load-bearing premise

Specialised sub-agents can reliably pull accurate, relevant, and unbiased evidence from heterogeneous biomedical sources and turn it into citable sections without introducing factual errors or hallucinations that humans must later catch.

What would settle it

A controlled test on a set of completed TSA cases in which experts compare TSAssistant-generated sections against the original expert-written versions and count the rate of factual errors, missing citations, or required major revisions.

Figures

Figures reproduced from arXiv: 2604.23938 by Klas Hatje, Melanie Guerard, Tatyana Doktorova, Xiaochen Zheng, Zhiwen Jiang.

**Figure 1.** Figure 1: Hierarchical agent architecture of TSASSISTANT. An Orchestrator decomposes the assessment into Research Subagents and Synthesis Subagents, each targeting a single TSA domain. Pre-execution hooks handle security checks, memory injection, path validation, and sequential control; post-execution hooks perform citation validation, memory compression, state tracking, and output verification. Runtime hooks provid… view at source ↗

**Figure 2.** Figure 2: Interactive refinement loop in TSASSISTANT. After initial section generation, the user reviews each section and may: (a) manually edit content, (b) append new information, (c) upload additional sources or graphics, or (d) re-invoke the subagent for targeted revision. Conversational memory preserves context across iterations; user feedback progressively adapts retrieval strategies and prompt templates. anal… view at source ↗

read the original abstract

Target Safety Assessment (TSA) requires systematic integration of heterogeneous evidence, including genetic, transcriptomic, target homology, pharmacological, and clinical data, to evaluate potential safety liabilities of therapeutic targets. This process is inherently iterative and expert-driven, posing challenges in scalability and reproducibility. We present TSAssistant, a multi-agent framework designed to support TSA report drafting through a modular, section-based, and human-in-the-loop paradigm. The framework decomposes report generation into a coordinated pipeline of specialised subagents, each targeting a single TSA section. Specialised subagents retrieve structured and unstructured data as well as literature evidence from curated biomedical sources through standardised tool interfaces, producing individually citable, evidence-grounded sections. Agent behaviour is governed by a hierarchical instruction architecture comprising system prompts, domain-specific skill modules, and runtime user instructions. A key feature is an interactive refinement loop in which users may manually edit sections, append new information, upload additional sources, or re-invoke agents to revise specific sections, with the system maintaining conversational memory across iterations. TSAssistant is designed to reduce the mechanical burden of evidence synthesis and report drafting, supporting a hybrid model in which agentic AI augments evidence synthesis while toxicologists retain final decision authority.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TSAssistant lays out a modular multi-agent design for section-wise TSA reports with human refinement but offers no tests of whether the agents produce accurate or citable output.

read the letter

The main takeaway is that this paper describes TSAssistant, a multi-agent system that splits target safety assessment reports into specialized sub-agents, each pulling evidence from biomedical tools and allowing iterative human edits. The architecture keeps toxicologists in final control while aiming to cut down on routine evidence gathering. That is the core contribution, and it is presented clearly enough to follow the intended workflow. The section-based breakdown, standardized tool interfaces for genetic and literature data, hierarchical prompts, and conversational memory across revisions are practical choices that address real iteration needs in safety work. Those details give a concrete picture of how agentic systems could fit into an expert-driven process. The central limitation is the total lack of evaluation. The manuscript gives the system design and the human-in-the-loop safeguards but reports no accuracy numbers, hallucination rates, retrieval success on real targets, or comparisons against expert-written sections. Claims that the approach reduces mechanical burden or improves reproducibility therefore rest on untested assumptions about how well current LLMs handle heterogeneous biomedical sources. Without at least case studies or error analysis, it is difficult to judge whether the output sections are reliable enough to cite or whether the human reviewer ends up doing most of the heavy lifting anyway. This work is aimed at teams inside pharmaceutical R&D who are already experimenting with AI assistants for toxicology or regulatory documentation. Readers building similar agent pipelines might borrow the decomposition and tool-layer ideas. I would bring it to a reading group focused on applied AI in drug development to talk through the design decisions. I would not cite it yet because there are no results to reference. It could reasonably go to peer review at a venue that publishes system descriptions, though any serious referee would almost certainly require some form of validation or case study before acceptance.

Referee Report

2 major / 2 minor

Summary. The paper presents TSAssistant, a multi-agent framework for assisting in Target Safety Assessment (TSA) report drafting. It decomposes the process into specialized sub-agents that retrieve structured/unstructured data and literature from biomedical sources via standardized tool interfaces, governed by hierarchical prompts (system, domain-specific, runtime user instructions). A human-in-the-loop interactive refinement loop allows editing, appending sources, or re-invoking agents while maintaining conversational memory. The central claim is that this produces individually citable, evidence-grounded TSA sections, reducing mechanical burden while retaining toxicologist oversight.

Significance. The modular, section-based architecture with human-in-the-loop safeguards represents a thoughtful design for augmenting expert-driven workflows in a high-stakes domain. If empirically validated, it could improve scalability and reproducibility of TSA processes. However, the manuscript provides no performance data, so any assessment of significance remains speculative at present.

major comments (2)

Abstract: The claim that the framework produces 'individually citable, evidence-grounded sections' is load-bearing for the paper's contribution, yet the manuscript contains no empirical evaluation, accuracy metrics, hallucination rates, error analysis, or comparison against expert-written sections or existing workflows to support it.
Abstract and system description: The assumption that specialized sub-agents can reliably retrieve accurate, relevant evidence from heterogeneous biomedical sources and synthesize it without factual errors (the 'weakest assumption' in the architecture) is untested; no case studies on real targets, qualitative assessments, or safeguards beyond human review are reported.

minor comments (2)

The manuscript would benefit from explicit discussion of related work on multi-agent systems for scientific report generation and existing TSA automation efforts to better situate the novelty of the hierarchical prompt architecture and tool interfaces.
Clarify how the system handles conflicting evidence across sub-agents or maintains citation consistency in the final report, as this is central to the 'evidence-grounded' claim but described only at a high level.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed review of our manuscript on TSAssistant. We appreciate the recognition of the modular, section-based, human-in-the-loop design. The comments correctly identify that the current work is a system description without accompanying empirical evaluations. We address each point below and outline targeted revisions.

read point-by-point responses

Referee: Abstract: The claim that the framework produces 'individually citable, evidence-grounded sections' is load-bearing for the paper's contribution, yet the manuscript contains no empirical evaluation, accuracy metrics, hallucination rates, error analysis, or comparison against expert-written sections or existing workflows to support it.

Authors: We acknowledge that the manuscript provides no quantitative evaluations, metrics, or comparisons, as it is a framework description paper rather than an empirical study. The claim of 'individually citable, evidence-grounded sections' is grounded in the architecture: each specialized sub-agent uses standardized tool interfaces to retrieve from specific biomedical sources, preserving traceable citations, while the interactive refinement loop enables expert verification and editing. We do not assert error-free automation. We will revise the abstract to clarify that these properties are achieved by design through source attribution and human oversight, and we will add an explicit limitations section discussing the absence of empirical validation along with plans for future evaluations. revision: partial
Referee: Abstract and system description: The assumption that specialized sub-agents can reliably retrieve accurate, relevant evidence from heterogeneous biomedical sources and synthesize it without factual errors (the 'weakest assumption' in the architecture) is untested; no case studies on real targets, qualitative assessments, or safeguards beyond human review are reported.

Authors: We agree this assumption is central and remains untested in the presented work. The framework addresses reliability through hierarchical prompts (system, domain-specific, and runtime), standardized tool interfaces for retrieval, and conversational memory. The primary safeguard is the human-in-the-loop, where toxicologists review, edit, append sources, or re-invoke agents. No case studies or qualitative assessments on real targets are included because the manuscript focuses on the architectural paradigm and workflow rather than deployment results. We will expand the system description and discussion sections to more explicitly detail these safeguards, state the reliance on human review, and note the lack of empirical testing as a limitation for future research. revision: partial

standing simulated objections not resolved

Quantitative performance data, accuracy metrics, hallucination rates, error analysis, case studies on real targets, qualitative assessments, and comparisons against expert-written sections or existing workflows.

Circularity Check

0 steps flagged

No significant circularity; no derivations or self-referential reductions present

full rationale

The manuscript describes an architectural multi-agent framework for TSA report generation using specialized sub-agents, tool interfaces, hierarchical prompts, and human-in-the-loop refinement. No equations, fitted parameters, predictions, or derivation chains appear in the provided text or abstract. Claims rest on system design and intended workflow rather than reducing any result to a self-definition, fitted input, or self-citation chain. The absence of quantitative validation is a separate empirical concern, not a circularity issue. The paper is self-contained as a framework description with no load-bearing steps that collapse to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The framework rests on the assumption that current large language models and retrieval tools can be orchestrated to produce accurate, citable biomedical evidence summaries; no free parameters, formal axioms, or new invented entities are introduced beyond standard agentic AI components.

pith-pipeline@v0.9.0 · 5527 in / 1247 out tokens · 34040 ms · 2026-05-11T01:44:34.333497+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
multi-agent framework... specialised subagents... hierarchical instruction architecture... interactive refinement loop
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
producing individually citable, evidence-grounded sections

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 3 internal anchors

[1]

Buniello, A., MacArthur, J. A. L., Cerezo, M., Harris, L. W., Hayhurst, J., Malangone, C., McMahon, A., Morales, J., Mountjoy, E., Sollis, E., et al. The nhgri-ebi gwas catalog of published genome-wide association studies, targeted arrays and summary statistics 2019.Nucleic acids research, 47(D1):D1005–D1012,

work page 2019
[2]

A., Rieser, V ., Iqbal, H., Tomašev, N., Ktena, I., Kenton, Z., Rodriguez, M., et al

Gabriel, I., Manzini, A., Keeling, G., Hendricks, L. A., Rieser, V ., Iqbal, H., Tomašev, N., Ktena, I., Kenton, Z., Rodriguez, M., et al. The ethics of advanced AI assistants. arXiv preprint arXiv:2404.16244,

work page arXiv
[3]

Democratizing ai scientists using tooluniverse.arXiv preprint arXiv:2509.23426, 2025

Gao, S., Zhu, R., Sui, P., Kong, Z., Aldogom, S., Huang, Y ., Noori, A., Shamji, R., Parvataneni, K., Tsiligkaridis, T., et al. Democratizing ai scientists using tooluniverse. arXiv preprint arXiv:2509.23426,

work page arXiv
[4]

The reactome pathway knowledge- base 2022.Nucleic Acids Research, 50(D1):D419–D426,

Gillespie, M., Jassal, B., Stephan, R., Milacic, M., Rothfels, K., Senff-Ribeiro, A., Griss, J., Sevilla, C., Matthews, L., Gong, C., et al. The reactome pathway knowledge- base 2022.Nucleic Acids Research, 50(D1):D419–D426,

work page 2022
[5]

Towards an AI co-scientist

8 TSASSISTANT: Human-in-the-Loop Agentic Framework for Target Safety Assessment Gottweis, J., Weng, W.-H., Daryin, A., Tu, T., Palepu, A., Sirkovic, P., Myaskovsky, A., Weissenberger, F., Rong, K., Tanno, R., et al. Towards an ai co-scientist.arXiv preprint arXiv:2502.18864,

work page internal anchor Pith review arXiv
[6]

Harrison, R. K. Phase II and phase III failures: 2013–2015. Nature reviews Drug discovery, 15(12):817–818,

work page 2013
[7]

Towards a Science of Scaling Agent Systems

Kim, Y ., Gu, K., Park, C., Park, C., Schmidgall, S., Heydari, A. A., Yan, Y ., Zhang, Z., Zhuang, Y ., Malhotra, M., et al. Towards a science of scaling agent systems.arXiv preprint arXiv:2512.08296,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Lu, C., Lu, C., Lange, R. T., Foerster, J., Clune, J., and Ha, D. The AI scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292,

work page internal anchor Pith review arXiv
[9]

M., Saüch-Pitarch, J., Ron- zano, F., Centeno, E., Sanz, F., and Furlong, L

Piñero, J., Ramírez-Anguita, J. M., Saüch-Pitarch, J., Ron- zano, F., Centeno, E., Sanz, F., and Furlong, L. I. The Dis- GeNET knowledge platform for disease genomics: 2019 update.Nucleic acids research, 48(D1):D845–D855,

work page 2019
[10]

The gene ontology knowl- edgebase in 2023.Genetics, 224(1):iyad031,

The Gene Ontology Consortium. The gene ontology knowl- edgebase in 2023.Genetics, 224(1):iyad031,

work page 2023
[11]

Uniprot: the universal protein knowl- edgebase in 2023.Nucleic acids research, 51(D1):D523– D531,

UniProt Consortium. Uniprot: the universal protein knowl- edgebase in 2023.Nucleic acids research, 51(D1):D523– D531,

work page 2023
[12]

Colacare: Enhancing electronic health record modeling through large language model-driven multi-agent collaboration

Wang, Z., Zhu, Y ., Zhao, H., Zheng, X., Sui, D., Wang, T., Tang, W., Wang, Y ., Harrison, E., Pan, C., et al. Colacare: Enhancing electronic health record modeling through large language model-driven multi-agent collaboration. InProceedings of the ACM on Web Conference 2025, pp. 2250–2261,

work page 2025
[13]

S., Feunang, Y

Wishart, D. S., Feunang, Y . D., Guo, A. C., Lo, E. J., Marcu, A., Grant, J. R., Sajed, T., Johnson, D., Li, C., Sayeeda, Z., et al. DrugBank 5.0: a major update to the drug- bank database for 2018.Nucleic acids research, 46(D1): D1074–D1082,

work page 2018
[14]

Wu, T., Jiang, E., Donsbach, A., Gray, J., Molina, A., Terry, M., and Cai, C. J. PromptChainer: Chaining large lan- guage model prompts through visual programming. InEx- tended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems, pp. 1–10. ACM,

work page 2022
[15]

Mam: Modular multi-agent framework for multi-modal medical diagnosis via role- specialized collaboration

Zhou, Y ., Song, L., and Shen, J. Mam: Modular multi-agent framework for multi-modal medical diagnosis via role- specialized collaboration. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 25319– 25333,

work page 2025