PRAXIS: Case-distilled and code-verified AI agents for biological research

Chunyi Yang; Jingyi Zhu; Limei Xu; Min Xiao; Xukai Jiang; Yuyang Song; Zhenyu Ma

arxiv: 2605.23169 · v1 · pith:J3DXCT4Xnew · submitted 2026-05-22 · 🧬 q-bio.QM

PRAXIS: Case-distilled and code-verified AI agents for biological research

Zhenyu Ma , Yuyang Song , Chunyi Yang , Jingyi Zhu , Limei Xu , Min Xiao , Xukai Jiang This is my paper

Pith reviewed 2026-05-25 02:49 UTC · model grok-4.3

classification 🧬 q-bio.QM

keywords AI agentscase distillationbiological researchlong-term memorybiocomputational tasksagent frameworkmethod selectionworkflow organization

0 comments

The pith

PRAXIS converts research cases into structured memory so AI agents can handle biological tasks with better method selection and fewer errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PRAXIS as a framework that distills literature, successful cases, negative cases, domain rules, and procedures into long-term memory to guide AI agents. It claims this memory coordination enables reliable performance across problem definition, object validation, method selection, workflow execution, result interpretation, and review in biocomputational tasks. A sympathetic reader would care because standard prompt engineering and retrieval alone fail to deliver domain-specific judgment, while this approach turns experience into executable and auditable capabilities. Evaluations through object validation, case retrieval, memory ablation, benchmarks, and cross-agent workflows show case-based learning improves method selection, error suppression, and workflow organization.

Core claim

PRAXIS converts research experience, failure boundaries, domain rules, and executable procedures into structured long-term memory. By coordinating successful cases, negative cases, rules, and skills, PRAXIS supports problem definition, object validation, method selection, workflow execution, result interpretation, and review feedback across diverse biocomputational tasks. The results show that case-based learning improves method selection, error suppression, and workflow organization in complex biological research tasks.

What carries the argument

Structured long-term memory that coordinates successful cases, negative cases, rules, and skills to drive agent decisions in biological research.

If this is right

Case-based learning improves method selection in complex biological research tasks.
Coordinating successful and negative cases reduces errors and improves workflow organization.
PRAXIS supports the full cycle of tasks from problem definition through result interpretation and review feedback.
The framework provides a general pathway for turning research experience into executable, auditable, and transferable agent capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same memory structure could be tested in non-biological domains to check whether case distillation generalizes beyond biology.
Running PRAXIS on live experimental data rather than benchmarks would expose whether its gains hold when ground truth is uncertain.
Combining the agent with automated experimental platforms could create closed-loop systems that propose, execute, and interpret experiments.

Load-bearing premise

Converting research experience, failure boundaries, domain rules, and executable procedures into structured long-term memory produces reliable domain-specific scientific judgment where prompt engineering or general retrieval cannot.

What would settle it

A head-to-head test on the same biocomputational tasks where general RAG or prompt-only agents match or exceed PRAXIS on method selection accuracy, error rates, and workflow completion would refute the claim that case distillation is necessary.

read the original abstract

Large language models are moving scientific research from text assistance toward agentic workflows, yet biological research requires strong object validation, methodological suitability, reproducibility, and auditability. Prompt engineering, general RAG, or tool use alone cannot reliably produce domain-specific scientific judgment. Here, we present PRAXIS, a verifiable biological research agent framework driven by literature learning and case distillation. PRAXIS converts research experience, failure boundaries, domain rules, and executable procedures into structured long-term memory. By coordinating successful cases, negative cases, rules, and skills, PRAXIS supports problem definition, object validation, method selection, workflow execution, result interpretation, and review feedback across diverse biocomputational tasks. We instantiated PRAXIS as an agent suite for biomedical computing and evaluated it through object validation, case retrieval, memory ablation, public benchmarks, and cross-agent workflows. The results show that case-based learning improves method selection, error suppression, and workflow organization in complex biological research tasks. Rather than replacing scientists, PRAXIS provides a general pathway for transforming research experience into executable, auditable, and transferable agent capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PRAXIS sketches a case-distilled memory system for biology agents but the abstract supplies no numbers or methods to show it actually works better.

read the letter

The core idea is turning research cases, failures, rules, and procedures into structured long-term memory that an agent can draw on for tasks like method selection and workflow execution in biology. That combination, applied to biocomputational problems with explicit negative cases and code verification, looks like the main new piece. It targets a genuine gap: general RAG or prompting often fails at domain judgment, and this tries to make experience reusable and auditable instead of starting from scratch each time. The framework description is clear on the six supported stages and how memory components coordinate. That part is useful on its own. The evaluation plan mentions public benchmarks, memory ablation, object validation, and cross-agent tests, which is the right set of checks. The problem is the abstract states that case-based learning improves method selection, error suppression, and organization but gives zero numbers, baselines, dataset sizes, or ablation tables. Without those, the central claim stays unevidenced. The full text might contain the details, but based on what is here the soundness is thin. This is for people already working on agent architectures for science who want concrete patterns for memory design. A reader could extract the memory structure and try it even if the results section is missing. It is coherent on its own terms and engages the right literature on agents and reproducibility, so it is worth a serious referee who can ask for the missing quantitative evidence and code. I would send it to review rather than desk reject, but only if the full manuscript actually shows the results and verification steps.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PRAXIS, a verifiable biological research agent framework that converts research experience, failure boundaries, domain rules, and executable procedures into structured long-term memory (successful cases, negative cases, rules, and skills). It claims that this case-distilled approach, when coordinated across an agent suite for biomedical computing, supports problem definition, object validation, method selection, workflow execution, result interpretation, and review feedback. Evaluations through object validation, case retrieval, memory ablation, public benchmarks, and cross-agent workflows are asserted to demonstrate that case-based learning improves method selection, error suppression, and workflow organization.

Significance. If the quantitative claims hold, the work could offer a concrete mechanism for embedding domain-specific scientific judgment into LLM agents beyond prompt engineering or generic RAG, with potential value for reproducibility and auditability in biocomputational tasks. The emphasis on verifiable, transferable memory structures is a strength if supported by the promised ablation and benchmark data.

major comments (2)

[Abstract] Abstract: the central claim that 'the results show that case-based learning improves method selection, error suppression, and workflow organization' is unsupported because the abstract (and the provided manuscript description) supplies no quantitative results, error bars, dataset descriptions, baseline comparisons, or ablation tables from the public benchmarks or memory ablation studies.
[Evaluation] Evaluation description: the manuscript states that evaluations were performed via 'object validation, case retrieval, memory ablation, public benchmarks, and cross-agent workflows' yet provides no methodological details, success metrics, or statistical tests, rendering the improvement claims unevidenced and preventing assessment of whether gains exceed those from general tool-use or RAG baselines.

minor comments (2)

The distinction between 'case-distilled' memory and standard RAG retrieval is not clearly operationalized; a concrete example of how a negative case alters agent behavior would clarify the mechanism.
[Abstract] The abstract's phrasing that 'PRAXIS supports problem definition...' would be strengthened by explicit mapping to the agent components responsible for each capability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and for recognizing the potential of our case-distilled framework. We address the two major comments below, agreeing that additional quantitative details are needed to support the claims. We will make revisions to incorporate these elements.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'the results show that case-based learning improves method selection, error suppression, and workflow organization' is unsupported because the abstract (and the provided manuscript description) supplies no quantitative results, error bars, dataset descriptions, baseline comparisons, or ablation tables from the public benchmarks or memory ablation studies.

Authors: We agree with this observation. The abstract summarizes the findings without including specific numbers to keep it concise. In the revision, we will update the abstract to include key quantitative results from our evaluations, such as improvements in method selection accuracy and error rates from the memory ablation studies and public benchmarks, along with brief mentions of baselines and dataset sizes. revision: yes
Referee: [Evaluation] Evaluation description: the manuscript states that evaluations were performed via 'object validation, case retrieval, memory ablation, public benchmarks, and cross-agent workflows' yet provides no methodological details, success metrics, or statistical tests, rendering the improvement claims unevidenced and preventing assessment of whether gains exceed those from general tool-use or RAG baselines.

Authors: We acknowledge that the current manuscript version lacks the detailed methodological information in the evaluation section. We will revise the manuscript to include comprehensive descriptions of the evaluation protocols, specific success metrics used (e.g., precision in case retrieval, accuracy in method selection), statistical tests applied, and direct comparisons to baseline approaches including standard RAG and tool-use agents without case distillation. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an agent framework evaluated via memory ablation, public benchmarks, and cross-agent workflows. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear. Claims rest on empirical comparisons rather than reducing by construction to the framework's own inputs or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces the PRAXIS framework at a conceptual level without enumerating fitted parameters, background axioms, or new postulated entities.

pith-pipeline@v0.9.0 · 5746 in / 1164 out tokens · 28633 ms · 2026-05-25T02:49:22.541714+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · 2 internal anchors

[1]

1 Wang, H. et al. Scientific discovery in the age of artificial intelligence. Nature 620, 47–60 (2023). 2 Chen, M. et al. Evaluating large language models trained on code. arXiv 2107.03374 (2021). 3 Achiam, J. et al. GPT-4 technical report. arXiv 2303.08774 (2023). 4 Vamathevan, J. et al. Applications of machine learning in drug discovery and development....

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

9 Musaelian, A

Nature 630, 493–500 (2024). 9 Musaelian, A. et al. Learning local equivariant representations for large -scale atomistic dynamics. Nat. Commun. 14, 579 (2023). 10 Qu, Y. et al. CRISPR -GPT for agentic automation of gene -editing experiments. Nat. Biomed. Eng. 10, 245–258 (2025). 11 Squair, J. W. et al. Confronting false discoveries in single-cell differen...

work page 2024
[3]

53, D609–D617 (2025)

Nucleic Acids Res. 53, D609–D617 (2025). 27 Burley, S. K. et al. RCSB Protein Data Bank (RCSB.org): delivery of experimentally- determined PDB structures alongside one million computed structure models of proteins from artificial intelligence/machine learning. Nucleic Acids Res. 51, D488–D508 (2023). 28 The wwPDB Consortium. EMDB —the Electron Microscopy ...

work page 2025
[4]

51, D1003–D1009 (2023)

Nucleic Acids Res. 51, D1003–D1009 (2023). 30 Harrison, P. W. et al. Ensembl

work page 2023
[5]

Scaling Laws for Neural Language Models

Nucleic Acids Res. 52, D891–D899 (2024). 31 Sitzmann, M., Ihlenfeldt, W.-D. & Nicklaus, M. C. Tautomerism in large databases. J. Comput. Aided Mol. Des. 24, 521–551 (2010). 32 Heller, S. R., McNaught, A., Pletnev, I., Stein, S. & Tchekhovskoi , D. InChI, the IUPAC International Chemical Identifier. J. Cheminform. 7, 23 (2015). 38 33 Robertson, S. & Zarago...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

1 Wang, H. et al. Scientific discovery in the age of artificial intelligence. Nature 620, 47–60 (2023). 2 Chen, M. et al. Evaluating large language models trained on code. arXiv 2107.03374 (2021). 3 Achiam, J. et al. GPT-4 technical report. arXiv 2303.08774 (2023). 4 Vamathevan, J. et al. Applications of machine learning in drug discovery and development....

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

9 Musaelian, A

Nature 630, 493–500 (2024). 9 Musaelian, A. et al. Learning local equivariant representations for large -scale atomistic dynamics. Nat. Commun. 14, 579 (2023). 10 Qu, Y. et al. CRISPR -GPT for agentic automation of gene -editing experiments. Nat. Biomed. Eng. 10, 245–258 (2025). 11 Squair, J. W. et al. Confronting false discoveries in single-cell differen...

work page 2024

[3] [3]

53, D609–D617 (2025)

Nucleic Acids Res. 53, D609–D617 (2025). 27 Burley, S. K. et al. RCSB Protein Data Bank (RCSB.org): delivery of experimentally- determined PDB structures alongside one million computed structure models of proteins from artificial intelligence/machine learning. Nucleic Acids Res. 51, D488–D508 (2023). 28 The wwPDB Consortium. EMDB —the Electron Microscopy ...

work page 2025

[4] [4]

51, D1003–D1009 (2023)

Nucleic Acids Res. 51, D1003–D1009 (2023). 30 Harrison, P. W. et al. Ensembl

work page 2023

[5] [5]

Scaling Laws for Neural Language Models

Nucleic Acids Res. 52, D891–D899 (2024). 31 Sitzmann, M., Ihlenfeldt, W.-D. & Nicklaus, M. C. Tautomerism in large databases. J. Comput. Aided Mol. Des. 24, 521–551 (2010). 32 Heller, S. R., McNaught, A., Pletnev, I., Stein, S. & Tchekhovskoi , D. InChI, the IUPAC International Chemical Identifier. J. Cheminform. 7, 23 (2015). 38 33 Robertson, S. & Zarago...

work page internal anchor Pith review Pith/arXiv arXiv 2024