EpiBench: Verifiable Evaluation of AI Agents on Epigenomics Analysis
Pith reviewed 2026-06-27 06:48 UTC · model grok-4.3
The pith
No AI agent system passes a majority of verifiable epigenomics analysis tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that no model-harness pair succeeds on a majority of the 106 evaluations, with GPT-5.5 / Pi reaching 45.0 percent (143/318 attempts) and other leading pairs at or below 39.9 percent. Agents frequently produce partial answers and handle file and computation steps, yet consistently fail when deeper, assay-specific judgment is required.
What carries the argument
EpiBench, a collection of 106 short-horizon, deterministically gradable evaluations that present agents with realistic epigenomics workflow states and require assay-specific decisions.
If this is right
- Current agents cannot yet perform independent epigenomics analysis at majority reliability.
- Success rates differ across assay types, pointing to uneven coverage of domain knowledge.
- Partial correctness in many failed runs shows agents reach intermediate results but stall on interpretation.
- The benchmark supplies a repeatable yardstick for measuring future gains in scientific decision-making.
Where Pith is reading between the lines
- The same evaluation style could be applied to longer multi-step workflows or to other omics domains to test generalization.
- Hybrid setups that route only the judgment steps to human experts might raise effective success rates faster than pure agent improvement.
- Training on assay-protocol corpora might lift the specific failure modes observed here.
Load-bearing premise
The 106 evaluations accurately capture realistic workflow states and decisions that require assay-specific scientific judgment, and the deterministic grading scheme correctly measures correctness without bias.
What would settle it
A new agent system that succeeds on more than 50 percent of the same 106 evaluations under identical grading rules would falsify the claim that no system passes a majority.
read the original abstract
We introduce EpiBench, a verifiable benchmark for short-horizon epigenomics analysis. EpiBench evaluates whether agents can make well-defined analysis decisions from realistic workflow states and return deterministically gradable answers. The benchmark includes 106 evaluations across CUT\&Tag/CUT\&RUN, ATAC-seq, ChIP-seq, and DNA methylation workflows. Across 5,088 valid trajectories from 16 model-harness pairs, no system passed a majority of attempts: GPT-5.5 / Pi led at 45.0\% (143/318 attempts; 95\% confidence interval (CI), 36.3--53.7), followed by GPT-5.5 / OpenAI Codex at 39.9\% (127/318 attempts; 95\% CI, 31.6--48.3). Claude Opus 4.8 Max / Pi and GPT-5.4 / Pi each passed 39.0\% (124/318 attempts; 95\% CI, 30.2--47.8 and 31.0--47.0, respectively). Performance varies across assay types, and many failed runs still contain parts of the correct answer. Agents often found the right files and computed useful intermediate results, but failed when the task required deeper, assay-specific scientific judgment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces EpiBench, a verifiable benchmark for short-horizon epigenomics analysis consisting of 106 evaluations across CUT&Tag/CUT&RUN, ATAC-seq, ChIP-seq, and DNA methylation workflows. It reports results from 5,088 valid trajectories across 16 model-harness pairs, finding that no system passed a majority of attempts, with the leading system (GPT-5.5 / Pi) achieving 45.0% success (143/318 attempts; 95% CI 36.3--53.7). Agents frequently locate correct files and produce useful intermediates but fail on tasks requiring deeper assay-specific scientific judgment.
Significance. If the task definitions and deterministic grading hold, EpiBench supplies a reproducible, large-scale empirical testbed for AI agents in a specialized scientific domain. The scale (over 5,000 trajectories) and reporting of confidence intervals are strengths. The consistent sub-50% ceiling across systems provides a clear, falsifiable baseline for future agent development. Credit is due for the emphasis on short-horizon, gradable tasks that avoid open-ended evaluation.
major comments (2)
- [Evaluation Protocol] Evaluation Protocol: the definition of the 5,088 'valid trajectories' and the precise exclusion criteria are not stated with sufficient operational detail to allow independent replication or auditing of the reported success counts (e.g., 143/318). This directly affects the central performance claims.
- [Benchmark Construction] Benchmark Construction: the criteria used to select and validate the 106 evaluations are not described in enough detail to confirm they capture realistic workflow states without selection bias; this is load-bearing for interpreting the 45% ceiling as representative rather than an artifact of task curation.
minor comments (2)
- [Results] Results: a supplementary table breaking down success rates by assay type (CUT&Tag, ATAC-seq, etc.) would make the statement that 'performance varies across assay types' more precise and useful.
- [Abstract] Abstract: model identifiers such as 'GPT-5.5' and 'Claude Opus 4.8 Max' should be clarified or footnoted with exact version strings or harness configurations for reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive comments and positive assessment of EpiBench as a reproducible benchmark. We address each major comment below and will revise the manuscript to incorporate the requested details.
read point-by-point responses
-
Referee: [Evaluation Protocol] Evaluation Protocol: the definition of the 5,088 'valid trajectories' and the precise exclusion criteria are not stated with sufficient operational detail to allow independent replication or auditing of the reported success counts (e.g., 143/318). This directly affects the central performance claims.
Authors: We agree that the current description lacks sufficient operational detail for independent replication. In the revised manuscript we will add a dedicated 'Trajectory Validation' subsection in Methods that defines a valid trajectory, enumerates all exclusion criteria with examples and pseudocode, and documents the exact filtering pipeline that produced the 5,088 trajectories and the per-system counts (e.g., 143/318). revision: yes
-
Referee: [Benchmark Construction] Benchmark Construction: the criteria used to select and validate the 106 evaluations are not described in enough detail to confirm they capture realistic workflow states without selection bias; this is load-bearing for interpreting the 45% ceiling as representative rather than an artifact of task curation.
Authors: We acknowledge the need for greater transparency on task selection. The revised 'Benchmark Construction' section will detail the workflow-state generation process, the expert validation protocol, quantitative diversity metrics across assay types, and an explicit discussion of potential curation biases together with their implications for the observed performance ceiling. revision: yes
Circularity Check
No significant circularity
full rationale
This is an empirical benchmark paper reporting direct experimental measurements of agent performance on 106 tasks across 5088 trajectories. No derivation chain, equations, fitted parameters, or predictions exist that could reduce the reported success rates (e.g., 45.0% for GPT-5.5/Pi) to quantities defined by the authors' own prior choices or self-citations. The results follow directly from the stated task definitions and deterministic grading rules once those are accepted; the benchmark construction itself is definitional to the work rather than a circular step within a claimed derivation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
An integrated encyclopedia of DNA elements in the human genome.Nature489, 57–74 (2012)
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome.Nature489, 57–74 (2012)
2012
-
[2]
Integrative analysis of 111 reference human epigenomes.Nature518, 317–330 (2015)
Roadmap Epigenomics Consortium. Integrative analysis of 111 reference human epigenomes.Nature518, 317–330 (2015)
2015
-
[3]
M., Andonian, A., Tenmann, B., Narayanan, S., Wellawatte, G
Mitchener, L., Laurent, J. M., Andonian, A., Tenmann, B., Narayanan, S., Wellawatte, G. P., White, A., Sani, L. & Ro- driques, S. G. BixBench: a comprehensive benchmark for LLM-based agents in computational biology.arXiv2503.00096 (2025)
arXiv 2025
-
[4]
H., Fletez-Brant, K., Xie, X., Corrada Bravo, H
Nair, S., Gunsalus, L., Orcutt-Jahns, B., Rossen, J., Lal, A., De Donno, C., Celik, M. H., Fletez-Brant, K., Xie, X., Corrada Bravo, H. & Eraslan, G. Agentic systems are adept at solving well- scoped, verifiable problems in computational biology.bioRxiv 2026.04.06.716850 (2026)
2026
-
[5]
Evaluating Claude’s bioinformatics research ca- pabilities with BioMysteryBench
Anthropic. Evaluating Claude’s bioinformatics research ca- pabilities with BioMysteryBench. Anthropic Research (2026). anthropic.com/research/BioMysteryBench
2026
-
[6]
Laurent, J. M., Janizek, J. D., Ruzo, M., Hinks, M. M., Hammer- ling, M. J., Narayanan, S., Ponnapati, M., White, A. D. & Ro- 8 driques, S. G. LAB-Bench: measuring capabilities of language models for biology research.arXiv2407.10362 (2024)
Pith/arXiv arXiv 2024
-
[7]
G., Shih, J.- H., Zhao, B
Qu, Y., Lu, Y., Tu, X., Zhang, S., She, T., Shaw, A. G., Shih, J.- H., Zhao, B. et al. BiomniBench: process-level evaluation of LLM agents for real-world biomedical research.bioRxiv 2026.05.12.724604 (2026)
2026
-
[8]
Li, J. & Ho, A. GeneBench: assessing AI agents for multi-stage inference problems in genomics and quantitative biology. bioRxiv2026.04.22.720113 (2026)
2026
-
[9]
Diedrich, J. D. et al. Profiling chromatin accessibility in pedi- atric acute lymphoblastic leukemia identifies subtype-specific chromatin landscapes and gene regulatory networks.Leukemia 35, 3078–3091 (2021). GEO: GSE161501
2021
-
[10]
Barnett, K. R. et al. Epigenomic mapping reveals distinct B cell acute lymphoblastic leukemia chromatin architectures and regulators.Cell Genomics3, 100442 (2023). GEO: GSE211631
2023
-
[11]
Cao, W. et al. Multi-faceted epigenetic dysregulation of gene expression promotes esophageal squamous cell carcinoma. Nature Communications11, 3675 (2020). GEO: GSE149608 and GSE149609
2020
-
[12]
Workman, K., Yang, Z., Muralidharan, H. & Le, H. Spatial- Bench: Can agents analyze real-world spatial biology data? arXiv2512.21907 (2025)
arXiv 2025
-
[13]
Workman, K., Yang, Z., Muralidharan, H., Abdulali, A. & Le, H. scBench: Evaluating AI agents on single-cell RNA-seq analysis. arXiv2602.09063 (2026)
arXiv 2026
-
[14]
& Salzberg, S
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2.Nature Methods9, 357–359 (2012)
2012
-
[15]
& Andrews, S
Krueger, F. & Andrews, S. R. Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications.Bioinformatics 27, 1571–1572 (2011)
2011
-
[16]
D., Giresi, P
Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y. & Greenleaf, W. J. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA- binding proteins and nucleosome position.Nature Methods10, 1213–1218 (2013)
2013
-
[17]
Zhang, Y. et al. Model-based analysis of ChIP-Seq (MACS). Genome Biology9, R137 (2008). 9
2008
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.