EpiBench: Verifiable Evaluation of AI Agents on Epigenomics Analysis

Harihara Muralidharan; Kenny Workman; Reema Baskar; Soo Hee Lee; Tim Proctor

arxiv: 2606.13602 · v1 · pith:3SSO3FKMnew · submitted 2026-06-11 · 💻 cs.AI

EpiBench: Verifiable Evaluation of AI Agents on Epigenomics Analysis

Harihara Muralidharan , Reema Baskar , Soo Hee Lee , Tim Proctor , Kenny Workman This is my paper

Pith reviewed 2026-06-27 06:48 UTC · model grok-4.3

classification 💻 cs.AI

keywords epigenomicsAI agentsbenchmarkCUT&TagATAC-seqChIP-seqDNA methylationverifiable evaluation

0 comments

The pith

No AI agent system passes a majority of verifiable epigenomics analysis tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

EpiBench provides 106 deterministically graded evaluations that test whether AI agents can make analysis decisions from realistic workflow states in CUT&Tag, ATAC-seq, ChIP-seq, and DNA methylation pipelines. Across 5,088 trajectories from 16 model-harness pairs, the strongest result reaches only 45 percent success, and no system clears 50 percent. A sympathetic reader cares because the failures concentrate at points that require assay-specific scientific judgment even when agents locate files and compute intermediates correctly.

Core claim

The paper establishes that no model-harness pair succeeds on a majority of the 106 evaluations, with GPT-5.5 / Pi reaching 45.0 percent (143/318 attempts) and other leading pairs at or below 39.9 percent. Agents frequently produce partial answers and handle file and computation steps, yet consistently fail when deeper, assay-specific judgment is required.

What carries the argument

EpiBench, a collection of 106 short-horizon, deterministically gradable evaluations that present agents with realistic epigenomics workflow states and require assay-specific decisions.

If this is right

Current agents cannot yet perform independent epigenomics analysis at majority reliability.
Success rates differ across assay types, pointing to uneven coverage of domain knowledge.
Partial correctness in many failed runs shows agents reach intermediate results but stall on interpretation.
The benchmark supplies a repeatable yardstick for measuring future gains in scientific decision-making.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same evaluation style could be applied to longer multi-step workflows or to other omics domains to test generalization.
Hybrid setups that route only the judgment steps to human experts might raise effective success rates faster than pure agent improvement.
Training on assay-protocol corpora might lift the specific failure modes observed here.

Load-bearing premise

The 106 evaluations accurately capture realistic workflow states and decisions that require assay-specific scientific judgment, and the deterministic grading scheme correctly measures correctness without bias.

What would settle it

A new agent system that succeeds on more than 50 percent of the same 106 evaluations under identical grading rules would falsify the claim that no system passes a majority.

read the original abstract

We introduce EpiBench, a verifiable benchmark for short-horizon epigenomics analysis. EpiBench evaluates whether agents can make well-defined analysis decisions from realistic workflow states and return deterministically gradable answers. The benchmark includes 106 evaluations across CUT\&Tag/CUT\&RUN, ATAC-seq, ChIP-seq, and DNA methylation workflows. Across 5,088 valid trajectories from 16 model-harness pairs, no system passed a majority of attempts: GPT-5.5 / Pi led at 45.0\% (143/318 attempts; 95\% confidence interval (CI), 36.3--53.7), followed by GPT-5.5 / OpenAI Codex at 39.9\% (127/318 attempts; 95\% CI, 31.6--48.3). Claude Opus 4.8 Max / Pi and GPT-5.4 / Pi each passed 39.0\% (124/318 attempts; 95\% CI, 30.2--47.8 and 31.0--47.0, respectively). Performance varies across assay types, and many failed runs still contain parts of the correct answer. Agents often found the right files and computed useful intermediate results, but failed when the task required deeper, assay-specific scientific judgment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EpiBench sets up 106 short, gradable epigenomics tasks and shows no current agent clears even half of them.

read the letter

The paper's main contribution is a new benchmark called EpiBench with 106 concrete tasks spread across CUT&Tag, ATAC-seq, ChIP-seq, and methylation workflows. It tests 16 model-harness pairs on 5,088 trajectories and reports that the best result is 45% success (GPT-5.5 / Pi), with confidence intervals. Agents often locate files and produce intermediate outputs but drop when the step needs assay-specific judgment.

What stands out is the focus on short-horizon, deterministically gradable decisions rather than open-ended biology. The authors track partial credit in failures and note performance differences by assay type. That gives a clearer picture than many agent papers that rely on subjective scoring.

The soft spot is whether the 106 tasks actually reflect the judgment calls that matter in real epigenomics work. The abstract claims they come from realistic workflow states, but without the full task definitions and exclusion rules it is hard to judge selection or grading bias. The 45% ceiling follows directly from the chosen tasks and rules, so the result is only as useful as those choices.

This is a benchmark paper aimed at groups building or evaluating AI agents for biological data analysis. Readers who care about measurable progress in that niche will find the numbers and setup worth looking at. It is coherent on its own terms and reports new empirical data, so it deserves a serious referee rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The paper introduces EpiBench, a verifiable benchmark for short-horizon epigenomics analysis consisting of 106 evaluations across CUT&Tag/CUT&RUN, ATAC-seq, ChIP-seq, and DNA methylation workflows. It reports results from 5,088 valid trajectories across 16 model-harness pairs, finding that no system passed a majority of attempts, with the leading system (GPT-5.5 / Pi) achieving 45.0% success (143/318 attempts; 95% CI 36.3--53.7). Agents frequently locate correct files and produce useful intermediates but fail on tasks requiring deeper assay-specific scientific judgment.

Significance. If the task definitions and deterministic grading hold, EpiBench supplies a reproducible, large-scale empirical testbed for AI agents in a specialized scientific domain. The scale (over 5,000 trajectories) and reporting of confidence intervals are strengths. The consistent sub-50% ceiling across systems provides a clear, falsifiable baseline for future agent development. Credit is due for the emphasis on short-horizon, gradable tasks that avoid open-ended evaluation.

major comments (2)

[Evaluation Protocol] Evaluation Protocol: the definition of the 5,088 'valid trajectories' and the precise exclusion criteria are not stated with sufficient operational detail to allow independent replication or auditing of the reported success counts (e.g., 143/318). This directly affects the central performance claims.
[Benchmark Construction] Benchmark Construction: the criteria used to select and validate the 106 evaluations are not described in enough detail to confirm they capture realistic workflow states without selection bias; this is load-bearing for interpreting the 45% ceiling as representative rather than an artifact of task curation.

minor comments (2)

[Results] Results: a supplementary table breaking down success rates by assay type (CUT&Tag, ATAC-seq, etc.) would make the statement that 'performance varies across assay types' more precise and useful.
[Abstract] Abstract: model identifiers such as 'GPT-5.5' and 'Claude Opus 4.8 Max' should be clarified or footnoted with exact version strings or harness configurations for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments and positive assessment of EpiBench as a reproducible benchmark. We address each major comment below and will revise the manuscript to incorporate the requested details.

read point-by-point responses

Referee: [Evaluation Protocol] Evaluation Protocol: the definition of the 5,088 'valid trajectories' and the precise exclusion criteria are not stated with sufficient operational detail to allow independent replication or auditing of the reported success counts (e.g., 143/318). This directly affects the central performance claims.

Authors: We agree that the current description lacks sufficient operational detail for independent replication. In the revised manuscript we will add a dedicated 'Trajectory Validation' subsection in Methods that defines a valid trajectory, enumerates all exclusion criteria with examples and pseudocode, and documents the exact filtering pipeline that produced the 5,088 trajectories and the per-system counts (e.g., 143/318). revision: yes
Referee: [Benchmark Construction] Benchmark Construction: the criteria used to select and validate the 106 evaluations are not described in enough detail to confirm they capture realistic workflow states without selection bias; this is load-bearing for interpreting the 45% ceiling as representative rather than an artifact of task curation.

Authors: We acknowledge the need for greater transparency on task selection. The revised 'Benchmark Construction' section will detail the workflow-state generation process, the expert validation protocol, quantitative diversity metrics across assay types, and an explicit discussion of potential curation biases together with their implications for the observed performance ceiling. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is an empirical benchmark paper reporting direct experimental measurements of agent performance on 106 tasks across 5088 trajectories. No derivation chain, equations, fitted parameters, or predictions exist that could reduce the reported success rates (e.g., 45.0% for GPT-5.5/Pi) to quantities defined by the authors' own prior choices or self-citations. The results follow directly from the stated task definitions and deterministic grading rules once those are accepted; the benchmark construction itself is definitional to the work rather than a circular step within a claimed derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are required or introduced; the contribution is the construction and application of the benchmark itself.

pith-pipeline@v0.9.1-grok · 5781 in / 1073 out tokens · 24923 ms · 2026-06-27T06:48:01.625246+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 1 linked inside Pith

[1]

An integrated encyclopedia of DNA elements in the human genome.Nature489, 57–74 (2012)

ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome.Nature489, 57–74 (2012)

2012
[2]

Integrative analysis of 111 reference human epigenomes.Nature518, 317–330 (2015)

Roadmap Epigenomics Consortium. Integrative analysis of 111 reference human epigenomes.Nature518, 317–330 (2015)

2015
[3]

M., Andonian, A., Tenmann, B., Narayanan, S., Wellawatte, G

Mitchener, L., Laurent, J. M., Andonian, A., Tenmann, B., Narayanan, S., Wellawatte, G. P., White, A., Sani, L. & Ro- driques, S. G. BixBench: a comprehensive benchmark for LLM-based agents in computational biology.arXiv2503.00096 (2025)

arXiv 2025
[4]

H., Fletez-Brant, K., Xie, X., Corrada Bravo, H

Nair, S., Gunsalus, L., Orcutt-Jahns, B., Rossen, J., Lal, A., De Donno, C., Celik, M. H., Fletez-Brant, K., Xie, X., Corrada Bravo, H. & Eraslan, G. Agentic systems are adept at solving well- scoped, verifiable problems in computational biology.bioRxiv 2026.04.06.716850 (2026)

2026
[5]

Evaluating Claude’s bioinformatics research ca- pabilities with BioMysteryBench

Anthropic. Evaluating Claude’s bioinformatics research ca- pabilities with BioMysteryBench. Anthropic Research (2026). anthropic.com/research/BioMysteryBench

2026
[6]

M., Janizek, J

Laurent, J. M., Janizek, J. D., Ruzo, M., Hinks, M. M., Hammer- ling, M. J., Narayanan, S., Ponnapati, M., White, A. D. & Ro- 8 driques, S. G. LAB-Bench: measuring capabilities of language models for biology research.arXiv2407.10362 (2024)

Pith/arXiv arXiv 2024
[7]

G., Shih, J.- H., Zhao, B

Qu, Y., Lu, Y., Tu, X., Zhang, S., She, T., Shaw, A. G., Shih, J.- H., Zhao, B. et al. BiomniBench: process-level evaluation of LLM agents for real-world biomedical research.bioRxiv 2026.05.12.724604 (2026)

2026
[8]

Li, J. & Ho, A. GeneBench: assessing AI agents for multi-stage inference problems in genomics and quantitative biology. bioRxiv2026.04.22.720113 (2026)

2026
[9]

Diedrich, J. D. et al. Profiling chromatin accessibility in pedi- atric acute lymphoblastic leukemia identifies subtype-specific chromatin landscapes and gene regulatory networks.Leukemia 35, 3078–3091 (2021). GEO: GSE161501

2021
[10]

Barnett, K. R. et al. Epigenomic mapping reveals distinct B cell acute lymphoblastic leukemia chromatin architectures and regulators.Cell Genomics3, 100442 (2023). GEO: GSE211631

2023
[11]

Cao, W. et al. Multi-faceted epigenetic dysregulation of gene expression promotes esophageal squamous cell carcinoma. Nature Communications11, 3675 (2020). GEO: GSE149608 and GSE149609

2020
[12]

Workman, K., Yang, Z., Muralidharan, H. & Le, H. Spatial- Bench: Can agents analyze real-world spatial biology data? arXiv2512.21907 (2025)

arXiv 2025
[13]

Workman, K., Yang, Z., Muralidharan, H., Abdulali, A. & Le, H. scBench: Evaluating AI agents on single-cell RNA-seq analysis. arXiv2602.09063 (2026)

arXiv 2026
[14]

& Salzberg, S

Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2.Nature Methods9, 357–359 (2012)

2012
[15]

& Andrews, S

Krueger, F. & Andrews, S. R. Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications.Bioinformatics 27, 1571–1572 (2011)

2011
[16]

D., Giresi, P

Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y. & Greenleaf, W. J. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA- binding proteins and nucleosome position.Nature Methods10, 1213–1218 (2013)

2013
[17]

Zhang, Y. et al. Model-based analysis of ChIP-Seq (MACS). Genome Biology9, R137 (2008). 9

2008

[1] [1]

An integrated encyclopedia of DNA elements in the human genome.Nature489, 57–74 (2012)

ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome.Nature489, 57–74 (2012)

2012

[2] [2]

Integrative analysis of 111 reference human epigenomes.Nature518, 317–330 (2015)

Roadmap Epigenomics Consortium. Integrative analysis of 111 reference human epigenomes.Nature518, 317–330 (2015)

2015

[3] [3]

M., Andonian, A., Tenmann, B., Narayanan, S., Wellawatte, G

Mitchener, L., Laurent, J. M., Andonian, A., Tenmann, B., Narayanan, S., Wellawatte, G. P., White, A., Sani, L. & Ro- driques, S. G. BixBench: a comprehensive benchmark for LLM-based agents in computational biology.arXiv2503.00096 (2025)

arXiv 2025

[4] [4]

H., Fletez-Brant, K., Xie, X., Corrada Bravo, H

Nair, S., Gunsalus, L., Orcutt-Jahns, B., Rossen, J., Lal, A., De Donno, C., Celik, M. H., Fletez-Brant, K., Xie, X., Corrada Bravo, H. & Eraslan, G. Agentic systems are adept at solving well- scoped, verifiable problems in computational biology.bioRxiv 2026.04.06.716850 (2026)

2026

[5] [5]

Evaluating Claude’s bioinformatics research ca- pabilities with BioMysteryBench

Anthropic. Evaluating Claude’s bioinformatics research ca- pabilities with BioMysteryBench. Anthropic Research (2026). anthropic.com/research/BioMysteryBench

2026

[6] [6]

M., Janizek, J

Laurent, J. M., Janizek, J. D., Ruzo, M., Hinks, M. M., Hammer- ling, M. J., Narayanan, S., Ponnapati, M., White, A. D. & Ro- 8 driques, S. G. LAB-Bench: measuring capabilities of language models for biology research.arXiv2407.10362 (2024)

Pith/arXiv arXiv 2024

[7] [7]

G., Shih, J.- H., Zhao, B

Qu, Y., Lu, Y., Tu, X., Zhang, S., She, T., Shaw, A. G., Shih, J.- H., Zhao, B. et al. BiomniBench: process-level evaluation of LLM agents for real-world biomedical research.bioRxiv 2026.05.12.724604 (2026)

2026

[8] [8]

Li, J. & Ho, A. GeneBench: assessing AI agents for multi-stage inference problems in genomics and quantitative biology. bioRxiv2026.04.22.720113 (2026)

2026

[9] [9]

Diedrich, J. D. et al. Profiling chromatin accessibility in pedi- atric acute lymphoblastic leukemia identifies subtype-specific chromatin landscapes and gene regulatory networks.Leukemia 35, 3078–3091 (2021). GEO: GSE161501

2021

[10] [10]

Barnett, K. R. et al. Epigenomic mapping reveals distinct B cell acute lymphoblastic leukemia chromatin architectures and regulators.Cell Genomics3, 100442 (2023). GEO: GSE211631

2023

[11] [11]

Cao, W. et al. Multi-faceted epigenetic dysregulation of gene expression promotes esophageal squamous cell carcinoma. Nature Communications11, 3675 (2020). GEO: GSE149608 and GSE149609

2020

[12] [12]

Workman, K., Yang, Z., Muralidharan, H. & Le, H. Spatial- Bench: Can agents analyze real-world spatial biology data? arXiv2512.21907 (2025)

arXiv 2025

[13] [13]

Workman, K., Yang, Z., Muralidharan, H., Abdulali, A. & Le, H. scBench: Evaluating AI agents on single-cell RNA-seq analysis. arXiv2602.09063 (2026)

arXiv 2026

[14] [14]

& Salzberg, S

Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2.Nature Methods9, 357–359 (2012)

2012

[15] [15]

& Andrews, S

Krueger, F. & Andrews, S. R. Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications.Bioinformatics 27, 1571–1572 (2011)

2011

[16] [16]

D., Giresi, P

Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y. & Greenleaf, W. J. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA- binding proteins and nucleosome position.Nature Methods10, 1213–1218 (2013)

2013

[17] [17]

Zhang, Y. et al. Model-based analysis of ChIP-Seq (MACS). Genome Biology9, R137 (2008). 9

2008