arxiv: 2604.10696 · v1 · submitted 2026-04-12 · 💻 cs.AI · cs.CV

Recognition: 2 theorem links

· Lean Theorem

Camyla: Scaling Autonomous Research in Medical Image Segmentation

Yifan Gao , Haoyue Li , Feng Yuan , Xin Gao , Weiran Huang , Xiaosong Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:56 UTC · model grok-4.3

classification 💻 cs.AI cs.CV

keywords medical image segmentationautonomous researchAI agentsmodel generationresearch manuscriptszero-intervention evaluationnnU-Net baseline

0 comments

The pith

Camyla is an autonomous system that turns raw medical image datasets into novel models and complete manuscripts, beating the best of 14 human-designed baselines on 24 of 31 datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Camyla as a complete pipeline for zero-human research in medical image segmentation. It takes raw datasets, grounds proposals in recent literature, runs experiments, and outputs finished papers. Three linked mechanisms handle the main obstacles of long autonomous runs: Quality-Weighted Branch Exploration spreads effort across ideas, Layered Reflective Memory keeps useful knowledge from earlier trials, and Divergent Diagnostic Feedback pushes recovery away from repeated small fixes. On a benchmark of 31 datasets published only in 2025, the system produced more than 2,700 new model implementations and 40 full manuscripts across two independent runs. The generated models exceeded the strongest baseline from a set of 14 established architectures on 24 datasets combined, while human reviewers rated the manuscripts at the level of top medical imaging journals.

Core claim

Camyla performs fully autonomous research in medical image segmentation by combining Quality-Weighted Branch Exploration, Layered Reflective Memory, and Divergent Diagnostic Feedback to generate over 2,700 novel model implementations and 40 complete manuscripts that surpass the strongest per-dataset baseline chosen from 14 established architectures on 24 of 31 datasets under identical training budgets and zero-intervention conditions.

What carries the argument

Three coupled mechanisms—Quality-Weighted Branch Exploration for allocating search effort, Layered Reflective Memory for compressing cross-trial knowledge at multiple scales, and Divergent Diagnostic Feedback for diversifying recovery paths—that together support long-horizon autonomous experimentation without drift or repetition.

If this is right

Camyla produces more novel model implementations and complete manuscripts than prior automated baselines under the same constraints.
The generated manuscripts receive senior human reviewer scores at the T1/T2 boundary of contemporary medical imaging journals.
Camyla exceeds AutoML and NAS systems on aggregate segmentation performance and outperforms six other open-ended research agents on task completion and frequency of beating baselines.
Domain-scale autonomous research becomes feasible in medical image segmentation when search drift, knowledge loss, and repetitive failure recovery are addressed together.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same three-mechanism structure could be tested in other data-rich scientific domains such as protein design or climate modeling to check whether long-horizon autonomy generalizes.
If the reflective memory and diagnostic feedback scale with larger compute clusters, the number of datasets Camyla can handle in parallel would increase without proportional human oversight.
A natural next measurement is whether the novel models transfer to new patient cohorts or scanner types beyond the 31 datasets used here.
The system’s ability to produce journal-level manuscripts suggests that automated literature grounding can reduce the time from data collection to submission in imaging research.

Load-bearing premise

The CamylaBench benchmark contains only 2025 publications with no prior contamination and all evaluations follow a strict zero-intervention protocol.

What would settle it

Independent retraining of the generated models on the same datasets yielding performance below the reported baselines, or external reviewers finding that the manuscripts contain only incremental or previously published ideas, would show the autonomous research claim does not hold.

read the original abstract

We present Camyla, a system for fully autonomous research within the scientific domain of medical image segmentation. Camyla transforms raw datasets into literature-grounded research proposals, executable experiments, and complete manuscripts without human intervention. Autonomous experimentation over long horizons poses three interrelated challenges: search effort drifts toward unpromising directions, knowledge from earlier trials degrades as context accumulates, and recovery from failures collapses into repetitive incremental fixes. To address these challenges, the system combines three coupled mechanisms: Quality-Weighted Branch Exploration for allocating effort across competing proposals, Layered Reflective Memory for retaining and compressing cross-trial knowledge at multiple granularities, and Divergent Diagnostic Feedback for diversifying recovery after underperforming trials. The system is evaluated on CamylaBench, a contamination-free benchmark of 31 datasets constructed exclusively from 2025 publications, under a strict zero-intervention protocol across two independent runs within a total of 28 days on an 8-GPU cluster. Across the two runs, Camyla generates more than 2,700 novel model implementations and 40 complete manuscripts, and surpasses the strongest per-dataset baseline selected from 14 established architectures, including nnU-Net, on 22 and 18 of 31 datasets under identical training budgets, respectively (union: 24/31). Senior human reviewers score the generated manuscripts at the T1/T2 boundary of contemporary medical imaging journals. Relative to automated baselines, Camyla outperforms AutoML and NAS systems on aggregate segmentation performance and exceeds six open-ended research agents on both task completion and baseline-surpassing frequency. These results suggest that domain-scale autonomous research is achievable in medical image segmentation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Camyla runs a scaled autonomous agent in medical segmentation that produces thousands of models and dozens of papers while beating strong baselines on most of a new benchmark, but the results depend on that benchmark being verifiably clean.

read the letter

The core claim is that Camyla combines three mechanisms—Quality-Weighted Branch Exploration, Layered Reflective Memory, and Divergent Diagnostic Feedback—to run a fully autonomous research loop at domain scale. In two zero-intervention runs it generated over 2,700 model implementations and 40 manuscripts, then beat the strongest of 14 baselines (including nnU-Net) on 24 of 31 datasets under fixed training budgets, with manuscripts landing at the T1/T2 boundary according to human reviewers. It also outperformed other AutoML and open-ended agent baselines on the reported metrics.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Camyla, a fully autonomous system for conducting research in medical image segmentation. It integrates three mechanisms—Quality-Weighted Branch Exploration, Layered Reflective Memory, and Divergent Diagnostic Feedback—to handle search drift, knowledge degradation, and failure recovery in long-horizon experimentation. The system is evaluated on CamylaBench, a benchmark comprising 31 datasets sourced exclusively from 2025 publications and asserted to be contamination-free. Under a zero-intervention protocol across two independent runs, Camyla generates over 2,700 novel model implementations and 40 complete manuscripts. It surpasses the strongest baseline from 14 established architectures (including nnU-Net) on 24 of 31 datasets (union across runs), with generated manuscripts receiving senior reviewer scores at the T1/T2 level of medical imaging journals. The system also outperforms AutoML, NAS, and other research agents in aggregate performance and task completion.

Significance. If the results are robust and the benchmark construction ensures no data leakage or prior knowledge contamination, this work could significantly advance the field of autonomous AI research by demonstrating end-to-end scientific discovery at scale in a specialized domain. The ability to produce journal-quality manuscripts and models that exceed strong baselines like nnU-Net under controlled conditions highlights the potential for such systems to accelerate medical imaging research. The scale of experimentation (thousands of models) and the multi-run evaluation provide a strong empirical foundation, though the ultimate impact depends on independent verification of the benchmark's integrity and the reproducibility of the agent's components.

major comments (2)

The assertion that CamylaBench is 'contamination-free' and built exclusively from 2025 publications is load-bearing for the outperformance claims (abstract and evaluation section). The manuscript should detail the specific enforcement mechanisms, such as literature search cutoffs, exclusion lists for source papers, or independent audits, to allow readers to assess the risk of implicit retrieval of prior analogs. Without this, the 'novel model implementations' could partly reflect memorized knowledge rather than autonomous discovery.
The claim of surpassing the strongest per-dataset baseline on 24/31 datasets requires explicit description of how the 'strongest' baseline is determined for each dataset and confirmation that all models, including Camyla's, were trained under strictly identical budgets and protocols (evaluation section). Any deviation in hyperparameter search or data augmentation could undermine the direct comparability.

minor comments (2)

The abstract states results from 'two independent runs'; including variance or statistical tests across runs in the results section would strengthen the reliability of the 22/18/24 outperformance figures.
Provide pseudocode or flow diagrams for the three core mechanisms (Quality-Weighted Branch Exploration, Layered Reflective Memory, Divergent Diagnostic Feedback) to aid understanding and replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address the two major comments below, indicating the revisions we will make to the manuscript.

read point-by-point responses

Referee: The assertion that CamylaBench is 'contamination-free' and built exclusively from 2025 publications is load-bearing for the outperformance claims (abstract and evaluation section). The manuscript should detail the specific enforcement mechanisms, such as literature search cutoffs, exclusion lists for source papers, or independent audits, to allow readers to assess the risk of implicit retrieval of prior analogs. Without this, the 'novel model implementations' could partly reflect memorized knowledge rather than autonomous discovery.

Authors: We agree that providing explicit details on the benchmark construction process is necessary to substantiate the contamination-free claim. Although the manuscript states that CamylaBench comprises datasets from exclusively 2025 publications, it does not elaborate on the specific mechanisms used to enforce this. In the revised manuscript, we will insert a detailed paragraph in the CamylaBench section describing the literature search protocol (cutoff after 2024), the exclusion criteria applied to avoid analogs from prior years, and the independent audit performed by external experts to confirm no data leakage. This will enable readers to evaluate the risk of implicit knowledge retrieval. revision: yes
Referee: The claim of surpassing the strongest per-dataset baseline on 24/31 datasets requires explicit description of how the 'strongest' baseline is determined for each dataset and confirmation that all models, including Camyla's, were trained under strictly identical budgets and protocols (evaluation section). Any deviation in hyperparameter search or data augmentation could undermine the direct comparability.

Authors: We appreciate this observation regarding the need for greater clarity on baseline comparison. The current manuscript specifies that the strongest baseline is selected from 14 architectures and that all models, including those generated by Camyla, were trained under identical training budgets. However, to address the referee's concern directly, we will revise the evaluation section to include an explicit description: the strongest baseline for each dataset is the one with the highest performance among the 14 when evaluated under the same fixed protocol, including identical data splits, augmentation pipelines, and hyperparameter optimization budgets. We will also add confirmation that no deviations were allowed in training procedures. A new table will be included to list the selected strongest baseline per dataset for transparency. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmark comparisons

full rationale

The paper presents a systems description of an autonomous research agent evaluated through empirical runs on a newly constructed benchmark (CamylaBench) consisting of 2025 publications. No mathematical derivations, equations, fitted parameters, or predictions are present that reduce to self-defined inputs by construction. Performance claims (e.g., surpassing baselines on 24/31 datasets) are grounded in direct comparisons to 14 external architectures under fixed training budgets, with no load-bearing self-citations, uniqueness theorems, or ansatzes invoked. The mechanisms (Quality-Weighted Branch Exploration, Layered Reflective Memory, Divergent Diagnostic Feedback) are described as design choices without circular reductions to the reported outcomes. This is a standard non-circular empirical systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 3 invented entities

The abstract introduces three new mechanisms as core inventions without citing prior independent evidence for their effectiveness in this setting.

invented entities (3)

Quality-Weighted Branch Exploration no independent evidence
purpose: allocating effort across competing proposals
New mechanism introduced to address search drift; no external validation cited.
Layered Reflective Memory no independent evidence
purpose: retaining and compressing cross-trial knowledge at multiple granularities
New memory architecture for long-horizon retention; no prior evidence cited.
Divergent Diagnostic Feedback no independent evidence
purpose: diversifying recovery after underperforming trials
New feedback strategy to avoid repetitive fixes; no external validation cited.

pith-pipeline@v0.9.0 · 5601 in / 1172 out tokens · 24286 ms · 2026-05-10T15:56:27.808792+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean; IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction; washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The system combines three coupled mechanisms: Quality-Weighted Branch Exploration for allocating effort across competing proposals, Layered Reflective Memory for retaining and compressing cross-trial knowledge at multiple granularities, and Divergent Diagnostic Feedback for diversifying recovery after underperforming trials.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CamylaBench, a contamination-free benchmark of 31 datasets constructed exclusively from 2025 publications... surpasses the strongest per-dataset baseline selected from 14 established architectures, including nnU-Net

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

97 extracted references · 43 canonical work pages · 6 internal anchors

[1]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292, 2024

work page internal anchor Pith review arXiv 2024
[2]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist-v2: Workshop-level automated scientific discovery via agentic tree search. arXiv preprint arXiv:2504.08066, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Agent laboratory: Using LLM agents as research assistants

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using LLM agents as research assistants. In Findings of the Association for Computational Linguistics: EMNLP, 2025

2025
[4]

DiNTS: Differentiable neural network topology search for 3D medical image segmentation

Yufan He, Dong Yang, Holger Roth, Can Zhao, and Daguang Xu. DiNTS: Differentiable neural network topology search for 3D medical image segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5841–5850, 2021

2021
[5]

Roth, Yutong Bai, Yixiao Zhang, Alan L

Qihang Yu, Dong Yang, Holger R. Roth, Yutong Bai, Yixiao Zhang, Alan L. Yuille, and Daguang Xu. C2FNAS: Coarse-to-fine neural architecture search for 3D medical image segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4126–4135, 2020

2020
[6]

Auto-nnU-Net: Towards automated medical image segmentation

Jannis Becktepe, Leona Hennig, Steffen Oeltze-Jafra, and Marius Lindauer. Auto-nnU-Net: Towards automated medical image segmentation. In International Conference on Automated Machine Learning, pages 25–1. PMLR, 2025

2025
[7]

Springer, 2019

Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren.Automated Machine Learning: Methods, Systems, Challenges. Springer, 2019

2019
[8]

autoresearch.https://github.com/karpathy/autoresearch, 2026

Andrej Karpathy. autoresearch.https://github.com/karpathy/autoresearch, 2026

2026
[9]

Jaeger, Simon A

Fabian Isensee, Paul F. Jaeger, Simon A. A. Kohl, Jens Petersen, and Klaus H. Maier-Hein. nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. Nature Methods, 18(2):203–211, 2021

2021
[10]

V-Net: Fully convolutional neural networks for volumetric medical image segmentation

Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-Net: Fully convolutional neural networks for volumetric medical image segmentation. In Fourth International Conference on 3D Vision, pages 565–571, 2016

2016
[11]

Metrics for evaluating 3D medical image segmentation: Analysis, selection, and tool

Abdel Aziz Taha and Allan Hanbury. Metrics for evaluating 3D medical image segmentation: Analysis, selection, and tool. BMC Medical Informatics and Decision Making, 15(29), 2015

2015
[12]

Landman, Geert Litjens, Bjoern Menze, Olaf Ronneberger, Ronald M

Michela Antonelli, Annika Reinke, Spyridon Bakas, Keyvan Farahani, Annette Kopp-Schneider, Bennett A. Landman, Geert Litjens, Bjoern Menze, Olaf Ronneberger, Ronald M. Summers, et al. The medical segmentation decathlon. Nature Communications, 13(1):4128, 2022. 22

2022
[13]

A comprehensive dataset of germinoma on MRI/CT with clinical and radiomic data

Lixuan Huang et al. A comprehensive dataset of germinoma on MRI/CT with clinical and radiomic data. Scientific Data, 12: 312, 2025. doi: 10.1038/s41597-025-04596-7

work page doi:10.1038/s41597-025-04596-7 2025
[14]

A comprehensive dataset of magnetic resonance enterography images with intestinal segment annotations

Zhangnan Zhong et al. A comprehensive dataset of magnetic resonance enterography images with intestinal segment annotations. Scientific Data, 12:425, 2025. doi: 10.1038/s41597-025-04760-z

work page doi:10.1038/s41597-025-04760-z 2025
[15]

A dataset of primary nasopharyngeal carcinoma MRI with multi-modalities segmentation

Yin Li et al. A dataset of primary nasopharyngeal carcinoma MRI with multi-modalities segmentation. Scientific Data, 12: 1450, 2025. doi: 10.1038/s41597-025-05815-x

work page doi:10.1038/s41597-025-05815-x 2025
[16]

A dataset of synthetic, maturation-informed magnetic resonance images of the human fetal brain

Helene Lajous et al. A dataset of synthetic, maturation-informed magnetic resonance images of the human fetal brain. Scientific Data, 12:602, 2025. doi: 10.1038/s41597-025-04926-9

work page doi:10.1038/s41597-025-04926-9 2025
[17]

A longitudinal MRI dataset of brain metastases with tumor segmentations, clinical and radiomic data

Dimitra Flouri et al. A longitudinal MRI dataset of brain metastases with tumor segmentations, clinical and radiomic data. Scientific Data, 12:1828, 2025. doi: 10.1038/s41597-025-06131-0

work page doi:10.1038/s41597-025-06131-0 2025
[18]

Ribeiro et al

Fernanda L. Ribeiro et al. An annotated multi-site and multi-contrast magnetic resonance imaging dataset for the study of the human tongue musculature. Scientific Data, 12:790, 2025. doi: 10.1038/s41597-025-05092-8

work page doi:10.1038/s41597-025-05092-8 2025
[19]

BOston Neonatal Brain Injury Data for Hypoxic Ischemic Encephalopathy (BONBID-HIE): I

Rina Bao et al. BOston Neonatal Brain Injury Data for Hypoxic Ischemic Encephalopathy (BONBID-HIE): I. MRI and lesion labeling. Scientific Data, 12:53, 2025. doi: 10.1038/s41597-024-03986-7

work page doi:10.1038/s41597-024-03986-7 2025
[20]

A radiograph dataset for the classification, localization, and segmentation of primary bone tumors

Shunhan Yao et al. A radiograph dataset for the classification, localization, and segmentation of primary bone tumors. Scientific Data, 12:88, 2025. doi: 10.1038/s41597-024-04311-y

work page doi:10.1038/s41597-024-04311-y 2025
[21]

BUS-UCLM: Breast ultrasound lesion segmentation dataset

Noelia Vallez et al. BUS-UCLM: Breast ultrasound lesion segmentation dataset. Scientific Data, 12:242, 2025. doi: 10.1038/ s41597-025-04562-3

2025
[22]

Large scale MRI collection and segmentation of cirrhotic liver

Debesh Jha et al. Large scale MRI collection and segmentation of cirrhotic liver. Scientific Data, 12:896, 2025. doi: 10.1038/ s41597-025-05201-7

2025
[23]

Umerenkov et al

D. Umerenkov et al. Core-penumbra hyperacute ischemic stroke dataset. Scientific Data, 12:707, 2025. doi: 10.1038/ s41597-025-05000-0

2025
[24]

DenPAR: Annotated intra-oral periapical radiographs dataset for machine learning

Sumudu Rasnayaka et al. DenPAR: Annotated intra-oral periapical radiographs dataset for machine learning. Scientific Data, 12:1615, 2025. doi: 10.1038/s41597-025-05906-9

work page doi:10.1038/s41597-025-05906-9 2025
[25]

DERMA-OCTA: A comprehensive dataset and preprocessing pipeline for dermatological OCTA vessel segmentation

Giulia Rotunno et al. DERMA-OCTA: A comprehensive dataset and preprocessing pipeline for dermatological OCTA vessel segmentation. Scientific Data, 12:1473, 2025. doi: 10.1038/s41597-025-05763-6

work page doi:10.1038/s41597-025-05763-6 2025
[26]

Endoscapes, a critical view of safety and surgical scene segmentation dataset for laparoscopic cholecystectomy

Pietro Mascagni et al. Endoscapes, a critical view of safety and surgical scene segmentation dataset for laparoscopic cholecystectomy. Scientific Data, 12:331, 2025. doi: 10.1038/s41597-025-04642-4

work page doi:10.1038/s41597-025-04642-4 2025
[27]

Ravasio et al

Claudio S. Ravasio et al. FOVEA: Preoperative and intraoperative retinal fundus images with optic disc and retinal vessel annotations. Scientific Data, 12:703, 2025. doi: 10.1038/s41597-025-04965-2

work page doi:10.1038/s41597-025-04965-2 2025
[28]

A fundus image dataset for AI-based artery-vein vessel segmentation

Zhuo Deng et al. A fundus image dataset for AI-based artery-vein vessel segmentation. Scientific Data, 12:1298, 2025. doi: 10.1038/s41597-025-05381-2

work page doi:10.1038/s41597-025-05381-2 2025
[29]

Dorosti et al

S. Dorosti et al. High-resolution ultrasound data for AI-based segmentation in mouse brain tumor. Scientific Data, 12:1322,
[30]

doi: 10.1038/s41597-025-05619-z

work page doi:10.1038/s41597-025-05619-z
[31]

Gou et al

L. Gou et al. Dynamic key vascular anatomy dataset for D2 lymph node dissection during laparoscopic gastric cancer surgery. Scientific Data, 12:903, 2025. doi: 10.1038/s41597-025-05255-7

work page doi:10.1038/s41597-025-05255-7 2025
[32]

D. S. Carmo et al. Manual segmentation of opacities and consolidations on CT of long COVID patients from multiple annotators. Scientific Data, 12:402, 2025. doi: 10.1038/s41597-025-04709-2

work page doi:10.1038/s41597-025-04709-2 2025
[33]

MSLesSeg: baseline and benchmarking of a new Multiple Sclerosis Lesion Segmentation dataset.Scientific Data, 12(1):920, May 2025

F. Guarnera et al. MSLesSeg: baseline and benchmarking of a new multiple sclerosis lesion segmentation dataset. Scientific Data, 12:920, 2025. doi: 10.1038/s41597-025-05250-y

work page doi:10.1038/s41597-025-05250-y 2025
[34]

Mahmoud et al

E. Mahmoud et al. MU-Glioma Post: A comprehensive dataset of automated MR multi-sequence segmentation and clinical features. Scientific Data, 12:1847, 2025. doi: 10.1038/s41597-025-06011-7

work page doi:10.1038/s41597-025-06011-7 2025
[35]

Chen et al

K.-H. Chen et al. NLSTseg: A pixel-level lung cancer dataset based on NLST LDCT images. Scientific Data, 12:1475, 2025. doi: 10.1038/s41597-025-05742-x. 23

work page doi:10.1038/s41597-025-05742-x 2025
[36]

Arikan et al

M. Arikan et al. OCT5k: A dataset of multi-disease and multi-graded annotations for retinal layers. Scientific Data, 12:267,
[37]

doi: 10.1038/s41597-024-04259-z

work page doi:10.1038/s41597-024-04259-z
[38]

Popa et al

M. Popa et al. PediMS: A pediatric multiple sclerosis lesion segmentation dataset. Scientific Data, 12:1184, 2025. doi: 10.1038/s41597-025-05346-5

work page doi:10.1038/s41597-025-05346-5 2025
[39]

Comprehensive multi-phase 3D contrast-enhanced CT imaging for primary liver cancer

Jiawei Luo et al. Comprehensive multi-phase 3D contrast-enhanced CT imaging for primary liver cancer. Scientific Data, 12: 768, 2025. doi: 10.1038/s41597-025-05125-2

work page doi:10.1038/s41597-025-05125-2 2025
[40]

PW-BALFC, a clinical dataset for detection and instance segmentation of bronchoalveolar lavage fluid cell

Xin Shi et al. PW-BALFC, a clinical dataset for detection and instance segmentation of bronchoalveolar lavage fluid cell. Scientific Data, 12:1074, 2025. doi: 10.1038/s41597-025-05452-4

work page doi:10.1038/s41597-025-05452-4 2025
[41]

Spine endoscopic atlas: an open-source dataset for surgical instrument segmentation

Zhipeng Xu et al. Spine endoscopic atlas: an open-source dataset for surgical instrument segmentation. Scientific Data, 12: 1611, 2025. doi: 10.1038/s41597-025-05897-7

work page doi:10.1038/s41597-025-05897-7 2025
[42]

A multi-modal dental dataset for semi-supervised deep learning image segmentation

Yaqi Wang et al. A multi-modal dental dataset for semi-supervised deep learning image segmentation. Scientific Data, 12:117,
[43]

doi: 10.1038/s41597-024-04306-9

work page doi:10.1038/s41597-024-04306-9
[44]

TOM500: A multi-organ annotated orbital MRI dataset for thyroid eye disease

Haiyang Zhang et al. TOM500: A multi-organ annotated orbital MRI dataset for thyroid eye disease. Scientific Data, 12:60,
[45]

doi: 10.1038/s41597-025-04427-9

work page doi:10.1038/s41597-025-04427-9
[46]

TRUSTED: The paired 3D transabdominal ultrasound and CT human data for kidney segmentation and registration research

William Ndzimbong et al. TRUSTED: The paired 3D transabdominal ultrasound and CT human data for kidney segmentation and registration research. Scientific Data, 12:615, 2025. doi: 10.1038/s41597-025-04467-1

work page doi:10.1038/s41597-025-04467-1 2025
[47]

A multi-modal pelvic MRI dataset for deep learning-based pelvic organ segmentation in endometriosis

Xiaomin Liang et al. A multi-modal pelvic MRI dataset for deep learning-based pelvic organ segmentation in endometriosis. Scientific Data, 12:1292, 2025. doi: 10.1038/s41597-025-05623-3

work page doi:10.1038/s41597-025-05623-3 2025
[48]

U-mamba: Enhancing long-range dependency for biomedical image segmentation

Jun Ma, Feifei Li, and Bo Wang. U-Mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722, 2024

work page arXiv 2024
[49]

Openalex: A fully-open index of scholarly works, authors, venues, institutions, and concepts.arXiv preprint arXiv:2205.01833, 2022

Jason Priem, Heather Piwowar, and Richard Orr. OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. arXiv preprint arXiv:2205.01833, 2022

work page arXiv 2022
[50]

Finite-time analysis of the multiarmed bandit problem

Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2–3):235–256, 2002

2002
[51]

Christopher D. Rosin. Multi-armed bandits with episode context. Annals of Mathematics and Artificial Intelligence, 61(3): 203–230, 2011

2011
[52]

Mastering the game of Go without human knowledge

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of Go without human knowledge. Nature, 550:354–359, 2017

2017
[53]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems, 2023

2023
[54]

DeepSeek-V3 Technical Report

DeepSeek-AI. DeepSeek-V3 technical report. arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

Automated 3D segmentation of kidneys and tumors in MICCAI KiTS 2023 challenge

Andriy Myronenko, Dong Yang, Yufan He, and Daguang Xu. Automated 3D segmentation of kidneys and tumors in MICCAI KiTS 2023 challenge. In KiTS@MICCAI, 2023

2023
[56]

MONAI: An open-source framework for deep learning in healthcare

M. Jorge Cardoso, Wenqi Li, Richard Brown, Nic Ma, Eric Kerfoot, Yiheng Wang, Benjamin Murray, Andriy Myronenko, Can Zhao, Dong Yang, et al. MONAI: An open-source framework for deep learning in healthcare.arXiv preprint arXiv:2211.02701, 2022

work page internal anchor Pith review arXiv 2022
[57]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[58]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? In International Conference on Learning Representations, 2024

2024
[59]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. In Advances in Neural Information Processing Systems, 2024. 24

2024
[60]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. OpenHands: An open platform for AI software developers as generalist agents. In International Conference on Learning Representations, 2025

2025
[61]

MLR-copilot: Autonomous machine learning research based on large language models agents

Ruochen Li, Teerth Patel, Qingyun Wang, and Xinya Du. MLR-copilot: Autonomous machine learning research based on large language models agents. arXiv preprint arXiv:2408.14033, 2024

work page arXiv 2024
[62]

Weld, and Peter Clark

Peter Alexander Jansen, Oyvind Tafjord, Marissa Radensky, Pao Siangliulue, Tom Hope, Bhavana Dalvi, Bodhisattwa Prasad Majumder, Daniel S. Weld, and Peter Clark. CodeScientist: End-to-end semi-automated scientific discovery with code-based experimentation. In Annual Meeting of the Association for Computational Linguistics, 2025

2025
[63]

DeepScientist: Advancing frontier- pushing scientific findings progressively

Yixuan Weng, Minjun Zhu, Qiujie Xie, QiYao Sun, Zhen Lin, Sifan Liu, and Yue Zhang. DeepScientist: Advancing frontier- pushing scientific findings progressively. In The Fourteenth International Conference on Learning Representations, 2026

2026
[64]

ResearchAgent: Iterative research idea generation over scientific literature with large language models

Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. ResearchAgent: Iterative research idea generation over scientific literature with large language models. In North American Chapter of the Association for Computational Linguistics, 2024

2024
[65]

Many heads are better than one: Improved scientific idea generation by a LLM-based multi-agent system

Haoyang Su, Renqi Chen, Shixiang Tang, Zhenfei Yin, Xinzhe Zheng, Jinzhe Li, Biqing Qi, Qi Wu, Hui Li, Wanli Ouyang, Philip Torr, Bowen Zhou, and Nanqing Dong. Many heads are better than one: Improved scientific idea generation by a LLM-based multi-agent system. In Annual Meeting of the Association for Computational Linguistics, 2025

2025
[66]

EvoScientist: Towards multi-agent evolving AI scientists for end-to-end scientific discovery

Yougang Lyu, Xi Zhang, Xinhao Yi, Yuyue Zhao, Shuyu Guo, Wenxiang Hu, Jan Piotrowski, Jakub Kaliski, Jacopo Urbani, Zaiqiao Meng, Lu Zhou, and Xiaohu Yan. EvoScientist: Towards multi-agent evolving AI scientists for end-to-end scientific discovery. arXiv preprint, 2026

2026
[67]

IdeaBench: Benchmarking large language models for research idea generation

Sikun Guo, Amir Hassan Shariatmadari, Guangzhi Xiong, Albert Huang, Eric Xie, Stefan Bekiranov, and Aidong Zhang. IdeaBench: Benchmarking large language models for research idea generation. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2025

2025
[68]

Pawan Kumar, Emilien Dupont, Francisco J

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models. Nature, 625:468–475, 2024

2024
[69]

From automation to autonomy: A survey on large language models in scientific discovery.arXiv preprint arXiv:2505.13259, 2025

Tianshi Zheng, Zheye Deng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Zihao Wang, and Yangqiu Song. From automation to autonomy: A survey on large language models in scientific discovery. arXiv preprint arXiv:2505.13259, 2025

work page arXiv 2025
[70]

Towards a medical AI scientist

Hongtao Wu, Boyun Zheng, Dingjie Song, Yu Jiang, Jianfeng Gao, Lei Xing, Lichao Sun, and Yixuan Yuan. Towards a medical AI scientist. arXiv preprint, 2026

2026
[71]

OpenLens AI: Fully autonomous research agent for health informatics

Yuxiao Cheng and Jinli Suo. OpenLens AI: Fully autonomous research agent for health informatics. arXiv preprint arXiv:2509.14778, 2025

work page arXiv 2025
[72]

Alawi, Mike Durymanov, Filip Galkin, et al

Vladimir Naumov, Diana Zagirova, Sha Lin, Yupeng Xie, Wenhao Gou, Anatoly Urban, Nina Tikhonova, Khadija M. Alawi, Mike Durymanov, Filip Galkin, et al. DORA AI Scientist: Multi-agent virtual research team for scientific exploration discovery and automated report generation. bioRxiv preprint 2025.03.06.641840, 2025

2025
[73]

SpatialAgent: An autonomous AI agent for spatial biology

Hanchen Wang, Yichun He, Paula Coelho, Massimo Bucci, Asma Nazir, Bo Chen, Loi Trinh, Serena Zhang, Kexin Huang, et al. SpatialAgent: An autonomous AI agent for spatial biology. bioRxiv preprint 2025.04.03.646459, 2025

2025
[74]

PharmAgents: Building a virtual pharma with large language model agents

Bowen Gao, Yanwen Huang, Yiqiao Liu, Wenxuan Xie, Weiying Ma, Ya-Qin Zhang, and Yanyan Lan. PharmAgents: Building a virtual pharma with large language model agents. arXiv preprint arXiv:2503.22164, 2025

work page arXiv 2025
[75]

Claudini: Autoresearch discovers state-of-the-art adversarial attack algorithms for LLMs

Alexander Panfilov, Peter Romov, Igor Shilov, Yves-Alexandre de Montjoye, Jonas Geiping, and Maksym Andriushchenko. Claudini: Autoresearch discovers state-of-the-art adversarial attack algorithms for LLMs. arXiv preprint, 2026

2026
[76]

Omni-simplemem: Autoresearch-guided discovery of lifelong multimodal agent memory

Jiaqi Liu, Zipeng Ling, Shi Qiu, Yanqing Liu, Siwei Han, Peng Xia, Haoqin Tu, Zeyu Zheng, Cihang Xie, Charles Fleming, Mingyu Ding, and Huaxiu Yao. Omni-simplemem: Autoresearch-guided discovery of lifelong multimodal agent memory. arXiv preprint, 2026

2026
[77]

Bilevel autoresearch: Meta-autoresearching itself

Yao Qu and Meng Lu. Bilevel autoresearch: Meta-autoresearching itself. arXiv preprint, 2026. 25

2026
[78]

Maier-Hein, and Paul F

Fabian Isensee, Tassilo Wald, Constantin Ulrich, Michael Baumgartner, Saikat Roy, Klaus H. Maier-Hein, and Paul F. Jaeger. nnU-Net revisited: A call for rigorous validation in 3D medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 2024

2024
[79]

NAS-Unet: Neural architecture search for medical image segmentation

Yu Weng, Tianbao Zhou, Yujie Li, and Xiaoyu Qiu. NAS-Unet: Neural architecture search for medical image segmentation. IEEE Access, 7:44247–44257, 2019

2019
[80]

V-NAS: Neural architecture search for volumetric medical image segmentation

Zhuotun Zhu, Chenxi Liu, Dong Yang, Alan Yuille, and Daguang Xu. V-NAS: Neural architecture search for volumetric medical image segmentation. In International Conference on 3D Vision, pages 240–248, 2019

2019

Showing first 80 references.