Recognition: 2 theorem links
· Lean TheoremCamyla: Scaling Autonomous Research in Medical Image Segmentation
Pith reviewed 2026-05-10 15:56 UTC · model grok-4.3
The pith
Camyla is an autonomous system that turns raw medical image datasets into novel models and complete manuscripts, beating the best of 14 human-designed baselines on 24 of 31 datasets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Camyla performs fully autonomous research in medical image segmentation by combining Quality-Weighted Branch Exploration, Layered Reflective Memory, and Divergent Diagnostic Feedback to generate over 2,700 novel model implementations and 40 complete manuscripts that surpass the strongest per-dataset baseline chosen from 14 established architectures on 24 of 31 datasets under identical training budgets and zero-intervention conditions.
What carries the argument
Three coupled mechanisms—Quality-Weighted Branch Exploration for allocating search effort, Layered Reflective Memory for compressing cross-trial knowledge at multiple scales, and Divergent Diagnostic Feedback for diversifying recovery paths—that together support long-horizon autonomous experimentation without drift or repetition.
If this is right
- Camyla produces more novel model implementations and complete manuscripts than prior automated baselines under the same constraints.
- The generated manuscripts receive senior human reviewer scores at the T1/T2 boundary of contemporary medical imaging journals.
- Camyla exceeds AutoML and NAS systems on aggregate segmentation performance and outperforms six other open-ended research agents on task completion and frequency of beating baselines.
- Domain-scale autonomous research becomes feasible in medical image segmentation when search drift, knowledge loss, and repetitive failure recovery are addressed together.
Where Pith is reading between the lines
- The same three-mechanism structure could be tested in other data-rich scientific domains such as protein design or climate modeling to check whether long-horizon autonomy generalizes.
- If the reflective memory and diagnostic feedback scale with larger compute clusters, the number of datasets Camyla can handle in parallel would increase without proportional human oversight.
- A natural next measurement is whether the novel models transfer to new patient cohorts or scanner types beyond the 31 datasets used here.
- The system’s ability to produce journal-level manuscripts suggests that automated literature grounding can reduce the time from data collection to submission in imaging research.
Load-bearing premise
The CamylaBench benchmark contains only 2025 publications with no prior contamination and all evaluations follow a strict zero-intervention protocol.
What would settle it
Independent retraining of the generated models on the same datasets yielding performance below the reported baselines, or external reviewers finding that the manuscripts contain only incremental or previously published ideas, would show the autonomous research claim does not hold.
read the original abstract
We present Camyla, a system for fully autonomous research within the scientific domain of medical image segmentation. Camyla transforms raw datasets into literature-grounded research proposals, executable experiments, and complete manuscripts without human intervention. Autonomous experimentation over long horizons poses three interrelated challenges: search effort drifts toward unpromising directions, knowledge from earlier trials degrades as context accumulates, and recovery from failures collapses into repetitive incremental fixes. To address these challenges, the system combines three coupled mechanisms: Quality-Weighted Branch Exploration for allocating effort across competing proposals, Layered Reflective Memory for retaining and compressing cross-trial knowledge at multiple granularities, and Divergent Diagnostic Feedback for diversifying recovery after underperforming trials. The system is evaluated on CamylaBench, a contamination-free benchmark of 31 datasets constructed exclusively from 2025 publications, under a strict zero-intervention protocol across two independent runs within a total of 28 days on an 8-GPU cluster. Across the two runs, Camyla generates more than 2,700 novel model implementations and 40 complete manuscripts, and surpasses the strongest per-dataset baseline selected from 14 established architectures, including nnU-Net, on 22 and 18 of 31 datasets under identical training budgets, respectively (union: 24/31). Senior human reviewers score the generated manuscripts at the T1/T2 boundary of contemporary medical imaging journals. Relative to automated baselines, Camyla outperforms AutoML and NAS systems on aggregate segmentation performance and exceeds six open-ended research agents on both task completion and baseline-surpassing frequency. These results suggest that domain-scale autonomous research is achievable in medical image segmentation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Camyla, a fully autonomous system for conducting research in medical image segmentation. It integrates three mechanisms—Quality-Weighted Branch Exploration, Layered Reflective Memory, and Divergent Diagnostic Feedback—to handle search drift, knowledge degradation, and failure recovery in long-horizon experimentation. The system is evaluated on CamylaBench, a benchmark comprising 31 datasets sourced exclusively from 2025 publications and asserted to be contamination-free. Under a zero-intervention protocol across two independent runs, Camyla generates over 2,700 novel model implementations and 40 complete manuscripts. It surpasses the strongest baseline from 14 established architectures (including nnU-Net) on 24 of 31 datasets (union across runs), with generated manuscripts receiving senior reviewer scores at the T1/T2 level of medical imaging journals. The system also outperforms AutoML, NAS, and other research agents in aggregate performance and task completion.
Significance. If the results are robust and the benchmark construction ensures no data leakage or prior knowledge contamination, this work could significantly advance the field of autonomous AI research by demonstrating end-to-end scientific discovery at scale in a specialized domain. The ability to produce journal-quality manuscripts and models that exceed strong baselines like nnU-Net under controlled conditions highlights the potential for such systems to accelerate medical imaging research. The scale of experimentation (thousands of models) and the multi-run evaluation provide a strong empirical foundation, though the ultimate impact depends on independent verification of the benchmark's integrity and the reproducibility of the agent's components.
major comments (2)
- The assertion that CamylaBench is 'contamination-free' and built exclusively from 2025 publications is load-bearing for the outperformance claims (abstract and evaluation section). The manuscript should detail the specific enforcement mechanisms, such as literature search cutoffs, exclusion lists for source papers, or independent audits, to allow readers to assess the risk of implicit retrieval of prior analogs. Without this, the 'novel model implementations' could partly reflect memorized knowledge rather than autonomous discovery.
- The claim of surpassing the strongest per-dataset baseline on 24/31 datasets requires explicit description of how the 'strongest' baseline is determined for each dataset and confirmation that all models, including Camyla's, were trained under strictly identical budgets and protocols (evaluation section). Any deviation in hyperparameter search or data augmentation could undermine the direct comparability.
minor comments (2)
- The abstract states results from 'two independent runs'; including variance or statistical tests across runs in the results section would strengthen the reliability of the 22/18/24 outperformance figures.
- Provide pseudocode or flow diagrams for the three core mechanisms (Quality-Weighted Branch Exploration, Layered Reflective Memory, Divergent Diagnostic Feedback) to aid understanding and replication.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our work. We address the two major comments below, indicating the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: The assertion that CamylaBench is 'contamination-free' and built exclusively from 2025 publications is load-bearing for the outperformance claims (abstract and evaluation section). The manuscript should detail the specific enforcement mechanisms, such as literature search cutoffs, exclusion lists for source papers, or independent audits, to allow readers to assess the risk of implicit retrieval of prior analogs. Without this, the 'novel model implementations' could partly reflect memorized knowledge rather than autonomous discovery.
Authors: We agree that providing explicit details on the benchmark construction process is necessary to substantiate the contamination-free claim. Although the manuscript states that CamylaBench comprises datasets from exclusively 2025 publications, it does not elaborate on the specific mechanisms used to enforce this. In the revised manuscript, we will insert a detailed paragraph in the CamylaBench section describing the literature search protocol (cutoff after 2024), the exclusion criteria applied to avoid analogs from prior years, and the independent audit performed by external experts to confirm no data leakage. This will enable readers to evaluate the risk of implicit knowledge retrieval. revision: yes
-
Referee: The claim of surpassing the strongest per-dataset baseline on 24/31 datasets requires explicit description of how the 'strongest' baseline is determined for each dataset and confirmation that all models, including Camyla's, were trained under strictly identical budgets and protocols (evaluation section). Any deviation in hyperparameter search or data augmentation could undermine the direct comparability.
Authors: We appreciate this observation regarding the need for greater clarity on baseline comparison. The current manuscript specifies that the strongest baseline is selected from 14 architectures and that all models, including those generated by Camyla, were trained under identical training budgets. However, to address the referee's concern directly, we will revise the evaluation section to include an explicit description: the strongest baseline for each dataset is the one with the highest performance among the 14 when evaluated under the same fixed protocol, including identical data splits, augmentation pipelines, and hyperparameter optimization budgets. We will also add confirmation that no deviations were allowed in training procedures. A new table will be included to list the selected strongest baseline per dataset for transparency. revision: yes
Circularity Check
No significant circularity; empirical claims rest on external benchmark comparisons
full rationale
The paper presents a systems description of an autonomous research agent evaluated through empirical runs on a newly constructed benchmark (CamylaBench) consisting of 2025 publications. No mathematical derivations, equations, fitted parameters, or predictions are present that reduce to self-defined inputs by construction. Performance claims (e.g., surpassing baselines on 24/31 datasets) are grounded in direct comparisons to 14 external architectures under fixed training budgets, with no load-bearing self-citations, uniqueness theorems, or ansatzes invoked. The mechanisms (Quality-Weighted Branch Exploration, Layered Reflective Memory, Divergent Diagnostic Feedback) are described as design choices without circular reductions to the reported outcomes. This is a standard non-circular empirical systems paper.
Axiom & Free-Parameter Ledger
invented entities (3)
-
Quality-Weighted Branch Exploration
no independent evidence
-
Layered Reflective Memory
no independent evidence
-
Divergent Diagnostic Feedback
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.lean; IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction; washburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The system combines three coupled mechanisms: Quality-Weighted Branch Exploration for allocating effort across competing proposals, Layered Reflective Memory for retaining and compressing cross-trial knowledge at multiple granularities, and Divergent Diagnostic Feedback for diversifying recovery after underperforming trials.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CamylaBench, a contamination-free benchmark of 31 datasets constructed exclusively from 2025 publications... surpasses the strongest per-dataset baseline selected from 14 established architectures, including nnU-Net
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292, 2024
work page internal anchor Pith review arXiv 2024
-
[2]
The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search
Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist-v2: Workshop-level automated scientific discovery via agentic tree search. arXiv preprint arXiv:2504.08066, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Agent laboratory: Using LLM agents as research assistants
Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using LLM agents as research assistants. In Findings of the Association for Computational Linguistics: EMNLP, 2025
2025
-
[4]
DiNTS: Differentiable neural network topology search for 3D medical image segmentation
Yufan He, Dong Yang, Holger Roth, Can Zhao, and Daguang Xu. DiNTS: Differentiable neural network topology search for 3D medical image segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5841–5850, 2021
2021
-
[5]
Roth, Yutong Bai, Yixiao Zhang, Alan L
Qihang Yu, Dong Yang, Holger R. Roth, Yutong Bai, Yixiao Zhang, Alan L. Yuille, and Daguang Xu. C2FNAS: Coarse-to-fine neural architecture search for 3D medical image segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4126–4135, 2020
2020
-
[6]
Auto-nnU-Net: Towards automated medical image segmentation
Jannis Becktepe, Leona Hennig, Steffen Oeltze-Jafra, and Marius Lindauer. Auto-nnU-Net: Towards automated medical image segmentation. In International Conference on Automated Machine Learning, pages 25–1. PMLR, 2025
2025
-
[7]
Springer, 2019
Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren.Automated Machine Learning: Methods, Systems, Challenges. Springer, 2019
2019
-
[8]
autoresearch.https://github.com/karpathy/autoresearch, 2026
Andrej Karpathy. autoresearch.https://github.com/karpathy/autoresearch, 2026
2026
-
[9]
Jaeger, Simon A
Fabian Isensee, Paul F. Jaeger, Simon A. A. Kohl, Jens Petersen, and Klaus H. Maier-Hein. nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. Nature Methods, 18(2):203–211, 2021
2021
-
[10]
V-Net: Fully convolutional neural networks for volumetric medical image segmentation
Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-Net: Fully convolutional neural networks for volumetric medical image segmentation. In Fourth International Conference on 3D Vision, pages 565–571, 2016
2016
-
[11]
Metrics for evaluating 3D medical image segmentation: Analysis, selection, and tool
Abdel Aziz Taha and Allan Hanbury. Metrics for evaluating 3D medical image segmentation: Analysis, selection, and tool. BMC Medical Informatics and Decision Making, 15(29), 2015
2015
-
[12]
Landman, Geert Litjens, Bjoern Menze, Olaf Ronneberger, Ronald M
Michela Antonelli, Annika Reinke, Spyridon Bakas, Keyvan Farahani, Annette Kopp-Schneider, Bennett A. Landman, Geert Litjens, Bjoern Menze, Olaf Ronneberger, Ronald M. Summers, et al. The medical segmentation decathlon. Nature Communications, 13(1):4128, 2022. 22
2022
-
[13]
A comprehensive dataset of germinoma on MRI/CT with clinical and radiomic data
Lixuan Huang et al. A comprehensive dataset of germinoma on MRI/CT with clinical and radiomic data. Scientific Data, 12: 312, 2025. doi: 10.1038/s41597-025-04596-7
-
[14]
Zhangnan Zhong et al. A comprehensive dataset of magnetic resonance enterography images with intestinal segment annotations. Scientific Data, 12:425, 2025. doi: 10.1038/s41597-025-04760-z
-
[15]
A dataset of primary nasopharyngeal carcinoma MRI with multi-modalities segmentation
Yin Li et al. A dataset of primary nasopharyngeal carcinoma MRI with multi-modalities segmentation. Scientific Data, 12: 1450, 2025. doi: 10.1038/s41597-025-05815-x
-
[16]
A dataset of synthetic, maturation-informed magnetic resonance images of the human fetal brain
Helene Lajous et al. A dataset of synthetic, maturation-informed magnetic resonance images of the human fetal brain. Scientific Data, 12:602, 2025. doi: 10.1038/s41597-025-04926-9
-
[17]
A longitudinal MRI dataset of brain metastases with tumor segmentations, clinical and radiomic data
Dimitra Flouri et al. A longitudinal MRI dataset of brain metastases with tumor segmentations, clinical and radiomic data. Scientific Data, 12:1828, 2025. doi: 10.1038/s41597-025-06131-0
-
[18]
Fernanda L. Ribeiro et al. An annotated multi-site and multi-contrast magnetic resonance imaging dataset for the study of the human tongue musculature. Scientific Data, 12:790, 2025. doi: 10.1038/s41597-025-05092-8
-
[19]
BOston Neonatal Brain Injury Data for Hypoxic Ischemic Encephalopathy (BONBID-HIE): I
Rina Bao et al. BOston Neonatal Brain Injury Data for Hypoxic Ischemic Encephalopathy (BONBID-HIE): I. MRI and lesion labeling. Scientific Data, 12:53, 2025. doi: 10.1038/s41597-024-03986-7
-
[20]
A radiograph dataset for the classification, localization, and segmentation of primary bone tumors
Shunhan Yao et al. A radiograph dataset for the classification, localization, and segmentation of primary bone tumors. Scientific Data, 12:88, 2025. doi: 10.1038/s41597-024-04311-y
-
[21]
BUS-UCLM: Breast ultrasound lesion segmentation dataset
Noelia Vallez et al. BUS-UCLM: Breast ultrasound lesion segmentation dataset. Scientific Data, 12:242, 2025. doi: 10.1038/ s41597-025-04562-3
2025
-
[22]
Large scale MRI collection and segmentation of cirrhotic liver
Debesh Jha et al. Large scale MRI collection and segmentation of cirrhotic liver. Scientific Data, 12:896, 2025. doi: 10.1038/ s41597-025-05201-7
2025
-
[23]
Umerenkov et al
D. Umerenkov et al. Core-penumbra hyperacute ischemic stroke dataset. Scientific Data, 12:707, 2025. doi: 10.1038/ s41597-025-05000-0
2025
-
[24]
DenPAR: Annotated intra-oral periapical radiographs dataset for machine learning
Sumudu Rasnayaka et al. DenPAR: Annotated intra-oral periapical radiographs dataset for machine learning. Scientific Data, 12:1615, 2025. doi: 10.1038/s41597-025-05906-9
-
[25]
Giulia Rotunno et al. DERMA-OCTA: A comprehensive dataset and preprocessing pipeline for dermatological OCTA vessel segmentation. Scientific Data, 12:1473, 2025. doi: 10.1038/s41597-025-05763-6
-
[26]
Pietro Mascagni et al. Endoscapes, a critical view of safety and surgical scene segmentation dataset for laparoscopic cholecystectomy. Scientific Data, 12:331, 2025. doi: 10.1038/s41597-025-04642-4
-
[27]
Claudio S. Ravasio et al. FOVEA: Preoperative and intraoperative retinal fundus images with optic disc and retinal vessel annotations. Scientific Data, 12:703, 2025. doi: 10.1038/s41597-025-04965-2
-
[28]
A fundus image dataset for AI-based artery-vein vessel segmentation
Zhuo Deng et al. A fundus image dataset for AI-based artery-vein vessel segmentation. Scientific Data, 12:1298, 2025. doi: 10.1038/s41597-025-05381-2
-
[29]
Dorosti et al
S. Dorosti et al. High-resolution ultrasound data for AI-based segmentation in mouse brain tumor. Scientific Data, 12:1322,
-
[30]
doi: 10.1038/s41597-025-05619-z
-
[31]
L. Gou et al. Dynamic key vascular anatomy dataset for D2 lymph node dissection during laparoscopic gastric cancer surgery. Scientific Data, 12:903, 2025. doi: 10.1038/s41597-025-05255-7
-
[32]
D. S. Carmo et al. Manual segmentation of opacities and consolidations on CT of long COVID patients from multiple annotators. Scientific Data, 12:402, 2025. doi: 10.1038/s41597-025-04709-2
-
[33]
F. Guarnera et al. MSLesSeg: baseline and benchmarking of a new multiple sclerosis lesion segmentation dataset. Scientific Data, 12:920, 2025. doi: 10.1038/s41597-025-05250-y
-
[34]
E. Mahmoud et al. MU-Glioma Post: A comprehensive dataset of automated MR multi-sequence segmentation and clinical features. Scientific Data, 12:1847, 2025. doi: 10.1038/s41597-025-06011-7
-
[35]
K.-H. Chen et al. NLSTseg: A pixel-level lung cancer dataset based on NLST LDCT images. Scientific Data, 12:1475, 2025. doi: 10.1038/s41597-025-05742-x. 23
-
[36]
Arikan et al
M. Arikan et al. OCT5k: A dataset of multi-disease and multi-graded annotations for retinal layers. Scientific Data, 12:267,
-
[37]
doi: 10.1038/s41597-024-04259-z
-
[38]
M. Popa et al. PediMS: A pediatric multiple sclerosis lesion segmentation dataset. Scientific Data, 12:1184, 2025. doi: 10.1038/s41597-025-05346-5
-
[39]
Comprehensive multi-phase 3D contrast-enhanced CT imaging for primary liver cancer
Jiawei Luo et al. Comprehensive multi-phase 3D contrast-enhanced CT imaging for primary liver cancer. Scientific Data, 12: 768, 2025. doi: 10.1038/s41597-025-05125-2
-
[40]
Xin Shi et al. PW-BALFC, a clinical dataset for detection and instance segmentation of bronchoalveolar lavage fluid cell. Scientific Data, 12:1074, 2025. doi: 10.1038/s41597-025-05452-4
-
[41]
Spine endoscopic atlas: an open-source dataset for surgical instrument segmentation
Zhipeng Xu et al. Spine endoscopic atlas: an open-source dataset for surgical instrument segmentation. Scientific Data, 12: 1611, 2025. doi: 10.1038/s41597-025-05897-7
-
[42]
A multi-modal dental dataset for semi-supervised deep learning image segmentation
Yaqi Wang et al. A multi-modal dental dataset for semi-supervised deep learning image segmentation. Scientific Data, 12:117,
-
[43]
doi: 10.1038/s41597-024-04306-9
-
[44]
TOM500: A multi-organ annotated orbital MRI dataset for thyroid eye disease
Haiyang Zhang et al. TOM500: A multi-organ annotated orbital MRI dataset for thyroid eye disease. Scientific Data, 12:60,
-
[45]
doi: 10.1038/s41597-025-04427-9
-
[46]
William Ndzimbong et al. TRUSTED: The paired 3D transabdominal ultrasound and CT human data for kidney segmentation and registration research. Scientific Data, 12:615, 2025. doi: 10.1038/s41597-025-04467-1
-
[47]
A multi-modal pelvic MRI dataset for deep learning-based pelvic organ segmentation in endometriosis
Xiaomin Liang et al. A multi-modal pelvic MRI dataset for deep learning-based pelvic organ segmentation in endometriosis. Scientific Data, 12:1292, 2025. doi: 10.1038/s41597-025-05623-3
-
[48]
U-mamba: Enhancing long-range dependency for biomedical image segmentation
Jun Ma, Feifei Li, and Bo Wang. U-Mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722, 2024
-
[49]
Jason Priem, Heather Piwowar, and Richard Orr. OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. arXiv preprint arXiv:2205.01833, 2022
-
[50]
Finite-time analysis of the multiarmed bandit problem
Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2–3):235–256, 2002
2002
-
[51]
Christopher D. Rosin. Multi-armed bandits with episode context. Annals of Mathematics and Artificial Intelligence, 61(3): 203–230, 2011
2011
-
[52]
Mastering the game of Go without human knowledge
David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of Go without human knowledge. Nature, 550:354–359, 2017
2017
-
[53]
Reflexion: Language agents with verbal reinforcement learning
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems, 2023
2023
-
[54]
DeepSeek-AI. DeepSeek-V3 technical report. arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
Automated 3D segmentation of kidneys and tumors in MICCAI KiTS 2023 challenge
Andriy Myronenko, Dong Yang, Yufan He, and Daguang Xu. Automated 3D segmentation of kidneys and tumors in MICCAI KiTS 2023 challenge. In KiTS@MICCAI, 2023
2023
-
[56]
MONAI: An open-source framework for deep learning in healthcare
M. Jorge Cardoso, Wenqi Li, Richard Brown, Nic Ma, Eric Kerfoot, Yiheng Wang, Benjamin Murray, Andriy Myronenko, Can Zhao, Dong Yang, et al. MONAI: An open-source framework for deep learning in healthcare.arXiv preprint arXiv:2211.02701, 2022
work page internal anchor Pith review arXiv 2022
-
[57]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[58]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? In International Conference on Learning Representations, 2024
2024
-
[59]
Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. In Advances in Neural Information Processing Systems, 2024. 24
2024
-
[60]
Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al
Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. OpenHands: An open platform for AI software developers as generalist agents. In International Conference on Learning Representations, 2025
2025
-
[61]
MLR-copilot: Autonomous machine learning research based on large language models agents
Ruochen Li, Teerth Patel, Qingyun Wang, and Xinya Du. MLR-copilot: Autonomous machine learning research based on large language models agents. arXiv preprint arXiv:2408.14033, 2024
-
[62]
Weld, and Peter Clark
Peter Alexander Jansen, Oyvind Tafjord, Marissa Radensky, Pao Siangliulue, Tom Hope, Bhavana Dalvi, Bodhisattwa Prasad Majumder, Daniel S. Weld, and Peter Clark. CodeScientist: End-to-end semi-automated scientific discovery with code-based experimentation. In Annual Meeting of the Association for Computational Linguistics, 2025
2025
-
[63]
DeepScientist: Advancing frontier- pushing scientific findings progressively
Yixuan Weng, Minjun Zhu, Qiujie Xie, QiYao Sun, Zhen Lin, Sifan Liu, and Yue Zhang. DeepScientist: Advancing frontier- pushing scientific findings progressively. In The Fourteenth International Conference on Learning Representations, 2026
2026
-
[64]
ResearchAgent: Iterative research idea generation over scientific literature with large language models
Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. ResearchAgent: Iterative research idea generation over scientific literature with large language models. In North American Chapter of the Association for Computational Linguistics, 2024
2024
-
[65]
Many heads are better than one: Improved scientific idea generation by a LLM-based multi-agent system
Haoyang Su, Renqi Chen, Shixiang Tang, Zhenfei Yin, Xinzhe Zheng, Jinzhe Li, Biqing Qi, Qi Wu, Hui Li, Wanli Ouyang, Philip Torr, Bowen Zhou, and Nanqing Dong. Many heads are better than one: Improved scientific idea generation by a LLM-based multi-agent system. In Annual Meeting of the Association for Computational Linguistics, 2025
2025
-
[66]
EvoScientist: Towards multi-agent evolving AI scientists for end-to-end scientific discovery
Yougang Lyu, Xi Zhang, Xinhao Yi, Yuyue Zhao, Shuyu Guo, Wenxiang Hu, Jan Piotrowski, Jakub Kaliski, Jacopo Urbani, Zaiqiao Meng, Lu Zhou, and Xiaohu Yan. EvoScientist: Towards multi-agent evolving AI scientists for end-to-end scientific discovery. arXiv preprint, 2026
2026
-
[67]
IdeaBench: Benchmarking large language models for research idea generation
Sikun Guo, Amir Hassan Shariatmadari, Guangzhi Xiong, Albert Huang, Eric Xie, Stefan Bekiranov, and Aidong Zhang. IdeaBench: Benchmarking large language models for research idea generation. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2025
2025
-
[68]
Pawan Kumar, Emilien Dupont, Francisco J
Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models. Nature, 625:468–475, 2024
2024
-
[69]
Tianshi Zheng, Zheye Deng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Zihao Wang, and Yangqiu Song. From automation to autonomy: A survey on large language models in scientific discovery. arXiv preprint arXiv:2505.13259, 2025
-
[70]
Towards a medical AI scientist
Hongtao Wu, Boyun Zheng, Dingjie Song, Yu Jiang, Jianfeng Gao, Lei Xing, Lichao Sun, and Yixuan Yuan. Towards a medical AI scientist. arXiv preprint, 2026
2026
-
[71]
OpenLens AI: Fully autonomous research agent for health informatics
Yuxiao Cheng and Jinli Suo. OpenLens AI: Fully autonomous research agent for health informatics. arXiv preprint arXiv:2509.14778, 2025
-
[72]
Alawi, Mike Durymanov, Filip Galkin, et al
Vladimir Naumov, Diana Zagirova, Sha Lin, Yupeng Xie, Wenhao Gou, Anatoly Urban, Nina Tikhonova, Khadija M. Alawi, Mike Durymanov, Filip Galkin, et al. DORA AI Scientist: Multi-agent virtual research team for scientific exploration discovery and automated report generation. bioRxiv preprint 2025.03.06.641840, 2025
2025
-
[73]
SpatialAgent: An autonomous AI agent for spatial biology
Hanchen Wang, Yichun He, Paula Coelho, Massimo Bucci, Asma Nazir, Bo Chen, Loi Trinh, Serena Zhang, Kexin Huang, et al. SpatialAgent: An autonomous AI agent for spatial biology. bioRxiv preprint 2025.04.03.646459, 2025
2025
-
[74]
PharmAgents: Building a virtual pharma with large language model agents
Bowen Gao, Yanwen Huang, Yiqiao Liu, Wenxuan Xie, Weiying Ma, Ya-Qin Zhang, and Yanyan Lan. PharmAgents: Building a virtual pharma with large language model agents. arXiv preprint arXiv:2503.22164, 2025
-
[75]
Claudini: Autoresearch discovers state-of-the-art adversarial attack algorithms for LLMs
Alexander Panfilov, Peter Romov, Igor Shilov, Yves-Alexandre de Montjoye, Jonas Geiping, and Maksym Andriushchenko. Claudini: Autoresearch discovers state-of-the-art adversarial attack algorithms for LLMs. arXiv preprint, 2026
2026
-
[76]
Omni-simplemem: Autoresearch-guided discovery of lifelong multimodal agent memory
Jiaqi Liu, Zipeng Ling, Shi Qiu, Yanqing Liu, Siwei Han, Peng Xia, Haoqin Tu, Zeyu Zheng, Cihang Xie, Charles Fleming, Mingyu Ding, and Huaxiu Yao. Omni-simplemem: Autoresearch-guided discovery of lifelong multimodal agent memory. arXiv preprint, 2026
2026
-
[77]
Bilevel autoresearch: Meta-autoresearching itself
Yao Qu and Meng Lu. Bilevel autoresearch: Meta-autoresearching itself. arXiv preprint, 2026. 25
2026
-
[78]
Maier-Hein, and Paul F
Fabian Isensee, Tassilo Wald, Constantin Ulrich, Michael Baumgartner, Saikat Roy, Klaus H. Maier-Hein, and Paul F. Jaeger. nnU-Net revisited: A call for rigorous validation in 3D medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 2024
2024
-
[79]
NAS-Unet: Neural architecture search for medical image segmentation
Yu Weng, Tianbao Zhou, Yujie Li, and Xiaoyu Qiu. NAS-Unet: Neural architecture search for medical image segmentation. IEEE Access, 7:44247–44257, 2019
2019
-
[80]
V-NAS: Neural architecture search for volumetric medical image segmentation
Zhuotun Zhu, Chenxi Liu, Dong Yang, Alan Yuille, and Daguang Xu. V-NAS: Neural architecture search for volumetric medical image segmentation. In International Conference on 3D Vision, pages 240–248, 2019
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.