AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery

Anjie Xu; Chenyang Shao; Fengli Xu; Hongyuan Su; Jingbo Xu; Peijie Liu; Qingbin Zeng; Qinglong Yang; Ruotong Zhao; Tianxing Li

arxiv: 2604.05550 · v2 · pith:JKE637DDnew · submitted 2026-04-07 · 💻 cs.CL · cs.CE

AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery

Yu Li , Chenyang Shao , Xinyang Liu , Ruotong Zhao , Peijie Liu , Hongyuan Su , Zhibin Chen , Qinglong Yang

show 8 more authors

Anjie Xu Yi Fang Qingbin Zeng Tianxing Li Jingbo Xu Fengli Xu Yong Li Tie-Yan Liu

This is my paper

Pith reviewed 2026-05-10 18:40 UTC · model grok-4.3

classification 💻 cs.CL cs.CE

keywords automated researchmulti-agent systemsSOTA optimizationpaper replicationAI model discoveryempirical automationcode grounding

0 comments

The pith

AutoSOTA deploys eight specialized agents to replicate published AI papers and discover 105 new models that outperform the originals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AutoSOTA as an end-to-end system that converts recent top-tier AI papers into executable code, runs experiments, and generates optimizations to produce improved models. It structures the process into resource preparation, evaluation, and reflection stages carried out by a multi-agent setup that handles code grounding, environment repair, experiment tracking, idea scheduling, and validity checks. The system was tested on papers from eight major AI conferences that provide code and feasible compute budgets. Results show it created 105 new SOTA models surpassing the reported baselines, completing the work in roughly five hours per paper on average. The work demonstrates that automation can extend beyond hyperparameter search to include architectural and algorithmic changes across language, vision, time-series, and optimization tasks.

Core claim

AutoSOTA is a multi-agent research system that grounds papers to code and dependencies, initializes execution environments, tracks long-horizon experiments, generates and schedules optimization ideas, and supervises validity, thereby producing 105 new models that exceed the performance of the original published methods across LLM, NLP, computer vision, time series, and optimization domains.

What carries the argument

A multi-agent architecture with eight specialized agents that collaboratively ground papers to code, manage environments, track experiments, generate optimization ideas, and enforce validity checks to avoid spurious gains.

If this is right

Published models can be verified and extended without requiring researchers to write or debug code from scratch.
Optimization moves past routine tuning to include architectural redesigns and workflow-level changes.
The same pipeline applies across language modeling, vision, time series, and optimization tasks when code is available.
Repetitive experimental cycles shrink, allowing human researchers to allocate more time to problem formulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be extended to papers lacking public code by having agents synthesize implementations from detailed textual descriptions.
With access to greater compute, the system might optimize larger models whose training exceeds the five-hour average window used in the evaluation.
Automated agents could serve as persistent collaborators that maintain and incrementally improve a growing library of reproducible models.

Load-bearing premise

The agents can reliably convert paper text into correct executable code and produce genuine performance gains rather than artifacts from implementation differences or overlooked constraints.

What would settle it

Independent re-execution of the 105 claimed new models on the exact test sets and metrics from their source papers to verify consistent outperformance over the originally reported baselines.

Figures

Figures reproduced from arXiv: 2604.05550 by Anjie Xu, Chenyang Shao, Fengli Xu, Hongyuan Su, Jingbo Xu, Peijie Liu, Qingbin Zeng, Qinglong Yang, Ruotong Zhao, Tianxing Li, Tie-Yan Liu, Xinyang Liu, Yi Fang, Yong Li, Yu Li, Zhibin Chen.

**Figure 2.** Figure 2: Overall Framework of AutoSOTA specialized, strictly bounded agents. While the mechanical generation of syntax modifications—often framed as Code Optimization—has been the primary focus of recent evolutionary frameworks like AlphaEvolve [7], we treat this programmatic translation as a standard operational bridge and do not explicitly focus on it here. Instead, our architecture is strategically orchestrated … view at source ↗

**Figure 3.** Figure 3: Overall Framework of the Resource Acquisition Process [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Overall Framework of the Rubric Construction Process [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Overall Framework of the Replication Process [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Overall Framework of the Reflection & Ideation Process [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of Average Performance Improvements across Major Research Categories, [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗

**Figure 8.** Figure 8: Illustration of the LLM case study. variance and stronger consistency. This fully validates the framework’s scalability and generalization value when handling complex, interdisciplinary scientific research tasks. 3.3 Case Study 3.3.1 Case Study for LLM Research. We use FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling as the representative LLM case because it… view at source ↗

**Figure 9.** Figure 9: Illustration of the NLP case study. a single resolvent operator (I −αT) −1 , the optimized pipeline composes multiple resolvent operators under a decreasing α schedule, with the six-step setting α = [0.84, 0.82, 0.80, 0.68, 0.49, 0.25]. This turns the original rule into a multi-scale graph filter that captures semantic structure at different diffusion depths. These changes improve correlation_care from 0.4… view at source ↗

**Figure 10.** Figure 10: Illustration of the Biology case study [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗

**Figure 11.** Figure 11: Illustration of the CV case study. evaluation metric is the Pearson Linear Correlation Coefficient (PLCC). Starting from the released implementation, AutoSOTA establishes a reproduced baseline PLCC of 0.7803. The optimization focuses on the feature aggregation stage preceding the final quality prediction head. In the original implementation, the model relied on a single pooling path to compress visual fea… view at source ↗

**Figure 12.** Figure 12: Illustration of the time series case study. [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗

**Figure 13.** Figure 13: Illustration of the Optimization case study. [PITH_FULL_IMAGE:figures/full_fig_p031_13.png] view at source ↗

read the original abstract

Artificial intelligence research increasingly depends on prolonged cycles of reproduction, debugging, and iterative refinement to achieve State-Of-The-Art (SOTA) performance, creating a growing need for systems that can accelerate the full pipeline of empirical model optimization. In this work, we introduce AutoSOTA, an end-to-end automated research system that advances the latest SOTA models published in top-tier AI papers to reproducible and empirically improved new SOTA models. We formulate this problem through three tightly coupled stages: resource preparation and goal setting; experiment evaluation; and reflection and ideation. To tackle this problem, AutoSOTA adopts a multi-agent architecture with eight specialized agents that collaboratively ground papers to code and dependencies, initialize and repair execution environments, track long-horizon experiments, generate and schedule optimization ideas, and supervise validity to avoid spurious gains. We evaluate AutoSOTA on recent research papers collected from eight top-tier AI conferences under filters for code availability and execution cost. Across these papers, AutoSOTA achieves strong end-to-end performance in both automated replication and subsequent optimization. Specifically, it successfully discovers 105 new SOTA models that surpass the original reported methods, averaging approximately five hours per paper. Case studies spanning LLM, NLP, computer vision, time series, and optimization further show that the system can move beyond routine hyperparameter tuning to identify architectural innovation, algorithmic redesigns, and workflow-level improvements. These results suggest that end-to-end research automation can serve not only as a performance optimizer, but also as a new form of research infrastructure that reduces repetitive experimental burden and helps redirect human attention toward higher-level scientific creativity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AutoSOTA claims 105 new SOTAs from an eight-agent pipeline but the evaluation gives no controls or verification steps to show the gains are real.

read the letter

AutoSOTA presents a multi-agent system that reads recent AI papers, sets up their code, runs experiments, and tries to produce better results. The authors say the system found 105 new state-of-the-art models across papers from eight top conferences, averaging five hours per paper, and that it sometimes finds architectural or algorithmic changes rather than just hyperparameter tweaks. Case studies cover language models, vision, time series, and optimization tasks. The eight-agent breakdown—covering grounding, environment repair, experiment tracking, idea generation, and validity checks—is the clearest new element. It frames the work as research infrastructure that could cut down on routine reproduction work. The three-stage structure (preparation, evaluation, reflection) is laid out plainly enough to follow. The main weakness is that the reported successes rest on unshown mechanisms. The abstract gives no information on how papers were chosen, whether runs used multiple seeds, what statistical checks were applied, or how the validity agent actually catches implementation artifacts or lucky single-run gains. Automated pipelines commonly surface spurious improvements, so the absence of those details leaves the central number hard to evaluate. The paper is aimed at people building tools for empirical AI research. A reader interested in multi-agent workflows or research automation could extract concrete agent roles and pipeline ideas. It is substantial enough to go to referees, who can request the missing verification details and implementation specifics. I would send it for peer review.

Referee Report

3 major / 2 minor

Summary. The paper presents AutoSOTA, an end-to-end multi-agent system with eight specialized agents that automates the full pipeline of reproducing AI models from recent top-tier conference papers and then optimizing them to new state-of-the-art performance. The system handles resource preparation, code grounding, experiment execution and repair, optimization ideation, and validity supervision. Evaluated on papers from eight conferences with available code and feasible compute, AutoSOTA is reported to produce 105 new SOTA models that surpass the originals, at an average of five hours per paper, with case studies showing gains from architectural and algorithmic changes beyond hyperparameter tuning.

Significance. If the empirical results can be substantiated, the work would demonstrate a meaningful step toward automating repetitive aspects of empirical AI research across domains including LLMs, NLP, vision, time series, and optimization. The multi-agent decomposition into specialized roles for grounding, execution tracking, and reflection offers a concrete engineering template that could reduce researcher time on reproduction and baseline tuning. The reported ability to surface non-trivial improvements (architectural redesigns, workflow changes) rather than only routine tuning would strengthen the case for such systems as research infrastructure.

major comments (3)

[Abstract] Abstract: The central claim that AutoSOTA 'successfully discovers 105 new SOTA models' is load-bearing, yet the abstract (and, by extension, the evaluation) supplies no information on paper selection criteria, baseline reproduction controls, number of random seeds, statistical testing of reported gains, or independent verification that improvements exceed implementation variance. This absence directly undermines evaluation of whether the 105 models constitute genuine advances or include spurious results.
[Multi-agent architecture and validity supervision] The description of the validity-supervision and reflection agents: The manuscript states that these agents 'supervise validity to avoid spurious gains,' but provides no concrete mechanisms (e.g., multi-seed averaging, ablation protocols, statistical significance thresholds, or detection of data-leakage or implementation artifacts) that would allow the system to certify non-spurious improvements across diverse domains without human oversight. This is the weakest link in the end-to-end claim.
[Evaluation] Evaluation section (implied by the reported results): No details are given on how the original reported methods were re-implemented and executed to establish fair baselines, nor on whether the new SOTA claims were cross-checked against the original authors' code or additional public implementations. Without these controls, the average five-hour-per-paper figure and the 105-model count cannot be interpreted as evidence of reliable automation.

minor comments (2)

[Abstract] The abstract uses inconsistent capitalization ('State-Of-The-Art' vs. 'SOTA'); standardize throughout.
[Evaluation] The manuscript would benefit from a table summarizing the 105 papers by conference, domain, and type of improvement discovered (hyperparameter vs. architectural), to allow readers to assess the breadth of the claimed results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater methodological transparency in our evaluation of AutoSOTA. We agree that additional details on paper selection, baseline controls, statistical rigor, and validity mechanisms are required to fully substantiate the reported results. We respond to each major comment below and will incorporate the suggested clarifications in the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that AutoSOTA 'successfully discovers 105 new SOTA models' is load-bearing, yet the abstract (and, by extension, the evaluation) supplies no information on paper selection criteria, baseline reproduction controls, number of random seeds, statistical testing of reported gains, or independent verification that improvements exceed implementation variance. This absence directly undermines evaluation of whether the 105 models constitute genuine advances or include spurious results.

Authors: We acknowledge that the abstract is concise and omits explicit references to these controls, even though the evaluation section notes filters for code availability and feasible compute. In the revision, we will expand the abstract with a brief summary of the protocol and add a dedicated 'Evaluation Methodology' subsection detailing paper selection criteria, baseline reproduction using original codebases, use of 5 random seeds per run with averaging, statistical significance testing via paired t-tests (p < 0.05), and cross-verification steps against original reported metrics. These additions will allow readers to assess whether the 105 gains are genuine. revision: yes
Referee: [Multi-agent architecture and validity supervision] The description of the validity-supervision and reflection agents: The manuscript states that these agents 'supervise validity to avoid spurious gains,' but provides no concrete mechanisms (e.g., multi-seed averaging, ablation protocols, statistical significance thresholds, or detection of data-leakage or implementation artifacts) that would allow the system to certify non-spurious improvements across diverse domains without human oversight. This is the weakest link in the end-to-end claim.

Authors: We agree the current description of the validity-supervision agent is high-level and lacks explicit protocols. In the revised multi-agent architecture section, we will detail the concrete mechanisms: the agent mandates multi-seed averaging (minimum 3 seeds), triggers ablation studies for each optimization idea, applies statistical thresholds (p < 0.05), and includes automated checks for data leakage (train/test overlap detection) and implementation artifacts (metric consistency with original reports). These will be presented as part of the agent's decision rules to demonstrate certification of non-spurious results. revision: yes
Referee: [Evaluation] Evaluation section (implied by the reported results): No details are given on how the original reported methods were re-implemented and executed to establish fair baselines, nor on whether the new SOTA claims were cross-checked against the original authors' code or additional public implementations. Without these controls, the average five-hour-per-paper figure and the 105-model count cannot be interpreted as evidence of reliable automation.

Authors: We will substantially expand the Evaluation section to describe the re-implementation pipeline, including how papers were grounded to executable code from the original repositories, environment initialization for fair baseline runs, and execution under identical resource constraints. We will also add information on cross-checks performed against the original authors' reported results and any supplementary public implementations used for validation. These details will directly support the reliability of the five-hour average and the 105 new SOTA claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system demonstration with no derivation chain

full rationale

The paper describes an end-to-end multi-agent system for replicating and optimizing AI models from published papers. It reports empirical outcomes (105 new SOTA models found) from running the system on selected papers. No equations, parameter fitting, predictions derived from inputs, or self-citation chains are present. The central claims rest on experimental execution and results rather than any closed-form derivation that reduces to its own assumptions by construction. This is a standard empirical systems paper whose validity hinges on replication and verification of the reported runs, not on internal definitional circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on assumptions about agent reliability and paper reproducibility rather than explicit free parameters or new physical entities.

axioms (2)

domain assumption Papers provide sufficient code and dependencies for automated grounding and execution.
Invoked in the resource preparation stage of the three-stage pipeline.
domain assumption Generated optimization ideas can be validated as non-spurious by the supervisor agent.
Central to the reflection and ideation stage and the claim of 105 valid new SOTAs.

invented entities (1)

Eight specialized agents no independent evidence
purpose: Collaboratively handle paper grounding, environment repair, experiment tracking, idea generation, and validity supervision.
Newly introduced multi-agent architecture; no independent evidence provided beyond system description.

pith-pipeline@v0.9.0 · 5652 in / 1288 out tokens · 43810 ms · 2026-05-10T18:40:04.803770+00:00 · methodology

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SciIntegrity-Bench: A Benchmark for Evaluating Academic Integrity in AI Scientist Systems
cs.AI 2026-05 unverdicted novelty 8.0

SciIntegrity-Bench shows state-of-the-art LLMs violate academic integrity in 34.2% of dilemmatic scenarios, primarily by fabricating data rather than refusing impossible tasks.
NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation
cs.AI 2026-05 unverdicted novelty 6.0

NanoResearch introduces a tri-level co-evolving framework of skills, memory, and policy to personalize LLM-powered research automation across projects and users.
NeuroClaw Technical Report
cs.CV 2026-04 unverdicted novelty 6.0

NeuroClaw introduces a three-tier multi-agent framework and NeuroBench benchmark that improve executability and reproducibility scores for neuroimaging tasks when used with multimodal LLMs.
AI for Auto-Research: Roadmap & User Guide
cs.AI 2026-05 unverdicted novelty 4.0

The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.
ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration
cs.SE 2026-05 unverdicted novelty 4.0

ARIS is a three-layer open-source system that uses cross-model adversarial collaboration plus claim-auditing pipelines to make LLM-driven research workflows more reliable.