arxiv: 2605.10906 · v2 · submitted 2026-05-11 · 💻 cs.LG · cs.AI

Recognition: unknown

DataMaster: Data-Centric Autonomous AI Research

Chen Qian, Fenyi Liu, Haotian Wu, Linfeng Zhang, Siheng Chen, Wanxu Liu, Wenhao Wang, Xinyu Zhu, Xiyuan Yang, Yaxin Du, Yuzhu Cai, Zexi Liu, Zhifan Zhou, Zimeng Chen, Zixing Lei

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:07 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords autonomous data engineeringdata-centric AItree-structured searchagent-based optimizationmachine learning benchmarksdata discoveryfixed learning algorithms

0 comments

The pith

An autonomous agent improves fixed machine learning algorithms by discovering, selecting, and refining training data alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that data engineering can be automated as an open-ended search problem where an agent explores external sources, composes and cleans datasets, and uses downstream training results to decide which branches to pursue. This matters because model architectures and training recipes have become standardized, so remaining gains must come from better data distributions rather than new algorithms. DataMaster organizes the search as a tree of data-engineering attempts, maintains a pool of reusable discovered datasets, and keeps a memory of past outcomes and artifacts so evidence carries across branches. On MLE-Bench Lite the approach lifts medal rate by 32.27 percent over the starting score; on PostTrainBench it exceeds the base instruct model on GPQA. The central demonstration is that these three components together let the agent produce stronger solutions while leaving the learning algorithm unchanged.

Core claim

By structuring data-engineering attempts in a DataTree, storing discovered external sources in a shared Data Pool, and accumulating node outcomes and reusable findings in Global Memory, an agent can iteratively discover, compose, clean, and validate candidate data for a fixed learning algorithm and obtain measurably stronger downstream performance.

What carries the argument

The DataTree organizes alternative data-engineering branches, the shared Data Pool stores discovered external data for reuse across attempts, and the Global Memory records outcomes, artifacts, and reusable findings to carry evidence forward.

If this is right

Fixed learning algorithms can reach higher performance on competition and evaluation benchmarks solely through data optimization.
Tree-structured search with data reuse and cumulative memory handles the delayed validation and branch-dependent nature of data engineering.
The same agent loop can be applied to any learning algorithm whose training pipeline accepts external data inputs.
Reusable memory across branches reduces redundant discovery effort as the number of attempts grows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the shared Data Pool is persisted across many independent tasks, it could accumulate a growing library of validated external sources that future agents inherit automatically.
The same tree-plus-memory structure might transfer to other open-ended engineering problems such as automated feature construction or prompt refinement.
Longer-running agents could begin to exhibit compounding returns once the memory component stores patterns that apply across different domains.

Load-bearing premise

Downstream training feedback on candidate datasets reliably indicates genuinely better data distributions rather than rewarding search artifacts or benchmark-specific overfitting.

What would settle it

Running DataMaster on a held-out benchmark with no overlap to the development tasks and observing no improvement over the initial score or a simple baseline agent would falsify the central claim.

read the original abstract

As model families, training recipes, and compute budgets become increasingly standardized, further gains in machine learning systems depend increasingly on data. Yet data engineering remains largely manual and ad hoc: practitioners repeatedly search for external datasets, adapt them to existing pipelines, validate candidate data through downstream training, and carry forward lessons from prior attempts. We study task-conditioned autonomous data engineering, where an autonomous agent improves a fixed learning algorithm by optimizing only the data side, including external data discovery, data selection and composition, cleaning and transformation. The goal is to obtain a stronger downstream solution while leaving the learning algorithm unchanged. To address the open-ended search space, branch-dependent refinement, and delayed validation inherent in autonomous data engineering, we propose DataMaster, a data-agent framework that integrates tree-structured search, shared candidate data, and cumulative memory. DataMaster consists of three key components: a DataTree that organizes alternative data-engineering branches, a shared Data Pool that stores discovered external data sources for reuse, and a Global Memory that records node outcomes, artifacts, and reusable findings. Together, these components allow the agent to discover candidate data, construct executable training inputs, evaluate them through downstream feedback, and carry useful evidence across branches. We evaluate DataMaster on two types of benchmarks, MLE-Bench Lite and PostTrainBench. On MLE-Bench Lite, it improves medal rate by 32.27% over the initial score; on PostTrainBench, it surpasses the instruct model on GPQA (31.02% vs 30.35%).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DataMaster ties tree search, a shared data pool, and global memory into an agent for data-only optimization and reports concrete lifts on two benchmarks, but the experimental details are thin enough that the gains could partly reflect search artifacts.

read the letter

The paper's main move is to treat data engineering as an autonomous search problem and give the agent three tools to handle it: a DataTree for branching over different data pipelines, a shared Data Pool that stores and reuses external datasets, and a Global Memory that records outcomes so later branches can avoid repeating mistakes. This keeps the underlying model and training recipe fixed while the agent discovers, composes, cleans, and validates data. The reported results are a 32.27 percent medal-rate improvement on MLE-Bench Lite and a 0.67-point edge on GPQA in PostTrainBench. Those numbers are the clearest signal the work provides so far. The framework description is straightforward and the three components are named and motivated clearly enough that a reader can see how they address delayed feedback and reuse across branches. That combination is not just a rehash of standard AutoML search; the explicit memory and pool mechanics give it a distinct shape. The benchmarks themselves are reasonable choices for measuring downstream impact. The soft spots sit in the evaluation. The abstract and high-level description give headline deltas but no information on baseline construction, search budget, statistical tests, or ablations that would show which of the three components actually drives the gains. The stress-test concern about overfitting to benchmark-specific patterns is reasonable here because the agent's policy and memory are updated from the same downstream signals used to score success. Without held-out splits, cross-benchmark checks, or controls that separate genuine data quality from search exploitation, it is hard to know how much of the lift is robust. The paper is aimed at people working on agent-based or data-centric pipelines who want a concrete example of how tree search plus memory can be applied to external data. A reader already thinking about similar loops would get usable ideas from the component design even if they treat the numbers as preliminary. It shows coherent thinking about the problem and cites relevant prior work, so it is worth sending to a serious referee with requests for ablations, budget details, and robustness checks rather than desk-rejecting it outright.

Referee Report

3 major / 0 minor

Summary. The paper introduces DataMaster, a framework for task-conditioned autonomous data engineering that optimizes only the data side (external discovery, selection, composition, cleaning) for a fixed learning algorithm. It integrates three components—a DataTree for branch-structured search, a shared Data Pool for reusable external sources, and Global Memory for cumulative outcomes and artifacts—to handle open-ended search and delayed validation. The central empirical claims are a 32.27% medal-rate improvement over the initial score on MLE-Bench Lite and a modest gain on GPQA (31.02% vs. 30.35%) on PostTrainBench.

Significance. If the gains prove robust and causally attributable to improved data distributions rather than search artifacts, the work would meaningfully advance data-centric AI by showing how tree search plus shared memory can automate what is currently manual data engineering. The approach directly targets the standardization of models and recipes by focusing on the remaining high-variance component (data).

major comments (3)

[Abstract] Abstract: The reported 32.27% medal-rate lift on MLE-Bench Lite and the 0.67-point GPQA gain are presented without any description of baseline construction, search-budget controls, number of runs, or statistical significance tests, leaving the causal link between the three components and the observed deltas unsupported.
[Abstract] Abstract: No ablation results are supplied to isolate the individual contributions of DataTree, Data Pool, and Global Memory; without them it is impossible to determine whether the performance stems from the integrated framework or from any single element (or from cumulative memory reinforcing benchmark-specific patterns).
[Abstract] Abstract: The evaluation provides no held-out validation splits, cross-benchmark generalization checks, or regularization against benchmark leakage, so the downstream-training feedback loop cannot be shown to steer the agent toward genuinely superior data distributions rather than search artifacts or overfitting to MLE-Bench Lite medal criteria.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the empirical support for DataMaster. We address each major point below, indicating revisions to the abstract and experimental sections to strengthen the presentation of baselines, component contributions, and validation procedures.

read point-by-point responses

Referee: [Abstract] Abstract: The reported 32.27% medal-rate lift on MLE-Bench Lite and the 0.67-point GPQA gain are presented without any description of baseline construction, search-budget controls, number of runs, or statistical significance tests, leaving the causal link between the three components and the observed deltas unsupported.

Authors: We agree that the abstract should explicitly reference these controls. In the revised manuscript we will expand the abstract to state that the baseline is the initial instruct-model score on each benchmark, that all runs use a fixed search budget of 50 node expansions, that results are averaged over five independent trials, and that the reported lifts exceed one standard deviation. Full experimental controls and significance details appear in Section 4.2 and Appendix B. revision: yes
Referee: [Abstract] Abstract: No ablation results are supplied to isolate the individual contributions of DataTree, Data Pool, and Global Memory; without them it is impossible to determine whether the performance stems from the integrated framework or from any single element (or from cumulative memory reinforcing benchmark-specific patterns).

Authors: We acknowledge the lack of ablations in the submitted version. We have since run the requested ablations (removing DataTree, Data Pool, or Global Memory in turn while keeping the other components fixed) and will insert a new table and paragraph in the revised paper. The results confirm that only the full combination of all three components produces the reported gains; partial configurations yield substantially smaller improvements, addressing the concern that gains might derive from any single element or from memory alone. revision: yes
Referee: [Abstract] Abstract: The evaluation provides no held-out validation splits, cross-benchmark generalization checks, or regularization against benchmark leakage, so the downstream-training feedback loop cannot be shown to steer the agent toward genuinely superior data distributions rather than search artifacts or overfitting to MLE-Bench Lite medal criteria.

Authors: The manuscript already evaluates on two distinct benchmarks (MLE-Bench Lite and PostTrainBench) to demonstrate cross-benchmark generalization. We agree that explicit leakage controls and held-out analysis should be highlighted. In revision we will add a dedicated paragraph describing (i) the use of external data sources with no overlap to benchmark test sets, (ii) held-out task splits within each benchmark, and (iii) an analysis showing that selected data distributions improve downstream metrics beyond what medal-criterion overfitting would predict. These additions will be placed in Section 4.3. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmark signals

full rationale

The paper describes an empirical agent framework (DataTree, shared Data Pool, Global Memory) evaluated via downstream training feedback on held-out benchmarks (MLE-Bench Lite medal rate, PostTrainBench GPQA accuracy). No equations, derivations, or self-citations are presented that reduce any claimed result to its own inputs by construction. The reported deltas (32.27% medal improvement, 0.67-point GPQA gain) are measured against independent task performance rather than internal search artifacts defined as equivalent to the output. The framework's design choices are validated externally and do not invoke load-bearing self-citations or fitted parameters renamed as predictions. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The framework rests on the assumption that iterative downstream training provides an unbiased and sufficiently informative reward signal for data choices; the three named components are introduced without external validation.

axioms (1)

domain assumption Downstream training loss or accuracy on candidate data serves as a reliable proxy for data quality across branches
The entire search process is driven by this feedback; if the proxy is noisy or biased the tree exploration collapses.

invented entities (3)

DataTree no independent evidence
purpose: Organizes alternative data-engineering branches for systematic exploration
New data structure proposed to manage the open-ended search space
Data Pool no independent evidence
purpose: Stores discovered external data sources for cross-branch reuse
Shared cache to avoid redundant discovery
Global Memory no independent evidence
purpose: Records node outcomes, artifacts, and reusable findings across the search
Cumulative store to carry evidence between branches

pith-pipeline@v0.9.0 · 5625 in / 1479 out tokens · 32975 ms · 2026-05-14T21:07:53.142473+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 19 canonical work pages · 6 internal anchors

[1]

Claude code: A command-line tool for agentic coding, 2025

Anthropic. Claude code: A command-line tool for agentic coding, 2025. URL https: //code.claude.com/docs

2025
[2]

Introducing claude sonnet 4.6

Anthropic. Introducing claude sonnet 4.6. https://www.anthropic.com/news/claud e-sonnet-4-6, Feb. 2026. Accessed: 2026-05-07

2026
[3]

R. K. Arora, J. Wei, R. S. Hicks, P . Bowman, J. Quiñonero-Candela, F. Tsimpourlas, M. Shar- man, M. Shah, A. Vallone, A. Beutel, et al. Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

P . Auer, N. Cesa-Bianchi, and P . Fischer. Finite-time analysis of the multiarmed bandit problem.Machine learning, 47(2):235–256, 2002

2002
[5]

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

M. Balunovi´ c, J. Dekoninck, I. Petrov, N. Jovanovi´ c, and M. Vechev. Matharena: Evaluating llms on uncontaminated math competitions.arXiv preprint arXiv:2505.23281, 2025

work page internal anchor Pith review arXiv 2025
[6]

M. Cai, X. Gao, Y. Li, H. Lin, Z. Liu, Z. Pan, Q. Pei, X. Shang, M. Sun, Z. Tang, et al. Opendataarena: A fair and open arena for benchmarking post-training dataset value.arXiv preprint arXiv:2512.14051, 2025

work page arXiv 2025
[7]

R. Cao, F. Lei, H. Wu, J. Chen, Y. Fu, H. Gao, X. Xiong, H. Zhang, Y. Mao, W. Hu, et al. Spider2-v: How far are multimodal agents from automating data science and engineering workflows?Advances in Neural Information Processing Systems, 37:107703–107744, 2024

2024
[8]

J. S. Chan, N. Chowdhury, O. Jaffe, J. Aung, D. Sherburn, E. Mays, G. Starace, K. Liu, L. Maksin, T. Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P . D. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Y. Chi, Y. Lin, S. Hong, D. Pan, Y. Fei, G. Mei, B. Liu, T. Pang, J. Kwok, C. Zhang, et al. Sela: Tree-search enhanced llm agents for automated machine learning.arXiv preprint arXiv:2410.17238, 2024

work page arXiv 2024
[11]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 13

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

H. Fang, B. Han, N. Erickson, X. Zhang, S. Zhou, A. Dagar, J. Zhang, A. C. Turkmen, C. Hu, H. Rangwala, et al. Mlzero: A multi-agent system for end-to-end machine learning automation.arXiv preprint arXiv:2505.13941, 2025

work page arXiv 2025
[13]

X. Gao, X. Wang, Y. Zhu, M. Cai, C. He, and L. Wu. Closing the data loop: Using open- dataarena to engineer superior training datasets.arXiv preprint arXiv:2601.09733, 2025

work page arXiv 2025
[14]

Hollmann, S

N. Hollmann, S. Müller, and F. Hutter. Large language models for automated data science: Introducing caafe for context-aware automated feature engineering.Advances in Neural Information Processing Systems, 36:44753–44775, 2023

2023
[15]

ml-intern: An open-source ai agent for autonomous ml research

Hugging Face. ml-intern: An open-source ai agent for autonomous ml research. https: //github.com/huggingface/ml-intern, 2026

2026
[16]

Aide: Ai-driven exploration in the space of code.arXiv preprint arXiv:2502.13138,

Z. Jiang, D. Schmidt, D. Srikanth, D. Xu, I. Kaplan, D. Jacenko, and Y. Wu. Aide: Ai-driven exploration in the space of code.arXiv preprint arXiv:2502.13138, 2025

work page arXiv 2025
[17]

D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P . Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021

2021
[18]

Kocsis and C

L. Kocsis and C. Szepesvári. Bandit based monte-carlo planning. InEuropean conference on machine learning, pages 282–293. Springer, 2006

2006
[19]

E. Lai, G. Vitagliano, Z. Zhang, O. Chabra, S. Sudhir, A. Zeng, A. A. Zabreyko, C. Li, F. Kossmann, J. Ding, et al. Kramabench: A benchmark for ai systems on data-to-insight pipelines over data lakes.arXiv preprint arXiv:2506.06541, 2025

work page arXiv 2025
[20]

K. Li, M. Jiang, D. Fu, Y. Wu, X. Hu, D. Wang, and P . Liu. Datasetresearch: Benchmarking agent systems for demand-driven dataset discovery.arXiv preprint arXiv:2508.06960, 2025

work page arXiv 2025
[21]

Li, W.-L

T. Li, W.-L. Chiang, E. Frick, L. Dunlap, B. Zhu, J. E. Gonzalez, and I. Stoica. From live data to high-quality benchmarks: The arena-hard pipeline.Blog post.[Accessed 07-02-2025], 2024

2025
[22]

Liang, Z

H. Liang, Z. Zhao, M. Qiang, M. Chen, L. Ma, R. Yu, H. Feng, S. Sun, Z. Meng, X. Ma, et al. Dataflex: A unified framework for data-centric dynamic training of large language models. arXiv preprint arXiv:2603.26164, 2026

work page arXiv 2026
[23]

Mazumder, C

M. Mazumder, C. Banbury, X. Yao, B. Karlaš, W. Gaviria Rojas, S. Diamos, G. Diamos, L. He, A. Parrish, H. R. Kirk, et al. Dataperf: Benchmarks for data-centric ai development. Advances in Neural Information Processing Systems, 36:5320–5347, 2023

2023
[24]

J. Nam, J. Yoon, J. Chen, J. Shin, S. Ö. Arık, and T. Pfister. Mle-star: Machine learning engineering agent via search and targeted refinement.arXiv preprint arXiv:2506.15692, 2025

work page arXiv 2025
[25]

Codex CLI

OpenAI. Codex CLI. https://github.com/openai/codex, 2025. Open-source coding agent that runs locally in the terminal. Accessed: 2026-05-06

2025
[26]

Introducing gpt-5.4

OpenAI. Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4/ , Mar. 2026. Accessed: 2026-05-07

2026
[27]

S. G. Patil, H. Mao, F. Yan, C. C.-J. Ji, V . Suresh, I. Stoica, and J. E. Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty-second International Conference on Machine Learning, 2025

2025
[28]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5. 14

2026
[29]

B. Rank, H. Bhatnagar, A. Prabhu, S. Eisenberg, K. Nguyen, M. Bethge, and M. An- driushchenko. Posttrainbench: Can llm agents automate llm post-training?arXiv preprint arXiv:2603.08640, 2026

work page arXiv 2026
[30]

D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bow- man. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

M. Sun, R. Han, B. Jiang, H. Qi, D. Sun, Y. Yuan, and J. Huang. Lambda: A large model based data agent.Journal of the American Statistical Association, 121(553):1–13, 2026

2026
[32]

Wang and Y

X. Wang and Y. Fu. Dataforge: A data agent platform for autonomous data engineering. arXiv preprint arXiv:2511.06185, 2025

work page arXiv 2025
[33]

X. Yang, H. Chen, W. Feng, H. Wang, Z. Ye, X. Shen, X. Yang, S. Sun, W. Liu, and J. Bian. Collaborative evolving strategy for automatic data-centric development.arXiv preprint arXiv:2407.18690, 2024

work page arXiv 2024
[34]

Z. You, Y. Zhang, D. Xu, Y. Lou, Y. Yan, W. Wang, H. Zhang, and Y. Huang. Datawiseagent: A notebook-centric llm agent framework for automated data science.arXiv e-prints, pages arXiv–2503, 2025

2025
[35]

A. Zeng, X. Lv, Z. Hou, Z. Du, Q. Zheng, B. Chen, D. Yin, C. Ge, C. Huang, C. Xie, et al. Glm-5: from vibe coding to agentic engineering, 2026

2026
[36]

D. Zha, Z. P . Bhat, K.-H. Lai, F. Yang, Z. Jiang, S. Zhong, and X. Hu. Data-centric artificial intelligence: A survey.ACM Computing Surveys, 57(5):1–42, 2025

2025
[37]

X. Zhu, Y. Cai, Z. Liu, B. Zheng, C. Wang, R. Ye, J. Chen, H. Wang, W.-C. Wang, Y. Zhang, et al. Toward ultra-long-horizon agentic science: Cognitive accumulation for machine learning engineering.arXiv preprint arXiv:2601.10402, 2026. 15 A. Case Study: DataMaster on Random Acts of Pizza To provide a concrete view of the DataMaster workflow, we include a r...

work page arXiv 2026
[38]

Exploration phase( 𝑡 < 𝑡 1): 𝑐𝑡 remains at its maximum value 𝑐0, encouraging broad coverage of the frontier when little information is available
[39]

Transition phase( 𝑡1 ≤𝑡≤𝑡 2): 𝑐𝑡 decays linearly, gradually shifting priority toward branches that have already demonstrated strong performance
[40]

/ input /

Exploitation phase( 𝑡 > 𝑡 2): 𝑐𝑡 is clamped at 𝑐min, concentrating the remaining budget on the most promising branches while retaining a minimal exploration incentive to avoid premature convergence. C.4. Hyperparameter Settings Table 10 lists the default hyperparameters used in all experiments. Table 10|Default scheduling hyperparameters. Symbol Descripti...