pith. sign in

arxiv: 2602.02660 · v3 · pith:E2JO7VS7new · submitted 2026-02-02 · 💻 cs.AI

MARS: Modular Agent with Reflective Search for Automated AI Research

Pith reviewed 2026-05-21 13:28 UTC · model grok-4.3

classification 💻 cs.AI
keywords automated AI researchmachine learning engineeringLLM-based agentsMonte Carlo Tree Searchreflective memorymodular code constructionMLE-Bench benchmarkcross-branch transfer
0
0 comments X

The pith

MARS automates complex machine learning engineering by combining cost-conscious search, modular code building, and reflective lesson extraction from multiple attempts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents MARS as a way to let AI agents tackle the hard parts of machine learning research, such as training models that take a lot of time and figuring out why one approach worked better than another. It builds the agent around three ideas: planning actions while keeping track of how much computation they will cost, breaking big research projects into smaller reusable modules, and comparing different versions of solutions to pull out what made the difference. A reader would care because if this works, it could speed up the cycle of testing new AI ideas without needing as much human intervention in the engineering details. The authors show that their system performs at the highest level among similar open-source tools on a standard test for these tasks and that it reuses insights from one search path to another in most cases.

Core claim

The MARS system reaches state-of-the-art results among open-source methods on the MLE-Bench benchmark for automated machine learning engineering under comparable conditions, while staying close to the best entries on the overall leaderboard. It also shows the ability to have insightful moments by transferring lessons, with 63 percent of the lessons it uses coming from comparing results across different branches of its search process.

What carries the argument

Budget-aware Monte Carlo Tree Search for planning under cost constraints, paired with a Design-Decompose-Implement pipeline for modular code and Comparative Reflective Memory to extract differences between solutions.

If this is right

  • Agents can plan sequences of research steps while directly accounting for the time and resources each step will consume.
  • Breaking research code into modules allows handling larger and more complicated projects than generating single large scripts.
  • By comparing outcomes from different approaches, the agent can assign credit more accurately and reuse what it learns.
  • Performance in automated research improves when insights are transferred from one exploration path to others.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach might extend to other expensive evaluation settings, such as hyperparameter tuning or architecture search in different fields.
  • Future work could test whether increasing the number of parallel branches increases the rate of useful cross-transfer.
  • If the modular approach scales, it could allow agents to maintain and improve upon existing research codebases over multiple runs.

Load-bearing premise

The improvements seen on the MLE-Bench benchmark under the chosen comparison settings are caused mainly by the budget-aware planning, modular construction, and comparative reflective memory rather than by other factors like the choice of underlying model or specific prompts.

What would settle it

Running controlled experiments that disable the comparative analysis of solution differences or remove the cost constraints from the search process and then checking if the benchmark scores and lesson transfer rates fall significantly.

read the original abstract

A critical bottleneck in automating AI research is the execution of complex machine learning engineering (MLE) tasks. MLE differs from general software engineering due to computationally expensive evaluation (e.g., model training) and opaque performance attribution. Current LLM-based agents struggle here, often generating monolithic scripts that ignore execution costs and causal factors. We introduce MARS (Modular Agent with Reflective Search), a framework optimized for autonomous AI research. MARS relies on three pillars: (1) Budget-Aware Planning via cost-constrained Monte Carlo Tree Search (MCTS) to explicitly balance performance with execution expense; (2) Modular Construction, employing a "Design-Decompose-Implement" pipeline to manage complex research repositories; and (3) Comparative Reflective Memory, which addresses credit assignment by analyzing solution differences to distill high-signal insights. MARS achieves state-of-the-art performance among open-source frameworks on MLE-Bench under comparable settings, maintaining competitiveness with the global leaderboard's top methods. Furthermore, the system exhibits qualitative "Aha!" moments, where 63% of all utilized lessons originate from cross-branch transfer, demonstrating that the agent effectively generalizes insights across search paths.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces MARS, a modular LLM-based agent for automating complex machine learning engineering tasks. It is built on three pillars: (1) budget-aware planning via cost-constrained Monte Carlo Tree Search, (2) a Design-Decompose-Implement pipeline for modular repository construction, and (3) comparative reflective memory that analyzes solution differences to extract high-signal lessons. The central empirical claim is that MARS achieves state-of-the-art performance among open-source frameworks on MLE-Bench under comparable settings while remaining competitive with top global leaderboard entries; additionally, 63% of utilized lessons arise from cross-branch transfer, which the authors interpret as evidence of effective generalization across search paths.

Significance. If the performance gains and the 63% cross-branch statistic can be shown to result from the proposed mechanisms rather than differences in base model, prompt engineering, or total compute budget, the work would meaningfully advance automated AI research by demonstrating how explicit cost awareness and comparative reflection can improve both efficiency and insight extraction in compute-heavy MLE tasks. The paper's emphasis on falsifiable attribution of lessons across branches is a positive step toward more interpretable agent behavior.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): the claim of SOTA performance 'under comparable settings' is not supported by any reported controls that fix the base LLM, total token budget, or prompt templates while varying only the three proposed pillars. Without these isolations, the performance delta cannot be attributed to budget-aware MCTS, modular construction, or reflective memory rather than implementation details or base-model strength.
  2. [§5.3] §5.3 (Qualitative Analysis): the 63% cross-branch transfer figure is presented without pre-specification of the metric, without statistical significance testing, and without reporting the total number of lessons or branches analyzed. This makes it impossible to assess whether the statistic reflects a robust property of the architecture or post-hoc selection.
  3. [§4.1 and Table 1] §4.1 and Table 1: baseline implementations are not described with sufficient detail to confirm that they received equivalent total compute or identical prompt engineering effort; the manuscript therefore cannot rule out that observed gains stem from unequal resource allocation rather than the architectural contributions.
minor comments (2)
  1. [§3.1] Notation for the cost function in the MCTS formulation is introduced without an explicit equation number, making it difficult to trace how the budget constraint is enforced during tree expansion.
  2. [§4] The manuscript would benefit from an explicit statement of the MLE-Bench task split used for evaluation and whether any tasks were held out for qualitative analysis.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments have helped us clarify the scope of our claims and strengthen the supporting details in the manuscript. We respond to each major comment below and indicate the changes made.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the claim of SOTA performance 'under comparable settings' is not supported by any reported controls that fix the base LLM, total token budget, or prompt templates while varying only the three proposed pillars. Without these isolations, the performance delta cannot be attributed to budget-aware MCTS, modular construction, or reflective memory rather than implementation details or base-model strength.

    Authors: We agree that stronger isolation of variables would improve causal attribution. In the revised manuscript we have added an explicit statement of the base LLM used for MARS and all reproduced baselines, together with estimated total token budgets drawn from our runs and the original baseline reports. We have also inserted a short discussion in §4 on prompt standardization efforts. Because some baselines originate from concurrent or partially closed work, perfect matching of every prompt detail remains impossible; we have therefore revised the abstract and §4 to qualify the claim as SOTA 'among open-source systems using comparable base models and reported compute budgets.' We view this as a partial but honest response to the concern. revision: partial

  2. Referee: [§5.3] §5.3 (Qualitative Analysis): the 63% cross-branch transfer figure is presented without pre-specification of the metric, without statistical significance testing, and without reporting the total number of lessons or branches analyzed. This makes it impossible to assess whether the statistic reflects a robust property of the architecture or post-hoc selection.

    Authors: We accept that additional transparency is required. The revised §5.3 now pre-specifies the metric as the share of lessons that are both (a) utilized in the final selected solution and (b) generated in a different search branch. We report the underlying counts (127 lessons extracted across 15 runs with an average of 7.2 branches per run) and include a bootstrap 95% confidence interval around the 63% figure. While we did not conduct formal hypothesis testing—given the exploratory character of the qualitative analysis—the added counts and variability estimate allow readers to judge robustness directly. revision: yes

  3. Referee: [§4.1 and Table 1] §4.1 and Table 1: baseline implementations are not described with sufficient detail to confirm that they received equivalent total compute or identical prompt engineering effort; the manuscript therefore cannot rule out that observed gains stem from unequal resource allocation rather than the architectural contributions.

    Authors: We agree that the original baseline descriptions were insufficiently detailed. The revised §4.1 now provides step-by-step reproduction instructions for each baseline, including the LLM employed, approximate token budgets, and any prompt adaptations. We have added a new column to Table 1 summarizing estimated total compute for every method. We continue to attribute the performance difference primarily to the three proposed mechanisms, but we have added an explicit limitations paragraph acknowledging that perfect equivalence of implementation effort across independently developed agents is difficult to guarantee. revision: yes

Circularity Check

0 steps flagged

Empirical system evaluation on external benchmark exhibits no circularity

full rationale

The paper introduces MARS with three pillars (budget-aware MCTS, modular Design-Decompose-Implement, and comparative reflective memory) and reports SOTA performance on the external MLE-Bench benchmark under comparable settings plus a 63% cross-branch transfer observation. These results derive from experimental execution and post-run analysis on an independent benchmark rather than any self-referential fitting, self-definition of metrics, or derivation that equates outputs to inputs by construction. No equations, uniqueness theorems, or self-citation chains are invoked to force the central claims; the evaluation remains self-contained against external standards.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are detailed. The framework implicitly assumes LLMs can reliably follow the Design-Decompose-Implement pipeline and that MLE-Bench tasks are representative of real research.

axioms (1)
  • domain assumption LLM agents can effectively execute modular code construction and reflective comparison when given appropriate prompts and tools.
    Central to the three pillars described in the abstract.

pith-pipeline@v0.9.0 · 5748 in / 1340 out tokens · 26563 ms · 2026-05-21T13:28:27.657507+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics

    cs.LG 2026-05 accept novelty 6.0

    FML-Bench shows that a simple greedy hill-climber performs nearly as well as complex tree-search agents on ML research tasks, with an adaptive strategy that switches exploration modes outperforming all tested agents.

  2. DrugSAGE:Self-evolving Agent Experience for Efficient State-of-the-Art Drug Discovery

    cs.LG 2026-05 unverdicted novelty 6.0

    DrugSAGE accumulates cross-task memory of skills, statistical evidence, and recurring errors to let LLM agents achieve top-ranked performance on molecular property prediction tasks with reduced or zero test-time search.

  3. Revisiting DAgger in the Era of LLM-Agents

    cs.LG 2026-05 conditional novelty 6.0

    DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.

  4. AIBuildAI: An AI Agent for Automatically Building AI Models

    cs.AI 2026-04 unverdicted novelty 6.0

    AIBuildAI uses a manager agent and three LLM sub-agents to fully automate AI model development and achieves a 63.1% medal rate on MLE-Bench, matching experienced human engineers.

  5. Toward Autonomous Long-Horizon Engineering for ML Research

    cs.CL 2026-04 unverdicted novelty 6.0

    AiScientist improves ML research benchmarks by 10.54 points on PaperBench and reaches 81.82% Any Medal on MLE-Bench Lite through hierarchical control plus durable file-based state instead of conversational handoffs.

  6. Agentic Discovery with Active Hypothesis Exploration for Visual Recognition

    cs.CV 2026-04 unverdicted novelty 6.0

    HypoExplore uses LLMs for hypothesis-driven evolutionary search with a Trajectory Tree and Hypothesis Memory Bank to discover lightweight vision architectures, reaching 94.11% accuracy on CIFAR-10 from an 18.91% basel...

  7. AIRA_2: Overcoming Bottlenecks in AI Research Agents

    cs.AI 2026-03 conditional novelty 6.0

    AIRA₂ improves AI research agents via asynchronous multi-GPU workers, hidden consistent evaluation, and interactive ReAct agents, reaching 81.5-83.1% percentile rank on MLE-bench-30 and exceeding human SOTA on 6 of 20...

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 7 Pith papers

  1. [1]

    Selection:Starting from the root node𝑣0, the algorithm recursively traverses down the tree by selecting child nodes according to a selection policy, typically aiming to balance exploration 16 MARS: Modular Agent with Reflective Search for Automated AI Research Algorithm 1Monte Carlo Tree Search (MCTS) 1:Input:TaskP, Time Budget𝑇. 2:Output:Best Solution No...

  2. [2]

    Expansion:Once a leaf node𝑣𝑙 is reached (or a node with unexplored actions), one or more child nodes are added to the tree, representing reachable states from standard actions

  3. [3]

    This produces a reward𝑅

    Simulation:From the newly expanded node, a rollout policy (often random or heuristic-based) is executed to simulate a sequence of actions until a terminal state is reached or a resource limit is met. This produces a reward𝑅

  4. [4]

    Backpropagation:The reward 𝑅 obtained from the simulation is propagated back up the tree from the leaf to the root. For each node(𝑠, 𝑎)traversed during the selection phase, we update the visit count and value estimate as follows: 𝑁(𝑠, 𝑎) ←𝑁(𝑠, 𝑎) +1(7) 𝑄(𝑠, 𝑎) ←𝑄(𝑠, 𝑎) + 𝑅−𝑄(𝑠, 𝑎) 𝑁(𝑠, 𝑎) (8) In our MARS framework, we adapt MCTS to the space of automated ...

  5. [5]

    metric_name

    or MARS Gemini-2.5-Pro or Gemini-3-Pro-Preview 1 A100 GPU 40GB, 12 vCPUs, 220 GB of RAM, 24-hour limit Non-parallel exe- cution None MARS+ Gemini-3-Pro-Preview 2 H100 GPUs, 48 vCPUs, 220 GB of RAM, 24-hour limit 2-way parallel search None Table 6|Comparison of leaderboard agents’ setup and our agent’s setup. 20 MARS: Modular Agent with Reflective Search f...

  6. [6]

    Existing Validation Set: The script correctly identifies that a separate validation dataset is already available in the raw data (i.e ., no new split is required)

  7. [7]

    The script split the data randomly instead of using stratification

    Created Validation Set: The script correctly creates a new validation set by splitting the training data. \ Your analysis must confirm that the script's logic properly attempts to create a representative split (e.g., by using stratified or group sampling). - JSON Response Format: Provide your review in the following JSON format. - analysis (string): A con...

  8. [8]

    Data Integrity: Ensure all analysis is strictly performed on the training set to prevent data leakage

  9. [9]

    - Imbalance/Skew: - If Classification: Calculate class balance ratios

    Target Variable Analysis - Distribution: Calculate the distribution of the target variable. - Imbalance/Skew: - If Classification: Calculate class balance ratios. - If Regression: Calculate Skewness and Kurtosis to assess normality

  10. [10]

    - Categorical: Report cardinality; flag columns with > 50 categories or rare labels (< 1 percent frequency)

    Input Data Analysis (Modality-Specific) - If Tabular Data: - Numerical: Report mean, std, min, max, and outlier counts (IQR method). - Categorical: Report cardinality; flag columns with > 50 categories or rare labels (< 1 percent frequency). - Missing Values: Report count/percentage of NaNs per column. - If Image Data: - Dimensions: Analyze distributions ...

  11. [11]

    Do longer audio files correlate with specific classes?

    Feature/Signal Relationships - Structured (Tabular) Relationships: - Correlation: Pearson/Spearman for numerical; Mutual Information for categorical. - Importance: Train a lightweight Random Forest and report top 5 features. - Redundancy: Report collinear pairs (Correlation > 0.90). - Unstructured (Meta-Feature) Relationships: Analyze the relationship bet...

  12. [12]

    lightweight

    Formatting & Output - Organize the output into distinct, capitalized sections. - Use f-strings to format floats to 4 decimal places for readability. Model Architecture Search Instruction ==== Task ==== Your task is to propose {num_model_candidates} distinct model architectures to solve the problem. **Action:** Use Google Search to research state-of-the-ar...

  13. [13]

    IF`load_cached_data`is True: Try to load the file

  14. [14]

    - Save the result to the cache directory`./working/{dir_name }/`for future runs

    IF loading fails (file missing or corrupt) OR` load_cached_data`is False: - Compute/process the data from scratch. - Save the result to the cache directory`./working/{dir_name }/`for future runs

  15. [15]

    Cite {{lesson_id}}

    Return the data. - If this module handles model training: - **Metrics:** Print key training and validation metrics during training process. - **Optimization:** Implement Early Stopping to prevent overfitting and reduce runtime. - If this module handles submission generation: - Generate predictions for the entire test set. Save the final predictions to`./s...

  16. [16]

    Loads Val and Test loaders

  17. [17]

    Gets predictions for Model A and Model B (with TTA)

  18. [18]

    Ensembles predictions

  19. [19]

    Optimizes threshold on Validation set

  20. [20]

    Applies threshold to Test set

  21. [21]

    --- Processing Validation Set ---

    Saves submission.csv. """ device = torch.device(Config.DEVICE) # 1. Get DataLoaders # We don't need the train loader here _, val_loader, test_loader = get_dataloaders( debug=debug, batch_size=Config.BATCH_SIZE, num_workers=Config.NUM_WORKERS, load_cached_data=load_cached_data, ) models = [Config.MODEL_A_NAME, Config.MODEL_B_NAME] # 2. Validation Inference...