MARS: Modular Agent with Reflective Search for Automated AI Research
Pith reviewed 2026-05-21 13:28 UTC · model grok-4.3
The pith
MARS automates complex machine learning engineering by combining cost-conscious search, modular code building, and reflective lesson extraction from multiple attempts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The MARS system reaches state-of-the-art results among open-source methods on the MLE-Bench benchmark for automated machine learning engineering under comparable conditions, while staying close to the best entries on the overall leaderboard. It also shows the ability to have insightful moments by transferring lessons, with 63 percent of the lessons it uses coming from comparing results across different branches of its search process.
What carries the argument
Budget-aware Monte Carlo Tree Search for planning under cost constraints, paired with a Design-Decompose-Implement pipeline for modular code and Comparative Reflective Memory to extract differences between solutions.
If this is right
- Agents can plan sequences of research steps while directly accounting for the time and resources each step will consume.
- Breaking research code into modules allows handling larger and more complicated projects than generating single large scripts.
- By comparing outcomes from different approaches, the agent can assign credit more accurately and reuse what it learns.
- Performance in automated research improves when insights are transferred from one exploration path to others.
Where Pith is reading between the lines
- The approach might extend to other expensive evaluation settings, such as hyperparameter tuning or architecture search in different fields.
- Future work could test whether increasing the number of parallel branches increases the rate of useful cross-transfer.
- If the modular approach scales, it could allow agents to maintain and improve upon existing research codebases over multiple runs.
Load-bearing premise
The improvements seen on the MLE-Bench benchmark under the chosen comparison settings are caused mainly by the budget-aware planning, modular construction, and comparative reflective memory rather than by other factors like the choice of underlying model or specific prompts.
What would settle it
Running controlled experiments that disable the comparative analysis of solution differences or remove the cost constraints from the search process and then checking if the benchmark scores and lesson transfer rates fall significantly.
read the original abstract
A critical bottleneck in automating AI research is the execution of complex machine learning engineering (MLE) tasks. MLE differs from general software engineering due to computationally expensive evaluation (e.g., model training) and opaque performance attribution. Current LLM-based agents struggle here, often generating monolithic scripts that ignore execution costs and causal factors. We introduce MARS (Modular Agent with Reflective Search), a framework optimized for autonomous AI research. MARS relies on three pillars: (1) Budget-Aware Planning via cost-constrained Monte Carlo Tree Search (MCTS) to explicitly balance performance with execution expense; (2) Modular Construction, employing a "Design-Decompose-Implement" pipeline to manage complex research repositories; and (3) Comparative Reflective Memory, which addresses credit assignment by analyzing solution differences to distill high-signal insights. MARS achieves state-of-the-art performance among open-source frameworks on MLE-Bench under comparable settings, maintaining competitiveness with the global leaderboard's top methods. Furthermore, the system exhibits qualitative "Aha!" moments, where 63% of all utilized lessons originate from cross-branch transfer, demonstrating that the agent effectively generalizes insights across search paths.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MARS, a modular LLM-based agent for automating complex machine learning engineering tasks. It is built on three pillars: (1) budget-aware planning via cost-constrained Monte Carlo Tree Search, (2) a Design-Decompose-Implement pipeline for modular repository construction, and (3) comparative reflective memory that analyzes solution differences to extract high-signal lessons. The central empirical claim is that MARS achieves state-of-the-art performance among open-source frameworks on MLE-Bench under comparable settings while remaining competitive with top global leaderboard entries; additionally, 63% of utilized lessons arise from cross-branch transfer, which the authors interpret as evidence of effective generalization across search paths.
Significance. If the performance gains and the 63% cross-branch statistic can be shown to result from the proposed mechanisms rather than differences in base model, prompt engineering, or total compute budget, the work would meaningfully advance automated AI research by demonstrating how explicit cost awareness and comparative reflection can improve both efficiency and insight extraction in compute-heavy MLE tasks. The paper's emphasis on falsifiable attribution of lessons across branches is a positive step toward more interpretable agent behavior.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experiments): the claim of SOTA performance 'under comparable settings' is not supported by any reported controls that fix the base LLM, total token budget, or prompt templates while varying only the three proposed pillars. Without these isolations, the performance delta cannot be attributed to budget-aware MCTS, modular construction, or reflective memory rather than implementation details or base-model strength.
- [§5.3] §5.3 (Qualitative Analysis): the 63% cross-branch transfer figure is presented without pre-specification of the metric, without statistical significance testing, and without reporting the total number of lessons or branches analyzed. This makes it impossible to assess whether the statistic reflects a robust property of the architecture or post-hoc selection.
- [§4.1 and Table 1] §4.1 and Table 1: baseline implementations are not described with sufficient detail to confirm that they received equivalent total compute or identical prompt engineering effort; the manuscript therefore cannot rule out that observed gains stem from unequal resource allocation rather than the architectural contributions.
minor comments (2)
- [§3.1] Notation for the cost function in the MCTS formulation is introduced without an explicit equation number, making it difficult to trace how the budget constraint is enforced during tree expansion.
- [§4] The manuscript would benefit from an explicit statement of the MLE-Bench task split used for evaluation and whether any tasks were held out for qualitative analysis.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. The comments have helped us clarify the scope of our claims and strengthen the supporting details in the manuscript. We respond to each major comment below and indicate the changes made.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the claim of SOTA performance 'under comparable settings' is not supported by any reported controls that fix the base LLM, total token budget, or prompt templates while varying only the three proposed pillars. Without these isolations, the performance delta cannot be attributed to budget-aware MCTS, modular construction, or reflective memory rather than implementation details or base-model strength.
Authors: We agree that stronger isolation of variables would improve causal attribution. In the revised manuscript we have added an explicit statement of the base LLM used for MARS and all reproduced baselines, together with estimated total token budgets drawn from our runs and the original baseline reports. We have also inserted a short discussion in §4 on prompt standardization efforts. Because some baselines originate from concurrent or partially closed work, perfect matching of every prompt detail remains impossible; we have therefore revised the abstract and §4 to qualify the claim as SOTA 'among open-source systems using comparable base models and reported compute budgets.' We view this as a partial but honest response to the concern. revision: partial
-
Referee: [§5.3] §5.3 (Qualitative Analysis): the 63% cross-branch transfer figure is presented without pre-specification of the metric, without statistical significance testing, and without reporting the total number of lessons or branches analyzed. This makes it impossible to assess whether the statistic reflects a robust property of the architecture or post-hoc selection.
Authors: We accept that additional transparency is required. The revised §5.3 now pre-specifies the metric as the share of lessons that are both (a) utilized in the final selected solution and (b) generated in a different search branch. We report the underlying counts (127 lessons extracted across 15 runs with an average of 7.2 branches per run) and include a bootstrap 95% confidence interval around the 63% figure. While we did not conduct formal hypothesis testing—given the exploratory character of the qualitative analysis—the added counts and variability estimate allow readers to judge robustness directly. revision: yes
-
Referee: [§4.1 and Table 1] §4.1 and Table 1: baseline implementations are not described with sufficient detail to confirm that they received equivalent total compute or identical prompt engineering effort; the manuscript therefore cannot rule out that observed gains stem from unequal resource allocation rather than the architectural contributions.
Authors: We agree that the original baseline descriptions were insufficiently detailed. The revised §4.1 now provides step-by-step reproduction instructions for each baseline, including the LLM employed, approximate token budgets, and any prompt adaptations. We have added a new column to Table 1 summarizing estimated total compute for every method. We continue to attribute the performance difference primarily to the three proposed mechanisms, but we have added an explicit limitations paragraph acknowledging that perfect equivalence of implementation effort across independently developed agents is difficult to guarantee. revision: yes
Circularity Check
Empirical system evaluation on external benchmark exhibits no circularity
full rationale
The paper introduces MARS with three pillars (budget-aware MCTS, modular Design-Decompose-Implement, and comparative reflective memory) and reports SOTA performance on the external MLE-Bench benchmark under comparable settings plus a 63% cross-branch transfer observation. These results derive from experimental execution and post-run analysis on an independent benchmark rather than any self-referential fitting, self-definition of metrics, or derivation that equates outputs to inputs by construction. No equations, uniqueness theorems, or self-citation chains are invoked to force the central claims; the evaluation remains self-contained against external standards.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM agents can effectively execute modular code construction and reflective comparison when given appropriate prompts and tools.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Budget-Aware Planning via cost-constrained Monte Carlo Tree Search (MCTS) to explicitly balance performance with execution expense
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 7 Pith papers
-
FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics
FML-Bench shows that a simple greedy hill-climber performs nearly as well as complex tree-search agents on ML research tasks, with an adaptive strategy that switches exploration modes outperforming all tested agents.
-
DrugSAGE:Self-evolving Agent Experience for Efficient State-of-the-Art Drug Discovery
DrugSAGE accumulates cross-task memory of skills, statistical evidence, and recurring errors to let LLM agents achieve top-ranked performance on molecular property prediction tasks with reduced or zero test-time search.
-
Revisiting DAgger in the Era of LLM-Agents
DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
-
AIBuildAI: An AI Agent for Automatically Building AI Models
AIBuildAI uses a manager agent and three LLM sub-agents to fully automate AI model development and achieves a 63.1% medal rate on MLE-Bench, matching experienced human engineers.
-
Toward Autonomous Long-Horizon Engineering for ML Research
AiScientist improves ML research benchmarks by 10.54 points on PaperBench and reaches 81.82% Any Medal on MLE-Bench Lite through hierarchical control plus durable file-based state instead of conversational handoffs.
-
Agentic Discovery with Active Hypothesis Exploration for Visual Recognition
HypoExplore uses LLMs for hypothesis-driven evolutionary search with a Trajectory Tree and Hypothesis Memory Bank to discover lightweight vision architectures, reaching 94.11% accuracy on CIFAR-10 from an 18.91% basel...
-
AIRA_2: Overcoming Bottlenecks in AI Research Agents
AIRA₂ improves AI research agents via asynchronous multi-GPU workers, hidden consistent evaluation, and interactive ReAct agents, reaching 81.5-83.1% percentile rank on MLE-bench-30 and exceeding human SOTA on 6 of 20...
Reference graph
Works this paper leans on
-
[1]
Selection:Starting from the root node𝑣0, the algorithm recursively traverses down the tree by selecting child nodes according to a selection policy, typically aiming to balance exploration 16 MARS: Modular Agent with Reflective Search for Automated AI Research Algorithm 1Monte Carlo Tree Search (MCTS) 1:Input:TaskP, Time Budget𝑇. 2:Output:Best Solution No...
work page 2006
-
[2]
Expansion:Once a leaf node𝑣𝑙 is reached (or a node with unexplored actions), one or more child nodes are added to the tree, representing reachable states from standard actions
-
[3]
Simulation:From the newly expanded node, a rollout policy (often random or heuristic-based) is executed to simulate a sequence of actions until a terminal state is reached or a resource limit is met. This produces a reward𝑅
-
[4]
Backpropagation:The reward 𝑅 obtained from the simulation is propagated back up the tree from the leaf to the root. For each node(𝑠, 𝑎)traversed during the selection phase, we update the visit count and value estimate as follows: 𝑁(𝑠, 𝑎) ←𝑁(𝑠, 𝑎) +1(7) 𝑄(𝑠, 𝑎) ←𝑄(𝑠, 𝑎) + 𝑅−𝑄(𝑠, 𝑎) 𝑁(𝑠, 𝑎) (8) In our MARS framework, we adapt MCTS to the space of automated ...
work page 2025
-
[5]
or MARS Gemini-2.5-Pro or Gemini-3-Pro-Preview 1 A100 GPU 40GB, 12 vCPUs, 220 GB of RAM, 24-hour limit Non-parallel exe- cution None MARS+ Gemini-3-Pro-Preview 2 H100 GPUs, 48 vCPUs, 220 GB of RAM, 24-hour limit 2-way parallel search None Table 6|Comparison of leaderboard agents’ setup and our agent’s setup. 20 MARS: Modular Agent with Reflective Search f...
work page 2025
-
[6]
Existing Validation Set: The script correctly identifies that a separate validation dataset is already available in the raw data (i.e ., no new split is required)
-
[7]
The script split the data randomly instead of using stratification
Created Validation Set: The script correctly creates a new validation set by splitting the training data. \ Your analysis must confirm that the script's logic properly attempts to create a representative split (e.g., by using stratified or group sampling). - JSON Response Format: Provide your review in the following JSON format. - analysis (string): A con...
-
[8]
Data Integrity: Ensure all analysis is strictly performed on the training set to prevent data leakage
-
[9]
- Imbalance/Skew: - If Classification: Calculate class balance ratios
Target Variable Analysis - Distribution: Calculate the distribution of the target variable. - Imbalance/Skew: - If Classification: Calculate class balance ratios. - If Regression: Calculate Skewness and Kurtosis to assess normality
-
[10]
Input Data Analysis (Modality-Specific) - If Tabular Data: - Numerical: Report mean, std, min, max, and outlier counts (IQR method). - Categorical: Report cardinality; flag columns with > 50 categories or rare labels (< 1 percent frequency). - Missing Values: Report count/percentage of NaNs per column. - If Image Data: - Dimensions: Analyze distributions ...
-
[11]
Do longer audio files correlate with specific classes?
Feature/Signal Relationships - Structured (Tabular) Relationships: - Correlation: Pearson/Spearman for numerical; Mutual Information for categorical. - Importance: Train a lightweight Random Forest and report top 5 features. - Redundancy: Report collinear pairs (Correlation > 0.90). - Unstructured (Meta-Feature) Relationships: Analyze the relationship bet...
-
[12]
Formatting & Output - Organize the output into distinct, capitalized sections. - Use f-strings to format floats to 4 decimal places for readability. Model Architecture Search Instruction ==== Task ==== Your task is to propose {num_model_candidates} distinct model architectures to solve the problem. **Action:** Use Google Search to research state-of-the-ar...
-
[13]
IF`load_cached_data`is True: Try to load the file
-
[14]
- Save the result to the cache directory`./working/{dir_name }/`for future runs
IF loading fails (file missing or corrupt) OR` load_cached_data`is False: - Compute/process the data from scratch. - Save the result to the cache directory`./working/{dir_name }/`for future runs
-
[15]
Return the data. - If this module handles model training: - **Metrics:** Print key training and validation metrics during training process. - **Optimization:** Implement Early Stopping to prevent overfitting and reduce runtime. - If this module handles submission generation: - Generate predictions for the entire test set. Save the final predictions to`./s...
work page 2020
-
[16]
Loads Val and Test loaders
-
[17]
Gets predictions for Model A and Model B (with TTA)
-
[18]
Ensembles predictions
-
[19]
Optimizes threshold on Validation set
-
[20]
Applies threshold to Test set
-
[21]
--- Processing Validation Set ---
Saves submission.csv. """ device = torch.device(Config.DEVICE) # 1. Get DataLoaders # We don't need the train loader here _, val_loader, test_loader = get_dataloaders( debug=debug, batch_size=Config.BATCH_SIZE, num_workers=Config.NUM_WORKERS, load_cached_data=load_cached_data, ) models = [Config.MODEL_A_NAME, Config.MODEL_B_NAME] # 2. Validation Inference...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.