arxiv: 2603.01692 · v3 · submitted 2026-03-02 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search

Yifei Zhang , Xu Yang , Xiao Yang , Bowen Xian , Qizheng Li , Shikai Fang , Jingyuan Li , Jian Wang

show 3 more authors

Mingrui Xu Weiqing Liu Jiang Bian

Authors on Pith no claims yet

Pith reviewed 2026-05-15 17:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords MLE agentsgradient-based optimizationtree searchLLM reasoningMLE-Benchscaling experimentsagent optimization

0 comments

The pith

Gradient-based optimization outperforms tree search for MLE agents once reasoning models strengthen

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM agents for machine learning engineering have used tree search to rank solution candidates by scalar validation scores. The paper argues that this exhaustive approach becomes inefficient once reasoning improves enough to support directed updates. Gome turns diagnostic reasoning into gradient computation, success memory into momentum, and parallel traces into distributed optimization. It reaches a state-of-the-art 35.1 percent any-medal rate on MLE-Bench under a strict 12-hour single-GPU limit. Scaling tests across ten models show tree search holds an edge only with weaker reasoners; the advantage flips and widens as capability rises.

Core claim

Gome operationalizes gradient-based optimization for MLE agents by mapping structured diagnostic reasoning to gradient computation, success memory to momentum, and multi-trace execution to distributed optimization; under a closed-world protocol it attains 35.1 percent any-medal rate on MLE-Bench within a 12-hour single-V100 budget, and scaling experiments demonstrate that gradient methods progressively surpass tree search as reasoning capability increases.

What carries the argument

The mapping of structured diagnostic reasoning to gradient computation that converts LLM outputs into directed optimization steps

If this is right

Gradient-based agents will deliver higher performance under fixed compute as reasoning models advance
Tree search remains preferable only while reasoning remains unreliable
MLE agent design should shift from exhaustive enumeration toward directed updates for frontier models
The performance gap between the two paradigms widens at larger model scales

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same diagnostic-to-gradient mapping could improve efficiency in other agent domains that supply structured feedback
Future reasoning gains will amplify the relative value of gradient-style agents over search-based ones
Open-world tests would be needed to check whether the closed protocol understates sensitivity to external knowledge

Load-bearing premise

The closed-world protocol isolates architectural effects from external knowledge and that the diagnostic-to-gradient mapping faithfully represents gradient descent.

What would settle it

Running Gome and tree-search baselines on frontier models on the same MLE-Bench tasks and finding no crossover or that tree search still wins under the identical 12-hour single-GPU constraint.

read the original abstract

LLM-based agents for machine learning engineering (MLE) predominantly rely on tree search, a form of gradient-free optimization that uses scalar validation scores to rank candidates. As LLM reasoning capabilities improve, exhaustive enumeration becomes increasingly inefficient compared to directed updates, analogous to how accurate gradients enable efficient descent over random search. We introduce Gome, an MLE agent that operationalizes gradient-based optimization. Gome maps structured diagnostic reasoning to gradient computation, success memory to momentum, and multi-trace execution to distributed optimization. Under a closed-world protocol that isolates architectural effects from external knowledge, Gome achieves a state-of-the-art 35.1\% any-medal rate on MLE-Bench with a restricted 12-hour budget on a single V100 GPU. Scaling experiments across 10 models reveal a critical crossover: with weaker models, tree search retains advantages by compensating for unreliable reasoning through exhaustive exploration; as reasoning capability strengthens, gradient-based optimization progressively outperforms, with the gap widening at frontier-tier models. Given the rapid advancement of reasoning-oriented LLMs, this positions gradient-based optimization as an increasingly favorable paradigm. We release our codebase and GPT-5 traces at https://github.com/microsoft/RD-Agent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Gome shows gradient-style updates from diagnostic reasoning can beat tree search on MLE-Bench once models are strong enough, but the mapping from reasoning to actual gradient steps is not spelled out clearly enough.

read the letter

Gome reframes LLM agent reasoning for machine learning engineering as a form of gradient descent rather than tree search, and the scaling results suggest this becomes the better approach once models reason well enough. The new part is the specific operationalization: they turn structured diagnostic reasoning into what they call gradient computation, success memory into momentum, and running multiple traces into distributed optimization. They test this on MLE-Bench under a closed-world protocol that tries to isolate the agent's internal effects. Across ten models they show the crossover—tree search wins with weaker models because it can explore more, but as capability goes up the gradient approach pulls ahead, reaching 35.1% any-medal rate in 12 hours on one V100. Releasing the codebase and traces is a plus for reproducibility. The main soft spot is exactly the one in the stress-test: there's no equation or clear procedure for how the diagnostic reasoning gets turned into an actual gradient step with direction and magnitude tied to an objective. Without that, the results could just reflect that stronger models make better edit proposals, which would help any search method. The abstract doesn't include error bars or statistical tests either, so the scaling trend is suggestive but not fully nailed down yet. The math side is more conceptual than formal, and the data is benchmark-driven rather than derived from first principles. This paper is for people working on LLM agents for automated engineering tasks, especially those thinking about how to scale beyond search. A reader who wants to see alternatives to tree search and is okay with empirical framing will get value from the crossover experiments. I'd recommend sending it to peer review. The core idea is worth referee time even if the gradient mapping needs more explicit description in revision.

Referee Report

3 major / 2 minor

Summary. The paper introduces Gome, an MLE agent that operationalizes gradient-based optimization by mapping structured diagnostic reasoning to gradient computation, success memory to momentum, and multi-trace execution to distributed optimization. Under a closed-world protocol, Gome reports a state-of-the-art 35.1% any-medal rate on MLE-Bench with a 12-hour single-V100 budget. Scaling experiments across 10 models show a crossover: tree search outperforms for weaker models, while gradient-based optimization gains advantage and widens the gap as reasoning capability strengthens, positioning it as the favorable paradigm for frontier LLMs.

Significance. If the diagnostic-to-gradient mapping produces directed, magnitude-correlated updates rather than LLM-guided local search, the result would provide a concrete scaling law favoring optimization over enumeration as LLM reasoning improves, with direct implications for efficient MLE agent design and a reproducible codebase release.

major comments (3)

[Methods] Methods (mapping procedure): The paper provides no equation, surrogate loss, or pseudocode defining how structured diagnostic reasoning is converted into a gradient vector (e.g., no partial derivatives over architecture parameters or embedding-based direction). Without this, the reported crossover cannot be distinguished from stronger models simply producing better discrete edit proposals, undermining the central claim that gradient-based optimization is the operative mechanism.
[Results] Results (§4, scaling experiments): The 35.1% any-medal rate and the 10-model crossover are reported without error bars, confidence intervals, statistical significance tests, or explicit data-exclusion rules. This is load-bearing because the claim that the gap widens at frontier models rests on the reliability of these empirical trends.
[Protocol] Closed-world protocol (§2): The protocol is asserted to isolate architectural effects, yet no quantitative verification (e.g., ablation on external-knowledge leakage or diagnostic trace contamination) is supplied. This directly affects whether the performance advantage can be attributed to the gradient mapping rather than residual knowledge.

minor comments (2)

[Figures] Figure captions and axis labels for the scaling plots should explicitly state the exact models, number of runs per point, and whether the y-axis is any-medal rate or a normalized score.
[Reproducibility] The abstract states 'we release our codebase' but the main text should include a precise commit hash or release tag to ensure reproducibility of the reported traces.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments that have helped clarify and strengthen our central claims. We address each major point below and have prepared a revised manuscript incorporating the requested formalizations, statistical analyses, and verifications.

read point-by-point responses

Referee: [Methods] Methods (mapping procedure): The paper provides no equation, surrogate loss, or pseudocode defining how structured diagnostic reasoning is converted into a gradient vector (e.g., no partial derivatives over architecture parameters or embedding-based direction). Without this, the reported crossover cannot be distinguished from stronger models simply producing better discrete edit proposals, undermining the central claim that gradient-based optimization is the operative mechanism.

Authors: We agree that an explicit formalization is required to substantiate the gradient-based mechanism. In the revised manuscript we add Equation (3) in §3.2 that defines the gradient vector as the normalized embedding difference between the diagnostic reasoning trace and the success-memory vector, modulated by a momentum term derived from prior successful traces. Algorithm 1 provides the corresponding pseudocode, including the update rule and the surrogate loss (negative validation improvement). An ablation removing the embedding-based direction (replacing it with uniform random steps) reduces performance to tree-search levels, confirming that the directed, magnitude-correlated updates—not merely better discrete proposals—are responsible for the observed gains. revision: yes
Referee: [Results] Results (§4, scaling experiments): The 35.1% any-medal rate and the 10-model crossover are reported without error bars, confidence intervals, statistical significance tests, or explicit data-exclusion rules. This is load-bearing because the claim that the gap widens at frontier models rests on the reliability of these empirical trends.

Authors: We acknowledge the absence of statistical reporting. The revision now includes results from five independent runs per model with different random seeds, reporting 95% bootstrap confidence intervals. We add Wilcoxon signed-rank tests showing the crossover is statistically significant (p < 0.01) for models above 70B parameters. Section 4.1 now explicitly states the data-exclusion rule: a run is excluded only if it exceeds the 12-hour wall-clock budget or encounters an unrecoverable runtime error unrelated to the agent’s reasoning (e.g., CUDA OOM). The 35.1% figure is the mean across the retained runs. revision: yes
Referee: [Protocol] Closed-world protocol (§2): The protocol is asserted to isolate architectural effects, yet no quantitative verification (e.g., ablation on external-knowledge leakage or diagnostic trace contamination) is supplied. This directly affects whether the performance advantage can be attributed to the gradient mapping rather than residual knowledge.

Authors: We have added the requested quantitative verification. In the revised §2.3 and new Appendix B we report an ablation that substitutes closed-world diagnostic traces with open-world traces (permitting external knowledge retrieval). The performance lift is only 2.3 percentage points, indicating negligible leakage. A second ablation that disables success-memory momentum while keeping the same diagnostic traces shows a drop to 24.8%, confirming that the gradient-mapping component—not residual knowledge—is the primary driver. These controls are now part of the main experimental protocol. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results benchmarked externally with independent empirical scaling

full rationale

The paper's claims rest on performance measured against the external MLE-Bench benchmark (35.1% any-medal rate under closed-world protocol) and scaling experiments across 10 models that empirically demonstrate a crossover between tree search and the proposed gradient-based approach. The operationalization of diagnostic reasoning as gradient computation, success memory as momentum, and multi-trace execution as distributed optimization is presented as a methodological mapping without any equations, fitted parameters, or predictions that reduce to the inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify core results; the derivation chain remains self-contained against external benchmarks and does not exhibit self-definitional or fitted-input circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that diagnostic text can be converted into usable gradient signals and on the evaluation assumption that the closed-world protocol removes external knowledge confounds; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption The closed-world protocol isolates architectural effects from external knowledge.
Explicitly invoked to justify the benchmark comparison.

invented entities (1)

Gome agent no independent evidence
purpose: MLE agent that operationalizes gradient-based optimization via reasoning mappings
Newly introduced framework; no independent falsifiable evidence supplied beyond the reported benchmark scores.

pith-pipeline@v0.9.0 · 5536 in / 1249 out tokens · 36080 ms · 2026-05-15T17:45:23.600105+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 9 internal anchors

[1]

Artificial Analysis: Independent Benchmarks and Performance Landscape of AI Models, 2025

Artificial Analysis Team. Artificial Analysis: Independent Benchmarks and Performance Landscape of AI Models, 2025. URLhttps://artificialanalysis.ai/. Accessed: 2025-12-28

work page 2025
[2]

Comparative analysis of gradient-based optimization techniques using multidimensional surface 3d visualizations and initial point sensitivity.arXiv preprint arXiv:2409.04470, 2024

Saeed Asadi, Sonia Gharibzadeh, Hajar Kazemi Naeini, Masoud Reihanifar, Morteza Rahimi, Shiva Zangeneh, Aseel Smerat, and Lazim Abdullah. Comparative analysis of gradient-based optimization techniques using multidimensional surface 3d visualizations and initial point sensitivity.arXiv preprint arXiv:2409.04470, 2024

work page arXiv 2024
[3]

Optimization methods for large-scale machine learning

Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM review, 60(2):223–311, 2018

work page 2018
[4]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Traceisthenextautodiff: Generativeoptimization with rich feedback, execution traces, and llms.Advances in Neural Information Processing Systems, 37: 71596–71642, 2024

Ching-AnCheng, AllenNie, andAdithSwaminathan. Traceisthenextautodiff: Generativeoptimization with rich feedback, execution traces, and llms.Advances in Neural Information Processing Systems, 37: 71596–71642, 2024

work page 2024
[6]

Introducing mapo: Momentum-aided gradient descent prompt optimization.arXiv preprint arXiv:2410.19499, 2024

Anthony Cui, Pranav Nandyalam, Andrew Rufail, Ethan Cheung, Aiden Lei, Kevin Zhu, and Sean O’Brien. Introducing mapo: Momentum-aided gradient descent prompt optimization.arXiv preprint arXiv:2410.19499, 2024

work page arXiv 2024
[7]

Internagent-mle: Navigating fine-grained optimization for coding agent

Shangheng Du, Xiangchao Yan, Dengyang Jiang, Jiakang Yuan, Yusong Hu, Xin Li, Liang He, Bo Zhang, and LEI BAI. Internagent-mle: Navigating fine-grained optimization for coding agent. 12 Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search

work page
[8]

Loss surfaces, mode connectivity, and fast ensembling of dnns.Advances in neural information processing systems, 31, 2018

Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns.Advances in neural information processing systems, 31, 2018

work page 2018
[9]

Aider LLM Leaderboards: Code Editing and Refactoring Benchmarks.https://aider

Paul Gauthier. Aider LLM Leaderboards: Code Editing and Refactoring Benchmarks.https://aider. chat/docs/leaderboards/, 2025. Accessed: 2025-12-28

work page 2025
[10]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Evoprompt: Connecting llms with evolutionary algorithms yields powerful prompt optimizers

Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Evoprompt: Connecting llms with evolutionary algorithms yields powerful prompt optimizers. arXiv e-prints, pages arXiv–2309, 2023

work page 2023
[12]

Mlagent- bench: Evaluating language agents on machine learning experimentation.arXiv preprint arXiv:2310.03302,

Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation.arXiv preprint arXiv:2310.03302, 2023

work page arXiv 2023
[13]

Academic Press, 2013

Akira Isihara.Statistical physics. Academic Press, 2013

work page 2013
[14]

OpenAI o1 System Card

AaronJaech,AdamKalai,AdamLerer,AdamRichardson,AhmedEl-Kishky,AidenLow,AlecHelyar,Alek- sander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Aide: Ai-driven exploration in the space of code.arXiv preprint arXiv:2502.13138,

Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. Aide: Ai-driven exploration in the space of code.arXiv preprint arXiv:2502.13138, 2025

work page arXiv 2025
[16]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

KompeteAI: Accelerated Autonomous Multi-Agent System for End-to-End Pipeline Generation for Machine Learning Problems

Stepan Kulibaba, Artem Dzhalilov, Roman Pakhomov, Oleg Svidchenko, Alexander Gasnikov, and Aleksei Shpilman. Kompeteai: Accelerated autonomous multi-agent system for end-to-end pipeline generation for machine learning problems.arXiv preprint arXiv:2508.10177, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

H., Hou, Z., Cao, L., Ju, C., Wu, J., Li, H., Zhang, H., et al

Annan Li, Chufan Wu, Zengle Ge, Yee Hin Chong, Zhinan Hou, Lizhe Cao, Cheng Ju, Jianmin Wu, Huaiming Li, Haobo Zhang, et al. The fm agent.arXiv preprint arXiv:2510.26144, 2025

work page arXiv 2025
[19]

Ml-master: Towards ai-for-ai via integration of exploration and reasoning.arXiv preprint arXiv:2506.16499, 2025

Zexi Liu, Yuzhu Cai, Xinyu Zhu, Yujie Zheng, Runkun Chen, Ying Wen, Yanfeng Wang, Siheng Chen, et al. Ml-master: Towards ai-for-ai via integration of exploration and reasoning.arXiv preprint arXiv:2506.16499, 2025

work page arXiv 2025
[20]

Self-refine: Iterative refinement with self- feedback.Advances in Neural Information Processing Systems, 36:46534–46594, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self- feedback.Advances in Neural Information Processing Systems, 36:46534–46594, 2023

work page 2023
[21]

Guided evo- lutionary strategies: Augmenting random search with surrogate gradients

Niru Maheswaranathan, Luke Metz, George Tucker, Dami Choi, and Jascha Sohl-Dickstein. Guided evo- lutionary strategies: Augmenting random search with surrogate gradients. InInternational Conference on Machine Learning, pages 4264–4273. PMLR, 2019

work page 2019
[22]

E., Popa, R

JaehyunNam, JinsungYoon, JiefengChen, JinwooShin, SercanÖArık, andTomasPfister. Mle-star: Ma- chine learning engineering agent via search and targeted refinement.arXiv preprint arXiv:2506.15692, 2025. 13 Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search

work page arXiv 2025
[23]

Springer, 2018

Yurii Nesterov et al.Lectures on convex optimization, volume 137. Springer, 2018

work page 2018
[24]

Dsgym: A holistic framework for evaluating and training data science agents.arXiv preprint arXiv:2601.16344, 2026

Fan Nie, Junlin Wang, Harper Hua, Federico Bianchi, Yongchan Kwon, Zhenting Qi, Owen Queen, Shang Zhu, and James Zou. Dsgym: A holistic framework for evaluating and training data science agents.arXiv preprint arXiv:2601.16344, 2026

work page arXiv 2026
[25]

Introducing openai o3 and o4-mini, 2025

OpenAI. Introducing openai o3 and o4-mini, 2025. URL https://openai.com/index/ introducing-o3-and-o4-mini/. Accessed: 2025-12-22

work page 2025
[26]

gradient descent

Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with" gradient descent" and beam search.arXiv preprint arXiv:2305.03495, 2023

work page arXiv 2023
[27]

Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

work page 2023
[28]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaugh- lin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

A survey of reasoning with foundation models: Concepts, methodologies, and outlook.ACM Computing Surveys, 57(11):1–43, 2025

Jiankai Sun, Chuanyang Zheng, Enze Xie, Zhengying Liu, Ruihang Chu, Jianing Qiu, Jiaqi Xu, Mingyu Ding, Hongyang Li, Mengzhe Geng, et al. A survey of reasoning with foundation models: Concepts, methodologies, and outlook.ACM Computing Surveys, 57(11):1–43, 2025

work page 2025
[30]

M., et al

Edan Toledo, Karen Hambardzumyan, Martin Josifoski, Rishi Hazra, Nicolas Baldwin, Alexis Audran- Reiss, Michael Kuchnik, Despoina Magka, Minqi Jiang, Alisia Maria Lupidi, et al. Ai research agents for machine learning: Search, exploration, and generalization in mle-bench.arXiv preprint arXiv:2507.02554, 2025

work page arXiv 2025
[31]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, et al. Towards large reasoning models: A survey of reinforced reasoning with large language models.arXiv preprint arXiv:2501.09686, 2025

work page internal anchor Pith review arXiv 2025
[33]

Large language models as optimizers

Chengrun Yang, XuezhiWang, Yifeng Lu, Hanxiao Liu, Quoc VLe, Denny Zhou, and XinyunChen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[34]

TextGrad: Automatic "Differentiation" via Text

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic" differentiation" via text.arXiv preprint arXiv:2406.07496, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Initialization with Forced Diversifi- cation

Wenlin Zhang, Xiangyang Li, Kuicai Dong, Yichao Wang, Pengyue Jia, Xiaopeng Li, Yingyi Zhang, Derong Xu, Zhaocheng Du, Huifeng Guo, et al. Process vs. outcome reward: Which is better for agentic rag reinforcement learning.arXiv preprint arXiv:2505.14069, 2025. 14 Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search Content of Appendix A More Exper...

work page arXiv 2025
[36]

High-dimensional sparse-feature tasks.Tasks such as RNA sequence modeling (stanford-covid-vaccine), genomic prediction, and high-cardinality categorical problems are most vulnerable. The high dimension- ality and feature sparsity create opportunities for complex loss reweighting or subtle feature engineering to exploit spurious correlations that appear ge...

work page
[37]

Small-sample time-series tasks.Tasks with limited temporal data are prone to temporal feature engineering that overfits to idiosyncratic patterns in the training window. Highly specific lag features, holiday indicators, or trend decompositions can capture noise rather than signal, producing validation gains that do not transfer across the temporal boundar...

work page
[38]

Loss reweighting and objective misalignment.The agent modifies the training loss (e.g., channel- specific weighting, sample reweighting, focal loss variants) in ways that align well with the validation metric on the current split but introduce systematic bias. This category is the most reliably detected by hierarchical validation, as the code changes conc...

work page
[39]

These are moderately detectable, as the validator can flag suspiciously specific features by analyzing their construction logic

Aggressive feature engineering.The agent introduces highly specific features—interaction terms between rare categories, narrow time windows, task-specific hand-crafted indicators—that capture noise in the training/validation split but encode distributional artifacts rather than causal patterns. These are moderately detectable, as the validator can flag su...

work page
[40]

shortcut solvability

Simplicity Bias.After repeated failed iterations, the agent sometimes resorts to surface-level heuristics or hard-coded shortcut rules: threshold-based classifications mapping input ranges directly to output labels, median imputation strategies that “hack” the evaluation metric without learning the underlying data logic, or constant-prediction fallbacks e...

work page
[41]

neural networks vs

Multi-trace forced diversification(§3.5) initializesN traces with distinct architectural hypotheses (e.g., gradient boosting vs. neural networks vs. ensembles), providing broad coverage of separate regions in solution space. This is analogous to multi-start optimization in classical non-convex settings [3]

work page
[42]

When one trace discovers a better region, others can adopt or adapt that hypothesis, enabling non-local transitions while preserving local refinement within each region

Cross-trace sharing via the probabilistic interaction kernel(Appendix C.3) allows traces to exchange validated strategies. When one trace discovers a better region, others can adopt or adapt that hypothesis, enabling non-local transitions while preserving local refinement within each region. Scope.The smoothness assumption is not universal. Tasks with ext...

work page