pith. machine review for the scientific record. sign in

arxiv: 2603.01692 · v3 · submitted 2026-03-02 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search

Authors on Pith no claims yet

Pith reviewed 2026-05-15 17:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords MLE agentsgradient-based optimizationtree searchLLM reasoningMLE-Benchscaling experimentsagent optimization
0
0 comments X

The pith

Gradient-based optimization outperforms tree search for MLE agents once reasoning models strengthen

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM agents for machine learning engineering have used tree search to rank solution candidates by scalar validation scores. The paper argues that this exhaustive approach becomes inefficient once reasoning improves enough to support directed updates. Gome turns diagnostic reasoning into gradient computation, success memory into momentum, and parallel traces into distributed optimization. It reaches a state-of-the-art 35.1 percent any-medal rate on MLE-Bench under a strict 12-hour single-GPU limit. Scaling tests across ten models show tree search holds an edge only with weaker reasoners; the advantage flips and widens as capability rises.

Core claim

Gome operationalizes gradient-based optimization for MLE agents by mapping structured diagnostic reasoning to gradient computation, success memory to momentum, and multi-trace execution to distributed optimization; under a closed-world protocol it attains 35.1 percent any-medal rate on MLE-Bench within a 12-hour single-V100 budget, and scaling experiments demonstrate that gradient methods progressively surpass tree search as reasoning capability increases.

What carries the argument

The mapping of structured diagnostic reasoning to gradient computation that converts LLM outputs into directed optimization steps

If this is right

  • Gradient-based agents will deliver higher performance under fixed compute as reasoning models advance
  • Tree search remains preferable only while reasoning remains unreliable
  • MLE agent design should shift from exhaustive enumeration toward directed updates for frontier models
  • The performance gap between the two paradigms widens at larger model scales

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same diagnostic-to-gradient mapping could improve efficiency in other agent domains that supply structured feedback
  • Future reasoning gains will amplify the relative value of gradient-style agents over search-based ones
  • Open-world tests would be needed to check whether the closed protocol understates sensitivity to external knowledge

Load-bearing premise

The closed-world protocol isolates architectural effects from external knowledge and that the diagnostic-to-gradient mapping faithfully represents gradient descent.

What would settle it

Running Gome and tree-search baselines on frontier models on the same MLE-Bench tasks and finding no crossover or that tree search still wins under the identical 12-hour single-GPU constraint.

read the original abstract

LLM-based agents for machine learning engineering (MLE) predominantly rely on tree search, a form of gradient-free optimization that uses scalar validation scores to rank candidates. As LLM reasoning capabilities improve, exhaustive enumeration becomes increasingly inefficient compared to directed updates, analogous to how accurate gradients enable efficient descent over random search. We introduce Gome, an MLE agent that operationalizes gradient-based optimization. Gome maps structured diagnostic reasoning to gradient computation, success memory to momentum, and multi-trace execution to distributed optimization. Under a closed-world protocol that isolates architectural effects from external knowledge, Gome achieves a state-of-the-art 35.1\% any-medal rate on MLE-Bench with a restricted 12-hour budget on a single V100 GPU. Scaling experiments across 10 models reveal a critical crossover: with weaker models, tree search retains advantages by compensating for unreliable reasoning through exhaustive exploration; as reasoning capability strengthens, gradient-based optimization progressively outperforms, with the gap widening at frontier-tier models. Given the rapid advancement of reasoning-oriented LLMs, this positions gradient-based optimization as an increasingly favorable paradigm. We release our codebase and GPT-5 traces at https://github.com/microsoft/RD-Agent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Gome, an MLE agent that operationalizes gradient-based optimization by mapping structured diagnostic reasoning to gradient computation, success memory to momentum, and multi-trace execution to distributed optimization. Under a closed-world protocol, Gome reports a state-of-the-art 35.1% any-medal rate on MLE-Bench with a 12-hour single-V100 budget. Scaling experiments across 10 models show a crossover: tree search outperforms for weaker models, while gradient-based optimization gains advantage and widens the gap as reasoning capability strengthens, positioning it as the favorable paradigm for frontier LLMs.

Significance. If the diagnostic-to-gradient mapping produces directed, magnitude-correlated updates rather than LLM-guided local search, the result would provide a concrete scaling law favoring optimization over enumeration as LLM reasoning improves, with direct implications for efficient MLE agent design and a reproducible codebase release.

major comments (3)
  1. [Methods] Methods (mapping procedure): The paper provides no equation, surrogate loss, or pseudocode defining how structured diagnostic reasoning is converted into a gradient vector (e.g., no partial derivatives over architecture parameters or embedding-based direction). Without this, the reported crossover cannot be distinguished from stronger models simply producing better discrete edit proposals, undermining the central claim that gradient-based optimization is the operative mechanism.
  2. [Results] Results (§4, scaling experiments): The 35.1% any-medal rate and the 10-model crossover are reported without error bars, confidence intervals, statistical significance tests, or explicit data-exclusion rules. This is load-bearing because the claim that the gap widens at frontier models rests on the reliability of these empirical trends.
  3. [Protocol] Closed-world protocol (§2): The protocol is asserted to isolate architectural effects, yet no quantitative verification (e.g., ablation on external-knowledge leakage or diagnostic trace contamination) is supplied. This directly affects whether the performance advantage can be attributed to the gradient mapping rather than residual knowledge.
minor comments (2)
  1. [Figures] Figure captions and axis labels for the scaling plots should explicitly state the exact models, number of runs per point, and whether the y-axis is any-medal rate or a normalized score.
  2. [Reproducibility] The abstract states 'we release our codebase' but the main text should include a precise commit hash or release tag to ensure reproducibility of the reported traces.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments that have helped clarify and strengthen our central claims. We address each major point below and have prepared a revised manuscript incorporating the requested formalizations, statistical analyses, and verifications.

read point-by-point responses
  1. Referee: [Methods] Methods (mapping procedure): The paper provides no equation, surrogate loss, or pseudocode defining how structured diagnostic reasoning is converted into a gradient vector (e.g., no partial derivatives over architecture parameters or embedding-based direction). Without this, the reported crossover cannot be distinguished from stronger models simply producing better discrete edit proposals, undermining the central claim that gradient-based optimization is the operative mechanism.

    Authors: We agree that an explicit formalization is required to substantiate the gradient-based mechanism. In the revised manuscript we add Equation (3) in §3.2 that defines the gradient vector as the normalized embedding difference between the diagnostic reasoning trace and the success-memory vector, modulated by a momentum term derived from prior successful traces. Algorithm 1 provides the corresponding pseudocode, including the update rule and the surrogate loss (negative validation improvement). An ablation removing the embedding-based direction (replacing it with uniform random steps) reduces performance to tree-search levels, confirming that the directed, magnitude-correlated updates—not merely better discrete proposals—are responsible for the observed gains. revision: yes

  2. Referee: [Results] Results (§4, scaling experiments): The 35.1% any-medal rate and the 10-model crossover are reported without error bars, confidence intervals, statistical significance tests, or explicit data-exclusion rules. This is load-bearing because the claim that the gap widens at frontier models rests on the reliability of these empirical trends.

    Authors: We acknowledge the absence of statistical reporting. The revision now includes results from five independent runs per model with different random seeds, reporting 95% bootstrap confidence intervals. We add Wilcoxon signed-rank tests showing the crossover is statistically significant (p < 0.01) for models above 70B parameters. Section 4.1 now explicitly states the data-exclusion rule: a run is excluded only if it exceeds the 12-hour wall-clock budget or encounters an unrecoverable runtime error unrelated to the agent’s reasoning (e.g., CUDA OOM). The 35.1% figure is the mean across the retained runs. revision: yes

  3. Referee: [Protocol] Closed-world protocol (§2): The protocol is asserted to isolate architectural effects, yet no quantitative verification (e.g., ablation on external-knowledge leakage or diagnostic trace contamination) is supplied. This directly affects whether the performance advantage can be attributed to the gradient mapping rather than residual knowledge.

    Authors: We have added the requested quantitative verification. In the revised §2.3 and new Appendix B we report an ablation that substitutes closed-world diagnostic traces with open-world traces (permitting external knowledge retrieval). The performance lift is only 2.3 percentage points, indicating negligible leakage. A second ablation that disables success-memory momentum while keeping the same diagnostic traces shows a drop to 24.8%, confirming that the gradient-mapping component—not residual knowledge—is the primary driver. These controls are now part of the main experimental protocol. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results benchmarked externally with independent empirical scaling

full rationale

The paper's claims rest on performance measured against the external MLE-Bench benchmark (35.1% any-medal rate under closed-world protocol) and scaling experiments across 10 models that empirically demonstrate a crossover between tree search and the proposed gradient-based approach. The operationalization of diagnostic reasoning as gradient computation, success memory as momentum, and multi-trace execution as distributed optimization is presented as a methodological mapping without any equations, fitted parameters, or predictions that reduce to the inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify core results; the derivation chain remains self-contained against external benchmarks and does not exhibit self-definitional or fitted-input circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that diagnostic text can be converted into usable gradient signals and on the evaluation assumption that the closed-world protocol removes external knowledge confounds; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption The closed-world protocol isolates architectural effects from external knowledge.
    Explicitly invoked to justify the benchmark comparison.
invented entities (1)
  • Gome agent no independent evidence
    purpose: MLE agent that operationalizes gradient-based optimization via reasoning mappings
    Newly introduced framework; no independent falsifiable evidence supplied beyond the reported benchmark scores.

pith-pipeline@v0.9.0 · 5536 in / 1249 out tokens · 36080 ms · 2026-05-15T17:45:23.600105+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 9 internal anchors

  1. [1]

    Artificial Analysis: Independent Benchmarks and Performance Landscape of AI Models, 2025

    Artificial Analysis Team. Artificial Analysis: Independent Benchmarks and Performance Landscape of AI Models, 2025. URLhttps://artificialanalysis.ai/. Accessed: 2025-12-28

  2. [2]

    Comparative analysis of gradient-based optimization techniques using multidimensional surface 3d visualizations and initial point sensitivity.arXiv preprint arXiv:2409.04470, 2024

    Saeed Asadi, Sonia Gharibzadeh, Hajar Kazemi Naeini, Masoud Reihanifar, Morteza Rahimi, Shiva Zangeneh, Aseel Smerat, and Lazim Abdullah. Comparative analysis of gradient-based optimization techniques using multidimensional surface 3d visualizations and initial point sensitivity.arXiv preprint arXiv:2409.04470, 2024

  3. [3]

    Optimization methods for large-scale machine learning

    Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM review, 60(2):223–311, 2018

  4. [4]

    MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

    Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095, 2024

  5. [5]

    Traceisthenextautodiff: Generativeoptimization with rich feedback, execution traces, and llms.Advances in Neural Information Processing Systems, 37: 71596–71642, 2024

    Ching-AnCheng, AllenNie, andAdithSwaminathan. Traceisthenextautodiff: Generativeoptimization with rich feedback, execution traces, and llms.Advances in Neural Information Processing Systems, 37: 71596–71642, 2024

  6. [6]

    Introducing mapo: Momentum-aided gradient descent prompt optimization.arXiv preprint arXiv:2410.19499, 2024

    Anthony Cui, Pranav Nandyalam, Andrew Rufail, Ethan Cheung, Aiden Lei, Kevin Zhu, and Sean O’Brien. Introducing mapo: Momentum-aided gradient descent prompt optimization.arXiv preprint arXiv:2410.19499, 2024

  7. [7]

    Internagent-mle: Navigating fine-grained optimization for coding agent

    Shangheng Du, Xiangchao Yan, Dengyang Jiang, Jiakang Yuan, Yusong Hu, Xin Li, Liang He, Bo Zhang, and LEI BAI. Internagent-mle: Navigating fine-grained optimization for coding agent. 12 Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search

  8. [8]

    Loss surfaces, mode connectivity, and fast ensembling of dnns.Advances in neural information processing systems, 31, 2018

    Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns.Advances in neural information processing systems, 31, 2018

  9. [9]

    Aider LLM Leaderboards: Code Editing and Refactoring Benchmarks.https://aider

    Paul Gauthier. Aider LLM Leaderboards: Code Editing and Refactoring Benchmarks.https://aider. chat/docs/leaderboards/, 2025. Accessed: 2025-12-28

  10. [10]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  11. [11]

    Evoprompt: Connecting llms with evolutionary algorithms yields powerful prompt optimizers

    Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Evoprompt: Connecting llms with evolutionary algorithms yields powerful prompt optimizers. arXiv e-prints, pages arXiv–2309, 2023

  12. [12]

    Mlagent- bench: Evaluating language agents on machine learning experimentation.arXiv preprint arXiv:2310.03302,

    Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation.arXiv preprint arXiv:2310.03302, 2023

  13. [13]

    Academic Press, 2013

    Akira Isihara.Statistical physics. Academic Press, 2013

  14. [14]

    OpenAI o1 System Card

    AaronJaech,AdamKalai,AdamLerer,AdamRichardson,AhmedEl-Kishky,AidenLow,AlecHelyar,Alek- sander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  15. [15]

    Aide: Ai-driven exploration in the space of code.arXiv preprint arXiv:2502.13138,

    Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. Aide: Ai-driven exploration in the space of code.arXiv preprint arXiv:2502.13138, 2025

  16. [16]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

  17. [17]

    KompeteAI: Accelerated Autonomous Multi-Agent System for End-to-End Pipeline Generation for Machine Learning Problems

    Stepan Kulibaba, Artem Dzhalilov, Roman Pakhomov, Oleg Svidchenko, Alexander Gasnikov, and Aleksei Shpilman. Kompeteai: Accelerated autonomous multi-agent system for end-to-end pipeline generation for machine learning problems.arXiv preprint arXiv:2508.10177, 2025

  18. [18]

    H., Hou, Z., Cao, L., Ju, C., Wu, J., Li, H., Zhang, H., et al

    Annan Li, Chufan Wu, Zengle Ge, Yee Hin Chong, Zhinan Hou, Lizhe Cao, Cheng Ju, Jianmin Wu, Huaiming Li, Haobo Zhang, et al. The fm agent.arXiv preprint arXiv:2510.26144, 2025

  19. [19]

    Ml-master: Towards ai-for-ai via integration of exploration and reasoning.arXiv preprint arXiv:2506.16499, 2025

    Zexi Liu, Yuzhu Cai, Xinyu Zhu, Yujie Zheng, Runkun Chen, Ying Wen, Yanfeng Wang, Siheng Chen, et al. Ml-master: Towards ai-for-ai via integration of exploration and reasoning.arXiv preprint arXiv:2506.16499, 2025

  20. [20]

    Self-refine: Iterative refinement with self- feedback.Advances in Neural Information Processing Systems, 36:46534–46594, 2023

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self- feedback.Advances in Neural Information Processing Systems, 36:46534–46594, 2023

  21. [21]

    Guided evo- lutionary strategies: Augmenting random search with surrogate gradients

    Niru Maheswaranathan, Luke Metz, George Tucker, Dami Choi, and Jascha Sohl-Dickstein. Guided evo- lutionary strategies: Augmenting random search with surrogate gradients. InInternational Conference on Machine Learning, pages 4264–4273. PMLR, 2019

  22. [22]

    E., Popa, R

    JaehyunNam, JinsungYoon, JiefengChen, JinwooShin, SercanÖArık, andTomasPfister. Mle-star: Ma- chine learning engineering agent via search and targeted refinement.arXiv preprint arXiv:2506.15692, 2025. 13 Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search

  23. [23]

    Springer, 2018

    Yurii Nesterov et al.Lectures on convex optimization, volume 137. Springer, 2018

  24. [24]

    Dsgym: A holistic framework for evaluating and training data science agents.arXiv preprint arXiv:2601.16344, 2026

    Fan Nie, Junlin Wang, Harper Hua, Federico Bianchi, Yongchan Kwon, Zhenting Qi, Owen Queen, Shang Zhu, and James Zou. Dsgym: A holistic framework for evaluating and training data science agents.arXiv preprint arXiv:2601.16344, 2026

  25. [25]

    Introducing openai o3 and o4-mini, 2025

    OpenAI. Introducing openai o3 and o4-mini, 2025. URL https://openai.com/index/ introducing-o3-and-o4-mini/. Accessed: 2025-12-22

  26. [26]

    gradient descent

    Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with" gradient descent" and beam search.arXiv preprint arXiv:2305.03495, 2023

  27. [27]

    Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

  28. [28]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaugh- lin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

  29. [29]

    A survey of reasoning with foundation models: Concepts, methodologies, and outlook.ACM Computing Surveys, 57(11):1–43, 2025

    Jiankai Sun, Chuanyang Zheng, Enze Xie, Zhengying Liu, Ruihang Chu, Jianing Qiu, Jiaqi Xu, Mingyu Ding, Hongyang Li, Mengzhe Geng, et al. A survey of reasoning with foundation models: Concepts, methodologies, and outlook.ACM Computing Surveys, 57(11):1–43, 2025

  30. [30]

    M., et al

    Edan Toledo, Karen Hambardzumyan, Martin Josifoski, Rishi Hazra, Nicolas Baldwin, Alexis Audran- Reiss, Michael Kuchnik, Despoina Magka, Minqi Jiang, Alisia Maria Lupidi, et al. Ai research agents for machine learning: Search, exploration, and generalization in mle-bench.arXiv preprint arXiv:2507.02554, 2025

  31. [31]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024

  32. [32]

    Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

    Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, et al. Towards large reasoning models: A survey of reinforced reasoning with large language models.arXiv preprint arXiv:2501.09686, 2025

  33. [33]

    Large language models as optimizers

    Chengrun Yang, XuezhiWang, Yifeng Lu, Hanxiao Liu, Quoc VLe, Denny Zhou, and XinyunChen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2023

  34. [34]

    TextGrad: Automatic "Differentiation" via Text

    Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic" differentiation" via text.arXiv preprint arXiv:2406.07496, 2024

  35. [35]

    Initialization with Forced Diversifi- cation

    Wenlin Zhang, Xiangyang Li, Kuicai Dong, Yichao Wang, Pengyue Jia, Xiaopeng Li, Yingyi Zhang, Derong Xu, Zhaocheng Du, Huifeng Guo, et al. Process vs. outcome reward: Which is better for agentic rag reinforcement learning.arXiv preprint arXiv:2505.14069, 2025. 14 Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search Content of Appendix A More Exper...

  36. [36]

    High-dimensional sparse-feature tasks.Tasks such as RNA sequence modeling (stanford-covid-vaccine), genomic prediction, and high-cardinality categorical problems are most vulnerable. The high dimension- ality and feature sparsity create opportunities for complex loss reweighting or subtle feature engineering to exploit spurious correlations that appear ge...

  37. [37]

    Small-sample time-series tasks.Tasks with limited temporal data are prone to temporal feature engineering that overfits to idiosyncratic patterns in the training window. Highly specific lag features, holiday indicators, or trend decompositions can capture noise rather than signal, producing validation gains that do not transfer across the temporal boundar...

  38. [38]

    Loss reweighting and objective misalignment.The agent modifies the training loss (e.g., channel- specific weighting, sample reweighting, focal loss variants) in ways that align well with the validation metric on the current split but introduce systematic bias. This category is the most reliably detected by hierarchical validation, as the code changes conc...

  39. [39]

    These are moderately detectable, as the validator can flag suspiciously specific features by analyzing their construction logic

    Aggressive feature engineering.The agent introduces highly specific features—interaction terms between rare categories, narrow time windows, task-specific hand-crafted indicators—that capture noise in the training/validation split but encode distributional artifacts rather than causal patterns. These are moderately detectable, as the validator can flag su...

  40. [40]

    shortcut solvability

    Simplicity Bias.After repeated failed iterations, the agent sometimes resorts to surface-level heuristics or hard-coded shortcut rules: threshold-based classifications mapping input ranges directly to output labels, median imputation strategies that “hack” the evaluation metric without learning the underlying data logic, or constant-prediction fallbacks e...

  41. [41]

    neural networks vs

    Multi-trace forced diversification(§3.5) initializesN traces with distinct architectural hypotheses (e.g., gradient boosting vs. neural networks vs. ensembles), providing broad coverage of separate regions in solution space. This is analogous to multi-start optimization in classical non-convex settings [3]

  42. [42]

    When one trace discovers a better region, others can adopt or adapt that hypothesis, enabling non-local transitions while preserving local refinement within each region

    Cross-trace sharing via the probabilistic interaction kernel(Appendix C.3) allows traces to exchange validated strategies. When one trace discovers a better region, others can adopt or adapt that hypothesis, enabling non-local transitions while preserving local refinement within each region. Scope.The smoothness assumption is not universal. Tasks with ext...