Towards Persistent Case-Based Memory for Autonomous Data Science: A CBR-Augmented R&D-Agent with a Locally Deployable Small Language Model

Felix Stocker

arxiv: 2606.05250 · v1 · pith:TMG27UOPnew · submitted 2026-06-03 · 💻 cs.SE

Towards Persistent Case-Based Memory for Autonomous Data Science: A CBR-Augmented R&D-Agent with a Locally Deployable Small Language Model

Felix Stocker This is my paper

Pith reviewed 2026-06-28 05:16 UTC · model grok-4.3

classification 💻 cs.SE

keywords case-based reasoningautonomous data science agentssmall language modelspersistent memorykaggle competitionsR&D agent frameworkGemma model

0 comments

The pith

CBR layer added to R&D agent with local SLM yields directional accuracy gains and lower variance on Kaggle tasks

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper integrates a persistent case-based reasoning layer into an existing autonomous data-science agent framework, using structured cases that pair symbolic records with executable code. This CBR component overrides selected phases of the agent's loop via a toggleable subclass and applies a five-gate quality filter plus heuristic reuse detection based on embedding similarity and code fingerprints. Evaluation across two Kaggle competitions with multiple seeds shows the CBR version reaching 0.8147 accuracy versus 0.8098 for the baseline on Spaceship Titanic, accompanied by substantially reduced variance, while reuse events exhibit high semantic relevance. The work also provides the first published end-to-end test of Gemma 4 31B as the agent's backbone model.

Core claim

Overriding three phases of the R&D loop with a CBR layer that stores structured cases containing executable code snapshots and quality metadata, then retrieves them via a five-gate filter and reuse-detection heuristic combining embedding similarity (mean 0.882) and code-fingerprint overlap (mean 0.305), produces directionally higher accuracy and markedly lower variance than the CBR-disabled baseline on the Spaceship Titanic task.

What carries the argument

The CBR layer, implemented as a surgical subclass toggled by an environment variable, that stores cases as structured records with executable code and quality metadata and retrieves them through a five-gate quality filter and heuristic reuse detection using embedding similarity plus code-fingerprint overlap.

If this is right

Persistent, quality-controlled case memory can be added to existing agent frameworks without replacing the core loop.
Small language models such as Gemma 4 31B can function as locally deployable backbones for full autonomous data-science pipelines.
Heuristic reuse detection supports conceptual guidance from prior cases rather than verbatim code reuse.
Lower variance across random seeds indicates more stable improvement trajectories when CBR memory is active.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same CBR pattern could be ported to other agent scaffolds that lack native long-term memory.
Variable code-fingerprint similarity alongside high embedding similarity points to a hybrid symbolic-neural memory design that may generalize beyond the current tasks.
Testing the five-gate filter on tasks with noisier or less structured code artefacts would clarify the limits of the current reuse heuristic.

Load-bearing premise

The five-gate quality filter and heuristic reuse-detection mechanism correctly identify transferable knowledge without introducing selection bias or false positives that inflate apparent gains.

What would settle it

Re-running the eight-loop evaluation on Spaceship Titanic with the reuse-detection heuristic disabled or replaced by random retrieval and finding that the accuracy gap and variance reduction both disappear.

Figures

Figures reproduced from arXiv: 2606.05250 by Felix Stocker.

**Figure 2.** Figure 2: Heuristic reuse-detection scatter: embedding cosine similarity ( [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Appendix G: CBR-augmented R&D-Agent loop (one iteration). Dark grey boxes mark the three phases overridden by the CBR layer (Phases 1, 2, 5); light grey boxes mark unchanged R&D-Agent phases (Phases 3, 4). Phase 1 performs two-stage CBR retrieval and injects retrieved cases and Failure-Tracker patterns into the hypothesis; Phase 2 appends code snapshots to the coding prompt; Phase 5 applies the five-gate Q… view at source ↗

read the original abstract

Most top-performing autonomous data-science agents rely on frontier cloud models and lack persistent, cross-session memory. This paper addresses two open gaps: (1) the underexplored use of formally structured, quality-controlled Case-Based Reasoning (CBR) case bases coupling symbolic case records with executable code artefacts; and (2) the untested viability of Small Language Models (SLMs) as locally deployable agent backbones. We present CBR-augmented R&D-Agent, integrating a persistent CBR layer into Microsoft's R&D-Agent framework with a custom backend for Gemma 4 31B Dense -- the first published end-to-end evaluation of Gemma 4 as an autonomous data-science agent backbone. The CBR layer overrides three R&D loop phases via a surgical subclass toggled by a single environment variable. Cases are stored as structured records with executable code snapshots and quality metadata; a five-gate quality filter and a heuristic reuse-detection mechanism assess knowledge transfer by combining embedding similarity, code-fingerprint overlap, and injection provenance. Evaluated on two Kaggle competitions (NOMAD 2018, Spaceship Titanic) with four seeds over eight improvement loops each, CBR achieves directionally higher accuracy than the CBR-disabled baseline on Spaceship Titanic (0.8147 vs. 0.8098, d = -1.41) with substantially lower variance. Heuristic reuse detection across 108 retrieval events shows high semantic relevance (mean embedding similarity 0.882) alongside variable structural proximity (mean code-fingerprint similarity 0.305), consistent with conceptual guidance rather than verbatim code copying.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The accuracy claim rests on a 0.5-point gain from four seeds with no significance test or error bars, so the central result does not hold up yet.

read the letter

The paper reports a small directional accuracy lift from adding a CBR layer to an R&D-Agent running Gemma 4 on the Spaceship Titanic task (0.8147 vs 0.8098), plus lower variance, but that difference comes from only four random seeds and carries no p-value, bootstrap interval, or hypothesis test. The effect size is listed without a formula or per-seed numbers, which leaves the claim underpowered.

What is actually new is the first published end-to-end run of Gemma 4 as the agent backbone together with a structured CBR case base that stores executable code snapshots and quality metadata. The five-gate filter and the reuse detector that mixes embedding similarity with code-fingerprint overlap are concrete implementation choices that have not appeared in the cited prior work on agent memory.

The approach itself is reasonable. Overriding three phases of the R&D loop via a single environment variable keeps the change surgical, and tracking 108 retrieval events for semantic and structural similarity gives a practical check on whether cases are being reused for the right reasons.

The main weakness is the evaluation. Two Kaggle competitions and eight loops per seed are a start, but the reported gain is tiny, the variance comparison rests on n=4, and nothing in the abstract shows that the difference exceeds ordinary seed-to-seed fluctuation. Without those checks the directional superiority cannot be treated as reliable.

This work is for people building local data-science agents who already know the R&D-Agent framework and want to see how CBR can be bolted on. A reader looking for reproducible performance gains will find the current evidence preliminary.

I would send it to peer review so the authors can add proper statistical reporting and more seeds, but the manuscript needs that strengthening before the empirical claim can be taken at face value.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a CBR-augmented R&D-Agent that integrates a persistent, structured Case-Based Reasoning layer (with quality-controlled case records coupling symbolic metadata and executable code) into Microsoft's R&D-Agent framework. It uses Gemma 4 31B Dense as the locally deployable SLM backbone and evaluates the system on two Kaggle competitions (NOMAD 2018 and Spaceship Titanic) across eight improvement loops with four random seeds. The central claim is that the CBR layer produces directionally higher accuracy (0.8147 vs. 0.8098) and substantially lower variance on Spaceship Titanic, with supporting analysis of 108 retrieval events showing high embedding similarity but variable code-fingerprint overlap.

Significance. If the reported accuracy and variance improvements can be substantiated with adequate statistical controls and larger sample sizes, the work would demonstrate a practical route to persistent, cross-session memory in autonomous data-science agents while using only locally deployable small models. The five-gate quality filter and combined embedding-plus-fingerprint reuse heuristic constitute a concrete, inspectable mechanism for controlled knowledge transfer that could be adopted or extended by other agent frameworks.

major comments (2)

[Abstract] Abstract: The central empirical claim of directional superiority plus lower variance rests on only four random seeds. No per-seed accuracy values, standard deviations, p-values, bootstrap confidence intervals, or hypothesis tests are reported, so it is impossible to determine whether the 0.49 pp difference (or the cited d = -1.41) exceeds what would be expected from seed-to-seed fluctuation under an otherwise identical agent.
[Abstract] Abstract: The effect size d = -1.41 is stated without definition, formula, or reference to the underlying per-seed data; this prevents verification of the sign, magnitude, or appropriateness of the statistic for the variance comparison.

minor comments (1)

[Abstract] Abstract: The phrase 'Gemma 4 31B Dense' should be clarified with the exact model identifier or citation, as it is not a standard public release name.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for identifying gaps in the statistical reporting of the abstract. We address each major comment below and will revise the manuscript to improve transparency and verifiability.

read point-by-point responses

Referee: [Abstract] Abstract: The central empirical claim of directional superiority plus lower variance rests on only four random seeds. No per-seed accuracy values, standard deviations, p-values, bootstrap confidence intervals, or hypothesis tests are reported, so it is impossible to determine whether the 0.49 pp difference (or the cited d = -1.41) exceeds what would be expected from seed-to-seed fluctuation under an otherwise identical agent.

Authors: We agree that four seeds constitute a small sample and that aggregate means alone do not permit readers to judge whether the 0.49 pp difference exceeds typical seed-to-seed fluctuation. In the revised manuscript we will add a table or explicit listing of the four per-seed accuracies for both the CBR-augmented and baseline conditions on Spaceship Titanic, report the standard deviation across seeds, and include the result of an appropriate paired test (e.g., Wilcoxon signed-rank) together with its p-value and a bootstrap confidence interval. These additions will be placed in both the abstract and a new short results subsection. revision: yes
Referee: [Abstract] Abstract: The effect size d = -1.41 is stated without definition, formula, or reference to the underlying per-seed data; this prevents verification of the sign, magnitude, or appropriateness of the statistic for the variance comparison.

Authors: We will insert an explicit definition of the reported effect size, the formula employed, and the per-seed values used in its calculation. The revised text will also clarify whether the statistic is intended to quantify the accuracy difference or the variance reduction and will cite the standard reference for the chosen effect-size measure. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical ablation is direct and self-contained

full rationale

The paper reports a straightforward empirical comparison of CBR-enabled versus CBR-disabled runs of the same R&D-Agent on identical Kaggle tasks (NOMAD 2018, Spaceship Titanic) across four seeds. No equations, derivations, or predictions are presented that reduce reported accuracies (0.8147 vs 0.8098) to fitted parameters defined by the authors. The five-gate filter and reuse-detection mechanism are described as implementation details whose correctness is evaluated externally via observed similarities, not assumed by construction. No self-citation chains or uniqueness theorems are invoked to justify the central claim. The evaluation is therefore independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that a small language model can function as a capable autonomous agent backbone and that the custom CBR layer can be surgically inserted without breaking the original loop; no free parameters or invented entities are described in the abstract.

axioms (2)

domain assumption Small language models such as Gemma 4 31B can serve as viable backbones for autonomous data-science agents
The paper positions the model as the first published test case for this role.
domain assumption Structured case records with executable code and quality metadata can be reused across sessions via embedding and fingerprint similarity
This is the core premise of the CBR layer described in the abstract.

pith-pipeline@v0.9.1-grok · 5821 in / 1401 out tokens · 27557 ms · 2026-06-28T05:16:42.283781+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 29 canonical work pages · 11 internal anchors

[1]

AIDE: AI-Driven Exploration in the Space of Code

Z. Jiang, D. Schmidt, D. Srikanth, D. Xu, I. Kaplan, D. Jacenko, and Y . Wu, “AIDE: AI-driven exploration in the space of code,”arXiv preprint arXiv:2502.13138, 2025. [Online]. Available: https://arxiv.org/abs/2502.13138

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

MARS: Modular Agent with Reflective Search for Automated AI Research

J. Chen, B. Dalvi Mishra, J. Nam, R. Meng, T. Pfister, and J. Yoon, “MARS: Modular agent with reflective search for automated AI research,” inProceedings of the 43rd International Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research. PMLR, 2026, to appear; proceedings not yet published at time of writing. [Online]. Availabl...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Kolb-based experiential learning for generalist agents with human-level Kaggle data science performance,

A. Grosnit, A. Maraval, Refinath S N, Z. Zhao, J. Doran, G. Paolo, A. Thomas, J. Gonzalez, A. Kumar, K. Khandelwal, A. Benechehab, H. Cherkaoui, Y . Attia El-Hili, K. Shao, J. Hao, J. Yao, B. Kégl, H. Bou-Ammar, and J. Wang, “Kolb-based experiential learning for generalist agents with human-level Kaggle data science performance,”arXiv preprint arXiv:2411....

work page arXiv 2025
[4]

R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science

X. Yang, X. Yang, S. Fang, B. Xian, Y . Li, J. Wang, M. Xu, H. Pan, X. Hong, W. Liu, Y . Shen, W. Chen, and J. Bian, “R&D-Agent: Automating data-driven AI solution building through LLM-powered automated research, development, and evolution,” arXiv preprint arXiv:2505.14738, 2025. [Online]. Available: https: //arxiv.org/abs/2505.14738

work page arXiv 2025
[5]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

J. S. Chan, N. Chowdhury, O. Jaffe, J. Aung, D. Sherburn, E. Mays, G. Starace, K. Liu, L. Maksin, T. Patwardhan, L. Weng, and A. M ˛ adry, “MLE-Bench: Evaluating machine learning agents on machine learning engineering,” inProceedings of the 13th International Conference on Learning Representations (ICLR), 2025. [Online]. Available: https://arxiv.org/abs/2...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Ds-agent: Automated data science by empowering large language models with case-based reasoning

S. Guo, C. Deng, Y . Wen, H. Chen, Y . Chang, and J. Wang, “DS- Agent: Automated data science by empowering large language models with case-based reasoning,” inProceedings of the 41st International Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 235. PMLR, 2024, pp. 16 813– 16 848. [Online]. Available: https://ar...

work page arXiv 2024
[7]

Gemma 4 model card,

Google DeepMind, “Gemma 4 model card,” https://ai.google.dev/ gemma/docs/core/model_card_4, 2026, last updated 17 April 2026, accessed 13 May 2026

2026
[8]

Case-based reasoning: Foundational issues, methodological variations, and system approaches,

A. Aamodt and E. Plaza, “Case-based reasoning: Foundational issues, methodological variations, and system approaches,”AI Communications, vol. 7, no. 1, pp. 39–59, 1994. [Online]. Available: https://www.researchgate.net/publication/225070522_Case- Based_Reasoning_Foundational_Issues_Methodological_Variations_ and_System_Approaches

work page arXiv 1994
[9]

Remembering to forget: A competence- preserving case deletion policy for case-based reasoning systems,

B. Smyth and M. T. Keane, “Remembering to forget: A competence- preserving case deletion policy for case-based reasoning systems,” in Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI-95), 1995, pp. 377–382. [Online]. Available: https://www.ijcai.org/Proceedings/95-1/Papers/050.pdf

1995
[10]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive NLP tasks,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 9459–9474. [Online]. Available: https://arxiv.org/abs/2005.11401

work page internal anchor Pith review Pith/arXiv arXiv 2020
[11]

CBR-RAG: Case-based reasoning for retrieval augmented generation in LLMs for legal question answering,

N. Wiratunga, R. Abeyratne, L. Jayawardena, K. Martin, S. Massie, I. Nkisi-Orji, R. Weerasinghe, A. Liret, and B. Fleisch, “CBR-RAG: Case-based reasoning for retrieval augmented generation in LLMs for legal question answering,” inProceedings of the 32nd International Conference on Case-Based Reasoning (ICCBR 2024), ser. Lecture Notes in Computer Science, ...

work page arXiv 2024
[12]

Review of case-based reasoning for LLM agents: Theoretical foundations, architectural components, and cognitive integration,

K. Hatalis, D. Christou, and V . Kondapalli, “Review of case-based reasoning for LLM agents: Theoretical foundations, architectural components, and cognitive integration,”arXiv preprint arXiv:2504.06943, 2025. [Online]. Available: https: //arxiv.org/abs/2504.06943

work page arXiv 2025
[13]

Case-based reasoning meets large language models: A research manifesto for open challenges and research directions,

K. Bach, R. Bergmann, F. Brand, M. Caro-Martínez, V . Eisenstadt, M. W. Floyd, L. Jayawardena, D. Leake, M. Lenz, L. Malburg, D. H. Ménager, M. Minor, B. Schack, I. Watson, K. Wilkerson, and N. Wiratunga, “Case-based reasoning meets large language models: A research manifesto for open challenges and research directions,” HAL Science, Tech. Rep. hal-050067...

2025
[14]

Levels of AI memory — and case-based ways for LLMs to ascend them,

M. W. Floyd, D. Leake, D. H. Ménager, I. Watson, and K. Wilkerson, “Levels of AI memory — and case-based ways for LLMs to ascend them,” inCBR-LLM Workshop @ ICCBR 2025, ser. CEUR Workshop Proceedings, vol. 3993, 2025, pp. 2–14. [Online]. Available: https://ceur-ws.org/V ol-3993/paper1.pdf

2025
[15]

EXAR: A unified experience-grounded agentic reasoning architecture,

R. Bergmann, F. Brand, M. Lenz, and L. Malburg, “EXAR: A unified experience-grounded agentic reasoning architecture,” inProceedings of the 33rd International Conference on Case-Based Reasoning (ICCBR 2025), ser. Lecture Notes in Computer Science, vol. 15662. Springer, 2025, pp. 3–17. [Online]. Available: https://www.wi2.uni- trier.de/shared/publications/2...

2025
[16]

A case-based reasoning approach to dynamic few-shot prompting for code generation,

D. Dannenhauer, Z. Dannenhauer, D. Christou, and K. Hatalis, “A case-based reasoning approach to dynamic few-shot prompting for code generation,” inICML 2024 Workshop on LLMs and Cognition, 2024. [Online]. Available: https://openreview.net/pdf?id= Kt9bM32oDY

2024
[17]

Large language models as knowledge engineers,

F. Brand, L. Malburg, and R. Bergmann, “Large language models as knowledge engineers,” inCBR-LLM Workshop @ ICCBR 2024, ser. CEUR Workshop Proceedings, vol. 3708, 2024, pp. 3–18. [Online]. Available: https://ceur-ws.org/V ol-3708/paper_01.pdf

2024
[18]

Retrieval augmented generation with LLMs for explaining business process models,

M. Minor and E. Kaucher, “Retrieval augmented generation with LLMs for explaining business process models,” inProceedings of the 32nd International Conference on Case-Based Reasoning (ICCBR 2024), ser. Lecture Notes in Computer Science, vol. 14775. Springer, 2024, pp. 175–190. [Online]. Available: http://wi.cs.uni- frankfurt.de/webdav/publications/2024_IC...

2024
[19]

Explainable classification system for hip fractures: A hybrid CBR+LLM surrogate approach,

E. Queipo-de Llano, M. Ciurcau, A. Paz-Olalla, B. Díaz-Agudo, and J. A. Recio-García, “Explainable classification system for hip fractures: A hybrid CBR+LLM surrogate approach,” inXCBR Workshop on CBR for the Explanation of Intelligent Systems @ ICCBR 2024, ser. CEUR Workshop Proceedings, vol. 3708, 2024, pp. 65–80. [Online]. Available: https://ceur-ws.or...

2024
[20]

LLM-driven case-base populating for structuring and integrating restoration experiences,

F. Ghazouani, F. Giustozzi, and F. Le Ber, “LLM-driven case-base populating for structuring and integrating restoration experiences,” inProceedings of the 33rd International Conference on Case- Based Reasoning (ICCBR 2025), ser. Lecture Notes in Computer Science, vol. 15662. Springer, 2025, pp. 67–80. [Online]. Available: https://hal.science/hal-05058570v...

2025
[21]

Agentic CBR in action: Empowering loan approvals through interactive, counterfactual explanations,

P. Salimi, N. Wiratunga, and D. Corsar, “Agentic CBR in action: Empowering loan approvals through interactive, counterfactual explanations,” inCBR-LLM Workshop @ ICCBR 2025, ser. CEUR Workshop Proceedings, vol. 3993, 2025, pp. 27–42. [Online]. Available: https://ceur-ws.org/V ol-3993/paper3.pdf

2025
[22]

A human-LLM note-taking system with case-based reasoning as framework for scientific discovery,

D. B. Craig, “A human-LLM note-taking system with case-based reasoning as framework for scientific discovery,” inProceedings of the 1st Workshop on AI and Scientific Discovery: Directions and Opportunities (AISD @ NAACL 2025), 2025, pp. 22–30. [Online]. Available: https://aclanthology.org/2025.aisd-main.3

2025
[23]

Decision making in LLMs: A first step,

R. O. Weber, C. B. Rauch, and S. Amin, “Decision making in LLMs: A first step,” inCBR-LLM Workshop @ ICCBR 2025, ser. CEUR Workshop Proceedings, vol. 3993, 2025, pp. 15–26. [Online]. Available: https://ceur-ws.org/V ol-3993/paper2.pdf

2025
[24]

Memento: Fine-tuning llm agents without fine-tuning llms.arXiv preprint arXiv:2508.16153, 2025

H. Zhou, Y . Chen, S. Guo, X. Yan, K. H. Lee, Z. Wang, K. Y . Lee, G. Zhang, K. Shao, L. Yang, and J. Wang, “Memento: Fine-tuning LLM agents without fine-tuning LLMs,” arXiv preprint arXiv:2508.16153, 2025. [Online]. Available: https: //arxiv.org/abs/2508.16153

work page arXiv 2025
[25]

Hoos, and Kevin Leyton-Brown

C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms,” inProceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2013, pp. 847–855. [Online]. Available: https://dl.acm.org/doi/10.1145/2487575.2487629

work page doi:10.1145/2487575.2487629 2013
[26]

Auto-sklearn: Automated machine learning,

M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter, “Auto-sklearn: Automated machine learning,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 28, 2015, pp. 2962–2970. [Online]. Available: https://www.researchgate.net/publication/333181102_Auto- sklearn_Efficient_and_Robust_Automated_Machine_Learning

work page arXiv 2015
[27]

Evaluation of a tree-based pipeline optimization tool for automating data science,

R. S. Olson, N. Bartley, R. J. Urbanowicz, and J. H. Moore, “Evaluation of a tree-based pipeline optimization tool for automating data science,” inProceedings of the Genetic and Evolutionary Computation Conference (GECCO), 2016, pp. 485–492. [Online]. Available: https://dl.acm.org/doi/10.1145/2908812.2908918

work page doi:10.1145/2908812.2908918 2016
[29]

Available: https://arxiv.org/abs/2402.18679

[Online]. Available: https://arxiv.org/abs/2402.18679

work page arXiv
[30]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

S. Hong, M. Zhuge, J. Chen, X. Zheng, Y . Cheng, C. Zhang, J. Wang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmidhuber, “MetaGPT: Meta programming for a multi-agent collaborative framework,” inProceedings of the 12th International Conference on Learning Representations (ICLR), 2024. [Online]. Available: https://arxiv.org/ab...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Mlagentbench: Evaluating language agents on machine learning experimentation.arXiv preprint arXiv:2310.03302, 2023

Q. Huang, J. V ora, P. Liang, and J. Leskovec, “MLAgentBench: Evaluating language agents on machine learning experimentation,” inProceedings of the 41st International Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 235. PMLR, 2024, pp. 20 271–20 309. [Online]. Available: https://arxiv.org/abs/2310.03302

work page arXiv 2024
[32]

Autokaggle: A multi-agent framework for autonomous data science competitions.arXiv preprint arXiv:2410.20424, 2024

Z. Li, Q. Zang, D. Ma, J. Guo, T. Zheng, M. Liu, X. Niu, Y . Wang, J. Yang, J. Liu, W. Zhong, W. Zhou, W. Huang, and G. Zhang, “AutoKaggle: A multi-agent framework for autonomous data science competitions,”arXiv preprint arXiv:2410.20424, 2024. [Online]. Available: https://arxiv.org/abs/2410.20424

work page arXiv 2024
[33]

MLE-STAR: Machine Learning Engineering Agent via Search and Targeted Refinement

J. Nam, J. Yoon, J. Chen, J. Shin, S. Ö. Arık, and T. Pfister, “MLE-STAR: Machine learning engineering agent via search and targeted refinement,”arXiv preprint arXiv:2506.15692, 2025. [Online]. Available: https://arxiv.org/abs/2506.15692

work page arXiv 2025
[34]

OpenHands: An open platform for AI software developers as generalist agents,

X. Wang, B. Li, Y . Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y . Song, B. Li, J. Singh, H. H. Tran, F. Li, R. Ma, M. Zheng, B. Qian, Y . Shao, N. Muennighoff, Y . Zhang, B. Hui, J. Lin, R. Brennan, H. Peng, H. Ji, and G. Neubig, “OpenHands: An open platform for AI software developers as generalist agents,” inProceedings of the 13th International Confere...
[35]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

[Online]. Available: https://arxiv.org/abs/2407.16741

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Collaborative evolving strategy for automatic data-centric development,

X. Yang, H. Chen, W. Feng, H. Wang, Z. Ye, X. Shen, X. Yang, S. Sun, W. Liu, and J. Bian, “Collaborative evolving strategy for automatic data-centric development,”arXiv preprint arXiv:2407.18690, 2024. [Online]. Available: https://arxiv.org/abs/2407.18690

work page arXiv 2024
[37]

Toward ultra-long-horizon agentic science: Cognitive accumulation for machine learning engineering

X. Zhu, Y . Cai, Z. Liu, B. Zheng, C. Wang, R. Ye, Y . Zhang, L. Zhang, W. E, S. Chen, and Y . Wang, “Toward ultra-long- horizon agentic science: Cognitive accumulation for machine learning engineering,”arXiv preprint arXiv:2601.10402, 2026. [Online]. Available: https://arxiv.org/abs/2601.10402

work page arXiv 2026
[38]

AutoMind: Adaptive Knowledgeable Agent for Automated Data Science

Y . Ou, Y . Luo, J. Zheng, L. Wei, Z. Yu, S. Qiao, J. Zhang, D. Zheng, Y . Mao, Y . Gao, H. Chen, and N. Zhang, “AutoMind: Adaptive knowledgeable agent for automated data science,”arXiv preprint arXiv:2506.10974, 2025. [Online]. Available: https://arxiv. org/abs/2506.10974

work page arXiv 2025
[40]

ML-Agent: Reinforcing LLM Agents for Autonomous Machine Learning Engineering

[Online]. Available: https://arxiv.org/abs/2505.23723

work page internal anchor Pith review Pith/arXiv arXiv
[41]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “LLaMA: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023. [Online]. Available: https://arxiv.org/abs/2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Mistral 7B

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. Le Scao, T. Lavril, T. Wang, T. Lacroix, and W. El Sayed, “Mistral 7b,”arXiv preprint arXiv:2310.06825, 2023. [Online]. Available: https://arxiv.org/abs/2310.06825

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y . Wu, Y . K. Li, F. Luo, Y . Xiong, and W. Liang, “DeepSeek-Coder: When the large language model meets programming — the rise of code intelligence,”arXiv preprint arXiv:2401.14196, 2024. [Online]. Available: https://arxiv.org/abs/2401.14196

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Position: LLMs can’t plan, but can help planning in LLM-modulo frameworks,

S. Kambhampati, K. Valmeekam, L. Guan, M. Verma, K. Stechly, S. Bhambri, L. Saldyt, and A. Murthy, “Position: LLMs can’t plan, but can help planning in LLM-modulo frameworks,” in Proceedings of the 41st International Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 235. PMLR, 2024, pp. 22 895–22 907. [Online]. Ava...

work page arXiv 2024
[45]

Parikh and Chris Alberti and Danielle Epstein and Illia Polosukhin and Jacob Devlin and Kenton Lee and Kristina Toutanova and Llion Jones and Matthew Kelcey and Ming

N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long contexts,”Transactions of the Association for Computational Linguistics, vol. 12, pp. 157–173, 2024. [Online]. Available: https://arxiv.org/abs/2307.03172

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Welcome Gemma 4: Frontier multimodal intelligence on device,

Hugging Face, “Welcome Gemma 4: Frontier multimodal intelligence on device,” https://huggingface.co/blog/gemma4, 2026, published 2 April 2026, accessed 13 May 2026. APPENDIX TABLE III APPENDIXA: PER-RUNPERFORMANCEMETRICS. COMP.=COMPETITION.COND.=CONDITION.S=SEED.N=LOOPS COMPLETED.BEST= BEST CUMULATIVE METRIC.L0=LOOP-0METRIC.∆SOTA=ABSOLUTESOTAGAIN(DIRECTIO...

2026

[1] [1]

AIDE: AI-Driven Exploration in the Space of Code

Z. Jiang, D. Schmidt, D. Srikanth, D. Xu, I. Kaplan, D. Jacenko, and Y . Wu, “AIDE: AI-driven exploration in the space of code,”arXiv preprint arXiv:2502.13138, 2025. [Online]. Available: https://arxiv.org/abs/2502.13138

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

MARS: Modular Agent with Reflective Search for Automated AI Research

J. Chen, B. Dalvi Mishra, J. Nam, R. Meng, T. Pfister, and J. Yoon, “MARS: Modular agent with reflective search for automated AI research,” inProceedings of the 43rd International Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research. PMLR, 2026, to appear; proceedings not yet published at time of writing. [Online]. Availabl...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Kolb-based experiential learning for generalist agents with human-level Kaggle data science performance,

A. Grosnit, A. Maraval, Refinath S N, Z. Zhao, J. Doran, G. Paolo, A. Thomas, J. Gonzalez, A. Kumar, K. Khandelwal, A. Benechehab, H. Cherkaoui, Y . Attia El-Hili, K. Shao, J. Hao, J. Yao, B. Kégl, H. Bou-Ammar, and J. Wang, “Kolb-based experiential learning for generalist agents with human-level Kaggle data science performance,”arXiv preprint arXiv:2411....

work page arXiv 2025

[4] [4]

R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science

X. Yang, X. Yang, S. Fang, B. Xian, Y . Li, J. Wang, M. Xu, H. Pan, X. Hong, W. Liu, Y . Shen, W. Chen, and J. Bian, “R&D-Agent: Automating data-driven AI solution building through LLM-powered automated research, development, and evolution,” arXiv preprint arXiv:2505.14738, 2025. [Online]. Available: https: //arxiv.org/abs/2505.14738

work page arXiv 2025

[5] [5]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

J. S. Chan, N. Chowdhury, O. Jaffe, J. Aung, D. Sherburn, E. Mays, G. Starace, K. Liu, L. Maksin, T. Patwardhan, L. Weng, and A. M ˛ adry, “MLE-Bench: Evaluating machine learning agents on machine learning engineering,” inProceedings of the 13th International Conference on Learning Representations (ICLR), 2025. [Online]. Available: https://arxiv.org/abs/2...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Ds-agent: Automated data science by empowering large language models with case-based reasoning

S. Guo, C. Deng, Y . Wen, H. Chen, Y . Chang, and J. Wang, “DS- Agent: Automated data science by empowering large language models with case-based reasoning,” inProceedings of the 41st International Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 235. PMLR, 2024, pp. 16 813– 16 848. [Online]. Available: https://ar...

work page arXiv 2024

[7] [7]

Gemma 4 model card,

Google DeepMind, “Gemma 4 model card,” https://ai.google.dev/ gemma/docs/core/model_card_4, 2026, last updated 17 April 2026, accessed 13 May 2026

2026

[8] [8]

Case-based reasoning: Foundational issues, methodological variations, and system approaches,

A. Aamodt and E. Plaza, “Case-based reasoning: Foundational issues, methodological variations, and system approaches,”AI Communications, vol. 7, no. 1, pp. 39–59, 1994. [Online]. Available: https://www.researchgate.net/publication/225070522_Case- Based_Reasoning_Foundational_Issues_Methodological_Variations_ and_System_Approaches

work page arXiv 1994

[9] [9]

Remembering to forget: A competence- preserving case deletion policy for case-based reasoning systems,

B. Smyth and M. T. Keane, “Remembering to forget: A competence- preserving case deletion policy for case-based reasoning systems,” in Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI-95), 1995, pp. 377–382. [Online]. Available: https://www.ijcai.org/Proceedings/95-1/Papers/050.pdf

1995

[10] [10]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive NLP tasks,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 9459–9474. [Online]. Available: https://arxiv.org/abs/2005.11401

work page internal anchor Pith review Pith/arXiv arXiv 2020

[11] [11]

CBR-RAG: Case-based reasoning for retrieval augmented generation in LLMs for legal question answering,

N. Wiratunga, R. Abeyratne, L. Jayawardena, K. Martin, S. Massie, I. Nkisi-Orji, R. Weerasinghe, A. Liret, and B. Fleisch, “CBR-RAG: Case-based reasoning for retrieval augmented generation in LLMs for legal question answering,” inProceedings of the 32nd International Conference on Case-Based Reasoning (ICCBR 2024), ser. Lecture Notes in Computer Science, ...

work page arXiv 2024

[12] [12]

Review of case-based reasoning for LLM agents: Theoretical foundations, architectural components, and cognitive integration,

K. Hatalis, D. Christou, and V . Kondapalli, “Review of case-based reasoning for LLM agents: Theoretical foundations, architectural components, and cognitive integration,”arXiv preprint arXiv:2504.06943, 2025. [Online]. Available: https: //arxiv.org/abs/2504.06943

work page arXiv 2025

[13] [13]

Case-based reasoning meets large language models: A research manifesto for open challenges and research directions,

K. Bach, R. Bergmann, F. Brand, M. Caro-Martínez, V . Eisenstadt, M. W. Floyd, L. Jayawardena, D. Leake, M. Lenz, L. Malburg, D. H. Ménager, M. Minor, B. Schack, I. Watson, K. Wilkerson, and N. Wiratunga, “Case-based reasoning meets large language models: A research manifesto for open challenges and research directions,” HAL Science, Tech. Rep. hal-050067...

2025

[14] [14]

Levels of AI memory — and case-based ways for LLMs to ascend them,

M. W. Floyd, D. Leake, D. H. Ménager, I. Watson, and K. Wilkerson, “Levels of AI memory — and case-based ways for LLMs to ascend them,” inCBR-LLM Workshop @ ICCBR 2025, ser. CEUR Workshop Proceedings, vol. 3993, 2025, pp. 2–14. [Online]. Available: https://ceur-ws.org/V ol-3993/paper1.pdf

2025

[15] [15]

EXAR: A unified experience-grounded agentic reasoning architecture,

R. Bergmann, F. Brand, M. Lenz, and L. Malburg, “EXAR: A unified experience-grounded agentic reasoning architecture,” inProceedings of the 33rd International Conference on Case-Based Reasoning (ICCBR 2025), ser. Lecture Notes in Computer Science, vol. 15662. Springer, 2025, pp. 3–17. [Online]. Available: https://www.wi2.uni- trier.de/shared/publications/2...

2025

[16] [16]

A case-based reasoning approach to dynamic few-shot prompting for code generation,

D. Dannenhauer, Z. Dannenhauer, D. Christou, and K. Hatalis, “A case-based reasoning approach to dynamic few-shot prompting for code generation,” inICML 2024 Workshop on LLMs and Cognition, 2024. [Online]. Available: https://openreview.net/pdf?id= Kt9bM32oDY

2024

[17] [17]

Large language models as knowledge engineers,

F. Brand, L. Malburg, and R. Bergmann, “Large language models as knowledge engineers,” inCBR-LLM Workshop @ ICCBR 2024, ser. CEUR Workshop Proceedings, vol. 3708, 2024, pp. 3–18. [Online]. Available: https://ceur-ws.org/V ol-3708/paper_01.pdf

2024

[18] [18]

Retrieval augmented generation with LLMs for explaining business process models,

M. Minor and E. Kaucher, “Retrieval augmented generation with LLMs for explaining business process models,” inProceedings of the 32nd International Conference on Case-Based Reasoning (ICCBR 2024), ser. Lecture Notes in Computer Science, vol. 14775. Springer, 2024, pp. 175–190. [Online]. Available: http://wi.cs.uni- frankfurt.de/webdav/publications/2024_IC...

2024

[19] [19]

Explainable classification system for hip fractures: A hybrid CBR+LLM surrogate approach,

E. Queipo-de Llano, M. Ciurcau, A. Paz-Olalla, B. Díaz-Agudo, and J. A. Recio-García, “Explainable classification system for hip fractures: A hybrid CBR+LLM surrogate approach,” inXCBR Workshop on CBR for the Explanation of Intelligent Systems @ ICCBR 2024, ser. CEUR Workshop Proceedings, vol. 3708, 2024, pp. 65–80. [Online]. Available: https://ceur-ws.or...

2024

[20] [20]

LLM-driven case-base populating for structuring and integrating restoration experiences,

F. Ghazouani, F. Giustozzi, and F. Le Ber, “LLM-driven case-base populating for structuring and integrating restoration experiences,” inProceedings of the 33rd International Conference on Case- Based Reasoning (ICCBR 2025), ser. Lecture Notes in Computer Science, vol. 15662. Springer, 2025, pp. 67–80. [Online]. Available: https://hal.science/hal-05058570v...

2025

[21] [21]

Agentic CBR in action: Empowering loan approvals through interactive, counterfactual explanations,

P. Salimi, N. Wiratunga, and D. Corsar, “Agentic CBR in action: Empowering loan approvals through interactive, counterfactual explanations,” inCBR-LLM Workshop @ ICCBR 2025, ser. CEUR Workshop Proceedings, vol. 3993, 2025, pp. 27–42. [Online]. Available: https://ceur-ws.org/V ol-3993/paper3.pdf

2025

[22] [22]

A human-LLM note-taking system with case-based reasoning as framework for scientific discovery,

D. B. Craig, “A human-LLM note-taking system with case-based reasoning as framework for scientific discovery,” inProceedings of the 1st Workshop on AI and Scientific Discovery: Directions and Opportunities (AISD @ NAACL 2025), 2025, pp. 22–30. [Online]. Available: https://aclanthology.org/2025.aisd-main.3

2025

[23] [23]

Decision making in LLMs: A first step,

R. O. Weber, C. B. Rauch, and S. Amin, “Decision making in LLMs: A first step,” inCBR-LLM Workshop @ ICCBR 2025, ser. CEUR Workshop Proceedings, vol. 3993, 2025, pp. 15–26. [Online]. Available: https://ceur-ws.org/V ol-3993/paper2.pdf

2025

[24] [24]

Memento: Fine-tuning llm agents without fine-tuning llms.arXiv preprint arXiv:2508.16153, 2025

H. Zhou, Y . Chen, S. Guo, X. Yan, K. H. Lee, Z. Wang, K. Y . Lee, G. Zhang, K. Shao, L. Yang, and J. Wang, “Memento: Fine-tuning LLM agents without fine-tuning LLMs,” arXiv preprint arXiv:2508.16153, 2025. [Online]. Available: https: //arxiv.org/abs/2508.16153

work page arXiv 2025

[25] [25]

Hoos, and Kevin Leyton-Brown

C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms,” inProceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2013, pp. 847–855. [Online]. Available: https://dl.acm.org/doi/10.1145/2487575.2487629

work page doi:10.1145/2487575.2487629 2013

[26] [26]

Auto-sklearn: Automated machine learning,

M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter, “Auto-sklearn: Automated machine learning,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 28, 2015, pp. 2962–2970. [Online]. Available: https://www.researchgate.net/publication/333181102_Auto- sklearn_Efficient_and_Robust_Automated_Machine_Learning

work page arXiv 2015

[27] [27]

Evaluation of a tree-based pipeline optimization tool for automating data science,

R. S. Olson, N. Bartley, R. J. Urbanowicz, and J. H. Moore, “Evaluation of a tree-based pipeline optimization tool for automating data science,” inProceedings of the Genetic and Evolutionary Computation Conference (GECCO), 2016, pp. 485–492. [Online]. Available: https://dl.acm.org/doi/10.1145/2908812.2908918

work page doi:10.1145/2908812.2908918 2016

[28] [29]

Available: https://arxiv.org/abs/2402.18679

[Online]. Available: https://arxiv.org/abs/2402.18679

work page arXiv

[29] [30]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

S. Hong, M. Zhuge, J. Chen, X. Zheng, Y . Cheng, C. Zhang, J. Wang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmidhuber, “MetaGPT: Meta programming for a multi-agent collaborative framework,” inProceedings of the 12th International Conference on Learning Representations (ICLR), 2024. [Online]. Available: https://arxiv.org/ab...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [31]

Mlagentbench: Evaluating language agents on machine learning experimentation.arXiv preprint arXiv:2310.03302, 2023

Q. Huang, J. V ora, P. Liang, and J. Leskovec, “MLAgentBench: Evaluating language agents on machine learning experimentation,” inProceedings of the 41st International Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 235. PMLR, 2024, pp. 20 271–20 309. [Online]. Available: https://arxiv.org/abs/2310.03302

work page arXiv 2024

[31] [32]

Autokaggle: A multi-agent framework for autonomous data science competitions.arXiv preprint arXiv:2410.20424, 2024

Z. Li, Q. Zang, D. Ma, J. Guo, T. Zheng, M. Liu, X. Niu, Y . Wang, J. Yang, J. Liu, W. Zhong, W. Zhou, W. Huang, and G. Zhang, “AutoKaggle: A multi-agent framework for autonomous data science competitions,”arXiv preprint arXiv:2410.20424, 2024. [Online]. Available: https://arxiv.org/abs/2410.20424

work page arXiv 2024

[32] [33]

MLE-STAR: Machine Learning Engineering Agent via Search and Targeted Refinement

J. Nam, J. Yoon, J. Chen, J. Shin, S. Ö. Arık, and T. Pfister, “MLE-STAR: Machine learning engineering agent via search and targeted refinement,”arXiv preprint arXiv:2506.15692, 2025. [Online]. Available: https://arxiv.org/abs/2506.15692

work page arXiv 2025

[33] [34]

OpenHands: An open platform for AI software developers as generalist agents,

X. Wang, B. Li, Y . Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y . Song, B. Li, J. Singh, H. H. Tran, F. Li, R. Ma, M. Zheng, B. Qian, Y . Shao, N. Muennighoff, Y . Zhang, B. Hui, J. Lin, R. Brennan, H. Peng, H. Ji, and G. Neubig, “OpenHands: An open platform for AI software developers as generalist agents,” inProceedings of the 13th International Confere...

[34] [35]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

[Online]. Available: https://arxiv.org/abs/2407.16741

work page internal anchor Pith review Pith/arXiv arXiv

[35] [36]

Collaborative evolving strategy for automatic data-centric development,

X. Yang, H. Chen, W. Feng, H. Wang, Z. Ye, X. Shen, X. Yang, S. Sun, W. Liu, and J. Bian, “Collaborative evolving strategy for automatic data-centric development,”arXiv preprint arXiv:2407.18690, 2024. [Online]. Available: https://arxiv.org/abs/2407.18690

work page arXiv 2024

[36] [37]

Toward ultra-long-horizon agentic science: Cognitive accumulation for machine learning engineering

X. Zhu, Y . Cai, Z. Liu, B. Zheng, C. Wang, R. Ye, Y . Zhang, L. Zhang, W. E, S. Chen, and Y . Wang, “Toward ultra-long- horizon agentic science: Cognitive accumulation for machine learning engineering,”arXiv preprint arXiv:2601.10402, 2026. [Online]. Available: https://arxiv.org/abs/2601.10402

work page arXiv 2026

[37] [38]

AutoMind: Adaptive Knowledgeable Agent for Automated Data Science

Y . Ou, Y . Luo, J. Zheng, L. Wei, Z. Yu, S. Qiao, J. Zhang, D. Zheng, Y . Mao, Y . Gao, H. Chen, and N. Zhang, “AutoMind: Adaptive knowledgeable agent for automated data science,”arXiv preprint arXiv:2506.10974, 2025. [Online]. Available: https://arxiv. org/abs/2506.10974

work page arXiv 2025

[38] [40]

ML-Agent: Reinforcing LLM Agents for Autonomous Machine Learning Engineering

[Online]. Available: https://arxiv.org/abs/2505.23723

work page internal anchor Pith review Pith/arXiv arXiv

[39] [41]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “LLaMA: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023. [Online]. Available: https://arxiv.org/abs/2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [42]

Mistral 7B

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. Le Scao, T. Lavril, T. Wang, T. Lacroix, and W. El Sayed, “Mistral 7b,”arXiv preprint arXiv:2310.06825, 2023. [Online]. Available: https://arxiv.org/abs/2310.06825

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [43]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y . Wu, Y . K. Li, F. Luo, Y . Xiong, and W. Liang, “DeepSeek-Coder: When the large language model meets programming — the rise of code intelligence,”arXiv preprint arXiv:2401.14196, 2024. [Online]. Available: https://arxiv.org/abs/2401.14196

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [44]

Position: LLMs can’t plan, but can help planning in LLM-modulo frameworks,

S. Kambhampati, K. Valmeekam, L. Guan, M. Verma, K. Stechly, S. Bhambri, L. Saldyt, and A. Murthy, “Position: LLMs can’t plan, but can help planning in LLM-modulo frameworks,” in Proceedings of the 41st International Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 235. PMLR, 2024, pp. 22 895–22 907. [Online]. Ava...

work page arXiv 2024

[43] [45]

Parikh and Chris Alberti and Danielle Epstein and Illia Polosukhin and Jacob Devlin and Kenton Lee and Kristina Toutanova and Llion Jones and Matthew Kelcey and Ming

N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long contexts,”Transactions of the Association for Computational Linguistics, vol. 12, pp. 157–173, 2024. [Online]. Available: https://arxiv.org/abs/2307.03172

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [46]

Welcome Gemma 4: Frontier multimodal intelligence on device,

Hugging Face, “Welcome Gemma 4: Frontier multimodal intelligence on device,” https://huggingface.co/blog/gemma4, 2026, published 2 April 2026, accessed 13 May 2026. APPENDIX TABLE III APPENDIXA: PER-RUNPERFORMANCEMETRICS. COMP.=COMPETITION.COND.=CONDITION.S=SEED.N=LOOPS COMPLETED.BEST= BEST CUMULATIVE METRIC.L0=LOOP-0METRIC.∆SOTA=ABSOLUTESOTAGAIN(DIRECTIO...

2026