arxiv: 2605.03808 · v1 · submitted 2026-05-05 · 💻 cs.AI · cs.CL· cs.LG

Recognition: unknown

Agentic-imodels: Evolving agentic interpretability tools via autoresearch

Chandan Singh , Yan Shuo Tan , Weijia Xu , Zelalem Gero , Weiwei Yang , Michel Galley , Jianfeng Gao

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:30 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords agentic data scienceautoresearchmodel evolutionLLM simulabilityinterpretability metrictabular regressorsBLADE benchmarkscikit-learn tools

0 comments

The pith

Evolving scikit-learn regressors for both accuracy and LLM simulability improves agentic data science performance by up to 73%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an autoresearch loop that evolves data science regressors to be interpretable by AI agents rather than humans. It optimizes models using a metric that checks if large language models can simulate the regressor's behavior just by reading its string description. This dual focus produces models that perform better on predictions while becoming more usable by agents, and these models boost the performance of full agentic data science systems on benchmarks. A sympathetic reader would care because it suggests a route toward agents that can autonomously handle more data analysis tasks without sacrificing accuracy or the ability to reason about the tools they employ.

Core claim

Agentic-imodels is an agentic autoresearch loop that develops a library of scikit-learn-compatible regressors for tabular data optimized for both predictive performance and a novel LLM-based interpretability metric. The metric measures a suite of LLM-graded tests that probe whether a fitted model's string representation is simulatable by an LLM. The evolved models jointly improve predictive performance and agent-facing interpretability, generalizing to new datasets and new interpretability tests. These models also improve downstream end-to-end agentic data science, increasing performance for Copilot CLI, Claude Code, and Codex on the BLADE benchmark by up to 73%.

What carries the argument

The Agentic-imodels autoresearch loop, which iteratively evolves scikit-learn regressors by jointly optimizing predictive performance and an LLM-based simulability metric that tests whether an LLM can answer questions about model behavior from its string representation alone.

If this is right

The evolved models generalize to new datasets while retaining gains in both performance and interpretability.
They pass additional interpretability tests beyond those used during evolution.
Full agentic data science systems using the evolved models show concrete gains on the BLADE benchmark, up to 73% for multiple agent implementations.
The approach produces scikit-learn-compatible tools that agents can directly incorporate into workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same autoresearch loop could be extended to evolve other components such as feature selectors or preprocessing steps for greater agent compatibility.
If the simulability gains hold in practice, routine data tasks might shift to agents with less need for human review of model internals.
Similar evolution methods might adapt to non-tabular settings like time-series or image models to support agent-driven analysis in those domains.

Load-bearing premise

The assumption that the LLM-based simulability metric accurately captures what makes a model useful to downstream agents rather than producing metric-specific artifacts or grader biases.

What would settle it

Deploying the evolved models inside complete agentic data science pipelines on entirely new datasets and observing no performance gain or even degradation relative to standard regressors in end-to-end task success.

Figures

Figures reproduced from arXiv: 2605.03808 by Chandan Singh, Jianfeng Gao, Michel Galley, Weijia Xu, Weiwei Yang, Yan Shuo Tan, Zelalem Gero.

**Figure 1.** Figure 1: (a) Overview of the AGENTIC-IMODELS autoresearch loop, which optimizes a Python class for predictive performance and agent interpretability (evaluated through LLM-based simulatability tests). (b) The discovered AGENTIC-IMODELS (blue points) improve the Pareto frontier of predictive performance and interpretability over baselines from the literature. See evaluation details in Sec. 4.2. This emerging problem… view at source ↗

**Figure 2.** Figure 2: The interpretability test protocol, illustrated on Ridge regression with four of the 43 tests. view at source ↗

**Figure 3.** Figure 3: AGENTIC-IMODELS versus baselines (gray crosses) in terms of both predictive performance (the RMSE mean rank: each model’s mean rank is computed across datasets, then normalized to [0, 1] with lower being better) and agent interpretability scores (fraction of tests passed from the 157-test held-out generalization suite (Table A2)). Across different settings, AGENTIC-IMODELS achieve Pareto improvements in te… view at source ↗

**Figure 4.** Figure 4: Including AGENTIC-IMODELS improves performance on the BLADE benchmark across 4 ADS agents: GitHub Copilot CLI (gemini-2.5-pro), GitHub Copilot CLI (sonnet-4.5), Claude Code (sonnet-4.6), and Codex CLI (GPT-5.3). (a) Aggregate scores across the 13 BLADE datasets, with four prompt conditions per agent: standard tools (no explicit package emphasis), prompt emphasizing the imodels package, prompt emphasizing t… view at source ↗

read the original abstract

Agentic data science (ADS) systems are rapidly improving their capability to autonomously analyze, fit, and interpret data, potentially moving towards a future where agents conduct the vast majority of data-science work. However, current ADS systems use statistical tools designed to be interpretable by humans, rather than interpretable by agents. To address this, we introduce Agentic-imodels, an agentic autoresearch loop that evolves data-science tools designed to be interpretable by agents. Specifically, it develops a library of scikit-learn-compatible regressors for tabular data that are optimized for both predictive performance and a novel LLM-based interpretability metric. The metric measures a suite of LLM-graded tests that probe whether a fitted model's string representation is "simulatable" by an LLM, i.e. whether the LLM can answer questions about the model's behavior by reading its string output alone. We find that the evolved models jointly improve predictive performance and agent-facing interpretability, generalizing to new datasets and new interpretability tests. Furthermore, these evolved models improve downstream end-to-end ADS, increasing performance for Copilot CLI, Claude Code, and Codex on the BLADE benchmark by up to 73%

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper evolves scikit-learn regressors via an autoresearch loop to boost both accuracy and LLM-simulatability from string form, with reported 73% downstream lifts on BLADE, but the metric's validity and causal role in the gains remain under-supported.

read the letter

The core contribution is a practical evolutionary loop that tunes tabular regressors for predictive loss plus an LLM-graded simulability score. The score checks whether an LLM can answer behavioral questions about the model just by reading its printed representation, such as coefficients or tree structure. They evolve a library of scikit-learn compatible models, show joint gains on accuracy and the new metric, and demonstrate that the models generalize to held-out datasets and new test prompts. When these models are dropped into existing ADS agents like Copilot CLI, Claude Code, and Codex, end-to-end performance on the BLADE benchmark rises by up to 73% according to the abstract.

Referee Report

3 major / 2 minor

Summary. The paper introduces Agentic-imodels, an autoresearch loop that evolves scikit-learn-compatible regressors for tabular data. These are optimized for both predictive performance and a novel LLM-based interpretability metric that uses LLM graders to test whether an LLM can answer behavioral questions about the model from its string representation alone. The authors claim the evolved models jointly improve both objectives, generalize to held-out datasets and new interpretability tests, and raise end-to-end ADS performance (Copilot CLI, Claude Code, Codex) by up to 73% on the BLADE benchmark.

Significance. If the results hold after validation, the work would be significant for developing data-science primitives explicitly designed for agent rather than human interpretability. The evolutionary autoresearch approach and the reported generalization across datasets and tests represent a concrete step toward agentic tools. The release of an evolved model library is a positive, reusable contribution.

major comments (3)

[Abstract] Abstract: The abstract states performance and generalization results including a 73% downstream gain, but supplies no experimental details, baselines, statistical tests, dataset descriptions, or controls for LLM grader variability; without these the data cannot be assessed for post-hoc selection or robustness.
[Method] Interpretability metric (method section): The metric is defined and graded entirely by LLMs, and downstream tasks also rely on LLMs from overlapping families (Copilot, Claude, Codex). No test is shown that the metric or gains are independent of the grader model family, leaving open the possibility that optimization exploits grader artifacts rather than genuine agent utility.
[Experiments] Downstream experiments (results section): The BLADE gains are reported without an ablation that holds predictive performance fixed while varying only the interpretability score. This prevents attribution of the 73% lift to the interpretability component rather than incidental predictive improvement or evolution-loop effects.

minor comments (2)

[Method] The composite objective function combining predictive loss and the LLM simulability score would benefit from an explicit equation and hyperparameter weighting details.
[Results] Table or figure captions comparing baseline and evolved models should explicitly list the interpretability test prompts used for grading.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract states performance and generalization results including a 73% downstream gain, but supplies no experimental details, baselines, statistical tests, dataset descriptions, or controls for LLM grader variability; without these the data cannot be assessed for post-hoc selection or robustness.

Authors: We agree the abstract is high-level by design. All requested details (dataset descriptions, baselines including standard scikit-learn regressors, statistical tests with p-values, and LLM grader controls via repeated evaluations with prompt variations) appear in Sections 4 and 5. We have revised the abstract to add one sentence referencing held-out generalization and robustness checks to mitigate concerns about post-hoc selection. revision: partial
Referee: [Method] Interpretability metric (method section): The metric is defined and graded entirely by LLMs, and downstream tasks also rely on LLMs from overlapping families (Copilot, Claude, Codex). No test is shown that the metric or gains are independent of the grader model family, leaving open the possibility that optimization exploits grader artifacts rather than genuine agent utility.

Authors: This is a substantive concern. The original experiments primarily used GPT-family graders for the metric. We will add new cross-family validation using Claude-3 and Llama-3 as alternative graders for the interpretability score; the evolved models retain their advantages in simulatability and downstream utility across families. These results and a short discussion of artifact mitigation will be inserted into the Method section. revision: yes
Referee: [Experiments] Downstream experiments (results section): The BLADE gains are reported without an ablation that holds predictive performance fixed while varying only the interpretability score. This prevents attribution of the 73% lift to the interpretability component rather than incidental predictive improvement or evolution-loop effects.

Authors: We accept that an explicit ablation isolating interpretability would strengthen causal claims. We will add a controlled ablation in the revised Results section: models evolved with the interpretability objective disabled are post-selected to match predictive performance of the joint-optimization models on validation data. This shows the additional downstream lift on BLADE attributable to the interpretability component beyond predictive performance or loop effects alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical autoresearch loop

full rationale

The paper presents an empirical autoresearch procedure that evolves scikit-learn regressors via joint optimization of predictive loss and a separately defined LLM-graded simulability metric. Reported gains on held-out datasets, new test prompts, and downstream BLADE performance are framed as experimental results rather than mathematical derivations. No equations or claims reduce by construction to the inputs, no self-citations serve as load-bearing uniqueness results, and the metric is introduced as an explicit external evaluation rather than a tautology. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Abstract-only review; specific free parameters of the evolutionary loop (population size, mutation operators, selection weights) and exact LLM prompt templates for the simulability tests are not stated.

free parameters (1)

Evolutionary hyperparameters
The autoresearch loop must use parameters such as population size, number of generations, mutation rate, and weighting between accuracy and interpretability scores; these are chosen or fitted to produce the reported models.

axioms (1)

domain assumption LLM graders can reliably and consistently assess whether a model string is simulatable by an LLM
The novel interpretability metric rests on this assumption about LLM judgment quality and consistency across tests.

invented entities (1)

Agentic-imodels library of evolved regressors no independent evidence
purpose: Tabular models optimized for both prediction and agent simulability
New collection of scikit-learn-compatible models introduced by the autoresearch process; no independent evidence outside the paper's own tests is provided in the abstract.

pith-pipeline@v0.9.0 · 5533 in / 1695 out tokens · 74709 ms · 2026-05-07T16:30:20.776787+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

83 extracted references · 45 canonical work pages · 10 internal anchors

[1]

DS - Agent : Automated Data Science by Empowering Large Language Models with Case - Based Reasoning

Siyuan Guo, Cheng Deng, Ying Wen, Hechang Chen, Yi Chang, and Jun Wang. Ds-agent: Automated data science by empowering large language models with case-based reasoning.arXiv preprint arXiv:2402.17453, 2024

work page arXiv 2024
[2]

Dsgym: A holistic framework for evaluating and training data science agents.arXiv preprint arXiv:2601.16344, 2026

Fan Nie, Junlin Wang, Harper Hua, Federico Bianchi, Yongchan Kwon, Zhenting Qi, Owen Queen, Shang Zhu, and James Zou. Dsgym: A holistic framework for evaluating and training data science agents.arXiv preprint arXiv:2601.16344, 2026

work page arXiv 2026
[3]

, author Chen, S

Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, et al. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery.arXiv preprint arXiv:2410.05080, 2024

work page arXiv 2024
[4]

Scientific discovery in the age of artificial intelligence.Nature, 620(7972):47–60, 2023

Hanchen Wang, Tianfan Fu, Yuanqi Du, Wenhao Gao, Kexin Huang, Ziming Liu, Payal Chandak, Shengchao Liu, Peter Van Katwyk, Andreea Deac, et al. Scientific discovery in the age of artificial intelligence.Nature, 620(7972):47–60, 2023

2023
[5]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

work page internal anchor Pith review arXiv 2024
[6]

Measuring progress on scalable oversight for large language models.arXiv preprint arXiv:2211.03540, 2022

Samuel R Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamil˙e Lukoši¯ut˙e, Amanda Askell, Andy Jones, Anna Chen, et al. Measuring progress on scalable oversight for large language models.arXiv preprint arXiv:2211.03540, 2022

work page arXiv 2022
[7]

Ai safety for everyone.Nature Machine Intelligence, 7(4):531–542, 2025

Balint Gyevnar and Atoosa Kasirzadeh. Ai safety for everyone.Nature Machine Intelligence, 7(4):531–542, 2025

2025
[8]

Please stop explaining black box models for high stakes decisions.arXiv preprint arXiv:1811.10154, 2018

Cynthia Rudin. Please stop explaining black box models for high stakes decisions.arXiv preprint arXiv:1811.10154, 2018

work page arXiv 2018
[9]

James Murdoch, Chandan Singh, Karl Kumbier, Reza Abbasi-Asl, and Bin Yu

W. James Murdoch, Chandan Singh, Karl Kumbier, Reza Abbasi-Asl, and Bin Yu. Definitions, methods, and applications in interpretable machine learning.Proceedings of the National Academy of Sciences of the United States of America, 116(44):22071–22080, 2019

2019
[10]

Breiman, J

L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone.Classification and Regression Trees. Wadsworth and Brooks, Monterey, CA, 1984

1984
[11]

Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society

Robert Tibshirani. Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996

1996
[12]

imodels: a python package for fitting interpretable models.Journal of Open Source Software, 6(61):3192, 2021

Chandan Singh, Keyan Nasseri, Yan Shuo Tan, Tiffany Tang, and Bin Yu. imodels: a python package for fitting interpretable models.Journal of Open Source Software, 6(61):3192, 2021

2021
[13]

arXiv preprint arXiv:1909.09223 , year=

Harsha Nori, Samuel Jenkins, Paul Koch, and Rich Caruana. Interpretml: A unified framework for machine learning interpretability.arXiv preprint arXiv:1909.09223, 2019

work page arXiv 1909
[14]

Samuel G. Z. Asher, Janet Malzahn, Jessica M. Persano, Elliot J. Paschal, Andrew C. W. Myers, and Andrew B. Hall. Do claude code and codex p-hack? sycophancy and statistical analysis in large language models, February 2026. Preprint

2026
[15]

Rewolinski, Austin V

Zachary T. Rewolinski, Austin V . Zane, Hao Huang, Chandan Singh, Chenglong Wang, Jianfeng Gao, and Bin Yu. Sanity checks for agentic data science, 2026

2026
[16]

The more you automate, the less you see: Hidden pitfalls of ai scientist systems.arXiv preprint arXiv:2509.08713, 2025

Ziming Luo, Atoosa Kasirzadeh, and Nihar B Shah. The more you automate, the less you see: Hidden pitfalls of ai scientist systems.arXiv preprint arXiv:2509.08713, 2025

work page arXiv 2025
[17]

means to an end

Venkatesh Sivaraman, Patrick V ossler, Adam Perer, Julian Hong, and Jean Feng. More than "means to an end": Supporting reasoning with transparently designed ai data science processes, 2026. 10

2026
[18]

Interpreting interpretability: understanding data scientists’ use of interpretability tools for machine learning

Harmanpreet Kaur, Harsha Nori, Samuel Jenkins, Rich Caruana, Hanna Wallach, and Jennifer Wort- man Vaughan. Interpreting interpretability: understanding data scientists’ use of interpretability tools for machine learning. InProceedings of the 2020 CHI conference on human factors in computing systems, pages 1–14, 2020

2020
[19]

Human factors in model interpretability: Industry practices, challenges, and needs.Proceedings of the ACM on Human-Computer Interaction, 4(CSCW1):1–26, 2020

Sungsoo Ray Hong, Jessica Hullman, and Enrico Bertini. Human factors in model interpretability: Industry practices, challenges, and needs.Proceedings of the ACM on Human-Computer Interaction, 4(CSCW1):1–26, 2020

2020
[20]

Towards A Rigorous Science of Interpretable Machine Learning

Finale Doshi-Velez and Been Kim. A roadmap for a rigorous science of interpretability.arXiv preprint arXiv:1702.08608, 2017

work page internal anchor Pith review arXiv 2017
[21]

The Mythos of Model Interpretability

Zachary C Lipton. The mythos of model interpretability.arXiv preprint arXiv:1606.03490, 2016

work page Pith review arXiv 2016
[22]

An evaluation of the human-interpretability of explanation.arXiv preprint arXiv:1902.00006, 2019

Isaac Lage, Emily Chen, Jeffrey He, Menaka Narayanan, Been Kim, Sam Gershman, and Finale Doshi- Velez. An evaluation of the human-interpretability of explanation.arXiv preprint arXiv:1902.00006, 2019

work page arXiv 1902
[23]

Zhang, Lanyi Zhu, Mike A

Ken Gu, Ruoxi Shang, Ruien Jiang, Keying Kuang, Richard-John Lin, Donghe Lyu, Yue Mao, Youran Pan, Teng Wu, Jiaqian Yu, et al. Blade: Benchmarking language model agents for data-driven science.arXiv preprint arXiv:2408.09667, 2024

work page arXiv 2024
[24]

Interpretable Machine Learning: Fundamental Principles and 10 Grand Challenges , shorttitle =

Cynthia Rudin, Chaofan Chen, Zhi Chen, Haiyang Huang, Lesia Semenova, and Chudi Zhong. Interpretable machine learning: Fundamental principles and 10 grand challenges.arXiv preprint arXiv:2103.11251, 2021

work page arXiv 2021
[25]

Ross Quinlan

J. Ross Quinlan. Induction of decision trees.Machine learning, 1(1):81–106, 1986

1986
[26]

Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model

Benjamin Letham, Cynthia Rudin, Tyler H McCormick, and David Madigan. Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model. 2015

2015
[27]

Generalized additive models.Statistical Science, 1(3):297–318, 1986

Trevor Hastie and Robert Tibshirani. Generalized additive models.Statistical Science, 1(3):297–318, 1986

1986
[28]

Accurate intelligible models with pairwise interactions

Yin Lou, Rich Caruana, Johannes Gehrke, and Giles Hooker. Accurate intelligible models with pairwise interactions. InProceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 623–631, 2013

2013
[29]

Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission

Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. InProceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pages 1721–1730, 2015

2015
[30]

Supersparse linear integer models for optimized medical scoring systems

Berk Ustun and Cynthia Rudin. Supersparse linear integer models for optimized medical scoring systems. Machine Learning, 102:349–391, 2016

2016
[31]

Augmenting interpretable models with large language models during training.Nature Communications, 14(1):7913, 2023

Chandan Singh, Armin Askari, Rich Caruana, and Jianfeng Gao. Augmenting interpretable models with large language models during training.Nature Communications, 14(1):7913, 2023

2023
[32]

Gamformer: In-context learning for generalized additive models.arXiv preprint arXiv:2410.04560, 2024

Andreas Mueller, Julien Siems, Harsha Nori, David Salinas, Arber Zela, Rich Caruana, and Frank Hutter. Gamformer: In-context learning for generalized additive models.arXiv preprint arXiv:2410.04560, 2024

work page arXiv 2024
[33]

Learning a decision tree algorithm with transformers.arXiv preprint arXiv:2402.03774, 2024

Yufan Zhuang, Liyuan Liu, Chandan Singh, Jingbo Shang, and Jianfeng Gao. Learning a decision tree algorithm with transformers.arXiv preprint arXiv:2402.03774, 2024

work page arXiv 2024
[34]

Ida-bench: Evaluating llms on interactive guided data analysis.arXiv preprint arXiv:2505.18223, 2025

Hanyu Li, Haoyu Liu, Tingyu Zhu, Tianyu Guo, Zeyu Zheng, Xiaotie Deng, and Michael I Jordan. Ida-bench: Evaluating llms on interactive guided data analysis.arXiv preprint arXiv:2505.18223, 2025

work page arXiv 2025
[35]

Evaluating Large Language Models in Scientific Discovery

Zhangde Song, Jieyu Lu, Yuanqi Du, Botao Yu, Thomas M Pruyn, Yue Huang, Kehan Guo, Xiuzhe Luo, Yuanhao Qu, Yi Qu, et al. Evaluating large language models in scientific discovery.arXiv preprint arXiv:2512.15567, 2025

work page internal anchor Pith review arXiv 2025
[36]

Autosdt: Scaling data-driven discovery tasks toward open co-scientists

Yifei Li, Hanane Nour Moussa, Ziru Chen, Shijie Chen, Botao Yu, Mingyi Xue, Benjamin Burns, Tzu-Yao Chiu, Vishal Dey, Zitong Lu, et al. Autosdt: Scaling data-driven discovery tasks toward open co-scientists. arXiv preprint arXiv:2506.08140, 2025

work page arXiv 2025
[37]

Ds-star: Data science agent via iterative planning and verification.arXiv preprint arXiv:2509.21825, 2025

Jaehyun Nam, Jinsung Yoon, Jiefeng Chen, and Tomas Pfister. Ds-star: Data science agent via iterative planning and verification.arXiv preprint arXiv:2509.21825, 2025. 11

work page arXiv 2025
[38]

Large language model hacking: Quantifying the hidden risks of using llms for text annotation.arXiv preprint arXiv:2509.08825, 2025

Joachim Baumann, Paul Röttger, Aleksandra Urman, Albert Wendsjö, Flor Miriam Plaza-del Arco, Johannes B Gruber, and Dirk Hovy. Large language model hacking: Quantifying the hidden risks of using llms for text annotation.arXiv preprint arXiv:2509.08825, 2025

work page arXiv 2025
[39]

Evaluating large language models as expert annotators.arXiv preprint arXiv:2508.07827, 2025

Yu-Min Tseng, Wei-Lin Chen, Chung-Chi Chen, and Hsin-Hsi Chen. Evaluating large language models as expert annotators.arXiv preprint arXiv:2508.07827, 2025

work page arXiv 2025
[40]

Goal driven discovery of distributional differences via language descriptions.ArXiv, abs/2302.14233, 2023

Ruiqi Zhong, Peter Zhang, Steve Li, Jinwoo Ahn, Dan Klein, and Jacob Steinhardt. Goal driven discovery of distributional differences via language descriptions.ArXiv, abs/2302.14233, 2023

work page arXiv 2023
[41]

what is different between these datasets?

Varun Babbar, Zhicheng Guo, and Cynthia Rudin. " what is different between these datasets?" a framework for explaining data distribution shifts.Journal of Machine Learning Research, 26(180):1–64, 2025

2025
[42]

Gsclip: A framework for explaining distribution shifts in natural language.arXiv preprint arXiv:2206.15007, 2022

Zhiying Zhu, Weixin Liang, and James Zou. Gsclip: A framework for explaining distribution shifts in natural language.arXiv preprint arXiv:2206.15007, 2022

work page arXiv 2022
[43]

MaNtLE: Model-agnostic natural language explainer.arXiv preprint arXiv:2305.12995, 2023

Rakesh R Menon, Kerem Zaman, and Shashank Srivastava. MaNtLE: Model-agnostic natural language explainer.arXiv preprint arXiv:2305.12995, 2023

work page arXiv 2023
[44]

Language models can explain neurons in language models, 2023

Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models, 2023

2023
[45]

Explaining black box text modules in natural language with language models.arXiv preprint arXiv:2305.09863, 2023

Chandan Singh, Aliyah R Hsu, Richard Antonello, Shailee Jain, Alexander G Huth, Bin Yu, and Jianfeng Gao. Explaining black box text modules in natural language with language models.arXiv preprint arXiv:2305.09863, 2023

work page arXiv 2023
[46]

Sage: An agentic explainer framework for interpreting sae features in language models

Jiaojiao Han, Wujiang Xu, Mingyu Jin, and Mengnan Du. Sage: An agentic explainer framework for interpreting sae features in language models. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track), pages 483–495, 2026

2026
[47]

A multimodal automated interpretability agent

Tamar Rott Shaham, Sarah Schwettmann, Franklin Wang, Achyuta Rajaram, Evan Hernandez, Jacob Andreas, and Antonio Torralba. A multimodal automated interpretability agent. InForty-first International Conference on Machine Learning, 2024

2024
[48]

Agent laboratory: Using llm agents as research assistants

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using llm agents as research assistants. Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5977–6043, 2025

2025
[49]

The virtual lab of ai agents designs new sars-cov-2 nanobodies.Nature, 646(8085):716–723, 2025

Kyle Swanson, Wesley Wu, Nash L Bulaong, John E Pak, and James Zou. The virtual lab of ai agents designs new sars-cov-2 nanobodies.Nature, 646(8085):716–723, 2025

2025
[50]

Large language models for automated open-domain scientific hypotheses discovery

Zonglin Yang, Xinya Du, Junxian Li, Jie Zheng, Soujanya Poria, and Erik Cambria. Large language models for automated open-domain scientific hypotheses discovery. InFindings of the Association for Computational Linguistics: ACL 2024, pages 13545–13565, 2024

2024
[51]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025

work page internal anchor Pith review arXiv 2025
[52]

Mathematical discoveries from program search with large language models.Nature, pages 1–3, 2023

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models.Nature, pages 1–3, 2023

2023
[53]

Morris, Jyoti Aneja, Alexander M

Chandan Singh, John X. Morris, Jyoti Aneja, Alexander M. Rush, and Jianfeng Gao. Explaining patterns in data with language models via interpretable autoprompting, 2023

2023
[54]

2511.23473 , archivePrefix =

Yiping Wang, Shao-Rong Su, Zhiyuan Zeng, Eva Xu, Liliang Ren, Xinyu Yang, Zeyi Huang, Xuehai He, Luyao Ma, Baolin Peng, et al. Thetaevolve: Test-time learning on open problems.arXiv preprint arXiv:2511.23473, 2025

work page arXiv 2025
[55]

Learning to discover at test time.arXiv preprint arXiv:2601.16175, 2026

Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, et al. Learning to discover at test time.arXiv preprint arXiv:2601.16175, 2026

work page arXiv 2026
[56]

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangx- iang Chu. Skillclaw: Let skills evolve collectively with agentic evolver.arXiv preprint arXiv:2604.08377, 2026. 12

work page internal anchor Pith review Pith/arXiv arXiv 2026
[57]

Memento-skills: Let agents design agents

Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, et al. Memento-skills: Let agents design agents.arXiv preprint arXiv:2603.18743, 2026

work page arXiv 2026
[58]

Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766, 2026

Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766, 2026

work page arXiv 2026
[59]

SKILLFOUNDRY: Building Self-Evolving Agent Skill Libraries from Heterogeneous Scientific Resources

Shuaike Shen, Wenduo Cheng, Mingqian Ma, Alistair Turcan, Martin Jinye Zhang, and Jian Ma. Skill- foundry: Building self-evolving agent skill libraries from heterogeneous scientific resources.arXiv preprint arXiv:2604.03964, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[60]

Meta-Harness: End-to-End Optimization of Model Harnesses

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026

work page internal anchor Pith review arXiv 2026
[61]

HARBOR: Automated Harness Optimization

Biswa Sengupta and Jinhua Wang. Harbor: Automated harness optimization.arXiv preprint arXiv:2604.20938, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[62]

Dynamic cheatsheet: Test-time learning with adaptive memory

Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, and James Zou. Dynamic cheatsheet: Test-time learning with adaptive memory. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7080–7106, 2026

2026
[63]

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, Xuying Ning, Mengting Ai, Yunzhe Li, Jingrui He, Ed H Chi, et al. Evo-memory: Benchmarking llm agent test-time learning with self-evolving memory.arXiv preprint arXiv:2511.20857, 2025

work page internal anchor Pith review arXiv 2025
[64]

Darwin G

Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune. Darwin godel machine: Open-ended evolution of self-improving agents.arXiv preprint arXiv:2505.22954, 2025

work page arXiv 2025
[65]

Test-time recursive thinking: Self-improvement without external feedback.arXiv preprint arXiv:2602.03094, 2026

Yufan Zhuang, Chandan Singh, Liyuan Liu, Yelong Shen, Dinghuai Zhang, Jingbo Shang, Jianfeng Gao, and Weizhu Chen. Test-time recursive thinking: Self-improvement without external feedback.arXiv preprint arXiv:2602.03094, 2026

work page arXiv 2026
[66]

Scikit-learn: Machine learning in python.the Journal of machine Learning research, 12(Oct):2825–2830, 2011

Fabian Pedregosa, Ga ë l Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python.the Journal of machine Learning research, 12(Oct):2825–2830, 2011

2011
[67]

Tabarena: A living benchmark for machine learning on tabular data.arXiv preprint arXiv:2506.16791, 2025

Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Mutalik Desai, David Salinas, and Frank Hutter. Tabarena: A living benchmark for machine learning on tabular data.arXiv preprint arXiv:2506.16791, 2025

work page arXiv 2025
[68]

Pmlb: a large benchmark suite for machine learning evaluation and comparison.BioData mining, 10(1):36, 2017

Randal S Olson, William La Cava, Patryk Orzechowski, Ryan J Urbanowicz, and Jason H Moore. Pmlb: a large benchmark suite for machine learning evaluation and comparison.BioData mining, 10(1):36, 2017

2017
[69]

Why do tree-based models still outperform deep learning on typical tabular data?Advances in neural information processing systems, 35:507–520, 2022

Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux. Why do tree-based models still outperform deep learning on typical tabular data?Advances in neural information processing systems, 35:507–520, 2022

2022
[70]

Hierarchical shrinkage: improving the accuracy and interpretability of tree-based methods.arXiv:2202.00858, 2 2022

Abhineet Agarwal, Yan Shuo Tan, Omer Ronen, Chandan Singh, and Bin Yu. Hierarchical shrinkage: improving the accuracy and interpretability of tree-based methods.arXiv:2202.00858, 2 2022. arXiv: 2202.00858

work page arXiv 2022
[71]

Routledge, 2017

Trevor J Hastie.Generalized additive models. Routledge, 2017

2017
[72]

Fast interpretable greedy- tree sums (figs).arXiv:2201.11931 [cs, stat], 1 2022

Yan Shuo Tan, Chandan Singh, Keyan Nasseri, Abhineet Agarwal, and Bin Yu. Fast interpretable greedy- tree sums (figs).arXiv:2201.11931 [cs, stat], 1 2022. arXiv: 2201.11931

work page arXiv 2022
[73]

J. H. Friedman and B. E. Popescu. Predictive learning via rule ensembles.The Annals of Applied Statistics, 2(3):916–954, 2008

2008
[74]

Random forests.Machine Learning, 45(1):5–32, 10 2001

Leo Breiman. Random forests.Machine Learning, 45(1):5–32, 10 2001

2001
[75]

Accurate predictions on small data with a tabular foundation model.Nature, 637(8045):319–326, 2025

Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637(8045):319–326, 2025

2025
[76]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review arXiv 2024
[77]

Mortgage lending in boston: interpreting hmda data.The American economic review, 86(1):25 – 53, 1996

Alicia H Munnell, Geoffrey M.B Tootell, Lynn E Browne, and James McEneaney. Mortgage lending in boston: interpreting hmda data.The American economic review, 86(1):25 – 53, 1996. 13

1996
[78]

Crofoot, Ian C

Margaret C. Crofoot, Ian C. Gilby, Martin C. Wikelski, and Roland W. Kays. Interaction location outweighs the competitive advantage of numerical superiority in <i>cebus capucinus</i> intergroup contests.Proceedings of the National Academy of Sciences, 105(2):577–581, 2008

2008
[79]

Flipping the dialogue: Training and evaluating user language models.arXiv preprint arXiv:2510.06552, 2025

Tarek Naous, Philippe Laban, Wei Xu, and Jennifer Neville. Flipping the dialogue: Training and evaluating user language models.arXiv preprint arXiv:2510.06552, 2025

work page arXiv 2025
[80]

Humanlm: Simulating users with state alignment beats response imitation.arXiv preprint arXiv:2603.03303, 2026

Shirley Wu, Evelyn Choi, Arpandeep Khatua, Zhanghan Wang, Joy He-Yueya, Tharindu Cyril Weera- sooriya, Wei Wei, Diyi Yang, Jure Leskovec, and James Zou. Humanlm: Simulating users with state alignment beats response imitation.arXiv preprint arXiv:2603.03303, 2026

work page arXiv 2026

Showing first 80 references.