citation dossier

Mle-bench: Evaluating machine learning agents on machine learning engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, and 1 others · 2024 · arXiv 2410.07095

20Pith papers citing it

20reference links

cs.AItop field · 10 papers

UNVERDICTEDtop verdict bucket · 18 papers

This arXiv-backed work is queued for full Pith review when it crosses the high-inbound sweep. That review runs reader · skeptic · desk-editor · referee · rebuttal · circularity · lean confirmation · RS check · pith extraction.

read on arXiv PDF

why this work matters in Pith

Pith has found this work in 20 reviewed papers. Its strongest current cluster is cs.AI (10 papers). The largest review-status bucket among citing papers is UNVERDICTED (18 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.

representative citing papers

TeamBench: Evaluating Agent Coordination under Enforced Role Separation

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

Enforcing role separation in agent teams reveals that prompt-only setups hide coordination failures, with verifiers approving 49% of failing work and teams sometimes harming performance when solo agents already succeed.

AcademiClaw: When Students Set Challenges for AI Agents

cs.AI · 2026-05-04 · unverdicted · novelty 7.0

AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.

SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?

cs.AI · 2026-04-12 · unverdicted · novelty 7.0

LLMs predict outcomes of real scientific experiments at 14-26% accuracy, comparable to human experts, but lack calibration on prediction reliability while humans demonstrate strong calibration.

FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks

cs.CL · 2026-04-07 · unverdicted · novelty 7.0

FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.

SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

cs.CR · 2026-05-12 · unverdicted · novelty 6.0

SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.

OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces

cs.AI · 2026-05-09 · unverdicted · novelty 6.0

OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.

Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

cs.LG · 2026-05-01 · unverdicted · novelty 6.0

Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.

On Benchmark Hacking in ML Contests: Modeling, Insights and Design

econ.GN · 2026-04-24 · unverdicted · novelty 6.0

In a game-theoretic model of ML contests, low-type contestants engage in benchmark hacking while high-types focus on creative effort, with more skewed rewards improving overall outcomes.

Evaluation-driven Scaling for Scientific Discovery

cs.LG · 2026-04-21 · unverdicted · novelty 6.0

SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster LASSO and new Erdos constructions.

TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration

cs.AI · 2026-04-15 · unverdicted · novelty 6.0

TREX automates the LLM training lifecycle via collaborative agents and tree-based exploration, delivering consistent performance gains across 10 real-world fine-tuning tasks in FT-Bench.

Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

cs.AI · 2026-04-14 · unverdicted · novelty 6.0

Frontier-Eng is a new benchmark for generative optimization in engineering where agents iteratively improve designs under fixed interaction budgets using executable verifiers, with top models like GPT 5.4 showing limited success.

Pioneer Agent: Continual Improvement of Small Language Models in Production

cs.AI · 2026-04-10 · unverdicted · novelty 6.0

Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on benchmarks and large lifts in production-style tasks.

In-Place Test-Time Training

cs.LG · 2026-04-07 · conditional · novelty 6.0

In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.

Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation

cs.CL · 2026-04-06 · unverdicted · novelty 6.0

Localizing judge prompts to five languages shows that LLM backbones interact with language in agent-as-a-judge evaluations, inverting rankings and revealing no universal best model with low inter-judge agreement.

PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents

cs.LG · 2026-05-07 · unverdicted · novelty 5.0

PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing, recommendation, and protein tasks.

EvoMaster: A Foundational Evolving Agent Framework for Agentic Science at Scale

cs.AI · 2026-04-19 · unverdicted · novelty 5.0

EvoMaster is a self-evolving agent framework that achieves state-of-the-art results on scientific benchmarks by enabling iterative hypothesis refinement and knowledge accumulation across domains.

Spatial Atlas: Compute-Grounded Reasoning for Spatial-Aware Research Agent Benchmarks

cs.AI · 2026-04-13 · unverdicted · novelty 5.0

Spatial Atlas implements compute-grounded reasoning via a structured scene graph engine and deterministic computations to deliver competitive accuracy on spatial QA and Kaggle ML benchmarks while preserving interpretability.

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

cs.AI · 2025-03-12 · unverdicted · novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

Humanity's Last Exam

cs.LG · 2025-01-24 · unverdicted · novelty 5.0

Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.

DataMaster: Data-Centric Autonomous AI Research

cs.LG · 2026-05-11

citing papers explorer

Showing 20 of 20 citing papers.

TeamBench: Evaluating Agent Coordination under Enforced Role Separation cs.AI · 2026-05-08 · unverdicted · none · ref 19
Enforcing role separation in agent teams reveals that prompt-only setups hide coordination failures, with verifiers approving 49% of failing work and teams sometimes harming performance when solo agents already succeed.
AcademiClaw: When Students Set Challenges for AI Agents cs.AI · 2026-05-04 · unverdicted · none · ref 2
AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.
SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences? cs.AI · 2026-04-12 · unverdicted · none · ref 8
LLMs predict outcomes of real scientific experiments at 14-26% accuracy, comparable to human experts, but lack calibration on prediction reliability while humans demonstrate strong calibration.
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks cs.CL · 2026-04-07 · unverdicted · none · ref 4
FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces cs.CR · 2026-05-12 · unverdicted · none · ref 50
SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces cs.AI · 2026-05-09 · unverdicted · none · ref 142
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning cs.LG · 2026-05-01 · unverdicted · none · ref 46
Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
On Benchmark Hacking in ML Contests: Modeling, Insights and Design econ.GN · 2026-04-24 · unverdicted · none · ref 9
In a game-theoretic model of ML contests, low-type contestants engage in benchmark hacking while high-types focus on creative effort, with more skewed rewards improving overall outcomes.
Evaluation-driven Scaling for Scientific Discovery cs.LG · 2026-04-21 · unverdicted · none · ref 23
SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster LASSO and new Erdos constructions.
TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration cs.AI · 2026-04-15 · unverdicted · none · ref 2
TREX automates the LLM training lifecycle via collaborative agents and tree-based exploration, delivering consistent performance gains across 10 real-world fine-tuning tasks in FT-Bench.
Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization cs.AI · 2026-04-14 · unverdicted · none · ref 1
Frontier-Eng is a new benchmark for generative optimization in engineering where agents iteratively improve designs under fixed interaction budgets using executable verifiers, with top models like GPT 5.4 showing limited success.
Pioneer Agent: Continual Improvement of Small Language Models in Production cs.AI · 2026-04-10 · unverdicted · none · ref 11
Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on benchmarks and large lifts in production-style tasks.
In-Place Test-Time Training cs.LG · 2026-04-07 · conditional · none · ref 9
In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation cs.CL · 2026-04-06 · unverdicted · none · ref 4
Localizing judge prompts to five languages shows that LLM backbones interact with language in agent-as-a-judge evaluations, inverting rankings and revealing no universal best model with low inter-judge agreement.
PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents cs.LG · 2026-05-07 · unverdicted · none · ref 6
PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing, recommendation, and protein tasks.
EvoMaster: A Foundational Evolving Agent Framework for Agentic Science at Scale cs.AI · 2026-04-19 · unverdicted · none · ref 2
EvoMaster is a self-evolving agent framework that achieves state-of-the-art results on scientific benchmarks by enabling iterative hypothesis refinement and knowledge accumulation across domains.
Spatial Atlas: Compute-Grounded Reasoning for Spatial-Aware Research Agent Benchmarks cs.AI · 2026-04-13 · unverdicted · none · ref 3
Spatial Atlas implements compute-grounded reasoning via a structured scene graph engine and deterministic computations to deliver competitive accuracy on spatial QA and Kaggle ML benchmarks while preserving interpretability.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models cs.AI · 2025-03-12 · unverdicted · none · ref 69
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
Humanity's Last Exam cs.LG · 2025-01-24 · unverdicted · none · ref 11
Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
DataMaster: Data-Centric Autonomous AI Research cs.LG · 2026-05-11 · unreviewed · ref 8

Mle-bench: Evaluating machine learning agents on machine learning engineering

why this work matters in Pith

fields

years

verdicts

representative citing papers

citing papers explorer