Towards execution-grounded automated ai research

Chenglei Si, Zitong Yang, Yejin Choi, Emmanuel Candès, Diyi Yang, Tatsunori Hashimoto · 2026 · arXiv 2601.14525

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot

cs.AI · 2026-04-15 · conditional · novelty 9.0

AI reviews for all 22,977 AAAI-26 papers were preferred by authors and PC members over human reviews on accuracy and suggestions and outperformed baselines at spotting weaknesses.

GIANTS: Generative Insight Anticipation from Scientific Literature

cs.CL · 2026-04-10 · unverdicted · novelty 8.0

GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.

MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

cs.LG · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

MLS-Bench is a benchmark with 140 tasks that evaluates AI agents on inventing generalizable and scalable ML methods, finding they lag human performance especially in insight-driven invention rather than tuning.

VESTA: Visual Exploration with Statistical Tool Agents

cs.AI · 2026-05-29 · unverdicted · novelty 6.0

VESTA introduces dynamic tool creation for VLMs that outperforms static-tool and no-tool baselines on distribution fitting, time series, and astronomy tasks in the new DAWN benchmark.

Unlocking LLM Creativity in Science through Analogical Reasoning

cs.AI · 2026-05-11 · conditional · novelty 6.0

Analogical reasoning increases LLM solution diversity by 90-173% and novelty rate to over 50%, delivering up to 13-fold gains on biomedical tasks including perturbation prediction and cell communication.

Agentic Discovery with Active Hypothesis Exploration for Visual Recognition

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

HypoExplore uses LLMs for hypothesis-driven evolutionary search with a Trajectory Tree and Hypothesis Memory Bank to discover lightweight vision architectures, reaching 94.11% accuracy on CIFAR-10 from an 18.91% baseline and generalizing to other datasets including state-of-the-art on MedMNIST.

AI for Auto-Research: Roadmap & User Guide

cs.AI · 2026-05-18 · unverdicted · novelty 4.0

The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.

citing papers explorer

Showing 7 of 7 citing papers.

AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot cs.AI · 2026-04-15 · conditional · none · ref 11
AI reviews for all 22,977 AAAI-26 papers were preferred by authors and PC members over human reviews on accuracy and suggestions and outperformed baselines at spotting weaknesses.
GIANTS: Generative Insight Anticipation from Scientific Literature cs.CL · 2026-04-10 · unverdicted · none · ref 19
GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI cs.LG · 2026-05-09 · unverdicted · none · ref 87 · 2 links
MLS-Bench is a benchmark with 140 tasks that evaluates AI agents on inventing generalizable and scalable ML methods, finding they lag human performance especially in insight-driven invention rather than tuning.
VESTA: Visual Exploration with Statistical Tool Agents cs.AI · 2026-05-29 · unverdicted · none · ref 40
VESTA introduces dynamic tool creation for VLMs that outperforms static-tool and no-tool baselines on distribution fitting, time series, and astronomy tasks in the new DAWN benchmark.
Unlocking LLM Creativity in Science through Analogical Reasoning cs.AI · 2026-05-11 · conditional · none · ref 38
Analogical reasoning increases LLM solution diversity by 90-173% and novelty rate to over 50%, delivering up to 13-fold gains on biomedical tasks including perturbation prediction and cell communication.
Agentic Discovery with Active Hypothesis Exploration for Visual Recognition cs.CV · 2026-04-14 · unverdicted · none · ref 47
HypoExplore uses LLMs for hypothesis-driven evolutionary search with a Trajectory Tree and Hypothesis Memory Bank to discover lightweight vision architectures, reaching 94.11% accuracy on CIFAR-10 from an 18.91% baseline and generalizing to other datasets including state-of-the-art on MedMNIST.
AI for Auto-Research: Roadmap & User Guide cs.AI · 2026-05-18 · unverdicted · none · ref 186
The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.

Towards execution-grounded automated ai research

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer