AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.
hub Canonical reference
CoRR , volume =
Canonical reference. 80% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
FARS deployed at scale produced 166 AI/ML papers across 67 topics that received 282 structured human reviews indicating some review-worthy outputs alongside recurring failure modes.
Glite ARF introduces a verifier-driven three-role framework for parallel LLM coding agents, demonstrated by first- and second-place finishes in the BEA 2026 vocabulary-difficulty shared task across three languages with 29.9-35.9% RMSE reduction at ~$450 API cost.
Introduces the Matter to Mechanism benchmark of 2,645 structured instances and a composite metric suite for evaluating AI co-scientists on problem-to-hypothesis reasoning in battery materials research.
ResearchClawBench supplies 40 grounded tasks and expert rubrics to measure autonomous research agents, with the strongest systems scoring only 21.5 and 20.7 on average.
FML-Bench shows a simple greedy hill-climber nearly matches tree search on dense-opportunity tasks while an adaptive agent that broadens search on stagnation outperforms six baselines across 18 tasks.
GoR extracts citation DAGs using position, frequency, predecessor links and time, then fine-tunes Qwen2.5-7B on 498 seed papers to generate ideas, claiming SOTA over gpt-4o baselines via LLM judges.
MLS-Bench is a benchmark with 140 tasks that evaluates AI agents on inventing generalizable and scalable ML methods, finding they lag human performance especially in insight-driven invention rather than tuning.
AI CFD Scientist autonomously discovers a Spalart-Allmaras runtime correction reducing lower-wall Cf RMSE by 7.89% on the periodic hill at Reh=5600 while using a vision-language gate to detect 14 of 16 silent failures missed by solver checks.
SAGE with MHFA improves failure recovery in autonomous research agents, raising metrics-bearing outputs from 42% to 92% on a 12-topic benchmark versus single-reflection baselines.
Agentic-Ideation uses oracle-guided multi-agent synthesis to generate efficient training trajectories for scientific ideation agents, reporting 11.91% quality gains and over 10x sample efficiency versus workflow baselines.
Arbor combines a coordinator, executors, and a hypothesis tree to enable cumulative autonomous research, outperforming Codex and Claude Code by over 2.5x on six real tasks and reaching 86.36% Any Medal on MLE-Bench Lite.
Graph2Idea builds dynamic knowledge graphs from retrieved literature to supply compact, relational contexts that guide LLMs in generating novel, feasible, and high-quality scientific ideas, outperforming flat-text baselines on automatic metrics.
Benchmark Agent is an autonomous agentic system that constructs benchmarks for LLMs and MLLMs via query analysis, subtask design, annotation and quality control, yielding 15 benchmarks with minimal human input.
AgentJet presents a decoupled multi-node swarm architecture for LLM agent RL that enables heterogeneous multi-model training, multi-task isolation, fault tolerance, live code iteration, context-optimized training, and an autonomous research system.
ScientistOne introduces Chain-of-Evidence and an audit system that achieves zero hallucinated references, perfect score verification, and top method-code alignment while matching or beating human experts on five frontier tasks and generalizing to six more.
MLReplicate benchmark evaluates six autonomous systems on 45 manuscripts from ICML 2025 papers, finding that automated reviews accept flawed outputs with fabricated claims while human review exposes methodological failures, and that the cheapest system outperforms the most expensive by a wide margin
NanoResearch introduces a tri-level co-evolving framework of skills, memory, and policy to personalize LLM-powered research automation across projects and users.
FAME models scientific topic trajectories in continuous time to forecast paper impact more accurately than LLMs by aligning manuscripts with field momentum in a dynamic latent space.
LLMs exhibit Bayesian-like hypothesis updating with strong-sampling bias and an evaluation-generation gap but generalize poorly outside observed data.
TREX automates the LLM training lifecycle via collaborative agents and tree-based exploration, delivering consistent performance gains across 10 real-world fine-tuning tasks in FT-Bench.
EvoGens uses rank-based mutation, semantic-aware crossover, and lightweight evaluation to evolve populations of LLM-generated scientific ideas, boosting novelty and diversity metrics.
A multi-agent harness autonomously generates functional single-page VIS apps with linked views for scientific data tasks using coordinated skills for analysis, planning, implementation, and evaluation.
AiraXiv is a proposed AI-driven platform for open preprints that supports human and AI authors with interactive UI and MCP-based interactions, validated by serving as the submission system for ICAIS 2025.
citing papers explorer
-
Graphs of Research: Citation Evolution Graphs as Supervision for Research Idea Generation
GoR extracts citation DAGs using position, frequency, predecessor links and time, then fine-tunes Qwen2.5-7B on 498 seed papers to generate ideas, claiming SOTA over gpt-4o baselines via LLM judges.
-
Toward Generalist Autonomous Research via Hypothesis-Tree Refinement
Arbor combines a coordinator, executors, and a hypothesis tree to enable cumulative autonomous research, outperforming Codex and Claude Code by over 2.5x on six real tasks and reaching 86.36% Any Medal on MLE-Bench Lite.
-
EvoGens: A Population-Based Heuristic Search Framework for Scientific Idea Generation
EvoGens uses rank-based mutation, semantic-aware crossover, and lightweight evaluation to evolve populations of LLM-generated scientific ideas, boosting novelty and diversity metrics.