hub Canonical reference

Agent Laboratory: Using LLM Agents as Research Assistants

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu · 2025 · cs.HC · arXiv 2501.04227

Canonical reference. 80% of citing Pith papers cite this work as background.

33 Pith papers citing it

Background 80% of classified citations

open full Pith review browse 33 citing papers arXiv PDF

abstract

Historically, scientific discovery has been a lengthy and costly process, demanding substantial time and resources from initial conception to final results. To accelerate scientific discovery, reduce research costs, and improve research quality, we introduce Agent Laboratory, an autonomous LLM-based framework capable of completing the entire research process. This framework accepts a human-provided research idea and progresses through three stages--literature review, experimentation, and report writing to produce comprehensive research outputs, including a code repository and a research report, while enabling users to provide feedback and guidance at each stage. We deploy Agent Laboratory with various state-of-the-art LLMs and invite multiple researchers to assess its quality by participating in a survey, providing human feedback to guide the research process, and then evaluate the final paper. We found that: (1) Agent Laboratory driven by o1-preview generates the best research outcomes; (2) The generated machine learning code is able to achieve state-of-the-art performance compared to existing methods; (3) Human involvement, providing feedback at each stage, significantly improves the overall quality of research; (4) Agent Laboratory significantly reduces research expenses, achieving an 84% decrease compared to previous autonomous research methods. We hope Agent Laboratory enables researchers to allocate more effort toward creative ideation rather than low-level coding and writing, ultimately accelerating scientific discovery.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9 method 1

citation-polarity summary

background 8 unclear 1 use method 1

representative citing papers

FermiLink: A Unified Agent Framework for Multidomain Autonomous Scientific Simulations

physics.chem-ph · 2026-04-03 · conditional · novelty 8.0

FermiLink is a unified AI agent framework that automates multidomain scientific simulations via separated package knowledge bases and a four-layer progressive disclosure mechanism, reproducing 56% of target figures in benchmarks and generating research-grade results on unpublished problems.

AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents

physics.flu-dyn · 2026-05-07 · conditional · novelty 7.0 · 3 refs

AI CFD Scientist autonomously discovers a Spalart-Allmaras runtime correction reducing lower-wall Cf RMSE by 7.89% on the periodic hill at Reh=5600 while using a vision-language gate to detect 14 of 16 silent failures missed by solver checks.

IntrAgent: An LLM Agent for Content-Grounded Information Retrieval through Literature Review

cs.IR · 2026-04-23 · unverdicted · novelty 7.0

IntrAgent uses a two-stage pipeline of section ranking and iterative reading to perform content-grounded literature information retrieval, achieving 13.2% higher accuracy than RAG and agent baselines on the new IntraBench benchmark.

GenCellAgent: Generalizable, Training-Free Cellular Image Segmentation via Large Language Model Agents

q-bio.QM · 2025-10-14 · unverdicted · novelty 7.0

GenCellAgent deploys a planner-executor-evaluator LLM agent loop to automatically select, adapt, and refine segmentation tools for diverse cellular microscopy images, matching or exceeding specialist performance on 4,718 images across seven benchmarks while handling out-of-distribution and novel-ves

Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

cs.CL · 2026-06-10 · unverdicted · novelty 6.0

Arbor combines a coordinator, executors, and a hypothesis tree to enable cumulative autonomous research, outperforming Codex and Claude Code by over 2.5x on six real tasks and reaching 86.36% Any Medal on MLE-Bench Lite.

From paper to benchmark: agentic, framework-based reproduction of under-specified methods in machine health intelligence

cs.AI · 2026-05-27 · unverdicted · novelty 6.0

Proposes agentic framework-based reproduction with a slot-binding interface to turn 16 PHM papers into standardized, assumption-aware benchmark implementations.

STRIDE: A Self-Reflective Agent Framework for Reliable Automatic Equation Discovery

cs.AI · 2026-05-18 · unverdicted · novelty 6.0

STRIDE is a self-reflective agent framework that improves accuracy, OOD robustness, and structural recovery in LLM-based symbolic regression by integrating generation, evaluation, repair, and diversity-preserving memory.

Position: Academic Conferences are Potentially Facing Denominator Gaming Caused by Fully Automated Scientific Agents

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

Malicious actors could use AI agents to submit large numbers of fake papers, inflating the submission count and thereby raising the acceptance odds for a small set of chosen legitimate papers under stable conference acceptance rates.

CTM-AI: A Blueprint for General AI Inspired by a Model of Consciousness

q-bio.NC · 2026-04-30 · unverdicted · novelty 6.0

CTM-AI combines a formal consciousness model with foundation models to report state-of-the-art results on sarcasm detection, humor, and agentic tool-use benchmarks.

How Researchers Navigate Accountability, Transparency, and Trust When Using AI Tools in Early-Stage Research: A Think-Aloud Study

cs.CY · 2026-04-25 · unverdicted · novelty 6.0

A think-aloud study reveals that AI tools in early research misrepresent uncertainty, obscure provenance, and create fragile trust, leading researchers to develop compensatory strategies to preserve scholarly judgment.

Fighting AI with AI: AI-Agent Augmented DNS Blocking of LLM Services during Student Evaluations

cs.NI · 2026-03-20 · unverdicted · novelty 6.0

AI-Sinkhole uses AI classification with quantized LLMs and Pi-Hole DNS blocking to dynamically prevent access to LLM services during student evaluations, reporting F1 scores above 0.83.

PRISM-XR: Empowering Privacy-Aware XR Collaboration with Multimodal Large Language Models

cs.CR · 2026-02-09 · unverdicted · novelty 6.0

PRISM-XR adds edge-based sensitive-data filtering and quick registration to MLLM-driven XR collaboration, reporting 90% request accuracy, sub-0.3s registration, and over 90% sensitive-object filtering in a 28-person study.

Co-Constructing Alignment: A Participatory Approach to Situate AI Values

cs.HC · 2026-01-22 · unverdicted · novelty 6.0

Misalignments appear in practice as unexpected responses and task breakdowns, with users proposing roles such as adjusting model output, interpreting behavior, or deliberate non-use to co-construct alignment.

RPC-Bench: A Fine-grained Benchmark for Research Paper Comprehension

cs.CL · 2026-01-14 · conditional · novelty 6.0

RPC-Bench supplies 15K verified QA pairs and a research-flow taxonomy that shows top foundation models still achieve only 68.2 percent correctness-completeness on academic paper comprehension.

CodeDistiller: Automatically Generating Code Libraries for Scientific Coding Agents

cs.AI · 2025-11-30 · conditional · novelty 6.0

CodeDistiller distills 250 materials-science GitHub repositories into vetted code libraries that improve the accuracy and scientific soundness of experiments generated by ASD agents.

Video models are zero-shot learners and reasoners

cs.LG · 2025-09-24 · unverdicted · novelty 6.0

Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.

GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis

cs.AI · 2025-07-28 · unverdicted · novelty 6.0

GenoMAS deploys six specialized LLM agents with guided planning to preprocess transcriptomic data and identify genes, reaching 89.13% composite similarity and 60.48% F1 on the GenoTEX benchmark while outperforming prior methods.

LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning

cs.CL · 2025-06-23 · unverdicted · novelty 6.0

LongWriter-Zero applies RL from a base model with specialized rewards for length, quality, and structure to outperform SFT baselines and larger models on long-writing benchmarks.

Textual Bayes: Quantifying Prompt Uncertainty in LLM-Based Systems

cs.LG · 2025-06-11 · unverdicted · novelty 6.0

Introduces a Bayesian framework viewing LLM prompts as textual parameters and proposes MHLP, a novel MCMC algorithm using LLM proposals, to perform inference and improve accuracy plus uncertainty quantification on benchmarks.

Towards an AI co-scientist

cs.AI · 2025-02-26 · unverdicted · novelty 6.0

A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.

Learning to Choose: An Empowerment-Guided Multi-Agent System with semantic communication for Adaptive Method Selection

cs.AI · 2026-05-28 · unverdicted · novelty 5.0

A multi-agent system combining contextual bandits, LLM agents, and semantic checkpoints improves convergence and robustness in adaptive method selection for sensitivity analysis and uncertainty quantification.

Sibyl-AutoResearch: Autonomous Research Needs Self-Evolving Trial-and-Error Harnesses, Not Paper Generators

cs.MA · 2026-05-21 · unverdicted · novelty 5.0

Sibyl-AutoResearch introduces self-evolving trial-and-error harnesses with auditable conversion units that link trial signals to updated research behaviors and harness repairs in autonomous systems.

AiraXiv: An AI-Driven Open-Access Platform for Human and AI Scientists

cs.AI · 2026-05-20 · unverdicted · novelty 5.0

AiraXiv is a proposed AI-driven platform for open preprints that supports human and AI authors with interactive UI and MCP-based interactions, validated by serving as the submission system for ICAIS 2025.

Personalized Deep Research: A User-Centric Framework, Dataset, and Hybrid Evaluation for Knowledge Discovery

cs.IR · 2026-05-11 · conditional · novelty 5.0

PDR is a user-context-aware framework for LLM research agents that improves report relevance over static baselines, supported by a new dataset and hybrid evaluation.

citing papers explorer

Showing 33 of 33 citing papers.

FermiLink: A Unified Agent Framework for Multidomain Autonomous Scientific Simulations physics.chem-ph · 2026-04-03 · conditional · none · ref 14 · internal anchor
FermiLink is a unified AI agent framework that automates multidomain scientific simulations via separated package knowledge bases and a four-layer progressive disclosure mechanism, reproducing 56% of target figures in benchmarks and generating research-grade results on unpublished problems.
AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents physics.flu-dyn · 2026-05-07 · conditional · none · ref 26 · 3 links · internal anchor
AI CFD Scientist autonomously discovers a Spalart-Allmaras runtime correction reducing lower-wall Cf RMSE by 7.89% on the periodic hill at Reh=5600 while using a vision-language gate to detect 14 of 16 silent failures missed by solver checks.
IntrAgent: An LLM Agent for Content-Grounded Information Retrieval through Literature Review cs.IR · 2026-04-23 · unverdicted · none · ref 7 · internal anchor
IntrAgent uses a two-stage pipeline of section ranking and iterative reading to perform content-grounded literature information retrieval, achieving 13.2% higher accuracy than RAG and agent baselines on the new IntraBench benchmark.
GenCellAgent: Generalizable, Training-Free Cellular Image Segmentation via Large Language Model Agents q-bio.QM · 2025-10-14 · unverdicted · none · ref 38 · internal anchor
GenCellAgent deploys a planner-executor-evaluator LLM agent loop to automatically select, adapt, and refine segmentation tools for diverse cellular microscopy images, matching or exceeding specialist performance on 4,718 images across seven benchmarks while handling out-of-distribution and novel-ves
Toward Generalist Autonomous Research via Hypothesis-Tree Refinement cs.CL · 2026-06-10 · unverdicted · none · ref 149 · internal anchor
Arbor combines a coordinator, executors, and a hypothesis tree to enable cumulative autonomous research, outperforming Codex and Claude Code by over 2.5x on six real tasks and reaching 86.36% Any Medal on MLE-Bench Lite.
From paper to benchmark: agentic, framework-based reproduction of under-specified methods in machine health intelligence cs.AI · 2026-05-27 · unverdicted · none · ref 27 · internal anchor
Proposes agentic framework-based reproduction with a slot-binding interface to turn 16 PHM papers into standardized, assumption-aware benchmark implementations.
STRIDE: A Self-Reflective Agent Framework for Reliable Automatic Equation Discovery cs.AI · 2026-05-18 · unverdicted · none · ref 48 · internal anchor
STRIDE is a self-reflective agent framework that improves accuracy, OOD robustness, and structural recovery in LLM-based symbolic regression by integrating generation, evaluation, repair, and diversity-preserving memory.
Position: Academic Conferences are Potentially Facing Denominator Gaming Caused by Fully Automated Scientific Agents cs.CL · 2026-05-11 · unverdicted · none · ref 22 · internal anchor
Malicious actors could use AI agents to submit large numbers of fake papers, inflating the submission count and thereby raising the acceptance odds for a small set of chosen legitimate papers under stable conference acceptance rates.
CTM-AI: A Blueprint for General AI Inspired by a Model of Consciousness q-bio.NC · 2026-04-30 · unverdicted · none · ref 20 · internal anchor
CTM-AI combines a formal consciousness model with foundation models to report state-of-the-art results on sarcasm detection, humor, and agentic tool-use benchmarks.
How Researchers Navigate Accountability, Transparency, and Trust When Using AI Tools in Early-Stage Research: A Think-Aloud Study cs.CY · 2026-04-25 · unverdicted · none · ref 62 · internal anchor
A think-aloud study reveals that AI tools in early research misrepresent uncertainty, obscure provenance, and create fragile trust, leading researchers to develop compensatory strategies to preserve scholarly judgment.
Fighting AI with AI: AI-Agent Augmented DNS Blocking of LLM Services during Student Evaluations cs.NI · 2026-03-20 · unverdicted · none · ref 12 · internal anchor
AI-Sinkhole uses AI classification with quantized LLMs and Pi-Hole DNS blocking to dynamically prevent access to LLM services during student evaluations, reporting F1 scores above 0.83.
PRISM-XR: Empowering Privacy-Aware XR Collaboration with Multimodal Large Language Models cs.CR · 2026-02-09 · unverdicted · none · ref 49 · internal anchor
PRISM-XR adds edge-based sensitive-data filtering and quick registration to MLLM-driven XR collaboration, reporting 90% request accuracy, sub-0.3s registration, and over 90% sensitive-object filtering in a 28-person study.
Co-Constructing Alignment: A Participatory Approach to Situate AI Values cs.HC · 2026-01-22 · unverdicted · none · ref 54 · internal anchor
Misalignments appear in practice as unexpected responses and task breakdowns, with users proposing roles such as adjusting model output, interpreting behavior, or deliberate non-use to co-construct alignment.
RPC-Bench: A Fine-grained Benchmark for Research Paper Comprehension cs.CL · 2026-01-14 · conditional · none · ref 4 · internal anchor
RPC-Bench supplies 15K verified QA pairs and a research-flow taxonomy that shows top foundation models still achieve only 68.2 percent correctness-completeness on academic paper comprehension.
CodeDistiller: Automatically Generating Code Libraries for Scientific Coding Agents cs.AI · 2025-11-30 · conditional · none · ref 17 · internal anchor
CodeDistiller distills 250 materials-science GitHub repositories into vetted code libraries that improve the accuracy and scientific soundness of experiments generated by ASD agents.
Video models are zero-shot learners and reasoners cs.LG · 2025-09-24 · unverdicted · none · ref 6 · internal anchor
Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.
GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis cs.AI · 2025-07-28 · unverdicted · none · ref 98 · internal anchor
GenoMAS deploys six specialized LLM agents with guided planning to preprocess transcriptomic data and identify genes, reaching 89.13% composite similarity and 60.48% F1 on the GenoTEX benchmark while outperforming prior methods.
LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning cs.CL · 2025-06-23 · unverdicted · none · ref 29 · internal anchor
LongWriter-Zero applies RL from a base model with specialized rewards for length, quality, and structure to outperform SFT baselines and larger models on long-writing benchmarks.
Textual Bayes: Quantifying Prompt Uncertainty in LLM-Based Systems cs.LG · 2025-06-11 · unverdicted · none · ref 53 · internal anchor
Introduces a Bayesian framework viewing LLM prompts as textual parameters and proposes MHLP, a novel MCMC algorithm using LLM proposals, to perform inference and improve accuracy plus uncertainty quantification on benchmarks.
Towards an AI co-scientist cs.AI · 2025-02-26 · unverdicted · none · ref 293 · internal anchor
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
Learning to Choose: An Empowerment-Guided Multi-Agent System with semantic communication for Adaptive Method Selection cs.AI · 2026-05-28 · unverdicted · none · ref 34 · internal anchor
A multi-agent system combining contextual bandits, LLM agents, and semantic checkpoints improves convergence and robustness in adaptive method selection for sensitivity analysis and uncertainty quantification.
Sibyl-AutoResearch: Autonomous Research Needs Self-Evolving Trial-and-Error Harnesses, Not Paper Generators cs.MA · 2026-05-21 · unverdicted · none · ref 34 · internal anchor
Sibyl-AutoResearch introduces self-evolving trial-and-error harnesses with auditable conversion units that link trial signals to updated research behaviors and harness repairs in autonomous systems.
AiraXiv: An AI-Driven Open-Access Platform for Human and AI Scientists cs.AI · 2026-05-20 · unverdicted · none · ref 19 · internal anchor
AiraXiv is a proposed AI-driven platform for open preprints that supports human and AI authors with interactive UI and MCP-based interactions, validated by serving as the submission system for ICAIS 2025.
Personalized Deep Research: A User-Centric Framework, Dataset, and Hybrid Evaluation for Knowledge Discovery cs.IR · 2026-05-11 · conditional · none · ref 34 · internal anchor
PDR is a user-context-aware framework for LLM research agents that improves report relevance over static baselines, supported by a new dataset and hybrid evaluation.
SciAtlas: A Large-Scale Knowledge Graph for Automated Scientific Research cs.AI · 2026-05-20 · unverdicted · none · ref 7 · internal anchor
SciAtlas builds a large-scale multi-disciplinary academic knowledge graph and a neuro-symbolic retrieval system to support automated scientific research tasks such as literature review and idea positioning.
Evolving Roles of LLMs in Scientific Innovation: Assistant, Collaborator, Scientist, and Evaluator cs.DL · 2025-07-16 · unverdicted · none · ref 153 · internal anchor
The paper proposes a four-role framework for LLMs in scientific innovation and reviews methods, benchmarks, and limitations across Assistant, Collaborator, Scientist, and Evaluator roles.
URSA: The Universal Research and Scientific Agent cs.AI · 2025-06-27 · unverdicted · none · ref 18 · internal anchor
URSA is a modular agent ecosystem that uses LLMs and scientific tools to accelerate research tasks of varying complexity.
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review cs.AI · 2025-04-28 · accept · none · ref 18 · internal anchor
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
ResearchLoop: An Evidence-Gated Control Plane for AI-Assisted Research cs.AI · 2026-05-27 · unverdicted · none · ref 21 · internal anchor
ResearchLoop defines a protocol and state model for evidence-gated AI-assisted computational research and reports experiments across nine versions including self-hosting and task ablations.
WisPaper: Your AI Scholar Search Engine cs.IR · 2025-12-07 · unverdicted · none · ref 17 · internal anchor
WisPaper integrates semantic search with agent-based validation, library organization, and personalized AI feeds into a closed-loop system that improves academic paper discovery and long-term awareness.
AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration cs.AI · 2026-05-19 · unreviewed · ref 14 · internal anchor
SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning cs.AI · 2026-05-02 · unreviewed · ref 31 · internal anchor
Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis cs.CL · 2026-04-27 · unreviewed · ref 47 · internal anchor

Agent Laboratory: Using LLM Agents as Research Assistants

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer