pith. sign in

arxiv: 2502.18864 · v1 · submitted 2025-02-26 · 💻 cs.AI · cs.CL· cs.HC· cs.LG· physics.soc-ph· q-bio.OT

Towards an AI co-scientist

Pith reviewed 2026-05-11 12:54 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.HCcs.LGphysics.soc-phq-bio.OT
keywords AI co-scientisthypothesis generationmulti-agent systemdrug repurposingtarget discoverybacterial evolutionscientific discoveryGemini 2.0
0
0 comments X

The pith

A multi-agent AI system on Gemini 2.0 generates biomedical hypotheses that receive experimental validation in leukemia, fibrosis, and bacterial evolution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an AI co-scientist, a multi-agent system built on Gemini 2.0, that generates novel research hypotheses through a generate-debate-evolve process aligned to scientist goals. The system includes an asynchronous task framework for scaling compute and a tournament evolution method to refine ideas. It demonstrates results in drug repurposing for acute myeloid leukemia with candidates that inhibit tumors in vitro at clinically relevant concentrations, in novel target discovery with epigenetic regulators for liver fibrosis that show anti-fibrotic effects and cell regeneration in human organoids, and in uncovering a gene transfer mechanism in bacterial evolution that matches unpublished experimental data. Automated evaluations indicate that additional test-time compute improves hypothesis quality. A sympathetic reader would see this as a way to accelerate the hypothesis generation step in science by producing starting points that survive initial lab checks.

Core claim

The AI co-scientist employs a multi-agent architecture with asynchronous execution and a tournament evolution process to produce self-improving hypotheses that lead to concrete experimental findings: drug candidates for acute myeloid leukemia showing tumor inhibition, new epigenetic targets for liver fibrosis validated in organoids, and a parallel in silico discovery of a bacterial gene transfer mechanism that aligns with separate unpublished results.

What carries the argument

The generate-debate-evolve cycle in a multi-agent system with asynchronous task execution and tournament-based hypothesis refinement, scaled by test-time compute.

If this is right

  • Increasing test-time compute produces measurable gains in automated hypothesis quality evaluations.
  • Drug repurposing proposals for acute myeloid leukemia include compounds that inhibit tumor growth in vitro at clinically applicable levels.
  • Epigenetic targets for liver fibrosis demonstrate anti-fibrotic activity and support liver cell regeneration in human hepatic organoids.
  • The system identifies a gene transfer mechanism in bacterial evolution that recapitulates results from separate unpublished experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the process reliably avoids simple recombination of known information, the same multi-agent structure could be applied to hypothesis generation in non-biomedical domains by changing only the input objectives.
  • Treating AI proposals and their lab validations as separate reports creates a workflow that could reduce confirmation bias in future uses.
  • The emphasis on scaling test-time compute suggests that future versions might trade additional inference steps for higher rates of usable ideas without retraining the base model.

Load-bearing premise

The generated hypotheses are genuinely novel rather than recombinations of training data, and the experimental validations come from independent studies not influenced by the AI outputs.

What would settle it

Independent lab tests on the proposed acute myeloid leukemia candidates that fail to show tumor inhibition at the stated concentrations would falsify the claim of promising validation findings.

read the original abstract

Scientific discovery relies on scientists generating novel hypotheses that undergo rigorous experimental validation. To augment this process, we introduce an AI co-scientist, a multi-agent system built on Gemini 2.0. The AI co-scientist is intended to help uncover new, original knowledge and to formulate demonstrably novel research hypotheses and proposals, building upon prior evidence and aligned to scientist-provided research objectives and guidance. The system's design incorporates a generate, debate, and evolve approach to hypothesis generation, inspired by the scientific method and accelerated by scaling test-time compute. Key contributions include: (1) a multi-agent architecture with an asynchronous task execution framework for flexible compute scaling; (2) a tournament evolution process for self-improving hypotheses generation. Automated evaluations show continued benefits of test-time compute, improving hypothesis quality. While general purpose, we focus development and validation in three biomedical areas: drug repurposing, novel target discovery, and explaining mechanisms of bacterial evolution and anti-microbial resistance. For drug repurposing, the system proposes candidates with promising validation findings, including candidates for acute myeloid leukemia that show tumor inhibition in vitro at clinically applicable concentrations. For novel target discovery, the AI co-scientist proposed new epigenetic targets for liver fibrosis, validated by anti-fibrotic activity and liver cell regeneration in human hepatic organoids. Finally, the AI co-scientist recapitulated unpublished experimental results via a parallel in silico discovery of a novel gene transfer mechanism in bacterial evolution. These results, detailed in separate, co-timed reports, demonstrate the potential to augment biomedical and scientific discovery and usher an era of AI empowered scientists.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces an AI co-scientist, a multi-agent system built on Gemini 2.0 that uses a generate-debate-evolve process (inspired by the scientific method and scaled via test-time compute) to generate hypotheses aligned with scientist-provided objectives. It focuses on three biomedical areas—drug repurposing, novel target discovery, and bacterial evolution mechanisms—reporting that the system proposed AML drug candidates showing in vitro tumor inhibition, epigenetic targets for liver fibrosis with anti-fibrotic effects in organoids, and a novel bacterial gene-transfer mechanism that recapitulates unpublished results, with all experimental validations deferred to separate co-timed reports.

Significance. If the reported validations hold and demonstrate genuine novelty independent of training-data recombination or post-hoc alignment, the work would represent a meaningful step toward AI-augmented hypothesis generation with practical biomedical impact. The asynchronous multi-agent architecture and tournament evolution for self-improving hypotheses are concrete engineering contributions that could be adopted more broadly; the emphasis on test-time compute scaling is a timely strength.

major comments (2)
  1. [Abstract] Abstract: The central claim that the AI co-scientist 'uncovers new, original knowledge' and produces 'demonstrably novel research hypotheses' rests entirely on three headline validation outcomes (AML repurposing, liver-fibrosis targets, bacterial gene transfer) whose supporting data, methods, error bars, exclusion criteria, and independence from the AI outputs are deferred to 'separate, co-timed reports' with no quantitative summaries or blinding statements supplied in this manuscript. This prevents evaluation of whether the generate-debate-evolve loop contributed original insights or merely retrieved/recombined existing knowledge.
  2. [Abstract] Abstract and system-description sections: No evidence is presented that the final hypotheses differ from content already latent in the Gemini 2.0 training distribution or from the scientist-provided research objectives; the manuscript supplies neither pre-2024 literature comparisons, model-cutoff checks, nor logs of the tournament evolution that would allow assessment of whether the process avoids circular recombination.
minor comments (1)
  1. The description of the asynchronous task-execution framework would benefit from a diagram or pseudocode showing how agents interact across the tournament rounds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript describing the AI co-scientist. We address the major concerns regarding the abstract claims and evidence for novelty below. We are prepared to make revisions to clarify the scope and add supporting details where feasible.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the AI co-scientist 'uncovers new, original knowledge' and produces 'demonstrably novel research hypotheses' rests entirely on three headline validation outcomes (AML repurposing, liver-fibrosis targets, bacterial gene transfer) whose supporting data, methods, error bars, exclusion criteria, and independence from the AI outputs are deferred to 'separate, co-timed reports' with no quantitative summaries or blinding statements supplied in this manuscript. This prevents evaluation of whether the generate-debate-evolve loop contributed original insights or merely retrieved/recombined existing knowledge.

    Authors: The primary focus of this manuscript is the presentation of the AI co-scientist architecture, including the asynchronous multi-agent framework and tournament evolution mechanism. The three biomedical applications are included to demonstrate the system's practical utility in generating hypotheses that align with scientist objectives and lead to experimental outcomes. As noted in the manuscript, the detailed experimental methods, quantitative results, error bars, exclusion criteria, and blinding information are provided in the separate co-timed reports to allow for comprehensive presentation in those venues. This manuscript supplies high-level descriptions of the outcomes. We will revise the abstract to more precisely indicate that the validations are detailed in accompanying reports and to qualify the claims of novelty accordingly, while retaining the description of the system's design. We maintain that the multi-agent debate and evolution process facilitates the generation of hypotheses that extend beyond direct retrieval, as evidenced by the system's ability to propose candidates leading to the reported validations. revision: partial

  2. Referee: [Abstract] Abstract and system-description sections: No evidence is presented that the final hypotheses differ from content already latent in the Gemini 2.0 training distribution or from the scientist-provided research objectives; the manuscript supplies neither pre-2024 literature comparisons, model-cutoff checks, nor logs of the tournament evolution that would allow assessment of whether the process avoids circular recombination.

    Authors: We recognize the value of providing explicit evidence for the originality of the generated hypotheses. The manuscript emphasizes the generate-debate-evolve process, where hypotheses are iteratively refined through agent-based critique and tournament selection, which is intended to promote novelty beyond the initial objectives or training data recombination. For the bacterial evolution case, the AI co-scientist independently identified a gene transfer mechanism that matched unpublished experimental findings, providing a strong indicator of originality. To address this comment, we will expand the system-description sections to include illustrative logs or step-by-step examples from the tournament evolution process for at least one case, showing how hypotheses were debated, critiqued, and evolved. This addition will allow readers to evaluate the contribution of the process. Pre-2024 literature comparisons and model-cutoff checks are not included due to the proprietary nature of the model and space constraints, but the use of scientist-provided objectives combined with the evolution mechanism helps ensure alignment and extension rather than circularity. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive system architecture with external validations

full rationale

The paper describes a multi-agent generate-debate-evolve system on Gemini 2.0 without any equations, fitted parameters, or first-principles derivations. All reported biomedical outcomes are explicitly deferred to separate co-timed experimental reports rather than derived internally. No self-definitional loops, load-bearing self-citations, or renaming of known results appear in the architecture or claims; the system is presented as aligned to external scientist objectives and prior evidence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper introduces a new system without mathematical derivations, fitted parameters, or formal axioms; claims rest on the empirical behavior of the multi-agent architecture and external experimental validations reported separately.

axioms (1)
  • domain assumption Large language models possess sufficient world knowledge and reasoning to generate useful novel scientific hypotheses when organized in a multi-agent debate-and-evolve loop.
    Invoked throughout the system design and evaluation claims.
invented entities (1)
  • AI co-scientist no independent evidence
    purpose: Multi-agent system for generating and evolving novel research hypotheses aligned to scientist objectives.
    The central proposed artifact; no independent falsifiable handle outside the system description is provided in the abstract.

pith-pipeline@v0.9.0 · 5760 in / 1435 out tokens · 117151 ms · 2026-05-11T12:54:18.188349+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints

    cs.AI 2026-05 unverdicted novelty 8.0

    Formalizes interface-constrained semi-Markov decision processes and proves a finite-sample bound for neural IC-Q that decomposes into neural approximation error, interface gap, and mixing-time residual, with experimen...

  2. AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

    cs.AI 2026-04 accept novelty 8.0

    AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.

  3. FermiLink: A Unified Agent Framework for Multidomain Autonomous Scientific Simulations

    physics.chem-ph 2026-04 conditional novelty 8.0

    FermiLink is a unified AI agent framework that automates multidomain scientific simulations via separated package knowledge bases and a four-layer progressive disclosure mechanism, reproducing 56% of target figures in...

  4. Evaluating Large Language Models in Scientific Discovery

    cs.AI 2025-12 unverdicted novelty 8.0

    The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.

  5. Why Do Multi-Agent LLM Systems Fail?

    cs.AI 2025-03 unverdicted novelty 8.0

    The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

  6. Forecasting Scientific Progress with Artificial Intelligence

    cs.AI 2026-05 unverdicted novelty 7.0

    Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and in...

  7. An Experimental Method to Study Opinion Diffusion in Human-AI Hybrid Societies

    cs.SI 2026-05 unverdicted novelty 7.0

    Hybrid human-AI networks in 5x5 grids reached lower final polarization than human-only networks after eight rounds of opinion revision on polarizing topics.

  8. AI co-mathematician: Accelerating mathematicians with agentic AI

    cs.AI 2026-05 unverdicted novelty 7.0

    An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.

  9. AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents

    physics.flu-dyn 2026-05 conditional novelty 7.0

    AI CFD Scientist autonomously discovers a Spalart-Allmaras runtime correction reducing lower-wall Cf RMSE by 7.89% on the periodic hill at Reh=5600 while using a vision-language gate to detect 14 of 16 silent failures...

  10. AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents

    physics.flu-dyn 2026-05 conditional novelty 7.0

    AI CFD Scientist autonomously finds a Spalart-Allmaras turbulence correction that lowers wall-friction error by 7.89% versus DNS on the periodic hill case using vision-language physics verification.

  11. End-to-end autonomous scientific discovery on a real optical platform

    cs.AI 2026-04 unverdicted novelty 7.0

    An LLM agent autonomously identifies and experimentally validates a previously unreported optical bilinear interaction on a physical platform.

  12. Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

    cs.AI 2026-04 unverdicted novelty 7.0

    Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...

  13. RosettaSearch: Multi-Objective Inference-Time Search for Protein Sequence Design

    cs.LG 2026-04 unverdicted novelty 7.0

    RosettaSearch applies LLM-driven multi-objective search at inference time to improve backbone-conditioned protein sequences, recovering designs with 18-68% better structural fidelity and 2.5x higher success rates than...

  14. VERITAS: Verifiable Epistemic Reasoning for Image-Derived Hypothesis Testing via Agentic Systems

    cs.MA 2026-04 unverdicted novelty 7.0

    VERITAS is a multi-agent system for verifiable hypothesis testing on multimodal clinical MRI datasets that achieves 81.4% verdict accuracy with frontier models and introduces an epistemic evidence labeling framework.

  15. AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery

    cs.CL 2026-04 unverdicted novelty 7.0

    AutoSOTA uses eight specialized agents to replicate and optimize models from recent AI papers, producing 105 new SOTA results in about five hours per paper on average.

  16. Kosmos: An AI Scientist for Autonomous Discovery

    cs.AI 2025-11 unverdicted novelty 7.0

    Kosmos is an AI scientist that maintains coherence over hundreds of agent steps via a shared world model, executes thousands of code lines and reads thousands of papers per run, and produces traceable reports with 79....

  17. AlphaEvolve: A coding agent for scientific and algorithmic discovery

    cs.AI 2025-06 unverdicted novelty 7.0

    AlphaEvolve is an LLM-orchestrated evolutionary coding agent that discovered a 4x4 complex matrix multiplication algorithm using 48 scalar multiplications, the first improvement over Strassen's algorithm in 56 years, ...

  18. Code Researcher: Deep Research Agent for Large Systems Code and Commit History

    cs.SE 2025-05 unverdicted novelty 7.0

    Code Researcher retrieves global context via multi-step reasoning on code semantics, patterns, and commit history to fix Linux kernel crashes, reaching 48% crash-resolution rate versus 31% for baselines.

  19. Human-LLM Compound System for Scientific Ideation through Facet Recombination and Novelty Evaluation

    cs.HC 2024-09 unverdicted novelty 7.0

    Scideator enables facet-based scientific ideation through LLM-driven extraction, human-guided recombination, analogous retrieval, and facet-grounded novelty verification, showing significantly higher creativity suppor...

  20. Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale

    cs.LG 2026-05 unverdicted novelty 6.0

    Presents Hack-Verifiable TextArena, a benchmark that embeds verifiable reward hacking opportunities into environments to enable deterministic measurement of exploitation by language models.

  21. AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

    cs.AI 2026-05 unverdicted novelty 6.0

    AutoResearchClaw presents a multi-agent autonomous research pipeline with debate, self-healing execution, verifiable reporting, human-in-the-loop modes, and cross-run evolution that outperforms AI Scientist v2 by 54.7...

  22. EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design

    cs.AI 2026-05 unverdicted novelty 6.0

    EngiAI is a multi-agent framework unifying topology optimization, retrieval, HPC orchestration, and manufacturing control, with benchmarks showing proprietary LLMs at 96-97% task completion on Beams2D and lower perfor...

  23. MLReplicate: Benchmarking Autonomous Research Systems for Machine Learning Reproducibility

    cs.LG 2026-05 conditional novelty 6.0

    MLReplicate benchmark evaluates six autonomous systems on 45 manuscripts from ICML 2025 papers, finding that automated reviews accept flawed outputs with fabricated claims while human review exposes methodological fai...

  24. LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design

    cs.LG 2026-05 unverdicted novelty 6.0

    LEAPBench shows trajectory scoring changes best-model rankings on 53% of tasks, LLMs do not beat Bayesian optimization, and domain-aware prompting underperforms domain-agnostic on biology tasks aligned with published ...

  25. Unlocking LLM Creativity in Science through Analogical Reasoning

    cs.AI 2026-05 conditional novelty 6.0

    Analogical reasoning increases LLM solution diversity by 90-173% and novelty rate to over 50%, delivering up to 13-fold gains on biomedical tasks including perturbation prediction and cell communication.

  26. GRAFT-ATHENA: Self-Improving Agentic Teams for Autonomous Discovery and Evolutionary Numerical Algorithms

    cs.LG 2026-05 unverdicted novelty 6.0

    GRAFT-ATHENA projects combinatorial method choices into factored trees that embed as fingerprints in a metric space, enabling an agentic system to accumulate experience across domains and autonomously discover new num...

  27. Intervention-Based Time Series Causal Discovery via Simulator-Generated Interventional Distributions

    cs.LG 2026-05 unverdicted novelty 6.0

    SVAR-FM uses simulator clamping to produce interventional distributions and flow matching to identify time series causal structures, with an error bound that predicts sign reversal of causal effects below a simulator ...

  28. MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

    cs.LG 2026-05 unverdicted novelty 6.0

    MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.

  29. Common-agency Games for Multi-Objective Test-Time Alignment

    cs.GT 2026-05 unverdicted novelty 6.0

    CAGE uses common-agency games and an EPEC algorithm to compute equilibrium policies that balance multiple conflicting objectives for test-time LLM alignment.

  30. FAME: Forecasting Academic Impact via Continuous-Time Manifold Evolution

    cs.LG 2026-05 unverdicted novelty 6.0

    FAME models scientific topic trajectories in continuous time to forecast paper impact more accurately than LLMs by aligning manuscripts with field momentum in a dynamic latent space.

  31. AI co-mathematician: Accelerating mathematicians with agentic AI

    cs.AI 2026-05 unverdicted novelty 6.0

    An interactive AI workbench called the AI co-mathematician supports open-ended mathematical research and achieves a new high score of 48% on FrontierMath Tier 4.

  32. AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents

    physics.flu-dyn 2026-05 unverdicted novelty 6.0

    An integrated AI agent framework for CFD uses vision-based physics gates to autonomously discover a Spalart-Allmaras runtime correction that cuts lower-wall skin-friction error by 7.89% versus DNS on the periodic hill...

  33. Intentmaking and Sensemaking: Human Interaction with AI-Guided Mathematical Discovery

    cs.AI 2026-05 unverdicted novelty 6.0

    Expert mathematicians using an AI coding agent for discovery engage in repeated cycles of intentmaking to define goals and sensemaking to interpret outputs.

  34. Hypothesis generation and updating in large language models

    cs.LG 2026-05 unverdicted novelty 6.0

    LLMs exhibit Bayesian-like hypothesis updating with strong-sampling bias and an evaluation-generation gap but generalize poorly outside observed data.

  35. SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    SciResearcher automates creation of diverse scientific reasoning tasks from academic evidence to train an 8B model that sets new SOTA at 19.46% on HLE-Bio/Chem-Gold and gains 13-15% on SuperGPQA-Hard-Biology and TRQA-...

  36. AgentEconomist: An End-to-end Agentic System Translating Economic Intuitions into Executable Computational Experiments

    cs.HC 2026-04 unverdicted novelty 6.0

    AgentEconomist is an end-to-end agentic system with idea development, experimental design, and execution stages that uses a large economics paper database to produce research ideas with better literature grounding, no...

  37. TSAssistant: A Human-in-the-Loop Agentic Framework for Automated Target Safety Assessment

    cs.CL 2026-04 unverdicted novelty 6.0

    TSAssistant is a human-in-the-loop multi-agent system that generates citable, evidence-grounded sections for target safety assessment reports by coordinating specialized subagents with interactive user refinement.

  38. Human Cognition in Machines: A Unified Perspective of World Models

    cs.RO 2026-04 unverdicted novelty 6.0

    The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...

  39. PRL-Bench: A Comprehensive Benchmark Evaluating LLMs' Capabilities in Frontier Physics Research

    cs.LG 2026-04 unverdicted novelty 6.0

    PRL-Bench evaluates frontier LLMs on 100 real physics research tasks and finds the best models score below 50, exposing a gap to autonomous discovery.

  40. TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration

    cs.AI 2026-04 unverdicted novelty 6.0

    TREX automates the LLM training lifecycle via collaborative agents and tree-based exploration, delivering consistent performance gains across 10 real-world fine-tuning tasks in FT-Bench.

  41. Pioneer Agent: Continual Improvement of Small Language Models in Production

    cs.AI 2026-04 unverdicted novelty 6.0

    Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...

  42. AIRA_2: Overcoming Bottlenecks in AI Research Agents

    cs.AI 2026-03 conditional novelty 6.0

    AIRA₂ improves AI research agents via asynchronous multi-GPU workers, hidden consistent evaluation, and interactive ReAct agents, reaching 81.5-83.1% percentile rank on MLE-bench-30 and exceeding human SOTA on 6 of 20...

  43. Bridging the Experimental Last Mile: Digitizing Laboratory Know-How for Safe AI-Assisted Support

    cs.HC 2026-03 conditional novelty 6.0

    A video-plus-RAG human-in-the-loop system digitizes site-specific laboratory know-how and supplies safe, grounded guidance for experiments such as powder X-ray diffraction.

  44. "When to Hand Off, When to Work Together": Expanding Human-Agent Co-Creative Collaboration through Concurrent Interaction

    cs.HC 2026-03 unverdicted novelty 6.0

    Concurrent human-agent interactions occur in 31.8% of turns and follow five action patterns explained by six triggers and four enabling factors, enabled by a context-aware design probe called CLEO.

  45. The 2025 AI Agent Index: Documenting Technical and Safety Features of Deployed Agentic AI Systems

    cs.CY 2026-02 accept novelty 6.0

    The 2025 AI Agent Index catalogs technical and safety details for 30 deployed AI agents and finds low developer transparency on safety, evaluations, and societal impacts.

  46. Glia: A Human-Inspired AI for Automated Systems Design and Optimization

    cs.AI 2025-10 unverdicted novelty 6.0

    Glia deploys a multi-agent LLM workflow with reasoning, experimentation, and analysis agents to generate interpretable algorithms for request routing, scheduling, and auto-scaling in distributed GPU clusters, reaching...

  47. Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics

    cs.AI 2025-10 unverdicted novelty 6.0

    Ax-Prover is a tool-using multi-agent LLM system that matches state-of-the-art provers on public math benchmarks and outperforms them on new abstract-algebra and quantum-theory benchmarks while also assisting an exper...

  48. An AI system to help scientists write expert-level empirical software

    cs.AI 2025-09 unverdicted novelty 6.0

    ERA is an AI system using LLMs and tree search to produce expert-level empirical software, generating methods that outperformed top human approaches in single-cell data analysis and COVID-19 forecasting tasks.

  49. An AI system to help scientists write expert-level empirical software

    cs.AI 2025-09 unverdicted novelty 6.0

    ERA combines LLMs and tree search to produce expert-level empirical software that outperforms top human methods on single-cell analysis leaderboards and CDC COVID-19 forecasts.

  50. InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling

    cs.CL 2025-08 unverdicted novelty 6.0

    InternBootcamp supplies 1000+ verifiable, auto-generated task environments across domains that enable task scaling to improve LLM reasoning, producing a 32B model with state-of-the-art results on the new Bootcamp-EVAL...

  51. General Agentic Planning Through Simulative Reasoning with World Models

    cs.AI 2025-07 conditional novelty 6.0

    SiRA uses LLM world models for simulative reasoning to achieve up to 124% higher task completion and 32.2% navigation success versus reactive baselines in web environments.

  52. RExBench: Can coding agents autonomously implement AI research extensions?

    cs.CL 2025-06 unverdicted novelty 6.0

    RExBench is a new benchmark showing that LLM coding agents fail to autonomously implement most realistic research extensions to prior AI papers.

  53. XtraGPT: Context-Aware and Controllable Academic Paper Revision via Human-AI Collaboration

    cs.CL 2025-05 conditional novelty 6.0

    XtraGPT is a suite of 1.5B-14B parameter open-source LLMs fine-tuned on 140,000 revision pairs from 7,000 top-tier papers to support controllable, context-aware academic paper editing.

  54. Sibyl-AutoResearch: Autonomous Research Needs Self-Evolving Trial-and-Error Harnesses, Not Paper Generators

    cs.MA 2026-05 unverdicted novelty 5.0

    Sibyl-AutoResearch introduces self-evolving trial-and-error harnesses with auditable conversion units that link trial signals to updated research behaviors and harness repairs in autonomous systems.

  55. Toward AI VIS Co-Scientists: A General and End-to-End Agent Harness for Solving Complex Data Visualization Tasks

    cs.AI 2026-05 unverdicted novelty 5.0

    A multi-agent harness autonomously generates functional single-page VIS apps with linked views for scientific data tasks using coordinated skills for analysis, planning, implementation, and evaluation.

  56. Pramana: A Protocol-Layer Treatment of Claim Verification in Autonomous Agent Networks

    cs.CR 2026-05 unverdicted novelty 5.0

    Pramana defines a typed ClaimAttestation protocol with four variants and verify operations, specifies its lifecycle in TLA+, model-checks it with TLC, and provides a tested Python implementation for auditable agent claims.

  57. CVEvolve: Autonomous Algorithm Discovery for Unstructured Scientific Data Processing

    cs.AI 2026-05 unverdicted novelty 5.0

    CVEvolve uses LLM agents with lineage-aware search to autonomously discover algorithms that outperform baselines on scientific image tasks including registration, peak detection, and segmentation.

  58. GEAR: Genetic AutoResearch for Agentic Code Evolution

    cs.NE 2026-05 unverdicted novelty 5.0

    GEAR applies genetic algorithms to maintain and evolve multiple research states in autonomous code agents, outperforming single-path baselines by continuing to discover improvements over extended runs.

  59. From Experimental Limits to Physical Insight: A Retrieval-Augmented Multi-Agent Framework for Interpreting Searches Beyond the Standard Model

    hep-ex 2026-05 unverdicted novelty 5.0

    HEP-CoPilot is a new multi-agent retrieval framework that retrieves, reconstructs, and compares experimental limits from HEP literature and HEPData to support interpretation of beyond-Standard-Model searches.

  60. TSAssistant: A Human-in-the-Loop Agentic Framework for Automated Target Safety Assessment

    cs.CL 2026-04 unverdicted novelty 5.0

    TSAssistant is a modular, human-in-the-loop multi-agent system that generates citable, section-specific drafts for target safety assessment reports by coordinating specialized sub-agents with biomedical data sources a...

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · cited by 76 Pith papers · 44 internal anchors

  1. [1]

    medRxiv , year=

    A foundation model for clinician-centered drug repurposing , author=. medRxiv , year=

  2. [2]

    Nat Rev Drug Discov , volume=

    Breaking Eroom’s law , author=. Nat Rev Drug Discov , volume=

  3. [3]

    Nature reviews Drug discovery , volume=

    Drug repurposing: progress, challenges and recommendations , author=. Nature reviews Drug discovery , volume=. 2019 , publisher=

  4. [4]

    International Journal of Forecasting , volume=

    Extension of the Elo rating system to margin of victory , author=. International Journal of Forecasting , volume=. 2020 , publisher=

  5. [5]

    arXiv preprint arXiv:2412.14427 , year=

    Elo Ratings in the Presence of Intransitivity , author=. arXiv preprint arXiv:2412.14427 , year=

  6. [6]

    ICGA journal , volume=

    Computing ``elo ratings'' of move patterns in the game of go , author=. ICGA journal , volume=. 2007 , publisher=

  7. [7]

    ACM Transactions on Interactive Intelligent Systems (TiiS) , volume=

    Bridging the gap between ethics and practice: guidelines for reliable, safe, and trustworthy human-centered AI systems , author=. ACM Transactions on Interactive Intelligent Systems (TiiS) , volume=. 2020 , publisher=

  8. [8]

    2024 , url=

    Responsible Scaling Policy , author=. 2024 , url=

  9. [9]

    2024 , eprint=

    Foot In The Door: Understanding Large Language Model Jailbreaking via Cognitive Psychology , author=. 2024 , eprint=

  10. [10]

    arXiv preprint arXiv:2311.09096 (2023)

    Defending large language models against jailbreaking attacks through goal prioritization , author=. arXiv preprint arXiv:2311.09096 , year=

  11. [11]

    JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

    Jailbreakbench: An open robustness benchmark for jailbreaking large language models , author=. arXiv preprint arXiv:2404.01318 , year=

  12. [12]

    Open sesame! universal black box jailbreaking of large language models

    Open sesame! universal black box jailbreaking of large language models , author=. arXiv preprint arXiv:2309.01446 , year=

  13. [13]

    Jailbreaking Black Box Large Language Models in Twenty Queries

    Jailbreaking black box large language models in twenty queries , author=. arXiv preprint arXiv:2310.08419 , year=

  14. [14]

    blue" data (harmful questions with safe/refusal responses) and

    Hijacking large language models via adversarial in-context learning , author=. arXiv preprint arXiv:2311.09948 , year=

  15. [15]

    European Conference on Computer Vision , pages=

    Adversarial prompt tuning for vision-language models , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  16. [16]

    Visual adversarial examples jailbreak aligned large language models, 2023

    Visual adversarial examples jailbreak large language models , author=. arXiv preprint arXiv:2306.13213 , year=

  17. [17]

    Adversarial demonstration attacks on large language models,

    Adversarial demonstration attacks on large language models , author=. arXiv preprint arXiv:2305.14950 , year=

  18. [18]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

  19. [19]

    Misusing Tools in Large Language Models With Visual Adversarial Examples

    Misusing tools in large language models with visual adversarial examples , author=. arXiv preprint arXiv:2310.03185 , year=

  20. [20]

    arXiv e-prints , pages=

    Promptbench: Towards evaluating the robustness of large language models on adversarial prompts , author=. arXiv e-prints , pages=

  21. [21]

    Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security , pages=

    Large language models for code: Security hardening and adversarial testing , author=. Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security , pages=

  22. [22]

    Advances in Neural Information Processing Systems , volume=

    On evaluating adversarial robustness of large vision-language models , author=. Advances in Neural Information Processing Systems , volume=

  23. [23]

    Survey of vulnerabilities in large language models revealed by adversarial attacks.arXiv preprint arXiv:2310.10844, 2023

    Survey of vulnerabilities in large language models revealed by adversarial attacks , author=. arXiv preprint arXiv:2310.10844 , year=

  24. [24]

    2024 , eprint=

    Limitations of the LLM-as-a-Judge Approach for Evaluating LLM Outputs in Expert Knowledge Tasks , author=. 2024 , eprint=

  25. [25]

    Wenda Xu, Guanglei Zhu, Xuandong Zhao, Liangming Pan, Lei Li, and William Yang Wang

    Systematic evaluation of llm-as-a-judge in llm alignment tasks: Explainable metrics and diverse prompt templates , author=. arXiv preprint arXiv:2408.13006 , year=

  26. [26]

    CoRR , volume =

    Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark , author=. arXiv preprint arXiv:2402.04788 , year=

  27. [27]

    A Survey on LLM-as-a-Judge

    A Survey on LLM-as-a-Judge , author=. arXiv preprint arXiv:2411.15594 , year=

  28. [28]

    Advances in Neural Information Processing Systems , volume=

    Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in Neural Information Processing Systems , volume=

  29. [29]

    Foundational Autoraters: Taming Large Language Models for better automatic evaluation,

    Foundational autoraters: Taming large language models for better automatic evaluation , author=. arXiv preprint arXiv:2407.10817 , year=

  30. [30]

    arXiv preprint arXiv:2311.18702 , year=

    Critiquellm: Scaling llm-as-critic for effective and explainable evaluation of large language model generation , author=. arXiv preprint arXiv:2311.18702 , year=

  31. [31]

    arXiv preprint arXiv:2402.13764 , year=

    Criticbench: Evaluating large language models as critic , author=. arXiv preprint arXiv:2402.13764 , year=

  32. [32]

    CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

    Critic: Large language models can self-correct with tool-interactive critiquing , author=. arXiv preprint arXiv:2305.11738 , year=

  33. [33]

    Regulation of Emerging Risks , author=. Vand. L. Rev. , volume=. 2016 , publisher=

  34. [34]

    2025 , eprint=

    Human Decision-making is Susceptible to AI-driven Manipulation , author=. 2025 , eprint=

  35. [35]

    Model evaluation for extreme risks

    Model evaluation for extreme risks , author=. arXiv preprint arXiv:2305.15324 , year=

  36. [36]

    arXiv preprint arXiv:2412.15433 , year=

    Quantifying detection rates for dangerous capabilities: a theoretical model of dangerous capability evaluations , author=. arXiv preprint arXiv:2412.15433 , year=

  37. [37]

    Nature machine intelligence , volume=

    The global landscape of AI ethics guidelines , author=. Nature machine intelligence , volume=. 2019 , publisher=

  38. [38]

    Evaluating frontier models for dangerous capabilities,

    Evaluating frontier models for dangerous capabilities , author=. arXiv preprint arXiv:2403.13793 , year=

  39. [39]

    2024 , eprint=

    Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science , author=. 2024 , eprint=

  40. [40]

    arXiv preprint arXiv:2404.16244 (2024).https://doi.org/10.48550/arXiv.2404.16244

    The ethics of advanced ai assistants , author=. arXiv preprint arXiv:2404.16244 , year=

  41. [41]

    Journal of Law and Society , volume=

    Regulating scientific research: A constitutional moment? , author=. Journal of Law and Society , volume=. 2018 , publisher=

  42. [42]

    Environment international , volume=

    The risks of risk-based regulation: Insights from the environmental policy domain , author=. Environment international , volume=. 2006 , publisher=

  43. [43]

    Journal of the Royal Society of Medicine , volume=

    Research governance: regulating risk and reducing harm? , author=. Journal of the Royal Society of Medicine , volume=. 2006 , publisher=

  44. [44]

    NanoEthics , volume=

    Regulating emerging and future technologies in the present , author=. NanoEthics , volume=. 2015 , publisher=

  45. [45]

    dual use

    A note on the definition of “dual use” , author=. Science and Engineering Ethics , volume=. 2010 , publisher=

  46. [46]

    Science and engineering ethics , volume=

    Philosophical aspects of dual use technologies , author=. Science and engineering ethics , volume=. 2010 , publisher=

  47. [47]

    New England Journal of Medicine , volume=

    Acute myeloid leukemia , author=. New England Journal of Medicine , volume=. 2015 , publisher=

  48. [48]

    Pharmaceutical statistics , volume=

    Guidelines for accurate EC50/IC50 estimation , author=. Pharmaceutical statistics , volume=. 2011 , publisher=

  49. [49]

    Research Ethics , volume=

    The ethics of disseminating dual-use knowledge , author=. Research Ethics , volume=. 2013 , publisher=

  50. [50]

    Science and engineering ethics , volume=

    Ethical and philosophical consideration of the dual-use dilemma in the biological sciences , author=. Science and engineering ethics , volume=. 2007 , publisher=

  51. [51]

    Bulletin of the World Health Organization , volume=

    Governance of dual-use research: an ethical dilemma , author=. Bulletin of the World Health Organization , volume=. 2009 , publisher=

  52. [52]

    The Uppsala Code of Ethics for Scientists , urldate =

    Bengt Gustafsson and Lars Ryd. The Uppsala Code of Ethics for Scientists , urldate =. Journal of Peace Research , number =

  53. [53]

    Federal Register. Vol. 44, no. 76. pp. 23191–7. Archived from the original (PDF) on October 17, 2011. , url=. 1979 , title=

  54. [54]

    Ethical principles for medical research involving human subjects , author=

    World Medical Association Declaration of Helsinki. Ethical principles for medical research involving human subjects , author=. Bulletin of the world health organization , volume=

  55. [55]

    Jama , volume=

    The Nuremberg code 70 years later , author=. Jama , volume=. 2017 , publisher=

  56. [56]

    Bulletin of the History of Medicine , volume=

    The origins of informed consent: The international scientific commission on medical war crimes, and the Nuremberg Code , author=. Bulletin of the History of Medicine , volume=. 2001 , publisher=

  57. [57]

    The American journal of the medical sciences , volume=

    The continuing legacy of the Tuskegee Syphilis Study: considerations for clinical investigation , author=. The American journal of the medical sciences , volume=. 1999 , publisher=

  58. [58]

    Science and Engineering Ethics , volume=

    Scientific ethics: A new approach , author=. Science and Engineering Ethics , volume=. 2019 , publisher=

  59. [59]

    2018 , publisher=

    Science and the structure of ethics , author=. 2018 , publisher=

  60. [60]

    The SAGE handbook of social research methods , pages=

    Research ethics in social science , author=. The SAGE handbook of social research methods , pages=. 2008 , publisher=

  61. [61]

    2006 , publisher=

    Science and ethics , author=. 2006 , publisher=

  62. [62]

    1994 , publisher=

    Ethics of scientific research , author=. 1994 , publisher=

  63. [63]

    2005 , publisher=

    The ethics of science: An introduction , author=. 2005 , publisher=

  64. [64]

    Health and human rights , volume=

    Addressing inequity: neglected tropical diseases and human rights , author=. Health and human rights , volume=. 2018 , publisher=

  65. [65]

    Global Health Research and Policy , volume=

    Addressing neglected tropical diseases in Africa: a health equity perspective , author=. Global Health Research and Policy , volume=. 2023 , publisher=

  66. [66]

    2024 , eprint=

    How AI Ideas Affect the Creativity, Diversity, and Evolution of Human Ideas: Evidence From a Large, Dynamic Experiment , author=. 2024 , eprint=

  67. [67]

    2024 , eprint=

    Divergent Creativity in Humans and Large Language Models , author=. 2024 , eprint=

  68. [68]

    2025 , eprint=

    We're Different, We're the Same: Creative Homogeneity Across LLMs , author=. 2025 , eprint=

  69. [69]

    Homogenization effects of large language models on human creative ideation

    Anderson, Barrett R and Shah, Jash Hemant and Kreminski, Max , title =. Proceedings of the 16th Conference on Creativity & Cognition , pages =. 2024 , isbn =. doi:10.1145/3635636.3656204 , abstract =

  70. [70]

    (No Title) , year=

    The rating of chessplayers: Past and present , author=. (No Title) , year=

  71. [71]

    BMC health services research , volume=

    Drug repurposing: a systematic review on root causes, barriers and facilitators , author=. BMC health services research , volume=. 2022 , publisher=

  72. [72]

    Proceedings of the National Academy of Sciences , volume=

    Network medicine framework for identifying drug-repurposing opportunities for COVID-19 , author=. Proceedings of the National Academy of Sciences , volume=. 2021 , publisher=

  73. [73]

    Bioinformatics , volume=

    Modeling polypharmacy side effects with graph convolutional networks , author=. Bioinformatics , volume=. 2018 , publisher=

  74. [74]

    bioRxiv , year=

    Genetic Discovery Enabled by A Large Language Model , author=. bioRxiv , year=

  75. [75]

    Sara Mahdavi, Christopher Semturs, David Fleet, Vivek Natarajan, and Shekoofeh Azizi

    Tx-LLM: A Large Language Model for Therapeutics , author=. arXiv preprint arXiv:2406.06316 , year=

  76. [76]

    bioRxiv , pages=

    ProtEx: A Retrieval-Augmented Approach for Protein Function Prediction , author=. bioRxiv , pages=. 2024 , publisher=

  77. [77]

    DepMap Q2 2024 Data Release , year =

  78. [78]

    Nucleic Acids Research , volume =

    Tang, Zefang and Kang, Boxi and Li, Chenwei and Chen, Tianxiang and Zhang, Zemin , title = ". Nucleic Acids Research , volume =. 2019 , month =. doi:10.1093/nar/gkz430 , url =

  79. [79]

    Signal Transduction and Targeted Therapy , volume=

    Drug repurposing for cancer therapy , author=. Signal Transduction and Targeted Therapy , volume=. 2024 , month=. doi:10.1038/s41392-024-01808-1 , url=

  80. [80]

    Scientific data , volume=

    MIMIC-III, a freely accessible critical care database , author=. Scientific data , volume=. 2016 , publisher=

Showing first 80 references.