CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.
hub
Weld and Doug Downey and Wen
22 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
FARS deployed at scale produced 166 AI/ML papers across 67 topics that received 282 structured human reviews indicating some review-worthy outputs alongside recurring failure modes.
Controlled experiments across six benchmarks and four models show RAG context enrichment with metadata, structure, or strategies mostly lowers accuracy, with model-context alignment as the determining factor.
A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.
Co-ReAct adds step-level rubric guidance to ReAct agents via a GRPO-trained generator using list-wise ranking rewards, yielding consistent gains on DeepResearchBench and SQA-CS-V2.
Re²Math is a new benchmark that evaluates AI models on retrieving and verifying the applicability of theorems from math literature to advance steps in partial proofs, accepting any sufficient theorem while controlling for leakage.
FactReview extracts claims from ML papers, positions them via literature retrieval, and verifies them through code execution, labeling each as Supported, Partially supported, or In conflict, as shown in a CompGCN case study.
A framework for nuanced, time-aware research impact summarization using fine-grained temporal citation intents shows moderate to strong correlation with human judgments on insightfulness.
PaperPilot induces executable DAG workflows for multi-turn literature search and trains via imitation plus preference optimization, raising Hit@5 from 58.0 to 77.0 over a baseline agent.
A systematic analysis of 284 manually reviewed papers plus 1.8k+ others from 2023-2025 reveals under-reporting of human evaluation study design details, creating ambiguity in what was measured and how.
SPIRE is a multi-agent framework drawing on scholarly primitives to perform evidence-grounded humanities scholarship, outperforming Naive LLM, Text RAG, and GraphRAG on a benchmark of classical Chinese and Greco-Roman Latin papers.
DecomposeR represents research plans as typed DAGs and uses two-stage planner-then-answerer RL to improve long-form research performance by 5.1-8.0 points over baselines.
BRANE maps queries to optimal retrieval pipeline configurations using LLM-derived features and per-configuration correctness predictors, improving the cost-quality Pareto frontier on three benchmarks.
Multi-agent deep research systems self-optimize prompts through self-play to match or outperform expert-crafted versions.
Short-form factual consistency metrics produce inconsistent scores on semantically equivalent long-document summaries and lose reliability on information-dense claims.
Attribution gradients consolidate citation evidence and enable incremental unfolding of secondary sources, leading to deeper engagement in a lab study of critical reading tasks for AI answers.
XtraGPT is a suite of 1.5B-14B parameter open-source LLMs fine-tuned on 140,000 revision pairs from 7,000 top-tier papers to support controllable, context-aware academic paper editing.
This survey categorizes agentic environments for LLMs by eight attributes and domains, introduces symbolic and neural synthesis paradigms with evaluation, and outlines four agent evolution pathways plus three environment evolution paradigms.
Eliot is a query-time clustering and temporal visualization system for arXiv literature, evaluated via offline metrics on eight domains and a user survey showing 85% meaningful cluster labels.
TusoAI is an LLM-based agent that builds and iteratively optimizes domain-specific computational methods for scientific data analysis, outperforming expert baselines on RNA-seq denoising and earth monitoring while reporting new genetic associations.
SciAtlas builds a large-scale multi-disciplinary academic knowledge graph and a neuro-symbolic retrieval system to support automated scientific research tasks such as literature review and idea positioning.
citing papers explorer
-
Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark
CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.
-
FARS: A Fully Automated Research System Deployed at Scale
FARS deployed at scale produced 166 AI/ML papers across 67 topics that received 282 structured human reviews indicating some review-worthy outputs alongside recurring failure modes.
-
Can AI Agents Synthesize Scientific Conclusions?
A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.
-
Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
Co-ReAct adds step-level rubric guidance to ReAct agents via a GRPO-trained generator using list-wise ranking rewards, yielding consistent gains on DeepResearchBench and SQA-CS-V2.
-
Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics
Re²Math is a new benchmark that evaluates AI models on retrieving and verifying the applicability of theorems from math literature to advance steps in partial proofs, accepting any sufficient theorem while controlling for leakage.
-
FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification
FactReview extracts claims from ML papers, positions them via literature retrieval, and verifies them through code execution, labeling each as Supported, Partially supported, or In conflict, as shown in a CompGCN case study.
-
Planner-Centric Reinforcement Learning for Deep Research with Structure-Aware Reward
DecomposeR represents research plans as typed DAGs and uses two-stage planner-then-answerer RL to improve long-form research performance by 5.1-8.0 points over baselines.
-
Natural Language Query to Configuration for Retrieval Agents
BRANE maps queries to optimal retrieval pipeline configurations using LLM-derived features and per-configuration correctness predictors, improving the cost-quality Pareto frontier on three benchmarks.
-
TusoAI: Agentic Optimization for Scientific Methods
TusoAI is an LLM-based agent that builds and iteratively optimizes domain-specific computational methods for scientific data analysis, outperforming expert baselines on RNA-seq denoising and earth monitoring while reporting new genetic associations.
-
SciAtlas: A Large-Scale Knowledge Graph for Automated Scientific Research
SciAtlas builds a large-scale multi-disciplinary academic knowledge graph and a neuro-symbolic retrieval system to support automated scientific research tasks such as literature review and idea positioning.