CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.
hub
Weld and Doug Downey and Wen
15 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Co-ReAct adds step-level rubric guidance to ReAct agents via a GRPO-trained generator using list-wise ranking rewards, yielding consistent gains on DeepResearchBench and SQA-CS-V2.
Re²Math is a new benchmark that evaluates AI models on retrieving and verifying the applicability of theorems from math literature to advance steps in partial proofs, accepting any sufficient theorem while controlling for leakage.
FactReview extracts claims from ML papers, positions them via literature retrieval, and verifies them through code execution, labeling each as Supported, Partially supported, or In conflict, as shown in a CompGCN case study.
A framework for nuanced, time-aware research impact summarization using fine-grained temporal citation intents shows moderate to strong correlation with human judgments on insightfulness.
SPIRE is a multi-agent framework drawing on scholarly primitives to perform evidence-grounded humanities scholarship, outperforming Naive LLM, Text RAG, and GraphRAG on a benchmark of classical Chinese and Greco-Roman Latin papers.
DecomposeR represents research plans as typed DAGs and uses two-stage planner-then-answerer RL to improve long-form research performance by 5.1-8.0 points over baselines.
Multi-agent deep research systems self-optimize prompts through self-play to match or outperform expert-crafted versions.
Short-form factual consistency metrics produce inconsistent scores on semantically equivalent long-document summaries and lose reliability on information-dense claims.
Attribution gradients consolidate citation evidence and enable incremental unfolding of secondary sources, leading to deeper engagement in a lab study of critical reading tasks for AI answers.
XtraGPT is a suite of 1.5B-14B parameter open-source LLMs fine-tuned on 140,000 revision pairs from 7,000 top-tier papers to support controllable, context-aware academic paper editing.
This survey categorizes agentic environments for LLMs by eight attributes and domains, introduces symbolic and neural synthesis paradigms with evaluation, and outlines four agent evolution pathways plus three environment evolution paradigms.
TusoAI is an LLM-based agent that builds and iteratively optimizes domain-specific computational methods for scientific data analysis, outperforming expert baselines on RNA-seq denoising and earth monitoring while reporting new genetic associations.
SciAtlas builds a large-scale multi-disciplinary academic knowledge graph and a neuro-symbolic retrieval system to support automated scientific research tasks such as literature review and idea positioning.
citing papers explorer
-
Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark
CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.
-
Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
Co-ReAct adds step-level rubric guidance to ReAct agents via a GRPO-trained generator using list-wise ranking rewards, yielding consistent gains on DeepResearchBench and SQA-CS-V2.
-
Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics
Re²Math is a new benchmark that evaluates AI models on retrieving and verifying the applicability of theorems from math literature to advance steps in partial proofs, accepting any sufficient theorem while controlling for leakage.
-
FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification
FactReview extracts claims from ML papers, positions them via literature retrieval, and verifies them through code execution, labeling each as Supported, Partially supported, or In conflict, as shown in a CompGCN case study.
-
In-depth Research Impact Summarization through Fine-Grained Temporal Citation Analysis
A framework for nuanced, time-aware research impact summarization using fine-grained temporal citation intents shows moderate to strong correlation with human judgments on insightfulness.
-
Extending AI for Research to the Humanities: A Multi-Agent Framework for Evidence-Grounded Scholarship
SPIRE is a multi-agent framework drawing on scholarly primitives to perform evidence-grounded humanities scholarship, outperforming Naive LLM, Text RAG, and GraphRAG on a benchmark of classical Chinese and Greco-Roman Latin papers.
-
Planner-Centric Reinforcement Learning for Deep Research with Structure-Aware Reward
DecomposeR represents research plans as typed DAGs and uses two-stage planner-then-answerer RL to improve long-form research performance by 5.1-8.0 points over baselines.
-
Self-Optimizing Multi-Agent Systems for Deep Research
Multi-agent deep research systems self-optimize prompts through self-play to match or outperform expert-crafted versions.
-
Stress Testing Factual Consistency Metrics for Long-Document Summarization
Short-form factual consistency metrics produce inconsistent scores on semantically equivalent long-document summaries and lose reliability on information-dense claims.
-
Attribution Gradients: Incrementally Unfolding Citations for Critical Examination of Attributed AI Answers
Attribution gradients consolidate citation evidence and enable incremental unfolding of secondary sources, leading to deeper engagement in a lab study of critical reading tasks for AI answers.
-
XtraGPT: Context-Aware and Controllable Academic Paper Revision via Human-AI Collaboration
XtraGPT is a suite of 1.5B-14B parameter open-source LLMs fine-tuned on 140,000 revision pairs from 7,000 top-tier papers to support controllable, context-aware academic paper editing.
-
Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application
This survey categorizes agentic environments for LLMs by eight attributes and domains, introduces symbolic and neural synthesis paradigms with evaluation, and outlines four agent evolution pathways plus three environment evolution paradigms.
-
TusoAI: Agentic Optimization for Scientific Methods
TusoAI is an LLM-based agent that builds and iteratively optimizes domain-specific computational methods for scientific data analysis, outperforming expert baselines on RNA-seq denoising and earth monitoring while reporting new genetic associations.
-
SciAtlas: A Large-Scale Knowledge Graph for Automated Scientific Research
SciAtlas builds a large-scale multi-disciplinary academic knowledge graph and a neuro-symbolic retrieval system to support automated scientific research tasks such as literature review and idea positioning.
- Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution