{"total":25,"items":[{"citing_arxiv_id":"2607.01588","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OrchestrXR: A Multi-Agent System for Idea-to-Prototype XR Study Authoring","primary_cat":"cs.HC","submitted_at":"2026-07-02T01:40:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OrchestrXR uses multi-agent orchestration with structured schemas to generate Unity XR study prototypes from ideas, supported by a user study with 12 researchers indicating effective support and intent preservation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.19135","ref_index":73,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Technical Taxonomy of LLM Agent Communication Protocols","primary_cat":"cs.MA","submitted_at":"2026-06-17T14:45:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Creates a five-dimension taxonomy (counterparty, payload, interaction state, discovery mechanism, schema flexibility) from nine protocols and identifies architectural patterns plus convergence trends.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.18356","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SafeClawBench: Separating Semantic, Audit-Evidence, and Sandbox Harm in Tool-Using LLM Agents","primary_cat":"cs.CR","submitted_at":"2026-06-16T18:04:45+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SafeClawBench supplies 600 staged adversarial tasks and three separate endpoints that show semantic acceptance, audit evidence, and sandbox-observed harm are distinct failure modes in tool-using LLM agents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07936","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Illusions of the Gold Standard: A Large-scale Analysis of Human Evaluation Protocols for Long-form Text Generation","primary_cat":"cs.CL","submitted_at":"2026-06-06T01:55:56+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A systematic analysis of 284 manually reviewed papers plus 1.8k+ others from 2023-2025 reveals under-reporting of human evaluation study design details, creating ambiguity in what was measured and how.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00644","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment","primary_cat":"cs.AI","submitted_at":"2026-05-30T09:41:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ForeSci is a temporally controlled benchmark with 500 tasks for assessing LLM agents on forward-looking AI research judgments in four domains using cutoff-aligned knowledge bases.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29966","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Compass: Navigating Global Marine Lead Data Integration through Expert-Guided LLM Agent","primary_cat":"cs.AI","submitted_at":"2026-05-28T14:06:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Compass is an expert-guided LLM agent framework that extracts 3,751 marine Pb records from 230k papers to build the largest integrated database, achieving 92% accuracy via multi-layered validation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27610","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Eliot: Interactively $\\underline{E}$xploring Fast-Changing Scientific $\\underline{Li}$terature Trends with $\\underline{O}$nline Da$\\underline{t}$a and Learning","primary_cat":"cs.IR","submitted_at":"2026-05-26T19:25:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Eliot is a query-time clustering and temporal visualization system for arXiv literature, evaluated via offline metrics on eight domains and a user survey showing 85% meaningful cluster labels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18661","ref_index":128,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AI for Auto-Research: Roadmap & User Guide","primary_cat":"cs.AI","submitted_at":"2026-05-18T17:08:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Citation- and editor-aware systemsclose the loop between synthesis and the writing environment. Sur- veyG [137] constructs a three-layer citation graph (Foundation/Development/Frontier) with hierarchical traversal, Citegeist [11] builds a dynamic RAG pipeline on the arXiv corpus, and CiteLLM [65] embeds hallucination-free reference discovery directly inside a LaTeX editor. Open-source systems such as GPT Researcher [38], PaperQA [128], and ChatPaper [124] further illustrate the growing practical adoption of literature synthesis tools beyond controlled research prototypes. However, citation fidelity remains a bottle- neck: ScholarCopilot [215] reports only40.1%top-1citation accuracy, suggesting that generating plausible related-work text is still easier than grounding each claim in the correct source."},{"citing_arxiv_id":"2605.18144","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Evidence-Grounded Frontier Mapping and Agentic Hypothesis Generation in Nanomedicine","primary_cat":"cs.AI","submitted_at":"2026-05-18T09:50:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"pArticleMap combines article embeddings, graph-based frontier extraction, and agentic LLMs to map nanomedicine literature and generate hypotheses, achieving 10.8% gold recovery and 61% future-neighborhood rate in retrospective benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10813","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation","primary_cat":"cs.AI","submitted_at":"2026-05-11T16:33:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"NanoResearch introduces a tri-level co-evolving framework of skills, memory, and policy to personalize LLM-powered research automation across projects and users.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"searchers in specific subtasks rather than replacing them. Even before the LLM era [30, 3, 10, 13], prior work had explored using AI to support scientific research [17, 7, 12, 5, 8], and recent studies further leverage foundation models [29, 4] to enhance assistance at individual research stages [26, 28]. Some efforts focus on literature understanding, like PaperQA [23], which answers scientific questions by re- trieving and reasoning over relevant papers. Another line targets novel idea generation, with Nova [11] retrieving external knowledge to enhance novelty and ResearchAgent [2] augmenting LLMs with an entity-centric knowledge store and iterative reviewing agents. Moving from ideation to reproduction, AutoP2C [19] converts papers into code via a multi-agent pipeline, while ResearchCodeAgent [ 9]"},{"citing_arxiv_id":"2605.09012","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics","primary_cat":"cs.AI","submitted_at":"2026-05-09T15:52:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Re²Math is a new benchmark that evaluates AI models on retrieving and verifying the applicability of theorems from math literature to advance steps in partial proofs, accepting any sufficient theorem while controlling for leakage.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07208","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FAME: Forecasting Academic Impact via Continuous-Time Manifold Evolution","primary_cat":"cs.LG","submitted_at":"2026-05-08T03:57:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FAME models scientific topic trajectories in continuous time to forecast paper impact more accurately than LLMs by aligning manuscripts with field momentum in a dynamic latent space.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"dynamics to accurately capture the delayed impact trajectories of highly innovative research, often termed sleeping beauties. 7 Related Works LLM-Driven Scientific Discovery and the Evaluation Bottleneck.Automated scientific discovery is shifting from domain-specific models [5, 6] to generalist AI Scientist agents driven by iterative generation-evaluation loops [19, 28, 35, 4, 12]. While LLMs effectively brainstorm [25], navigate literature [14, 29], and write code [34, 37], their evaluation modules-reliant on the LLM-as-a-judge paradigm [40, 7]-remain a critical bottleneck. Despite their fluency, LLMs act as myopic evaluators [18, 33]; even models fine-tuned for academic comparisons, like SciJudge [ 30], fail to reliably forecast impact because they ignore the macroscopic evolutionary trajectories of research fields."},{"citing_arxiv_id":"2605.07022","ref_index":39,"ref_count":3,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale","primary_cat":"cs.LG","submitted_at":"2026-05-07T23:08:18+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Starling, a multi-agent LLM system, extracts ~6.3 million nuanced structured records from PubMed across six tasks with reported error rates of 0.6-7.7%, lower than several curated databases.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tir, Ellen Thomas, Richard H. Scott, Emma Baple, Arianna Tucci, Helen Brittain, Anna de Burca, Kristina Ibañez, Dalia Kasperaviciute, Damian Smedley, Mark Caulfield, Augusto Rendon, and Ellen M. McDonagh. PanelApp crowdsources expert knowledge to establish consensus diagnos- tic gene panels.Nature Genetics, 51(11):1560-1565, 2019. doi: 10.1038/s41588-019-0528-2. [39] Ines Filipa Martins, Ana L. Teixeira, Luis Pinheiro, and Andre O. Falcao. A bayesian approach to in silico blood-brain barrier penetration modeling.Journal of Chemical Information and Modeling, 52(6):1686-1697, 2012. doi: 10.1021/ci300124c. URL https://doi.org/10. 1021/ci300124c. [40] Fanwang Meng, Yang Xi, Jinfeng Huang, and Paul W. Ayers. A curated diverse molecular"},{"citing_arxiv_id":"2605.04375","ref_index":28,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Experiment-as-Code Labs: A Declarative Stack for AI-Driven Scientific Discovery","primary_cat":"eess.SY","submitted_at":"2026-05-06T00:50:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper introduces Experiment-as-Code Labs as a declarative stack synthesizing AI agents, systems orchestration, and physical lab control for AI-driven discovery.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03344","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RAG over Thinking Traces Can Improve Reasoning Tasks","primary_cat":"cs.IR","submitted_at":"2026-05-05T04:03:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Retrieving structured thinking traces as a corpus improves reasoning performance on AIME, LiveCodeBench, and GPQA over standard RAG or no retrieval.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21304","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PaperMind: Benchmarking Agentic Reasoning and Critique over Scientific Papers in Multimodal LLMs","primary_cat":"cs.IR","submitted_at":"2026-04-23T05:42:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PaperMind is a new benchmark that evaluates integrated multimodal reasoning and critique over scientific papers through four complementary task families across seven domains.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"the model successfully retrieves relevant evidence but fails to integrate it into a coherent and well- justified answer. In these cases, the retrieved ob- servations contain the necessary information, yet the final prediction does not adequately combine, explain, or reason over the evidence, falling short of the depth required by the ground-truth label. (7) Generic-Academic-Assumption character- izes errors where the model relies on generic aca- demic conventions or assumptions rather than con- crete evidence from the retrieved materials. This often manifests as reasoning based on common scholarly patterns (e.g., \"the paper does not explic- itly state. . . \") instead of grounding conclusions in observed content."},{"citing_arxiv_id":"2604.06279","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Plasma GraphRAG: Physics-Grounded Parameter Selection for Gyrokinetic Simulations","primary_cat":"physics.plasm-ph","submitted_at":"2026-04-07T10:04:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Plasma GraphRAG automates physics-grounded parameter selection for gyrokinetic simulations via a domain-specific knowledge graph and LLMs, reporting over 10% better quality and up to 25% fewer hallucinations than standard RAG.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04074","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification","primary_cat":"cs.AI","submitted_at":"2026-04-05T11:45:22+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FactReview extracts claims from ML papers, positions them via literature retrieval, and verifies them through code execution, labeling each as Supported, Partially supported, or In conflict, as shown in a CompGCN case study.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.15170","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Multi-Dimensional Knowledge Profiling with Large-Scale Literature Database and Hierarchical Retrieval","primary_cat":"cs.CV","submitted_at":"2026-01-21T16:47:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Large-scale profiling of recent AI literature shows growth in safety, multimodal reasoning, and agent studies alongside stabilization in neural machine translation and graph methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.21168","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Diverse LLMs or Diverse Question Interpretations? That is the Ensembling Question","primary_cat":"cs.CL","submitted_at":"2025-07-25T15:26:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Question interpretation diversity outperforms model diversity for LLM ensembling on binary QA tasks using majority voting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.11336","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"XtraGPT: Context-Aware and Controllable Academic Paper Revision via Human-AI Collaboration","primary_cat":"cs.CL","submitted_at":"2025-05-16T15:02:19+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"XtraGPT is a suite of 1.5B-14B parameter open-source LLMs fine-tuned on 140,000 revision pairs from 7,000 top-tier papers to support controllable, context-aware academic paper editing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.13957","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Supervising the search process produces reliable and generalizable information-seeking agents","primary_cat":"cs.CL","submitted_at":"2025-02-19T18:56:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Process supervision via RAG-Gym produces more reliable and generalizable search agents, with gains driven by higher-quality queries on out-of-domain multi-hop tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.02871","ref_index":83,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning","primary_cat":"cs.CL","submitted_at":"2025-02-05T04:05:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.17448","ref_index":97,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"In Context Learning and Reasoning for Symbolic Regression with Large Language Models","primary_cat":"cs.CL","submitted_at":"2024-10-22T21:50:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GPT-4 models rediscover Langmuir isotherms and produce fits on Nikuradse pipe-flow data via iterative chain-of-thought prompting with scientific context and external code feedback.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2312.10997","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Retrieval-Augmented Generation for Large Language Models: A Survey","primary_cat":"cs.CL","submitted_at":"2023-12-18T07:47:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A survey of RAG paradigms, components, benchmarks, and challenges for improving LLMs on knowledge-intensive tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"RAG-Robust [48] Wikipedia Text Chunk Tuning Once RA-Long-Form [49] Dataset-base Text Chunk Tuning Once CoN [50] Wikipedia Text Chunk Tuning Once Self-RAG [25] Wikipedia Text Chunk Tuning Adaptive BGM [26] Wikipedia Text Chunk Inference Once CoQ [51] Wikipedia Text Chunk Inference Iterative Token-Elimination [52] Wikipedia Text Chunk Inference Once PaperQA [53] Arxiv,Online Database,PubMed Text Chunk Inference Iterative NoiseRAG [54] FactoidWiki Text Chunk Inference Once IAG [55] Search Engine,Wikipedia Text Chunk Inference Once NoMIRACL [56] Wikipedia Text Chunk Inference Once ToC [57] Search Engine,Wikipedia Text Chunk Inference Recursive SKR [58] Dataset-base,Wikipedia Text Chunk Inference Adaptive ITRG [59] Wikipedia Text Chunk Inference Iterative"}],"limit":50,"offset":0}