{"total":11,"items":[{"citing_arxiv_id":"2605.18661","ref_index":175,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AI for Auto-Research: Roadmap & User Guide","primary_cat":"cs.AI","submitted_at":"2026-05-18T17:08:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"rewards extracted from existing papers, with RL-optimized plans preferred by human experts70%of the time. Third,adaptive test-time computetreats reasoning effort as a controllable resource. IRIS [47] uses MCTS in a human-in-the-loop ideation platform to allocate search as ideas converge, while FlowPIE [210] evolves scientific ideas at test time through flow-guided literature exploration. A recent creativity-centered survey [175] further categorizes these methods into knowledge augmentation, prompt steering, inference-time scaling, multi-agent collaboration, and parameter adaptation. 10 3.1.2 External Signal-Driven Generation Direct LLM generation is limited by the model's parametric knowledge and by its tendency to produce plausible but weakly grounded ideas. External signal-driven methods address this limitation by anchoring"},{"citing_arxiv_id":"2605.17613","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VeriCache: Turning Lossy KV Cache into Lossless LLM Inference","primary_cat":"cs.AR","submitted_at":"2026-05-17T19:18:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VeriCache turns lossy KV cache compression into lossless LLM inference by drafting with compressed cache and verifying drafts with full cache, achieving up to 4x throughput with identical outputs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16902","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ArtifactLinker: Linking Scientific Artifacts for Automatic State-of-the-Art Discovery","primary_cat":"cs.LG","submitted_at":"2026-05-16T09:26:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ArtifactLinker frames SOTA discovery as missing-link prediction on an artifact graph of models and datasets, with a two-stage ranking-plus-verification pipeline and a new benchmark of 14k artifacts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16616","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MLReplicate: Benchmarking Autonomous Research Systems for Machine Learning Reproducibility","primary_cat":"cs.LG","submitted_at":"2026-05-15T20:35:32+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MLReplicate benchmark evaluates six autonomous systems on 45 manuscripts from ICML 2025 papers, finding that automated reviews accept flawed outputs with fabricated claims while human review exposes methodological failures, and that the cheapest system outperforms the most expensive by a wide margin","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11154","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Quantifying the Reconstructability of Astrophysical Methods with Large Language Models and Information Theory: A Case Study in Spectral Reconstruction","primary_cat":"astro-ph.IM","submitted_at":"2026-05-11T19:00:09+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02651","ref_index":33,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ARA: Agentic Reproducibility Assessment For Scalable Support Of Scientific Peer-Review","primary_cat":"cs.DL","submitted_at":"2026-05-04T14:34:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ARA uses LLMs to build workflow graphs linking sources, methods, and outputs in papers, then scores reproducibility, reaching ~61% accuracy on 213 ReScience C articles and outperforming priors on ReproBench and GoldStandardDB.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"limiting their generalizability across disciplines and their scalability to the rapidly growing volume of scientific publications. Agentic & LLM-assisted Paper Reproduction.Recent advances in LLMs and agentic systems have enabled automated pipelines that translate scientific papers into executable computational workflows, particularly in computer science [ 33, 43], astrophysics [ 44], machine learning [ 45], and quantitative science [46], where experiments rely on standardized datasets, statistical models, and software-based implementations [46]. A growing line of work studies document-to-codebase synthesis [43], reconstructing repository-level implementations from textual method descriptions through hierarchical information-flow optimization [ 43]."},{"citing_arxiv_id":"2604.25256","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery","primary_cat":"cs.AI","submitted_at":"2026-04-28T06:05:17+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"https://github.com/CherYou/AutoResearchBench. 1 Introduction The recent rise of LLM-basedAI scientists systemshas made autonomous scientific research a concrete target rather than a distant aspiration [1, 2, 3, 4, 5]. Across ideation, design, implementation, and experimentation, one capability is repeatedly indispensable: finding the right scientific literature [6, 7, 8, 9, 10]. Literature discovery serves both to explore existing knowledge around a problem and to gather evidence for verifying assumptions and supporting claims. An autonomous researcher must identify what is already known, which assumptions are supported or contradicted, what methods and evaluation protocols are appropriate, and where the strongest evidence for a claim resides."},{"citing_arxiv_id":"2604.19606","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AblateCell: A Reproduce-then-Ablate Agent for Virtual Cell Repositories","primary_cat":"cs.AI","submitted_at":"2026-04-21T15:55:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AblateCell reproduces baselines in three single-cell perturbation repositories with 88.9% success and recovers ground-truth critical components with 93.3% accuracy via closed-loop ablation.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"tions with the sparse additive mechanism shift variational autoencoder. InAdvances in Neural Information Process- ing Systems, volume 36, pp. 1-12, 2023. Bunne, C., Roohani, Y ., Rosen, Y ., Gupta, A., Zhang, X., Roed, M., Alexandrov, T., AlQuraishi, M., Brennan, P., Burkhardt, D. B., et al. How to build the virtual cell with artificial intelligence: Priorities and opportunities.Cell, 187(25):7045-7063, 2024. Chen, H., Xiong, M., Lu, Y ., Han, W., Deng, A., He, Y ., Wu, J., Li, Y ., Liu, Y ., and Hooi, B. Mlr-bench: Evaluating ai agents on open-ended machine learning research.arXiv preprint arXiv:2505.19955, 2025. DeepMind, G. Gemini 3 pro model card, 2025. URL https://storage.googleapis. com/deepmind-media/Model-Cards/ Gemini-3-Pro-Model-Card."},{"citing_arxiv_id":"2604.17745","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HiRAS: A Hierarchical Multi-Agent Framework for Paper-to-Code Generation and Execution","primary_cat":"cs.CL","submitted_at":"2026-04-20T02:59:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HiRAS introduces hierarchical multi-agent coordination for paper-to-code generation and experiment reproduction, claiming over 10% relative gains over prior state-of-the-art on a refined benchmark with reduced hallucination.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.21720","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PosterForest: Hierarchical Multi-Agent Collaboration for Scientific Poster Generation","primary_cat":"cs.AI","submitted_at":"2025-08-29T15:36:06+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.22598","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RExBench: Can coding agents autonomously implement AI research extensions?","primary_cat":"cs.CL","submitted_at":"2025-06-27T19:41:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RExBench is a new benchmark showing that LLM coding agents fail to autonomously implement most realistic research extensions to prior AI papers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}