{"total":12,"items":[{"citing_arxiv_id":"2605.20478","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Stage-Audit: Auditable Source-Frontier Discovery for Cross-Wiki Tables","primary_cat":"cs.CL","submitted_at":"2026-05-19T20:41:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Stage-Audit raises source-frontier precision from 0.356 to 0.505 and F1 from 0.334 to 0.451 on a 51-instance cross-domain set by enforcing disjoint write rights and row-level source gates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15109","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Why Neighborhoods Matter: Traversal Context and Provenance in Agentic GraphRAG","primary_cat":"cs.AI","submitted_at":"2026-05-14T17:25:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"In Agentic GraphRAG, cited evidence is necessary but not sufficient for accurate answers, as uncited traversal context and graph structure also affect results, requiring evaluation of the full retrieval trajectory.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09012","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics","primary_cat":"cs.AI","submitted_at":"2026-05-09T15:52:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Re²Math is a new benchmark that evaluates AI models on retrieving and verifying the applicability of theorems from math literature to advance steps in partial proofs, accepting any sufficient theorem while controlling for leakage.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"cluster intervals resample citing papers. ModelToolAccWilson 95% CI Paper-cluster 95% CI Claude Opus 4.5 7.0% (14/200) [4.2%, 11.4%] [3.6%, 10.7%] Grok 4 3.5% (7/200) [1.7%, 7.0%] [1.0%, 6.6%] Kimi K2 Thinking 3.5% (7/200) [1.7%, 7.0%] [1.0%, 6.4%] GPT-5.2 3.0% (6/200) [1.4%, 6.4%] [1.0%, 5.5%] DeepSeek V3.2 2.5% (5/200) [1.1%, 5.7%] [0.5%, 5.0%] Gemini 3.1 Pro 2.0% (4/200) [0.8%, 5.0%] [0.5%, 4.0%] Qwen3-235B Thinking 1.0% (2/200) [0.3%, 3.6%] [0.0%, 2.6%] Table 9:Domain-levelToolAccon Eval-200.Each domain has 40 instances. Model Alg./NT Anal./PDE Comb. Geom./Top. Prob./Stat./Ctrl. GPT-5.2 0.0% 5.0% 5.0% 2.5% 2.5% Gemini 3.1 Pro 0.0% 5.0% 0.0% 2.5% 2.5% Claude Opus 4.5 5.0% 2.5% 10.0% 7.5% 10.0% DeepSeek V3.2 0.0% 0.0% 5.0% 0.0% 7."},{"citing_arxiv_id":"2605.00505","ref_index":47,"ref_count":4,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLM-Oriented Information Retrieval: A Denoising-First Perspective","primary_cat":"cs.IR","submitted_at":"2026-05-01T08:30:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Argues for a denoising-first paradigm in LLM-oriented information retrieval, framing challenges via a four-stage progression and providing a taxonomy of signal-to-noise optimization techniques across the pipeline.","context_count":2,"top_context_role":"background","top_context_polarity":"background","context_text":"In-context Autoencoder for Context Compression in a Large Language Model. InThe Twelfth International Conference on Learning Representations. [47] Sebastian Gehrmann, Hendrik Strobelt, and Alexander M Rush. 2019. GLTR: Statistical Detection and Visualization of Generated Text. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 111-116. [48] Gregory Hok Tjoan Go, Khang Ly, Anders Søgaard, Amin Tabatabaei, Maarten de Rijke, and Xinyi Chen. 2025. LiRA: A Multi-Agent Framework for Reliable and Readable Literature Review Generation.arXiv preprint arXiv:2510.05138 (2025). [49] Alon Gorenshtein, Kamel Shihada, Moran Sorka, Dvir Aran, and Shahar Shelly. 2025. LITERAS: Biomedical literature review and citation retrieval agents."},{"citing_arxiv_id":"2604.15621","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rethinking the Necessity of Adaptive Retrieval-Augmented Generation through the Lens of Adaptive Listwise Ranking","primary_cat":"cs.IR","submitted_at":"2026-04-17T02:00:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AdaRankLLM shows adaptive listwise reranking outperforms fixed-depth retrieval for most LLMs by acting as a noise filter for weak models and an efficiency optimizer for strong ones, with lower context use.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Finally, by integrating the adaptive ranker with passage dropout and instruction distillation, we will get an AdaRan- kLLM that is capable of selecting necessary passages, while maintaining efficiency and accessible. IV. EXPERIMENTALSETUP A. Datasets We validate our method on three diverse public datasets: ASQA[31],QAMPARI[32], andELI5[33]. Consistent with standard protocols [34], we employ Exact Match (EM) for ASQA, F1 score for QAMPARI, and Claim Recall for ELI5. To evaluate comprehensive capabilities, we also report anOverall*metric computed as the average of these scores. Detailed dataset information are provided in Table I. B. Baselines We compare AdaRankLLM with several baselines catego- rized into three groups: (1)Vanilla-k: Models utilizing re-"},{"citing_arxiv_id":"2604.07012","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DTCRS: Dynamic Tree Construction for Recursive Summarization","primary_cat":"cs.CL","submitted_at":"2026-04-08T12:33:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DTCRS dynamically builds summary trees only for suitable question types by using sub-question embeddings as cluster centers, cutting construction time while improving QA on three tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.19977","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Context Attribution with Multi-Armed Bandit Optimization","primary_cat":"cs.AI","submitted_at":"2025-06-24T19:47:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Formulates context attribution as a combinatorial multi-armed bandit problem solved via Linear Thompson Sampling to reduce LLM queries by up to 30% on QA benchmarks while matching existing attribution quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.04338","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"In-depth Analysis of Graph-based RAG in a Unified Framework","primary_cat":"cs.IR","submitted_at":"2025-03-06T11:34:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A unified framework and large-scale comparison of graph-based RAG methods on QA tasks yields new high-performing variants obtained by recombining existing components.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.14751","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Query pipeline optimization for cancer patient question answering systems","primary_cat":"cs.CL","submitted_at":"2024-12-19T11:30:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Three-aspect RAG query pipeline optimization for cancer patient QA introduces HSRDR and SEOS and reports 5.24% accuracy gain on Claude-3-haiku versus chain-of-thought on a custom dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.18059","ref_index":70,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval","primary_cat":"cs.CL","submitted_at":"2024-01-31T18:30:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RAPTOR introduces a tree-organized retrieval method using recursive abstractive summaries, achieving a 20% absolute accuracy improvement on the QuALITY benchmark when paired with GPT-4.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.15391","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries","primary_cat":"cs.CL","submitted_at":"2024-01-27T11:41:48+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MultiHop-RAG is a new benchmark dataset demonstrating that existing retrieval-augmented generation systems perform poorly on multi-hop queries requiring retrieval and reasoning over multiple evidence pieces.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2310.11511","ref_index":131,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection","primary_cat":"cs.CL","submitted_at":"2023-10-17T18:18:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}