{"total":10,"items":[{"citing_arxiv_id":"2605.13534","ref_index":18,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging","primary_cat":"cs.AI","submitted_at":"2026-05-13T13:46:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MultiSearch uses parallel multi-query retrieval plus explicit merging inside a reinforcement-learning loop to improve retrieval-augmented reasoning, outperforming baselines on seven QA benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12714","ref_index":33,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs","primary_cat":"cs.LG","submitted_at":"2026-05-12T20:22:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LRD framework with Frenet, NRS, and GFMI metrics shows layer-wise structure in 31 models provides usable signal for model selection and pruning on MTEB tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10146","ref_index":19,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Benchmarking Safety Risks of Knowledge-Intensive Reasoning under Malicious Knowledge Editing","primary_cat":"cs.AI","submitted_at":"2026-05-11T07:54:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EditRisk-Bench demonstrates that malicious knowledge editing reliably induces incorrect or unsafe reasoning in LLMs while largely preserving general capabilities.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"than only its correctness, is a key factor influencing the success of malicious knowledge manipulation. 4.3 Risk Stealthiness We evaluate the stealthiness of malicious knowledge editing by measuring its impact on two core aspects of model capability:general knowledgeandreasoning capacity. Following prior work [ 27, 26], we assess general knowledge using BoolQ [3] and NaturalQuestions [19], and reasoning capacity using GSM8K [4] for mathematical reasoning and NLI [6] for semantic reasoning. All evaluations are conducted in a closed-book setting, comparing pre-edit and post-edit model performance. We consider three representative scenarios of malicious knowledge editing, including counterfactual injection (RippleEdits [5]), bias injection (EditAttack [1]), and safety-violating edits (BehaviorBench"},{"citing_arxiv_id":"2605.09287","ref_index":14,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2026-05-10T03:21:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PiCA uses pivot-based potential rewards derived from historical sub-queries to supply trajectory-aware step guidance in agentic RL, delivering 15% gains on QA benchmarks for 3B/7B models.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"[13] Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani, and Aleksandra Faust. Training language models to self-correct via reinforcement learning, 2024. [14] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453-466, 2019. [15] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman"},{"citing_arxiv_id":"2605.07243","ref_index":45,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting","primary_cat":"cs.CL","submitted_at":"2026-05-08T04:59:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SpecBlock achieves 8-13% higher mean speedup than EAGLE-3 at 44-52% drafting cost via block-iterative drafting with hidden-state inheritance, dynamic rank-head branching, valid-prefix masking, and optional cost-aware bandit adaptation.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"3 are used throughout, and the rank head is enabled after the first 2,000 update steps to let the drafter trunk reach a stable distribution before bucket supervision. Evaluation tasks.We evaluate on six benchmarks spanning conversation, code, competition math, instruction following, question answering, and translation: MT-Bench [41], HumanEval [42], MATH- 500 [43], Alpaca [44], Natural Questions (NQ) [45], and WMT-23 [46]. 1https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k 2https://huggingface.co/datasets/Aeala/ShareGPT_Vicuna_unfiltered 7 Table 1: Speedup (Spd) over vanilla decoding and average accepted length τ per benchmark at A100-80GB, batch size 1, under HuggingFace Transformers. \"TD%\" is the per-method drafting-cost share. Bold indicates the best speedup within each model group."},{"citing_arxiv_id":"2605.05007","ref_index":32,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation","primary_cat":"cs.AI","submitted_at":"2026-05-06T15:07:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A learned orchestration policy for LLM agents that jointly optimizes task decomposition and selective routing to (model, primitive) pairs, delivering 77% macro pass@1 at 10x lower cost than strong baselines across 13 benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27914","ref_index":46,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Geometry-Calibrated Conformal Abstention for Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-30T14:20:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Geometry-calibrated conformal abstention lets language models abstain from uncertain queries with finite-sample guarantees on both participation rate and conditional correctness of answers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17866","ref_index":23,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Latent Abstraction for Retrieval-Augmented Generation","primary_cat":"cs.CL","submitted_at":"2026-04-20T06:26:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LAnR unifies retrieval-augmented generation inside a single LLM by deriving dense retrieval vectors from a [PRED] token's hidden states and using entropy to adaptively stop retrieval, outperforming prior RAG on six QA benchmarks with better efficiency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.16079","ref_index":29,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle","primary_cat":"cs.CL","submitted_at":"2025-10-17T12:03:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EvolveR enables LLM agents to self-evolve via a closed loop of distilling interaction trajectories into strategic principles offline and retrieving them to guide online decisions with policy reinforcement, yielding better results on multi-hop QA benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.10978","ref_index":62,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Group-in-Group Policy Optimization for LLM Agent Training","primary_cat":"cs.LG","submitted_at":"2025-05-16T08:26:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while keeping the same rollout and memory footprint.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"To complete the task, the agent must interact with a simulated HTML- based shopping website to search for, navigate to, and ultimately purchase a suitable item. It contains over 1.1 million products and 12k user instructions, providing a rich and diverse action space. In addition, we also evaluate the multi-turn tool calling performance of GiGPO onsearch-augmented QA tasks, including single-hop QA datasets (NQ [62], TriviaQA [63], and PopQA [64]) and multi-hop QA datasets (HotpotQA [65], 2Wiki [66], MuSiQue [67], and Bamboogle [68]). Baselines.For ALFWorld and WebShop, we compare our approach with a range of competitive baselines: (1) Closed-source LLMs: GPT-4o [1] and Gemini-2.5-Pro [2], which represent state-of- the-art capabilities in general-purpose reasoning and language understanding."}],"limit":50,"offset":0}