{"total":104,"items":[{"citing_arxiv_id":"2606.31247","ref_index":259,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model","primary_cat":"cs.SD","submitted_at":"2026-06-30T07:24:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30814","ref_index":109,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"When Calibration Rankings Reverse: Accuracy-Controlled Evaluation for Fair Comparison of LLMs","primary_cat":"cs.CL","submitted_at":"2026-06-29T18:37:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Global calibration metrics like ECE are confounded by accuracy; the proposed ACE framework with three accuracy-controlled views shows many prior calibration advantages weaken or reverse.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29844","ref_index":29,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"MATCH: Modulating Attention via In-Context Retrieval for Long-Context Transformers","primary_cat":"cs.CL","submitted_at":"2026-06-29T06:33:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"MATCH augments sparsified attention with an efficient in-context retrieval system to boost performance on long-range recall tasks in transformers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28615","ref_index":7,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"What LLMs explain is not what they believe: Evaluating explanation sufficiency under models' own input beliefs","primary_cat":"cs.LG","submitted_at":"2026-06-26T21:14:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Proposes SCSuff metric for evaluating LLM explanation sufficiency via model-generated alternative inputs, showing explanations are typically insufficient and predictable from hidden states.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28044","ref_index":284,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"A Tree-of-Thoughts Inspired Hybrid Approach for Legal Case Judgement Summarization using LLMs","primary_cat":"cs.CL","submitted_at":"2026-06-26T12:46:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A tree-of-thoughts inspired hybrid extractive-abstractive LLM prompt yields better legal case judgment summaries than standard extractive or abstractive prompts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27786","ref_index":27,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"SHIFT: Gate-Modulated Activation Steering for Knowledge Conflict Mitigation in Retrieval-Augmented Generation","primary_cat":"cs.CL","submitted_at":"2026-06-26T07:17:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SHIFT reformulates neuron editing as learnable gate modulation on under 0.01% parameters to let LLMs adaptively balance contextual and parametric knowledge during RAG generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27679","ref_index":31,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"From Signals to Transfer: A Factorised Study of Probe-Based Uncertainty Estimation in Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-06-26T03:23:15+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A factorized study finds raw hidden states and attention features hard to beat in-domain for LLM uncertainty probes, but structured compressed features are more robust under distribution shift, with pretrained probes transferring to open-ended generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27361","ref_index":41,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Autoregressive Boltzmann Generators","primary_cat":"cs.LG","submitted_at":"2026-06-25T17:58:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ArBG replaces flow-based methods with autoregressive models for Boltzmann sampling, showing gains on peptide benchmarks and a 132M-parameter model Robin cutting zero-shot energy error by over 60% on 8-residue systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.23591","ref_index":66,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Quantifying the Agreement Between Data-Influence and Data-Similarity to Understand LLM Behavior","primary_cat":"cs.LG","submitted_at":"2026-06-22T17:00:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Data-similarity and data-influence produce significantly overlapping rankings of training documents for LLM outputs, with asymmetry allowing a favorable cost-accuracy trade-off.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.22645","ref_index":12,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"All Relations Lead to Rome: Automated Knowledge Graph Creation and Question Generation","primary_cat":"cs.IR","submitted_at":"2026-06-21T19:09:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ARLtR is a framework for jointly constructing knowledge graphs, embeddings, and grounded QA pairs from text, demonstrated on a Roman Empire dataset with over 19,000 entities and 8,400 QA pairs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.22511","ref_index":44,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Breaking the Likelihood Trap: Variance-Calibrated Modulation for Large Language Model Decoding","primary_cat":"cs.CL","submitted_at":"2026-06-21T14:04:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VCM is a training-free decoding intervention that applies PMI-driven token elevation and variance-adaptive penalization to reduce repetitive degeneration in LLM open-ended generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.19667","ref_index":22,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference","primary_cat":"cs.CL","submitted_at":"2026-06-18T00:38:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"CacheWeaver is a lightweight scheduling layer that orders evidence to exploit prefix caching, reducing median TTFT by 20-33% across vLLM setups while preserving answer quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.13392","ref_index":27,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"MiniMax Sparse Attention","primary_cat":"cs.AI","submitted_at":"2026-06-11T14:23:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MiniMax Sparse Attention is a GQA-based block-sparse attention mechanism that selects top-k blocks independently per group and delivers 28.4x per-token compute reduction at 1M context with on-par performance plus 14.2x prefill and 7.6x decode speedups via co-designed GPU kernel.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.12397","ref_index":13,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Redesign Mixture-of-Experts Routers with Manifold Power Iteration","primary_cat":"cs.LG","submitted_at":"2026-06-10T17:57:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Manifold Power Iteration aligns MoE router rows with principal singular directions of experts via a power-then-retract process, with theory showing convergence and experiments on 1B-11B models showing gains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.12117","ref_index":21,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation","primary_cat":"cs.CL","submitted_at":"2026-06-10T14:12:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Soft-prompt tuning with 10 vectors improves format compliance on LLM benchmarks and provides a low-cost proxy for comparing base models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.12479","ref_index":14,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"ReCal: Reward Calibration for RL-based LLM Routing","primary_cat":"cs.LG","submitted_at":"2026-06-10T06:59:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ReCal introduces hierarchical reward decomposition and distribution-aware optimization to address ambiguous credit assignment and optimization bias in RL-based LLM routing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11470","ref_index":109,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"The Periodic Table of LLM Reasoning: A Structured Survey of Reasoning Paradigms, Methods, and Failure Modes","primary_cat":"cs.CL","submitted_at":"2026-06-09T21:59:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A literature survey that introduces a taxonomy for LLM reasoning paradigms, analyzes methodological trends, and synthesizes failure modes from over 300 papers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.08804","ref_index":66,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Q-Delta: Beyond Key-Value Associative State Evolution","primary_cat":"cs.AI","submitted_at":"2026-06-07T19:49:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Q-Delta extends linear attention by introducing a query-conditioned delta rule that incorporates mixed key-query errors into recurrent state updates for improved stability and performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.08702","ref_index":27,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"ConMem: Structured Memory-Guided Adaptation in Training-Free Multi-Agent Systems","primary_cat":"cs.AI","submitted_at":"2026-06-07T15:59:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ConMem distills agent trajectories into structured memory cards organized in a relation-aware graph to enable training-free, relation-coordinated adaptation in LLM-based multi-agent systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06087","ref_index":49,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents","primary_cat":"cs.CL","submitted_at":"2026-06-04T12:26:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LatentSkill uses a hypernetwork to generate LoRA adapters from textual skills, enabling weight-space storage that cuts prefill tokens and boosts agent success rates on ALFWorld and Search-QA.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05734","ref_index":1,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"When AI Says It Feels","primary_cat":"cs.AI","submitted_at":"2026-06-04T05:49:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLMs trained via rubric-based self-rewarding RL with GRPO enhanced feeling expression and sycophancy robustness but degraded truthful QA performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05054","ref_index":143,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Boosting Self-Consistency with Ranking","primary_cat":"cs.CL","submitted_at":"2026-06-03T16:12:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RISC reformulates self-consistency answer selection as a ranking task solved by a lightweight LambdaRank model with five hand-designed features, yielding better accuracy-efficiency trade-offs than majority voting on QA benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04302","ref_index":76,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding","primary_cat":"cs.CL","submitted_at":"2026-06-03T00:12:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03846","ref_index":11,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Clustered Self-Assessment: A Simple yet Effective Method for Uncertainty Quantification in Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-06-02T16:25:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Clustered Self-Assessment groups sampled LLM responses into semantic clusters, presents clusters as multiple-choice options, and uses the LLM's assigned probabilities to those options as direct uncertainty estimates, outperforming entropy baselines with as few as two extra samples.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03197","ref_index":5,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"MemTrain: Self-Supervised Context Memory Training","primary_cat":"cs.CL","submitted_at":"2026-06-02T05:56:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MemTrain introduces two coupled self-supervised proxy tasks on Wikipedia corpora to train general context-memory capabilities in LLMs, reporting gains of up to 17.67 points on long-text and search-based QA benchmarks over direct post-training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02544","ref_index":27,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"SimSD: Simple Speculative Decoding in Diffusion Language Models","primary_cat":"cs.CL","submitted_at":"2026-06-01T17:46:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SimSD adds a masking strategy to enable speculative decoding in diffusion LLMs, delivering up to 7.46x throughput gains on SDAR models while preserving generation quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02245","ref_index":20,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"When Knowledge Is Not Free: Cost-Aware Evidence Selection in Retrieval-Augmented Generation","primary_cat":"cs.CL","submitted_at":"2026-06-01T13:39:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Defines cost-aware RAG with evidence cost tiers and shows static selectors are brittle while agentic LLM-based selection is promising but model-dependent.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02093","ref_index":9,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"The Role of Ambiguity in Error Prediction via Uncertainty Quantification","primary_cat":"cs.CL","submitted_at":"2026-06-01T11:20:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Disentangling input ambiguity from uncertainty quantification improves error prediction for LLMs on QA tasks, yielding over 10 PRR point gains across models and datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01923","ref_index":56,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Resonant Context Anchoring: Decoupling Attention Routing and Signal Gain at Inference Time","primary_cat":"cs.CL","submitted_at":"2026-06-01T08:57:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RCA is a training-free module that boosts input context signal strength in the residual stream of LLMs by orthogonal decoupling of attention routing from value magnitude.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01033","ref_index":18,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"TriLens: Per-Layer Logit-Lens Entropy for White-Box Hallucination Detection","primary_cat":"cs.AI","submitted_at":"2026-05-31T05:48:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TriLens detects hallucinations via per-layer entropy trajectories of logit-lens readouts from three internal modules across LLMs and QA benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00881","ref_index":11,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Chunking Methods on Retrieval-Augmented Generation - Effectiveness Evaluation Against Computational Cost and Limitations","primary_cat":"cs.CL","submitted_at":"2026-05-30T20:32:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Empirical study claiming to be the first broad comparison of chunking methods in RAG, highlighting effectiveness, cost, and generalization limitations across scenarios.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00683","ref_index":43,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"OCC-RAG: Optimal Cognitive Core for Faithful Question Answering","primary_cat":"cs.CL","submitted_at":"2026-05-30T11:42:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OCC-RAG develops task-specialized SLMs (0.6B and 1.7B) via a new synthetic data pipeline for multi-hop reasoning and context faithfulness, claiming to match or exceed 2-6x larger general models on HotpotQA, MuSiQue, TAT-QA, ConFiQA, and MuSiQue-Un.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00357","ref_index":20,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"From \"Weak\" Signals to Strong Models: Preference Delta Aggregation with LoRA Merging","primary_cat":"cs.AI","submitted_at":"2026-05-29T21:00:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Aggregating preference deltas from several weak-weaker model pairs via LoRA adapters and geometric alignment merging improves strong-model performance on reasoning and search benchmarks beyond any single delta.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07597","ref_index":14,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Repetition Mismatch: Why Data Mixture Experiments Don't Scale and How to Fix Them","primary_cat":"cs.LG","submitted_at":"2026-05-29T06:08:57+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Repetition rate mismatch between small-scale proxies and target budgets is the main reason data mixture experiments do not scale; a subsampling procedure that equalizes repetition rates recovers optimal mixtures from 1/16-scale experiments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29727","ref_index":35,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting","primary_cat":"cs.LG","submitted_at":"2026-05-28T10:21:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BASTION is a budget-aware speculative decoding framework with adaptive tree-structured block diffusion drafting that reports up to 6.61x speedup and 39% improvement over block-diffusion baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29224","ref_index":10,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents","primary_cat":"cs.CL","submitted_at":"2026-05-28T01:23:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Web retrieval degrades safety alignment in LLM agents, with relevance activating vulnerabilities including a Safe Source Paradox where oppositional content increases harmful compliance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28721","ref_index":10,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?","primary_cat":"cs.AI","submitted_at":"2026-05-27T16:39:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LiveBrowseComp shows search agents rely on intrinsic knowledge on standard benchmarks, with scores dropping 25-40 points and closed-book accuracy below 2% on questions about facts from the prior 90 days.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27996","ref_index":35,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure","primary_cat":"cs.AI","submitted_at":"2026-05-27T05:40:22+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Single-axis reward bias mitigations redirect optimization pressure to correlated proxies, and audit-distribution scoring produces identical observables for successful mitigation, bias substitution, and overcorrection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27164","ref_index":22,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Query Symbolically or Retrieve Semantically? A Dataset and Method for Semi-Structured Question Answering","primary_cat":"cs.AI","submitted_at":"2026-05-26T15:22:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DualGraph combines semantic textual KGs with symbolic KGs for semi-structured QA and introduces the SpecsQA benchmark, outperforming baselines on both open and specification questions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26366","ref_index":23,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Automatic Layer Selection for Hallucination Detection","primary_cat":"cs.AI","submitted_at":"2026-05-25T22:28:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FEPoID automatically selects optimal or near-optimal intermediate layers for hallucination detection across LLM architectures and tasks, outperforming prior criteria and baselines, with an added truncation step that further improves performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27445","ref_index":17,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"RAGe: A Retrieval-Augmented Generation Evaluation Framework","primary_cat":"cs.IR","submitted_at":"2026-05-23T17:46:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"RAGe is a modular evaluation framework that correlates retrieval and generation quality with hardware constraints to recommend optimal RAG components for specific datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21413","ref_index":17,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work","primary_cat":"cs.AI","submitted_at":"2026-05-20T17:09:56+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"QuestBench is a student-constructed benchmark of 256 questions on which current deep research AI systems achieve a mean pass rate of 16.85% and a best-case rate of 57.58%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20490","ref_index":45,"ref_count":3,"confidence":0.88,"is_internal_anchor":false,"paper_title":"ECUAS$_n$: A family of metrics for principled evaluation of uncertainty-augmented systems","primary_cat":"cs.AI","submitted_at":"2026-05-19T20:55:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ECUAS_n is a parameterized family of proper scoring rules for jointly assessing prediction accuracy and uncertainty quality in automated decision systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17989","ref_index":4,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Predictive Prefetching for Retrieval-Augmented Generation","primary_cat":"cs.CL","submitted_at":"2026-05-18T07:45:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces predictive prefetching for RAG that anticipates retrieval needs several tokens ahead via three components, reporting up to 43.5% latency reduction and 62.4% TTFT improvement while preserving answer quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14754","ref_index":4,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"XDomainBench: Diagnosing Reasoning Collapse in High-Dimensional Scientific Knowledge Composition","primary_cat":"cs.AI","submitted_at":"2026-05-14T12:19:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"XDomainBench shows LLMs suffer systematic reasoning collapse as domain composition order increases due to direct difficulty and interaction-amplified failures.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14449","ref_index":13,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition","primary_cat":"cs.LG","submitted_at":"2026-05-14T06:44:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"QAOD projects away question-aligned directions from answer representations to isolate domain-agnostic factuality signals, enabling efficient hallucination detection with top in-domain AUROC and up to 21% better OOD transfer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12227","ref_index":2,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"A Recipe for Long-Context Reasoning in Large Language Models via On-Policy Optimization and Distillation","primary_cat":"cs.CL","submitted_at":"2026-05-12T15:04:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Combines GRPO with teacher-guided on-policy distillation and introduces LongBlocks dataset to yield more stable long-context reasoning than either method alone.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"data-efficient approach to bootstrap long-context behavior and can be viewed as a form of imitation learning (Ross et al., 2011) for language models. Formally, given an expert dataset D of input-output pairs (x, y), we optimize the model parameters θ by minimizing the token-level negative log-likelihood: LSFT(θ) =E (x,y)∼D \" − |y| ∑ t=1 logπ θ(yt |x,y <t) # . (2) This objective delivers dense, token-level supervision, making it computationally efficient and effective for instilling task adherence and structural coherence (Raffel et al., 2020; Ouyang et al., 2022). However, when a strong teacher model is available, knowledge distillation (Hinton et al., 2015) can further enhance learning by aligning the student policy"},{"citing_arxiv_id":"2605.11608","ref_index":30,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"PRISM: A Geometric Risk Bound that Decomposes Drift into Scale, Shape, and Head","primary_cat":"cs.CL","submitted_at":"2026-05-12T06:40:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PRISM supplies a geometric upper bound on LLM variant risk that splits drift into scale, shape, and head axes and doubles as a differentiable regularizer against forgetting.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"7750 248.2226 17.8641 266.0867 0.3658 BnB INT8 138.96 139.09 0.9880 56.3886 056.3886 0.0265 NF4 138.96 144.16 0.9124 155.4506 0 155.4506 0.0750 FP4 138.96 138.10 0.9196 145.1767 0 145.1767 0.1306 GPTQ GPTQ-4bit 138.96 140.37 0.9298 136.7867 0 136.7867 0.1422 Benchmarks and scoring.Five benchmarks:MMLU[ 28],ARC[ 29] (multiple-choice knowledge), TriviaQA[ 30],SQuAD[ 31] (short-horizon QA), andGSM8K[ 32] (multi-step reasoning). All risks are computed teacher-forced (prompt c and targets y scored in a single forward pass over the gold span), producing a deterministic per-sample CE loss whose expectation gives the model's riskRM , and|∆R|is the target-vs-proxy gap we report. Calibration and hyperparameters."},{"citing_arxiv_id":"2605.10828","ref_index":5,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"The First Drop of Ink: Nonlinear Impact of Misleading Information in Long-Context Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-11T16:46:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Hard distractors trigger a nonlinear 'First Drop of Ink' performance collapse in long-context LLM reasoning, with most damage from the initial small fraction via disproportionate attention.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06200","ref_index":33,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping","primary_cat":"cs.CL","submitted_at":"2026-05-07T13:09:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A²TGPO improves RL policy optimization for multi-turn agentic LLMs by normalizing information gain within same-depth turn groups, rescaling cumulative advantages by sqrt of term count, and modulating clipping ranges per turn's normalized IG.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"+ TG-Norm +D t-rescaling + Ada-Clipping(A 2TGPO) 49.42 51.29 25.21 53.60 48.06 both training and evaluation. Seven open-domain question answering benchmarks are used, or- ganized into two groups by reasoning depth.Multi-hopbenchmarks consist of HotpotQA [ 28], 2WikiMultihopQA [29], MuSiQue [30], and Bamboogle [31].Single-hopbenchmarks consist of Natural Questions (NQ) [ 32], TriviaQA [ 33], and PopQA [ 34]. We train and evaluate on three backbones: Qwen3-4B, Qwen3-8B, and Qwen2.5-7B. We reportExact Match (EM)as the primary metric on every benchmark as well as the average accuracy across all evaluation samples. This experiment setting deliberately avoids proprietary APIs and heavyweight tool infrastructure, keeping the evaluation reproducible and concentrating on the progress of the RL algorithm."}],"limit":50,"offset":0}