super hub

Cohen and Ruslan Salakhutdinov and Christopher D

Christopher D. Manning, Peng Qi, Ruslan Salakhutdinov, Saizheng Zhang, William Cohen, Yoshua Bengio + 1 more · 2018 · Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing · DOI 10.18653/v1/d18-1259

101 Pith papers cite this work, alongside 739 external citations. Polarity classification is still indexing.

101 Pith papers citing it

739 external citations · Crossref

open at publisher browse 101 citing papers more from Christopher D. Manning

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 2 dataset 2

citation-polarity summary

background 2 use dataset 2

claims ledger

dataset 88 + TG-Norm 47.24 50.17 22.68 52.40 46.27 + TG-Norm +D t-rescaling 47.94 50.54 22.77 52.00 46.71 + TG-Norm +D t-rescaling + Ada-Clipping(A 2TGPO) 49.42 51.29 25.21 53.60 48.06 both training and evaluation. Seven open-domain question answering benchmarks are used, or- ganized into two groups by reasoning depth.Multi-hopbenchmarks consist of HotpotQA [ 28], 2WikiMultihopQA [29], MuSiQue [30], and Bamboogle [31].Single-hopbenchmarks consist of Natural Questions (NQ) [ 32], TriviaQA [ 33], and PopQ
dataset requiring no additional compressor or compression-specific training (distinct from latent-compression approaches [11]). We find that small values such as Lp ∈ {3,5,7} substantially reduce MaxSim cost while preserving the shared-representation design. 4 Benchmarks and Experimental Setup We evaluate INTRA on four Wikipedia-based QA benchmarks: HotPotQA [38], 2WikiMultihopQA [12], MuSiQue [34], and Natural Questions [19]. Together they span bridge and comparison reasoning, cleaner two-hop evidence
background query involves chaining together multiple related facts across entities, CTI reports, or time (e.g., actor → uses → malware → targets → sector, or comparing campaigns over time). Dense retrieval that returns the top-𝑘 most relevant text chunks [20, 22] can fail when evidence is distributed across distant text fragments, when constraints must be satisfied jointly, or when the answer depends on chaining multiple facts [ 40]. Equally important, LLM-based CTI assistants must reliably abstain when th
background Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.),Advances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/forum?id=R9KnuFlvnU. [69] Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains, 2024.URL https://arxiv. org/abs/2406.12045, 2024. [70] Junjie Ye, Guanyu Li, Songyang Gao, Caishuang Huang, Yilong Wu, Sixian Li, Xiaoran Fan, Shihan Dou, Tao Ji, Qi Zhang, Tao Gui, and Xua

authors

Christopher D. Manning Peng Qi Ruslan Salakhutdinov Saizheng Zhang William Cohen Yoshua Bengio Zhilin Yang

co-cited works

representative citing papers

LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

cs.CL · 2026-06-11 · unverdicted · novelty 8.0

LoHoSearch is a new benchmark of 544 KG-constructed questions across 11 domains where the strongest search agent scores 34.74% and context strategies add at most 6.8%.

CORTEX: High-Quality Cross-Domain Organization of Web-Scale Corpora through Ontological Corpus Graph

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

Cortex uses an Ontological Corpus Graph to structure web-scale corpora, creating a refined 24.14B-token corpus and a new benchmark validated on eight LLMs.

FAPO: Fully Automated Prompt Optimization of Multi-Step LLM Pipelines

cs.SE · 2026-06-17 · unverdicted · novelty 7.0

FAPO automates LLM pipeline optimization via iterative diagnosis and prompt-or-structure edits, beating GEPA baseline by +14.1 pp mean across 18 comparisons and +33.8 pp when structural changes occur.

LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents

cs.CL · 2026-06-04 · unverdicted · novelty 7.0

LatentSkill uses a hypernetwork to generate LoRA adapters from textual skills, enabling weight-space storage that cuts prefill tokens and boosts agent success rates on ALFWorld and Search-QA.

LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding

cs.CL · 2026-06-03 · unverdicted · novelty 7.0

LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.

MemTrain: Self-Supervised Context Memory Training

cs.CL · 2026-06-02 · unverdicted · novelty 7.0

MemTrain introduces two coupled self-supervised proxy tasks on Wikipedia corpora to train general context-memory capabilities in LLMs, reporting gains of up to 17.67 points on long-text and search-based QA benchmarks over direct post-training.

When Knowledge Is Not Free: Cost-Aware Evidence Selection in Retrieval-Augmented Generation

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

Defines cost-aware RAG with evidence cost tiers and shows static selectors are brittle while agentic LLM-based selection is promising but model-dependent.

GrepSeek: Training Search Agents for Direct Corpus Interaction

cs.CL · 2026-05-28 · unverdicted · novelty 7.0

GrepSeek introduces a two-stage trained agent that uses shell commands for direct corpus search, achieving the strongest token-level F1 and Exact Match on seven open-domain QA benchmarks.

LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

LiveBrowseComp shows search agents rely on intrinsic knowledge on standard benchmarks, with scores dropping 25-40 points and closed-book accuracy below 2% on questions about facts from the prior 90 days.

Retrieval as Reasoning: Self-Evolving Agent-Native Retrieval via LLM-Wiki

cs.CL · 2026-05-25 · unverdicted · novelty 7.0

LLM-Wiki structures external knowledge as compilable wiki pages with links and persistent self-correction, achieving SOTA results on HotpotQA, MuSiQue, and 2WikiMultiHopQA by 2.0-8.1 F1 points over prior RAG systems.

Tool-Schema Compression Enables Agentic RAG Under Constrained Context Budgets

cs.SE · 2026-05-24 · unverdicted · novelty 7.0

Tool schema compression by 44-50% enables agentic RAG at 8K context where uncompressed schemas fail, with +20.5 pp exact match lift across models and scaling to over 800 tools.

Latent Cache Flow: Model-to-Model Communication Without Text

cs.LG · 2026-05-19 · unverdicted · novelty 7.0

Latent Cache Flow uses a small joint-translation-and-compression adapter to let LLMs with different contexts exchange KV cache summaries, outperforming both larger C2C adapters and text in early experiments.

HexAGenT: Efficient Agentic LLM Serving via Workflow- and Heterogeneity-Aware Scheduling

cs.DC · 2026-05-15 · unverdicted · novelty 7.0

HexAGenT reduces the SLO scale required for timely agentic LLM workflow completion by an average of 20.1% at 95% attainment and 33.0% at 99% attainment on heterogeneous A100/H100/H200 clusters.

F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

F-GRPO factorizes group-relative policy optimization into generation and ranking phases within one autoregressive sequence, using order-invariant coverage and position-aware utility rewards to improve top-ranked performance on recommendation and multi-hop QA tasks.

Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

SIOP enables turn-level credit assignment in LLM agents via semantic clustering of final answers as latent outcomes, improving performance on reasoning benchmarks without verifiers.

SENECA: Small-Sample Discrete Entropy Estimation via Self-Consistent Missing Mass

cs.IT · 2026-05-01 · unverdicted · novelty 7.0

SENECA uses a novel self-consistent missing mass calculation to improve discrete entropy estimates in small-sample regimes and outperforms alternatives in numerical tests.

When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models

cs.IR · 2026-04-29 · unverdicted · novelty 7.0

ReaLM-Retrieve uses step-level uncertainty to trigger retrievals during reasoning, achieving 10.1% better F1 scores and 47% fewer calls on multi-hop QA benchmarks.

The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models

cs.CL · 2026-04-28 · accept · novelty 7.0

SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.

Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective

cs.CL · 2026-04-25 · conditional · novelty 7.0 · 2 refs

A controlled formal language task reveals fine-tuning outperforms in-context learning on in-distribution generalization but equals it on out-of-distribution, with ICL showing greater sensitivity to model size and tokenization.

Evaluating Temporal Consistency in Multi-Turn Language Models

cs.CL · 2026-04-24 · unverdicted · novelty 7.0

Language models frequently violate temporal scope stability in multi-turn dialogues by drifting toward present-day assumptions even when they possess the correct facts.

Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought

cs.CL · 2026-04-24 · unverdicted · novelty 7.0

Abstract-CoT lets models reason with short discrete latent token sequences from a reserved vocabulary, using warm-up training and RL to match verbal CoT performance with up to 11.6x fewer tokens.

Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval Evaluation

cs.IR · 2026-04-22 · unverdicted · novelty 7.0

Semantic stratification organizes documents into entity-based clusters to systematically generate queries for missing strata, yielding formal coverage guarantees and interpretable failure mode visibility in retrieval evaluation.

Profile-Then-Reason: Bounded Semantic Complexity for Tool-Augmented Language Agents

cs.AI · 2026-04-05 · unverdicted · novelty 7.0

PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.

HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation

cs.CL · 2025-10-09 · unverdicted · novelty 7.0

HiPRAG adds hierarchical process rewards to RL training for agentic RAG, reducing over-search to 2.3% and achieving 65.4-67.2% accuracy on seven QA benchmarks across 3B and 7B models.

citing papers explorer

Showing 50 of 101 citing papers.

LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling cs.CL · 2026-06-11 · unverdicted · none · ref 1
LoHoSearch is a new benchmark of 544 KG-constructed questions across 11 domains where the strongest search agent scores 34.74% and context strategies add at most 6.8%.
CORTEX: High-Quality Cross-Domain Organization of Web-Scale Corpora through Ontological Corpus Graph cs.CL · 2026-06-29 · unverdicted · none · ref 109
Cortex uses an Ontological Corpus Graph to structure web-scale corpora, creating a refined 24.14B-token corpus and a new benchmark validated on eight LLMs.
FAPO: Fully Automated Prompt Optimization of Multi-Step LLM Pipelines cs.SE · 2026-06-17 · unverdicted · none · ref 34
FAPO automates LLM pipeline optimization via iterative diagnosis and prompt-or-structure edits, beating GEPA baseline by +14.1 pp mean across 18 comparisons and +33.8 pp when structural changes occur.
LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents cs.CL · 2026-06-04 · unverdicted · none · ref 51
LatentSkill uses a hypernetwork to generate LoRA adapters from textual skills, enabling weight-space storage that cuts prefill tokens and boosts agent success rates on ALFWorld and Search-QA.
LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding cs.CL · 2026-06-03 · unverdicted · none · ref 78
LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.
MemTrain: Self-Supervised Context Memory Training cs.CL · 2026-06-02 · unverdicted · none · ref 14
MemTrain introduces two coupled self-supervised proxy tasks on Wikipedia corpora to train general context-memory capabilities in LLMs, reporting gains of up to 17.67 points on long-text and search-based QA benchmarks over direct post-training.
When Knowledge Is Not Free: Cost-Aware Evidence Selection in Retrieval-Augmented Generation cs.CL · 2026-06-01 · unverdicted · none · ref 18
Defines cost-aware RAG with evidence cost tiers and shows static selectors are brittle while agentic LLM-based selection is promising but model-dependent.
GrepSeek: Training Search Agents for Direct Corpus Interaction cs.CL · 2026-05-28 · unverdicted · none · ref 2
GrepSeek introduces a two-stage trained agent that uses shell commands for direct corpus search, achieving the strongest token-level F1 and Exact Match on seven open-domain QA benchmarks.
LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know? cs.AI · 2026-05-27 · unverdicted · none · ref 12
LiveBrowseComp shows search agents rely on intrinsic knowledge on standard benchmarks, with scores dropping 25-40 points and closed-book accuracy below 2% on questions about facts from the prior 90 days.
Retrieval as Reasoning: Self-Evolving Agent-Native Retrieval via LLM-Wiki cs.CL · 2026-05-25 · unverdicted · none · ref 21
LLM-Wiki structures external knowledge as compilable wiki pages with links and persistent self-correction, achieving SOTA results on HotpotQA, MuSiQue, and 2WikiMultiHopQA by 2.0-8.1 F1 points over prior RAG systems.
Tool-Schema Compression Enables Agentic RAG Under Constrained Context Budgets cs.SE · 2026-05-24 · unverdicted · none · ref 23
Tool schema compression by 44-50% enables agentic RAG at 8K context where uncompressed schemas fail, with +20.5 pp exact match lift across models and scaling to over 800 tools.
Latent Cache Flow: Model-to-Model Communication Without Text cs.LG · 2026-05-19 · unverdicted · none · ref 2
Latent Cache Flow uses a small joint-translation-and-compression adapter to let LLMs with different contexts exchange KV cache summaries, outperforming both larger C2C adapters and text in early experiments.
HexAGenT: Efficient Agentic LLM Serving via Workflow- and Heterogeneity-Aware Scheduling cs.DC · 2026-05-15 · unverdicted · none · ref 48
HexAGenT reduces the SLO scale required for timely agentic LLM workflow completion by an average of 20.1% at 95% attainment and 33.0% at 99% attainment on heterogeneous A100/H100/H200 clusters.
F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking cs.LG · 2026-05-13 · unverdicted · none · ref 70
F-GRPO factorizes group-relative policy optimization into generation and ranking phases within one autoregressive sequence, using order-invariant coverage and position-aware utility rewards to improve top-ranked performance on recommendation and multi-hop QA tasks.
Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers cs.LG · 2026-05-06 · unverdicted · none · ref 25
SIOP enables turn-level credit assignment in LLM agents via semantic clustering of final answers as latent outcomes, improving performance on reasoning benchmarks without verifiers.
SENECA: Small-Sample Discrete Entropy Estimation via Self-Consistent Missing Mass cs.IT · 2026-05-01 · unverdicted · none · ref 81
SENECA uses a novel self-consistent missing mass calculation to improve discrete entropy estimates in small-sample regimes and outperforms alternatives in numerical tests.
When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models cs.IR · 2026-04-29 · unverdicted · none · ref 40
ReaLM-Retrieve uses step-level uncertainty to trigger retrievals during reasoning, achieving 10.1% better F1 scores and 47% fewer calls on multi-hop QA benchmarks.
The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models cs.CL · 2026-04-28 · accept · none · ref 28
SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.
Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective cs.CL · 2026-04-25 · conditional · none · ref 75 · 2 links
A controlled formal language task reveals fine-tuning outperforms in-context learning on in-distribution generalization but equals it on out-of-distribution, with ICL showing greater sensitivity to model size and tokenization.
Evaluating Temporal Consistency in Multi-Turn Language Models cs.CL · 2026-04-24 · unverdicted · none · ref 43
Language models frequently violate temporal scope stability in multi-turn dialogues by drifting toward present-day assumptions even when they possess the correct facts.
Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought cs.CL · 2026-04-24 · unverdicted · none · ref 45
Abstract-CoT lets models reason with short discrete latent token sequences from a reserved vocabulary, using warm-up training and RL to match verbal CoT performance with up to 11.6x fewer tokens.
Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval Evaluation cs.IR · 2026-04-22 · unverdicted · none · ref 5
Semantic stratification organizes documents into entity-based clusters to systematically generate queries for missing strata, yielding formal coverage guarantees and interpretable failure mode visibility in retrieval evaluation.
Profile-Then-Reason: Bounded Semantic Complexity for Tool-Augmented Language Agents cs.AI · 2026-04-05 · unverdicted · none · ref 19
PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.
HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation cs.CL · 2025-10-09 · unverdicted · none · ref 25
HiPRAG adds hierarchical process rewards to RL training for agentic RAG, reducing over-search to 2.3% and achieving 65.4-67.2% accuracy on seven QA benchmarks across 3B and 7B models.
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models cs.CL · 2024-04-29 · conditional · none · ref 28
A panel of smaller diverse LLMs outperforms a single large model as an evaluator of generations, showing less intra-model bias and over 7x lower cost.
RouterBench: A Benchmark for Multi-LLM Routing System cs.LG · 2024-03-18 · unverdicted · none · ref 116
RouterBench supplies a standardized benchmark, 405k+ inference dataset, theoretical framework, and comparative analysis for multi-LLM routing systems.
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation cs.CL · 2024-02-05 · unverdicted · none · ref 97
M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.
SHIFT: Gate-Modulated Activation Steering for Knowledge Conflict Mitigation in Retrieval-Augmented Generation cs.CL · 2026-06-26 · unverdicted · none · ref 28
SHIFT reformulates neuron editing as learnable gate modulation on under 0.01% parameters to let LLMs adaptively balance contextual and parametric knowledge during RAG generation.
Tracing Target Answers in Poisoned Retrieval Corpora via Token Influence Attribution cs.CR · 2026-06-24 · unverdicted · none · ref 56
TRACE detects corpus poisoning in RAG via token influence attribution to find recurrent keywords tied to target answers.
All Relations Lead to Rome: Automated Knowledge Graph Creation and Question Generation cs.IR · 2026-06-21 · unverdicted · none · ref 31
ARLtR is a framework for jointly constructing knowledge graphs, embeddings, and grounded QA pairs from text, demonstrated on a Roman Empire dataset with over 19,000 entities and 8,400 QA pairs.
Don't Blindly Trust It: How Unreliable Feedback Breaks Tool-Using LLM Agents cs.AI · 2026-06-19 · unverdicted · none · ref 56
Misleading tool feedback produces value inversion in LLM agents, with performance dropping below matched no-feedback baselines on HotpotQA and similar tasks.
MCompassRAG: Topic Metadata as a Semantic Compass for Paragraph-Level Retrieval cs.CL · 2026-06-16 · unverdicted · none · ref 39
MCompassRAG adds topic metadata to chunk representations and uses LLM distillation to train a lightweight topic-aware retriever, reporting 8.24% average information efficiency gain and over 5x lower latency than strong baselines across six benchmarks.
SproutRAG: Attention-Guided Tree Search with Progressive Embeddings for Long-Document RAG cs.CL · 2026-06-16 · unverdicted · none · ref 32
SproutRAG introduces an attention-guided hierarchical framework that constructs a binary chunking tree for multi-granularity retrieval in RAG systems and reports a 6.1% average gain in information efficiency.
ConSA: Controllable Sparsity in Hybrid Attention via Learnable Allocation cs.CL · 2026-06-16 · unverdicted · none · ref 36
ConSA learns FA/SWA allocation via L0 masks and augmented Lagrangian constraints, outperforming rule-based baselines on 0.6B and 1.7B models with consistent layer patterns.
RSRank: Learning Relevance from Representational Shifts cs.IR · 2026-06-16 · unverdicted · none · ref 54
RSRank learns calibrated relevance scores from alignment between representational shifts induced by candidate documents and those from oracle document sets, enabling zero-threshold filtering.
Adaptive Multi-Resolution Procedural Knowledge Compression for Large Language Models cs.CL · 2026-06-10 · unverdicted · none · ref 75
SKIM is an adaptive multi-resolution soft-token framework that compresses procedural skills while aiming to preserve logical dependencies and task performance better than prior compression methods.
HKVM-RAG: Key-Value-Separated Hypergraph Evidence Organization for Multi-Hop RAG cs.IR · 2026-06-05 · unverdicted · none · ref 7
HKVM-RAG uses key-value-separated hypergraphs to organize LLM evidence tuples into answer-path hyperedges, yielding F1 gains over KG-PPR on two multi-hop QA benchmarks and further gains when combined with dense retrievers.
Diagnosing Evidence Utilization in Long-Context and Retrieval-Augmented Language Models under Matched Evidence Conditions cs.CL · 2026-06-04 · unverdicted · none · ref 18
Introduces a matched four-condition protocol and ONCU metric to diagnose evidence utilization in long-context and RAG models across synthetic and multi-hop QA tasks.
Boosting Self-Consistency with Ranking cs.CL · 2026-06-03 · unverdicted · none · ref 162
RISC reformulates self-consistency answer selection as a ranking task solved by a lightweight LambdaRank model with five hand-designed features, yielding better accuracy-efficiency trade-offs than majority voting on QA benchmarks.
InfoMem: Training Long-Context Memory Agents with Answer-Conditioned Information Gain cs.AI · 2026-06-02 · unverdicted · none · ref 14
InfoMem is an answer-conditioned information gain reward for RL training of long-context memory agents that improves performance when applied to successful trajectories and normalized.
RASER: Recoverability-Aware Selective Escalation Router for Multi-Hop Question Answering cs.AI · 2026-06-01 · unverdicted · none · ref 1
RASER routers built on one-shot RAG features selectively escalate retrieval, matching SOTA F1 scores on multi-hop QA while using 41-49% of the tokens required by always-prune across six LLMs and three benchmarks.
K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts cs.CL · 2026-06-01 · unverdicted · none · ref 18
K-BrowseComp is a new Korean web-browsing agent benchmark where frontier LLMs score 30-46% and Korean LLMs score 0-10% on the verified subset.
HarnessForge: Joint Harness and Policy Evolution for Adaptive Agent Systems cs.CL · 2026-06-01 · unverdicted · none · ref 35
HarnessForge co-evolves harness-policy pairs in LLM agents via fault-guided tailoring and alignment, reporting up to 12% gains over single-component baselines on five benchmarks.
TriLens: Per-Layer Logit-Lens Entropy for White-Box Hallucination Detection cs.AI · 2026-05-31 · unverdicted · none · ref 19
TriLens detects hallucinations via per-layer entropy trajectories of logit-lens readouts from three internal modules across LLMs and QA benchmarks.
SPADER: Step-wise Peer Advantage with Diversity-Aware Exploration Rewards for Multi-Answer Question Answering cs.CL · 2026-05-30 · unverdicted · none · ref 25
SPADER proposes step-wise peer advantage and diversity-aware exploration rewards in RL for multi-answer QA, reporting improved recall and F1 on QAMPARI, Mintaka, WebQSP, and QUEST.
Automatic Layer Selection for Hallucination Detection cs.AI · 2026-05-25 · unverdicted · none · ref 45
FEPoID automatically selects optimal or near-optimal intermediate layers for hallucination detection across LLM architectures and tasks, outperforming prior criteria and baselines, with an added truncation step that further improves performance.
Efficient DP-SGD for LLMs with Randomized Clipping cs.LG · 2026-05-24 · unverdicted · none · ref 47
DP-SGD-RC applies Hutchinson and Hutch++ estimators to approximate per-sample gradient norms for clipping in DP-SGD, claiming competitive privacy noise multipliers and utility on Llama 3.2-1B with reduced memory.
Adaptive KV Cache Reuse for Fast Long-Context LLM Serving cs.AR · 2026-05-20 · unverdicted · none · ref 40
CacheTune delivers 3.72x-4.86x TTFT speedup and 3.93x-6.21x throughput in long-context LLM serving via frequency-guided selective KV recomputation and hardware-aware I/O overlap while keeping output quality near full recompute.
When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR cs.LG · 2026-05-19 · unverdicted · none · ref 40
Dynamic Gradient Gating monitors lm_head gradient norms to safely reuse rollout batches in RLVR, achieving up to 2.93x sample efficiency and 2.14x wall-clock speedup across math, ALFWorld, WebShop, and QA tasks.
Predictive Prefetching for Retrieval-Augmented Generation cs.CL · 2026-05-18 · unverdicted · none · ref 2
Introduces predictive prefetching for RAG that anticipates retrieval needs several tokens ahead via three components, reporting up to 43.5% latency reduction and 62.4% TTFT improvement while preserving answer quality.

Cohen and Ruslan Salakhutdinov and Christopher D

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer