LoHoSearch is a new benchmark of 544 KG-constructed questions across 11 domains where the strongest search agent scores 34.74% and context strategies add at most 6.8%.
hub
Cohen and Ruslan Salakhutdinov and Christopher D
86 Pith papers cite this work, alongside 739 external citations. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- dataset 88 + TG-Norm 47.24 50.17 22.68 52.40 46.27 + TG-Norm +D t-rescaling 47.94 50.54 22.77 52.00 46.71 + TG-Norm +D t-rescaling + Ada-Clipping(A 2TGPO) 49.42 51.29 25.21 53.60 48.06 both training and evaluation. Seven open-domain question answering benchmarks are used, or- ganized into two groups by reasoning depth.Multi-hopbenchmarks consist of HotpotQA [ 28], 2WikiMultihopQA [29], MuSiQue [30], and Bamboogle [31].Single-hopbenchmarks consist of Natural Questions (NQ) [ 32], TriviaQA [ 33], and PopQ
- dataset requiring no additional compressor or compression-specific training (distinct from latent-compression approaches [11]). We find that small values such as Lp ∈ {3,5,7} substantially reduce MaxSim cost while preserving the shared-representation design. 4 Benchmarks and Experimental Setup We evaluate INTRA on four Wikipedia-based QA benchmarks: HotPotQA [38], 2WikiMultihopQA [12], MuSiQue [34], and Natural Questions [19]. Together they span bridge and comparison reasoning, cleaner two-hop evidence
- background query involves chaining together multiple related facts across entities, CTI reports, or time (e.g., actor → uses → malware → targets → sector, or comparing campaigns over time). Dense retrieval that returns the top-𝑘 most relevant text chunks [20, 22] can fail when evidence is distributed across distant text fragments, when constraints must be satisfied jointly, or when the answer depends on chaining multiple facts [ 40]. Equally important, LLM-based CTI assistants must reliably abstain when th
- background Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.),Advances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/forum?id=R9KnuFlvnU. [69] Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains, 2024.URL https://arxiv. org/abs/2406.12045, 2024. [70] Junjie Ye, Guanyu Li, Songyang Gao, Caishuang Huang, Yilong Wu, Sixian Li, Xiaoran Fan, Shihan Dou, Tao Ji, Qi Zhang, Tao Gui, and Xua
co-cited works
representative citing papers
FAPO automates LLM pipeline optimization via iterative diagnosis and prompt-or-structure edits, beating GEPA baseline by +14.1 pp mean across 18 comparisons and +33.8 pp when structural changes occur.
LatentSkill uses a hypernetwork to generate LoRA adapters from textual skills, enabling weight-space storage that cuts prefill tokens and boosts agent success rates on ALFWorld and Search-QA.
LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.
MemTrain introduces two coupled self-supervised proxy tasks on Wikipedia corpora to train general context-memory capabilities in LLMs, reporting gains of up to 17.67 points on long-text and search-based QA benchmarks over direct post-training.
Defines cost-aware RAG with evidence cost tiers and shows static selectors are brittle while agentic LLM-based selection is promising but model-dependent.
HexAGenT reduces the SLO scale required for timely agentic LLM workflow completion by an average of 20.1% at 95% attainment and 33.0% at 99% attainment on heterogeneous A100/H100/H200 clusters.
F-GRPO factorizes group-relative policy optimization into generation and ranking phases within one autoregressive sequence, using order-invariant coverage and position-aware utility rewards to improve top-ranked performance on recommendation and multi-hop QA tasks.
SIOP enables turn-level credit assignment in LLM agents via semantic clustering of final answers as latent outcomes, improving performance on reasoning benchmarks without verifiers.
SENECA uses a novel self-consistent missing mass calculation to improve discrete entropy estimates in small-sample regimes and outperforms alternatives in numerical tests.
ReaLM-Retrieve uses step-level uncertainty to trigger retrievals during reasoning, achieving 10.1% better F1 scores and 47% fewer calls on multi-hop QA benchmarks.
SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.
A controlled formal language task reveals fine-tuning outperforms in-context learning on in-distribution generalization but equals it on out-of-distribution, with ICL showing greater sensitivity to model size and tokenization.
Language models frequently violate temporal scope stability in multi-turn dialogues by drifting toward present-day assumptions even when they possess the correct facts.
Abstract-CoT lets models reason with short discrete latent token sequences from a reserved vocabulary, using warm-up training and RL to match verbal CoT performance with up to 11.6x fewer tokens.
Semantic stratification organizes documents into entity-based clusters to systematically generate queries for missing strata, yielding formal coverage guarantees and interpretable failure mode visibility in retrieval evaluation.
PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.
HiPRAG adds hierarchical process rewards to RL training for agentic RAG, reducing over-search to 2.3% and achieving 65.4-67.2% accuracy on seven QA benchmarks across 3B and 7B models.
A panel of smaller diverse LLMs outperforms a single large model as an evaluator of generations, showing less intra-model bias and over 7x lower cost.
RouterBench supplies a standardized benchmark, 405k+ inference dataset, theoretical framework, and comparative analysis for multi-LLM routing systems.
M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.
TRACE detects corpus poisoning in RAG via token influence attribution to find recurrent keywords tied to target answers.
ARLtR is a framework for jointly constructing knowledge graphs, embeddings, and grounded QA pairs from text, demonstrated on a Roman Empire dataset with over 19,000 entities and 8,400 QA pairs.
Misleading tool feedback produces value inversion in LLM agents, with performance dropping below matched no-feedback baselines on HotpotQA and similar tasks.
citing papers explorer
-
LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling
LoHoSearch is a new benchmark of 544 KG-constructed questions across 11 domains where the strongest search agent scores 34.74% and context strategies add at most 6.8%.
-
FAPO: Fully Automated Prompt Optimization of Multi-Step LLM Pipelines
FAPO automates LLM pipeline optimization via iterative diagnosis and prompt-or-structure edits, beating GEPA baseline by +14.1 pp mean across 18 comparisons and +33.8 pp when structural changes occur.
-
LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents
LatentSkill uses a hypernetwork to generate LoRA adapters from textual skills, enabling weight-space storage that cuts prefill tokens and boosts agent success rates on ALFWorld and Search-QA.
-
LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding
LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.
-
MemTrain: Self-Supervised Context Memory Training
MemTrain introduces two coupled self-supervised proxy tasks on Wikipedia corpora to train general context-memory capabilities in LLMs, reporting gains of up to 17.67 points on long-text and search-based QA benchmarks over direct post-training.
-
When Knowledge Is Not Free: Cost-Aware Evidence Selection in Retrieval-Augmented Generation
Defines cost-aware RAG with evidence cost tiers and shows static selectors are brittle while agentic LLM-based selection is promising but model-dependent.
-
HexAGenT: Efficient Agentic LLM Serving via Workflow- and Heterogeneity-Aware Scheduling
HexAGenT reduces the SLO scale required for timely agentic LLM workflow completion by an average of 20.1% at 95% attainment and 33.0% at 99% attainment on heterogeneous A100/H100/H200 clusters.
-
F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking
F-GRPO factorizes group-relative policy optimization into generation and ranking phases within one autoregressive sequence, using order-invariant coverage and position-aware utility rewards to improve top-ranked performance on recommendation and multi-hop QA tasks.
-
Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers
SIOP enables turn-level credit assignment in LLM agents via semantic clustering of final answers as latent outcomes, improving performance on reasoning benchmarks without verifiers.
-
SENECA: Small-Sample Discrete Entropy Estimation via Self-Consistent Missing Mass
SENECA uses a novel self-consistent missing mass calculation to improve discrete entropy estimates in small-sample regimes and outperforms alternatives in numerical tests.
-
When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models
ReaLM-Retrieve uses step-level uncertainty to trigger retrievals during reasoning, achieving 10.1% better F1 scores and 47% fewer calls on multi-hop QA benchmarks.
-
The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models
SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.
-
Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective
A controlled formal language task reveals fine-tuning outperforms in-context learning on in-distribution generalization but equals it on out-of-distribution, with ICL showing greater sensitivity to model size and tokenization.
-
Evaluating Temporal Consistency in Multi-Turn Language Models
Language models frequently violate temporal scope stability in multi-turn dialogues by drifting toward present-day assumptions even when they possess the correct facts.
-
Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought
Abstract-CoT lets models reason with short discrete latent token sequences from a reserved vocabulary, using warm-up training and RL to match verbal CoT performance with up to 11.6x fewer tokens.
-
Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval Evaluation
Semantic stratification organizes documents into entity-based clusters to systematically generate queries for missing strata, yielding formal coverage guarantees and interpretable failure mode visibility in retrieval evaluation.
-
Profile-Then-Reason: Bounded Semantic Complexity for Tool-Augmented Language Agents
PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.
-
HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation
HiPRAG adds hierarchical process rewards to RL training for agentic RAG, reducing over-search to 2.3% and achieving 65.4-67.2% accuracy on seven QA benchmarks across 3B and 7B models.
-
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
A panel of smaller diverse LLMs outperforms a single large model as an evaluator of generations, showing less intra-model bias and over 7x lower cost.
-
RouterBench: A Benchmark for Multi-LLM Routing System
RouterBench supplies a standardized benchmark, 405k+ inference dataset, theoretical framework, and comparative analysis for multi-LLM routing systems.
-
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.
-
Tracing Target Answers in Poisoned Retrieval Corpora via Token Influence Attribution
TRACE detects corpus poisoning in RAG via token influence attribution to find recurrent keywords tied to target answers.
-
All Relations Lead to Rome: Automated Knowledge Graph Creation and Question Generation
ARLtR is a framework for jointly constructing knowledge graphs, embeddings, and grounded QA pairs from text, demonstrated on a Roman Empire dataset with over 19,000 entities and 8,400 QA pairs.
-
Don't Blindly Trust It: How Unreliable Feedback Breaks Tool-Using LLM Agents
Misleading tool feedback produces value inversion in LLM agents, with performance dropping below matched no-feedback baselines on HotpotQA and similar tasks.
-
MCompassRAG: Topic Metadata as a Semantic Compass for Paragraph-Level Retrieval
MCompassRAG adds topic metadata to chunk representations and uses LLM distillation to train a lightweight topic-aware retriever, reporting 8.24% average information efficiency gain and over 5x lower latency than strong baselines across six benchmarks.
-
SproutRAG: Attention-Guided Tree Search with Progressive Embeddings for Long-Document RAG
SproutRAG introduces an attention-guided hierarchical framework that constructs a binary chunking tree for multi-granularity retrieval in RAG systems and reports a 6.1% average gain in information efficiency.
-
ConSA: Controllable Sparsity in Hybrid Attention via Learnable Allocation
ConSA learns FA/SWA allocation via L0 masks and augmented Lagrangian constraints, outperforming rule-based baselines on 0.6B and 1.7B models with consistent layer patterns.
-
RSRank: Learning Relevance from Representational Shifts
RSRank learns calibrated relevance scores from alignment between representational shifts induced by candidate documents and those from oracle document sets, enabling zero-threshold filtering.
-
Adaptive Multi-Resolution Procedural Knowledge Compression for Large Language Models
SKIM is an adaptive multi-resolution soft-token framework that compresses procedural skills while aiming to preserve logical dependencies and task performance better than prior compression methods.
-
HKVM-RAG: Key-Value-Separated Hypergraph Evidence Organization for Multi-Hop RAG
HKVM-RAG uses key-value-separated hypergraphs to organize LLM evidence tuples into answer-path hyperedges, yielding F1 gains over KG-PPR on two multi-hop QA benchmarks and further gains when combined with dense retrievers.
-
Diagnosing Evidence Utilization in Long-Context and Retrieval-Augmented Language Models under Matched Evidence Conditions
Introduces a matched four-condition protocol and ONCU metric to diagnose evidence utilization in long-context and RAG models across synthetic and multi-hop QA tasks.
-
Boosting Self-Consistency with Ranking
RISC reformulates self-consistency answer selection as a ranking task solved by a lightweight LambdaRank model with five hand-designed features, yielding better accuracy-efficiency trade-offs than majority voting on QA benchmarks.
-
InfoMem: Training Long-Context Memory Agents with Answer-Conditioned Information Gain
InfoMem is an answer-conditioned information gain reward for RL training of long-context memory agents that improves performance when applied to successful trajectories and normalized.
-
RASER: Recoverability-Aware Selective Escalation Router for Multi-Hop Question Answering
RASER routers built on one-shot RAG features selectively escalate retrieval, matching SOTA F1 scores on multi-hop QA while using 41-49% of the tokens required by always-prune across six LLMs and three benchmarks.
-
K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts
K-BrowseComp is a new Korean web-browsing agent benchmark where frontier LLMs score 30-46% and Korean LLMs score 0-10% on the verified subset.
-
HarnessForge: Joint Harness and Policy Evolution for Adaptive Agent Systems
HarnessForge co-evolves harness-policy pairs in LLM agents via fault-guided tailoring and alignment, reporting up to 12% gains over single-component baselines on five benchmarks.
-
TriLens: Per-Layer Logit-Lens Entropy for White-Box Hallucination Detection
TriLens detects hallucinations via per-layer entropy trajectories of logit-lens readouts from three internal modules across LLMs and QA benchmarks.
-
SPADER: Step-wise Peer Advantage with Diversity-Aware Exploration Rewards for Multi-Answer Question Answering
SPADER proposes step-wise peer advantage and diversity-aware exploration rewards in RL for multi-answer QA, reporting improved recall and F1 on QAMPARI, Mintaka, WebQSP, and QUEST.
-
When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR
Dynamic Gradient Gating monitors lm_head gradient norms to safely reuse rollout batches in RLVR, achieving up to 2.93x sample efficiency and 2.14x wall-clock speedup across math, ALFWorld, WebShop, and QA tasks.
-
Predictive Prefetching for Retrieval-Augmented Generation
Introduces predictive prefetching for RAG that anticipates retrieval needs several tokens ahead via three components, reporting up to 43.5% latency reduction and 62.4% TTFT improvement while preserving answer quality.
-
ICRL: Learning to Internalize Self-Critique with Reinforcement Learning
ICRL uses joint RL training of solver and critic with distribution-calibration re-weighting and role-wise advantage estimation to internalize critique into unassisted LLM performance, yielding 6.4-point gains on agentic tasks and 7.0 on math reasoning with Qwen3 models.
-
A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping
A²TGPO improves RL policy optimization for multi-turn agentic LLMs by normalizing information gain within same-depth turn groups, rescaling cumulative advantages by sqrt of term count, and modulating clipping ranges per turn's normalized IG.
-
Retrieval from Within: An Intrinsic Capability of Attention-Based Models
Attention-based models can retrieve evidence intrinsically by using decoder attention to score and reuse their own pre-encoded chunks, outperforming separate retrieval pipelines on QA benchmarks.
-
Verbal-R3: Verbal Reranker as the Missing Bridge between Retrieval and Reasoning
Verbal-R3 uses a verbal reranker to generate analytic narratives that guide retrieval and reasoning in LLMs, achieving SOTA results on complex QA benchmarks.
-
A Survey of Reasoning-Intensive Retrieval: Progress and Challenges
A survey that categorizes RIR benchmarks by domain and modality, proposes a taxonomy for integrating reasoning into retrieval pipelines, and outlines key challenges.
-
CL-bench Life: Can Language Models Learn from Real-Life Context?
CL-bench Life shows frontier language models achieve only 13.8% average success on real-life context tasks, with the best model at 19.3%.
-
R$^3$AG: Retriever Routing for Retrieval-Augmented Generation
R³AG routes queries to retrievers by decomposing capabilities into retrieval quality and generation utility, trained via contrastive learning on document assessments and downstream answer correctness to outperform static methods.
-
Pause or Fabricate? Training Language Models for Grounded Reasoning
GRIL uses stage-specific RL rewards to train LLMs to detect missing premises, pause proactively, and resume grounded reasoning after clarification, yielding up to 45% better premise detection and 30% higher task success on insufficient math datasets.
-
CoSearch: Joint Training of Reasoning and Document Ranking via Reinforcement Learning for Agentic Search
Joint RL training of reasoning agent and document ranker via GRPO with semantic grouping and composite rewards yields consistent gains over fixed-retrieval baselines on seven QA benchmarks.
-
No-Worse Context-Aware Decoding: Preventing Neutral Regression in Context-Conditioned Generation
NWCAD uses a two-stream setup with a two-stage gate to prevent accuracy drops on baseline-correct items under non-informative contexts while retaining gains from helpful contexts.