Title resolution pending

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov + 1 more · 2018 · Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing · DOI 10.18653/v1/d18-1259

36 Pith papers cite this work, alongside 739 external citations. Polarity classification is still indexing.

36 Pith papers citing it

739 external citations · Crossref

open at publisher browse 36 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

F-GRPO factorizes group-relative policy optimization into generation and ranking phases within one autoregressive sequence, using order-invariant coverage and position-aware utility rewards to improve top-ranked performance on recommendation and multi-hop QA tasks.

Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.

Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

SIOP enables turn-level credit assignment in LLM agents via semantic clustering of final answers as latent outcomes, improving performance on reasoning benchmarks without verifiers.

SENECA: Small-Sample Discrete Entropy Estimation via Self-Consistent Missing Mass

cs.IT · 2026-05-01 · unverdicted · novelty 7.0

SENECA uses a novel self-consistent missing mass calculation to improve discrete entropy estimates in small-sample regimes and outperforms alternatives in numerical tests.

InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees

cs.LG · 2026-05-01 · unverdicted · novelty 7.0 · 2 refs

InvEvolve evolves white-box inventory policies from LLMs with statistical safety guarantees and outperforms classical and deep learning methods on synthetic and real retail data.

When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models

cs.IR · 2026-04-29 · unverdicted · novelty 7.0

ReaLM-Retrieve uses step-level uncertainty to trigger retrievals during reasoning, achieving 10.1% better F1 scores and 47% fewer calls on multi-hop QA benchmarks.

The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models

cs.CL · 2026-04-28 · accept · novelty 7.0

SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.

Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective

cs.CL · 2026-04-25 · unverdicted · novelty 7.0

Fine-tuning shows higher proficiency than in-context learning on in-distribution generalization in formal languages, with equal out-of-distribution performance and diverging inductive biases at high proficiency.

Evaluating Temporal Consistency in Multi-Turn Language Models

cs.CL · 2026-04-24 · unverdicted · novelty 7.0

Language models frequently violate temporal scope stability in multi-turn dialogues by drifting toward present-day assumptions even when they possess the correct facts.

Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought

cs.CL · 2026-04-24 · unverdicted · novelty 7.0

Abstract-CoT lets models reason with short discrete latent token sequences from a reserved vocabulary, using warm-up training and RL to match verbal CoT performance with up to 11.6x fewer tokens.

Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval Evaluation

cs.IR · 2026-04-22 · unverdicted · novelty 7.0

Semantic stratification organizes documents into entity-based clusters to systematically generate queries for missing strata, yielding formal coverage guarantees and interpretable failure mode visibility in retrieval evaluation.

Profile-Then-Reason: Bounded Semantic Complexity for Tool-Augmented Language Agents

cs.AI · 2026-04-05 · unverdicted · novelty 7.0

PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

cs.CL · 2024-02-05 · unverdicted · novelty 7.0

M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.

A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

cs.CL · 2026-05-07 · unverdicted · novelty 6.0

A²TGPO improves RL policy optimization for multi-turn agentic LLMs by normalizing information gain within same-depth turn groups, rescaling cumulative advantages by sqrt of term count, and modulating clipping ranges per turn's normalized IG.

Retrieval from Within: An Intrinsic Capability of Attention-Based Models

cs.LG · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

Attention-based models can retrieve evidence intrinsically by using decoder attention to score and reuse their own pre-encoded chunks, outperforming separate retrieval pipelines on QA benchmarks.

Stochastic Sparse Attention for Memory-Bound Inference

cs.LG · 2026-05-03 · accept · novelty 6.0

SANTA sparsifies post-softmax value aggregation via stratified sampling of S << n_k indices to produce an unbiased estimator, delivering 1.5x decode attention speedup on RTX 6000 Ada at 32k contexts while matching baseline accuracy.

Verbal-R3: Verbal Reranker as the Missing Bridge between Retrieval and Reasoning

cs.CL · 2026-05-02 · unverdicted · novelty 6.0

Verbal-R3 uses a verbal reranker to generate analytic narratives that guide retrieval and reasoning in LLMs, achieving SOTA results on complex QA benchmarks.

A Survey of Reasoning-Intensive Retrieval: Progress and Challenges

cs.IR · 2026-04-30 · unverdicted · novelty 6.0

A survey that categorizes RIR benchmarks by domain and modality, proposes a taxonomy for integrating reasoning into retrieval pipelines, and outlines key challenges.

CL-bench Life: Can Language Models Learn from Real-Life Context?

cs.CL · 2026-04-29 · unverdicted · novelty 6.0

CL-bench Life shows frontier language models achieve only 13.8% average success on real-life context tasks, with the best model at 19.3%.

R$^3$AG: Retriever Routing for Retrieval-Augmented Generation

cs.IR · 2026-04-22 · unverdicted · novelty 6.0

R³AG routes queries to retrievers by decomposing capabilities into retrieval quality and generation utility, trained via contrastive learning on document assessments and downstream answer correctness to outperform static methods.

Pause or Fabricate? Training Language Models for Grounded Reasoning

cs.CL · 2026-04-21 · conditional · novelty 6.0

GRIL uses stage-specific RL rewards to train LLMs to detect missing premises, pause proactively, and resume grounded reasoning after clarification, yielding up to 45% better premise detection and 30% higher task success on insufficient math datasets.

KnowledgeBerg: Evaluating Systematic Knowledge Coverage and Compositional Reasoning in Large Language Models

cs.AI · 2026-04-19 · unverdicted · novelty 6.0

KnowledgeBerg benchmark shows open-source LLMs achieve only 5.26-36.88 F1 on universe enumeration and 16-44% accuracy on knowledge-grounded compositional reasoning, with persistent failures in completeness, awareness, and application.

CoSearch: Joint Training of Reasoning and Document Ranking via Reinforcement Learning for Agentic Search

cs.AI · 2026-04-19 · unverdicted · novelty 6.0

Joint RL training of reasoning agent and document ranker via GRPO with semantic grouping and composite rewards yields consistent gains over fixed-retrieval baselines on seven QA benchmarks.

No-Worse Context-Aware Decoding: Preventing Neutral Regression in Context-Conditioned Generation

cs.CL · 2026-04-17 · unverdicted · novelty 6.0

NWCAD uses a two-stream setup with a two-stage gate to prevent accuracy drops on baseline-correct items under non-informative contexts while retaining gains from helpful contexts.

citing papers explorer

Showing 36 of 36 citing papers.

F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking cs.LG · 2026-05-13 · unverdicted · none · ref 70
F-GRPO factorizes group-relative policy optimization into generation and ranking phases within one autoregressive sequence, using order-invariant coverage and position-aware utility rewards to improve top-ranked performance on recommendation and multi-hop QA tasks.
Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models cs.CL · 2026-05-12 · unverdicted · none · ref 95
dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers cs.LG · 2026-05-06 · unverdicted · none · ref 25
SIOP enables turn-level credit assignment in LLM agents via semantic clustering of final answers as latent outcomes, improving performance on reasoning benchmarks without verifiers.
SENECA: Small-Sample Discrete Entropy Estimation via Self-Consistent Missing Mass cs.IT · 2026-05-01 · unverdicted · none · ref 81
SENECA uses a novel self-consistent missing mass calculation to improve discrete entropy estimates in small-sample regimes and outperforms alternatives in numerical tests.
InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees cs.LG · 2026-05-01 · unverdicted · none · ref 89 · 2 links
InvEvolve evolves white-box inventory policies from LLMs with statistical safety guarantees and outperforms classical and deep learning methods on synthetic and real retail data.
When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models cs.IR · 2026-04-29 · unverdicted · none · ref 40
ReaLM-Retrieve uses step-level uncertainty to trigger retrievals during reasoning, achieving 10.1% better F1 scores and 47% fewer calls on multi-hop QA benchmarks.
The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models cs.CL · 2026-04-28 · accept · none · ref 28
SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.
Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective cs.CL · 2026-04-25 · unverdicted · none · ref 75
Fine-tuning shows higher proficiency than in-context learning on in-distribution generalization in formal languages, with equal out-of-distribution performance and diverging inductive biases at high proficiency.
Evaluating Temporal Consistency in Multi-Turn Language Models cs.CL · 2026-04-24 · unverdicted · none · ref 43
Language models frequently violate temporal scope stability in multi-turn dialogues by drifting toward present-day assumptions even when they possess the correct facts.
Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought cs.CL · 2026-04-24 · unverdicted · none · ref 45
Abstract-CoT lets models reason with short discrete latent token sequences from a reserved vocabulary, using warm-up training and RL to match verbal CoT performance with up to 11.6x fewer tokens.
Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval Evaluation cs.IR · 2026-04-22 · unverdicted · none · ref 5
Semantic stratification organizes documents into entity-based clusters to systematically generate queries for missing strata, yielding formal coverage guarantees and interpretable failure mode visibility in retrieval evaluation.
Profile-Then-Reason: Bounded Semantic Complexity for Tool-Augmented Language Agents cs.AI · 2026-04-05 · unverdicted · none · ref 19
PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation cs.CL · 2024-02-05 · unverdicted · none · ref 97
M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.
A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping cs.CL · 2026-05-07 · unverdicted · none · ref 28
A²TGPO improves RL policy optimization for multi-turn agentic LLMs by normalizing information gain within same-depth turn groups, rescaling cumulative advantages by sqrt of term count, and modulating clipping ranges per turn's normalized IG.
Retrieval from Within: An Intrinsic Capability of Attention-Based Models cs.LG · 2026-05-07 · unverdicted · none · ref 38 · 2 links
Attention-based models can retrieve evidence intrinsically by using decoder attention to score and reuse their own pre-encoded chunks, outperforming separate retrieval pipelines on QA benchmarks.
Stochastic Sparse Attention for Memory-Bound Inference cs.LG · 2026-05-03 · accept · none · ref 41
SANTA sparsifies post-softmax value aggregation via stratified sampling of S << n_k indices to produce an unbiased estimator, delivering 1.5x decode attention speedup on RTX 6000 Ada at 32k contexts while matching baseline accuracy.
Verbal-R3: Verbal Reranker as the Missing Bridge between Retrieval and Reasoning cs.CL · 2026-05-02 · unverdicted · none · ref 22
Verbal-R3 uses a verbal reranker to generate analytic narratives that guide retrieval and reasoning in LLMs, achieving SOTA results on complex QA benchmarks.
A Survey of Reasoning-Intensive Retrieval: Progress and Challenges cs.IR · 2026-04-30 · unverdicted · none · ref 82
A survey that categorizes RIR benchmarks by domain and modality, proposes a taxonomy for integrating reasoning into retrieval pipelines, and outlines key challenges.
CL-bench Life: Can Language Models Learn from Real-Life Context? cs.CL · 2026-04-29 · unverdicted · none · ref 70
CL-bench Life shows frontier language models achieve only 13.8% average success on real-life context tasks, with the best model at 19.3%.
R$^3$AG: Retriever Routing for Retrieval-Augmented Generation cs.IR · 2026-04-22 · unverdicted · none · ref 25
R³AG routes queries to retrievers by decomposing capabilities into retrieval quality and generation utility, trained via contrastive learning on document assessments and downstream answer correctness to outperform static methods.
Pause or Fabricate? Training Language Models for Grounded Reasoning cs.CL · 2026-04-21 · conditional · none · ref 48
GRIL uses stage-specific RL rewards to train LLMs to detect missing premises, pause proactively, and resume grounded reasoning after clarification, yielding up to 45% better premise detection and 30% higher task success on insufficient math datasets.
KnowledgeBerg: Evaluating Systematic Knowledge Coverage and Compositional Reasoning in Large Language Models cs.AI · 2026-04-19 · unverdicted · none · ref 28
KnowledgeBerg benchmark shows open-source LLMs achieve only 5.26-36.88 F1 on universe enumeration and 16-44% accuracy on knowledge-grounded compositional reasoning, with persistent failures in completeness, awareness, and application.
CoSearch: Joint Training of Reasoning and Document Ranking via Reinforcement Learning for Agentic Search cs.AI · 2026-04-19 · unverdicted · none · ref 3
Joint RL training of reasoning agent and document ranker via GRPO with semantic grouping and composite rewards yields consistent gains over fixed-retrieval baselines on seven QA benchmarks.
No-Worse Context-Aware Decoding: Preventing Neutral Regression in Context-Conditioned Generation cs.CL · 2026-04-17 · unverdicted · none · ref 25
NWCAD uses a two-stream setup with a two-stage gate to prevent accuracy drops on baseline-correct items under non-informative contexts while retaining gains from helpful contexts.
Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retrieval cs.AI · 2026-04-13 · unverdicted · none · ref 40
A hybrid graph-text retrieval system for cyber threat intelligence improves multi-hop question answering by up to 35% over vector-based RAG on a 3,300-question benchmark.
RUBEN: Rule-Based Explanations for Retrieval-Augmented LLM Systems cs.CL · 2026-05-11 · unverdicted · none · ref 27
RUBEN discovers minimal rule sets explaining RAG LLM outputs via novel pruning and applies them to evaluate LLM safety against adversarial injections.
LensVLM: Selective Context Expansion for Compressed Visual Representation of Text cs.CV · 2026-05-07 · unverdicted · none · ref 30
LensVLM trains VLMs to scan compressed rendered text images and selectively expand task-relevant regions, achieving 4.3x compression with near full-text accuracy and outperforming baselines up to 10.1x on text QA benchmarks.
An Information-Geometric Framework for Stability Analysis of Large Language Models under Entropic Stress cs.AI · 2026-04-27 · unverdicted · none · ref 19
A thermodynamic-inspired information-geometric framework defines a composite LLM stability score that outperforms a utility-entropy baseline by 0.0299 on average across 80 observations, with gains increasing at higher entropy.
Align Documents to Questions: Question-Oriented Document Rewriting for Retrieval-Augmented Generation cs.CL · 2026-04-19 · unverdicted · none · ref 107
QREAM rewrites documents to question-focused style using iterative ICL and distilled FT models, boosting RAG performance by up to 8% relative improvement.
From Hallucination to Structure Snowballing: The Alignment Tax of Constrained Decoding in LLM Reflection cs.CL · 2026-04-07 · unverdicted · none · ref 11
Enforcing structured reflection via Outlines-based constrained decoding on an 8B LLM triggers structure snowballing instead of better self-correction, producing near-perfect syntax but persistent semantic errors and revealing an alignment tax.
Gated Delta Networks: Improving Mamba2 with Delta Rule cs.CL · 2024-12-09 · unverdicted · none · ref 64
Gated DeltaNet integrates gating and delta rules into linear transformers, outperforming Mamba2 and DeltaNet on language modeling, reasoning, retrieval, and long-context tasks.
Multilingual E5 Text Embeddings: A Technical Report cs.CL · 2024-02-08 · unverdicted · none · ref 40
Open-source multilingual E5 embedding models are trained via contrastive pre-training on 1 billion text pairs and fine-tuning, with an instruction-tuned model matching English SOTA performance.
Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility cs.LG · 2026-05-07 · unverdicted · none · ref 31 · 2 links
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.
Text-Graph Synergy: A Bidirectional Verification and Completion Framework for RAG cs.AI · 2026-05-07 · unverdicted · none · ref 9
TGS-RAG adds graph-to-text re-ranking with global voting and text-to-graph orphan path bridging to improve precision and efficiency in multi-hop RAG over prior baselines.
Qwen Goes Brrr: Off-the-Shelf RAG for Ukrainian Multi-Domain Document Understanding cs.CL · 2026-05-11 · unverdicted · none · ref 63
A RAG pipeline with contextual PDF chunking, question-and-answer-aware retrieval and reranking using Qwen3 models reaches 0.96 accuracy on a Ukrainian multi-domain document QA shared task.
Hypencoder Revisited: Reproducibility and Analysis of Non-Linear Scoring for First-Stage Retrieval cs.IR · 2026-04-29 · conditional · none · ref 69
Reproducibility study confirms Hypencoder's non-linear query-specific scoring improves retrieval over bi-encoders on standard benchmarks but standard methods remain faster and hard-task results are mixed due to implementation issues.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer