F-GRPO factorizes group-relative policy optimization into generation and ranking phases within one autoregressive sequence, using order-invariant coverage and position-aware utility rewards to improve top-ranked performance on recommendation and multi-hop QA tasks.
Title resolution pending
36 Pith papers cite this work, alongside 739 external citations. Polarity classification is still indexing.
representative citing papers
dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
SIOP enables turn-level credit assignment in LLM agents via semantic clustering of final answers as latent outcomes, improving performance on reasoning benchmarks without verifiers.
SENECA uses a novel self-consistent missing mass calculation to improve discrete entropy estimates in small-sample regimes and outperforms alternatives in numerical tests.
InvEvolve evolves white-box inventory policies from LLMs with statistical safety guarantees and outperforms classical and deep learning methods on synthetic and real retail data.
ReaLM-Retrieve uses step-level uncertainty to trigger retrievals during reasoning, achieving 10.1% better F1 scores and 47% fewer calls on multi-hop QA benchmarks.
SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.
Fine-tuning shows higher proficiency than in-context learning on in-distribution generalization in formal languages, with equal out-of-distribution performance and diverging inductive biases at high proficiency.
Language models frequently violate temporal scope stability in multi-turn dialogues by drifting toward present-day assumptions even when they possess the correct facts.
Abstract-CoT lets models reason with short discrete latent token sequences from a reserved vocabulary, using warm-up training and RL to match verbal CoT performance with up to 11.6x fewer tokens.
Semantic stratification organizes documents into entity-based clusters to systematically generate queries for missing strata, yielding formal coverage guarantees and interpretable failure mode visibility in retrieval evaluation.
PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.
M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.
A²TGPO improves RL policy optimization for multi-turn agentic LLMs by normalizing information gain within same-depth turn groups, rescaling cumulative advantages by sqrt of term count, and modulating clipping ranges per turn's normalized IG.
Attention-based models can retrieve evidence intrinsically by using decoder attention to score and reuse their own pre-encoded chunks, outperforming separate retrieval pipelines on QA benchmarks.
SANTA sparsifies post-softmax value aggregation via stratified sampling of S << n_k indices to produce an unbiased estimator, delivering 1.5x decode attention speedup on RTX 6000 Ada at 32k contexts while matching baseline accuracy.
Verbal-R3 uses a verbal reranker to generate analytic narratives that guide retrieval and reasoning in LLMs, achieving SOTA results on complex QA benchmarks.
A survey that categorizes RIR benchmarks by domain and modality, proposes a taxonomy for integrating reasoning into retrieval pipelines, and outlines key challenges.
CL-bench Life shows frontier language models achieve only 13.8% average success on real-life context tasks, with the best model at 19.3%.
R³AG routes queries to retrievers by decomposing capabilities into retrieval quality and generation utility, trained via contrastive learning on document assessments and downstream answer correctness to outperform static methods.
GRIL uses stage-specific RL rewards to train LLMs to detect missing premises, pause proactively, and resume grounded reasoning after clarification, yielding up to 45% better premise detection and 30% higher task success on insufficient math datasets.
KnowledgeBerg benchmark shows open-source LLMs achieve only 5.26-36.88 F1 on universe enumeration and 16-44% accuracy on knowledge-grounded compositional reasoning, with persistent failures in completeness, awareness, and application.
Joint RL training of reasoning agent and document ranker via GRPO with semantic grouping and composite rewards yields consistent gains over fixed-retrieval baselines on seven QA benchmarks.
NWCAD uses a two-stream setup with a two-stage gate to prevent accuracy drops on baseline-correct items under non-informative contexts while retaining gains from helpful contexts.
citing papers explorer
-
F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking
F-GRPO factorizes group-relative policy optimization into generation and ranking phases within one autoregressive sequence, using order-invariant coverage and position-aware utility rewards to improve top-ranked performance on recommendation and multi-hop QA tasks.
-
Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
-
Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers
SIOP enables turn-level credit assignment in LLM agents via semantic clustering of final answers as latent outcomes, improving performance on reasoning benchmarks without verifiers.
-
SENECA: Small-Sample Discrete Entropy Estimation via Self-Consistent Missing Mass
SENECA uses a novel self-consistent missing mass calculation to improve discrete entropy estimates in small-sample regimes and outperforms alternatives in numerical tests.
-
InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees
InvEvolve evolves white-box inventory policies from LLMs with statistical safety guarantees and outperforms classical and deep learning methods on synthetic and real retail data.
-
When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models
ReaLM-Retrieve uses step-level uncertainty to trigger retrievals during reasoning, achieving 10.1% better F1 scores and 47% fewer calls on multi-hop QA benchmarks.
-
The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models
SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.
-
Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective
Fine-tuning shows higher proficiency than in-context learning on in-distribution generalization in formal languages, with equal out-of-distribution performance and diverging inductive biases at high proficiency.
-
Evaluating Temporal Consistency in Multi-Turn Language Models
Language models frequently violate temporal scope stability in multi-turn dialogues by drifting toward present-day assumptions even when they possess the correct facts.
-
Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought
Abstract-CoT lets models reason with short discrete latent token sequences from a reserved vocabulary, using warm-up training and RL to match verbal CoT performance with up to 11.6x fewer tokens.
-
Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval Evaluation
Semantic stratification organizes documents into entity-based clusters to systematically generate queries for missing strata, yielding formal coverage guarantees and interpretable failure mode visibility in retrieval evaluation.
-
Profile-Then-Reason: Bounded Semantic Complexity for Tool-Augmented Language Agents
PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.
-
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.
-
A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping
A²TGPO improves RL policy optimization for multi-turn agentic LLMs by normalizing information gain within same-depth turn groups, rescaling cumulative advantages by sqrt of term count, and modulating clipping ranges per turn's normalized IG.
-
Retrieval from Within: An Intrinsic Capability of Attention-Based Models
Attention-based models can retrieve evidence intrinsically by using decoder attention to score and reuse their own pre-encoded chunks, outperforming separate retrieval pipelines on QA benchmarks.
-
Stochastic Sparse Attention for Memory-Bound Inference
SANTA sparsifies post-softmax value aggregation via stratified sampling of S << n_k indices to produce an unbiased estimator, delivering 1.5x decode attention speedup on RTX 6000 Ada at 32k contexts while matching baseline accuracy.
-
Verbal-R3: Verbal Reranker as the Missing Bridge between Retrieval and Reasoning
Verbal-R3 uses a verbal reranker to generate analytic narratives that guide retrieval and reasoning in LLMs, achieving SOTA results on complex QA benchmarks.
-
A Survey of Reasoning-Intensive Retrieval: Progress and Challenges
A survey that categorizes RIR benchmarks by domain and modality, proposes a taxonomy for integrating reasoning into retrieval pipelines, and outlines key challenges.
-
CL-bench Life: Can Language Models Learn from Real-Life Context?
CL-bench Life shows frontier language models achieve only 13.8% average success on real-life context tasks, with the best model at 19.3%.
-
R$^3$AG: Retriever Routing for Retrieval-Augmented Generation
R³AG routes queries to retrievers by decomposing capabilities into retrieval quality and generation utility, trained via contrastive learning on document assessments and downstream answer correctness to outperform static methods.
-
Pause or Fabricate? Training Language Models for Grounded Reasoning
GRIL uses stage-specific RL rewards to train LLMs to detect missing premises, pause proactively, and resume grounded reasoning after clarification, yielding up to 45% better premise detection and 30% higher task success on insufficient math datasets.
-
KnowledgeBerg: Evaluating Systematic Knowledge Coverage and Compositional Reasoning in Large Language Models
KnowledgeBerg benchmark shows open-source LLMs achieve only 5.26-36.88 F1 on universe enumeration and 16-44% accuracy on knowledge-grounded compositional reasoning, with persistent failures in completeness, awareness, and application.
-
CoSearch: Joint Training of Reasoning and Document Ranking via Reinforcement Learning for Agentic Search
Joint RL training of reasoning agent and document ranker via GRPO with semantic grouping and composite rewards yields consistent gains over fixed-retrieval baselines on seven QA benchmarks.
-
No-Worse Context-Aware Decoding: Preventing Neutral Regression in Context-Conditioned Generation
NWCAD uses a two-stream setup with a two-stage gate to prevent accuracy drops on baseline-correct items under non-informative contexts while retaining gains from helpful contexts.
-
Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retrieval
A hybrid graph-text retrieval system for cyber threat intelligence improves multi-hop question answering by up to 35% over vector-based RAG on a 3,300-question benchmark.
-
RUBEN: Rule-Based Explanations for Retrieval-Augmented LLM Systems
RUBEN discovers minimal rule sets explaining RAG LLM outputs via novel pruning and applies them to evaluate LLM safety against adversarial injections.
-
LensVLM: Selective Context Expansion for Compressed Visual Representation of Text
LensVLM trains VLMs to scan compressed rendered text images and selectively expand task-relevant regions, achieving 4.3x compression with near full-text accuracy and outperforming baselines up to 10.1x on text QA benchmarks.
-
An Information-Geometric Framework for Stability Analysis of Large Language Models under Entropic Stress
A thermodynamic-inspired information-geometric framework defines a composite LLM stability score that outperforms a utility-entropy baseline by 0.0299 on average across 80 observations, with gains increasing at higher entropy.
-
Align Documents to Questions: Question-Oriented Document Rewriting for Retrieval-Augmented Generation
QREAM rewrites documents to question-focused style using iterative ICL and distilled FT models, boosting RAG performance by up to 8% relative improvement.
-
From Hallucination to Structure Snowballing: The Alignment Tax of Constrained Decoding in LLM Reflection
Enforcing structured reflection via Outlines-based constrained decoding on an 8B LLM triggers structure snowballing instead of better self-correction, producing near-perfect syntax but persistent semantic errors and revealing an alignment tax.
-
Gated Delta Networks: Improving Mamba2 with Delta Rule
Gated DeltaNet integrates gating and delta rules into linear transformers, outperforming Mamba2 and DeltaNet on language modeling, reasoning, retrieval, and long-context tasks.
-
Multilingual E5 Text Embeddings: A Technical Report
Open-source multilingual E5 embedding models are trained via contrastive pre-training on 1 billion text pairs and fine-tuning, with an instruction-tuned model matching English SOTA performance.
-
Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.
-
Text-Graph Synergy: A Bidirectional Verification and Completion Framework for RAG
TGS-RAG adds graph-to-text re-ranking with global voting and text-to-graph orphan path bridging to improve precision and efficiency in multi-hop RAG over prior baselines.
-
Qwen Goes Brrr: Off-the-Shelf RAG for Ukrainian Multi-Domain Document Understanding
A RAG pipeline with contextual PDF chunking, question-and-answer-aware retrieval and reranking using Qwen3 models reaches 0.96 accuracy on a Ukrainian multi-domain document QA shared task.
-
Hypencoder Revisited: Reproducibility and Analysis of Non-Linear Scoring for First-Stage Retrieval
Reproducibility study confirms Hypencoder's non-linear query-specific scoring improves retrieval over bi-encoders on standard benchmarks but standard methods remain faster and hard-task results are mixed due to implementation issues.