RWGBench is a citation-centric benchmark for related work generation built from 40k CS papers and a 100-paper test set, with multi-dimensional metrics that better match human expert judgment than standard similarity scores.
Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, and James Zou
10 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 10representative citing papers
SEAM learns to generate utility-optimized structured experiences via rollouts to boost frozen LLM performance on mathematical reasoning benchmarks with low overhead.
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
A 4B deep research agent trained on 10K open data outperforms prior agents under 9B parameters and narrows the gap to 30B-class systems on research benchmarks.
The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.
Multi-agent deep research systems self-optimize prompts through self-play to match or outperform expert-crafted versions.
Retrievers trained on agent trajectories via the LRAT framework improve evidence recall, task success, and efficiency in agentic search benchmarks.
Empirical study finds diversity collapse in multi-agent LLM ideation arises from structural coupling in interactions, not model limitations.
Pi-Serini shows a tuned BM25 lexical retriever with adequate depth, used inside an LLM agentic loop, reaches 83.1% accuracy and 94.7% evidence recall on BrowseComp-Plus while beating released dense-retriever agents.
Argues for a denoising-first paradigm in LLM-oriented information retrieval, framing challenges via a four-stage progression and providing a taxonomy of signal-to-noise optimization techniques across the pipeline.
citing papers explorer
-
RWGBench: Evaluating Scholarly Positioning in Related Work Generation
RWGBench is a citation-centric benchmark for related work generation built from 40k CS papers and a 100-paper test set, with multi-dimensional metrics that better match human expert judgment than standard similarity scores.
-
Beyond Experience Retrieval: Learning to Generate Utility-Optimized Structured Experience for Frozen LLMs
SEAM learns to generate utility-optimized structured experiences via rollouts to boost frozen LLM performance on mathematical reasoning benchmarks with low overhead.
-
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
-
DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data
A 4B deep research agent trained on 10K open data outperforms prior agents under 9B parameters and narrows the gap to 30B-class systems on research benchmarks.
-
Towards Knowledgeable Deep Research: Framework and Benchmark
The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.
-
Self-Optimizing Multi-Agent Systems for Deep Research
Multi-agent deep research systems self-optimize prompts through self-play to match or outperform expert-crafted versions.
-
Learning to Retrieve from Agent Trajectories
Retrievers trained on agent trajectories via the LRAT framework improve evidence recall, task success, and efficiency in agentic search benchmarks.
-
Diversity Collapse in Multi-Agent LLM Systems: Structural Coupling and Collective Failure in Open-Ended Idea Generation
Empirical study finds diversity collapse in multi-agent LLM ideation arises from structural coupling in interactions, not model limitations.
-
Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?
Pi-Serini shows a tuned BM25 lexical retriever with adequate depth, used inside an LLM agentic loop, reaches 83.1% accuracy and 94.7% evidence recall on BrowseComp-Plus while beating released dense-retriever agents.
-
LLM-Oriented Information Retrieval: A Denoising-First Perspective
Argues for a denoising-first paradigm in LLM-oriented information retrieval, framing challenges via a four-stage progression and providing a taxonomy of signal-to-noise optimization techniques across the pipeline.