Re²Math is a new benchmark that evaluates AI models on retrieving and verifying the applicability of theorems from math literature to advance steps in partial proofs, accepting any sufficient theorem while controlling for leakage.
hub
Enabling large language models to generate text with citations
12 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 3representative citing papers
MultiHop-RAG is a new benchmark dataset demonstrating that existing retrieval-augmented generation systems perform poorly on multi-hop queries requiring retrieval and reasoning over multiple evidence pieces.
Stage-Audit raises source-frontier precision from 0.356 to 0.505 and F1 from 0.334 to 0.451 on a 51-instance cross-domain set by enforcing disjoint write rights and row-level source gates.
In Agentic GraphRAG, cited evidence is necessary but not sufficient for accurate answers, as uncited traversal context and graph structure also affect results, requiring evaluation of the full retrieval trajectory.
Formulates context attribution as a combinatorial multi-armed bandit problem solved via Linear Thompson Sampling to reduce LLM queries by up to 30% on QA benchmarks while matching existing attribution quality.
A unified framework and large-scale comparison of graph-based RAG methods on QA tasks yields new high-performing variants obtained by recombining existing components.
RAPTOR introduces a tree-organized retrieval method using recursive abstractive summaries, achieving a 20% absolute accuracy improvement on the QuALITY benchmark when paired with GPT-4.
Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.
AdaRankLLM shows adaptive listwise reranking outperforms fixed-depth retrieval for most LLMs by acting as a noise filter for weak models and an efficiency optimizer for strong ones, with lower context use.
DTCRS dynamically builds summary trees only for suitable question types by using sub-question embeddings as cluster centers, cutting construction time while improving QA on three tasks.
Argues for a denoising-first paradigm in LLM-oriented information retrieval, framing challenges via a four-stage progression and providing a taxonomy of signal-to-noise optimization techniques across the pipeline.
Three-aspect RAG query pipeline optimization for cancer patient QA introduces HSRDR and SEOS and reports 5.24% accuracy gain on Claude-3-haiku versus chain-of-thought on a custom dataset.
citing papers explorer
-
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.