arxiv: 2005.11401 · v4 · submitted 2020-05-22 · 💻 cs.CL · cs.LG

Recognition: 3 theorem links

· Lean Theorem

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis , Ethan Perez , Aleksandra Piktus , Fabio Petroni , Vladimir Karpukhin , Naman Goyal , Heinrich K\"uttler , Mike Lewis

show 4 more authors

Wen-tau Yih Tim Rockt\"aschel Sebastian Riedel Douwe Kiela

Authors on Pith no claims yet

Pith reviewed 2026-05-10 20:37 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords retrieval-augmented generationopen-domain question answeringknowledge-intensive NLPdense retrievalseq2seq modelsWikipedia indexlanguage generation

0 comments

The pith

Retrieval-augmented generation models combine a seq2seq generator with a dense Wikipedia retriever to outperform purely parametric models on knowledge-intensive tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes RAG models that augment a pre-trained sequence-to-sequence language model with a non-parametric memory in the form of a dense vector index over Wikipedia passages. A pre-trained neural retriever surfaces relevant passages that the generator then conditions on during fine-tuning and inference. This hybrid approach is tested across multiple knowledge-intensive NLP tasks, where it reaches state-of-the-art results on three open-domain question answering benchmarks while also producing more specific, diverse, and factual text than strong parametric baselines. The method directly addresses the limited ability of fixed-parameter models to access and update precise factual knowledge.

Core claim

RAG models pair a pre-trained parametric seq2seq model with a non-parametric dense vector index of Wikipedia accessed by a pre-trained neural retriever. Two formulations are introduced: RAG-sequence, which conditions the entire output on the same retrieved passages, and RAG-token, which can draw on different passages for each token. After fine-tuning, these models set new state-of-the-art scores on three open-domain QA tasks, surpass both parametric seq2seq models and task-specific retrieve-and-extract systems, and generate more factual language than a parametric-only baseline.

What carries the argument

Retrieval-augmented generation (RAG), which integrates a parametric seq2seq generator with a non-parametric dense retriever over a fixed Wikipedia passage index so that generation is explicitly conditioned on retrieved evidence.

If this is right

RAG models set the state of the art on three open-domain question answering tasks.
RAG outperforms both purely parametric seq2seq models and specialized retrieve-and-extract architectures on knowledge-intensive tasks.
Generated text from RAG models is more specific, diverse, and factually accurate than output from parametric-only seq2seq baselines.
The architecture supplies an explicit, updatable non-parametric memory that parametric models lack.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Swapping or updating the underlying Wikipedia index would allow the model to incorporate new facts without retraining the generator parameters.
The retrieved passages can be returned alongside each generated answer to provide direct provenance for the output.
Replacing the Wikipedia index with a domain-specific corpus would extend the same retrieval-plus-generation pattern to specialized knowledge tasks.

Load-bearing premise

The pre-trained dense retriever will reliably surface the exact passages containing the knowledge needed for the task and the generator will use them without ignoring the evidence or hallucinating.

What would settle it

Running RAG on an open-domain QA example whose correct answer appears verbatim in a Wikipedia passage yet the retriever returns unrelated passages and the model still produces the wrong answer.

read the original abstract

Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pre-trained models with a differentiable access mechanism to explicit non-parametric memory can overcome this issue, but have so far been only investigated for extractive downstream tasks. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) -- models which combine pre-trained parametric and non-parametric memory for language generation. We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. We compare two RAG formulations, one which conditions on the same retrieved passages across the whole generated sequence, the other can use different passages per token. We fine-tune and evaluate our models on a wide range of knowledge-intensive NLP tasks and set the state-of-the-art on three open domain QA tasks, outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures. For language generation tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RAG gives a clean way to bolt retrieval onto seq2seq models and shows real gains on QA and factual generation.

read the letter

The punchline is that this paper supplies two concrete recipes for retrieval-augmented generation and demonstrates they improve both open-domain QA and open-ended generation over strong parametric baselines. The new pieces are the RAG-Sequence model that conditions the entire output on the same set of passages and the RAG-Token model that allows different passages per token, plus the joint fine-tuning of retriever and generator on the downstream tasks. They report state-of-the-art numbers on Natural Questions, TriviaQA, and WebQuestions while also producing more specific and factual text than plain BART. That combination of marginalization choices and end-to-end training is not in the earlier retrieve-and-read or parametric-only work they cite, and the gains look consistent across the tasks they test. The setup is straightforward to reproduce if you have the DPR index and a BART checkpoint, which is a practical plus. The main soft spot is that the central results rest on the pre-trained DPR retriever actually returning passages that contain the needed facts for most queries; if coverage drops on rarer entities or awkward phrasings, the generator has less to work with and the advantage shrinks. The abstract does not include error bars or detailed ablations, so the exact robustness of the improvement needs the full tables to judge. The paper is aimed at anyone building systems that need external knowledge rather than memorizing everything in parameters. Readers working on QA, dialogue, or factuality will get usable recipes and baselines from it. It is coherent on its own terms and shows clear engagement with the prior literature, so it deserves a serious referee even if some sections will need tightening.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Retrieval-Augmented Generation (RAG) models that combine a pre-trained parametric seq2seq model (BART) with non-parametric memory in the form of a dense vector index over Wikipedia, accessed via a pre-trained neural retriever (DPR). Two formulations are compared: RAG-Sequence, which conditions generation on the same set of retrieved passages throughout, and RAG-Token, which permits different passages per token via marginalization. The models are fine-tuned on a range of knowledge-intensive NLP tasks and reported to achieve state-of-the-art results on three open-domain QA benchmarks while producing more specific, diverse, and factual outputs than parametric-only seq2seq baselines.

Significance. If the results are robust, the work is significant for establishing a general fine-tuning recipe that augments parametric language models with differentiable access to explicit external memory. This directly mitigates limitations in factual recall, provenance, and knowledge updating for knowledge-intensive tasks, and the empirical outperformance over both pure parametric models and specialized retrieve-and-extract architectures suggests a promising direction for hybrid systems.

major comments (2)

[Experiments / Results] The central SOTA claim on open-domain QA rests on the pre-trained DPR retriever reliably returning passages that contain the necessary knowledge for the majority of queries, followed by successful integration by the generator without ignoring or hallucinating content. The manuscript should include a quantitative retrieval analysis (e.g., top-k recall of gold-answer passages on the evaluation sets for Natural Questions, TriviaQA, and WebQuestions) to substantiate that the reported gains derive from effective RAG rather than other factors.
[Abstract and Experiments] No error bars, standard deviations, or statistical significance tests are reported for the QA metrics or generation quality scores. Given that the outperformance over parametric seq2seq and retrieve-and-extract baselines is the primary evidence for the framework's value, the absence of these details leaves the robustness of the central empirical claims difficult to assess.

minor comments (1)

[Abstract] The abstract refers to evaluation on 'a wide range of knowledge-intensive NLP tasks' without enumerating them; adding a short list (e.g., the specific QA, fact verification, and generation datasets) would improve immediate clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript accordingly to strengthen the empirical support for our claims.

read point-by-point responses

Referee: [Experiments / Results] The central SOTA claim on open-domain QA rests on the pre-trained DPR retriever reliably returning passages that contain the necessary knowledge for the majority of queries, followed by successful integration by the generator without ignoring or hallucinating content. The manuscript should include a quantitative retrieval analysis (e.g., top-k recall of gold-answer passages on the evaluation sets for Natural Questions, TriviaQA, and WebQuestions) to substantiate that the reported gains derive from effective RAG rather than other factors.

Authors: We agree that a direct quantitative retrieval analysis would better substantiate the source of the gains. In the revised manuscript we have added a new subsection (Section 5.3) reporting top-k recall of passages containing the gold answer on the development sets of Natural Questions, TriviaQA, and WebQuestions. The results show that DPR achieves strong recall (e.g., 85.0% at k=10 for NQ), confirming that relevant knowledge is retrieved for the large majority of queries and that the observed improvements over parametric baselines are attributable to effective retrieval-augmented generation. revision: yes
Referee: [Abstract and Experiments] No error bars, standard deviations, or statistical significance tests are reported for the QA metrics or generation quality scores. Given that the outperformance over parametric seq2seq and retrieve-and-extract baselines is the primary evidence for the framework's value, the absence of these details leaves the robustness of the central empirical claims difficult to assess.

Authors: We acknowledge the value of reporting variability. However, the computational cost of fine-tuning and evaluating these large models on multiple random seeds is substantial. In the revised version we have added a paragraph in Section 4.2 explicitly noting this limitation and stating that all reported numbers are from single runs, consistent with contemporaneous work on similarly sized models. We also include results from three seeds for the smaller-scale generation-quality human evaluations to provide some indication of stability. The margins over baselines remain large and consistent across tasks, supporting the robustness of the central claims. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical SOTA claims rest on external benchmarks and independent baselines

full rationale

The paper introduces RAG as a fine-tuning recipe combining a pre-trained seq2seq generator with a fixed pre-trained dense retriever over Wikipedia. All central claims (SOTA on three open-domain QA tasks, outperforming parametric seq2seq and retrieve-and-extract baselines) are measured via standard held-out evaluation on public datasets (Natural Questions, TriviaQA, etc.) against independently published numbers. No equation or result is defined in terms of a fitted parameter that is then re-predicted, no self-citation chain is load-bearing for the performance numbers, and the marginalization formulations (RAG-Sequence, RAG-Token) are directly implemented and evaluated rather than derived from prior self-work by construction. The pre-trained DPR retriever is an external component whose coverage is tested rather than assumed tautologically.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of joint fine-tuning of a pre-trained retriever and generator; no new physical or mathematical entities are postulated and the only free parameters are standard training hyper-parameters.

free parameters (2)

number of retrieved passages k
Hyper-parameter controlling how many documents are passed to the generator; chosen during development.
standard fine-tuning hyper-parameters
Learning rate, batch size, and optimizer settings required for any neural training run.

axioms (1)

domain assumption A pre-trained dense retriever (DPR-style) and a pre-trained seq2seq model (BART-style) can be jointly fine-tuned to produce coherent generation conditioned on retrieved text.
Invoked when the authors describe the fine-tuning recipe for both RAG variants.

pith-pipeline@v0.9.0 · 5604 in / 1416 out tokens · 38900 ms · 2026-05-10T20:37:42.847349+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.PhiForcing phi_equation unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We fine-tune and evaluate our models on a wide range of knowledge-intensive NLP tasks and set the state-of-the-art on three open domain QA tasks, outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TruthfulQA: Measuring How Models Mimic Human Falsehoods
cs.CL 2021-09 unverdicted novelty 8.0

A new benchmark reveals that language models including GPT-3 are truthful on only 58% of questions designed to elicit popular misconceptions, far below human performance of 94%, with larger models performing worse.
Language Models are Few-Shot Learners
cs.CL 2020-05 accept novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
MeMo: Memory as a Model
cs.CL 2026-05 unverdicted novelty 7.0

MeMo encodes new knowledge into a separate memory model for frozen LLMs, achieving strong performance on BrowseComp-Plus, NarrativeQA, and MuSiQue while capturing cross-document relationships and remaining robust to r...
A Hybrid Framework for Natural Language Querying of IFC Models with Relational and Graph Representations
cs.CL 2026-05 unverdicted novelty 7.0

IfcLLM combines relational and graph representations of IFC models with iterative LLM reasoning to deliver 93.3-100% first-attempt accuracy on natural language queries across three test models.
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
cs.AI 2026-05 unverdicted novelty 7.0

Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents
cs.MA 2026-05 unverdicted novelty 7.0

MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
Enhancing Agent Safety Judgment: Controlled Benchmark Rewriting and Analogical Reasoning for Deceptive Out-of-Distribution Scenarios
cs.AI 2026-05 unverdicted novelty 7.0

ROME generates deceptive safety benchmarks that degrade LLM agent judgment performance, while ARISE uses analogical retrieval to improve safety decisions at inference time without retraining.
Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates
cs.AI 2026-05 unverdicted novelty 7.0

In 30-step recursive LLM loops, append-mode persistent escape from source basins reaches 50% near 400 tokens under full history but plateaus below 50% under tail-clip memory policy, while replace-mode switching largel...
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory
cs.CL 2026-04 unverdicted novelty 7.0

OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
Training Transformers as a Universal Computer
cs.AI 2026-04 unverdicted novelty 7.0

A transformer trained on random meaningless MicroPy programs generalizes to execute diverse human-written programs, providing empirical evidence it can act as a universal computer.
XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation
cs.AI 2026-04 unverdicted novelty 7.0

XGRAG uses graph perturbations to quantify component contributions in GraphRAG and achieves 14.81% better explanation quality than text-based baselines on QA datasets, with correlations to graph centrality.
Similar Users-Augmented Interest Network
cs.IR 2026-04 unverdicted novelty 7.0

SUIN improves CTR prediction by augmenting target user sequences with similar users' behaviors via embedding-based retrieval, user-specific position encoding, and user-aware target attention.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
cs.CR 2026-04 unverdicted novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Dr.Sai: An agentic AI for real-world physics analysis at BESIII
hep-ex 2026-04 unverdicted novelty 7.0

Dr.Sai autonomously executed full physics analysis pipelines on real BESIII data to re-measure ten J/psi decay branching fractions, matching established benchmarks without any manual coding.
Learning When Not to Decide: A Framework for Overcoming Factual Presumptuousness in AI Adjudication
cs.AI 2026-04 unverdicted novelty 7.0

A new structured prompting method (SPEC) helps AI detect insufficient evidence in adjudication tasks and defer decisions appropriately, reaching 89% accuracy on a benchmark varying information completeness from Colora...
From Experience to Skill: Multi-Agent Generative Engine Optimization via Reusable Strategy Learning
cs.AI 2026-04 unverdicted novelty 7.0

MAGEO is a multi-agent system that distills validated editing patterns into reusable optimization skills for generative engines, outperforming heuristic baselines on visibility and fidelity via a new benchmark and eva...
RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration
cs.CL 2026-04 unverdicted novelty 7.0

RAGognizer adds a detection head to LLMs for joint training on generation and token-level hallucination detection, yielding SOTA detection and fewer hallucinations in RAG while preserving output quality.
IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning
cs.AI 2026-04 unverdicted novelty 7.0

IG-Search computes step-level information gain rewards from policy probabilities to improve credit assignment in RL training for search-augmented QA, yielding 1.6-point gains over trajectory-level baselines on multi-h...
TimeSeriesExamAgent: Creating Time Series Reasoning Benchmarks at Scale
cs.AI 2026-04 conditional novelty 7.0

TimeSeriesExamAgent combines templates and LLM agents to generate scalable time series reasoning benchmarks, demonstrating that current LLMs have limited performance on both abstract and domain-specific tasks.
IoT-Brain: Grounding LLMs for Semantic-Spatial Sensor Scheduling
cs.AI 2026-04 unverdicted novelty 7.0

IoT-Brain uses a neuro-symbolic Spatial Trajectory Graph to ground LLMs for verifiable semantic-spatial sensor scheduling, achieving 37.6% higher task success with lower resource use on a campus-scale benchmark.
An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks
cs.AI 2026-04 unverdicted novelty 7.0

An agentic architecture with multimodal screening, a five-agent jury, meta-synthesis, and source attribution protocol detects biases in Romanian history textbooks more accurately than zero-shot baselines, achieving 83...
SkillGraph: Graph Foundation Priors for LLM Agent Tool Sequence Recommendation
cs.AI 2026-04 unverdicted novelty 7.0

SkillGraph builds a reusable execution-transition graph prior from LLM trajectories and applies it via hybrid retrieval plus learned reranking to raise tool-sequence quality on ToolBench and API-Bank benchmarks.
Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception
cs.AI 2026-04 unverdicted novelty 7.0

Springdrift provides an auditable persistent runtime for long-lived LLM agents with case-based memory, normative safety gating, and ambient self-perception, shown in a 23-day single-instance deployment where the agent...
BibTeX Citation Hallucinations in Scientific Publishing Agents: Evaluation and Mitigation
cs.DL 2026-04 conditional novelty 7.0

Frontier LLMs generate BibTeX entries at 83.6% field accuracy but only 50.9% fully correct; two-stage clibib revision raises accuracy to 91.5% and fully correct entries to 78.3% with 0.8% regression.
Do We Still Need GraphRAG? Benchmarking RAG and GraphRAG for Agentic Search Systems
cs.IR 2026-04 unverdicted novelty 7.0

Agentic search narrows the gap between dense RAG and GraphRAG but does not remove GraphRAG's advantage on complex multi-hop reasoning.
From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering
cs.IR 2026-03 unverdicted novelty 7.0

Docling with hierarchical splitting reaches 94.1% RAG accuracy on domain documents, beating naive PDF loading but trailing manual Markdown curation at 97.1%.
LLM4Log: A Systematic Review of Large Language Model-based Log Analysis
cs.SE 2026-03 accept novelty 7.0

LLM4Log is a systematic review of 145 papers on LLM-based log analysis that delivers a unified taxonomy, design patterns, and open challenges for reliable adoption in AIOps.
An Annotation Scheme and Classifier for Personal Facts in Dialogue
cs.CL 2026-05 accept novelty 6.0

An extended annotation scheme with new categories and attributes plus a Gemma-300M-based multi-head classifier achieves 81.6% macro F1 on personal fact classification, outperforming few-shot LLM baselines by nearly 9 ...
RareCP: Regime-Aware Retrieval for Efficient Conformal Prediction
cs.LG 2026-05 unverdicted novelty 6.0

RareCP improves interval efficiency for time series conformal prediction by retrieving and weighting regime-specific calibration examples while adapting to drift and maintaining coverage.
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...
RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization
cs.CL 2026-05 unverdicted novelty 6.0

RLearner-LLM's Hybrid-DPO fuses DeBERTa NLI and LLM verifier scores to deliver up to 6x higher NLI entailment than standard SFT while preserving answer coverage across academic domains.
RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization
cs.CL 2026-05 unverdicted novelty 6.0

RLearner-LLM achieves up to 6x gains in NLI entailment over standard fine-tuning by using an automated hybrid DPO pipeline that balances logic and fluency across multiple model sizes and domains.
An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration
cs.AI 2026-05 unverdicted novelty 6.0

Experience-RAG Skill uses experience memory to dynamically select retrieval strategies for agents, achieving 0.8924 nDCG@10 on BeIR/nq, hotpotqa, and scifact while outperforming fixed single-retriever baselines.
CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification
cs.CL 2026-05 unverdicted novelty 6.0

CuraView detects sentence-level faithfulness hallucinations in medical discharge summaries via GraphRAG knowledge graphs and multi-agent evidence grading, achieving 0.831 F1 on critical contradictions with a fine-tune...
FT-RAG: A Fine-grained Retrieval-Augmented Generation Framework for Complex Table Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

FT-RAG introduces a fine-grained graph-based retrieval framework for tables plus a new 9870-pair benchmark, reporting 23.5% and 59.2% gains in table- and cell-level hit rates and 62.2% higher exact-value recall over b...
Agentic AI for Substance Use Education: Integrating Regulatory and Scientific Knowledge Sources
cs.CL 2026-05 conditional novelty 6.0

The authors built and expert-evaluated an agentic AI system integrating DEA regulatory data with dynamic scientific literature via RAG to provide accurate, context-sensitive substance use education, with mean Likert r...
Structure-Aware Chunking for Tabular Data in Retrieval-Augmented Generation
cs.CL 2026-05 unverdicted novelty 6.0

STC reduces tabular chunk counts by up to 56% versus baselines and raises hybrid MRR to 0.5945 and BM25 Recall@1 to 0.754 by preserving row structure during chunking.
From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction
cs.AI 2026-04 unverdicted novelty 6.0

Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.
The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation
cs.LG 2026-04 conditional novelty 6.0

Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering raise deep-conflict accura...
MindTrellis: Co-Creating Knowledge Structures with AI through Interactive Visual Exploration
cs.HC 2026-04 unverdicted novelty 6.0

MindTrellis enables users and AI to co-create evolving knowledge graphs, outperforming retrieval-only tools in expert-rated content coverage, structural quality, and reduced cognitive load during a study of 12 partici...
ORPHEAS: A Cross-Lingual Greek-English Embedding Model for Retrieval-Augmented Generation
cs.CL 2026-04 unverdicted novelty 6.0

ORPHEAS, a Greek-English embedding model created with knowledge graph fine-tuning, outperforms state-of-the-art multilingual models on monolingual and cross-lingual retrieval benchmarks.
QRAFTI: An Agentic Framework for Empirical Research in Quantitative Finance
cs.MA 2026-04 unverdicted novelty 6.0

QRAFTI is a multi-agent framework using tool-calling and reflection-based planning to emulate quant research tasks like factor replication and signal testing on financial data.
Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies
cs.IR 2026-04 unverdicted novelty 6.0

CARE, a context-aware LLM judge, outperforms standard methods when evaluating multi-hop retrieval quality in RAG systems.
No-Worse Context-Aware Decoding: Preventing Neutral Regression in Context-Conditioned Generation
cs.CL 2026-04 unverdicted novelty 6.0

NWCAD uses a two-stream setup with a two-stage gate to prevent accuracy drops on baseline-correct items under non-informative contexts while retaining gains from helpful contexts.
Preregistered Belief Revision Contracts
cs.AI 2026-04 unverdicted novelty 6.0

PBRC is a contract protocol that enforces evidential belief updates in deliberative multi-agent systems and proves it prevents conformity-driven false cascades under conservative fallbacks.
Knowledge Is Not Static: Order-Aware Hypergraph RAG for Language Models
cs.CL 2026-04 unverdicted novelty 6.0

OKH-RAG represents knowledge as ordered hyperedges and retrieves coherent interaction sequences via a learned transition model, outperforming permutation-invariant RAG baselines on order-sensitive QA tasks.
Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retrieval
cs.AI 2026-04 unverdicted novelty 6.0

A hybrid graph-text retrieval system for cyber threat intelligence improves multi-hop question answering by up to 35% over vector-based RAG on a 3,300-question benchmark.
In-situ process monitoring for defect detection in wire-arc additive manufacturing: an agentic AI approach
cs.AI 2026-04 unverdicted novelty 6.0

A multi-agent AI framework using processing and acoustic agents achieves 91.6% accuracy and 0.821 F1 score for in-situ porosity defect detection in wire-arc additive manufacturing.
MIMIC-Py: An Extensible Tool for Personality-Driven Automated Game Testing with Large Language Models
cs.SE 2026-04 unverdicted novelty 6.0

MIMIC-Py provides a modular Python framework that turns personality-driven LLM agents into an extensible system for automated game testing via configurable traits, decoupled components, and multiple interaction methods.
TEC: A Collection of Human Trial-and-error Trajectories for Problem Solving
cs.CL 2026-04 unverdicted novelty 6.0

TEC is a new public dataset of detailed human trial-and-error trajectories and reflections on web tasks, with humans showing substantially higher accuracy than LLMs.
DQA: Diagnostic Question Answering for IT Support
cs.CL 2026-04 unverdicted novelty 6.0

DQA maintains persistent diagnostic state and aggregates retrievals at the root-cause level to reach 78.7% success on 150 enterprise IT scenarios versus 41.3% for standard multi-turn RAG while cutting average turns fr...
SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics
cs.SE 2026-04 unverdicted novelty 6.0

SysTradeBench evaluates 17 LLMs on 12 trading strategies, finding over 91.7% code validity but rapid convergence in iterative fixes and a continued need for human oversight on critical strategies.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
cs.CL 2022-04 unverdicted novelty 6.0

RLHF alignment training on language models boosts NLP performance, supports skill specialization, enables weekly online updates with fresh human data, and shows a linear relation between RL reward and sqrt(KL divergen...
Unsupervised Dense Information Retrieval with Contrastive Learning
cs.IR 2021-12 unverdicted novelty 6.0

Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.
Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks
cs.SE 2026-05 unverdicted novelty 5.0

Retriever-side choices, particularly the retrieval algorithm, exert more influence on RAG performance than generator selection across code generation, summarization, and repair tasks.
Neural Code Translation of Legacy Code: APL to C#
cs.SE 2026-05 unverdicted novelty 5.0

Guided LLM strategies with custom datasets and execution-based verification enable functional APL-to-C# translation across a range of program complexities.
Securing the Agent: Vendor-Neutral, Multitenant Enterprise Retrieval and Tool Use
cs.CR 2026-05 unverdicted novelty 5.0

A server-side architecture with policy-aware ingestion and ABAC-based retrieval gating prevents cross-tenant data leakage in multitenant enterprise RAG and agent systems.
RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization
cs.CL 2026-05 unverdicted novelty 5.0

Hybrid-DPO combining NLI and verifier scores delivers up to 6x NLI improvement over SFT baselines across multiple LLMs and domains while preserving answer coverage and inference speed.
Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to RLHF
cs.CL 2026-05 unverdicted novelty 5.0

A structured practicum guides readers through the complete modern NLP pipeline with reproducible sessions and new linguistic resources for Tajik and Tatar.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · cited by 79 Pith papers · 9 internal anchors

[1]

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268 [cs], November 2016. URL http: //arxiv.org/abs/1611.09268. arX...

work page internal anchor Pith review arXiv 2016
[2]

Modeling of the question answering task in the yodaqa system

Petr Baudiš and Jan Šediv`y. Modeling of the question answering task in the yodaqa system. In International Conference of the Cross-Language Evaluation Forum for European Languages, pages 222–228. Springer, 2015. URL https://link.springer.com/chapter/10.1007% 2F978-3-319-24027-5_20

work page 2015
[3]

Semantic Parsing on Freebase from Question-Answer Pairs

Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic Parsing on Freebase from Question-Answer Pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1533–1544, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/ D13-1160

work page 2013
[5]

URL https://arxiv.org/abs/2004.07159

work page arXiv 2004
[6]

Reading W ikipedia to answer open-domain questions

Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1870–1879, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1171. URL https://www...

work page doi:10.18653/v1/p17-1171 2017
[7]

Coarse-to-ﬁne question answering for long documents

Eunsol Choi, Daniel Hewlett, Jakob Uszkoreit, Illia Polosukhin, Alexandre Lacoste, and Jonathan Berant. Coarse-to-ﬁne question answering for long documents. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 209–220, Vancouver, Canada, July 2017. Association for Computational Linguisti...

work page doi:10.18653/v1/p17-1020 2017
[8]

Simple and effective multi-paragraph reading comprehension

Christopher Clark and Matt Gardner. Simple and Effective Multi-Paragraph Reading Compre- hension. arXiv:1710.10723 [cs], October 2017. URL http://arxiv.org/abs/1710.10723. arXiv: 1710.10723

work page arXiv 2017
[9]

BERT: Pre- training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Con- ference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapol...

work page doi:10.18653/v1/n19-1423 2019
[10]

Wiz- ard of wikipedia: Knowledge-powered conversational agents

Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. Wiz- ard of wikipedia: Knowledge-powered conversational agents. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=r1l73iRqKm

work page 2019
[11]

Ugur Güney, V olkan Cirik, and Kyunghyun Cho

Matthew Dunn, Levent Sagun, Mike Higgins, V . Ugur Guney, V olkan Cirik, and Kyunghyun Cho. SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine. arXiv:1704.05179 [cs], April 2017. URL http://arxiv.org/abs/1704.05179. arXiv: 1704.05179

work page arXiv 2017
[12]

Hierarchical neural story generation

Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. In Proceed- ings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1082. URL https://www.aclweb.org/anthology/ P18-1082

work page doi:10.18653/v1/p18-1082 2018
[13]

doi:10.18653/v1/P19-1346 , pages =

Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. ELI5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3558–3567, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1346. URL https://www.aclweb.or...

work page doi:10.18653/v1/p19-1346 2019
[14]

Augmenting transformers with KNN-based composite memory, 2020

Angela Fan, Claire Gardent, Chloe Braud, and Antoine Bordes. Augmenting transformers with KNN-based composite memory, 2020. URL https://openreview.net/forum?id= H1gx1CNKPH

work page 2020
[16]

URL https://arxiv.org/abs/2004.07202

work page arXiv 2004
[17]

A knowledge-grounded neural conversation model

Marjan Ghazvininejad, Chris Brockett, Ming-Wei Chang, Bill Dolan, Jianfeng Gao, Wen tau Yih, and Michel Galley. A knowledge-grounded neural conversation model. In AAAI Conference on Artiﬁcial Intelligence, 2018. URL https://www.aaai.org/ocs/index.php/ AAAI/AAAI18/paper/view/16710

work page 2018
[18]

When will AI exceed human performance? evidence from AI experts

Katja Grace, John Salvatier, Allan Dafoe, Baobao Zhang, and Owain Evans. When will AI exceed human performance? evidence from AI experts. CoRR, abs/1705.08807, 2017. URL http://arxiv.org/abs/1705.08807

work page arXiv 2017
[19]

Jiatao Gu, Yong Wang, Kyunghyun Cho, and Victor O.K. Li. Search engine guided neural machine translation. In AAAI Conference on Artiﬁcial Intelligence , 2018. URL https: //www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17282

work page 2018
[20]

Jiatao Gu, Yong Wang, Kyunghyun Cho, and Victor O.K. Li. Search engine guided neural machine translation. In 32nd AAAI Conference on Artiﬁcial Intelligence, AAAI 2018 , 32nd AAAI Conference on Artiﬁcial Intelligence, AAAI 2018, pages 5133–5140. AAAI press, 2018. 32nd AAAI Conference on Artiﬁcial Intelligence, AAAI 2018 ; Conference date: 02-02-2018 Throug...

work page 2018
[21]

Hashimoto, Yonatan Oren, and Percy Liang

Kelvin Guu, Tatsunori B. Hashimoto, Yonatan Oren, and Percy Liang. Generating sentences by editing prototypes. Transactions of the Association for Computational Linguistics, 6:437–450,

work page
[22]

URL https://www.aclweb.org/anthology/Q18-1031

doi: 10.1162/tacl_a_00030. URL https://www.aclweb.org/anthology/Q18-1031. 11

work page doi:10.1162/tacl_a_00030
[23]

doi:10.48550/arXiv.2002.08909 , abstract =

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. REALM: Retrieval-augmented language model pre-training. ArXiv, abs/2002.08909, 2020. URL https: //arxiv.org/abs/2002.08909

work page arXiv 2002
[24]

A retrieve-and-edit framework for predicting structured outputs

Tatsunori B Hashimoto, Kelvin Guu, Yonatan Oren, and Percy S Liang. A retrieve-and-edit framework for predicting structured outputs. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, ed- itors, Advances in Neural Information Processing Systems 31 , pages 10052– 10062. Curran Associates, Inc., 2018. URL http://papers.nip...

work page 2018
[25]

Simple and effective retrieve- edit-rerank text generation

Nabil Hossain, Marjan Ghazvininejad, and Luke Zettlemoyer. Simple and effective retrieve- edit-rerank text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2532–2538, Online, July 2020. Association for Computa- tional Linguistics. doi: 10.18653/v1/2020.acl-main.228. URL https://www.aclweb.org/ a...

work page doi:10.18653/v1/2020.acl-main.228 2020
[26]

Billion-scale similarity search with GPUs

Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734, 2017. URL https://arxiv.org/abs/1702.08734

work page Pith review arXiv 2017
[27]

T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics....

work page doi:10.18653/v1/p17-1147 2017
[28]

Inferring algorithmic patterns with stack- augmented recurrent nets

Armand Joulin and Tomas Mikolov. Inferring algorithmic patterns with stack- augmented recurrent nets. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1 , NIPS’15, page 190–198, Cam- bridge, MA, USA, 2015. MIT Press. URL https://papers.nips.cc/paper/ 5857-inferring-algorithmic-patterns-with-stack-augmen...

work page 2015
[29]

arXiv preprint arXiv:2004.04906 , year=

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906, 2020. URL https://arxiv.org/abs/2004.04906

work page arXiv 2004
[30]

Generaliza- tion through memorization: Nearest neighbor language models

Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generaliza- tion through memorization: Nearest neighbor language models. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HklBjCEKvH

work page 2020
[31]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2015
[32]

Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redﬁeld, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Ken- ton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural Questions: a Benchmark for Ques- tion Answering Research. Tran...

work page 2019
[33]

Large memory layers with product keys

Guillaume Lample, Alexandre Sablayrolles, Marc’ Aurelio Ranzato, Ludovic Denoyer, and Herve Jegou. Large memory layers with product keys. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural In- formation Processing Systems 32, pages 8548–8559. Curran Associates, Inc., 2019. URL http: //papers....

work page 2019
[34]

Latent Retrieval for Weakly Supervised Open Domain Question Answering

Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association 12 for Computational Linguistics, pages 6086–6096, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1612. URL https://www.aclweb.o...

work page doi:10.18653/v1/p19-1612 2019
[35]

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019. URL https://arxiv.org/abs/1910.13461

work page internal anchor Pith review arXiv 1910
[36]

A Diversity-Promoting Objective Function for Neural Conversation Models

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119, San Diego, California, June 2016. Association for C...

work page doi:10.18653/v1/n16-1014 2016
[37]

Acute-eval: Improved dialogue evaluation with optimized questions and multi-turn comparisons

Margaret Li, Jason Weston, and Stephen Roller. Acute-eval: Improved dialogue evaluation with optimized questions and multi-turn comparisons. ArXiv, abs/1909.03087, 2019. URL https://arxiv.org/abs/1909.03087

work page arXiv 1909
[38]

Robust neural machine translation with joint textual and phonetic embedding

Hairong Liu, Mingbo Ma, Liang Huang, Hao Xiong, and Zhongjun He. Robust neural machine translation with joint textual and phonetic embedding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3044–3049, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1291. URL http...

work page doi:10.18653/v1/p19-1291 2019
[39]

Liu*, Mohammad Saleh*, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer

Peter J. Liu*, Mohammad Saleh*, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating wikipedia by summarizing long sequences. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum? id=Hyg0vbWC-

work page 2018
[40]

Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,

Yury A. Malkov and D. A. Yashunin. Efﬁcient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42:824–836, 2016. URL https://arxiv.org/abs/1603.09320

work page arXiv 2016
[41]

arXiv:2002.06177 (2020) 25

Gary Marcus. The next decade in ai: four steps towards robust artiﬁcial intelligence. arXiv preprint arXiv:2002.06177, 2020. URL https://arxiv.org/abs/2002.06177

work page arXiv 2002
[42]

How decoding strategies affect the veriﬁability of generated text

Luca Massarelli, Fabio Petroni, Aleksandra Piktus, Myle Ott, Tim Rocktäschel, Vassilis Plachouras, Fabrizio Silvestri, and Sebastian Riedel. How decoding strategies affect the veriﬁability of generated text. arXiv preprint arXiv:1911.03587 , 2019. URL https: //arxiv.org/abs/1911.03587

work page arXiv 1911
[43]

Mixed precision training

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training. In ICLR, 2018. URL https://openreview.net/forum?id=r1gs9JgRZ

work page 2018
[44]

Nikita Moghe, Siddhartha Arora, Suman Banerjee, and Mitesh M. Khapra. Towards exploit- ing background knowledge for building conversation systems. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2322–2332, Brus- sels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v...

work page doi:10.18653/v1/d18-1255 2018
[45]

Preksha Nema and Mitesh M. Khapra. Towards a better metric for evaluating question generation systems. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3950–3959, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1429. URL https://www.aclweb.org/ anthol...

work page doi:10.18653/v1/d18-1429 2018
[46]

MS MARCO: A human generated machine reading comprehension dataset

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. MS MARCO: A human generated machine reading comprehension dataset. In Tarek Richard Besold, Antoine Bordes, Artur S. d’Avila Garcez, and Greg Wayne, editors, Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic 13 approaches 20...

work page 2016
[47]

Passage Re-ranking with BERT

Rodrigo Nogueira and Kyunghyun Cho. Passage re-ranking with BERT. arXiv preprint arXiv:1901.04085, 2019. URL https://arxiv.org/abs/1901.04085

work page internal anchor Pith review arXiv 1901
[48]

fairseq: A fast, extensible toolkit for sequence modeling

Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, Minneapolis, Minnesota, June 2019. Associ...

work page doi:10.18653/v1/n19-4009 2019
[49]

Finding generalizable evidence by learning to convince q&a models

Ethan Perez, Siddharth Karamcheti, Rob Fergus, Jason Weston, Douwe Kiela, and Kyunghyun Cho. Finding generalizable evidence by learning to convince q&a models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages 2402–2411,...

work page doi:10.18653/v1/d19-1244 2019
[50]

Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473, Hong Kong,...

work page doi:10.18653/v1/ 2019
[51]

Miller, and Sebastian Riedel

Fabio Petroni, Patrick Lewis, Aleksandra Piktus, Tim Rocktäschel, Yuxiang Wu, Alexander H. Miller, and Sebastian Riedel. How context affects language models’ factual predictions. In Automated Knowledge Base Construction, 2020. URL https://openreview.net/forum? id=025X0zPfn

work page 2020
[52]

Im- proving Language Understanding by Generative Pre-Training, 2018

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Im- proving Language Understanding by Generative Pre-Training, 2018. URL https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/ language-unsupervised/language_understanding_paper.pdf

work page 2018
[53]

Language models are unsupervised multitask learners, 2019

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners, 2019. URL https://d4mucfpksywv.cloudfront.net/better-language-models/language_ models_are_unsupervised_multitask_learners.pdf

work page 2019
[54]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer. arXiv e-prints, 2019. URL https://arxiv.org/abs/1910.10683

work page internal anchor Pith review arXiv 2019
[55]

How Much Knowledge Can You Pack Into the Parameters of a Language Model?

Adam Roberts, Colin Raffel, and Noam Shazeer. How much knowledge can you pack into the parameters of a language model? arXiv e-prints, 2020. URL https://arxiv.org/abs/ 2002.08910

work page internal anchor Pith review arXiv 2020
[56]

The probabilistic relevance framework: Bm25 and beyond

Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr., 3(4):333–389, April 2009. ISSN 1554-0669. doi: 10.1561/ 1500000019. URL https://doi.org/10.1561/1500000019

work page doi:10.1561/1500000019 2009
[57]

Emma Strubell, Ananya Ganesh, and Andrew McCallum

Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-V oss, Jeff Wu, Alec Radford, and Jian-Bing Wang. Release strategies and the social impacts of language models. ArXiv, abs/1908.09203, 2019

work page arXiv 1908
[58]

End-to-end memory net- works

Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory net- works. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,Advances in Neural Information Processing Systems 28, pages 2440–2448. Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/5846-end-to-end-memory-networks.pdf . 14

work page 2015
[59]

FEVER: a large-scale dataset for Fact Extraction and VERification

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. FEVER: a large-scale dataset for fact extraction and VERiﬁcation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana, Jun...

work page internal anchor Pith review doi:10.18653/v1/n18-1074 2018
[61]

URL https://arxiv.org/abs/2004.14366

work page arXiv 2004
[62]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc., 2017...

work page 2017
[63]

Diverse beam search for improved description of complex scenes

Ashwin Vijayakumar, Michael Cogswell, Ramprasaath Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. Diverse beam search for improved description of complex scenes. AAAI Conference on Artiﬁcial Intelligence, 2018. URL https://www.aaai.org/ocs/index. php/AAAI/AAAI18/paper/view/17329

work page 2018
[64]

doi: 10.18653/v1/W18-5446

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium, November 2018. Association for Computationa...

work page doi:10.18653/v1/w18-5446 2018
[65]

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. SuperGLUE: A Stickier Benchmark for General- Purpose Language Understanding Systems. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d\textquotesingle Alché-Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing...

work page internal anchor Pith review arXiv 2019
[66]

R3: Reinforced ranker-reader for open-domain question answering

Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, Tim Klinger, Wei Zhang, Shiyu Chang, Gerry Tesauro, Bowen Zhou, and Jing Jiang. R3: Reinforced ranker-reader for open-domain question answering. In Sheila A. McIlraith and Kilian Q. Weinberger, editors, Proceedings of the Thirty-Second AAAI Conference on Artiﬁcial Intelligence, (AAAI-18), the 30th innovativ...

work page 2018
[67]

Evidence aggregation for answer re- ranking in open-domain question answering

Shuohang Wang, Mo Yu, Jing Jiang, Wei Zhang, Xiaoxiao Guo, Shiyu Chang, Zhiguo Wang, Tim Klinger, Gerald Tesauro, and Murray Campbell. Evidence aggregation for answer re- ranking in open-domain question answering. In ICLR, 2018. URL https://openreview. net/forum?id=rJl3yM-Ab

work page 2018
[68]

Memory Networks

Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings , 2015. URL http://arxiv.org/abs/1410.3916

work page Pith review arXiv 2015
[69]

Retrieve and reﬁne: Improved sequence generation models for dialogue

Jason Weston, Emily Dinan, and Alexander Miller. Retrieve and reﬁne: Improved sequence generation models for dialogue. In Proceedings of the 2018 EMNLP Workshop SCAI: The 2nd International Workshop on Search-Oriented Conversational AI, pages 87–92, Brussels, Belgium, October 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5713. URL h...

work page doi:10.18653/v1/w18-5713 2018
[70]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Huggingface’s transformers: St...

work page internal anchor Pith review arXiv 1910
[71]

Addressing semantic drift in question generation for semi- supervised question answering

Shiyue Zhang and Mohit Bansal. Addressing semantic drift in question generation for semi- supervised question answering. In Proceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages 2495–2509, Hong Kong, China, Novem- ber 2019. A...

work page doi:10.18653/v1/d19-1253 2019
[72]

view tool guide

Wanjun Zhong, Jingjing Xu, Duyu Tang, Zenan Xu, Nan Duan, Ming Zhou, Jiahai Wang, and Jian Yin. Reasoning over semantic-level graph for fact checking. ArXiv, abs/1909.03745, 2019. URL https://arxiv.org/abs/1909.03745. 16 Appendices for Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks A Implementation Details For Open-domain QA we report te...

work page arXiv 1909