pith. machine review for the scientific record. sign in

arxiv: 2005.11401 · v4 · submitted 2020-05-22 · 💻 cs.CL · cs.LG

Recognition: 3 theorem links

· Lean Theorem

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Authors on Pith no claims yet

Pith reviewed 2026-05-10 20:37 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords retrieval-augmented generationopen-domain question answeringknowledge-intensive NLPdense retrievalseq2seq modelsWikipedia indexlanguage generation
0
0 comments X

The pith

Retrieval-augmented generation models combine a seq2seq generator with a dense Wikipedia retriever to outperform purely parametric models on knowledge-intensive tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes RAG models that augment a pre-trained sequence-to-sequence language model with a non-parametric memory in the form of a dense vector index over Wikipedia passages. A pre-trained neural retriever surfaces relevant passages that the generator then conditions on during fine-tuning and inference. This hybrid approach is tested across multiple knowledge-intensive NLP tasks, where it reaches state-of-the-art results on three open-domain question answering benchmarks while also producing more specific, diverse, and factual text than strong parametric baselines. The method directly addresses the limited ability of fixed-parameter models to access and update precise factual knowledge.

Core claim

RAG models pair a pre-trained parametric seq2seq model with a non-parametric dense vector index of Wikipedia accessed by a pre-trained neural retriever. Two formulations are introduced: RAG-sequence, which conditions the entire output on the same retrieved passages, and RAG-token, which can draw on different passages for each token. After fine-tuning, these models set new state-of-the-art scores on three open-domain QA tasks, surpass both parametric seq2seq models and task-specific retrieve-and-extract systems, and generate more factual language than a parametric-only baseline.

What carries the argument

Retrieval-augmented generation (RAG), which integrates a parametric seq2seq generator with a non-parametric dense retriever over a fixed Wikipedia passage index so that generation is explicitly conditioned on retrieved evidence.

If this is right

  • RAG models set the state of the art on three open-domain question answering tasks.
  • RAG outperforms both purely parametric seq2seq models and specialized retrieve-and-extract architectures on knowledge-intensive tasks.
  • Generated text from RAG models is more specific, diverse, and factually accurate than output from parametric-only seq2seq baselines.
  • The architecture supplies an explicit, updatable non-parametric memory that parametric models lack.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Swapping or updating the underlying Wikipedia index would allow the model to incorporate new facts without retraining the generator parameters.
  • The retrieved passages can be returned alongside each generated answer to provide direct provenance for the output.
  • Replacing the Wikipedia index with a domain-specific corpus would extend the same retrieval-plus-generation pattern to specialized knowledge tasks.

Load-bearing premise

The pre-trained dense retriever will reliably surface the exact passages containing the knowledge needed for the task and the generator will use them without ignoring the evidence or hallucinating.

What would settle it

Running RAG on an open-domain QA example whose correct answer appears verbatim in a Wikipedia passage yet the retriever returns unrelated passages and the model still produces the wrong answer.

read the original abstract

Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pre-trained models with a differentiable access mechanism to explicit non-parametric memory can overcome this issue, but have so far been only investigated for extractive downstream tasks. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) -- models which combine pre-trained parametric and non-parametric memory for language generation. We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. We compare two RAG formulations, one which conditions on the same retrieved passages across the whole generated sequence, the other can use different passages per token. We fine-tune and evaluate our models on a wide range of knowledge-intensive NLP tasks and set the state-of-the-art on three open domain QA tasks, outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures. For language generation tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Retrieval-Augmented Generation (RAG) models that combine a pre-trained parametric seq2seq model (BART) with non-parametric memory in the form of a dense vector index over Wikipedia, accessed via a pre-trained neural retriever (DPR). Two formulations are compared: RAG-Sequence, which conditions generation on the same set of retrieved passages throughout, and RAG-Token, which permits different passages per token via marginalization. The models are fine-tuned on a range of knowledge-intensive NLP tasks and reported to achieve state-of-the-art results on three open-domain QA benchmarks while producing more specific, diverse, and factual outputs than parametric-only seq2seq baselines.

Significance. If the results are robust, the work is significant for establishing a general fine-tuning recipe that augments parametric language models with differentiable access to explicit external memory. This directly mitigates limitations in factual recall, provenance, and knowledge updating for knowledge-intensive tasks, and the empirical outperformance over both pure parametric models and specialized retrieve-and-extract architectures suggests a promising direction for hybrid systems.

major comments (2)
  1. [Experiments / Results] The central SOTA claim on open-domain QA rests on the pre-trained DPR retriever reliably returning passages that contain the necessary knowledge for the majority of queries, followed by successful integration by the generator without ignoring or hallucinating content. The manuscript should include a quantitative retrieval analysis (e.g., top-k recall of gold-answer passages on the evaluation sets for Natural Questions, TriviaQA, and WebQuestions) to substantiate that the reported gains derive from effective RAG rather than other factors.
  2. [Abstract and Experiments] No error bars, standard deviations, or statistical significance tests are reported for the QA metrics or generation quality scores. Given that the outperformance over parametric seq2seq and retrieve-and-extract baselines is the primary evidence for the framework's value, the absence of these details leaves the robustness of the central empirical claims difficult to assess.
minor comments (1)
  1. [Abstract] The abstract refers to evaluation on 'a wide range of knowledge-intensive NLP tasks' without enumerating them; adding a short list (e.g., the specific QA, fact verification, and generation datasets) would improve immediate clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript accordingly to strengthen the empirical support for our claims.

read point-by-point responses
  1. Referee: [Experiments / Results] The central SOTA claim on open-domain QA rests on the pre-trained DPR retriever reliably returning passages that contain the necessary knowledge for the majority of queries, followed by successful integration by the generator without ignoring or hallucinating content. The manuscript should include a quantitative retrieval analysis (e.g., top-k recall of gold-answer passages on the evaluation sets for Natural Questions, TriviaQA, and WebQuestions) to substantiate that the reported gains derive from effective RAG rather than other factors.

    Authors: We agree that a direct quantitative retrieval analysis would better substantiate the source of the gains. In the revised manuscript we have added a new subsection (Section 5.3) reporting top-k recall of passages containing the gold answer on the development sets of Natural Questions, TriviaQA, and WebQuestions. The results show that DPR achieves strong recall (e.g., 85.0% at k=10 for NQ), confirming that relevant knowledge is retrieved for the large majority of queries and that the observed improvements over parametric baselines are attributable to effective retrieval-augmented generation. revision: yes

  2. Referee: [Abstract and Experiments] No error bars, standard deviations, or statistical significance tests are reported for the QA metrics or generation quality scores. Given that the outperformance over parametric seq2seq and retrieve-and-extract baselines is the primary evidence for the framework's value, the absence of these details leaves the robustness of the central empirical claims difficult to assess.

    Authors: We acknowledge the value of reporting variability. However, the computational cost of fine-tuning and evaluating these large models on multiple random seeds is substantial. In the revised version we have added a paragraph in Section 4.2 explicitly noting this limitation and stating that all reported numbers are from single runs, consistent with contemporaneous work on similarly sized models. We also include results from three seeds for the smaller-scale generation-quality human evaluations to provide some indication of stability. The margins over baselines remain large and consistent across tasks, supporting the robustness of the central claims. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical SOTA claims rest on external benchmarks and independent baselines

full rationale

The paper introduces RAG as a fine-tuning recipe combining a pre-trained seq2seq generator with a fixed pre-trained dense retriever over Wikipedia. All central claims (SOTA on three open-domain QA tasks, outperforming parametric seq2seq and retrieve-and-extract baselines) are measured via standard held-out evaluation on public datasets (Natural Questions, TriviaQA, etc.) against independently published numbers. No equation or result is defined in terms of a fitted parameter that is then re-predicted, no self-citation chain is load-bearing for the performance numbers, and the marginalization formulations (RAG-Sequence, RAG-Token) are directly implemented and evaluated rather than derived from prior self-work by construction. The pre-trained DPR retriever is an external component whose coverage is tested rather than assumed tautologically.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of joint fine-tuning of a pre-trained retriever and generator; no new physical or mathematical entities are postulated and the only free parameters are standard training hyper-parameters.

free parameters (2)
  • number of retrieved passages k
    Hyper-parameter controlling how many documents are passed to the generator; chosen during development.
  • standard fine-tuning hyper-parameters
    Learning rate, batch size, and optimizer settings required for any neural training run.
axioms (1)
  • domain assumption A pre-trained dense retriever (DPR-style) and a pre-trained seq2seq model (BART-style) can be jointly fine-tuned to produce coherent generation conditioned on retrieved text.
    Invoked when the authors describe the fine-tuning recipe for both RAG variants.

pith-pipeline@v0.9.0 · 5604 in / 1416 out tokens · 38900 ms · 2026-05-10T20:37:42.847349+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation.PhiForcing phi_equation unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We fine-tune and evaluate our models on a wide range of knowledge-intensive NLP tasks and set the state-of-the-art on three open domain QA tasks, outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TruthfulQA: Measuring How Models Mimic Human Falsehoods

    cs.CL 2021-09 unverdicted novelty 8.0

    A new benchmark reveals that language models including GPT-3 are truthful on only 58% of questions designed to elicit popular misconceptions, far below human performance of 94%, with larger models performing worse.

  2. Language Models are Few-Shot Learners

    cs.CL 2020-05 accept novelty 8.0

    GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

  3. MeMo: Memory as a Model

    cs.CL 2026-05 unverdicted novelty 7.0

    MeMo encodes new knowledge into a separate memory model for frozen LLMs, achieving strong performance on BrowseComp-Plus, NarrativeQA, and MuSiQue while capturing cross-document relationships and remaining robust to r...

  4. A Hybrid Framework for Natural Language Querying of IFC Models with Relational and Graph Representations

    cs.CL 2026-05 unverdicted novelty 7.0

    IfcLLM combines relational and graph representations of IFC models with iterative LLM reasoning to deliver 93.3-100% first-attempt accuracy on natural language queries across three test models.

  5. Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory

    cs.AI 2026-05 unverdicted novelty 7.0

    Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.

  6. MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents

    cs.MA 2026-05 unverdicted novelty 7.0

    MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.

  7. Enhancing Agent Safety Judgment: Controlled Benchmark Rewriting and Analogical Reasoning for Deceptive Out-of-Distribution Scenarios

    cs.AI 2026-05 unverdicted novelty 7.0

    ROME generates deceptive safety benchmarks that degrade LLM agent judgment performance, while ARISE uses analogical retrieval to improve safety decisions at inference time without retraining.

  8. Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates

    cs.AI 2026-05 unverdicted novelty 7.0

    In 30-step recursive LLM loops, append-mode persistent escape from source basins reaches 50% near 400 tokens under full history but plateaus below 50% under tail-clip memory policy, while replace-mode switching largel...

  9. OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory

    cs.CL 2026-04 unverdicted novelty 7.0

    OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.

  10. Training Transformers as a Universal Computer

    cs.AI 2026-04 unverdicted novelty 7.0

    A transformer trained on random meaningless MicroPy programs generalizes to execute diverse human-written programs, providing empirical evidence it can act as a universal computer.

  11. XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation

    cs.AI 2026-04 unverdicted novelty 7.0

    XGRAG uses graph perturbations to quantify component contributions in GraphRAG and achieves 14.81% better explanation quality than text-based baselines on QA datasets, with correlations to graph centrality.

  12. Similar Users-Augmented Interest Network

    cs.IR 2026-04 unverdicted novelty 7.0

    SUIN improves CTR prediction by augmenting target user sequences with similar users' behaviors via embedding-based retrieval, user-specific position encoding, and user-aware target attention.

  13. A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

    cs.CR 2026-04 unverdicted novelty 7.0

    A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

  14. Dr.Sai: An agentic AI for real-world physics analysis at BESIII

    hep-ex 2026-04 unverdicted novelty 7.0

    Dr.Sai autonomously executed full physics analysis pipelines on real BESIII data to re-measure ten J/psi decay branching fractions, matching established benchmarks without any manual coding.

  15. Learning When Not to Decide: A Framework for Overcoming Factual Presumptuousness in AI Adjudication

    cs.AI 2026-04 unverdicted novelty 7.0

    A new structured prompting method (SPEC) helps AI detect insufficient evidence in adjudication tasks and defer decisions appropriately, reaching 89% accuracy on a benchmark varying information completeness from Colora...

  16. From Experience to Skill: Multi-Agent Generative Engine Optimization via Reusable Strategy Learning

    cs.AI 2026-04 unverdicted novelty 7.0

    MAGEO is a multi-agent system that distills validated editing patterns into reusable optimization skills for generative engines, outperforming heuristic baselines on visibility and fidelity via a new benchmark and eva...

  17. RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration

    cs.CL 2026-04 unverdicted novelty 7.0

    RAGognizer adds a detection head to LLMs for joint training on generation and token-level hallucination detection, yielding SOTA detection and fewer hallucinations in RAG while preserving output quality.

  18. IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning

    cs.AI 2026-04 unverdicted novelty 7.0

    IG-Search computes step-level information gain rewards from policy probabilities to improve credit assignment in RL training for search-augmented QA, yielding 1.6-point gains over trajectory-level baselines on multi-h...

  19. TimeSeriesExamAgent: Creating Time Series Reasoning Benchmarks at Scale

    cs.AI 2026-04 conditional novelty 7.0

    TimeSeriesExamAgent combines templates and LLM agents to generate scalable time series reasoning benchmarks, demonstrating that current LLMs have limited performance on both abstract and domain-specific tasks.

  20. IoT-Brain: Grounding LLMs for Semantic-Spatial Sensor Scheduling

    cs.AI 2026-04 unverdicted novelty 7.0

    IoT-Brain uses a neuro-symbolic Spatial Trajectory Graph to ground LLMs for verifiable semantic-spatial sensor scheduling, achieving 37.6% higher task success with lower resource use on a campus-scale benchmark.

  21. An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks

    cs.AI 2026-04 unverdicted novelty 7.0

    An agentic architecture with multimodal screening, a five-agent jury, meta-synthesis, and source attribution protocol detects biases in Romanian history textbooks more accurately than zero-shot baselines, achieving 83...

  22. SkillGraph: Graph Foundation Priors for LLM Agent Tool Sequence Recommendation

    cs.AI 2026-04 unverdicted novelty 7.0

    SkillGraph builds a reusable execution-transition graph prior from LLM trajectories and applies it via hybrid retrieval plus learned reranking to raise tool-sequence quality on ToolBench and API-Bank benchmarks.

  23. Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception

    cs.AI 2026-04 unverdicted novelty 7.0

    Springdrift provides an auditable persistent runtime for long-lived LLM agents with case-based memory, normative safety gating, and ambient self-perception, shown in a 23-day single-instance deployment where the agent...

  24. BibTeX Citation Hallucinations in Scientific Publishing Agents: Evaluation and Mitigation

    cs.DL 2026-04 conditional novelty 7.0

    Frontier LLMs generate BibTeX entries at 83.6% field accuracy but only 50.9% fully correct; two-stage clibib revision raises accuracy to 91.5% and fully correct entries to 78.3% with 0.8% regression.

  25. Do We Still Need GraphRAG? Benchmarking RAG and GraphRAG for Agentic Search Systems

    cs.IR 2026-04 unverdicted novelty 7.0

    Agentic search narrows the gap between dense RAG and GraphRAG but does not remove GraphRAG's advantage on complex multi-hop reasoning.

  26. From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering

    cs.IR 2026-03 unverdicted novelty 7.0

    Docling with hierarchical splitting reaches 94.1% RAG accuracy on domain documents, beating naive PDF loading but trailing manual Markdown curation at 97.1%.

  27. LLM4Log: A Systematic Review of Large Language Model-based Log Analysis

    cs.SE 2026-03 accept novelty 7.0

    LLM4Log is a systematic review of 145 papers on LLM-based log analysis that delivers a unified taxonomy, design patterns, and open challenges for reliable adoption in AIOps.

  28. An Annotation Scheme and Classifier for Personal Facts in Dialogue

    cs.CL 2026-05 accept novelty 6.0

    An extended annotation scheme with new categories and attributes plus a Gemma-300M-based multi-head classifier achieves 81.6% macro F1 on personal fact classification, outperforming few-shot LLM baselines by nearly 9 ...

  29. RareCP: Regime-Aware Retrieval for Efficient Conformal Prediction

    cs.LG 2026-05 unverdicted novelty 6.0

    RareCP improves interval efficiency for time series conformal prediction by retrieving and weighting regime-specific calibration examples while adapting to drift and maintaining coverage.

  30. Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...

  31. RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization

    cs.CL 2026-05 unverdicted novelty 6.0

    RLearner-LLM's Hybrid-DPO fuses DeBERTa NLI and LLM verifier scores to deliver up to 6x higher NLI entailment than standard SFT while preserving answer coverage across academic domains.

  32. RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization

    cs.CL 2026-05 unverdicted novelty 6.0

    RLearner-LLM achieves up to 6x gains in NLI entailment over standard fine-tuning by using an automated hybrid DPO pipeline that balances logic and fluency across multiple model sizes and domains.

  33. An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration

    cs.AI 2026-05 unverdicted novelty 6.0

    Experience-RAG Skill uses experience memory to dynamically select retrieval strategies for agents, achieving 0.8924 nDCG@10 on BeIR/nq, hotpotqa, and scifact while outperforming fixed single-retriever baselines.

  34. CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification

    cs.CL 2026-05 unverdicted novelty 6.0

    CuraView detects sentence-level faithfulness hallucinations in medical discharge summaries via GraphRAG knowledge graphs and multi-agent evidence grading, achieving 0.831 F1 on critical contradictions with a fine-tune...

  35. FT-RAG: A Fine-grained Retrieval-Augmented Generation Framework for Complex Table Reasoning

    cs.CL 2026-05 unverdicted novelty 6.0

    FT-RAG introduces a fine-grained graph-based retrieval framework for tables plus a new 9870-pair benchmark, reporting 23.5% and 59.2% gains in table- and cell-level hit rates and 62.2% higher exact-value recall over b...

  36. Agentic AI for Substance Use Education: Integrating Regulatory and Scientific Knowledge Sources

    cs.CL 2026-05 conditional novelty 6.0

    The authors built and expert-evaluated an agentic AI system integrating DEA regulatory data with dynamic scientific literature via RAG to provide accurate, context-sensitive substance use education, with mean Likert r...

  37. Structure-Aware Chunking for Tabular Data in Retrieval-Augmented Generation

    cs.CL 2026-05 unverdicted novelty 6.0

    STC reduces tabular chunk counts by up to 56% versus baselines and raises hybrid MRR to 0.5945 and BM25 Recall@1 to 0.754 by preserving row structure during chunking.

  38. From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction

    cs.AI 2026-04 unverdicted novelty 6.0

    Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.

  39. The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation

    cs.LG 2026-04 conditional novelty 6.0

    Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering raise deep-conflict accura...

  40. MindTrellis: Co-Creating Knowledge Structures with AI through Interactive Visual Exploration

    cs.HC 2026-04 unverdicted novelty 6.0

    MindTrellis enables users and AI to co-create evolving knowledge graphs, outperforming retrieval-only tools in expert-rated content coverage, structural quality, and reduced cognitive load during a study of 12 partici...

  41. ORPHEAS: A Cross-Lingual Greek-English Embedding Model for Retrieval-Augmented Generation

    cs.CL 2026-04 unverdicted novelty 6.0

    ORPHEAS, a Greek-English embedding model created with knowledge graph fine-tuning, outperforms state-of-the-art multilingual models on monolingual and cross-lingual retrieval benchmarks.

  42. QRAFTI: An Agentic Framework for Empirical Research in Quantitative Finance

    cs.MA 2026-04 unverdicted novelty 6.0

    QRAFTI is a multi-agent framework using tool-calling and reflection-based planning to emulate quant research tasks like factor replication and signal testing on financial data.

  43. Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies

    cs.IR 2026-04 unverdicted novelty 6.0

    CARE, a context-aware LLM judge, outperforms standard methods when evaluating multi-hop retrieval quality in RAG systems.

  44. No-Worse Context-Aware Decoding: Preventing Neutral Regression in Context-Conditioned Generation

    cs.CL 2026-04 unverdicted novelty 6.0

    NWCAD uses a two-stream setup with a two-stage gate to prevent accuracy drops on baseline-correct items under non-informative contexts while retaining gains from helpful contexts.

  45. Preregistered Belief Revision Contracts

    cs.AI 2026-04 unverdicted novelty 6.0

    PBRC is a contract protocol that enforces evidential belief updates in deliberative multi-agent systems and proves it prevents conformity-driven false cascades under conservative fallbacks.

  46. Knowledge Is Not Static: Order-Aware Hypergraph RAG for Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    OKH-RAG represents knowledge as ordered hyperedges and retrieves coherent interaction sequences via a learned transition model, outperforming permutation-invariant RAG baselines on order-sensitive QA tasks.

  47. Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retrieval

    cs.AI 2026-04 unverdicted novelty 6.0

    A hybrid graph-text retrieval system for cyber threat intelligence improves multi-hop question answering by up to 35% over vector-based RAG on a 3,300-question benchmark.

  48. In-situ process monitoring for defect detection in wire-arc additive manufacturing: an agentic AI approach

    cs.AI 2026-04 unverdicted novelty 6.0

    A multi-agent AI framework using processing and acoustic agents achieves 91.6% accuracy and 0.821 F1 score for in-situ porosity defect detection in wire-arc additive manufacturing.

  49. MIMIC-Py: An Extensible Tool for Personality-Driven Automated Game Testing with Large Language Models

    cs.SE 2026-04 unverdicted novelty 6.0

    MIMIC-Py provides a modular Python framework that turns personality-driven LLM agents into an extensible system for automated game testing via configurable traits, decoupled components, and multiple interaction methods.

  50. TEC: A Collection of Human Trial-and-error Trajectories for Problem Solving

    cs.CL 2026-04 unverdicted novelty 6.0

    TEC is a new public dataset of detailed human trial-and-error trajectories and reflections on web tasks, with humans showing substantially higher accuracy than LLMs.

  51. DQA: Diagnostic Question Answering for IT Support

    cs.CL 2026-04 unverdicted novelty 6.0

    DQA maintains persistent diagnostic state and aggregates retrievals at the root-cause level to reach 78.7% success on 150 enterprise IT scenarios versus 41.3% for standard multi-turn RAG while cutting average turns fr...

  52. SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics

    cs.SE 2026-04 unverdicted novelty 6.0

    SysTradeBench evaluates 17 LLMs on 12 trading strategies, finding over 91.7% code validity but rapid convergence in iterative fixes and a continued need for human oversight on critical strategies.

  53. Language Models (Mostly) Know What They Know

    cs.CL 2022-07 unverdicted novelty 6.0

    Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

  54. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    cs.CL 2022-04 unverdicted novelty 6.0

    RLHF alignment training on language models boosts NLP performance, supports skill specialization, enables weekly online updates with fresh human data, and shows a linear relation between RL reward and sqrt(KL divergen...

  55. Unsupervised Dense Information Retrieval with Contrastive Learning

    cs.IR 2021-12 unverdicted novelty 6.0

    Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.

  56. Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks

    cs.SE 2026-05 unverdicted novelty 5.0

    Retriever-side choices, particularly the retrieval algorithm, exert more influence on RAG performance than generator selection across code generation, summarization, and repair tasks.

  57. Neural Code Translation of Legacy Code: APL to C#

    cs.SE 2026-05 unverdicted novelty 5.0

    Guided LLM strategies with custom datasets and execution-based verification enable functional APL-to-C# translation across a range of program complexities.

  58. Securing the Agent: Vendor-Neutral, Multitenant Enterprise Retrieval and Tool Use

    cs.CR 2026-05 unverdicted novelty 5.0

    A server-side architecture with policy-aware ingestion and ABAC-based retrieval gating prevents cross-tenant data leakage in multitenant enterprise RAG and agent systems.

  59. RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization

    cs.CL 2026-05 unverdicted novelty 5.0

    Hybrid-DPO combining NLI and verifier scores delivers up to 6x NLI improvement over SFT baselines across multiple LLMs and domains while preserving answer coverage and inference speed.

  60. Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to RLHF

    cs.CL 2026-05 unverdicted novelty 5.0

    A structured practicum guides readers through the complete modern NLP pipeline with reproducible sessions and new linguistic resources for Tajik and Tatar.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · cited by 79 Pith papers · 9 internal anchors

  1. [1]

    MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

    Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268 [cs], November 2016. URL http: //arxiv.org/abs/1611.09268. arX...

  2. [2]

    Modeling of the question answering task in the yodaqa system

    Petr Baudiš and Jan Šediv`y. Modeling of the question answering task in the yodaqa system. In International Conference of the Cross-Language Evaluation Forum for European Languages, pages 222–228. Springer, 2015. URL https://link.springer.com/chapter/10.1007% 2F978-3-319-24027-5_20

  3. [3]

    Semantic Parsing on Freebase from Question-Answer Pairs

    Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic Parsing on Freebase from Question-Answer Pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1533–1544, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/ D13-1160

  4. [5]

    URL https://arxiv.org/abs/2004.07159

  5. [6]

    Reading W ikipedia to answer open-domain questions

    Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1870–1879, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1171. URL https://www...

  6. [7]

    Coarse-to-fine question answering for long documents

    Eunsol Choi, Daniel Hewlett, Jakob Uszkoreit, Illia Polosukhin, Alexandre Lacoste, and Jonathan Berant. Coarse-to-fine question answering for long documents. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 209–220, Vancouver, Canada, July 2017. Association for Computational Linguisti...

  7. [8]

    Simple and effective multi-paragraph reading comprehension

    Christopher Clark and Matt Gardner. Simple and Effective Multi-Paragraph Reading Compre- hension. arXiv:1710.10723 [cs], October 2017. URL http://arxiv.org/abs/1710.10723. arXiv: 1710.10723

  8. [9]

    BERT: Pre- training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Con- ference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapol...

  9. [10]

    Wiz- ard of wikipedia: Knowledge-powered conversational agents

    Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. Wiz- ard of wikipedia: Knowledge-powered conversational agents. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=r1l73iRqKm

  10. [11]

    Ugur Güney, V olkan Cirik, and Kyunghyun Cho

    Matthew Dunn, Levent Sagun, Mike Higgins, V . Ugur Guney, V olkan Cirik, and Kyunghyun Cho. SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine. arXiv:1704.05179 [cs], April 2017. URL http://arxiv.org/abs/1704.05179. arXiv: 1704.05179

  11. [12]

    Hierarchical neural story generation

    Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. In Proceed- ings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1082. URL https://www.aclweb.org/anthology/ P18-1082

  12. [13]

    doi:10.18653/v1/P19-1346 , pages =

    Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. ELI5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3558–3567, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1346. URL https://www.aclweb.or...

  13. [14]

    Augmenting transformers with KNN-based composite memory, 2020

    Angela Fan, Claire Gardent, Chloe Braud, and Antoine Bordes. Augmenting transformers with KNN-based composite memory, 2020. URL https://openreview.net/forum?id= H1gx1CNKPH

  14. [16]

    URL https://arxiv.org/abs/2004.07202

  15. [17]

    A knowledge-grounded neural conversation model

    Marjan Ghazvininejad, Chris Brockett, Ming-Wei Chang, Bill Dolan, Jianfeng Gao, Wen tau Yih, and Michel Galley. A knowledge-grounded neural conversation model. In AAAI Conference on Artificial Intelligence, 2018. URL https://www.aaai.org/ocs/index.php/ AAAI/AAAI18/paper/view/16710

  16. [18]

    When will AI exceed human performance? evidence from AI experts

    Katja Grace, John Salvatier, Allan Dafoe, Baobao Zhang, and Owain Evans. When will AI exceed human performance? evidence from AI experts. CoRR, abs/1705.08807, 2017. URL http://arxiv.org/abs/1705.08807

  17. [19]

    Jiatao Gu, Yong Wang, Kyunghyun Cho, and Victor O.K. Li. Search engine guided neural machine translation. In AAAI Conference on Artificial Intelligence , 2018. URL https: //www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17282

  18. [20]

    Jiatao Gu, Yong Wang, Kyunghyun Cho, and Victor O.K. Li. Search engine guided neural machine translation. In 32nd AAAI Conference on Artificial Intelligence, AAAI 2018 , 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, pages 5133–5140. AAAI press, 2018. 32nd AAAI Conference on Artificial Intelligence, AAAI 2018 ; Conference date: 02-02-2018 Throug...

  19. [21]

    Hashimoto, Yonatan Oren, and Percy Liang

    Kelvin Guu, Tatsunori B. Hashimoto, Yonatan Oren, and Percy Liang. Generating sentences by editing prototypes. Transactions of the Association for Computational Linguistics, 6:437–450,

  20. [22]

    URL https://www.aclweb.org/anthology/Q18-1031

    doi: 10.1162/tacl_a_00030. URL https://www.aclweb.org/anthology/Q18-1031. 11

  21. [23]

    doi:10.48550/arXiv.2002.08909 , abstract =

    Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. REALM: Retrieval-augmented language model pre-training. ArXiv, abs/2002.08909, 2020. URL https: //arxiv.org/abs/2002.08909

  22. [24]

    A retrieve-and-edit framework for predicting structured outputs

    Tatsunori B Hashimoto, Kelvin Guu, Yonatan Oren, and Percy S Liang. A retrieve-and-edit framework for predicting structured outputs. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, ed- itors, Advances in Neural Information Processing Systems 31 , pages 10052– 10062. Curran Associates, Inc., 2018. URL http://papers.nip...

  23. [25]

    Simple and effective retrieve- edit-rerank text generation

    Nabil Hossain, Marjan Ghazvininejad, and Luke Zettlemoyer. Simple and effective retrieve- edit-rerank text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2532–2538, Online, July 2020. Association for Computa- tional Linguistics. doi: 10.18653/v1/2020.acl-main.228. URL https://www.aclweb.org/ a...

  24. [26]

    Billion-scale similarity search with GPUs

    Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734, 2017. URL https://arxiv.org/abs/1702.08734

  25. [27]

    T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics....

  26. [28]

    Inferring algorithmic patterns with stack- augmented recurrent nets

    Armand Joulin and Tomas Mikolov. Inferring algorithmic patterns with stack- augmented recurrent nets. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1 , NIPS’15, page 190–198, Cam- bridge, MA, USA, 2015. MIT Press. URL https://papers.nips.cc/paper/ 5857-inferring-algorithmic-patterns-with-stack-augmen...

  27. [29]

    arXiv preprint arXiv:2004.04906 , year=

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906, 2020. URL https://arxiv.org/abs/2004.04906

  28. [30]

    Generaliza- tion through memorization: Nearest neighbor language models

    Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generaliza- tion through memorization: Nearest neighbor language models. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HklBjCEKvH

  29. [31]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980

  30. [32]

    Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Ken- ton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural Questions: a Benchmark for Ques- tion Answering Research. Tran...

  31. [33]

    Large memory layers with product keys

    Guillaume Lample, Alexandre Sablayrolles, Marc’ Aurelio Ranzato, Ludovic Denoyer, and Herve Jegou. Large memory layers with product keys. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural In- formation Processing Systems 32, pages 8548–8559. Curran Associates, Inc., 2019. URL http: //papers....

  32. [34]

    Latent Retrieval for Weakly Supervised Open Domain Question Answering

    Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association 12 for Computational Linguistics, pages 6086–6096, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1612. URL https://www.aclweb.o...

  33. [35]

    BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

    Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019. URL https://arxiv.org/abs/1910.13461

  34. [36]

    A Diversity-Promoting Objective Function for Neural Conversation Models

    Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119, San Diego, California, June 2016. Association for C...

  35. [37]

    Acute-eval: Improved dialogue evaluation with optimized questions and multi-turn comparisons

    Margaret Li, Jason Weston, and Stephen Roller. Acute-eval: Improved dialogue evaluation with optimized questions and multi-turn comparisons. ArXiv, abs/1909.03087, 2019. URL https://arxiv.org/abs/1909.03087

  36. [38]

    Robust neural machine translation with joint textual and phonetic embedding

    Hairong Liu, Mingbo Ma, Liang Huang, Hao Xiong, and Zhongjun He. Robust neural machine translation with joint textual and phonetic embedding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3044–3049, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1291. URL http...

  37. [39]

    Liu*, Mohammad Saleh*, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer

    Peter J. Liu*, Mohammad Saleh*, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating wikipedia by summarizing long sequences. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum? id=Hyg0vbWC-

  38. [40]

    Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,

    Yury A. Malkov and D. A. Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42:824–836, 2016. URL https://arxiv.org/abs/1603.09320

  39. [41]

    arXiv:2002.06177 (2020) 25

    Gary Marcus. The next decade in ai: four steps towards robust artificial intelligence. arXiv preprint arXiv:2002.06177, 2020. URL https://arxiv.org/abs/2002.06177

  40. [42]

    How decoding strategies affect the verifiability of generated text

    Luca Massarelli, Fabio Petroni, Aleksandra Piktus, Myle Ott, Tim Rocktäschel, Vassilis Plachouras, Fabrizio Silvestri, and Sebastian Riedel. How decoding strategies affect the verifiability of generated text. arXiv preprint arXiv:1911.03587 , 2019. URL https: //arxiv.org/abs/1911.03587

  41. [43]

    Mixed precision training

    Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training. In ICLR, 2018. URL https://openreview.net/forum?id=r1gs9JgRZ

  42. [44]

    Nikita Moghe, Siddhartha Arora, Suman Banerjee, and Mitesh M. Khapra. Towards exploit- ing background knowledge for building conversation systems. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2322–2332, Brus- sels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v...

  43. [45]

    Preksha Nema and Mitesh M. Khapra. Towards a better metric for evaluating question generation systems. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3950–3959, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1429. URL https://www.aclweb.org/ anthol...

  44. [46]

    MS MARCO: A human generated machine reading comprehension dataset

    Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. MS MARCO: A human generated machine reading comprehension dataset. In Tarek Richard Besold, Antoine Bordes, Artur S. d’Avila Garcez, and Greg Wayne, editors, Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic 13 approaches 20...

  45. [47]

    Passage Re-ranking with BERT

    Rodrigo Nogueira and Kyunghyun Cho. Passage re-ranking with BERT. arXiv preprint arXiv:1901.04085, 2019. URL https://arxiv.org/abs/1901.04085

  46. [48]

    fairseq: A fast, extensible toolkit for sequence modeling

    Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, Minneapolis, Minnesota, June 2019. Associ...

  47. [49]

    Finding generalizable evidence by learning to convince q&a models

    Ethan Perez, Siddharth Karamcheti, Rob Fergus, Jason Weston, Douwe Kiela, and Kyunghyun Cho. Finding generalizable evidence by learning to convince q&a models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages 2402–2411,...

  48. [50]

    Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473, Hong Kong,...

  49. [51]

    Miller, and Sebastian Riedel

    Fabio Petroni, Patrick Lewis, Aleksandra Piktus, Tim Rocktäschel, Yuxiang Wu, Alexander H. Miller, and Sebastian Riedel. How context affects language models’ factual predictions. In Automated Knowledge Base Construction, 2020. URL https://openreview.net/forum? id=025X0zPfn

  50. [52]

    Im- proving Language Understanding by Generative Pre-Training, 2018

    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Im- proving Language Understanding by Generative Pre-Training, 2018. URL https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/ language-unsupervised/language_understanding_paper.pdf

  51. [53]

    Language models are unsupervised multitask learners, 2019

    Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners, 2019. URL https://d4mucfpksywv.cloudfront.net/better-language-models/language_ models_are_unsupervised_multitask_learners.pdf

  52. [54]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019. URL https://arxiv.org/abs/1910.10683

  53. [55]

    How Much Knowledge Can You Pack Into the Parameters of a Language Model?

    Adam Roberts, Colin Raffel, and Noam Shazeer. How much knowledge can you pack into the parameters of a language model? arXiv e-prints, 2020. URL https://arxiv.org/abs/ 2002.08910

  54. [56]

    The probabilistic relevance framework: Bm25 and beyond

    Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr., 3(4):333–389, April 2009. ISSN 1554-0669. doi: 10.1561/ 1500000019. URL https://doi.org/10.1561/1500000019

  55. [57]

    Emma Strubell, Ananya Ganesh, and Andrew McCallum

    Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-V oss, Jeff Wu, Alec Radford, and Jian-Bing Wang. Release strategies and the social impacts of language models. ArXiv, abs/1908.09203, 2019

  56. [58]

    End-to-end memory net- works

    Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory net- works. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,Advances in Neural Information Processing Systems 28, pages 2440–2448. Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/5846-end-to-end-memory-networks.pdf . 14

  57. [59]

    FEVER: a large-scale dataset for Fact Extraction and VERification

    James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana, Jun...

  58. [61]

    URL https://arxiv.org/abs/2004.14366

  59. [62]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc., 2017...

  60. [63]

    Diverse beam search for improved description of complex scenes

    Ashwin Vijayakumar, Michael Cogswell, Ramprasaath Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. Diverse beam search for improved description of complex scenes. AAAI Conference on Artificial Intelligence, 2018. URL https://www.aaai.org/ocs/index. php/AAAI/AAAI18/paper/view/17329

  61. [64]

    doi: 10.18653/v1/W18-5446

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium, November 2018. Association for Computationa...

  62. [65]

    SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

    Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. SuperGLUE: A Stickier Benchmark for General- Purpose Language Understanding Systems. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d\textquotesingle Alché-Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing...

  63. [66]

    R3: Reinforced ranker-reader for open-domain question answering

    Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, Tim Klinger, Wei Zhang, Shiyu Chang, Gerry Tesauro, Bowen Zhou, and Jing Jiang. R3: Reinforced ranker-reader for open-domain question answering. In Sheila A. McIlraith and Kilian Q. Weinberger, editors, Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovativ...

  64. [67]

    Evidence aggregation for answer re- ranking in open-domain question answering

    Shuohang Wang, Mo Yu, Jing Jiang, Wei Zhang, Xiaoxiao Guo, Shiyu Chang, Zhiguo Wang, Tim Klinger, Gerald Tesauro, and Murray Campbell. Evidence aggregation for answer re- ranking in open-domain question answering. In ICLR, 2018. URL https://openreview. net/forum?id=rJl3yM-Ab

  65. [68]

    Memory Networks

    Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings , 2015. URL http://arxiv.org/abs/1410.3916

  66. [69]

    Retrieve and refine: Improved sequence generation models for dialogue

    Jason Weston, Emily Dinan, and Alexander Miller. Retrieve and refine: Improved sequence generation models for dialogue. In Proceedings of the 2018 EMNLP Workshop SCAI: The 2nd International Workshop on Search-Oriented Conversational AI, pages 87–92, Brussels, Belgium, October 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5713. URL h...

  67. [70]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Huggingface’s transformers: St...

  68. [71]

    Addressing semantic drift in question generation for semi- supervised question answering

    Shiyue Zhang and Mohit Bansal. Addressing semantic drift in question generation for semi- supervised question answering. In Proceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages 2495–2509, Hong Kong, China, Novem- ber 2019. A...

  69. [72]

    view tool guide

    Wanjun Zhong, Jingjing Xu, Duyu Tang, Zenan Xu, Nan Duan, Ming Zhou, Jiahai Wang, and Jian Yin. Reasoning over semantic-level graph for fact checking. ArXiv, abs/1909.03745, 2019. URL https://arxiv.org/abs/1909.03745. 16 Appendices for Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks A Implementation Details For Open-domain QA we report te...