hub Canonical reference

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, Hannaneh Hajishirzi · 2022 · cs.CL · arXiv 2212.10511

Canonical reference. 71% of citing Pith papers cite this work as background.

29 Pith papers citing it

Background 71% of classified citations

open full Pith review browse 29 citing papers arXiv PDF

abstract

Despite their impressive performance on diverse tasks, large language models (LMs) still struggle with tasks requiring rich world knowledge, implying the limitations of relying solely on their parameters to encode a wealth of world knowledge. This paper aims to understand LMs' strengths and limitations in memorizing factual knowledge, by conducting large-scale knowledge probing experiments of 10 models and 4 augmentation methods on PopQA, our new open-domain QA dataset with 14k questions. We find that LMs struggle with less popular factual knowledge, and that scaling fails to appreciably improve memorization of factual knowledge in the long tail. We then show that retrieval-augmented LMs largely outperform orders of magnitude larger LMs, while unassisted LMs remain competitive in questions about high-popularity entities. Based on those findings, we devise a simple, yet effective, method for powerful and efficient retrieval-augmented LMs, which retrieves non-parametric memories only when necessary. Experimental results show that this significantly improves models' performance while reducing the inference costs.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 dataset 2

citation-polarity summary

background 5 use dataset 2

representative citing papers

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

cs.CL · 2023-04-03 · accept · novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.

Do We Still Need GraphRAG? Benchmarking RAG and GraphRAG for Agentic Search Systems

cs.IR · 2026-04-01 · unverdicted · novelty 7.0

Agentic search narrows the gap between dense RAG and GraphRAG but does not remove GraphRAG's advantage on complex multi-hop reasoning.

CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training

cs.LG · 2026-02-19 · unverdicted · novelty 7.0 · 2 refs

CapTrack shows post-training causes drift beyond facts, with instruction fine-tuning producing stronger behavioral changes than preference optimization across model families.

MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

cs.CL · 2025-11-04 · unverdicted · novelty 7.0

MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.

From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems

cs.MA · 2025-06-05 · accept · novelty 7.0

A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.

Group-in-Group Policy Optimization for LLM Agent Training

cs.LG · 2025-05-16 · unverdicted · novelty 7.0

GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while keeping the same rollout and memory footprint.

Policy and World Modeling Co-Training for Language Agents

cs.LG · 2026-06-01 · unverdicted · novelty 6.0

PaW co-trains policy and world modeling on standard RL rollouts using action-entropy data selection, noise-tolerant loss, and reward-adaptive balancing, yielding consistent gains on three agent benchmarks.

Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency

cs.CL · 2026-05-18 · unverdicted · novelty 6.0

Factual recall quality in LLMs follows a sigmoid scaling law in the log-linear combination of model parameter count and topic frequency in training data, explaining 60% of variance across models and up to 94% within families.

Towards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

Theoretical analysis of continual factual knowledge acquisition shows data replay stabilizes pretrained knowledge by shifting convergence dynamics while regularization only slows forgetting, leading to the STOC method for attention-based replay selection.

Priming: Hybrid State Space Models From Pre-trained Transformers

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasoning at 32B scale.

Estimating the Black-box LLM Uncertainty with Distribution-Aligned Adversarial Distillation

cs.CL · 2026-05-07 · unverdicted · novelty 6.0

DisAAD trains a 1%-sized proxy model via adversarial distillation to quantify uncertainty in black-box LLMs by aligning with their output distributions.

Decoupling Knowledge and Task Subspaces for Composable Parametric Retrieval Augmented Generation

cs.CL · 2026-04-29 · unverdicted · novelty 6.0

Orthogonalizing task and document subspaces in LoRA-based PRAG improves compositional robustness when merging multiple document adapters.

Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.

Filling the Gaps: Selective Knowledge Augmentation for LLM Recommenders

cs.IR · 2026-04-09 · unverdicted · novelty 6.0

KnowSA_CKP uses comparative knowledge probing to selectively augment LLM prompts for items with knowledge gaps, improving recommendation accuracy and context efficiency.

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

cs.CL · 2025-10-17 · unverdicted · novelty 6.0 · 2 refs

EvolveR enables LLM agents to self-evolve via a closed loop of distilling interaction trajectories into strategic principles offline and retrieving them to guide online decisions with policy reinforcement, yielding better results on multi-hop QA benchmarks.

ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards

cs.CL · 2025-10-01 · unverdicted · novelty 6.0

ReSeek adds self-correction via a JUDGE action and a dense instructive reward (correctness plus utility) to RL training of search agents, yielding higher success and faithfulness on a new contamination-resistant benchmark.

Provable Knowledge Acquisition and Extraction in One-Layer Transformers

cs.LG · 2025-07-28 · unverdicted · novelty 6.0

In a stylized one-layer transformer, pre-training encodes factual knowledge via relation-specific feature directions and attention patterns; fine-tuning extracts it through a relation-covering mechanism that succeeds when enough latent templates are triggered, with a failure regime explaining inauds

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

cs.CL · 2025-05-07 · unverdicted · novelty 6.0 · 2 refs

ZeroSearch uses supervised fine-tuning to create a simulated retrieval module and curriculum-based RL rollouts that degrade document quality to train LLMs on search capabilities without real search API calls.

In-depth Analysis of Graph-based RAG in a Unified Framework

cs.IR · 2025-03-06 · unverdicted · novelty 6.0

A unified framework and large-scale comparison of graph-based RAG methods on QA tasks yields new high-performing variants obtained by recombining existing components.

ArchRAG: Attributed Community-based Hierarchical Retrieval-Augmented Generation

cs.IR · 2025-02-14 · unverdicted · novelty 6.0

ArchRAG proposes attributed-community hierarchical indexing and LLM clustering to improve accuracy and lower token usage in graph-based retrieval-augmented generation.

RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

cs.CL · 2024-01-31 · unverdicted · novelty 6.0

RAPTOR introduces a tree-organized retrieval method using recursive abstractive summaries, achieving a 20% absolute accuracy improvement on the QuALITY benchmark when paired with GPT-4.

REPLUG: Retrieval-Augmented Black-Box Language Models

cs.CL · 2023-01-30 · conditional · novelty 6.0

REPLUG improves frozen black-box LMs by prepending LM-supervised retrieved documents, delivering 6.3% better language modeling on GPT-3 and 5.1% better five-shot MMLU on Codex.

Internalizing the Future: A Unified Agentic Training Paradigm for World Model Planning

cs.AI · 2026-06-25 · unverdicted · novelty 5.0

A three-stage training pipeline internalizes world-model simulation and success estimation in LLM agents for improved planning on search and math tasks.

Query-Conditioned Knowledge Alignment for Reliable Cross-System Medical Reasoning

cs.AI · 2026-05-18 · conditional · novelty 5.0

QCEA reformulates entity alignment as a query-conditioned ranking task with semantic encoding, graph learning, and direction-aware transformation to handle context-dependent, asymmetric correspondences in medical knowledge graphs.

citing papers explorer

Showing 29 of 29 citing papers.

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling cs.CL · 2023-04-03 · accept · none · ref 58 · internal anchor
Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
Do We Still Need GraphRAG? Benchmarking RAG and GraphRAG for Agentic Search Systems cs.IR · 2026-04-01 · unverdicted · none · ref 29 · internal anchor
Agentic search narrows the gap between dense RAG and GraphRAG but does not remove GraphRAG's advantage on complex multi-hop reasoning.
CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training cs.LG · 2026-02-19 · unverdicted · none · ref 32 · 2 links · internal anchor
CapTrack shows post-training causes drift beyond facts, with instruction fine-tuning producing stronger behavioral changes than preference optimization across model families.
MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning cs.CL · 2025-11-04 · unverdicted · none · ref 15 · internal anchor
MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems cs.MA · 2025-06-05 · accept · none · ref 120 · internal anchor
A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
Group-in-Group Policy Optimization for LLM Agent Training cs.LG · 2025-05-16 · unverdicted · none · ref 64 · internal anchor
GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while keeping the same rollout and memory footprint.
Policy and World Modeling Co-Training for Language Agents cs.LG · 2026-06-01 · unverdicted · none · ref 51 · internal anchor
PaW co-trains policy and world modeling on standard RL rollouts using action-entropy data selection, noise-tolerant loss, and reward-adaptive balancing, yielding consistent gains on three agent benchmarks.
Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency cs.CL · 2026-05-18 · unverdicted · none · ref 12 · internal anchor
Factual recall quality in LLMs follows a sigmoid scaling law in the log-linear combination of model parameter count and topic frequency in training data, explaining 60% of variance across models and up to 94% within families.
Towards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm cs.CL · 2026-05-11 · unverdicted · none · ref 47 · internal anchor
Theoretical analysis of continual factual knowledge acquisition shows data replay stabilizes pretrained knowledge by shifting convergence dynamics while regularization only slows forgetting, leading to the STOC method for attention-based replay selection.
Priming: Hybrid State Space Models From Pre-trained Transformers cs.LG · 2026-05-08 · unverdicted · none · ref 56 · internal anchor
Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasoning at 32B scale.
Estimating the Black-box LLM Uncertainty with Distribution-Aligned Adversarial Distillation cs.CL · 2026-05-07 · unverdicted · none · ref 13 · internal anchor
DisAAD trains a 1%-sized proxy model via adversarial distillation to quantify uncertainty in black-box LLMs by aligning with their output distributions.
Decoupling Knowledge and Task Subspaces for Composable Parametric Retrieval Augmented Generation cs.CL · 2026-04-29 · unverdicted · none · ref 2 · internal anchor
Orthogonalizing task and document subspaces in LoRA-based PRAG improves compositional robustness when merging multiple document adapters.
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts cs.LG · 2026-04-20 · unverdicted · none · ref 27 · internal anchor
BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
Filling the Gaps: Selective Knowledge Augmentation for LLM Recommenders cs.IR · 2026-04-09 · unverdicted · none · ref 41 · internal anchor
KnowSA_CKP uses comparative knowledge probing to selectively augment LLM prompts for items with knowledge gaps, improving recommendation accuracy and context efficiency.
EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle cs.CL · 2025-10-17 · unverdicted · none · ref 32 · 2 links · internal anchor
EvolveR enables LLM agents to self-evolve via a closed loop of distilling interaction trajectories into strategic principles offline and retrieving them to guide online decisions with policy reinforcement, yielding better results on multi-hop QA benchmarks.
ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards cs.CL · 2025-10-01 · unverdicted · none · ref 14 · internal anchor
ReSeek adds self-correction via a JUDGE action and a dense instructive reward (correctness plus utility) to RL training of search agents, yielding higher success and faithfulness on a new contamination-resistant benchmark.
Provable Knowledge Acquisition and Extraction in One-Layer Transformers cs.LG · 2025-07-28 · unverdicted · none · ref 25 · internal anchor
In a stylized one-layer transformer, pre-training encodes factual knowledge via relation-specific feature directions and attention patterns; fine-tuning extracts it through a relation-covering mechanism that succeeds when enough latent templates are triggered, with a failure regime explaining inauds
ZeroSearch: Incentivize the Search Capability of LLMs without Searching cs.CL · 2025-05-07 · unverdicted · none · ref 26 · 2 links · internal anchor
ZeroSearch uses supervised fine-tuning to create a simulated retrieval module and curriculum-based RL rollouts that degrade document quality to train LLMs on search capabilities without real search API calls.
In-depth Analysis of Graph-based RAG in a Unified Framework cs.IR · 2025-03-06 · unverdicted · none · ref 58 · internal anchor
A unified framework and large-scale comparison of graph-based RAG methods on QA tasks yields new high-performing variants obtained by recombining existing components.
ArchRAG: Attributed Community-based Hierarchical Retrieval-Augmented Generation cs.IR · 2025-02-14 · unverdicted · none · ref 38 · internal anchor
ArchRAG proposes attributed-community hierarchical indexing and LLM clustering to improve accuracy and lower token usage in graph-based retrieval-augmented generation.
RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval cs.CL · 2024-01-31 · unverdicted · none · ref 134 · internal anchor
RAPTOR introduces a tree-organized retrieval method using recursive abstractive summaries, achieving a 20% absolute accuracy improvement on the QuALITY benchmark when paired with GPT-4.
REPLUG: Retrieval-Augmented Black-Box Language Models cs.CL · 2023-01-30 · conditional · none · ref 72 · internal anchor
REPLUG improves frozen black-box LMs by prepending LM-supervised retrieved documents, delivering 6.3% better language modeling on GPT-3 and 5.1% better five-shot MMLU on Codex.
Internalizing the Future: A Unified Agentic Training Paradigm for World Model Planning cs.AI · 2026-06-25 · unverdicted · none · ref 47 · internal anchor
A three-stage training pipeline internalizes world-model simulation and success estimation in LLM agents for improved planning on search and math tasks.
Query-Conditioned Knowledge Alignment for Reliable Cross-System Medical Reasoning cs.AI · 2026-05-18 · conditional · none · ref 20 · internal anchor
QCEA reformulates entity alignment as a query-conditioned ranking task with semantic encoding, graph learning, and direction-aware transformation to handle context-dependent, asymmetric correspondences in medical knowledge graphs.
AMATA: Adaptive Multi-Agent Trajectory Alignment for Knowledge-Intensive Question Answering cs.CL · 2026-05-17 · unverdicted · none · ref 6 · internal anchor
AMATA is an adaptive multi-agent trajectory alignment system that improves factual consistency in knowledge-intensive QA via intra-trajectory preference learning and inter-agent dependency optimization.
Attributing Culture-Conditioned Generations to Pretraining Corpora cs.CL · 2024-12-30 · unverdicted · none · ref 13 · internal anchor
MEMOed framework attributes LLM generations about cultures to pretraining memorization and finds frequency-based biases across 110 cultures for food and clothing.
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment cs.AI · 2023-08-10 · accept · none · ref 56 · internal anchor
Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.
Sharpness-Guided Group Relative Policy Optimization via Probability Shaping cs.LG · 2025-10-29 · unverdicted · none · ref 25 · internal anchor
GRPO-SG is a sharpness-guided token-weighted variant of GRPO that downweights high-gradient tokens to stabilize optimization and improve generalization in reinforcement learning with verifiable rewards.
Retrieval-Augmented Generation for Large Language Models: A Survey cs.CL · 2023-12-18 · unverdicted · none · ref 116 · internal anchor
A survey of RAG paradigms, components, benchmarks, and challenges for improving LLMs on knowledge-intensive tasks.

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer