KARLA: Knowledge-base Augmented Retrieval for Language Models

Fabian M. Suchanek (IP Paris; Francois Crespin; LTCI); Nils Holzenberger

arxiv: 2606.26807 · v1 · pith:IODTOOZInew · submitted 2026-06-25 · 💻 cs.AI · cs.CL

KARLA: Knowledge-base Augmented Retrieval for Language Models

Francois Crespin , Fabian M. Suchanek (IP Paris , LTCI) , Nils Holzenberger This is my paper

Pith reviewed 2026-06-26 05:22 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords knowledge baselanguage modelsfactual groundingretrievalspecial tokensknowledge editingLLM augmentation

0 comments

The pith

LLMs can learn to emit special tokens that query an external knowledge base mid-generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a training method so that language models produce designated tokens at chosen points in their output. These tokens cause the system to fetch relevant facts from a separate knowledge base and insert them into the generated text. The approach is meant to let factual content be refreshed by editing the knowledge base alone, without changing model weights. It is also intended to make the origin of each fact traceable and to let smaller models reach the factual performance of larger ones. Experiments reported in the paper test the method on both short factual answers and longer generated passages.

Core claim

Training a language model to insert special tokens that trigger queries to a knowledge base during token generation allows factual knowledge to be supplied externally, revised by editing the base rather than retraining, and traced back to its source.

What carries the argument

Special tokens that the model is trained to emit in order to pause generation and retrieve facts from the knowledge base.

If this is right

Factual content in model output can be corrected by editing the knowledge base instead of retraining parameters.
Each fact in the output can be linked back to its entry in the knowledge base for verification.
Smaller models reach factual accuracy levels previously seen only in larger models.
The same mechanism works for both short factual responses and longer generated passages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deployed systems could receive live fact corrections without downtime for model updates.
The method might be combined with existing retrieval pipelines to handle cases where the knowledge base itself changes frequently.
Traceability could support auditing requirements in regulated domains such as healthcare or finance.

Load-bearing premise

An LLM can be trained to emit the right special tokens at the right moments without lowering the overall quality of its text or requiring large amounts of extra supervision.

What would settle it

A controlled test in which the trained model either omits the special tokens on factual prompts or inserts them at wrong positions, resulting in no gain or a drop in measured factual accuracy compared with the unmodified baseline.

Figures

Figures reproduced from arXiv: 2606.26807 by Fabian M. Suchanek (IP Paris, Francois Crespin, LTCI), Nils Holzenberger.

**Figure 1.** Figure 1: KARLA in action. Retrieval-augmented generation (RAG) methods (Lewis et al., 2020; Guu et al., 2020; Borgeaud et al., 2022) and tool-use approaches (Yao et al., 2022; Schick et al., 2023) allow pulling in factual information from external textual resources or tools. These methods improve factual grounding and make some knowledge updates possible through changes to the retrieval corpus or tool backend. H… view at source ↗

**Figure 2.** Figure 2: Per-relation sample counts, sorted in ascend [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Parametric fine-tuning update curve. Baseline [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Inline-query exact-match accuracy (%) on the test corpus across models sizes and LoRA configurations. Prediction requires both the subject and relation to be correct. 10 50 100 500 1k T target frequency 0 20 40 60 80 100 120 140 Eval Accuracy (%) Qwen3-0.6B (PRIMEKG) Qwen3-0.6B (YAGO) Qwen3-1.7B (PRIMEKG) Qwen3-1.7B (YAGO) Qwen3-4B (PRIMEKG) Qwen3-4B (YAGO) Qwen3-8B (PRIMEKG) Qwen3-8B (YAGO) [PITH_FULL_I… view at source ↗

**Figure 5.** Figure 5: Inline-query Exact-match accuracy on the test [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Inline-query exact-match accuracy (%) for the special-token (solid) and free-form (dashed) predicate representations across four Qwen3 model sizes. All runs use LoRA with rank r = 64 and α = 128. ⟨KB_query⟩ r ⟨/KB_query⟩⟨subj⟩ s ⟨/subj⟩ block. If r does not match any predicate in R, the executor returns ⟨KB_FAIL⟩ and the model falls back to parametric generation. In-distribution inline-query accuracy [PIT… view at source ↗

read the original abstract

We propose a new method that allows an LLM to automatically pull in factual knowledge from a knowledge base during token generation. This means that (1)~factual knowledge in the LLM output can be updated without retraining the LLM, (2)~facts in the LLM output can be traced to the knowledge base for transparency and explainability, and (3)~smaller models can achieve the same factual accuracy as larger models. Our core idea is to train the model to produce special tokens that trigger a query to the knowledge base. Our experiments show that our method improves factual grounding in both short and long-form generation, and allows factual revisions to take effect through KB edits rather than parameter updates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The special-token training for mid-generation KB lookups is the actual novelty, but the abstract's lack of metrics or supervision details leaves the claims untested.

read the letter

The main point is a training setup where the LLM learns to emit special tokens during autoregressive generation to trigger KB queries on the fly. This targets updatability without retraining, traceability to the source, and better fact accuracy from smaller models.

What stands out as new is the focus on teaching the model an implicit policy for when to insert those tokens, rather than bolting on a separate retriever. If the supervision turns out to be straightforward and the model stays fluent, it could simplify some RAG pipelines.

The paper states that experiments show gains in factual grounding for both short and long outputs, plus easy revisions via KB edits. That direction makes sense for knowledge-intensive uses. However, the abstract supplies no numbers, baselines, datasets, or ablations, and the stress-test concern holds: there is no description of how positive examples for token placement are created, whether an auxiliary loss is used, or how over- or under-querying is avoided. Without those pieces the central mechanism remains unverified.

The approach relies on standard training plus an external KB, with no circularity or self-referential fitting. The citation pattern is not an issue here since the claim is presented as distinct.

This is for RAG researchers who care about dynamic knowledge. A reader gets value only if the full experiments are solid and the supervision cost is low. It deserves a serious referee because the idea is coherent and the goals are practical, even though the current evidence is thin and the weakest assumption about reliable token emission needs checking.

Referee Report

2 major / 0 minor

Summary. The paper proposes KARLA, a method to augment LLMs with external knowledge bases by training the model to emit special tokens during generation that trigger KB queries. This is claimed to enable (1) updating factual knowledge via KB edits without retraining, (2) tracing facts to the KB for explainability, and (3) smaller models matching larger ones on factual accuracy. Experiments are asserted to show improved factual grounding in short- and long-form generation and effective revisions through KB changes rather than parameter updates.

Significance. If the core mechanism can be shown to work reliably, the approach would allow dynamic, traceable knowledge integration in LLMs without the costs of retraining or the opacity of parametric knowledge, with potential benefits for smaller models and maintenance of factual correctness over time.

major comments (2)

[Abstract] Abstract: The abstract asserts experimental improvements in factual grounding and successful KB-based revisions but supplies no metrics, baselines, dataset details, ablation results, or description of how supervision for special-token emission is generated. This absence makes it impossible to evaluate whether the data support the central claims about reliable token emission without degrading fluency or requiring impractical supervision.
[Abstract] The weakest assumption—that an LLM can be trained to emit the correct special tokens at contextually appropriate moments without auxiliary losses, dense supervision, or output degradation—is load-bearing for all three claimed benefits, yet no information is provided on the training objective, positive/negative example construction, or controls against over-/under-querying.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback on the abstract. We agree that it would benefit from greater specificity regarding metrics, datasets, and the training procedure for special-token emission. We will revise the abstract in the next version to address these points directly while preserving its brevity. Point-by-point responses follow.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract asserts experimental improvements in factual grounding and successful KB-based revisions but supplies no metrics, baselines, dataset details, ablation results, or description of how supervision for special-token emission is generated. This absence makes it impossible to evaluate whether the data support the central claims about reliable token emission without degrading fluency or requiring impractical supervision.

Authors: We accept this criticism of the abstract. The revised abstract will report key quantitative results (e.g., relative gains in factual accuracy on short- and long-form tasks), name the primary evaluation datasets, reference the main baselines, and include a one-sentence description of supervision: special-token targets are derived automatically by aligning factual spans in the training corpus to KB entries. Ablation results on query frequency and fluency impact will be summarized by reference to the corresponding experimental tables. revision: yes
Referee: [Abstract] The weakest assumption—that an LLM can be trained to emit the correct special tokens at contextually appropriate moments without auxiliary losses, dense supervision, or output degradation—is load-bearing for all three claimed benefits, yet no information is provided on the training objective, positive/negative example construction, or controls against over-/under-querying.

Authors: The manuscript (Section 3) specifies that training uses the standard autoregressive language-modeling objective with no auxiliary losses; supervision is obtained by automatically labeling positions where a factual span matches a KB entry, without manual dense annotation. Positive examples are the aligned spans; no explicit negative examples are constructed. Controls for over- and under-querying are evaluated via an ablation that measures both factual accuracy and output fluency (perplexity and human ratings) across different query-rate regimes. We will add a concise summary of this procedure to the abstract. The experimental results in Sections 4 and 5 provide evidence that the assumption holds under the reported conditions. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes training an LLM to emit special tokens that trigger external KB queries, with claimed benefits for factual updates and smaller models. No equations, fitted parameters, or derivation steps are described in the provided text. The method relies on standard supervised training and an external knowledge base rather than any self-referential construction or self-citation chain that reduces the result to its inputs. The central claim is an empirical training procedure whose validity is independent of the paper's own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities. The central claim implicitly rests on the unstated premise that a suitable KB exists and that token-level triggering can be learned reliably.

pith-pipeline@v0.9.1-grok · 5643 in / 1087 out tokens · 26580 ms · 2026-06-26T05:22:02.912015+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 2 linked inside Pith

[1]

In International conference on machine learning, pages 2206–2240

Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR. Pere-Lluís Huguet Cabot and Roberto Navigli. 2021. Rebel: Relation extraction by end-to-end language generation. In Findings of the association for compu- tational linguistics: EMNLP 2021, pages 2370–2381. Payal Chandak...

2021
[2]

Scientific data, 10(1):67

Building a knowledge graph to enable pre- cision medicine. Scientific data, 10(1):67. Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou
[3]

IEEE Transactions on Big Data

The faiss library. IEEE Transactions on Big Data. Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasu- pat, and Mingwei Chang. 2020. Retrieval augmented language model pre-training. In International confer- ence on machine learning, pages 3929–3938. PMLR. Samy Haffoudhi, Fabian M Suchanek, and Nils Holzen- berger. 2026. Lela: an llm-based entity linking ap- ...

arXiv 2020
[4]

Ad- vances in neural information processing systems , 36:45870–45894

Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. Ad- vances in neural information processing systems , 36:45870–45894. Sam Houliston, Ambroise Odonnat, Charles Arnal, and Vivien Cabannes. 2025. Provable benefits of in-tool learning for large language models. arXiv preprint arXiv:2508.20755. Edward J Hu, Yelong Shen, Ph...

arXiv 2025
[5]

Preprint, arXiv:2102.01951

Mind the gap: Assessing temporal gen- eralization in neural language models. Preprint, arXiv:2102.01951. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Hein- rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, and 1 others. 2020. Retrieval-augmented gen- eration for knowledge-intensive nlp tasks. Advan...

arXiv 2020
[6]

In Proceedings of the 2021 conference on empirical methods in natural language processing, pages 7052–7063

Entity-based knowledge conflicts in question answering. In Proceedings of the 2021 conference on empirical methods in natural language processing, pages 7052–7063. Xingyu Lu, Xiaonan Li, Qinyuan Cheng, Kai Ding, Xuan-Jing Huang, and Xipeng Qiu. 2024. Scaling laws for fact memorization of large language models. In Findings of the Association for Computatio...

Pith/arXiv arXiv 2021
[7]

arXiv preprint arXiv:2411.00204

Restor: knowledge recovery in machine un- learning. arXiv preprint arXiv:2411.00204. Zacchary Sadeddine, Winston Maxwell, Gaël Varo- quaux, and Fabian M. Suchanek. 2025. Large Lan- guage Models as Search Engines: Societal Chal- lenges. In SIGIR Forum invited paper. Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, L...

arXiv 2025
[8]

arXiv preprint arXiv:2505.09388

Qwen3 technical report. arXiv preprint arXiv:2505.09388. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629. Urchade Zaratiana, Nadi Tomeh, Pierre Holat, and Thierry Charnois. 2024. Gliner: Generalist model for named e...

Pith/arXiv arXiv 2022
[9]

Lower bound (coverage): for every r ∈ R, count(r) ≥ T
[10]

Tell me a bio of ⟨name⟩. ⟨name⟩ is

Upper bound (saturation): for every r ∈ R, count(r) ≤ C = αT. Proof. Since R is constructed from the KB, ev- ery relation has at least one supporting triple, so E(r) ̸= ∅ for all r ∈ R. Lower bound. The selection step (line 5) picks r∗ ∈ arg minr∈U count(r), where U = {r ∈ R : count(r) < T }. The uniform sample in line 6 is always well-defined since E(r∗)...

2025
[11]

OK: [REL:schema:director|Giuseppe Tornatore] Avoid:[REL:schema:director|directed by Giuseppe Tornatore]

The surface text must be MINIMAL: only the object value, no verbs, prepositions, or articles. OK: [REL:schema:director|Giuseppe Tornatore] Avoid:[REL:schema:director|directed by Giuseppe Tornatore]
[12]

Do NOT nest tags inside other tags
[13]

If you normalise a KB value, use the normalised form as the surface text
[14]

Only annotate the provided triples; do not annotate bridging world knowledge you added
[15]

born in Paris [REL:schema:birthPlace|Paris]

The tagged text REPLACES the object – do NOT write the object in plain text and then repeat it in a tag. Avoid:“born in Paris [REL:schema:birthPlace|Paris]” OK: “born in [REL:schema:birthPlace|Paris]”
[16]

Acetazolamide is a [REL:rdf:type|drug]

Date surface forms must match the object as provided (do not reformat). Output. Return ONLY the marked-up paragraph in the marked_paragraph field. Example user message Subject entity: Giuseppe Tornatore Relation definitions (use these to understand the triples): - schema:birthPlace: the place where the person was born - schema:birthDate: the date of birth...

1956
[17]

The surface text must be MINIMAL: only the object value, no verbs/prepositions/articles
[18]

If you normalize an object value, the normalized form must be the tagged surface text
[19]

Only annotate the provided triples; do not annotate added bridging text
[20]

targets X [REL:targets|X]

The tagged text REPLACES the object – do NOT write the object in plain text and then repeat it in a tag. Avoid:“targets X [REL:targets|X]” OK: “targets [REL:targets|X]”
[21]

If the same object appears in multiple triples, each triple must still be realized exactly once in a semantically distinguishable way
[22]

{entity} was born in

If a relation is negative (absence, contraindication, negative phenotype association), the surrounding sentence must explicitly preserve that negative meaning. Entity handling. Use terminology appropriate to the entity type: disease, drug, protein, phenotype, anatomy, pathway, biological process, molecular function, cellular component, or exposure. Output...

1956

[1] [1]

In International conference on machine learning, pages 2206–2240

Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR. Pere-Lluís Huguet Cabot and Roberto Navigli. 2021. Rebel: Relation extraction by end-to-end language generation. In Findings of the association for compu- tational linguistics: EMNLP 2021, pages 2370–2381. Payal Chandak...

2021

[2] [2]

Scientific data, 10(1):67

Building a knowledge graph to enable pre- cision medicine. Scientific data, 10(1):67. Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou

[3] [3]

IEEE Transactions on Big Data

The faiss library. IEEE Transactions on Big Data. Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasu- pat, and Mingwei Chang. 2020. Retrieval augmented language model pre-training. In International confer- ence on machine learning, pages 3929–3938. PMLR. Samy Haffoudhi, Fabian M Suchanek, and Nils Holzen- berger. 2026. Lela: an llm-based entity linking ap- ...

arXiv 2020

[4] [4]

Ad- vances in neural information processing systems , 36:45870–45894

Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. Ad- vances in neural information processing systems , 36:45870–45894. Sam Houliston, Ambroise Odonnat, Charles Arnal, and Vivien Cabannes. 2025. Provable benefits of in-tool learning for large language models. arXiv preprint arXiv:2508.20755. Edward J Hu, Yelong Shen, Ph...

arXiv 2025

[5] [5]

Preprint, arXiv:2102.01951

Mind the gap: Assessing temporal gen- eralization in neural language models. Preprint, arXiv:2102.01951. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Hein- rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, and 1 others. 2020. Retrieval-augmented gen- eration for knowledge-intensive nlp tasks. Advan...

arXiv 2020

[6] [6]

In Proceedings of the 2021 conference on empirical methods in natural language processing, pages 7052–7063

Entity-based knowledge conflicts in question answering. In Proceedings of the 2021 conference on empirical methods in natural language processing, pages 7052–7063. Xingyu Lu, Xiaonan Li, Qinyuan Cheng, Kai Ding, Xuan-Jing Huang, and Xipeng Qiu. 2024. Scaling laws for fact memorization of large language models. In Findings of the Association for Computatio...

Pith/arXiv arXiv 2021

[7] [7]

arXiv preprint arXiv:2411.00204

Restor: knowledge recovery in machine un- learning. arXiv preprint arXiv:2411.00204. Zacchary Sadeddine, Winston Maxwell, Gaël Varo- quaux, and Fabian M. Suchanek. 2025. Large Lan- guage Models as Search Engines: Societal Chal- lenges. In SIGIR Forum invited paper. Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, L...

arXiv 2025

[8] [8]

arXiv preprint arXiv:2505.09388

Qwen3 technical report. arXiv preprint arXiv:2505.09388. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629. Urchade Zaratiana, Nadi Tomeh, Pierre Holat, and Thierry Charnois. 2024. Gliner: Generalist model for named e...

Pith/arXiv arXiv 2022

[9] [9]

Lower bound (coverage): for every r ∈ R, count(r) ≥ T

[10] [10]

Tell me a bio of ⟨name⟩. ⟨name⟩ is

Upper bound (saturation): for every r ∈ R, count(r) ≤ C = αT. Proof. Since R is constructed from the KB, ev- ery relation has at least one supporting triple, so E(r) ̸= ∅ for all r ∈ R. Lower bound. The selection step (line 5) picks r∗ ∈ arg minr∈U count(r), where U = {r ∈ R : count(r) < T }. The uniform sample in line 6 is always well-defined since E(r∗)...

2025

[11] [11]

OK: [REL:schema:director|Giuseppe Tornatore] Avoid:[REL:schema:director|directed by Giuseppe Tornatore]

The surface text must be MINIMAL: only the object value, no verbs, prepositions, or articles. OK: [REL:schema:director|Giuseppe Tornatore] Avoid:[REL:schema:director|directed by Giuseppe Tornatore]

[12] [12]

Do NOT nest tags inside other tags

[13] [13]

If you normalise a KB value, use the normalised form as the surface text

[14] [14]

Only annotate the provided triples; do not annotate bridging world knowledge you added

[15] [15]

born in Paris [REL:schema:birthPlace|Paris]

The tagged text REPLACES the object – do NOT write the object in plain text and then repeat it in a tag. Avoid:“born in Paris [REL:schema:birthPlace|Paris]” OK: “born in [REL:schema:birthPlace|Paris]”

[16] [16]

Acetazolamide is a [REL:rdf:type|drug]

Date surface forms must match the object as provided (do not reformat). Output. Return ONLY the marked-up paragraph in the marked_paragraph field. Example user message Subject entity: Giuseppe Tornatore Relation definitions (use these to understand the triples): - schema:birthPlace: the place where the person was born - schema:birthDate: the date of birth...

1956

[17] [17]

The surface text must be MINIMAL: only the object value, no verbs/prepositions/articles

[18] [18]

If you normalize an object value, the normalized form must be the tagged surface text

[19] [19]

Only annotate the provided triples; do not annotate added bridging text

[20] [20]

targets X [REL:targets|X]

The tagged text REPLACES the object – do NOT write the object in plain text and then repeat it in a tag. Avoid:“targets X [REL:targets|X]” OK: “targets [REL:targets|X]”

[21] [21]

If the same object appears in multiple triples, each triple must still be realized exactly once in a semantically distinguishable way

[22] [22]

{entity} was born in

If a relation is negative (absence, contraindication, negative phenotype association), the surrounding sentence must explicitly preserve that negative meaning. Entity handling. Use terminology appropriate to the entity type: disease, drug, protein, phenotype, anatomy, pathway, biological process, molecular function, cellular component, or exposure. Output...

1956