ARK: Answer-Centric Retriever Tuning via KG-augmented Curriculum Learning

Haiyun Jiang; Hang Ding; Jiawei Zhou

arxiv: 2511.16326 · v3 · submitted 2025-11-20 · 💻 cs.IR

ARK: Answer-Centric Retriever Tuning via KG-augmented Curriculum Learning

Hang Ding , Jiawei Zhou , Haiyun Jiang This is my paper

Pith reviewed 2026-05-17 20:59 UTC · model grok-4.3

classification 💻 cs.IR

keywords retrieval-augmented generationretriever fine-tuningknowledge graphscurriculum learningcontrastive learninglong-context retrievalanswer alignmenthard negative mining

0 comments

The pith

A fine-tuning framework aligns retrievers with answer generation by labeling sufficient chunks and using knowledge-graph curriculum learning to select hard negatives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a method to fine-tune retrievers in retrieval-augmented generation so they prioritize chunks that actually enable an LLM to produce the correct answer rather than relying on surface similarity. It labels positives by testing sufficiency for answer generation and builds a curriculum of increasingly difficult negatives by using LLM-built knowledge graphs to create augmented queries. This trains the model to handle sparse evidence in long documents. A sympathetic reader would care because standard retrievers often overlook critical but low-similarity passages, limiting RAG reliability in extended contexts.

Core claim

The central claim is that answer-centric fine-tuning via a curriculum-based contrastive scheme, where positives are identified by their sufficiency to generate the correct answer and hard negatives are mined from LLM-constructed knowledge graphs, produces a retriever that achieves state-of-the-art results on long-context benchmarks while preserving efficiency and requiring no major architectural changes.

What carries the argument

Answer Alignment through curriculum contrastive learning that uses LLM-constructed knowledge graphs to generate augmented queries for progressively harder negatives, paired with sufficiency checks to select positive chunks.

If this is right

The tuned retriever reaches state-of-the-art performance across 10 datasets from Ultradomain and LongBench.
Performance improves 14.5 percent over the base model without substantial architectural modifications.
The approach maintains strong efficiency when handling long-context retrieval-augmented generation.
Training teaches the retriever to separate answer-sufficient positives from nuanced distractors, improving generalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method might transfer to retrieval settings that do not involve generation, such as pure fact-checking pipelines.
If knowledge-graph construction introduces domain-specific biases, performance could vary across specialized corpora not tested here.
Applying the sufficiency-labeling step to multi-turn conversations could extend the framework beyond single-query long documents.

Load-bearing premise

That checking whether a chunk lets an LLM generate the correct answer reliably marks it as a high-quality positive, and that knowledge graphs built by LLMs supply useful augmented queries for finding challenging negatives.

What would settle it

Replicating the fine-tuning on the same Ultradomain and LongBench datasets and finding either no improvement near 14.5 percent over the base model or loss of efficiency in long-context settings would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2511.16326 by Haiyun Jiang, Hang Ding, Jiawei Zhou.

**Figure 1.** Figure 1: Our RAG Retriever Finetuning Framework ARK, which consists of two major stages: A (Query Construction): From long documents and their corresponding QA pairs, we extract a query-based subgraph using an LLM-generated KG. The subgraph is reformulated with knowledge injection to produce enriched queries. B (Contrastive Finetuning): Using both the original query and injected variants, we identify positive chunk… view at source ↗

**Figure 2.** Figure 2: Query Construction Phase. The pipeline begins with KG Construction, where we extract entities, relations, and covariates from long documents to construct an LLM-generated KG. Given a corresponding QA pair, relevant entities are extracted and used to construct PPR-based subgraphs from the KG, with varying maximum sizes to control difficulty. Finally, Augmented Queries are formulated with LLM conditioned o… view at source ↗

**Figure 3.** Figure 3: Contrastive Finetuning Phase. Our finetuning pipeline comprises two sequential components: Ranking Alignment, in which for each sample, we combine three alignment scores to select the Top-M chunks as positive chunks; followed by Curriculum-based Contrastive Learning, which progressively refines the retriever through (i) in-batch negative sampling, (ii) hard negatives T − hardL mined via query set T − h… view at source ↗

read the original abstract

Retrieval-Augmented Generation (RAG) has emerged as a powerful framework for knowledge-intensive tasks, yet its effectiveness in long-context scenarios is often bottlenecked by the retriever's inability to distinguish sparse yet crucial evidence. Standard retrievers, optimized for query-document similarity, frequently fail to align with the downstream goal of generating a precise answer. To bridge this gap, we propose a novel fine-tuning framework that optimizes the retriever for Answer Alignment. Specifically, we first identify high-quality positive chunks by evaluating their sufficiency to generate the correct answer. We then employ a curriculum-based contrastive learning scheme to fine-tune the retriever. This curriculum leverages LLM-constructed Knowledge Graphs (KGs) to generate augmented queries, which in turn mine progressively challenging hard negatives. This process trains the retriever to distinguish the answer-sufficient positive chunks from these nuanced distractors, enhancing its generalization. Extensive experiments on 10 datasets from the Ultradomain and LongBench benchmarks demonstrate that our fine-tuned retriever achieves state-of-the-art performance, improving 14.5\% over the base model without substantial architectural modifications and maintaining strong efficiency for long-context RAG. Our work presents a robust and effective methodology for building truly answer-centric retrievers. Source Code is available on https://github.com/valleysprings/ARK/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ARK gives a workable recipe for answer-aligned retriever tuning via sufficiency labels and KG curriculum, but the positive labeling step is vulnerable to parametric knowledge leakage.

read the letter

The main thing here is that the paper shifts retriever fine-tuning toward chunks that actually let an LLM produce the correct answer, then uses an LLM-built knowledge graph to create a curriculum of harder negatives for contrastive training. This is a concrete synthesis aimed at long-context RAG rather than generic similarity matching. The reported 14.5 percent lift over the base model across ten datasets from Ultradomain and LongBench, with no major architecture changes, is the empirical claim that would matter most to practitioners if it holds up. The approach stays efficient and keeps the focus on existing retriever backbones, which is a practical strength. The experiments cover a reasonable range of benchmarks, and the code release is noted, which helps with reproducibility checks. The KG augmentation for query variation and progressive hard-negative mining adds a structured way to increase difficulty during training that goes beyond simple random or static negatives. That part feels like a reasonable engineering choice for building generalization. The soft spot is the positive labeling. Determining sufficiency by whether an LLM can generate the right answer from the chunk alone can easily pick up cases where the model succeeds from its own pre-trained knowledge instead of the chunk content. If that happens at scale, the contrastive signal becomes noisy and the gains could partly reflect artifact rather than genuine retrieval improvement. The abstract gives no details on controls for this, such as forcing the LLM to cite the chunk or using knowledge-limited models for labeling. The concern lands directly on the method as described. This work is aimed at engineers and researchers who maintain or tune retrievers inside production RAG pipelines, especially for long documents where evidence is sparse. A reader who wants empirical methods to boost answer quality without redesigning the whole stack would find the results relevant, assuming the experimental details check out. I would send it to peer review. The scope is broad enough for referees to test the labeling robustness and clarify how much the curriculum actually contributes.

Referee Report

2 major / 2 minor

Summary. The paper proposes ARK, a fine-tuning framework for retrievers in long-context RAG that optimizes for answer alignment rather than query-document similarity. Positive chunks are labeled by checking whether an LLM can generate the correct answer from the chunk alone; a KG-augmented curriculum then constructs progressively harder queries to mine negatives for contrastive learning. Experiments across 10 datasets from Ultradomain and LongBench are reported to yield state-of-the-art results with a 14.5% gain over the base model, and source code is released.

Significance. If the central performance claims hold after addressing labeling validity, the work would offer a practical route to answer-centric retrievers without architectural overhaul, potentially benefiting long-context RAG pipelines. The curriculum-learning design and public code release are concrete strengths that support reproducibility and extension.

major comments (2)

[§3 (Positive Chunk Identification)] §3 (Positive Chunk Identification): The core labeling step that marks a chunk as a high-quality positive solely when an LLM produces the correct answer from it alone is load-bearing for the answer-alignment objective and all downstream contrastive training. This procedure risks false positives when the LLM succeeds via parametric knowledge rather than chunk content; no control experiment, ablation, or analysis isolating this effect is described, leaving open the possibility that reported gains reflect label noise rather than genuine retrieval improvement.
[§4 (Experiments)] §4 (Experiments): The claim of 14.5% improvement and state-of-the-art performance across 10 datasets is central yet presented without explicit primary metric (e.g., Recall@K or nDCG), full baseline list, run-to-run variance, or statistical significance tests. Because the curriculum and KG components are novel, the absence of targeted ablations on these elements weakens the ability to attribute gains specifically to the proposed answer-centric tuning.

minor comments (2)

[Abstract] Abstract: The phrase 'improving 14.5% over the base model' should specify both the exact metric and the identity of the base model for immediate clarity.
[Method] Notation and figures: Ensure all KG-augmented query examples and curriculum stage definitions are accompanied by precise pseudocode or equations to avoid ambiguity in replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and describe the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: §3 (Positive Chunk Identification): The core labeling step that marks a chunk as a high-quality positive solely when an LLM produces the correct answer from it alone is load-bearing for the answer-alignment objective and all downstream contrastive training. This procedure risks false positives when the LLM succeeds via parametric knowledge rather than chunk content; no control experiment, ablation, or analysis isolating this effect is described, leaving open the possibility that reported gains reflect label noise rather than genuine retrieval improvement.

Authors: We acknowledge the risk that some positive labels may arise from the LLM's parametric knowledge rather than chunk content. To address this directly, we will add a control experiment in the revised manuscript: we will prompt the same LLM on a set of clearly insufficient chunks (those lacking the answer information) and measure the false-positive rate due to parametric recall alone. We will also report the fraction of our positive chunks that contain explicit ground-truth answer spans. These additions will allow readers to assess the extent of any label noise and its potential influence on the observed gains. revision: yes
Referee: §4 (Experiments): The claim of 14.5% improvement and state-of-the-art performance across 10 datasets is central yet presented without explicit primary metric (e.g., Recall@K or nDCG), full baseline list, run-to-run variance, or statistical significance tests. Because the curriculum and KG components are novel, the absence of targeted ablations on these elements weakens the ability to attribute gains specifically to the proposed answer-centric tuning.

Authors: We agree that the experimental reporting should be more explicit and rigorous. In the revision we will (1) state the primary metric (Recall@5) clearly in the main results table and text, (2) provide a complete enumerated list of all baselines with citations, (3) report mean performance together with standard deviations across three random seeds, and (4) include paired t-test p-values for the main comparisons. We will also add two targeted ablations—one that disables the curriculum schedule while keeping KG-augmented negatives, and one that removes KG augmentation while retaining curriculum ordering—to isolate the contribution of each component to the overall improvement. revision: yes

Circularity Check

0 steps flagged

No equations or derivations; method is empirical and externally evaluated

full rationale

The paper presents an empirical fine-tuning framework for retrievers that identifies positive chunks via LLM sufficiency checks and uses KG-augmented queries for curriculum-based hard negative mining, followed by contrastive learning and benchmark evaluation. No mathematical derivations, equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the described chain. All performance claims (e.g., 14.5% improvement) rest on downstream task results from Ultradomain and LongBench datasets rather than reducing to inputs defined by the same process, rendering the work self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach depends on two key assumptions about LLM behavior rather than introducing new mathematical objects or free parameters.

axioms (2)

domain assumption An LLM can accurately judge whether a text chunk is sufficient to generate the correct answer
Used to label positive chunks in the first stage of the pipeline
domain assumption LLM-constructed knowledge graphs produce augmented queries that reliably surface useful hard negatives
Central to the curriculum learning stage for progressive difficulty

pith-pipeline@v0.9.0 · 5532 in / 1229 out tokens · 34563 ms · 2026-05-17T20:59:07.297450+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We first identify high-quality positive chunks by evaluating their sufficiency to generate the correct answer... curriculum leverages LLM-constructed Knowledge Graphs (KGs) to generate augmented queries, which in turn mine progressively challenging hard negatives.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Extensive experiments on 10 datasets from the Ultradomain and LongBench benchmarks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems
cs.CL 2026-05 unverdicted novelty 7.0

BRIGHT-Pro and RTriever-Synth advance reasoning-intensive retrieval by adding multi-aspect evidence evaluation and aspect-decomposed synthetic training, with the fine-tuned RTriever-4B showing gains over its base model.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Rq-rag: Learning to refine queries for retrieval augmented generation.arXiv preprint arXiv:2404.00610,

Improving language models by retrieving from trillions of tokens. InInternational conference on machine learning, pages 2206–2240. PMLR. Chi-Min Chan, Chunpu Xu, Ruibin Yuan, Hongyin Luo, Wei Xue, Yike Guo, and Jie Fu. 2024. Rq-rag: Learn- ing to refine queries for retrieval augmented genera- tion.arXiv preprint arXiv:2404.00610. Jianlv Chen, Shitao Xiao,...

work page arXiv 2024
[2]

Precise zero-shot dense retrieval without relevance labels,

Precise zero-shot dense retrieval without rele- vance labels.Preprint, arXiv:2212.10496. Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. 2025. Lightrag: Simple and fast retrieval- augmented generation.Preprint, arXiv:2410.05779. Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michi- hiro Yasunaga, and Yu Su. 2024. Hipporag: Neu- robiologically ins...

work page arXiv 2025
[3]

RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

RAPTOR: Recursive Abstractive Process- ing for Tree-Organized Retrieval.arXiv preprint. ArXiv:2401.18059 [cs]. Weijia Shi, Sewon Min, Michihiro Yasunaga, Min- joon Seo, Rich James, Mike Lewis, Luke Zettle- moyer, and Wen-tau Yih. 2023. Replug: Retrieval- augmented black-box language models.arXiv preprint arXiv:2301.12652. Saba Sturua, Isabelle Mohr, Moham...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Identify Entity Types:

work page
[5]

Format each entity as: ("entity"<|>entity_name<|>entity_type<|>entity_description>)

Extract Entities: ... Format each entity as: ("entity"<|>entity_name<|>entity_type<|>entity_description>)

work page
[6]

relationship

Identify Relationships: ... Format each relationship as: ("relationship" <|>source_entity<|>target_entity<|>relationship_description<|> relationship_strength>) ... Listing 1: A snippet of the prompt used for entity and relationship extraction. The full prompt provides detailed instructions and examples to the LLM. A.4 Query Generation Prompt Beyond KG con...

work page
[7]

Same Answer: Every generated question must have the exact same answer as the original question

work page
[8]

Incorporate Entities: Each question should subtly weave in information from the entities and their descriptions

work page
[9]

You should not include exact wording / entities in the original question / answer

Variety: The questions should be diverse in their structure and focus. You should not include exact wording / entities in the original question / answer. For example, you can:

work page
[10]

unanswerable

Clarity and Grammar: Despite being confusing, the questions must be grammatically correct and coherent. Output Format: Produce a single JSON object with one key, confusing_questions, which contains a list of 10 string questions. ... Listing 2: A snippet of the prompt used for generating confusing questions. The LLM is instructed to use provided entities t...

work page

[1] [1]

Rq-rag: Learning to refine queries for retrieval augmented generation.arXiv preprint arXiv:2404.00610,

Improving language models by retrieving from trillions of tokens. InInternational conference on machine learning, pages 2206–2240. PMLR. Chi-Min Chan, Chunpu Xu, Ruibin Yuan, Hongyin Luo, Wei Xue, Yike Guo, and Jie Fu. 2024. Rq-rag: Learn- ing to refine queries for retrieval augmented genera- tion.arXiv preprint arXiv:2404.00610. Jianlv Chen, Shitao Xiao,...

work page arXiv 2024

[2] [2]

Precise zero-shot dense retrieval without relevance labels,

Precise zero-shot dense retrieval without rele- vance labels.Preprint, arXiv:2212.10496. Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. 2025. Lightrag: Simple and fast retrieval- augmented generation.Preprint, arXiv:2410.05779. Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michi- hiro Yasunaga, and Yu Su. 2024. Hipporag: Neu- robiologically ins...

work page arXiv 2025

[3] [3]

RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

RAPTOR: Recursive Abstractive Process- ing for Tree-Organized Retrieval.arXiv preprint. ArXiv:2401.18059 [cs]. Weijia Shi, Sewon Min, Michihiro Yasunaga, Min- joon Seo, Rich James, Mike Lewis, Luke Zettle- moyer, and Wen-tau Yih. 2023. Replug: Retrieval- augmented black-box language models.arXiv preprint arXiv:2301.12652. Saba Sturua, Isabelle Mohr, Moham...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Identify Entity Types:

work page

[5] [5]

Format each entity as: ("entity"<|>entity_name<|>entity_type<|>entity_description>)

Extract Entities: ... Format each entity as: ("entity"<|>entity_name<|>entity_type<|>entity_description>)

work page

[6] [6]

relationship

Identify Relationships: ... Format each relationship as: ("relationship" <|>source_entity<|>target_entity<|>relationship_description<|> relationship_strength>) ... Listing 1: A snippet of the prompt used for entity and relationship extraction. The full prompt provides detailed instructions and examples to the LLM. A.4 Query Generation Prompt Beyond KG con...

work page

[7] [7]

Same Answer: Every generated question must have the exact same answer as the original question

work page

[8] [8]

Incorporate Entities: Each question should subtly weave in information from the entities and their descriptions

work page

[9] [9]

You should not include exact wording / entities in the original question / answer

Variety: The questions should be diverse in their structure and focus. You should not include exact wording / entities in the original question / answer. For example, you can:

work page

[10] [10]

unanswerable

Clarity and Grammar: Despite being confusing, the questions must be grammatically correct and coherent. Output Format: Produce a single JSON object with one key, confusing_questions, which contains a list of 10 string questions. ... Listing 2: A snippet of the prompt used for generating confusing questions. The LLM is instructed to use provided entities t...

work page