Narrative Knowledge Weaver: Narrative-Centric Retrieval-Augmented Reasoning for Long-Form Text Understanding

Fan Guo; Fengyi Chen; Jinjing Shen; Qiuyu Tian; Xin Zhang; Yiding Li; Yingce Xia; Yiyun Luo; Youyong Kong; Yuyao Li

arxiv: 2606.05724 · v1 · pith:OCGUHNAAnew · submitted 2026-06-04 · 💻 cs.CL · cs.AI

Narrative Knowledge Weaver: Narrative-Centric Retrieval-Augmented Reasoning for Long-Form Text Understanding

Qiuyu Tian , Fengyi Chen , Yiding Li , Youyong Kong , Fan Guo , Yuyao Li , Jinjing Shen , Zhijing Xie

show 4 more authors

Yiyun Luo Xin Zhang Yingce Xia Zequn Liu

This is my paper

Pith reviewed 2026-06-28 02:14 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords narrative QAretrieval-augmented generationstory understandinggraph-based reasoninglong-form textevidence alignmentconstraint auditingstory-world reasoning

0 comments

The pith

Narrative Knowledge Weaver aligns textual evidence with facts, graphs, profiles, interactions, episodes, and storylines to improve reasoning over evolving story worlds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Narrative Knowledge Weaver to address long-form narrative question answering, where answers depend on changing character states, social relations, causal triggers, and temporal positions rather than isolated passages. It claims that existing retrieval and graph methods fall short because their units do not encode how evidence functions inside a story. NKW instead aligns multiple layers of source-grounded evidence and applies post-retrieval tools to audit actor, scope, polarity, state, and temporal constraints. Experiments across three benchmarks show it performs best on screenplay-level story-world tasks while remaining competitive on passage-focused ones. If the alignment holds, systems gain reliable access to the dynamic structure of narratives.

Core claim

Narrative Knowledge Weaver is a source-grounded framework that aligns textual evidence, atomic facts, canonical graph structure, entity profiles, interactions, episodes, and storylines. At query time, NKW uses text, graph, and narrative tools with post-retrieval reading skills to assemble evidence and audit actor, scope, polarity, state, and temporal constraints. Across STAGE, FairytaleQA, and QuALITY, NKW is strongest on screenplay-level story-world QA while remaining competitive on more passage-centered benchmarks. Ablations, question-type analyses, graph-asset statistics, and case studies show complementary benefits for character, scene, temporal, causal, and narrative-progression reasoni

What carries the argument

The Narrative Knowledge Weaver framework, which aligns multiple evidence layers and applies constraint-auditing tools during retrieval to capture how evidence functions in a narrative.

If this is right

Strongest performance appears on screenplay-level story-world QA tasks that track multiple evolving elements.
Competitive results hold on passage-centered benchmarks that rely less on narrative dynamics.
Ablations indicate complementary gains for character, scene, temporal, causal, and narrative-progression reasoning.
Graph-asset statistics provide measurable support for the aligned evidence structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The layered alignment approach might apply to other domains with evolving evidence, such as legal case files or medical histories.
Constraint auditing could reduce errors when models must track state changes across long documents outside fiction.
Testing the framework on datasets with explicit temporal or causal structures would show whether the benefits generalize beyond the reported benchmarks.

Load-bearing premise

Aligning textual evidence, atomic facts, graph structures, entity profiles, interactions, episodes, and storylines plus auditing actor, scope, polarity, state, and temporal constraints will reliably capture how evidence functions inside a narrative.

What would settle it

A new screenplay-style benchmark where NKW shows no advantage over standard retrieval-augmented methods on questions that require tracking evolving character states, causal chains, and temporal positions.

Figures

Figures reproduced from arXiv: 2606.05724 by Fan Guo, Fengyi Chen, Jinjing Shen, Qiuyu Tian, Xin Zhang, Yiding Li, Yingce Xia, Yiyun Luo, Youyong Kong, Yuyao Li, Zequn Liu, Zhijing Xie.

**Figure 1.** Figure 1: Narrative-centric framework overview. The framework separates construction-time graph building from query-time reasoning. namics: local facts, character profiles, interactions, event traces, episodes, and storylines. 3 The Narrative Knowledge Weaver Framework We propose Narrative Knowledge Weaver, a framework for narrative-centric knowledge modeling and reasoning over long-form text. Let D = {ci} n i=1 de… view at source ↗

**Figure 2.** Figure 2: Fine-grained tool-use distribution across benchmarks. Donut charts summarize query-time tool invocations grouped into text, graph, narrative, and administrative mapping channels. STAGE uses more graph and narrative tracing, FairytaleQA relies more on atomic facts and semantic search, and QuALITY primarily uses text extraction tools. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Continuity chain visualization. Chains link scenes that can plausibly share a production setup in an example from The Wandering Earth II (Gwo, 2023), with scene summaries and source-grounded narrative cues shown for review. grounds entities, relations, and narrative units to source chunks, each asset can be mapped back to the scene in which it appears; because the schema separates characters, objects, loca… view at source ↗

read the original abstract

Long-form narrative QA requires reasoning over evolving story worlds rather than isolated passages: answers may depend on earlier goals, changing character states, social relations, causal triggers, temporal position, and later consequences. Existing retrieval and graph-augmented generation methods improve evidence access, but their units--chunks, entities, relations, summaries, or tool actions--do not directly encode how evidence functions in a story. We introduce Narrative Knowledge Weaver(NKW), a source-grounded framework that aligns textual evidence, atomic facts, canonical graph structure, entity profiles, interactions, episodes, and storylines. At query time, NKW uses text, graph, and narrative tools with post-retrieval reading skills to assemble evidence and audit actor, scope, polarity, state, and temporal constraints. Across STAGE, FairytaleQA, and QuALITY, NKW is strongest on screenplay-level story-world QA while remaining competitive on more passage-centered benchmarks. Ablations, question-type analyses, graph-asset statistics, and case studies show complementary benefits for character, scene, temporal, causal, and narrative-progression reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NKW stacks multiple narrative layers and adds constraint auditing to RAG, but the abstract leaves the core mechanisms too vague to judge if the gains are real.

read the letter

The main takeaway is that NKW aligns textual evidence with atomic facts, graphs, entity profiles, interactions, episodes, and storylines, then applies post-retrieval tools to audit actor, scope, polarity, state, and temporal constraints. This is presented as a way to handle evolving story worlds better than standard chunk or relation retrieval.

What the work does is lay out a concrete multi-layer structure and claim it helps on screenplay-level story-world QA across STAGE, FairytaleQA, and QuALITY while staying competitive on passage-focused sets. The mention of ablations, question-type breakdowns, and case studies suggests they tried to show where the different layers add value for character, causal, and temporal reasoning.

The soft spots are straightforward. Only the abstract is here, so there are no methods details, tables, or error bars to check whether the performance edge actually comes from the narrative alignment rather than retrieval volume or dataset quirks. The stress-test point holds: the abstract gives no mechanism for resolving conflicts between layers or for how the auditing step changes the reasoning output. Without that, the central claim rests on an unshown assumption.

This is aimed at people working on long-form narrative QA. A reader looking for new framing ideas on story-world evidence might pick up some useful distinctions, but anyone needing reproducible results or clear implementation will find it thin. It deserves a serious referee to see whether the full paper supplies the missing mechanisms and controls.

Referee Report

2 major / 0 minor

Summary. The paper introduces Narrative Knowledge Weaver (NKW), a source-grounded framework for long-form narrative QA. It aligns textual evidence, atomic facts, canonical graph structure, entity profiles, interactions, episodes, and storylines, then applies text/graph/narrative tools with post-retrieval auditing of actor/scope/polarity/state/temporal constraints to model evidence function in evolving story worlds. Experiments across STAGE, FairytaleQA, and QuALITY report strongest results on screenplay-level story-world QA while remaining competitive on passage-centered benchmarks, with ablations, question-type analyses, graph statistics, and case studies cited as supporting evidence.

Significance. If the performance differences prove robust and attributable to the narrative-centric alignment and auditing rather than retrieval volume or dataset artifacts, the work could advance RAG methods by providing explicit machinery for story-world dynamics (state changes, causal triggers, temporal position). The inclusion of ablations and case studies would strengthen claims about complementary benefits for character, scene, temporal, causal, and progression reasoning.

major comments (2)

[Abstract, paragraph 2] Abstract, paragraph 2: The claim that multi-layer alignment plus constraint auditing 'reliably capture how evidence functions inside a narrative' is load-bearing for the central contribution, yet the description provides no concrete mechanism for conflict resolution across layers (textual evidence vs. graph vs. profiles vs. episodes) or for how the auditing step alters downstream reasoning outputs; without this, gains on screenplay QA could stem from other factors.
[Abstract] Abstract: No experimental details, ablation tables, dataset statistics, error bars, or graph-asset numbers are visible, preventing verification that reported performance differences support the narrative-centric design over prior chunk/entity/relation methods or rule out dataset bias.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's comments and provide point-by-point responses below. We are prepared to make revisions where appropriate to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract, paragraph 2] Abstract, paragraph 2: The claim that multi-layer alignment plus constraint auditing 'reliably capture how evidence functions inside a narrative' is load-bearing for the central contribution, yet the description provides no concrete mechanism for conflict resolution across layers (textual evidence vs. graph vs. profiles vs. episodes) or for how the auditing step alters downstream reasoning outputs; without this, gains on screenplay QA could stem from other factors.

Authors: We thank the referee for highlighting this. The abstract condenses the framework description. The full paper details the mechanisms for multi-layer alignment, including conflict resolution across textual, graph, and narrative layers via the tool-based assembly and constraint auditing, as well as how auditing affects reasoning by enforcing consistency checks. To address the concern, we will revise the abstract to provide a more concrete, albeit brief, indication of these processes. revision: yes
Referee: [Abstract] Abstract: No experimental details, ablation tables, dataset statistics, error bars, or graph-asset numbers are visible, preventing verification that reported performance differences support the narrative-centric design over prior chunk/entity/relation methods or rule out dataset bias.

Authors: Abstracts have strict length constraints and standardly focus on high-level claims rather than detailed experimental data, which are instead reported in the body of the paper, including ablation studies, dataset statistics, error bars, and graph asset numbers. These elements allow readers to assess the robustness of the performance differences and the role of the narrative-centric design. We do not plan to expand the abstract with such details. revision: no

Circularity Check

0 steps flagged

No circularity: empirical system description with no load-bearing derivations or self-referential reductions

full rationale

The provided abstract and manuscript excerpt describe an empirical framework (NKW) that aligns multiple narrative layers and applies post-retrieval auditing, then reports benchmark performance. No equations, parameter-fitting steps, uniqueness theorems, or self-citations appear in the text. The central claims rest on system design choices and observed results rather than any quantity that reduces by construction to its own inputs. This is the most common honest finding for a retrieval-augmented QA paper whose contribution is architectural and experimental.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5762 in / 1131 out tokens · 21404 ms · 2026-06-28T02:14:07.396065+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 1 linked inside Pith

[1]

Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang

Constructing inferences during narrative text comprehension.Psychological Review, 101(3):371– 395. Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. 2025. LightRAG: Simple and fast retrieval- augmented generation. InFindings of the Associa- tion for Computational Linguistics: EMNLP 2025, pages 10746–10761. Association for Computational Linguistic...

arXiv 2025
[2]

Tom Trabasso and Paul van den Broek

STAGE: A full-screenplay benchmark for reasoning over evolving stories.arXiv preprint arXiv:2601.08510. Tom Trabasso and Paul van den Broek. 1985. Causal thinking and the representation of narrative events. Journal of Memory and Language, 24(5):612–630. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2023. Interleaving retrieva...

Pith/arXiv arXiv 1985
[3]

InProceedings of the 60th Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 447–460, Dublin, Ireland

Fantastic questions and where to find them: FairytaleQA – an authentic dataset for narrative com- prehension. InProceedings of the 60th Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 447–460, Dublin, Ireland. Association for Computational Linguistics. Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin, Zheng...

2019
[4]

Query decomposition.The evaluation wrap- per submits the question, document identifier, answer format, and retrieval budget to NKW. The query is decomposed into retrieval intents over canonical entities, relations, atomic facts, and narrative units, so the first evidence pass can address both local factual grounding and higher-level narrative context
[5]

In the query-time tool taxonomy, this step con- sists of: entity_search;entity_lookup; relation_search;atomic_fact_search; narrative_semantic_search

Forced graph–fact and narrative recall.Be- fore the tool agent starts, NKW runs a manda- tory first-pass recall step over the canonical graph, atomic facts, and semantic narrative units. In the query-time tool taxonomy, this step con- sists of: entity_search;entity_lookup; relation_search;atomic_fact_search; narrative_semantic_search. This mandatory step ...
[6]

These assembly and indexing steps are not counted as separate tool calls

Candidate evidence assembly.The system enriches retrieved entities with attributes and query-relevant atomic facts, keeps relation can- didates from both entity- and relation-centered retrieval, attaches available source-text evi- dence through provenance indices, and prepares source chunks for grounding. These assembly and indexing steps are not counted ...
[7]

Each round scores the cur- rent packet, optionally builds a pruned packet, checks answerability, and stops without tool calls when the packet is already answerable

Tool-agent refinement.The tool agent is in- serted after first-pass retrieval and before final prompt construction. Each round scores the cur- rent packet, optionally builds a pruned packet, checks answerability, and stops without tool calls when the packet is already answerable. Otherwise, it plans at most five tool calls for the current evidence gaps. T...
[8]

Dynamic packing then con- structs the final prompt sections for entities, re- lations, narrative items, source text chunks, op- tional tool-text evidence, and references

Evidence-board construction.Returned tool evidence can add entities, relations, narrative items, or compact tool-text evidence, subject to channel-specific edit switches and added- evidence budgets. Dynamic packing then con- structs the final prompt sections for entities, re- lations, narrative items, source text chunks, op- tional tool-text evidence, and...
[9]

is_correct

Reading-skill selection and finalization.Read- ing skills are selected after retrieval and pack- ing, with at most three skills injected into the final prompt. They do not retrieve new evi- dence; they instruct the final reader how to com- pare the assembled packet for option admission, 14 causal/quest reasoning, character mental state, theme or implicati...

1990
[10]

chasing Amy

, and Krippendorff’sαis α= 1− Do De . Because α measures how much the annotators agree beyond what would be expected by random labeling, it is well suited for evaluating the reliabil- ity of repeated stochastic LLM judgments. Table B.1: Internal agreement of the correctness evalua- tor across five independent judgments. Pairwise agree- ment is averaged ov...

arXiv 1988
[11]

is_continuity

(173 scenes), naive all-pairs checking would require 173×172/2 pairwise decisions, whereas our entity-based filtering produces on the order of 1,500–2,000 candidates. The agent can further tighten this set by incorporating additional priors such as semantic similarity between scene summaries and constraints on script-order distance (e.g., only considering...
[12]

Whether every neighboring scene pair has explicit source-grounded support
[13]

Whether the chain depends on a weak bridge scene that should be split
[14]

Whether shared entities, locations, objects, and occasions remain compatible across the chain
[15]

Whether any source-grounded fact contradicts reuse of the same setup
[16]

coherence_score

Whether the chain is redundant, over-extended, or should be merged with another chain. Return JSON fields: { "coherence_score": 0.0-1.0, "confidence": 0.0-1.0, "decision": "keep" | "split" | "drop", "weak_edges": [ ["scene_A", "scene_B"] ], "suggested_splits": [ ["scene_A", "scene_B"], ["scene_C"] ], "rationale": "brief rationale" } D.2 Character State Tr...
[17]

Timeline Swimlane View.A global character–scene matrixshowing all characters (rows) across all scenes (columns). This view of- fers a high-level visualization of appearance fre- quency, co-occurrence structure, and participa- 25 Figure D.1:Timeline Swimlane View.A screenplay-wide character–scene matrix enabling global inspection of appearance patterns. ti...

[1] [1]

Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang

Constructing inferences during narrative text comprehension.Psychological Review, 101(3):371– 395. Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. 2025. LightRAG: Simple and fast retrieval- augmented generation. InFindings of the Associa- tion for Computational Linguistics: EMNLP 2025, pages 10746–10761. Association for Computational Linguistic...

arXiv 2025

[2] [2]

Tom Trabasso and Paul van den Broek

STAGE: A full-screenplay benchmark for reasoning over evolving stories.arXiv preprint arXiv:2601.08510. Tom Trabasso and Paul van den Broek. 1985. Causal thinking and the representation of narrative events. Journal of Memory and Language, 24(5):612–630. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2023. Interleaving retrieva...

Pith/arXiv arXiv 1985

[3] [3]

InProceedings of the 60th Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 447–460, Dublin, Ireland

Fantastic questions and where to find them: FairytaleQA – an authentic dataset for narrative com- prehension. InProceedings of the 60th Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 447–460, Dublin, Ireland. Association for Computational Linguistics. Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin, Zheng...

2019

[4] [4]

Query decomposition.The evaluation wrap- per submits the question, document identifier, answer format, and retrieval budget to NKW. The query is decomposed into retrieval intents over canonical entities, relations, atomic facts, and narrative units, so the first evidence pass can address both local factual grounding and higher-level narrative context

[5] [5]

In the query-time tool taxonomy, this step con- sists of: entity_search;entity_lookup; relation_search;atomic_fact_search; narrative_semantic_search

Forced graph–fact and narrative recall.Be- fore the tool agent starts, NKW runs a manda- tory first-pass recall step over the canonical graph, atomic facts, and semantic narrative units. In the query-time tool taxonomy, this step con- sists of: entity_search;entity_lookup; relation_search;atomic_fact_search; narrative_semantic_search. This mandatory step ...

[6] [6]

These assembly and indexing steps are not counted as separate tool calls

Candidate evidence assembly.The system enriches retrieved entities with attributes and query-relevant atomic facts, keeps relation can- didates from both entity- and relation-centered retrieval, attaches available source-text evi- dence through provenance indices, and prepares source chunks for grounding. These assembly and indexing steps are not counted ...

[7] [7]

Each round scores the cur- rent packet, optionally builds a pruned packet, checks answerability, and stops without tool calls when the packet is already answerable

Tool-agent refinement.The tool agent is in- serted after first-pass retrieval and before final prompt construction. Each round scores the cur- rent packet, optionally builds a pruned packet, checks answerability, and stops without tool calls when the packet is already answerable. Otherwise, it plans at most five tool calls for the current evidence gaps. T...

[8] [8]

Dynamic packing then con- structs the final prompt sections for entities, re- lations, narrative items, source text chunks, op- tional tool-text evidence, and references

Evidence-board construction.Returned tool evidence can add entities, relations, narrative items, or compact tool-text evidence, subject to channel-specific edit switches and added- evidence budgets. Dynamic packing then con- structs the final prompt sections for entities, re- lations, narrative items, source text chunks, op- tional tool-text evidence, and...

[9] [9]

is_correct

Reading-skill selection and finalization.Read- ing skills are selected after retrieval and pack- ing, with at most three skills injected into the final prompt. They do not retrieve new evi- dence; they instruct the final reader how to com- pare the assembled packet for option admission, 14 causal/quest reasoning, character mental state, theme or implicati...

1990

[10] [10]

chasing Amy

, and Krippendorff’sαis α= 1− Do De . Because α measures how much the annotators agree beyond what would be expected by random labeling, it is well suited for evaluating the reliabil- ity of repeated stochastic LLM judgments. Table B.1: Internal agreement of the correctness evalua- tor across five independent judgments. Pairwise agree- ment is averaged ov...

arXiv 1988

[11] [11]

is_continuity

(173 scenes), naive all-pairs checking would require 173×172/2 pairwise decisions, whereas our entity-based filtering produces on the order of 1,500–2,000 candidates. The agent can further tighten this set by incorporating additional priors such as semantic similarity between scene summaries and constraints on script-order distance (e.g., only considering...

[12] [12]

Whether every neighboring scene pair has explicit source-grounded support

[13] [13]

Whether the chain depends on a weak bridge scene that should be split

[14] [14]

Whether shared entities, locations, objects, and occasions remain compatible across the chain

[15] [15]

Whether any source-grounded fact contradicts reuse of the same setup

[16] [16]

coherence_score

Whether the chain is redundant, over-extended, or should be merged with another chain. Return JSON fields: { "coherence_score": 0.0-1.0, "confidence": 0.0-1.0, "decision": "keep" | "split" | "drop", "weak_edges": [ ["scene_A", "scene_B"] ], "suggested_splits": [ ["scene_A", "scene_B"], ["scene_C"] ], "rationale": "brief rationale" } D.2 Character State Tr...

[17] [17]

Timeline Swimlane View.A global character–scene matrixshowing all characters (rows) across all scenes (columns). This view of- fers a high-level visualization of appearance fre- quency, co-occurrence structure, and participa- 25 Figure D.1:Timeline Swimlane View.A screenplay-wide character–scene matrix enabling global inspection of appearance patterns. ti...