Characterizing Narrative Content in Web-scale LLM Pretraining Data

Andrew Piper; Elliott Ash; Maria Antoniak; Teagan Johnson

arxiv: 2606.19468 · v1 · pith:OOZXSYOBnew · submitted 2026-06-17 · 💻 cs.CL

Characterizing Narrative Content in Web-scale LLM Pretraining Data

Teagan Johnson , Elliott Ash , Andrew Piper , Maria Antoniak This is my paper

Pith reviewed 2026-06-26 20:48 UTC · model grok-4.3

classification 💻 cs.CL

keywords narrative structureLLM pretrainingweb-scale corporanarrative theorytext classificationdata curationDolma corpusnarrative dimensions

0 comments

The pith

Narrative qualities in web text can be measured at scale with an 11-dimension framework, revealing continuous multidimensional structure and unequal distribution across sources and topics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that core narrative elements can be quantified across trillions of tokens of heterogeneous web data by drawing on narrative theory to define 11 dimensions covering agency, setting, and events. They annotate a sample, train a classifier, and apply it to millions of passages to produce a new dataset showing these qualities form a continuous space rather than discrete categories. A sympathetic reader would care because narrative is a basic mode of communication yet current practices for assembling LLM training data neither track nor balance these features. The work supplies both the measurement method and evidence that narrative content varies systematically by source and topic.

Core claim

We present a framework spanning three core narrative elements operationalized as 11 interpretable dimensions, annotate 400 passages, finetune NarraBERT on them, and apply the model to 3 million passages from the Dolma corpus to create NarraDolma. This demonstrates that narrative structure is measurable at scale across extremely heterogeneous data, that a continuous multidimensional narrative structure underlies web text, and that narrative qualities are unequally distributed across pretraining sources and topics in ways current curation practices neither measure nor account for.

What carries the argument

NarraBERT, a RoBERTa-based model fine-tuned to predict 11 narrative dimensions on annotated web passages and then scaled to millions of examples.

If this is right

Narrative structure can be measured reliably at web scale even in highly varied text.
Web text contains a continuous, multidimensional narrative space rather than isolated categories.
Narrative qualities differ systematically across pretraining sources and topics.
Standard data curation for LLM training does not track or correct for these differences.
The released dataset and model enable direct study of how narrative composition influences model performance on narrative reasoning tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models trained on sources with higher narrative density in certain dimensions may develop stronger or weaker story-generation capabilities as a result.
Balancing training data by the 11 dimensions could serve as a testable intervention for improving narrative coherence in downstream applications.
The same measurement approach might reveal whether narrative imbalances correlate with other documented biases in web corpora, such as topic or demographic skew.

Load-bearing premise

The 11 dimensions drawn from narrative theory capture the essential narrative elements present in the full range of web text without systematic gaps or annotation artifacts.

What would settle it

Re-annotating a fresh diverse sample of 400 passages from the same sources with the 11 dimensions produces low inter-annotator agreement or yields dimensions that the trained model cannot predict above chance on held-out data.

Figures

Figures reproduced from arXiv: 2606.19468 by Andrew Piper, Elliott Ash, Maria Antoniak, Teagan Johnson.

**Figure 2.** Figure 2: “I [. . . ] immediately noticed” signals fo [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: “The interview room” and “the panel” provide [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Two randomly selected adjacent event trig [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Mean z-scored narrative features by category, ordered by hierarchical clustering on profile similarity. (∗) denotes original DOLMA sources, all other rows are Common Crawl topics assigned by WEBORGANIZER. of the full corpus), so a category label removes only a small fraction of narrative variance. The homogeneity that does exist is concentrated in the factual and technical categories (Wikipedia (0.68) an… view at source ↗

**Figure 7.** Figure 7: Proportion of each category’s documents in [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

The narrative composition of web-scale LLM pretraining corpora remains largely unexplored even though narrative is a fundamental mode of human communication. We present the first fine-grained study of narrative features in Dolma, a 3-trillion-token open pretraining corpus. Drawing on narrative theory, we design a framework spanning three core narrative elements (agency, setting, and events) operationalized as 11 interpretable dimensions. After sampling and annotating a diverse set of 400 passages, we finetune and validate NarraBERT, a RoBERTa-based model for fine-grained narrative prediction. We apply NarraBERT to 3M passages, resulting in a new dataset, NarraDolma. We find (i) narrative structure is measurable at scale across extremely heterogeneous data, (ii) we uncover a continuous, multidimensional narrative structure underlying web text, and (iii) narrative qualities are unequally distributed across pretraining sources and topics in ways that current curation practices neither measure nor account for. Our framework, dataset, and analyses provide a foundation for understanding how narrative qualities are distributed in LLM pretraining data and for studying how data composition affects narrative reasoning tasks. We publicly release NarraDolma and NarraBERT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper supplies NarraDolma and NarraBERT as a first scalable measurement of narrative features in Dolma but leaves the annotation quality and model validation numbers out of the abstract.

read the letter

This paper introduces a framework for tagging narrative features in large web corpora and releases both a model and a labeled dataset from Dolma. The 11 dimensions cover agency, setting, and events, drawn from theory, and they scale the annotation from 400 passages to 3 million using a fine-tuned RoBERTa.

The release is the strongest part. NarraDolma lets others check or extend the work directly, and the application shows clear differences across sources and topics. That matches the claim that current curation does not track these properties.

The weak point is the lack of visible checks on the annotation and model. No inter-annotator agreement, no performance numbers on held-out data, and no sampling description appear in the abstract. That leaves open whether the dimensions work across all web text or embed biases from the small sample. The concern about missing elements like unreliable narration seems worth checking in review.

This paper is for researchers focused on pretraining data composition and its effects on model behavior. Anyone studying narrative reasoning in LLMs could use the dataset.

It deserves peer review because the measurement is new and the release makes it checkable. I'd recommend sending it with a request for the missing quantitative details.

Referee Report

3 major / 2 minor

Summary. The paper claims to deliver the first fine-grained characterization of narrative in the Dolma 3-trillion-token pretraining corpus. It introduces an 11-dimensional framework (operationalizing agency, setting, and events from narrative theory), annotates a diverse sample of 400 passages, fine-tunes and validates the RoBERTa-based NarraBERT model, scales the model to produce NarraDolma on 3M passages, and reports three findings: (i) narrative structure is measurable at scale in heterogeneous web data, (ii) web text exhibits a continuous, multidimensional narrative structure, and (iii) narrative qualities are unequally distributed across sources and topics in ways unaccounted for by current curation.

Significance. If the central measurements hold, the work supplies a concrete, reusable resource (NarraDolma and NarraBERT) for studying how narrative composition in pretraining data influences downstream narrative reasoning. The public release of both the annotated dataset and the fine-tuned model is a clear strength that enables reproducibility and follow-on experiments on data curation effects.

major comments (3)

[Annotation and Model Training] Annotation and validation section: the manuscript reports annotation of 400 passages and subsequent fine-tuning of NarraBERT but supplies no inter-annotator agreement statistics, per-dimension performance metrics on a held-out set, or sampling details that would establish the reliability of the 11-dimensional labels before scaling to 3M passages.
[Framework Design] Framework operationalization (early sections): the 11 dimensions are asserted to provide a sufficient characterization of narrative without any reported coverage analysis, inter-dimension correlation study, or comparison against alternative codings; this directly underpins the claim of uncovering an underlying “continuous, multidimensional narrative structure.”
[Distribution Analysis and Findings] Results on distribution (findings ii and iii): the reported inequalities across sources and topics rest entirely on NarraBERT predictions; absent error analysis or human re-annotation of a subsample from the 3M set, it is unclear whether the observed patterns reflect data properties or framework-specific artifacts.

minor comments (2)

[Abstract] The abstract states that NarraBERT is “finetune[d] and validate[d]” yet contains no numerical validation results; adding at least summary metrics would improve readability.
Notation for the 11 dimensions could be presented in a single table with example passages to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and will revise the manuscript to incorporate additional analyses and details where feasible.

read point-by-point responses

Referee: [Annotation and Model Training] Annotation and validation section: the manuscript reports annotation of 400 passages and subsequent fine-tuning of NarraBERT but supplies no inter-annotator agreement statistics, per-dimension performance metrics on a held-out set, or sampling details that would establish the reliability of the 11-dimensional labels before scaling to 3M passages.

Authors: We agree these statistics strengthen the claims. The revised manuscript will include inter-annotator agreement (Cohen's kappa and percentage agreement) computed from the double-annotated subset, per-dimension precision/recall/F1 on the held-out test set, and expanded sampling details describing the stratified selection across sources and topics. revision: yes
Referee: [Framework Design] Framework operationalization (early sections): the 11 dimensions are asserted to provide a sufficient characterization of narrative without any reported coverage analysis, inter-dimension correlation study, or comparison against alternative codings; this directly underpins the claim of uncovering an underlying “continuous, multidimensional narrative structure.”

Authors: The 11 dimensions are derived directly from established narrative theory (agency, setting, events). In revision we will add (a) pairwise correlation analysis among the 11 dimensions on the annotated data, (b) a brief coverage discussion justifying the selection against narrative theory sources, and (c) a comparison of predictive performance against a reduced 4-dimension baseline to support the multidimensional claim. revision: yes
Referee: [Distribution Analysis and Findings] Results on distribution (findings ii and iii): the reported inequalities across sources and topics rest entirely on NarraBERT predictions; absent error analysis or human re-annotation of a subsample from the 3M set, it is unclear whether the observed patterns reflect data properties or framework-specific artifacts.

Authors: We will add an error analysis section that reports human re-annotation of a 200-passage random subsample drawn from the 3M NarraDolma set, including agreement rates with NarraBERT predictions and breakdown of discrepancies by source and topic. This will be used to qualify the distribution findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical measurement pipeline

full rationale

The paper's chain is: select 11 dimensions from narrative theory, annotate 400 passages, finetune RoBERTa-based NarraBERT, apply to 3M passages to produce NarraDolma, then report distributions. No equations, no fitted parameters renamed as predictions, no self-citations invoked as uniqueness theorems, and no ansatz smuggled via prior work. All steps are externally falsifiable via the released dataset and model; the central claims about unequal distribution rest on the scaled outputs rather than reducing to the annotation inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; main domain assumption is that narrative theory supplies usable dimensions for web text. No free parameters or invented entities are described.

axioms (1)

domain assumption Narrative theory supplies a valid set of 11 dimensions for agency, setting, and events that apply to heterogeneous web text.
Framework is explicitly drawn from narrative theory and operationalized for the corpus.

pith-pipeline@v0.9.1-grok · 5744 in / 1256 out tokens · 34381 ms · 2026-06-26T20:48:03.772643+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 2 canonical work pages · 2 internal anchors

[1]

The Llama 3 Herd of Models

Stories that heal: Characterizing and support- ing narrative for suicide bereavement. InProceed- ings of the International AAAI Conference on Web and Social Media, volume 18, pages 354–366. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Arjun Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The Ll...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

InProceedings of the 19th Conference of the European Chapter of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 3786–3801

Narrabench: A comprehensive framework for narrative benchmarking. InProceedings of the 19th Conference of the European Chapter of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 3786–3801. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt
[3]

David Herman

Measuring massive multitask language under- standing. David Herman. 2009a.The Nexus of Narrative and Mind, chapter 6. John Wiley & Sons, Ltd. David Herman. 2009b.The Third Element; or, How to Build a Storyworld, chapter 5. John Wiley & Sons, Ltd. David Herman. 2011.Basic elements of narrative. John Wiley & Sons. Qian Liu, Xiaosen Zheng, Niklas Muennighoff...

2011
[4]

Jiaxin Pei, Aparna Ananthasubramaniam, Xingyao Wang, Naitian Zhou, Apostolos Dedeloudis, Jack- son Sargent, and David Jurgens

Automated annotation with generative ai re- quires validation. Jiaxin Pei, Aparna Ananthasubramaniam, Xingyao Wang, Naitian Zhou, Apostolos Dedeloudis, Jack- son Sargent, and David Jurgens. 2022. POTATO: The portable text annotation tool. InProceedings of the 2022 Conference on Empirical Methods in Nat- ural Language Processing: System Demonstrations, pag...

2022
[5]

InProceedings of the 2021 Confer- ence on Empirical Methods in Natural Language Processing, pages 298–311, Online and Punta Cana, Dominican Republic

Narrative theory for computational narrative understanding. InProceedings of the 2021 Confer- ence on Empirical Methods in Natural Language Processing, pages 298–311, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. V . Propp. 1968.Morphology of the Folktale: Second Edition. University of Texas Press. James Pustejovsky...

2021
[6]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

TimeBank 1.2. LDC2006T08. Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susan- nah Young, et al. 2021. Scaling language models: Methods, analysis & insights from training Gopher. arXiv preprint arXiv:2112.11446. John TE Richardson. 1975. Imagery, concreteness, and l...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

she couldn’t shake the feeling

Focalization How central is a specific character’s/narrator’s perspective to the text? A higher score reflects text in which the details and events are primarily presented through a specific character’s perspective: their perceptions, thoughts, and feelings shape how the story is told. A lower score reflects text that reports details and events from the o...
[8]

Completely paralyzed with fear, he was all of a sudden flooded with relief when he heard her voice

Emotion How central are a character’s emotional states to the text? A higher score reflects text in which a character’s emotional experience is a prominent feature. A lower score reflects text with little or no reference to how a character feels. • Score 5:“Completely paralyzed with fear, he was all of a sudden flooded with relief when he heard her voice”...
[9]

She read the email twice. It wasn’t entirely clear what they were asking for, but she thought she understood the gist. She drafted a reply

Cognition How central are a character’s thoughts, reasoning, or motivations to the text? A higher score reflects text in which a character’s thoughts, beliefs, reasoning, goals, or desires are a prominent feature. A lower score reflects text with little or no reference to what a character thinks, intends, or wants. • Score 5:“she kept turning the problem ...
[10]

by the time she reached the door, something in her had shifted. She wasn’t the same person who had walked in

Change of State How central is a change in a character’s condition or state to the text? A higher score reflects text in which a change in a character’s condition/state is a prominent or organizing feature. A lower score reflects text where characters remain essentially unchanged. The change does not need to be completed within the passage; an in-progress...
[11]

they had been arguing for hours, neither willing to give ground. The people around them started to stare. They were so caught up in the fight that they didn’t even notice

Conflict How central is conflict involving characters to the text? A higher score reflects text in which conflict is a dominant or organizing feature. A lower score reflects text in which conflict is absent or only incidentally present. Conflict can take many forms: tension between characters, internal psychological struggle, or opposition to institutions...
[12]

the chipped blue mug sat on the edge of the sink, handle cracked from years of use

Concreteness How concrete is the language of the text? A higher score reflects text in which most content could be explained by pointing, demonstrating, or showing (objects, physical states, bodily actions, perceptual qualities). A lower score reflects text in which most content can only be explained using other words (categories, principles, relationship...
[13]

It was the summer of 2015 at 9:02am, before the world changed forever

Temporal Grounding How strongly does the text create a sense of being anchored in a particular time? A higher score reflects text that creates a vivid sense of temporal location. A lower score reflects text that feels as if events could be taking place at any time. Two types contribute to the score:historical groundinglocates the reader in a specific, unr...

2015
[14]

In the narrow streets of Naples, the apartment’s tiled floors kept cool even in August heat

Spatial Grounding How strongly does the text create a sense of being anchored in a particular place? A higher score reflects text that creates a vivid sense of spatial location. A lower score reflects text that feels as if events could be taking place anywhere. Two types contribute to the score:geographic groundinglocates the reader on the globe (a countr...
[15]

the bread was still warm, its crust crackling under her fingers, the whole kitchen thick with the smell of it

Sensory How central are sensory details to the text? A higher score reflects text in which one or more senses are central to how the passage is organized, where the sensory experience drives, anchors, or sustains the content. A lower score reflects text where events and ideas are conveyed without meaningful appeal to the senses. What matters is both wheth...

[1] [1]

The Llama 3 Herd of Models

Stories that heal: Characterizing and support- ing narrative for suicide bereavement. InProceed- ings of the International AAAI Conference on Web and Social Media, volume 18, pages 354–366. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Arjun Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The Ll...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

InProceedings of the 19th Conference of the European Chapter of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 3786–3801

Narrabench: A comprehensive framework for narrative benchmarking. InProceedings of the 19th Conference of the European Chapter of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 3786–3801. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt

[3] [3]

David Herman

Measuring massive multitask language under- standing. David Herman. 2009a.The Nexus of Narrative and Mind, chapter 6. John Wiley & Sons, Ltd. David Herman. 2009b.The Third Element; or, How to Build a Storyworld, chapter 5. John Wiley & Sons, Ltd. David Herman. 2011.Basic elements of narrative. John Wiley & Sons. Qian Liu, Xiaosen Zheng, Niklas Muennighoff...

2011

[4] [4]

Jiaxin Pei, Aparna Ananthasubramaniam, Xingyao Wang, Naitian Zhou, Apostolos Dedeloudis, Jack- son Sargent, and David Jurgens

Automated annotation with generative ai re- quires validation. Jiaxin Pei, Aparna Ananthasubramaniam, Xingyao Wang, Naitian Zhou, Apostolos Dedeloudis, Jack- son Sargent, and David Jurgens. 2022. POTATO: The portable text annotation tool. InProceedings of the 2022 Conference on Empirical Methods in Nat- ural Language Processing: System Demonstrations, pag...

2022

[5] [5]

InProceedings of the 2021 Confer- ence on Empirical Methods in Natural Language Processing, pages 298–311, Online and Punta Cana, Dominican Republic

Narrative theory for computational narrative understanding. InProceedings of the 2021 Confer- ence on Empirical Methods in Natural Language Processing, pages 298–311, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. V . Propp. 1968.Morphology of the Folktale: Second Edition. University of Texas Press. James Pustejovsky...

2021

[6] [6]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

TimeBank 1.2. LDC2006T08. Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susan- nah Young, et al. 2021. Scaling language models: Methods, analysis & insights from training Gopher. arXiv preprint arXiv:2112.11446. John TE Richardson. 1975. Imagery, concreteness, and l...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

she couldn’t shake the feeling

Focalization How central is a specific character’s/narrator’s perspective to the text? A higher score reflects text in which the details and events are primarily presented through a specific character’s perspective: their perceptions, thoughts, and feelings shape how the story is told. A lower score reflects text that reports details and events from the o...

[8] [8]

Completely paralyzed with fear, he was all of a sudden flooded with relief when he heard her voice

Emotion How central are a character’s emotional states to the text? A higher score reflects text in which a character’s emotional experience is a prominent feature. A lower score reflects text with little or no reference to how a character feels. • Score 5:“Completely paralyzed with fear, he was all of a sudden flooded with relief when he heard her voice”...

[9] [9]

She read the email twice. It wasn’t entirely clear what they were asking for, but she thought she understood the gist. She drafted a reply

Cognition How central are a character’s thoughts, reasoning, or motivations to the text? A higher score reflects text in which a character’s thoughts, beliefs, reasoning, goals, or desires are a prominent feature. A lower score reflects text with little or no reference to what a character thinks, intends, or wants. • Score 5:“she kept turning the problem ...

[10] [10]

by the time she reached the door, something in her had shifted. She wasn’t the same person who had walked in

Change of State How central is a change in a character’s condition or state to the text? A higher score reflects text in which a change in a character’s condition/state is a prominent or organizing feature. A lower score reflects text where characters remain essentially unchanged. The change does not need to be completed within the passage; an in-progress...

[11] [11]

they had been arguing for hours, neither willing to give ground. The people around them started to stare. They were so caught up in the fight that they didn’t even notice

Conflict How central is conflict involving characters to the text? A higher score reflects text in which conflict is a dominant or organizing feature. A lower score reflects text in which conflict is absent or only incidentally present. Conflict can take many forms: tension between characters, internal psychological struggle, or opposition to institutions...

[12] [12]

the chipped blue mug sat on the edge of the sink, handle cracked from years of use

Concreteness How concrete is the language of the text? A higher score reflects text in which most content could be explained by pointing, demonstrating, or showing (objects, physical states, bodily actions, perceptual qualities). A lower score reflects text in which most content can only be explained using other words (categories, principles, relationship...

[13] [13]

It was the summer of 2015 at 9:02am, before the world changed forever

Temporal Grounding How strongly does the text create a sense of being anchored in a particular time? A higher score reflects text that creates a vivid sense of temporal location. A lower score reflects text that feels as if events could be taking place at any time. Two types contribute to the score:historical groundinglocates the reader in a specific, unr...

2015

[14] [14]

In the narrow streets of Naples, the apartment’s tiled floors kept cool even in August heat

Spatial Grounding How strongly does the text create a sense of being anchored in a particular place? A higher score reflects text that creates a vivid sense of spatial location. A lower score reflects text that feels as if events could be taking place anywhere. Two types contribute to the score:geographic groundinglocates the reader on the globe (a countr...

[15] [15]

the bread was still warm, its crust crackling under her fingers, the whole kitchen thick with the smell of it

Sensory How central are sensory details to the text? A higher score reflects text in which one or more senses are central to how the passage is organized, where the sensory experience drives, anchors, or sustains the content. A lower score reflects text where events and ideas are conveyed without meaningful appeal to the senses. What matters is both wheth...