arxiv: 2604.10786 · v1 · submitted 2026-04-12 · 💻 cs.CL · cs.AI

Recognition: unknown

Do BERT Embeddings Encode Narrative Dimensions? A Token-Level Probing Analysis of Time, Space, Causality, and Character in Fiction

Arwa Saghiri, Beicheng Bei, Chen Guo, Hannah Hyesun Chun

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:22 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords BERT embeddingsnarrative dimensionslinear probingfiction analysistoken-level classificationcausalitystory semanticstime space character

0 comments

The pith

BERT embeddings encode extractable information on time, space, causality, and character in fiction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether the vector representations from BERT carry structured details about how stories unfold. Researchers created a dataset of individual words from fiction texts labeled for four narrative aspects plus an others category, using language model assistance for scale. A simple linear classifier trained on the BERT vectors predicted these labels at 94 percent accuracy, well above the 47 percent achieved on random vectors with matched variance. This gap indicates that the embeddings hold genuine narrative content rather than noise. The result matters for understanding what language models learn implicitly about story structure during pretraining.

Core claim

BERT embeddings encode meaningful narrative information about time, space, causality, and character, shown by a linear probe reaching 94 percent accuracy on token-level classification of these dimensions, far above the 47 percent baseline on variance-matched random embeddings, with a macro-average recall of 0.83 under balanced weighting though unsupervised clustering aligns only near-randomly.

What carries the argument

Linear probe classifier applied to BERT token embeddings for predicting narrative dimension labels at the word level.

If this is right

BERT embeddings contain extractable signals for narrative dimensions that enable token-level classification at high accuracy.
Narrative categories exhibit boundary leakage, with rare dimensions such as causality and space frequently misclassified into the others category.
Unsupervised clustering of the embeddings recovers the predefined narrative categories only at near-random levels with an ARI of 0.081.
Balanced class weighting allows the probe to achieve a macro-average recall of 0.83 across all narrative dimensions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Pretraining on large text collections may allow models to acquire implicit representations of storytelling mechanics without explicit supervision.
Narrative encoding could be partially entangled with syntactic patterns, which would require separate controls such as part-of-speech baselines to isolate.
Layer-wise probing on the same data could reveal at which depths these narrative features become most accessible.

Load-bearing premise

The token labels produced with LLM assistance accurately capture the narrative dimensions without systematic bias or annotation errors.

What would settle it

If a linear probe trained on the same BERT embeddings but using manually verified labels instead of LLM-generated ones drops to near 50 percent accuracy, the claim that the embeddings encode these narrative dimensions would fail.

Figures

Figures reproduced from arXiv: 2604.10786 by Arwa Saghiri, Beicheng Bei, Chen Guo, Hannah Hyesun Chun.

**Figure 1.** Figure 1: Label distribution in the full dataset (𝑛 = 5,088). The majority class “others” (70.1%) contains over 142 times more samples than the rarest class “causality” (0.5%). The JSONL output contained an entry for every token in each sentence of the chapter. We manually reviewed all tokens in context, examining them sentence by sentence to ensure accuracy and reduce inconsistencies. Ambiguous or misclassified tok… view at source ↗

**Figure 2.** Figure 2: Confusion matrices for the real probe (a) and control probe (b), both trained with balanced class weights. Class Precision Recall F1 Support time 0.75 0.80 0.77 54 space 0.53 0.66 0.59 41 causality 0.50 0.75 0.60 8 character 0.98 0.97 0.97 353 others 0.97 0.96 0.96 1,071 macro avg 0.75 0.83 0.78 1,527 weighted avg 0.95 0.94 0.94 1,527 [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: UMAP projection and K-Means (768-dim) clustering of BERT embeddings for narrative tokens. We applied K-Means clustering (𝑘=12) directly in the 768-dimensional BERT embedding space—prior to any dimensionality reduction—to examine the intrinsic structure of the embeddings. The UMAP projection serves solely as a visualization tool, with data points colored by the cluster assignments derived from the high-dime… view at source ↗

**Figure 4.** Figure 4: POS tag distribution per narrative dimension. causality character space time 1 2 3 4 5 6 Word Count 100.0% (25) 95.8% (1127) 70.1% (96) 68.7% (123) 0.0% (0) 2.8% (33) 13.1% (18) 12.8% (23) 0.0% (0) 0.8% (9) 13.9% (19) 8.9% (16) 0.0% (0) 0.5% (6) 0.7% (1) 5.6% (10) 0.0% (0) 0.2% (2) 2.2% (3) 3.4% (6) 0.0% (0) 0.0% (0) 0.0% (0) 0.6% (1) Token Span per Narrative Dimension 0.0 0.2 0.4 0.6 0.8 1.0 Percentage wi… view at source ↗

**Figure 5.** Figure 5: Token span distribution per narrative dimension. Time: 12.8%), three-word spans (Space: 13.9%, Time: 8.9%), and beyond. Because Space and Time involve longer token spans, they occupy a higher percentage of ADP tags with introductory words like “for,” “on,” “in,” and “after.” This multi-word nature connects directly to the Boundary Leakage pattern observed in Figure 2a: longer spans present greater challeng… view at source ↗

read the original abstract

Narrative understanding requires multidimensional semantic structures. This study investigates whether BERT embeddings encode dimensions of fictional narrative semantics -- time, space, causality, and character. Using an LLM to accelerate annotation, we construct a token-level dataset labeled with these four narrative categories plus "others." A linear probe on BERT embeddings (94% accuracy) significantly outperforms a control probe on variance-matched random embeddings (47%), confirming that BERT encodes meaningful narrative information. With balanced class weighting, the probe achieves a macro-average recall of 0.83, with moderate success on rare categories such as causality (recall = 0.75) and space (recall = 0.66). However, confusion matrix analysis reveals "Boundary Leakage," where rare dimensions are systematically misclassified as "others." Clustering analysis shows that unsupervised clustering aligns near-randomly with predefined categories (ARI = 0.081), suggesting that narrative dimensions are encoded but not as discretely separable clusters. Future work includes a POS-only baseline to disentangle syntactic patterns from narrative encoding, expanded datasets, and layer-wise probing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BERT embeddings carry some token-level signal for narrative categories in fiction according to this probe, but the whole result hinges on unvalidated LLM-generated labels.

read the letter

The paper's main result is that a linear probe on BERT embeddings reaches 94% accuracy on five-way token classification (time, space, causality, character, other) while a variance-matched random embedding baseline only hits 47%. With class balancing the macro recall is 0.83, and they also report boundary leakage into the other class plus a low ARI of 0.081 for unsupervised clustering. That is the concrete empirical contribution here: a controlled token-level probe on these four narrative dimensions in fiction texts, plus the leakage and clustering observations as secondary findings.

Referee Report

2 major / 2 minor

Summary. The paper claims that BERT embeddings encode narrative dimensions (time, space, causality, character) in fiction. It uses LLM-assisted annotation to create a token-level dataset with five classes (including 'others'), trains a linear probe achieving 94% accuracy (vs. 47% on variance-matched random embeddings), and reports macro-average recall of 0.83 under balanced class weighting. Clustering yields low alignment with the labels (ARI=0.081), which the authors interpret as evidence of encoding without discrete separability. They note boundary leakage into the 'other' class and outline future work including a POS baseline.

Significance. If the central result holds after validation of the labels, the work would demonstrate that pre-trained contextual embeddings capture multidimensional narrative semantics beyond generic statistics, with the variance-matched random control providing a useful baseline to isolate learned representations. This could inform downstream tasks in story understanding and generation. The empirical probe-vs-control design is a clear strength, though the low clustering ARI and planned but absent POS baseline limit immediate claims about narrative-specific encoding independent of syntax.

major comments (2)

[Abstract] Abstract: the 94% probe accuracy and claim of 'meaningful narrative information' depend on the LLM-generated token labels being a faithful proxy for the narrative dimensions. No dataset size, inter-annotator agreement, human validation, or error analysis of the automated annotations is reported, leaving open the possibility that the linear probe recovers the LLM annotator's systematic biases or decision boundaries (including the noted 'Boundary Leakage' into 'others') rather than intrinsic properties of BERT embeddings. The random-embedding control rules out generic variance but does not address label noise.
[Clustering analysis] Clustering analysis: the reported ARI of 0.081 (near-random) is presented as showing that dimensions are 'encoded but not as discretely separable clusters.' This interpretation is reasonable but weakens the broader significance of the encoding claim, as it suggests the information may be distributed in ways that do not support the narrative categories as coherent, recoverable structures without supervision.

minor comments (2)

[Abstract] The abstract omits the total number of tokens or documents in the dataset and the specific LLM/prompting details used for annotation, which are needed to assess the scale and reproducibility of the results.
A table reporting per-class support (number of tokens) alongside the recall figures would help contextualize performance on rare categories such as causality (recall 0.75) and space (recall 0.66).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below, acknowledging limitations where they exist and outlining specific revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the 94% probe accuracy and claim of 'meaningful narrative information' depend on the LLM-generated token labels being a faithful proxy for the narrative dimensions. No dataset size, inter-annotator agreement, human validation, or error analysis of the automated annotations is reported, leaving open the possibility that the linear probe recovers the LLM annotator's systematic biases or decision boundaries (including the noted 'Boundary Leakage' into 'others') rather than intrinsic properties of BERT embeddings. The random-embedding control rules out generic variance but does not address label noise.

Authors: We agree that the validity of the LLM-generated labels is foundational to interpreting the probe results and that the manuscript currently lacks key details on the annotation process. The paper does not report dataset size, inter-annotator agreement, human validation, or a dedicated error analysis, which leaves open the possibility of recovering annotator-specific patterns. We will revise the Methods and Results sections to report the exact dataset size and class distribution, describe the LLM prompting procedure in full, and include an error analysis focused on boundary leakage cases. We will also add a small human validation study on a held-out subset and report agreement statistics. While the variance-matched random embedding baseline rules out generic variance effects, we acknowledge it does not address label noise and will explicitly discuss this limitation along with its implications for the claims. revision: partial
Referee: [Clustering analysis] Clustering analysis: the reported ARI of 0.081 (near-random) is presented as showing that dimensions are 'encoded but not as discretely separable clusters.' This interpretation is reasonable but weakens the broader significance of the encoding claim, as it suggests the information may be distributed in ways that do not support the narrative categories as coherent, recoverable structures without supervision.

Authors: We agree that an ARI of 0.081 indicates near-random alignment and that this finding limits claims about the narrative categories forming coherent, unsupervised structures. Our manuscript already presents this result as evidence that the dimensions are encoded in a distributed rather than discretely clustered manner. This does not contradict the probe results but qualifies the nature of the encoding. We will expand the discussion section to elaborate on this distinction, compare it to other distributed semantic features in embeddings, and clarify the implications for downstream narrative tasks that may require supervision. revision: no

Circularity Check

0 steps flagged

No significant circularity; results are measured empirical outcomes

full rationale

The paper constructs a token-level dataset via LLM-assisted annotation for narrative categories and then trains a linear probe on BERT embeddings to predict those fixed labels, reporting 94% accuracy against a 47% variance-matched random embedding baseline. This performance gap is an observed measurement on held-out data rather than a quantity derived by construction from the inputs. No equations reduce the probe accuracy to the annotation process itself, no self-citations form a load-bearing chain, and the random control is an independent statistical benchmark. The analysis remains self-contained against external controls without self-definitional loops or fitted inputs renamed as predictions.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the validity of LLM-assisted labels as ground truth and on the standard assumption that linear probe success indicates information encoding in embeddings; no new entities are introduced.

free parameters (1)

balanced class weights
Applied during probe training to address class imbalance; value not specified in abstract.

axioms (2)

domain assumption Linear probes can extract semantic information encoded in pre-trained embeddings
Invoked implicitly when interpreting the 94% accuracy as evidence of encoding.
ad hoc to paper LLM-generated token labels accurately represent narrative dimensions
The dataset construction step relies on this without reported human validation.

pith-pipeline@v0.9.0 · 5504 in / 1474 out tokens · 52050 ms · 2026-05-10T15:22:42.858962+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 6 canonical work pages · 2 internal anchors

[1]

Belinkov, Probing classifiers: Promises, shortcomings, and advances, Computational Linguistics 48 (2022) 207–219

Y. Belinkov, Probing classifiers: Promises, shortcomings, and advances, Computational Linguistics 48 (2022) 207–219

2022
[2]

Devlin, M.-W

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186

2019
[3]

What do you learn from context? Probing for sentence structure in contextualized word representations

I. Tenney, D. Das, E. Pavlick, What do you learn from context? probing for sentence structure in contextualized word representations, arXiv preprint arXiv:1905.06316 (2019). URL: https: //arxiv.org/abs/1905.06316

work page Pith review arXiv 1905
[4]

Piper, R

A. Piper, R. J. So, D. Bamman, Narrative theory for computational narrative understanding, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 2021, pp. 298–311

2021
[5]

Genette, Narrative Discourse: An Essay in Method, volume 3, Cornell University Press, 1983

G. Genette, Narrative Discourse: An Essay in Method, volume 3, Cornell University Press, 1983

1983
[6]

Törnberg, Best practices for text annotation with large language models, arXiv preprint arXiv:2402.05129 (2024)

P. Törnberg, Best practices for text annotation with large language models, arXiv preprint arXiv:2402.05129 (2024). URL: https://doi.org/10.48550/arXiv.2402.05129

work page doi:10.48550/arxiv.2402.05129 2024
[7]

OpenAI GPT-5 System Card

A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, et al., Openai gpt-5 system card, arXiv preprint arXiv:2601.03267 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep con- textualized word representations, in: M. Walker, H. Ji, A. Stent (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computat...

work page doi:10.18653/v1/n18-1202 2018
[9]

Hewitt, P

J. Hewitt, P. Liang, Designing and interpreting probes with control tasks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, 2019, pp. 2733–2743

2019
[10]

Austen, Pride and prejudice, Project Gutenberg EBook No

J. Austen, Pride and prejudice, Project Gutenberg EBook No. 1342, HTML edition, 1813. URL: https://www.gutenberg.org/files/1342/1342-h/1342-h.htm, accessed: 10 Dec 2025

2025
[11]

Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, S. Fidler, Aligning books and movies: Towards story-like visual explanations by watching movies and reading books, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 19–27

2015
[12]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825– 2830

2011
[13]

D. C. Liu, J. Nocedal, On the limited memory BFGS method for large scale optimization, Mathe- matical Programming 45 (1989) 503–528

1989
[14]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

L. McInnes, J. Healy, J. Melville, UMAP: Uniform manifold approximation and projection for dimension reduction, arXiv preprint arXiv:1802.03426 (2018)

work page internal anchor Pith review arXiv 2018
[15]

K. Ethayarajh, How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Ho...

work page doi:10.18653/v1/d19-1006 2019
[16]

Jawahar, B

G. Jawahar, B. Sagot, D. Sautier, What does BERT learn about the structure of language?, in: Pro- ceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, 2019, pp. 3651–3657

2019