arxiv: 2511.07752 · v3 · submitted 2025-11-11 · 💻 cs.CL

Back to the Future: The Role of Past and Future Context Predictability in Incremental Language Production

Shiva Upadhye , Richard Futrell This is my paper

Pith reviewed 2026-05-18 00:14 UTC · model grok-4.3

classification 💻 cs.CL

keywords contextual predictabilitylanguage productionincremental planningsubstitution errorsphonetic reductioninformation theoryfuture contextbackward predictability

0 comments

The pith

A new measure of information shared between a word and its future context, given past context, best predicts substitution errors and phonetic reduction in speech.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that both past and future context predictability shape how speakers choose and encode words during incremental production. It introduces an information-theoretic measure that quantifies the information a word shares with upcoming context while respecting constraints from what has already been produced. Across studies of naturalistic speech, this measure accounts for unique variance in phonetic reduction and emerges as the strongest predictor of which words surface as substitution errors, outperforming simple backward predictability. A sympathetic reader would care because the results suggest speakers actively balance lookahead with prior commitments when planning utterances in real time.

Core claim

The central claim is that past-conditioned predictability increases the likelihood of a word appearing as a substitution error while future-conditioned predictability reduces it. Within a generative framework modeling lexical, contextual, and communicative influences on word choice, the authors' proposed measure of shared information between a word and future context under past constraints subsumes backward predictability and serves as the strongest contextual predictor of error identity. Error-type analysis further reveals graded trade-offs in how speakers prioritize form, meaning, and context information during lexical planning.

What carries the argument

The central object is the proposed information-theoretic measure that quantifies the information shared between a word and its future context under the constraints imposed by the past context.

If this is right

Past-conditioned predictability increases the likelihood that a word surfaces as a substitution error.
Future-conditioned predictability decreases the likelihood that a word surfaces as a substitution error.
The proposed measure explains additional variance in phonetic reduction beyond what backward predictability alone accounts for.
Speakers exhibit graded trade-offs among form, meaning, and context information when selecting words during incremental planning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The results imply that production models should explicitly represent conditional future predictability to capture human-like error patterns and reduction.
This approach could extend to predicting other planning phenomena such as pauses or self-repairs in longer utterances.
Testing the measure in written production or across languages would clarify whether the past-future tradeoff is specific to spoken incremental planning.

Load-bearing premise

The generative framework accurately captures the separate influences of lexical, contextual, and communicative factors on word choice without circularity in how error likelihood is modeled.

What would settle it

A new corpus of naturalistic speech errors in which the proposed measure no longer outperforms backward predictability when predicting error identity after controlling for other factors would falsify the central claim.

Figures

Figures reproduced from arXiv: 2511.07752 by Richard Futrell, Shiva Upadhye.

**Figure 2.** Figure 2: Effects of communicative intent and context-based information sources on lexical planning [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: An illustration of the information-processing dependencies between the speaker’s message [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the process for estimating contextual predictability variables from a custom [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: (a) Estimated effect sizes for all probabilistic predictors from models with relative backward predictability and conditional PMI as formulations of future context predictability. Relative backward predictability assumes independence between past and future, whereas conditional PMI assumes conditional dependence. Error bars denote standard error. p < 0.001 (***), p < 0.01 (**), p < 0.05 (*), p > 0.05 (ns)… view at source ↗

**Figure 6.** Figure 6: The modulatory effect of lexical category (function/content) on probabilistic reduction. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: An illustration of the process for computing: [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: (a) Log-odds coefficients from logistic regression models with relative backward predictability and conditional PMI as operationalizations of future context predictability. Relative backward predictability assumes independence between past and future, whereas conditional PMI assumes conditional dependence. Error bars denote standard error. Positive coefficients indicate that an increase in the predictor … view at source ↗

**Figure 9.** Figure 9: Log-odds coefficients from substitution models with relative backward predictability and [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Pairwise Pearson correlations between forward and backward predictability estimates from [PITH_FULL_IMAGE:figures/full_fig_p037_10.png] view at source ↗

**Figure 11.** Figure 11: (a) Regression coefficients of models with backward predictability, relative backward pre [PITH_FULL_IMAGE:figures/full_fig_p040_11.png] view at source ↗

**Figure 12.** Figure 12: Pairwise Pearson correlation coefficients between all probabilistic variables for all words in [PITH_FULL_IMAGE:figures/full_fig_p040_12.png] view at source ↗

**Figure 13.** Figure 13: Comparisons between models with (i) relative backward predictability only, (ii) conditional [PITH_FULL_IMAGE:figures/full_fig_p042_13.png] view at source ↗

read the original abstract

Contextual predictability shapes how we choose and encode words in production. The effects of a word's predictability given preceding or past context are generally well-understood in both production and comprehension, but studies of naturalistic production have also revealed a poorly-understood yet robust backward predictability effect of a word given only its future context, which may be linked to future planning. Across two studies of naturalistic speech, we revisit backward predictability using improved operationalizations, introducing a conceptually motivated information-theoretic measure that quantifies the information shared between a word and future context under the constraints imposed by the past context. Study 1 shows that this measure produces effects qualitatively similar to backward predictability while explaining unique variance in phonetic reduction. Study 2 examines substitution errors within a generative framework that models lexical, contextual, and communicative influences on word choice to predict the identity of the word that surfaces as an error. Within this framework, we find that past-conditioned predictability increases error likelihood, whereas future-conditioned predictability reduces it. Further, our proposed measure emerges as the strongest contextual predictor of error identity, subsuming backward predictability. Analysis of error types further reveals graded trade offs in how speakers prioritize form-, meaning-, and context-based information during lexical planning. Together, these findings illuminate how past and future context shape word choice and encoding, linking contextual predictability to mechanisms of incremental planning in sentence production.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a new past-conditioned mutual information measure between a word and future context, then shows it explains unique variance in reduction and ranks as the top predictor of substitution error identity in a generative model.

read the letter

The core advance here is a measure that quantifies shared information between a word and its future context after conditioning on the past context. It is not just a re-labeling of backward predictability, and the authors test it directly on naturalistic speech data for both phonetic reduction and substitution errors. Study 1 finds qualitative similarity to prior backward effects plus unique variance explained. Study 2 places the measure inside a generative framework for error identity and reports it as the strongest contextual predictor while subsuming the simpler backward version. They also note graded trade-offs across form, meaning, and context in error types. That combination of a motivated information-theoretic quantity with error data is the part worth paying attention to. The work sits squarely in the incremental planning literature and gives a concrete way to separate past and future influences without requiring full lookahead. The naturalistic corpus and the attempt to model lexical, contextual, and communicative factors together are clear strengths. The main soft spot is the generative error model in Study 2. The abstract describes it as capturing separate influences on word choice, but it is not obvious from the summary whether the contextual predictors are estimated independently of the target measure or whether the model structure already embeds similar information-theoretic terms. If the latter, the reported superiority could be partly definitional rather than a strong test. Full methods, exact model specification, and any held-out estimation details would settle this. Data exclusion rules and the precise statistical tests are also missing from the abstract, which limits how far the quantitative claims can be taken at face value. This is the kind of paper that belongs in a reading group for people who track predictability effects in production. Readers who already work with information theory or error corpora will extract the most value. The central idea is coherent and the application to errors is new enough that it deserves a serious referee rather than a desk reject, even if the model details need tightening in revision.

Referee Report

1 major / 2 minor

Summary. The manuscript presents two studies on naturalistic speech examining how past and future context predictability influence word choice and encoding in language production. It proposes a new information-theoretic measure of the information shared between a word and its future context under past context constraints. Study 1 shows this measure has effects on phonetic reduction similar to backward predictability but explains unique variance. Study 2 employs a generative framework modeling lexical, contextual, and communicative factors to predict substitution error identities, finding that the proposed measure is the strongest contextual predictor, subsuming backward predictability, and identifying graded trade-offs in prioritizing form, meaning, and context information.

Significance. If the results hold, this work significantly advances the field by providing a conceptually motivated measure that better captures the interplay between past and future contexts in incremental planning. It strengthens the link between predictability effects and production mechanisms, with potential to inform computational models of language production and comprehension. The use of naturalistic data and error analysis adds ecological validity.

major comments (1)

Study 2: The generative framework is described as modeling lexical, contextual, and communicative influences on word choice to predict error identity. To establish that the proposed measure is the strongest predictor without circularity, the manuscript should clarify whether the contextual influences in the base generative model are estimated independently of the information-theoretic measures being compared (past-conditioned predictability, future-conditioned predictability, and the proposed shared-information measure). If the base model already incorporates similar quantities, the ranking of predictors would not constitute an independent test of the measure's superiority.

minor comments (2)

Abstract: The abstract reports qualitative similarity, unique variance explained, and strongest predictor status but omits specific statistical details, data exclusion rules, or full model specifications, which limits immediate verifiability of the central claims.
Throughout: Notation for the proposed measure (past-conditioned mutual information) and related quantities should be defined more explicitly with equations early in the methods to aid reader comprehension of how it differs from standard backward predictability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful and positive review of our manuscript. We address the major comment on Study 2 below.

read point-by-point responses

Referee: Study 2: The generative framework is described as modeling lexical, contextual, and communicative influences on word choice to predict error identity. To establish that the proposed measure is the strongest predictor without circularity, the manuscript should clarify whether the contextual influences in the base generative model are estimated independently of the information-theoretic measures being compared (past-conditioned predictability, future-conditioned predictability, and the proposed shared-information measure). If the base model already incorporates similar quantities, the ranking of predictors would not constitute an independent test of the measure's superiority.

Authors: We are grateful to the referee for pointing out the need for clarification on this matter to avoid any perception of circularity. In the generative framework of Study 2, the base model includes predictors for lexical factors (word frequency, length, and phonological properties), communicative factors (such as overall error rates and speaker-specific tendencies), and basic contextual influences captured through corpus-derived n-gram probabilities. These base contextual influences are estimated independently and do not incorporate the specific information-theoretic quantities under comparison, namely the past-conditioned predictability, future-conditioned predictability, or our proposed measure of shared information between the word and future context given the past. The latter measures are derived from more advanced language model-based estimates and are included as additional predictors in the model. We have revised the manuscript to explicitly describe this separation in the Methods section for Study 2, including details on how each component is computed, thereby confirming that the superior performance of the proposed measure represents an independent test. revision: yes

Circularity Check

0 steps flagged

No significant circularity; measures derived independently from information theory

full rationale

The paper defines its proposed shared-information measure directly from standard information-theoretic quantities on context distributions (past-conditioned mutual information between word and future context). Study 2's generative framework is presented as separately modeling lexical, contextual, and communicative influences on word choice to predict error identity, with the proposed measure then evaluated as a predictor within that framework. No equations or descriptions in the provided text reduce the target measure or its superiority to a fitted input, self-definition, or self-citation chain; the contextual components are not shown to embed the exact proposed quantity by construction. The derivation remains self-contained against external benchmarks of predictability and error modeling.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard information-theoretic definitions of shared information and on the validity of a generative model separating lexical, contextual, and communicative factors; no new entities are postulated.

free parameters (1)

parameters in generative error model
The framework models influences on word choice and error likelihood, implying fitted parameters whose exact count and fitting procedure are not detailed in the abstract.

axioms (1)

standard math Information theory can quantify shared information between a word and future context under past constraints
Invoked when defining the new measure that quantifies information shared between a word and future context under the constraints imposed by the past context.

pith-pipeline@v0.9.0 · 5539 in / 1196 out tokens · 29595 ms · 2026-05-18T00:14:34.474337+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose a principled alternative to backward predictability, based on the conditional PMI of the current word wt and the future context C>t given the past context C<t: conditional PMI(wt;C>t|C<t) = log p(wt|C>t,C<t) / p(wt|C<t)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Study 2 examines substitution errors within a generative framework that models lexical, contextual, and communicative influences on word choice

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages · 2 internal anchors

[1]

Efficient Training of Language Models to Fill in the Middle

Anderson, J. R. (1991). The adaptive nature of human categorization.Psychological review, 98:409. Aylett, M. and Turk, A. (2004). The smooth signal redundancy hypothesis: A functional explanation for relationships between redundancy, prosodic prominence, and duration in spontaneous speech. Language and speech, 47:31–56. Balota, D. A., Boland, J. E., and S...

work page internal anchor Pith review arXiv 1991
[2]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

John Benjamins Publishing Company. Calhoun, S., Carletta, J., Brenier, J. M., Mayo, N., Jurafsky, D., Steedman, M., and Beaver, D. I. (2010). The nxt-format switchboard corpus: a rich resource for investigating the syntax, semantics, pragmatics and prosody of dialogue.Language Resources and Evaluation, 44:387–419. Christiansen, M. H. and Chater, N. (2016)...

work page internal anchor Pith review Pith/arXiv arXiv 2010
[3]

Aprobabilisticmodeloflexicalandsyntacticaccessanddisambiguation.Cognitive Science, 20(2):137–194

Jurafsky, D.(1996). Aprobabilisticmodeloflexicalandsyntacticaccessanddisambiguation.Cognitive Science, 20(2):137–194. Jurafsky, D., Bell, A., Gregory, M., and Raymond, W. D. (2001). Probabilistic relations between words: Evidence from reduction in lexical production.Typological studies in language, 45:229–254. Kahneman, D. (1984). Changing views of attent...

work page 1996
[4]

R., and Phillips, C

Momma, S., Slevc, L. R., and Phillips, C. (2016). The timing of verb selection in japanese sentence production.Journal of Experimental Psychology: Learning, Memory, and Cognition, 42:813. Momma, S., Slevc, L. R., and Phillips, C. (2018). Unaccusativity in sentence production.Linguistic Inquiry, 49:181–194. Mortensen, D. R., Littell, P., Bharadwaj, A., Goy...

work page 2016
[5]

Rapp, D. N. and Samuel, A. G. (2002). A reason to rhyme: phonological and semantic influences on lexical access.Journal of Experimental Psychology: Learning, Memory, and Cognition, 28:564. Reece, A., Cooney, G., Bull, P., Chung, C., Dawson, B., Fitzpatrick, C., Glazer, T., Knox, D., Liebscher, A., and Marin, S. (2023). The candor corpus: Insights from a l...

work page 2002
[6]

and Piai, V

Roelofs, A. and Piai, V. (2011). Attention demands of spoken word planning: A review.Frontiers in psychology, 2:307. Ryskin, R., Futrell, R., Kiran, S., and Gibson, E. (2018). Comprehenders model the nature of noise in the environment.Cognition, 181:141–150. Schneider, W. and Shiffrin, R. M. (1977). Controlled and automatic human information processing: I...

work page 2011
[7]

Beyond this, the tokenizer vocabulary also included<eos>,<PRE>,<SUF>,<MID>, and<unk>tokens

Instead of using an off-the-shelf Byte-Pair Encoding (BPE) tokenizer (Rad- ford et al., 2019), which relies on subword tokenization based on token frequency, we opt to train a whitespace-based tokenizer on the CANDOR corpus; we choose this tokenization scheme for simplicity of estimating token or word probability and to keep the vocabulary size more tract...

work page 2019
[8]

In study 1, the duration model with relative backward predictability had lower BIC than the model with conditional PMI, indicating a better fit to the data (∆BIC = -3890). Adding conditional PMI to the model with relative backward predictability further reduced the BIC compared to the model with only backward predictability; since this difference in BIC>1...

work page 2000