pith. machine review for the scientific record. sign in

arxiv: 2604.02699 · v1 · submitted 2026-04-03 · 💻 cs.CL · cs.AI

Recognition: no theorem link

Trivial Vocabulary Bans Improve LLM Reasoning More Than Deep Linguistic Constraints

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:52 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM reasoningvocabulary constraintsE-Primeoutput regularizationcognitive restructuringreplicationprompt engineering
0
0 comments X

The pith

Banning filler words like 'very' and 'just' improves LLM reasoning more than banning the verb 'to be'.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper replicates a claim that E-Prime, by removing the verb 'to be', restructures how language models reason through specific vocabulary-cognition links. With active controls including a neutral ban on words like 'very' and 'just', every treatment beat the unconstrained baseline across six models and seven tasks. The neutral filler ban produced the biggest gain while E-Prime produced the smallest, ranking the conditions in exact reverse order of their theoretical depth. The pattern points to a simpler account in which any output constraint acts as a regularizer that disrupts default fluent but shallow patterns.

Core claim

The cognitive restructuring hypothesis was disconfirmed. All four constraints outperformed the 83 percent control, with the neutral filler-word ban at +6.7 points, followed by No-Have, the metacognitive prompt, and E-Prime at +3.7 points. The cross-model correlation signature failed to replicate. These outcomes support the view that any constraint forcing deviation from default generation paths improves reasoning by reducing shallow responses, with shallower constraints working best because they add monitoring load without deep conceptual interference.

What carries the argument

The output regularizer: any constraint that forces a model off its default generation path, disrupting fluent but shallow response patterns.

If this is right

  • Any output constraint improves reasoning by acting as a regularizer that disrupts shallow default patterns.
  • Shallower constraints impose monitoring load with minimal conceptual disruption and therefore produce larger gains.
  • The specific semantic content of the banned words matters less than the mere presence of a constraint.
  • Cognitive restructuring tied to particular vocabulary removals does not explain the observed improvements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Simple word bans could serve as a general, low-effort technique to boost model performance on reasoning tasks.
  • The same regularization logic might apply to other generation tasks where fluent but shallow outputs are common.
  • Testing random or even meaningless word bans would further isolate whether the effect depends on any deviation at all.

Load-bearing premise

That the neutral filler-word ban has no role in logical inference and that performance gains arise specifically from forcing deviation from default generation paths rather than from unmeasured factors in task design or filtering.

What would settle it

Running the same tasks with a constraint that does not force deviation from default paths, such as requiring the banned filler words in every sentence, and finding no accuracy gain would falsify the output-regularizer account.

read the original abstract

A previous study reported that E-Prime (English without the verb "to be") selectively altered reasoning in language models, with cross-model correlations suggesting a structural signature tied to which vocabulary was removed. I designed a replication with active controls to test the proposed mechanism: cognitive restructuring through specific vocabulary-cognition mappings. The experiment tested five conditions (unconstrained control, E-Prime, No-Have, elaborated metacognitive prompt, neutral filler-word ban) across six models and seven reasoning tasks (N=15,600 trials, 11,919 after compliance filtering). Every prediction from the cognitive restructuring hypothesis was disconfirmed. All four treatments outperformed the control (83.0%), including both active controls predicted to show null effects. The neutral filler-word ban, banning words like "very" and "just" with no role in logical inference, produced the largest improvement (+6.7 pp), while E-Prime produced the smallest (+3.7 pp). The four conditions ranked in perfect inverse order of theoretical depth. The cross-model correlation signature did not replicate (mean r=0.005). These results are consistent with a simpler mechanism: any constraint that forces a model off its default generation path acts as an output regularizer, improving reasoning by disrupting fluent but shallow response patterns. The shallowest constraints work best because they impose monitoring load with minimal conceptual disruption. I present these findings as a case study in discovery through disconfirmation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript replicates a prior claim that E-Prime selectively alters LLM reasoning via cognitive restructuring, but introduces active controls (No-Have, metacognitive prompt, neutral filler-word ban) across six models and seven tasks (N=15,600 trials). All four treatments outperform the unconstrained control (83.0%), with the filler ban yielding the largest gain (+6.7 pp) and E-Prime the smallest (+3.7 pp) in inverse order of theoretical depth; the cross-model correlation signature fails to replicate (mean r=0.005). The authors conclude that any constraint forcing deviation from default generation paths acts as an output regularizer, with shallower constraints performing best.

Significance. If the central interpretation survives scrutiny of the filtering procedure, the work supplies a clear, large-scale disconfirmation of vocabulary-specific cognitive restructuring in LLMs and offers a parsimonious alternative mechanism with direct implications for prompt design. The scale (multiple models, tasks, and pre-specified predictions), explicit active controls, and measured effect sizes constitute genuine strengths.

major comments (2)
  1. [Methods] Methods, compliance filtering paragraph: 23% of trials (3,681/15,600) are dropped exclusively from the four treatment arms while the control retains every generation. If discarded responses in the ban conditions are systematically less accurate than retained ones, the reported gains (+3.7 to +6.7 pp) partly reflect selection of successful avoidances rather than a general regularizing effect. This selection bias is unmeasured and directly threatens the claim that the neutral filler ban's superiority demonstrates a depth-independent mechanism.
  2. [Results] Results, condition ranking and effect sizes: the perfect inverse ordering of theoretical depth is presented as key evidence, yet the manuscript does not report per-condition compliance rates or accuracy on the filtered subset. Without these numbers it is impossible to quantify how much of the observed ordering is attributable to differential filtering rather than the constraints themselves.
minor comments (2)
  1. [Abstract] Abstract: the sentence reporting N=11,919 after filtering should also state the per-condition breakdown or at least the range across arms.
  2. [Methods] Methods: the exact lists of banned words for the filler condition and the full text of all task prompts should be included (or linked) to allow exact replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and for identifying a methodological issue that requires additional transparency. We address each major comment below and will revise the manuscript to incorporate the requested analyses.

read point-by-point responses
  1. Referee: [Methods] Methods, compliance filtering paragraph: 23% of trials (3,681/15,600) are dropped exclusively from the four treatment arms while the control retains every generation. If discarded responses in the ban conditions are systematically less accurate than retained ones, the reported gains (+3.7 to +6.7 pp) partly reflect selection of successful avoidances rather than a general regularizing effect. This selection bias is unmeasured and directly threatens the claim that the neutral filler ban's superiority demonstrates a depth-independent mechanism.

    Authors: We agree that the compliance filtering procedure requires explicit quantification to rule out selection bias. In the revised manuscript we will add a table reporting per-condition compliance rates together with the mean accuracy of the discarded generations in each treatment arm. This will allow readers to assess directly how much of the observed gains (+3.7 to +6.7 pp) may be attributable to selective retention rather than the constraints themselves. We maintain that the core pattern—an inverse relationship between theoretical depth and performance gain—reflects a general regularizing effect, but the additional data will make this claim fully testable. revision: yes

  2. Referee: [Results] Results, condition ranking and effect sizes: the perfect inverse ordering of theoretical depth is presented as key evidence, yet the manuscript does not report per-condition compliance rates or accuracy on the filtered subset. Without these numbers it is impossible to quantify how much of the observed ordering is attributable to differential filtering rather than the constraints themselves.

    Authors: We will include the per-condition compliance rates and the accuracy statistics for both retained and discarded responses in the revised Results section (and as a supplementary table). These numbers will permit a quantitative decomposition of the contribution of filtering to the observed inverse ordering. The revised text will explicitly discuss the extent to which the ranking survives this decomposition. revision: yes

Circularity Check

0 steps flagged

No significant circularity in experimental reporting

full rationale

The paper reports results from a controlled experiment across five conditions, six models, and seven tasks, with direct measurement of accuracy differences and explicit disconfirmation of the cognitive restructuring hypothesis. The conclusion that constraints act as output regularizers follows from the observed inverse ranking of effect sizes (neutral ban largest, E-Prime smallest) and the non-replication of cross-model correlations. No equations, fitted parameters renamed as predictions, or load-bearing self-citations reduce any claim to its inputs by construction. The prior study is cited only as the hypothesis being tested and refuted, not as justification for the mechanism.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical measurements from controlled experiments rather than on new theoretical constructs or fitted parameters.

axioms (1)
  • domain assumption The selected filler words have no role in logical inference
    Used to classify the neutral ban as a shallow constraint with no theoretical depth.

pith-pipeline@v0.9.0 · 5550 in / 1226 out tokens · 57844 ms · 2026-05-13T19:52:14.990991+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages · 3 internal anchors

  1. [1]

    Rodney Jehu-Appiah

    arXiv:2502.09061. Rodney Jehu-Appiah. Umwelt engineering: Designing the cognitive worlds of linguistic agents

  2. [2]

    Christian G

    arXiv:2603.27626. Christian G. Jensen, Signe Vangkilde, Vibe Frokjaer, and Steen G. Hasselbalch. Mindfulness training affects attention — or is it attentional effort?Journal of Experimental Psychology: General, 141(1):106–123,

  3. [3]

    FollowBench: A multi-level fine-grained constraints following benchmark for large language models

    Yuxin Jiang et al. FollowBench: A multi-level fine-grained constraints following benchmark for large language models. InProceedings of ACL 2024,

  4. [4]

    Large language models are zero-shot reasoners.NeurIPS 2022,

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.NeurIPS 2022,

  5. [5]

    Large Language Models are Zero-Shot Reasoners

    arXiv:2205.11916. Alfred Korzybski.Science and Sanity: An Introduction to Non-Aristotelian Systems and General Semantics. International Non-Aristotelian Library,

  6. [6]

    Mitigating the alignment tax of RLHF.EMNLP 2024,

    Yu-Fang Lin, Chung-Chi Chen, Hen-Hsen Huang, and Hsin-Hsi Chen. Mitigating the alignment tax of RLHF.EMNLP 2024,

  7. [7]

    Chain-of-thought prompting elicits reasoning in large language models.NeurIPS 2022,

    17 Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.NeurIPS 2022,

  8. [8]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    arXiv:2201.11903. Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models

  9. [9]

    Instruction-Following Evaluation for Large Language Models

    arXiv:2311.07911. A Condition Prompts All conditions share the opening:“You are a careful, analytical reasoning assistant. When presented with a problem, think through it step by step. Show your reasoning process, then provide your final answer.” Control.“Respond in clear, natural English. Focus on accuracy and thoroughness in your reasoning.” E-Prime.Ban...