Recognition: no theorem link
Trivial Vocabulary Bans Improve LLM Reasoning More Than Deep Linguistic Constraints
Pith reviewed 2026-05-13 19:52 UTC · model grok-4.3
The pith
Banning filler words like 'very' and 'just' improves LLM reasoning more than banning the verb 'to be'.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The cognitive restructuring hypothesis was disconfirmed. All four constraints outperformed the 83 percent control, with the neutral filler-word ban at +6.7 points, followed by No-Have, the metacognitive prompt, and E-Prime at +3.7 points. The cross-model correlation signature failed to replicate. These outcomes support the view that any constraint forcing deviation from default generation paths improves reasoning by reducing shallow responses, with shallower constraints working best because they add monitoring load without deep conceptual interference.
What carries the argument
The output regularizer: any constraint that forces a model off its default generation path, disrupting fluent but shallow response patterns.
If this is right
- Any output constraint improves reasoning by acting as a regularizer that disrupts shallow default patterns.
- Shallower constraints impose monitoring load with minimal conceptual disruption and therefore produce larger gains.
- The specific semantic content of the banned words matters less than the mere presence of a constraint.
- Cognitive restructuring tied to particular vocabulary removals does not explain the observed improvements.
Where Pith is reading between the lines
- Simple word bans could serve as a general, low-effort technique to boost model performance on reasoning tasks.
- The same regularization logic might apply to other generation tasks where fluent but shallow outputs are common.
- Testing random or even meaningless word bans would further isolate whether the effect depends on any deviation at all.
Load-bearing premise
That the neutral filler-word ban has no role in logical inference and that performance gains arise specifically from forcing deviation from default generation paths rather than from unmeasured factors in task design or filtering.
What would settle it
Running the same tasks with a constraint that does not force deviation from default paths, such as requiring the banned filler words in every sentence, and finding no accuracy gain would falsify the output-regularizer account.
read the original abstract
A previous study reported that E-Prime (English without the verb "to be") selectively altered reasoning in language models, with cross-model correlations suggesting a structural signature tied to which vocabulary was removed. I designed a replication with active controls to test the proposed mechanism: cognitive restructuring through specific vocabulary-cognition mappings. The experiment tested five conditions (unconstrained control, E-Prime, No-Have, elaborated metacognitive prompt, neutral filler-word ban) across six models and seven reasoning tasks (N=15,600 trials, 11,919 after compliance filtering). Every prediction from the cognitive restructuring hypothesis was disconfirmed. All four treatments outperformed the control (83.0%), including both active controls predicted to show null effects. The neutral filler-word ban, banning words like "very" and "just" with no role in logical inference, produced the largest improvement (+6.7 pp), while E-Prime produced the smallest (+3.7 pp). The four conditions ranked in perfect inverse order of theoretical depth. The cross-model correlation signature did not replicate (mean r=0.005). These results are consistent with a simpler mechanism: any constraint that forces a model off its default generation path acts as an output regularizer, improving reasoning by disrupting fluent but shallow response patterns. The shallowest constraints work best because they impose monitoring load with minimal conceptual disruption. I present these findings as a case study in discovery through disconfirmation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript replicates a prior claim that E-Prime selectively alters LLM reasoning via cognitive restructuring, but introduces active controls (No-Have, metacognitive prompt, neutral filler-word ban) across six models and seven tasks (N=15,600 trials). All four treatments outperform the unconstrained control (83.0%), with the filler ban yielding the largest gain (+6.7 pp) and E-Prime the smallest (+3.7 pp) in inverse order of theoretical depth; the cross-model correlation signature fails to replicate (mean r=0.005). The authors conclude that any constraint forcing deviation from default generation paths acts as an output regularizer, with shallower constraints performing best.
Significance. If the central interpretation survives scrutiny of the filtering procedure, the work supplies a clear, large-scale disconfirmation of vocabulary-specific cognitive restructuring in LLMs and offers a parsimonious alternative mechanism with direct implications for prompt design. The scale (multiple models, tasks, and pre-specified predictions), explicit active controls, and measured effect sizes constitute genuine strengths.
major comments (2)
- [Methods] Methods, compliance filtering paragraph: 23% of trials (3,681/15,600) are dropped exclusively from the four treatment arms while the control retains every generation. If discarded responses in the ban conditions are systematically less accurate than retained ones, the reported gains (+3.7 to +6.7 pp) partly reflect selection of successful avoidances rather than a general regularizing effect. This selection bias is unmeasured and directly threatens the claim that the neutral filler ban's superiority demonstrates a depth-independent mechanism.
- [Results] Results, condition ranking and effect sizes: the perfect inverse ordering of theoretical depth is presented as key evidence, yet the manuscript does not report per-condition compliance rates or accuracy on the filtered subset. Without these numbers it is impossible to quantify how much of the observed ordering is attributable to differential filtering rather than the constraints themselves.
minor comments (2)
- [Abstract] Abstract: the sentence reporting N=11,919 after filtering should also state the per-condition breakdown or at least the range across arms.
- [Methods] Methods: the exact lists of banned words for the filler condition and the full text of all task prompts should be included (or linked) to allow exact replication.
Simulated Author's Rebuttal
We thank the referee for the careful review and for identifying a methodological issue that requires additional transparency. We address each major comment below and will revise the manuscript to incorporate the requested analyses.
read point-by-point responses
-
Referee: [Methods] Methods, compliance filtering paragraph: 23% of trials (3,681/15,600) are dropped exclusively from the four treatment arms while the control retains every generation. If discarded responses in the ban conditions are systematically less accurate than retained ones, the reported gains (+3.7 to +6.7 pp) partly reflect selection of successful avoidances rather than a general regularizing effect. This selection bias is unmeasured and directly threatens the claim that the neutral filler ban's superiority demonstrates a depth-independent mechanism.
Authors: We agree that the compliance filtering procedure requires explicit quantification to rule out selection bias. In the revised manuscript we will add a table reporting per-condition compliance rates together with the mean accuracy of the discarded generations in each treatment arm. This will allow readers to assess directly how much of the observed gains (+3.7 to +6.7 pp) may be attributable to selective retention rather than the constraints themselves. We maintain that the core pattern—an inverse relationship between theoretical depth and performance gain—reflects a general regularizing effect, but the additional data will make this claim fully testable. revision: yes
-
Referee: [Results] Results, condition ranking and effect sizes: the perfect inverse ordering of theoretical depth is presented as key evidence, yet the manuscript does not report per-condition compliance rates or accuracy on the filtered subset. Without these numbers it is impossible to quantify how much of the observed ordering is attributable to differential filtering rather than the constraints themselves.
Authors: We will include the per-condition compliance rates and the accuracy statistics for both retained and discarded responses in the revised Results section (and as a supplementary table). These numbers will permit a quantitative decomposition of the contribution of filtering to the observed inverse ordering. The revised text will explicitly discuss the extent to which the ranking survives this decomposition. revision: yes
Circularity Check
No significant circularity in experimental reporting
full rationale
The paper reports results from a controlled experiment across five conditions, six models, and seven tasks, with direct measurement of accuracy differences and explicit disconfirmation of the cognitive restructuring hypothesis. The conclusion that constraints act as output regularizers follows from the observed inverse ranking of effect sizes (neutral ban largest, E-Prime smallest) and the non-replication of cross-model correlations. No equations, fitted parameters renamed as predictions, or load-bearing self-citations reduce any claim to its inputs by construction. The prior study is cited only as the hypothesis being tested and refuted, not as justification for the mechanism.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The selected filler words have no role in logical inference
Reference graph
Works this paper leans on
-
[1]
arXiv:2502.09061. Rodney Jehu-Appiah. Umwelt engineering: Designing the cognitive worlds of linguistic agents
-
[2]
arXiv:2603.27626. Christian G. Jensen, Signe Vangkilde, Vibe Frokjaer, and Steen G. Hasselbalch. Mindfulness training affects attention — or is it attentional effort?Journal of Experimental Psychology: General, 141(1):106–123,
-
[3]
FollowBench: A multi-level fine-grained constraints following benchmark for large language models
Yuxin Jiang et al. FollowBench: A multi-level fine-grained constraints following benchmark for large language models. InProceedings of ACL 2024,
work page 2024
-
[4]
Large language models are zero-shot reasoners.NeurIPS 2022,
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.NeurIPS 2022,
work page 2022
-
[5]
Large Language Models are Zero-Shot Reasoners
arXiv:2205.11916. Alfred Korzybski.Science and Sanity: An Introduction to Non-Aristotelian Systems and General Semantics. International Non-Aristotelian Library,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Mitigating the alignment tax of RLHF.EMNLP 2024,
Yu-Fang Lin, Chung-Chi Chen, Hen-Hsen Huang, and Hsin-Hsi Chen. Mitigating the alignment tax of RLHF.EMNLP 2024,
work page 2024
-
[7]
Chain-of-thought prompting elicits reasoning in large language models.NeurIPS 2022,
17 Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.NeurIPS 2022,
work page 2022
-
[8]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
arXiv:2201.11903. Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Instruction-Following Evaluation for Large Language Models
arXiv:2311.07911. A Condition Prompts All conditions share the opening:“You are a careful, analytical reasoning assistant. When presented with a problem, think through it step by step. Show your reasoning process, then provide your final answer.” Control.“Respond in clear, natural English. Focus on accuracy and thoroughness in your reasoning.” E-Prime.Ban...
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.