pith. machine review for the scientific record. sign in

arxiv: 2603.29025 · v2 · submitted 2026-03-30 · 💻 cs.CL · cs.AI

Recognition: no theorem link

The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning

Lu Zhang, Ramayya Krishnan, Rema Padman, Tianchong Jiang, Yubo Li

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:03 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords large language modelsreasoningheuristicsconstraintsbenchmarkcar wash problemheuristic override
0
0 comments X

The pith

Surface distance cues override implicit feasibility constraints in large language models, causing systematic reasoning failures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates why large language models fail on tasks where a prominent surface feature conflicts with an unstated practical constraint. It introduces a framework to diagnose, measure, bridge, and treat this issue using the car wash problem as a case study. Analysis across multiple models shows that distance cues have a much stronger influence than the actual goal, following sigmoid patterns. A new benchmark called HOB tests various heuristics and constraints, revealing low performance across models and that simple hints can help by improving constraint inference.

Core claim

Large language models exhibit heuristic override where salient surface cues, such as distance in the car wash problem, exert 8.7 to 38 times more influence than the implicit goal constraint, as revealed by causal-behavioral analysis and confirmed across the Heuristic Override Benchmark (HOB) spanning multiple heuristic and constraint families.

What carries the argument

The Heuristic Override Benchmark (HOB) consisting of 500 instances with minimal pairs and explicitness gradients across 4 heuristic by 5 constraint families, which measures how surface heuristics override implicit constraints.

If this is right

  • Under strict 10/10 evaluation, no model exceeds 75% accuracy on HOB, with presence constraints being the hardest at 44%.
  • Providing a minimal hint emphasizing the key object improves average performance by 15 percentage points.
  • 12 out of 14 models perform worse when the constraint is removed, up to 39 pp, indicating conservative bias.
  • Goal-decomposition prompting recovers 6 to 9 percentage points by forcing enumeration of preconditions.
  • The sigmoid pattern generalizes to cost, efficiency, and semantic-similarity heuristics via parametric probes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Addressing heuristic override may require new training methods focused on explicit constraint checking rather than pattern matching.
  • This vulnerability could affect applications like planning or decision-making where implicit rules are common.
  • Further tests could apply the benchmark to multimodal models to see if visual cues exacerbate the issue.

Load-bearing premise

The assumption that minimal pairs and explicitness gradients in the HOB benchmark isolate the effects of heuristic override from knowledge gaps or prompt formatting.

What would settle it

A model achieving over 90% accuracy on HOB instances under strict evaluation without relying on distance cues would falsify the claim of systematic override.

Figures

Figures reproduced from arXiv: 2603.29025 by Lu Zhang, Ramayya Krishnan, Rema Padman, Tianchong Jiang, Yubo Li.

Figure 1
Figure 1. Figure 1: Left: Base decision scores s(x). All positive (incorrect Walk preference); non￾monotonic scaling. Right: Span-level occlusion heatmap. Distance columns uniformly blue (∆s < 0, toward Drive); goal columns near-zero or red. Causal occlusion. Three findings emerge from span-level perturbation ( [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Left: CSI vs. DSI per paraphrase (Qwen3-4B). Goal sensitivity drives HDR vari￾ation; distance sensitivity is stable. Right: Per-span ∆s heatmap (Qwen3-4B). Pattern consistent across all six models. Monotonicity curves. All six models produce sigmoid conflict curves tracking the control ( [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: All six models’ conflict curves (solid) are sigmoids tracking the control (dashed [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Mean strict accuracy per H × C cell (14 models). C-pres hardest; C-cap easiest. We evaluate 14 models on ∼500 HOB instances (N=10 trials, strict: correct only if all 10 pass) [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Probe pattern classification across 6 models [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Goal-decomposition prompting improves weaker models substantially. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Token-level ∆s within the goal span (Qwen3-4B). Green bars (negative) weakly favour Drive; red bars (positive) favour Walk. Opposing effects cancel, leaving near-zero net goal influence. No token approaches the magnitude of the distance cue. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Monotonicity analysis: decision score s(d) vs. distance for conflict (orange) and control (blue) conditions across all six models. Every model produces sigmoid conflict curves that track the control curve. 10m 50m 100m 200m 500m 800m1km 2km 3km 5km 10km 25km 50km 100km Distance (log scale) −20 −10 0 10 20 30 S c o r e s(x) = lo g P(W alk) − lo g P(Driv e) Ideal: flat (Drive at all d) Walk → Drive → Conflic… view at source ↗
Figure 9
Figure 9. Figure 9: Individual monotonicity curves. Top: Qwen3-4B (left) and Qwen3-32B (right). Bottom: GPT-OSS-20B (left) and Qwen3-14B (right, highest Walk-bias at short distances). 17 [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Remaining models: Qwen3-8B (left) and Qwen3.5-27B (right). [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Multi-panel diagnostic profile for Qwen3-4B: span heatmap, HDR decomposition, [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Strict accuracy across H × C cells for all 14 models. Cells A1 (H-prox × C-pres) and B1 (H-eff × C-pres) are consistently the hardest. Several models fall below 30% on these cells. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Strict accuracy by constraint family (mean [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Parametric probes across four H × C combinations (Qwen3-4B). Orange: conflict; blue: control. Top-left: H-cost × C-scope—correct reasoning (curves distinct). Top-right: H-eff × C-cap—sigmoid failure (curves track). Bottom-left: H-prox × C-cap—correct rea￾soning. Bottom-right: H-sem × C-scope—semantic sigmoid. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: H-eff × C-cap conflict curves for all six models. Qwen3-4B stays strongly positive (sigmoid failure); larger models (Qwen3-32B, Qwen3.5-27B) correctly shift negative. GPT￾OSS-20B hovers near zero. E.3 Semantic Probe: Cross-Model Overlay a small conveni... a roadside shop... a fuel station... a gas station gas st. that sells car accessories gas st. with an auto supplies section a full-service gas station w… view at source ↗
Figure 16
Figure 16. Figure 16: H-sem × C-scope conflict curves for all six models. As the gas station description becomes more “car-related” (left to right), most models shift toward incorrectly recommend￾ing it for tire repair. Qwen3-4B shows the strongest semantic sigmoid; Qwen3.5-27B and Qwen3-32B remain closer to the decision boundary. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗
read the original abstract

Large language models systematically fail when a salient surface cue conflicts with an unstated feasibility constraint. We study this through a diagnose-measure-bridge-treat framework. Causal-behavioral analysis of the ``car wash problem'' across six models reveals approximately context-independent sigmoid heuristics: the distance cue exerts 8.7 to 38 times more influence than the goal, and token-level attribution shows patterns more consistent with keyword associations than compositional inference. The Heuristic Override Benchmark (HOB) -- 500 instances spanning 4 heuristic by 5 constraint families with minimal pairs and explicitness gradients -- demonstrates generality across 14 models: under strict evaluation (10/10 correct), no model exceeds 75%, and presence constraints are hardest (44%). A minimal hint (e.g., emphasizing the key object) recovers +15 pp on average, suggesting the failure lies in constraint inference rather than missing knowledge; 12/14 models perform worse when the constraint is removed (up to -39 pp), revealing conservative bias. Parametric probes confirm that the sigmoid pattern generalizes to cost, efficiency, and semantic-similarity heuristics; goal-decomposition prompting recovers +6 to 9 pp by forcing models to enumerate preconditions before answering. Together, these results characterize heuristic override as a systematic reasoning vulnerability and provide a benchmark for measuring progress toward resolving it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that large language models systematically prioritize salient surface cues over unstated feasibility constraints in reasoning. Through a diagnose-measure-bridge-treat framework and causal-behavioral analysis of the car-wash problem across six models, it identifies approximately context-independent sigmoid heuristics in which the distance cue exerts 8.7–38 times more influence than the goal. The introduced Heuristic Override Benchmark (HOB) spans 500 instances across 4 heuristic families and 5 constraint families with minimal pairs and explicitness gradients; under strict 10/10 evaluation, no model exceeds 75% accuracy and presence constraints are hardest (44%). Minimal hints recover +15 pp on average, goal-decomposition prompting recovers +6–9 pp, and 12/14 models perform worse when the constraint is removed (up to –39 pp), indicating failures in constraint inference rather than knowledge gaps. Parametric probes extend the sigmoid pattern to cost, efficiency, and semantic-similarity heuristics.

Significance. If the quantitative claims hold, the work provides a systematic characterization of heuristic override as a reproducible reasoning vulnerability in LLMs, introduces a reusable benchmark (HOB) for tracking progress, and demonstrates that lightweight interventions (hints, goal decomposition) can measurably mitigate the issue. The cross-model consistency and the recovery effects are concrete strengths that move the discussion beyond isolated failure cases.

major comments (2)
  1. [car-wash analysis] Car-wash analysis: the central claim that the distance cue exerts 8.7–38 times more influence than the goal rests on fitting sigmoid heuristics to behavioral responses. The manuscript provides no details on the fitting procedure, chosen parameterization, confidence intervals, or robustness checks under prompt rephrasing or alternative attribution methods; without these, the reported multiplier range risks being an artifact of the specific functional form rather than a stable property of heuristic override.
  2. [HOB benchmark] HOB evaluation protocol: the strict 10/10 correctness criterion and the reported performance drops when constraints are removed (up to –39 pp) are load-bearing for the claim that failures reflect constraint-inference deficits. The abstract and analysis lack explicit statistical controls, full model-version specifications, and data-exclusion criteria, which are required to support cross-model generality.
minor comments (2)
  1. The manuscript should report exact model versions (including checkpoints) and any response-filtering rules used in the six-model and 14-model evaluations.
  2. Token-level attribution results would benefit from a brief description of the attribution method and any controls for token-frequency confounds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments identify key areas where additional methodological transparency will strengthen the manuscript. We address each point below and will incorporate the requested details in the revision.

read point-by-point responses
  1. Referee: Car-wash analysis: the central claim that the distance cue exerts 8.7–38 times more influence than the goal rests on fitting sigmoid heuristics to behavioral responses. The manuscript provides no details on the fitting procedure, chosen parameterization, confidence intervals, or robustness checks under prompt rephrasing or alternative attribution methods; without these, the reported multiplier range risks being an artifact of the specific functional form rather than a stable property of heuristic override.

    Authors: We agree that the fitting details must be documented explicitly. In the revised manuscript we will add an appendix describing the procedure: responses were fit to a logistic sigmoid P(override) = 1 / (1 + exp(−k · (distance − x0))) via nonlinear least-squares minimization, with k and x0 estimated separately per model. We will report 95 % bootstrap confidence intervals (1 000 resamples) and show that the 8.7–38× multiplier range remains stable (7.9–41×) under three prompt rephrasings and when token attribution is replaced by integrated-gradients scores. These additions will demonstrate that the reported range reflects a reproducible behavioral pattern rather than a fitting artifact. revision: yes

  2. Referee: HOB evaluation protocol: the strict 10/10 correctness criterion and the reported performance drops when constraints are removed (up to –39 pp) are load-bearing for the claim that failures reflect constraint-inference deficits. The abstract and analysis lack explicit statistical controls, full model-version specifications, and data-exclusion criteria, which are required to support cross-model generality.

    Authors: We accept that these specifications are necessary. The revision will list every model version and checkpoint used, state that data exclusion was restricted to unparseable outputs (< 2 % of trials), and add paired t-tests (all p < .01) together with linear-regression controls for prompt length and token count. Standard deviations across three independent runs per model will also be reported. These changes will provide the statistical grounding required for the cross-model claims. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark study with no load-bearing derivations or self-referential reductions

full rationale

The paper conducts direct evaluations of LLMs on the car-wash problem and the Heuristic Override Benchmark (HOB), reporting observed behavioral patterns such as sigmoid-like responses and influence ratios from model outputs. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted parameters or prior self-citations. The central claims rest on external model testing across 14 models with minimal pairs and explicitness gradients, which are independent of the reported measurements. No self-citation chains or ansatzes are invoked to justify uniqueness or force results. This is a standard empirical analysis whose quantitative findings (e.g., 8.7–38x influence) are measurements rather than tautological outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that surface cues can be cleanly separated from implicit constraints via minimal pairs and that hint interventions isolate inference failures rather than knowledge gaps.

axioms (1)
  • domain assumption Minimal pairs in the benchmark isolate the effect of surface heuristics from other prompt factors.
    Invoked in the construction of the 500-instance benchmark spanning heuristic and constraint families.

pith-pipeline@v0.9.0 · 5546 in / 1152 out tokens · 48489 ms · 2026-05-14T21:03:13.023690+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 11 canonical work pages · 2 internal anchors

  1. [1]

    Shortcutted commonsense: Data spuriousness in deep learning of commonsense reasoning

    Ruben Branco, Ant´onio Branco, Joao Rodrigues, and Joao Silva. Shortcutted commonsense: Data spuriousness in deep learning of commonsense reasoning. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 1504–1521,

  2. [2]

    LR²Bench: Evalu- ating long-chain reflective reasoning capabilities of large language models via constraint satisfaction problems

    Jianghao Chen, Zhenlin Wei, Zhenjiang Ren, Ziyong Li, and Jiajun Zhang. LR²Bench: Evalu- ating long-chain reflective reasoning capabilities of large language models via constraint satisfaction problems. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mo- hammad Taher Pilehvar (eds.),Findings of the Association for Computational Linguistics: ACL 202...

  3. [3]

    ISBN 979-8-89176-256-5

    Association for Computational Linguistics. ISBN 979-8-89176-256-5. doi: 10.18653/v1/2025.findings-acl.312. URL https://aclanthology.org/2025.findings-acl.312/. Vanessa Cheung, Maximilian Maier, and Falk Lieder. Large language models show amplified cognitive biases in moral decision-making.Proceedings of the National Academy of Sciences, 122(25):e2412015122,

  4. [4]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457,

  5. [5]

    Cognitive bias in decision-making with llms

    Jessica Maria Echterhoff, Yao Liu, Abeer Alessa, Julian McAuley, and Zexue He. Cognitive bias in decision-making with llms. InFindings of the association for computational linguistics: EMNLP 2024, pp. 12640–12653,

  6. [6]

    Annotation artifacts in natural language inference data

    Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, and Noah A Smith. Annotation artifacts in natural language inference data. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 107–112,

  7. [7]

    Prompt architecture determines reasoning quality: A variable isolation study on the car wash problem.arXiv preprint arXiv:2602.21814,

    Heejin Jo. Prompt architecture determines reasoning quality: A variable isolation study on the car wash problem.arXiv preprint arXiv:2602.21814,

  8. [8]

    K´evin (@knowmadd)

    URLhttps://openreview.net/forum?id=Sklgs0NFvr. K´evin (@knowmadd). Car wash reasoning test. Mastodon post, https://mastodon.world/ @knowmadd/116072773118828295, February

  9. [9]

    Look at the first sentence: Position bias in question answering

    Miyoung Ko, Jinhyuk Lee, Hyunjae Kim, Gangwoo Kim, and Jaewoo Kang. Look at the first sentence: Position bias in question answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1109–1121,

  10. [10]

    The winograd schema challenge

    Hector J Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge. KR, 2012(13th):3,

  11. [11]

    Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models

    Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models.arXiv preprint arXiv:2410.05229,

  12. [12]

    Arithmetic with- out algorithms: Language models solve math with a bag of heuristics.arXiv preprint arXiv:2410.21272,

    Yaniv Nikankin, Anja Reusch, Aaron Mueller, and Yonatan Belinkov. Arithmetic with- out algorithms: Language models solve math with a bag of heuristics.arXiv preprint arXiv:2410.21272,

  13. [13]

    Under review

    11 Preprint. Under review. Mahmud Omar, Shelly Soffer, Reem Agbareia, Nicola Luigi Bragazzi, Donald U Apakama, Carol R Horowitz, Alexander W Charney, Robert Freeman, Benjamin Kummer, Benjamin S Glicksberg, et al. Socio-demographic biases in medical decision-making by large language models: a large-scale multi-model analysis.MedRxiv, pp. 2024–10,

  14. [14]

    Accessed: 2026-03-22

    https://opper.ai/blog/ car-wash-test. Accessed: 2026-03-22. Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. ” why should i trust you?” explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144,

  15. [15]

    Rethinking interpretability in the era of large language models.arXiv preprint arXiv:2402.01761, 2024

    Chandan Singh, Jeevana Priya Inala, Michel Galley, Rich Caruana, and Jianfeng Gao. Rethinking interpretability in the era of large language models.arXiv preprint arXiv:2402.01761,

  16. [16]

    Evaluating implicit regulatory compliance in llm tool invocation via logic-guided synthesis.arXiv preprint arXiv:2601.08196,

    Da Song, Yuheng Huang, Boqi Chen, Tianshuo Cong, Randy Goebel, Lei Ma, and Foutse Khomh. Evaluating implicit regulatory compliance in llm tool invocation via logic-guided synthesis.arXiv preprint arXiv:2601.08196,

  17. [17]

    Exploring and mitigating shortcut learning for generative large language models

    Zechen Sun, Yisheng Xiao, Juntao Li, Yixin Ji, Wenliang Chen, and Min Zhang. Exploring and mitigating shortcut learning for generative large language models. InProceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation (LREC-COLING 2024), pp. 6883–6893,

  18. [18]

    Large language models can be lazy learn- ers: Analyze shortcuts in in-context learning

    Ruixiang Tang, Dehan Kong, Longtao Huang, et al. Large language models can be lazy learn- ers: Analyze shortcuts in in-context learning. InFindings of the association for computational linguistics: ACL 2023, pp. 4645–4657,

  19. [19]

    Will the real linda please stand up

    Pengda Wang, Zilin Xiao, Hanjie Chen, and Frederick L Oswald. Will the real linda please stand up... to large language models? examining the representativeness heuristic in llms. arXiv preprint arXiv:2404.01461,

  20. [20]

    How is LLM reasoning distracted by irrelevant context? an analysis using a controlled benchmark

    Minglai Yang, Ethan Huang, Liang Zhang, Mihai Surdeanu, William Yang Wang, and Liangming Pan. How is LLM reasoning distracted by irrelevant context? an analysis using a controlled benchmark. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (eds.),Proceedings of the 2025 Conference on Empirical Methods in Natural Language P...

  21. [21]

    Metacognitive Prompting Improves Understanding in Large Language Models

    Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/ 2025.emnlp-main.674. URLhttps://aclanthology.org/2025.emnlp-main.674/. Yu Yuan, Lili Zhao, Kai Zhang, Guangting Zheng, and Qi Liu. Do llms overcome shortcut learning? an evaluation of shortcut challenges in large language models. InProceedings of the 2024 Conference on Em...

  22. [22]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911,

  23. [23]

    Under review

    12 Preprint. Under review. Yuqing Zhou, Ruixiang Tang, Ziyu Yao, and Ziwei Zhu. Navigating the shortcut maze: A comprehensive analysis of shortcut learning in text classification by language models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 2586–2614,