pith. machine review for the scientific record. sign in

arxiv: 2605.09900 · v1 · submitted 2026-05-11 · 💻 cs.AI · cs.CL· cs.CV

Recognition: 2 theorem links

· Lean Theorem

The Gordian Knot for VLMs: Diagrammatic Knot Reasoning as a Hard Benchmark

Hao Liu, Jicheng Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:35 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CV
keywords vision-language modelsknot diagramsdiagrammatic reasoningbenchmarkprime knotsmove predictionmultimodal evaluationReidemeister moves
0
0 comments X

The pith

Vision-language models can describe knot diagrams but fail to simulate the moves required to reason about them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents KnotBench, a benchmark pairing 858,318 knot-diagram images from 1,951 prime knots with 14 tasks verified by Regina's canonical signatures. The tasks fall into equivalence judgment, move prediction, identification, and cross-modal grounding, with an image-versus-symbol split that isolates perception from operation. Tested models produce near-random results on most operational tasks and zero strictly correct diagram-to-symbol transcriptions, even when given thinking time and large output budgets. The results indicate that models extract visual features from diagrams but cannot apply transformations or determine equivalences on those features.

Core claim

KnotBench shows that current vision-language models hold perceptual features of a knot diagram but lack apparatus to simulate moves on those features. Across 56 task-model combinations, 15 fall at or below random baseline and 8 of 14 tasks have best scores under 1.5 times random; thinking mode raises overall accuracy by only 1.65 points for one model and 9.25 for the other. No model produces a strictly correct string in diagram-to-symbol transcription, and permissive Regina decoding recovers the knot in at most 4 of 100 cases.

What carries the argument

KnotBench protocol of 14 tasks in four families, with answers checked against Regina canonical knot signatures and an image-versus-symbol split that measures the perception-operation gap.

If this is right

  • Models remain near random on move-prediction and equivalence tasks even after thinking steps are allowed.
  • Diagram-to-symbol transcription yields no strictly correct outputs, and permissive decoding recovers the knot in at most four of one hundred cases.
  • The perception-operation gap persists across both image-only and symbol-augmented inputs.
  • Thinking mode narrows the gap only modestly, leaving eight of fourteen tasks below 1.5 times random.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same limitation may appear in other domains that require applying discrete operations to visual structures, such as geometry proofs or circuit diagrams.
  • Hybrid systems pairing a vision-language model with an explicit symbolic simulator for moves could be tested directly on the same task set.
  • Extending the benchmark to include non-prime knots or random diagrams would check whether the observed gap is specific to prime-knot structure.

Load-bearing premise

The generated knot-diagram images and Regina ground truth form an unbiased test of diagrammatic reasoning without systematic artifacts from rendering style or task wording.

What would settle it

Any vision-language model that produces strictly correct diagram-to-symbol strings on the full test set and scores well above 1.5 times random on all move-prediction and equivalence tasks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.09900 by Hao Liu, Jicheng Liu.

Figure 1
Figure 1. Figure 1: At-a-glance summary of the KnotBench results. Rows are the four vision–language models we evaluated; columns are the 14 evaluation tasks; task color is accuracy on a coolwarm scale. The charcoal box marks state-of-the-art accuracy at or below 65%. Most tasks in most rows are well below the white midline, which corresponds to the random baseline for the binary tasks. into two non-trivial knots tied in serie… view at source ↗
Figure 2
Figure 2. Figure 2: Three drawings of the trefoil knot at three different crossing counts, produced by the same prototype under different random sequences of Reidemeister moves. Same knot, different diagram; KnotBench samples broadly from this equivalence class. Notation. • K: a knot (topological class, identified by its canonical knot signature). • D, D′ : diagrams of K. • |D|: crossing count of diagram D. • rc(K): reduced c… view at source ↗
Figure 3
Figure 3. Figure 3: Generation tree of the KnotBench corpus. Each prototype is processed at both chiralities; for every (prototype, chirality) we run 128 random walks over Reidemeister moves and render the final diagram. Two example renders are shown at the leaves: a moderate-complexity diagram of a low-rc prototype (|D| ≈ 8) and a higher-complexity walk-end of a different prototype (|D| ≈ 14). table, and lint thresholds are … view at source ↗
Figure 4
Figure 4. Figure 4: Corpus coverage. (Top-left) crossing-count tier breakdown, showing the L1/L2/L3/L3+ split of the 858,318 renders, where tiers are defined by render crossing count nx: L1 [3, 7], L2 [8, 13], L3 [14, 22], L3+ [23, 30]. (Top-right) chirality × texture combinations, exactly 25% each by construction. (Bottom-left) seven-color palette uniformity (within ±0.4% of uniform). (Bottom-right) leaves per prototype dist… view at source ↗
Figure 18
Figure 18. Figure 18: 6 Discussion and limitations Limitations. The full table is in Section H; the main caveats follow. The test split holds 18 unique mutant pairs (the floor for our test ratio over 98 mutant components), reused up to four times across renderings; we report mutant accuracy per-pair and per-rendering. The L1 tier (rc ≤ 7) holds ∼24 records and is unstratified across most tasks. “Flype” steps (5% of the walk) a… view at source ↗
Figure 5
Figure 5. Figure 5: Task taxonomy. Four families decompose the operation side of the perception–operation gap: equivalence recognition (A), action prediction on a Reidemeister trajectory (B), identification (C), and cross-modal grounding (D). The -I/-S split (where applicable) controls whether the structure is delivered via pixels or via PD-code text. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Deviation from random baseline for every (task, model) pair. Bars are color-coded by model. A bar to the right indicates better than random; to the left indicates at or below random. 15 of 56 pairs sit at or below their random baseline. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-task reasoning lift (no-thinking → thinking) for each vendor. Both vendors gain from thinking; the lift concentrates on −S tasks. Image-only tasks move by single digits or fall. A0 A1 A2 A3 B0 0% 25% 50% 75% 100% accuracy 52 52 46 65 67 97 90 50 30 84 Claude -- Claude Opus 4.7 A0 A1 A2 A3 B0 0% 25% 50% 75% 100% 55 60 47 92 65 99 90 50 32 84 Claude+T -- Claude Opus 4.7 + thinking A0 A1 A2 A3 B0 0% 25% 5… view at source ↗
Figure 8
Figure 8. Figure 8: Image (−I) vs. symbol (−S) accuracy paired-dot scatter, one panel per model, across the families that have both modalities. Lines connect each task’s two variants; green = symbol higher, red = image higher. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: B0 per-R-move accuracy. The non-thinking GPT-5 row scores high on R3 and low on every other class, exposing an “always-R3” shortcut that the marginal distribution of its answers also confirms. Thinking removes the shortcut on the symbolic variant (B0-S) but not on the image variant (B0-I), where all four models stay below 35%. 8 11 14 17 20 n_crossings (true crossing count of the diagram) 0% 25% 50% 75% 10… view at source ↗
Figure 10
Figure 10. Figure 10: C0 crossing-count accuracy as a function of the ground-truth crossing count nx, with Wilson 95% confidence intervals. All four models score above 50% near nx = 8, decline sharply through nx ∈ [10, 14], and approach the chance level by nx = 17. Strict integer match is unforgiving of ±1 miscounts. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: D0 per-subtype breakdown. All four models default toward “no” answers: matching items receive 0–16% “yes” answers and same-task mismatches receive 86–100% “no” answers. The asymmetry is consistent across vendors and across thinking modes. 8 11 14 17 20 n_crossings 0% 25% 50% 75% 100% accuracy rnd 50% A0_I 8 11 14 17 20 n_crossings 0% 25% 50% 75% 100% accuracy rnd 50% A0_S 8 11 14 17 20 n_crossings 0% 25% … view at source ↗
Figure 12
Figure 12. Figure 12: Per-task accuracy as a function of crossing count nx, plotted per task with one panel per task and four lines per panel (one per model). The shaded band marks Wilson 95% confidence at each nx bin. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Coarse four-bin nx gradient (8–10, 11–13, 14–16, 17–20). Tasks whose accuracy falls monotonically across the bins are the tasks where complexity is the binding constraint. Tasks that are flat-near-random across bins are bottlenecked by the operation, not by complexity. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Yes-bias per binary task, per model. The bar height is the empirical rate of “yes” answers; the dashed horizontal line is the ground-truth “yes” rate for that task. D0 has the strongest no-bias of any binary task; A1 S and A3 S show the largest thinking-mode-induced rebalance. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: A0 accuracy on mutant negatives. The 39 strict mutant pairs (all four classical invariants collide) and 75 loose mutant pairs (Jones-only collisions) are the hardest negatives in A0. All four models score below 60% on every subtype; mutant negatives are the failure-mode that classical-invariant-only graders would miss. 10 100 1,000 10,000 mean output tokens per item (log scale) 0% 25% 50% 75% 100% accurac… view at source ↗
Figure 16
Figure 16. Figure 16: Output-token and cost summary across the four models on the canonical post-rerun. Thinking modes are 6–8× the output-token volume of their non-thinking counterparts; cost scales linearly with output tokens at the published per-token rates. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: D1 confusion over the four-letter option set. The distribution of model-emitted letters is near-uniform for all four models; even when the model is correct, the letter choice is rarely content-driven [PITH_FULL_IMAGE:figures/full_fig_p041_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Output-token distribution per task, per model. The long tails on the thinking-mode rows correspond to runs that nearly exhaust the 64K extended-thinking budget; the 23 empty responses across the four models live in those tails. 41 [PITH_FULL_IMAGE:figures/full_fig_p041_18.png] view at source ↗
read the original abstract

A vision-language model can look at a knot diagram and report what it sees, yet fail to act on that structure. KnotBench pairs an 858,318-image corpus from 1,951 prime-knot prototypes (crossing numbers 3 to 19) with a protocol whose answers are checked against Regina's canonical knot signature. Its 14 tasks span four families, equivalence judgment, move prediction, identification, and cross-modal grounding; an image-versus-symbol split locates failures along the perception-operation gap. We score Claude Opus 4.7 and GPT-5, each with and without thinking, under a 64K output-token budget matched on both vendors. Across 56 (task, model) cases, 15 sit at or below a random baseline and 8 of 14 tasks have a best score under 1.5x random. On diagram-to-symbol transcription, no model produces a strictly correct string, and permissive Regina decoding recovers the knot in 0 to 4 of 100 items. Thinking-mode reasoning lifts overall accuracy by 1.65 points for Claude and 9.25 points for GPT-5, narrowing the gap only modestly. Read together, the four families suggest current vision-language models hold features of a diagram but lack apparatus to simulate moves on those features.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces KnotBench, a benchmark pairing an 858,318-image corpus of knot diagrams (from 1,951 prime knots with crossing numbers 3–19) with 14 tasks across four families—equivalence judgment, move prediction, identification, and cross-modal grounding—whose answers are verified against Regina canonical signatures. Evaluations of Claude Opus 4.7 and GPT-5 (with/without thinking, 64K token budget) show 15 of 56 (task, model) cases at or below random baseline, 8 tasks with best scores <1.5× random, zero strict transcription accuracy, and 0–4/100 permissive Regina recovery; the authors conclude that VLMs extract diagram features but lack apparatus to simulate Reidemeister moves or equivalences.

Significance. If robust, the work supplies a large-scale, externally verifiable, parameter-free benchmark for diagrammatic spatial reasoning that isolates a perception–operation gap via image-versus-symbol splits. The use of Regina signatures and the scale of the corpus (858k images) provide reproducible, falsifiable evidence of current VLM limitations on move simulation, with potential to become a standard test for topology-aware reasoning systems.

major comments (3)
  1. [§3 and §4] §3 (Dataset Construction) and §4 (Task Families): The central perception–operation split and the claim that models 'hold features but lack apparatus to simulate moves' is load-bearing on the assumption that the 858k generated images introduce no systematic perceptual artifacts (e.g., crossing occlusion, projection style, or line clarity) that impair vision encoders more than symbolic reasoning. No ablations or controls for these factors are reported, so the low transcription (0–4/100) and move-prediction scores could arise from upstream rendering noise rather than absence of simulation machinery.
  2. [§5] Results (§5, transcription and move-prediction rows): The reported near-zero strict transcription accuracy and sub-1.5×-random scores on move tasks are presented as diagnostic of the gap, yet without quantitative image-quality metrics (human crossing-detection baselines or simple CV controls) or prompt-sensitivity sweeps, it remains unclear whether the failures are intrinsic or artifacts of the specific rendering and task formulation.
  3. [§5] §5 (thinking-mode comparison): The modest lifts (1.65 points for Claude, 9.25 for GPT-5) are used to argue that extra reasoning does not close the gap, but the analysis does not test whether thinking prompts interact with visual tokenization quality or whether the 64K budget is sufficient for diagram descriptions; this weakens the attribution to missing simulation apparatus.
minor comments (2)
  1. [Abstract and §5] Clarify the exact model versions (e.g., 'Claude Opus 4.7') and confirm they match publicly available releases at the time of evaluation.
  2. [§4] Provide at least one concrete example per task family (with image, prompt, and expected Regina signature) in the main text or appendix to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments help clarify the evidential basis for our claims about the perception-operation gap in VLMs. We address each major point below and have incorporated revisions to strengthen the manuscript where the concerns identify gaps in the original analysis.

read point-by-point responses
  1. Referee: [§3 and §4] §3 (Dataset Construction) and §4 (Task Families): The central perception–operation split and the claim that models 'hold features but lack apparatus to simulate moves' is load-bearing on the assumption that the 858k generated images introduce no systematic perceptual artifacts (e.g., crossing occlusion, projection style, or line clarity) that impair vision encoders more than symbolic reasoning. No ablations or controls for these factors are reported, so the low transcription (0–4/100) and move-prediction scores could arise from upstream rendering noise rather than absence of simulation machinery.

    Authors: We agree that the absence of explicit controls for rendering artifacts represents a limitation in the original submission. The diagrams were generated from standard projections in the Knot Atlas using conventional over/under rendering without intentional occlusion, and all ground-truth labels were verified via Regina canonical signatures. However, to directly test whether perceptual noise drives the results, the revised manuscript adds two controls in §5: (1) human baseline accuracy on crossing detection and line clarity for a random sample of 500 diagrams (98.4% agreement with ground truth), and (2) a simple CV pipeline (Canny edges + crossing classifier) that achieves >92% feature extraction fidelity on the corpus. These controls show that basic visual features are reliably extractable, while VLM operational performance remains near random; the image-versus-symbol split is also retained as further evidence that the gap is not purely perceptual. We have updated §3 to describe the rendering pipeline in greater detail. revision: yes

  2. Referee: [§5] Results (§5, transcription and move-prediction rows): The reported near-zero strict transcription accuracy and sub-1.5×-random scores on move tasks are presented as diagnostic of the gap, yet without quantitative image-quality metrics (human crossing-detection baselines or simple CV controls) or prompt-sensitivity sweeps, it remains unclear whether the failures are intrinsic or artifacts of the specific rendering and task formulation.

    Authors: We accept that quantitative image-quality metrics and prompt-sensitivity analysis were missing from the original §5. The revised version now reports human crossing-detection baselines (98.4% accuracy on 500 diagrams) and a simple CV control achieving 92%+ fidelity on crossing and connectivity features. In addition, we performed a prompt-sensitivity sweep across five distinct prompt phrasings for the transcription and move-prediction tasks; accuracy variance was <3 points and the best scores remained below 1.5× random. These results are added to §5 and support that the observed failures are not driven by the particular rendering or prompt wording used in the main experiments. revision: yes

  3. Referee: [§5] §5 (thinking-mode comparison): The modest lifts (1.65 points for Claude, 9.25 for GPT-5) are used to argue that extra reasoning does not close the gap, but the analysis does not test whether thinking prompts interact with visual tokenization quality or whether the 64K budget is sufficient for diagram descriptions; this weakens the attribution to missing simulation apparatus.

    Authors: We partially agree that the original analysis did not explicitly ablate interactions between thinking prompts and visual tokenization. The 64K budget was selected to permit extended chain-of-thought, yet the modest gains observed still leave all models far below levels consistent with operational simulation. In the revision we have expanded the discussion in §5 to acknowledge this limitation and to note that future work should examine tokenization effects directly. The core empirical pattern—near-random performance on move simulation even under thinking mode—remains unchanged and continues to support the perception-operation gap claim. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark with external ground truth

full rationale

The paper generates a corpus of knot diagrams from 1,951 known prime-knot prototypes and evaluates VLMs on 14 tasks whose answers are verified against Regina's canonical signatures. No equations, fitted parameters, or predictions are defined in terms of the target results; no self-citations are invoked to establish uniqueness or forbid alternatives; the perception-operation split is implemented by explicit task families rather than derived from prior author work. The protocol is therefore self-contained against external benchmarks and random baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on standard knot theory assumptions for diagram generation and equivalence checking via Regina; no free parameters or new entities are introduced in the abstract.

axioms (1)
  • domain assumption Knot diagrams generated from prime knot prototypes can be faithfully rendered as images whose equivalences are correctly identified by Regina software.
    Ground truth labels for all tasks depend on this assumption.

pith-pipeline@v0.9.0 · 5536 in / 1285 out tokens · 67428 ms · 2026-05-12T04:35:50.080748+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 3 internal anchors

  1. [1]

    Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models? InAdvances in Neural Information Processing Systems (NeurIPS), 2024a. Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qi...

  2. [2]

    Nature600(7887), 70–74 (2021) https://doi.org/10.1038/s41586-021-04086-x

    doi: 10.1038/s41586-021-04086-x. Anne Dranowski, Yura Kabkov, and Daniel Tubbenhauer. On knot detection via picture recognition. arXiv:2510.06284,

  3. [3]

    doi: 10.1007/978-3-031-73337-6

  4. [4]

    Preprint: arXiv:2010.16263,

    doi: 10.1088/2632-2153/abe91f. Preprint: arXiv:2010.16263,

  5. [5]

    World Models

    David Ha and J¨ urgen Schmidhuber. World models. arXiv:1803.10122,

  6. [6]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv:2301.04104,

  7. [7]

    doi: 10.1016/j.tics.2004.04.001. Mark C. Hughes. A neural network approach to predicting and computing knot invariants. arXiv:1610.05744,

  8. [8]

    Geometric deep learning approach to knot theory

    Lennart Jaretzki. Geometric deep learning approach to knot theory. arXiv:2305.16808,

  9. [9]

    doi: 10.1017/S0140525X16001837. Jill H. Larkin and Herbert A. Simon. Why a diagram is (sometimes) worth ten thousand words,

  10. [10]

    Mind your step (by step): Chain-of-thought can reduce performance on tasks where thinking makes humans worse

    Version 0.9.2. Ryan Liu, Jiayi Geng, Addison J. Wu, Ilia Sucholutsky, Tania Lombrozo, and Thomas L. Griffiths. Mind your step (by step): Chain-of-thought can reduce performance on tasks where thinking makes humans worse. arXiv:2410.21333,

  11. [11]

    Ivanova and Idan A

    doi: 10.1016/j.tics.2024.01.011. Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, Dublin, Ireland,

  12. [12]

    Yujie Qian, Jiang Guo, Zhengkai Tu, Connor W

    Model card. Yujie Qian, Jiang Guo, Zhengkai Tu, Connor W. Coley, and Regina Barzilay. RxnScribe: A sequence generation model for reaction diagram parsing.Journal of Chemical Information and Modeling, 63(13):4030–4041, 2023a. doi: 10.1021/acs.jcim.3c00439. Yujie Qian, Jiang Guo, Zhengkai Tu, Zhening Li, Connor W. Coley, and Regina Barzilay. MolScribe: Robu...

  13. [13]

    Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, and Greg Durrett

    doi: 10.1007/BF02952507. Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, and Greg Durrett. To CoT or not to CoT? chain-of-thought helps mainly on math and symbolic reasoning. arXiv:2409.12183,

  14. [14]

    doi: 10.1016/0010-0285(80)90005-5. Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh-Agrawal, Sandeep Singh Sandha, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah 15 Goldblum. LiveBench: A challenging, contamination...

  15. [15]

    signature

    A Knot-theory formalism Aknotis a smooth embedding ϕ : S1 ,→S 3, considered up to ambient isotopy. Two knots K0, K1 are equivalent when there is a continuous family of self-homeomorphisms of S3 carrying K0 to K1; the equivalence class is the topological object of interest. Aknot diagramis the image of ϕ under a generic projection S3 →R 2 together with a c...

  16. [16]

    what to count

    The dominant failure mode is a ±1 to ±3 miscount that the strict integer match rejects; under a ±1 tolerance the accuracy roughly doubles but the qualitative collapse remains. The corpus uses orthogonal routing so that each crossing is rendered as an unambiguous over/under glyph; the task is not asking the model to disambiguate visually fragile crossings,...