pith. machine review for the scientific record. sign in

arxiv: 2604.02598 · v2 · submitted 2026-04-03 · 💻 cs.HC · cs.AI· cs.PL

Recognition: 2 theorem links

· Lean Theorem

Explorable Theorems: Making Written Theorems Explorable by Grounding Them in Formal Representations

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:25 UTC · model grok-4.3

classification 💻 cs.HC cs.AIcs.PL
keywords explorable theoremsproof comprehensionformal verificationLeanLLM translationinteractive learningmathematics educationproof exploration
0
0 comments X

The pith

Grounding written proofs in executable Lean code lets readers step through them and test examples interactively.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that static text explanations of theorems hit a ceiling because they cannot be executed or explored. By using LLMs to translate a theorem and proof into the Lean formal language and then linking the written steps to the code, the system creates interactive features such as stepping through proofs, testing custom examples or counterexamples, and tracing logical dependencies. Each step is produced by actually running the Lean proof on the chosen example and showing its state. In a study of 16 participants performing a proof-reading task, those given access to these features produced better, more correct, and more detailed answers to comprehension questions than those without them.

Core claim

Explorable theorems are created by translating a written theorem and proof into Lean via LLMs, then linking each natural-language step to its formal counterpart; readers can advance step by step, supply their own examples, and inspect the exact intermediate states produced by executing the Lean code, with the result that participants in a proof-comprehension study gave more accurate and detailed answers.

What carries the argument

Explorable theorems, the linked pair of written proof and its Lean formalization that supports step-level execution and example testing.

If this is right

  • Readers can actively test whether a claimed theorem holds for chosen inputs rather than accepting it on trust.
  • Proof dependencies become visible because each step can be examined in isolation with its exact formal state.
  • The same written proof can support multiple different examples without rewriting the text.
  • Comprehension questions about proofs receive more detailed answers when users have access to the executable version.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be applied to other formal languages or proof assistants beyond Lean if translation quality is sufficient.
  • Integration with online math platforms might allow collaborative exploration of theorems by multiple readers at once.
  • For very long or complex proofs the cost of accurate LLM translation may limit practical use unless automated verification catches errors.
  • The technique might generalize to non-mathematical technical explanations that can be grounded in executable specifications.

Load-bearing premise

LLM translations from natural-language proofs to correct Lean code are reliable enough to support exploration without introducing misleading states.

What would settle it

A follow-up study in which the Lean translations contain errors and participants using the system perform no better, or worse, on comprehension questions than readers given only the original text.

Figures

Figures reproduced from arXiv: 2604.02598 by Andrew Head, Harrison Goldstein, Hita Kambhamettu, Sean Welleck, Will Crichton.

Figure 1
Figure 1. Figure 1: Explorable theorems enlivens a static written proof with explorability affordances derived by analyzing and executing [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The explorable theorems interface for the theorem “For all integers 𝑥, if 𝑥 > 2, then 𝑥 2 − 1 is not prime”. The input slider ○1 lets readers set a concrete value of 𝑥; color coding indicates where the theorem’s assumption 𝑥 > 2 holds. Each proof step is instantiated with Lean-verified intermediate values ○2 : for 𝑥 = 2, the expression 𝑛 = 𝑥 2 − 1 evaluates to 3. Clicking a variable such as 𝑛 reveals the d… view at source ↗
Figure 3
Figure 3. Figure 3: The pipeline for grounding interaction affordances [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Instructor preferences by condition. Box plot shows the distribution of TrueSkill ratings assigned to participant re￾sponses through pairwise comparisons by a mathematics instruc￾tor, combined across both study tasks. The instructor preferred responses written with explorable theorems in 94.9% of pairwise comparisons. Consider the following example. Participants were asked to de￾scribe the proof of this th… view at source ↗
Figure 5
Figure 5. Figure 5: Response quality by condition. Box plots show the dis￾tribution of graded correctness scores and percentage of responses referencing explicit proof steps, combined across both theorem tasks. Vertical lines within each box indicate the median. Responses written after using explorable theorems were graded as significantly more correct and referenced specific proof steps more frequently. Beyond summarizing th… view at source ↗
Figure 7
Figure 7. Figure 7: Self-reported task load by condition. Box plots show the distribution of participant responses on a 1–7 agreement scale for four task load items, comparing explorable theorems to the base￾line. Participants felt significantly less rushed and more successful with explorable theorems. No significant differences were found for mental demand or desire to use the tool in the future. A common action at the start… view at source ↗
Figure 6
Figure 6. Figure 6: Timeline of proof-reading behavior for each partic [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 6
Figure 6. Figure 6: P2 switched modes 8 times across 17 step visits, while P3 [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
read the original abstract

LLM-generated explanations can make technical content more accessible, but there is a ceiling on what they can support interactively. Because LLM outputs are static text, they cannot be executed or stepped through. We argue that grounding explanations in a formalized representation enables interactive affordances beyond what static text supports. We instantiate this idea for mathematical proof comprehension with explorable theorems, a system that uses LLMs to translate a theorem and its written proof into Lean, a programming language for machine-checked proofs, and links the written proof with the Lean code. Readers can work through the proof at a step-level granularity, test custom examples or counterexamples, and trace the logical dependencies bridging each step. Each worked-out step is produced by executing the Lean proof on that example and extracting its intermediate state. A user study ($n = 16$) shows potential advantages of this approach: in a proof-reading task, participants who had access to the provided explorability features gave better, more correct, and more detailed answers to comprehension questions, demonstrating a stronger overall understanding of the underlying mathematics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes explorable theorems, a system that uses LLMs to translate written mathematical theorems and proofs into Lean code and links the natural-language text to the formal representation. This enables interactive affordances including step-level proof execution, testing of custom examples or counterexamples, and tracing of logical dependencies. A user study (n=16) reports that participants with access to these features produced better, more correct, and more detailed answers to comprehension questions than those without.

Significance. If the central claims hold after addressing methodological gaps, the work offers a concrete bridge between static written mathematics and executable formal artifacts. It demonstrates a practical way to augment LLM-generated explanations with verifiable interactivity, which could improve proof comprehension in education and research settings while preserving the original written form.

major comments (2)
  1. [User Study] The user study (n=16) is presented as evidence of comprehension gains, yet the abstract and methods description supply no information on randomization, pre-registration, time-on-task controls, or statistical reporting (e.g., error bars or effect sizes). This leaves open the possibility that observed differences arise from study-design artifacts rather than the explorability features themselves.
  2. [System Architecture] The system relies on LLM-generated Lean translations to support step execution and counterexample testing, but the manuscript provides no manual verification or accuracy metrics for these translations. Without such checks, mismatches between the written proof and the formal artifact would invalidate the claimed interactive affordances.
minor comments (1)
  1. [Abstract] The abstract would benefit from naming the specific theorems or proof domains used in the evaluation to allow readers to assess generalizability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important gaps in methodological reporting and system verification that we will address in revision. Below we respond point by point.

read point-by-point responses
  1. Referee: [User Study] The user study (n=16) is presented as evidence of comprehension gains, yet the abstract and methods description supply no information on randomization, pre-registration, time-on-task controls, or statistical reporting (e.g., error bars or effect sizes). This leaves open the possibility that observed differences arise from study-design artifacts rather than the explorability features themselves.

    Authors: We agree that the current reporting is insufficient. The 16 participants were randomly assigned to conditions. The study was not pre-registered because it was exploratory. We recorded but did not previously report time-on-task; average completion times will now be included. In the revision we will expand the Methods section with these details, add error bars to all figures, report effect sizes (Cohen's d), and explicitly discuss the small-sample limitation. These changes will allow readers to better evaluate whether the observed gains in answer quality, correctness, and detail are attributable to the explorability features. revision: yes

  2. Referee: [System Architecture] The system relies on LLM-generated Lean translations to support step execution and counterexample testing, but the manuscript provides no manual verification or accuracy metrics for these translations. Without such checks, mismatches between the written proof and the formal artifact would invalidate the claimed interactive affordances.

    Authors: We accept this criticism. While the paper does not describe it, the Lean translations for all theorems and proofs used in the evaluation were manually inspected by the authors to confirm step-level correspondence and working interactivity. No quantitative accuracy metrics across a broader corpus were collected. In the revised manuscript we will add a dedicated verification subsection that reports the manual checks performed, notes any discrepancies encountered, and provides basic accuracy observations for the presented examples. We will also discuss how the linking mechanism allows readers to detect and correct mismatches themselves. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on system description and empirical user study

full rationale

The paper presents a system for translating written theorems and proofs into Lean via LLMs, linking them for interactive step-level execution, counterexample testing, and dependency tracing, then evaluates the approach via a between-subjects user study (n=16) measuring comprehension gains. No load-bearing derivation chain exists that reduces by construction to its own inputs, fitted parameters, or self-citations. The central claim of improved understanding is supported directly by the reported study outcomes rather than any self-definitional equivalence, renamed empirical pattern, or uniqueness theorem imported from prior author work. The paper is self-contained against external benchmarks such as the described interactive affordances and measured participant performance.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach depends on the assumption that current LLMs can produce usable Lean translations and that interactivity causally improves proof comprehension; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption LLMs can translate natural language mathematical proofs into correct Lean code at a level sufficient for interactive use
    Invoked in the system description for the translation step that enables all explorability features.
  • domain assumption Interactive execution and example testing improve human comprehension of proofs compared to static text
    Underlies the user study interpretation and the claimed advantages.

pith-pipeline@v0.9.0 · 5499 in / 1240 out tokens · 42768 ms · 2026-05-13T19:25:56.632741+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 1 internal anchor

  1. [1]

    Lara Alcock, Mark Hodds, Somali Roy, and Matthew Inglis. 2015. Investigating and improving undergraduate proof comprehension.Notices of the AMS62, 7 (2015), 742–752

  2. [2]

    Lara Alcock and Nicola Wilkinson. 2011. e-Proofs: Design of a resource to support proof comprehension in mathematics.Educational Designer1, 4 (2011), 1–20

  3. [3]

    Paul Ayres. 2001. Systematic mathematical errors and cognitive load.Contempo- rary educational psychology26, 2 (2001), 227–248

  4. [4]

    Zhangir Azerbayev, Bartosz Piotrowski, and Jeremy Avigad. 2022. ProofNet: A benchmark for autoformalizing and formally proving undergraduate-level mathematics problems. InSecond MATH-AI Workshop

  5. [5]

    Elizabeth L Bjork, Robert A Bjork, et al. 2011. Making things hard on yourself, but in a good way: Creating desirable difficulties to enhance learning.Psychology and the real world: Essays illustrating fundamental contributions to society2, 59-68 (2011), 56–64

  6. [6]

    2016.Qualitative HCI research: Going behind the scenes

    Ann Blandford, Dominic Furniss, and Stephann Makri. 2016.Qualitative HCI research: Going behind the scenes. Morgan & Claypool Publishers

  7. [7]

    Neil Chulpongsatorn, Mille Skovhus Lunding, Nishan Soni, and Ryo Suzuki. 2023. Augmented Math: Authoring AR-Based Explorable Explanations by Augmenting Static Math Textbooks. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23). ACM, 1–16

  8. [8]

    Will Crichton. 2021. A New Medium for Communicating Research on Program- ming Languages. InProceedings of the 1st Workshop on Human Aspects of Types and Reasoning Assistants

  9. [9]

    Ben Davies, Lara Alcock, and Ian Jones. 2020. Comparative judgement, proof summaries and proof comprehension.Educational Studies in Mathematics105, 2 (2020), 181–197

  10. [10]

    Leonardo de Moura, Soonho Kong, Jeremy Avigad, Floris van Doorn, and Jakob von Raumer. 2015. The Lean Theorem Prover (System Description). InProceedings of the 25th International Conference on Automated Deduction (CADE-25) (Lecture Notes in Computer Science, Vol. 9195). Springer, 378–388. https://doi.org/10.1007/ 978-3-319-21401-6_26

  11. [11]

    Leonardo de Moura, Soonho Kong, Jeremy Avigad, Floris van Doorn, and Jakob von Raumer. 2015. The Lean theorem prover (system description). InInternational Conference on Automated Deduction (CADE). Springer, 378–388

  12. [12]

    Dragunov and Jonathan L

    Anton N. Dragunov and Jonathan L. Herlocker. 2003. Designing Intelligent and Dynamic Interfaces for Communicating Mathematics. InProceedings of the 8th International Conference on Intelligent User Interfaces. 154–161. https://doi.org/ 10.1145/604045.604085

  13. [13]

    proceptual

    Eddie M Gray and David O Tall. 1994. Duality, ambiguity, and flexibility: A “proceptual” view of simple arithmetic.Journal for research in Mathematics Education25, 2 (1994), 116–140

  14. [14]

    Ziwei Gu, Joyce Zhou, Ning-Er Lei, Jonathan K Kummerfeld, Mahmood Jasim, Narges Mahyar, and Elena L Glassman. 2025. AbstractExplorer: Leveraging Structure-Mapping Theory to Enhance Comparative Close Reading at Scale. In Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. 1–25

  15. [15]

    Aditya Gunturu, Yi Wen, Nandi Zhang, Jarin Thundathil, Rubaiat Habib Kazi, and Ryo Suzuki. 2024. Augmented Physics: Creating Interactive and Embedded Physics Simulations from Static Textbook Diagrams. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology (UIST ’24). ACM, 1–12

  16. [16]

    Guershon Harel and Larry Sowder. 2007. Toward comprehensive perspectives on the learning and teaching of proof.Second handbook of research on mathematics teaching and learning2 (2007), 805–842

  17. [17]

    John Harrison, Josef Urban, and Freek Wiedijk. 2014. History of interactive theorem proving.Computational Logic9 (2014), 135–214

  18. [18]

    Sandra G Hart and Lowell E Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. InAdvances in psy- chology. Vol. 52. Elsevier, 139–183

  19. [19]

    Andrew Head, Amber Xie, and Marti A. Hearst. 2022. Math Augmentation: How Authors Enhance the Readability of Formulas using Novel Visual Design Practices. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3491102.3501932

  20. [20]

    Ralf Herbrich, Tom Minka, and Thore Graepel. 2006. TrueSkill: A Bayesian Skill Rating System. InAdvances in Neural Information Processing Systems, Vol. 19

  21. [21]

    Thomas Hubert, Roshan Mehta, Laurent Sartran, et al . 2025. Olympiad-level formal mathematical reasoning with reinforcement learning.Nature(2025). https://doi.org/10.1038/s41586-025-09833-y

  22. [22]

    Albert Qiaochu Jiang, Sean Welleck, Jue Ping Zhou, Timothée Lacroix, Jiacheng Liu, Guillaume Lample, Samuel Lewis, Yuhuai Wu, Xiaokai Hou, Christian Hutter, et al. 2023. Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs. InInternational Conference on Learning Representations (ICLR)

  23. [23]

    Dongwei Jiang, Marcio Fonseca, and Shay B Cohen. 2024. Leanreasoner: Boosting complex logical reasoning with lean. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 7497–7510

  24. [24]

    Philip N Johnson-Laird. 2005. Mental models and thought.The Cambridge handbook of thinking and reasoning(2005), 185–208

  25. [25]

    Hee-jeong Kim. 2020. Concreteness fading strategy: A promising and sustainable instructional model in mathematics classrooms.Sustainability12, 6 (2020), 2211

  26. [26]

    Zeger-jan Kock, Ulises Salinas-Hernández, and Birgit Pepin. 2025. Engineering students’ initial use schemes of ChatGPT as an instrument for learning.Digital Experiences in Mathematics Education11, 1 (2025), 192–218

  27. [27]

    Christine P Lee, David Porfirio, Xinyu Jessica Wang, Kevin Chenkai Zhao, and Bilge Mutlu. 2025. Veriplan: Integrating formal verification and llms into end- user planning. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–19

  28. [28]

    Stanley Letovsky. 1987. Cognitive processes in program comprehension.Journal of Systems and software7, 4 (1987), 325–339

  29. [29]

    Yong Li, Shoaib Kamil, Alec Jacobson, and Yotam Gingold. 2022. Hé: Docu- ment Processor for Executable Linear Algebra Papers. InSIGGRAPH Asia 2022 Conference Papers. https://doi.org/10.1145/3550469.3555395

  30. [30]

    Raymond Lister, Elizabeth S Adams, Sue Fitzgerald, William Fone, John Hamer, Morten Lindholm, Robert McCartney, Jan Erik Moström, Kate Sanders, Otto Seppälä, et al. 2004. A multi-national study of reading and tracing skills in novice programmers.Acm sigcse bulletin36, 4 (2004), 119–150

  31. [31]

    Junqi Liu, Zihao Zhou, Zekai Zhu, Marco Dos Santos, Weikun He, Jiawei Liu, Ran Wang, Yunzhou Xie, Junqiao Zhao, Qiufeng Wang, et al. 2026. Numina-Lean- Agent: An Open and General Agentic Reasoning System for Formal Mathematics. arXiv preprint arXiv:2601.14027(2026)

  32. [32]

    Patrick Massot. 2024. Teaching mathematics using lean and controlled natural language. In15th International Conference on Interactive Theorem Proving (ITP 2024). Schloss Dagstuhl–Leibniz-Zentrum für Informatik, 27–1

  33. [33]

    Juan Pablo Mejía-Ramos, Evan Fuller, Keith Weber, Kathryn Rhoads, and Aron Samkoff. 2012. An assessment model for proof comprehension in undergraduate mathematics.Educational Studies in Mathematics79, 1 (2012), 3–18

  34. [34]

    Mejía-Ramos, Erika Fuller, Keith Weber, Kari Rhoads, and Alexis Samkoff

    Juan P. Mejía-Ramos, Erika Fuller, Keith Weber, Kari Rhoads, and Alexis Samkoff

  35. [35]

    https://doi.org/ 10.1007/s10649-011-9349-7

    An Assessment Model for Proof Comprehension in Undergraduate Mathe- matics.Educational Studies in Mathematics79, 1 (2012), 3–18. https://doi.org/ 10.1007/s10649-011-9349-7

  36. [36]

    Hugo Mercier and Dan Sperber. 2011. Why do humans reason? Arguments for an argumentative theory.Behavioral and brain sciences34, 2 (2011), 57–74

  37. [37]

    Frédéric Tran Minh, Laure Gonnord, and Julien Narboux. 2025. Proof Assis- tants for Teaching: a Survey. InThEdu@CADE. https://api.semanticscholar.org/ CorpusID:278422822

  38. [38]

    Nagham M Mohammad, Matthew Demers, Erin McCubbin, Mitchell Jackson, and Sara M Fulmer. 2025. How college students use ChatGPT.Pedagogical Research 10, 4 (2025), em0250

  39. [39]

    Robert C Moore. 1994. Making the transition to formal proof.Educational Studies in Mathematics27, 3 (1994), 249–266

  40. [40]

    Leonardo de Moura and Sebastian Ullrich. 2021. The Lean 4 theorem prover and programming language. InAutomated Deduction–CADE 28: 28th International Conference on Automated Deduction. Springer, 625–635

  41. [41]

    Chris Olah and Shan Carter. 2017. Research Debt.Distill(2017). https://doi.org/ 10.23915/distill.00005

  42. [42]

    2024.lean-lsp-mcp: Lean Theorem Prover MCP

    oOo0oOo. 2024.lean-lsp-mcp: Lean Theorem Prover MCP. https://github.com/ oOo0oOo/lean-lsp-mcp MCP server that allows agentic interaction with the Lean theorem prover via the Language Server Protocol. GitHub repository

  43. [43]

    Dale Parsons and Patricia Haden. 2006. Parson’s programming puzzles: a fun and effective learning tool for first programming courses. InProceedings of the 8th Australasian Conference on Computing Education-Volume 52. 157–163

  44. [44]

    Birgit Pepin, Nils Buchholtz, and Ulises Salinas-Hernández. 2025. A scoping survey of ChatGPT in mathematics education.Digital Experiences in Mathematics Education11, 1 (2025), 9–41

  45. [45]

    Clément Pit-Claudel. 2020. Untangling mechanized proofs. InProceedings of the 13th ACM SIGPLAN International Conference on Software Language Engineering. Kambhamettu et al

  46. [46]

    Nicholas Rougeux and Oliver Byrne. 2018. Byrne’s Euclid: The First Six Books of the Elements of Euclid with Coloured Diagrams and Symbols. https:// www.c82.net/euclid/ Online interactive reproduction

  47. [47]

    Alan H Schoenfeld. 1988. When good teaching leads to bad results: The disasters of’well-taught’mathematics courses.Educational psychologist23, 2 (1988), 145– 166

  48. [48]

    Annie Selden and John Selden. 2003. Validations of proofs considered as texts: Can undergraduates tell whether an argument proves a theorem?Journal for Research in Mathematics Education34, 1 (2003), 4–36

  49. [49]

    Derar Serhan and Natalie B. Welcome. 2024. Integrating ChatGPT in the Calculus Classroom: Student Perceptions.International Journal of Technology in Education and Science(2024). https://api.semanticscholar.org/CorpusId:269835173

  50. [50]

    Despina A Stylianou, Maria L Blanton, Eric J Knuth, and Despi Stylianou. 2009. Teaching and learning proof across the grades. Routledge New York

  51. [51]

    Despina A Stylianou, Maria L Blanton, and Ourania Rotou. 2015. Undergraduate students’ understanding of proof: Relationships between proof conceptions, beliefs, and classroom experiences with learning proof.International Journal of Research in Undergraduate Mathematics Education1, 1 (2015), 91–134

  52. [52]

    Nicole Sultanum and Arjun Srinivasan. 2023. Datatales: Investigating the use of large language models for authoring data-driven articles. In2023 IEEE Visualiza- tion and Visual Analytics (VIS). IEEE, 231–235

  53. [53]

    John Sweller. 1988. Cognitive load during problem solving: Effects on learning. Cognitive science12, 2 (1988), 257–285

  54. [54]

    2011.Cognitive load theory

    John Sweller, Paul Ayres, and Slava Kalyuga. 2011.Cognitive load theory. Springer Science & Business Media

  55. [55]

    Jeffrey Tao, Litao Yan, Jessica Shi, Mia Ginsberg, and Andrew Head. 2025. FreeForm: Flexibly Augmenting Formulas with Synchronized Markup and Graph- ical Edits. InProceedings of the 2025 CHI Conference on Human Factors in Com- puting Systems. 1–12

  56. [56]

    Oliver Tatton-Brown. 2019. Rigour and Intuition.Erkenntnis86 (2019), 1757 –

  57. [57]

    https://api.semanticscholar.org/CorpusID:254465500

  58. [58]

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)

  59. [59]

    2024.Widgets: The user-widgets system

    The Lean Developers. 2024.Widgets: The user-widgets system. https://lean- lang.org/examples/1900-1-1-widgets/ Lean 4 Documentation

  60. [60]

    Bret Victor. 2011. Explorable Explanations. https://worrydream.com/ ExplorableExplanations/. Accessed 2026-03-01

  61. [61]

    Bret Victor. 2011. Kill Math. http://worrydream.com/KillMath/. Accessed: 2024-05-24

  62. [62]

    Bret Victor. 2011. Tangle: a JavaScript Library for Reactive Documents. https:// worrydream.com/Tangle/. Accessed 2026-03-24

  63. [63]

    Tashtoush, Rommel Alali, and Adeeb M Jarrah

    Yousef Wardat, Mohammad A. Tashtoush, Rommel Alali, and Adeeb M Jarrah

  64. [64]

    Eurasia Journal of Mathematics, Science and Technology Education(2023)

    ChatGPT: A revolutionary tool for teaching and learning mathematics. Eurasia Journal of Mathematics, Science and Technology Education(2023). https:// pdfs.semanticscholar.org/3174/39900cffedb7f18e623c1b0ab3fa8e54de21.pdf

  65. [65]

    Keith Weber. 2008. How mathematicians determine if an argument is a valid proof.Journal for Research in Mathematics Education39, 4 (2008), 431–459

  66. [66]

    Keith Weber. 2015. Effective Proof Reading Strategies for Comprehending Math- ematical Proofs.International Journal of Research in Undergraduate Mathematics Education1, 3 (2015), 289–314. https://doi.org/10.1007/s40753-015-0011-0

  67. [67]

    Keith Weber and Juan Pablo Mejia-Ramos. 2011. Why and how mathematicians read proofs: an exploratory study.Educational Studies in Mathematics76, 3 (2011), 329–344. https://doi.org/10.1007/s10649-010-9292-z

  68. [68]

    Erik Weitnauer, David Landy, and Erin Ottmar. 2016. Graspable Math: Towards Dynamic Algebra Notations that Support Learners Better than Paper. In2016 Future Technologies Conference. 406–414

  69. [69]

    Jelle Wemmenhove, Dick Arends, Thijs Beurskens, Maitreyee Bhaid, Sean Mc- Carren, Jan Moraal, Diego Rivera Garrido, David Tuin, Malcolm Vassallo, Pieter Wils, et al. 2022. Waterproof: educational software for learning how to write mathematical proofs.arXiv preprint arXiv:2211.13513(2022)

  70. [70]

    Wieman, Wendy K

    Carl E. Wieman, Wendy K. Adams, and Katherine K. Perkins. 2008. PhET: Simulations that enhance learning.Science322, 5902 (2008), 682–683

  71. [71]

    Michelle Wu, Aadil Gopal, and Andrew Head. 2023. FFL: A Language and Live Runtime for Styling and Labeling Typeset Math Formulas. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology

  72. [72]

    Hyunkyoung Yoon, Jihye Hwang, Kyungwon Lee, Kyeong Hah Roh, and Oh Nam Kwon. 2024. Students’ use of generative artificial intelligence for proving mathe- matical statements.ZDM–Mathematics Education56, 7 (2024), 1531–1551

  73. [73]

    Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. 2022. MiniF2F: a cross- system benchmark for formal Olympiad-level mathematics. InInternational Conference on Learning Representations (ICLR). A Appendix A.1 Narrative Scenario We demonstrate how the interaction patterns in explorable theo- rems can support productive inference over a longer period of u...