arxiv: 2604.02598 · v2 · submitted 2026-04-03 · 💻 cs.HC · cs.AI· cs.PL

Recognition: 2 theorem links

· Lean Theorem

Explorable Theorems: Making Written Theorems Explorable by Grounding Them in Formal Representations

Hita Kambhamettu , Will Crichton , Sean Welleck , Harrison Goldstein , Andrew Head

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:25 UTC · model grok-4.3

classification 💻 cs.HC cs.AIcs.PL

keywords explorable theoremsproof comprehensionformal verificationLeanLLM translationinteractive learningmathematics educationproof exploration

0 comments

The pith

Grounding written proofs in executable Lean code lets readers step through them and test examples interactively.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that static text explanations of theorems hit a ceiling because they cannot be executed or explored. By using LLMs to translate a theorem and proof into the Lean formal language and then linking the written steps to the code, the system creates interactive features such as stepping through proofs, testing custom examples or counterexamples, and tracing logical dependencies. Each step is produced by actually running the Lean proof on the chosen example and showing its state. In a study of 16 participants performing a proof-reading task, those given access to these features produced better, more correct, and more detailed answers to comprehension questions than those without them.

Core claim

Explorable theorems are created by translating a written theorem and proof into Lean via LLMs, then linking each natural-language step to its formal counterpart; readers can advance step by step, supply their own examples, and inspect the exact intermediate states produced by executing the Lean code, with the result that participants in a proof-comprehension study gave more accurate and detailed answers.

What carries the argument

Explorable theorems, the linked pair of written proof and its Lean formalization that supports step-level execution and example testing.

If this is right

Readers can actively test whether a claimed theorem holds for chosen inputs rather than accepting it on trust.
Proof dependencies become visible because each step can be examined in isolation with its exact formal state.
The same written proof can support multiple different examples without rewriting the text.
Comprehension questions about proofs receive more detailed answers when users have access to the executable version.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be applied to other formal languages or proof assistants beyond Lean if translation quality is sufficient.
Integration with online math platforms might allow collaborative exploration of theorems by multiple readers at once.
For very long or complex proofs the cost of accurate LLM translation may limit practical use unless automated verification catches errors.
The technique might generalize to non-mathematical technical explanations that can be grounded in executable specifications.

Load-bearing premise

LLM translations from natural-language proofs to correct Lean code are reliable enough to support exploration without introducing misleading states.

What would settle it

A follow-up study in which the Lean translations contain errors and participants using the system perform no better, or worse, on comprehension questions than readers given only the original text.

Figures

Figures reproduced from arXiv: 2604.02598 by Andrew Head, Harrison Goldstein, Hita Kambhamettu, Sean Welleck, Will Crichton.

**Figure 1.** Figure 1: Explorable theorems enlivens a static written proof with explorability affordances derived by analyzing and executing [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: The explorable theorems interface for the theorem “For all integers 𝑥, if 𝑥 > 2, then 𝑥 2 − 1 is not prime”. The input slider ○1 lets readers set a concrete value of 𝑥; color coding indicates where the theorem’s assumption 𝑥 > 2 holds. Each proof step is instantiated with Lean-verified intermediate values ○2 : for 𝑥 = 2, the expression 𝑛 = 𝑥 2 − 1 evaluates to 3. Clicking a variable such as 𝑛 reveals the d… view at source ↗

**Figure 3.** Figure 3: The pipeline for grounding interaction affordances [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Instructor preferences by condition. Box plot shows the distribution of TrueSkill ratings assigned to participant responses through pairwise comparisons by a mathematics instructor, combined across both study tasks. The instructor preferred responses written with explorable theorems in 94.9% of pairwise comparisons. Consider the following example. Participants were asked to describe the proof of this th… view at source ↗

**Figure 5.** Figure 5: Response quality by condition. Box plots show the distribution of graded correctness scores and percentage of responses referencing explicit proof steps, combined across both theorem tasks. Vertical lines within each box indicate the median. Responses written after using explorable theorems were graded as significantly more correct and referenced specific proof steps more frequently. Beyond summarizing th… view at source ↗

**Figure 7.** Figure 7: Self-reported task load by condition. Box plots show the distribution of participant responses on a 1–7 agreement scale for four task load items, comparing explorable theorems to the baseline. Participants felt significantly less rushed and more successful with explorable theorems. No significant differences were found for mental demand or desire to use the tool in the future. A common action at the start… view at source ↗

**Figure 6.** Figure 6: Timeline of proof-reading behavior for each partic [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 6.** Figure 6: P2 switched modes 8 times across 17 step visits, while P3 [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

read the original abstract

LLM-generated explanations can make technical content more accessible, but there is a ceiling on what they can support interactively. Because LLM outputs are static text, they cannot be executed or stepped through. We argue that grounding explanations in a formalized representation enables interactive affordances beyond what static text supports. We instantiate this idea for mathematical proof comprehension with explorable theorems, a system that uses LLMs to translate a theorem and its written proof into Lean, a programming language for machine-checked proofs, and links the written proof with the Lean code. Readers can work through the proof at a step-level granularity, test custom examples or counterexamples, and trace the logical dependencies bridging each step. Each worked-out step is produced by executing the Lean proof on that example and extracting its intermediate state. A user study ($n = 16$) shows potential advantages of this approach: in a proof-reading task, participants who had access to the provided explorability features gave better, more correct, and more detailed answers to comprehension questions, demonstrating a stronger overall understanding of the underlying mathematics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a system linking LLM-translated Lean proofs to written text for interactive stepping and testing, with a small study suggesting comprehension gains.

read the letter

The main takeaway is that the authors have built a system to ground written proofs in Lean via LLM translation, adding step-level execution, example testing, and dependency tracing that static text can't provide. They do this by translating the theorem and proof to Lean, then linking the written steps to the formal code so that working through an example runs the Lean proof and shows the state at each point. The user study reports that the 16 participants with these features gave better answers to questions about the proof. What stands out is the specific integration: not just formalizing proofs but making the written version explorable through the formal one. The extraction of intermediate states from Lean is a nice technical detail that makes the interactivity work. The study is the weaker part. Sixteen people is a small sample, and without seeing the full methods it's hard to rule out that the gains came from having more material or spending more time rather than the specific features. The abstract also doesn't mention verifying the LLM translations against the original text, which is important for the claim that this aids understanding of the written mathematics. The formal side looks solid because Lean provides machine checking, and there's no sign of circular reasoning or unfalsifiable claims. This paper is for researchers in human-computer interaction and formal methods education. A reader looking for ways to make mathematical content more interactive would find the prototype and initial results worth examining. It deserves a serious referee. The idea is fresh and the implementation is concrete enough that peer review can focus on tightening the evaluation.

Referee Report

2 major / 1 minor

Summary. The paper proposes explorable theorems, a system that uses LLMs to translate written mathematical theorems and proofs into Lean code and links the natural-language text to the formal representation. This enables interactive affordances including step-level proof execution, testing of custom examples or counterexamples, and tracing of logical dependencies. A user study (n=16) reports that participants with access to these features produced better, more correct, and more detailed answers to comprehension questions than those without.

Significance. If the central claims hold after addressing methodological gaps, the work offers a concrete bridge between static written mathematics and executable formal artifacts. It demonstrates a practical way to augment LLM-generated explanations with verifiable interactivity, which could improve proof comprehension in education and research settings while preserving the original written form.

major comments (2)

[User Study] The user study (n=16) is presented as evidence of comprehension gains, yet the abstract and methods description supply no information on randomization, pre-registration, time-on-task controls, or statistical reporting (e.g., error bars or effect sizes). This leaves open the possibility that observed differences arise from study-design artifacts rather than the explorability features themselves.
[System Architecture] The system relies on LLM-generated Lean translations to support step execution and counterexample testing, but the manuscript provides no manual verification or accuracy metrics for these translations. Without such checks, mismatches between the written proof and the formal artifact would invalidate the claimed interactive affordances.

minor comments (1)

[Abstract] The abstract would benefit from naming the specific theorems or proof domains used in the evaluation to allow readers to assess generalizability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important gaps in methodological reporting and system verification that we will address in revision. Below we respond point by point.

read point-by-point responses

Referee: [User Study] The user study (n=16) is presented as evidence of comprehension gains, yet the abstract and methods description supply no information on randomization, pre-registration, time-on-task controls, or statistical reporting (e.g., error bars or effect sizes). This leaves open the possibility that observed differences arise from study-design artifacts rather than the explorability features themselves.

Authors: We agree that the current reporting is insufficient. The 16 participants were randomly assigned to conditions. The study was not pre-registered because it was exploratory. We recorded but did not previously report time-on-task; average completion times will now be included. In the revision we will expand the Methods section with these details, add error bars to all figures, report effect sizes (Cohen's d), and explicitly discuss the small-sample limitation. These changes will allow readers to better evaluate whether the observed gains in answer quality, correctness, and detail are attributable to the explorability features. revision: yes
Referee: [System Architecture] The system relies on LLM-generated Lean translations to support step execution and counterexample testing, but the manuscript provides no manual verification or accuracy metrics for these translations. Without such checks, mismatches between the written proof and the formal artifact would invalidate the claimed interactive affordances.

Authors: We accept this criticism. While the paper does not describe it, the Lean translations for all theorems and proofs used in the evaluation were manually inspected by the authors to confirm step-level correspondence and working interactivity. No quantitative accuracy metrics across a broader corpus were collected. In the revised manuscript we will add a dedicated verification subsection that reports the manual checks performed, notes any discrepancies encountered, and provides basic accuracy observations for the presented examples. We will also discuss how the linking mechanism allows readers to detect and correct mismatches themselves. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on system description and empirical user study

full rationale

The paper presents a system for translating written theorems and proofs into Lean via LLMs, linking them for interactive step-level execution, counterexample testing, and dependency tracing, then evaluates the approach via a between-subjects user study (n=16) measuring comprehension gains. No load-bearing derivation chain exists that reduces by construction to its own inputs, fitted parameters, or self-citations. The central claim of improved understanding is supported directly by the reported study outcomes rather than any self-definitional equivalence, renamed empirical pattern, or uniqueness theorem imported from prior author work. The paper is self-contained against external benchmarks such as the described interactive affordances and measured participant performance.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach depends on the assumption that current LLMs can produce usable Lean translations and that interactivity causally improves proof comprehension; no free parameters or invented entities are introduced.

axioms (2)

domain assumption LLMs can translate natural language mathematical proofs into correct Lean code at a level sufficient for interactive use
Invoked in the system description for the translation step that enables all explorability features.
domain assumption Interactive execution and example testing improve human comprehension of proofs compared to static text
Underlies the user study interpretation and the claimed advantages.

pith-pipeline@v0.9.0 · 5499 in / 1240 out tokens · 42768 ms · 2026-05-13T19:25:56.632741+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

uses LLMs to translate a theorem and its written proof into Lean... Readers can work through the proof at a step-level granularity, test custom examples or counterexamples, and trace the logical dependencies
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction and embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dependency recovery... diff of consecutive proof states... directed acyclic graph whose nodes are the facts introduced

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 1 internal anchor

[1]

Lara Alcock, Mark Hodds, Somali Roy, and Matthew Inglis. 2015. Investigating and improving undergraduate proof comprehension.Notices of the AMS62, 7 (2015), 742–752

work page 2015
[2]

Lara Alcock and Nicola Wilkinson. 2011. e-Proofs: Design of a resource to support proof comprehension in mathematics.Educational Designer1, 4 (2011), 1–20

work page 2011
[3]

Paul Ayres. 2001. Systematic mathematical errors and cognitive load.Contempo- rary educational psychology26, 2 (2001), 227–248

work page 2001
[4]

Zhangir Azerbayev, Bartosz Piotrowski, and Jeremy Avigad. 2022. ProofNet: A benchmark for autoformalizing and formally proving undergraduate-level mathematics problems. InSecond MATH-AI Workshop

work page 2022
[5]

Elizabeth L Bjork, Robert A Bjork, et al. 2011. Making things hard on yourself, but in a good way: Creating desirable difficulties to enhance learning.Psychology and the real world: Essays illustrating fundamental contributions to society2, 59-68 (2011), 56–64

work page 2011
[6]

2016.Qualitative HCI research: Going behind the scenes

Ann Blandford, Dominic Furniss, and Stephann Makri. 2016.Qualitative HCI research: Going behind the scenes. Morgan & Claypool Publishers

work page 2016
[7]

Neil Chulpongsatorn, Mille Skovhus Lunding, Nishan Soni, and Ryo Suzuki. 2023. Augmented Math: Authoring AR-Based Explorable Explanations by Augmenting Static Math Textbooks. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23). ACM, 1–16

work page 2023
[8]

Will Crichton. 2021. A New Medium for Communicating Research on Program- ming Languages. InProceedings of the 1st Workshop on Human Aspects of Types and Reasoning Assistants

work page 2021
[9]

Ben Davies, Lara Alcock, and Ian Jones. 2020. Comparative judgement, proof summaries and proof comprehension.Educational Studies in Mathematics105, 2 (2020), 181–197

work page 2020
[10]

Leonardo de Moura, Soonho Kong, Jeremy Avigad, Floris van Doorn, and Jakob von Raumer. 2015. The Lean Theorem Prover (System Description). InProceedings of the 25th International Conference on Automated Deduction (CADE-25) (Lecture Notes in Computer Science, Vol. 9195). Springer, 378–388. https://doi.org/10.1007/ 978-3-319-21401-6_26

work page 2015
[11]

Leonardo de Moura, Soonho Kong, Jeremy Avigad, Floris van Doorn, and Jakob von Raumer. 2015. The Lean theorem prover (system description). InInternational Conference on Automated Deduction (CADE). Springer, 378–388

work page 2015
[12]

Dragunov and Jonathan L

Anton N. Dragunov and Jonathan L. Herlocker. 2003. Designing Intelligent and Dynamic Interfaces for Communicating Mathematics. InProceedings of the 8th International Conference on Intelligent User Interfaces. 154–161. https://doi.org/ 10.1145/604045.604085

work page doi:10.1145/604045.604085 2003
[13]

proceptual

Eddie M Gray and David O Tall. 1994. Duality, ambiguity, and flexibility: A “proceptual” view of simple arithmetic.Journal for research in Mathematics Education25, 2 (1994), 116–140

work page 1994
[14]

Ziwei Gu, Joyce Zhou, Ning-Er Lei, Jonathan K Kummerfeld, Mahmood Jasim, Narges Mahyar, and Elena L Glassman. 2025. AbstractExplorer: Leveraging Structure-Mapping Theory to Enhance Comparative Close Reading at Scale. In Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. 1–25

work page 2025
[15]

Aditya Gunturu, Yi Wen, Nandi Zhang, Jarin Thundathil, Rubaiat Habib Kazi, and Ryo Suzuki. 2024. Augmented Physics: Creating Interactive and Embedded Physics Simulations from Static Textbook Diagrams. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology (UIST ’24). ACM, 1–12

work page 2024
[16]

Guershon Harel and Larry Sowder. 2007. Toward comprehensive perspectives on the learning and teaching of proof.Second handbook of research on mathematics teaching and learning2 (2007), 805–842

work page 2007
[17]

John Harrison, Josef Urban, and Freek Wiedijk. 2014. History of interactive theorem proving.Computational Logic9 (2014), 135–214

work page 2014
[18]

Sandra G Hart and Lowell E Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. InAdvances in psy- chology. Vol. 52. Elsevier, 139–183

work page 1988
[19]

Andrew Head, Amber Xie, and Marti A. Hearst. 2022. Math Augmentation: How Authors Enhance the Readability of Formulas using Novel Visual Design Practices. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3491102.3501932

work page doi:10.1145/3491102.3501932 2022
[20]

Ralf Herbrich, Tom Minka, and Thore Graepel. 2006. TrueSkill: A Bayesian Skill Rating System. InAdvances in Neural Information Processing Systems, Vol. 19

work page 2006
[21]

Thomas Hubert, Roshan Mehta, Laurent Sartran, et al . 2025. Olympiad-level formal mathematical reasoning with reinforcement learning.Nature(2025). https://doi.org/10.1038/s41586-025-09833-y

work page doi:10.1038/s41586-025-09833-y 2025
[22]

Albert Qiaochu Jiang, Sean Welleck, Jue Ping Zhou, Timothée Lacroix, Jiacheng Liu, Guillaume Lample, Samuel Lewis, Yuhuai Wu, Xiaokai Hou, Christian Hutter, et al. 2023. Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs. InInternational Conference on Learning Representations (ICLR)

work page 2023
[23]

Dongwei Jiang, Marcio Fonseca, and Shay B Cohen. 2024. Leanreasoner: Boosting complex logical reasoning with lean. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 7497–7510

work page 2024
[24]

Philip N Johnson-Laird. 2005. Mental models and thought.The Cambridge handbook of thinking and reasoning(2005), 185–208

work page 2005
[25]

Hee-jeong Kim. 2020. Concreteness fading strategy: A promising and sustainable instructional model in mathematics classrooms.Sustainability12, 6 (2020), 2211

work page 2020
[26]

Zeger-jan Kock, Ulises Salinas-Hernández, and Birgit Pepin. 2025. Engineering students’ initial use schemes of ChatGPT as an instrument for learning.Digital Experiences in Mathematics Education11, 1 (2025), 192–218

work page 2025
[27]

Christine P Lee, David Porfirio, Xinyu Jessica Wang, Kevin Chenkai Zhao, and Bilge Mutlu. 2025. Veriplan: Integrating formal verification and llms into end- user planning. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–19

work page 2025
[28]

Stanley Letovsky. 1987. Cognitive processes in program comprehension.Journal of Systems and software7, 4 (1987), 325–339

work page 1987
[29]

Yong Li, Shoaib Kamil, Alec Jacobson, and Yotam Gingold. 2022. Hé: Docu- ment Processor for Executable Linear Algebra Papers. InSIGGRAPH Asia 2022 Conference Papers. https://doi.org/10.1145/3550469.3555395

work page doi:10.1145/3550469.3555395 2022
[30]

Raymond Lister, Elizabeth S Adams, Sue Fitzgerald, William Fone, John Hamer, Morten Lindholm, Robert McCartney, Jan Erik Moström, Kate Sanders, Otto Seppälä, et al. 2004. A multi-national study of reading and tracing skills in novice programmers.Acm sigcse bulletin36, 4 (2004), 119–150

work page 2004
[31]

Junqi Liu, Zihao Zhou, Zekai Zhu, Marco Dos Santos, Weikun He, Jiawei Liu, Ran Wang, Yunzhou Xie, Junqiao Zhao, Qiufeng Wang, et al. 2026. Numina-Lean- Agent: An Open and General Agentic Reasoning System for Formal Mathematics. arXiv preprint arXiv:2601.14027(2026)

work page arXiv 2026
[32]

Patrick Massot. 2024. Teaching mathematics using lean and controlled natural language. In15th International Conference on Interactive Theorem Proving (ITP 2024). Schloss Dagstuhl–Leibniz-Zentrum für Informatik, 27–1

work page 2024
[33]

Juan Pablo Mejía-Ramos, Evan Fuller, Keith Weber, Kathryn Rhoads, and Aron Samkoff. 2012. An assessment model for proof comprehension in undergraduate mathematics.Educational Studies in Mathematics79, 1 (2012), 3–18

work page 2012
[34]

Mejía-Ramos, Erika Fuller, Keith Weber, Kari Rhoads, and Alexis Samkoff

Juan P. Mejía-Ramos, Erika Fuller, Keith Weber, Kari Rhoads, and Alexis Samkoff

work page
[35]

https://doi.org/ 10.1007/s10649-011-9349-7

An Assessment Model for Proof Comprehension in Undergraduate Mathe- matics.Educational Studies in Mathematics79, 1 (2012), 3–18. https://doi.org/ 10.1007/s10649-011-9349-7

work page doi:10.1007/s10649-011-9349-7 2012
[36]

Hugo Mercier and Dan Sperber. 2011. Why do humans reason? Arguments for an argumentative theory.Behavioral and brain sciences34, 2 (2011), 57–74

work page 2011
[37]

Frédéric Tran Minh, Laure Gonnord, and Julien Narboux. 2025. Proof Assis- tants for Teaching: a Survey. InThEdu@CADE. https://api.semanticscholar.org/ CorpusID:278422822

work page 2025
[38]

Nagham M Mohammad, Matthew Demers, Erin McCubbin, Mitchell Jackson, and Sara M Fulmer. 2025. How college students use ChatGPT.Pedagogical Research 10, 4 (2025), em0250

work page 2025
[39]

Robert C Moore. 1994. Making the transition to formal proof.Educational Studies in Mathematics27, 3 (1994), 249–266

work page 1994
[40]

Leonardo de Moura and Sebastian Ullrich. 2021. The Lean 4 theorem prover and programming language. InAutomated Deduction–CADE 28: 28th International Conference on Automated Deduction. Springer, 625–635

work page 2021
[41]

Chris Olah and Shan Carter. 2017. Research Debt.Distill(2017). https://doi.org/ 10.23915/distill.00005

work page doi:10.23915/distill.00005 2017
[42]

2024.lean-lsp-mcp: Lean Theorem Prover MCP

oOo0oOo. 2024.lean-lsp-mcp: Lean Theorem Prover MCP. https://github.com/ oOo0oOo/lean-lsp-mcp MCP server that allows agentic interaction with the Lean theorem prover via the Language Server Protocol. GitHub repository

work page 2024
[43]

Dale Parsons and Patricia Haden. 2006. Parson’s programming puzzles: a fun and effective learning tool for first programming courses. InProceedings of the 8th Australasian Conference on Computing Education-Volume 52. 157–163

work page 2006
[44]

Birgit Pepin, Nils Buchholtz, and Ulises Salinas-Hernández. 2025. A scoping survey of ChatGPT in mathematics education.Digital Experiences in Mathematics Education11, 1 (2025), 9–41

work page 2025
[45]

Clément Pit-Claudel. 2020. Untangling mechanized proofs. InProceedings of the 13th ACM SIGPLAN International Conference on Software Language Engineering. Kambhamettu et al

work page 2020
[46]

Nicholas Rougeux and Oliver Byrne. 2018. Byrne’s Euclid: The First Six Books of the Elements of Euclid with Coloured Diagrams and Symbols. https:// www.c82.net/euclid/ Online interactive reproduction

work page 2018
[47]

Alan H Schoenfeld. 1988. When good teaching leads to bad results: The disasters of’well-taught’mathematics courses.Educational psychologist23, 2 (1988), 145– 166

work page 1988
[48]

Annie Selden and John Selden. 2003. Validations of proofs considered as texts: Can undergraduates tell whether an argument proves a theorem?Journal for Research in Mathematics Education34, 1 (2003), 4–36

work page 2003
[49]

Derar Serhan and Natalie B. Welcome. 2024. Integrating ChatGPT in the Calculus Classroom: Student Perceptions.International Journal of Technology in Education and Science(2024). https://api.semanticscholar.org/CorpusId:269835173

work page 2024
[50]

Despina A Stylianou, Maria L Blanton, Eric J Knuth, and Despi Stylianou. 2009. Teaching and learning proof across the grades. Routledge New York

work page 2009
[51]

Despina A Stylianou, Maria L Blanton, and Ourania Rotou. 2015. Undergraduate students’ understanding of proof: Relationships between proof conceptions, beliefs, and classroom experiences with learning proof.International Journal of Research in Undergraduate Mathematics Education1, 1 (2015), 91–134

work page 2015
[52]

Nicole Sultanum and Arjun Srinivasan. 2023. Datatales: Investigating the use of large language models for authoring data-driven articles. In2023 IEEE Visualiza- tion and Visual Analytics (VIS). IEEE, 231–235

work page 2023
[53]

John Sweller. 1988. Cognitive load during problem solving: Effects on learning. Cognitive science12, 2 (1988), 257–285

work page 1988
[54]

2011.Cognitive load theory

John Sweller, Paul Ayres, and Slava Kalyuga. 2011.Cognitive load theory. Springer Science & Business Media

work page 2011
[55]

Jeffrey Tao, Litao Yan, Jessica Shi, Mia Ginsberg, and Andrew Head. 2025. FreeForm: Flexibly Augmenting Formulas with Synchronized Markup and Graph- ical Edits. InProceedings of the 2025 CHI Conference on Human Factors in Com- puting Systems. 1–12

work page 2025
[56]

Oliver Tatton-Brown. 2019. Rigour and Intuition.Erkenntnis86 (2019), 1757 –

work page 2019
[57]

https://api.semanticscholar.org/CorpusID:254465500

work page
[58]

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

2024.Widgets: The user-widgets system

The Lean Developers. 2024.Widgets: The user-widgets system. https://lean- lang.org/examples/1900-1-1-widgets/ Lean 4 Documentation

work page 2024
[60]

Bret Victor. 2011. Explorable Explanations. https://worrydream.com/ ExplorableExplanations/. Accessed 2026-03-01

work page 2011
[61]

Bret Victor. 2011. Kill Math. http://worrydream.com/KillMath/. Accessed: 2024-05-24

work page 2011
[62]

Bret Victor. 2011. Tangle: a JavaScript Library for Reactive Documents. https:// worrydream.com/Tangle/. Accessed 2026-03-24

work page 2011
[63]

Tashtoush, Rommel Alali, and Adeeb M Jarrah

Yousef Wardat, Mohammad A. Tashtoush, Rommel Alali, and Adeeb M Jarrah

work page
[64]

Eurasia Journal of Mathematics, Science and Technology Education(2023)

ChatGPT: A revolutionary tool for teaching and learning mathematics. Eurasia Journal of Mathematics, Science and Technology Education(2023). https:// pdfs.semanticscholar.org/3174/39900cffedb7f18e623c1b0ab3fa8e54de21.pdf

work page 2023
[65]

Keith Weber. 2008. How mathematicians determine if an argument is a valid proof.Journal for Research in Mathematics Education39, 4 (2008), 431–459

work page 2008
[66]

Keith Weber. 2015. Effective Proof Reading Strategies for Comprehending Math- ematical Proofs.International Journal of Research in Undergraduate Mathematics Education1, 3 (2015), 289–314. https://doi.org/10.1007/s40753-015-0011-0

work page doi:10.1007/s40753-015-0011-0 2015
[67]

Keith Weber and Juan Pablo Mejia-Ramos. 2011. Why and how mathematicians read proofs: an exploratory study.Educational Studies in Mathematics76, 3 (2011), 329–344. https://doi.org/10.1007/s10649-010-9292-z

work page doi:10.1007/s10649-010-9292-z 2011
[68]

Erik Weitnauer, David Landy, and Erin Ottmar. 2016. Graspable Math: Towards Dynamic Algebra Notations that Support Learners Better than Paper. In2016 Future Technologies Conference. 406–414

work page 2016
[69]

Jelle Wemmenhove, Dick Arends, Thijs Beurskens, Maitreyee Bhaid, Sean Mc- Carren, Jan Moraal, Diego Rivera Garrido, David Tuin, Malcolm Vassallo, Pieter Wils, et al. 2022. Waterproof: educational software for learning how to write mathematical proofs.arXiv preprint arXiv:2211.13513(2022)

work page arXiv 2022
[70]

Wieman, Wendy K

Carl E. Wieman, Wendy K. Adams, and Katherine K. Perkins. 2008. PhET: Simulations that enhance learning.Science322, 5902 (2008), 682–683

work page 2008
[71]

Michelle Wu, Aadil Gopal, and Andrew Head. 2023. FFL: A Language and Live Runtime for Styling and Labeling Typeset Math Formulas. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology

work page 2023
[72]

Hyunkyoung Yoon, Jihye Hwang, Kyungwon Lee, Kyeong Hah Roh, and Oh Nam Kwon. 2024. Students’ use of generative artificial intelligence for proving mathe- matical statements.ZDM–Mathematics Education56, 7 (2024), 1531–1551

work page 2024
[73]

Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. 2022. MiniF2F: a cross- system benchmark for formal Olympiad-level mathematics. InInternational Conference on Learning Representations (ICLR). A Appendix A.1 Narrative Scenario We demonstrate how the interaction patterns in explorable theo- rems can support productive inference over a longer period of u...

work page 2022