Characterizing initial human-AI proof formalization workflows

Adrian Weller; Cameron E. Freer; Fabian Zaiser; Ilia Sucholutsky; Jacob Loader; Jeck Lim; Jonas Bayer; Jose Hernandez-Orallo; Joshua B. Tenenbaum; Katherine M. Collins

arxiv: 2606.04273 · v1 · pith:7THSSLQ6new · submitted 2026-06-02 · 💻 cs.AI

Characterizing initial human-AI proof formalization workflows

Katherine M. Collins , Simon Frieder , Jonas Bayer , Jacob Loader , Jeck Lim , Peiyang Song , Fabian Zaiser , Lexin Zhou

show 9 more authors

Shanda Li Sam Looi Joshua B. Tenenbaum Umang Bhatt Adrian Weller Jose Hernandez-Orallo Cameron E. Freer Valerie Chen Ilia Sucholutsky

This is my paper

Pith reviewed 2026-06-28 09:28 UTC · model grok-4.3

classification 💻 cs.AI

keywords human-AI collaborationproof formalizationautoformalizationuser studymathematical proofsAI assistanceworkflow analysisformal verification

0 comments

The pith

Participants achieve higher accuracy formalizing math proofs with AI tool access than without it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how people first incorporate AI into writing formal proofs from informal mathematical statements. A survey finds varied preferences but a shared wish for AI support that leaves humans in charge of the overall proof strategy. A controlled user study then tests actual behavior: across problems of different difficulty and domains, people reached better formalization accuracy when AI tools were available, and most switched flexibly among several tools. The work maps the early human-AI interplay in proof workflows under current tool limits.

Core claim

A qualitative survey shows diverse user preferences centered on AI assistance that preserves high-level human control over proof discovery. In the accompanying controlled user study, participants formalizing informal math problems and proofs attained higher accuracy with access to AI tools than when working alone, even though the tools had known autoformalization limits at the time; most participants chose to use multiple different AI tools in flexible combinations.

What carries the argument

The mixed-methods design that pairs a preference survey with a controlled user study comparing AI-assisted versus unaided formalization accuracy across mathematical domains and difficulty levels.

If this is right

AI tools can raise formalization accuracy even before they reach full autoformalization capability.
Users adapt workflows by combining several AI tools rather than relying on one.
Desired AI support keeps humans directing the high-level proof structure.
Early AI integration in formalization already shows close human-AI alternation rather than full replacement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Interfaces that make switching among tools seamless may improve adoption more than single-tool optimization.
The accuracy benefit could compound as AI reasoning improves, altering how formal verification is taught or practiced.
Similar mixed human-AI patterns may appear in related tasks such as code verification or hardware specification.
Survey preferences for retained human control suggest design targets for future tools that avoid over-automation.

Load-bearing premise

The AI-assisted and unaided study conditions are comparable without hidden differences in task setup or participant effort, and self-reported tool choices match what participants actually did.

What would settle it

A follow-up experiment that logs actual tool usage and finds no accuracy gain (or a loss) in the AI-access condition, or that shows reported multi-tool usage does not match recorded behavior.

Figures

Figures reproduced from arXiv: 2606.04273 by Adrian Weller, Cameron E. Freer, Fabian Zaiser, Ilia Sucholutsky, Jacob Loader, Jeck Lim, Jonas Bayer, Jose Hernandez-Orallo, Joshua B. Tenenbaum, Katherine M. Collins, Lexin Zhou, Peiyang Song, Sam Looi, Shanda Li, Simon Frieder, Umang Bhatt, Valerie Chen.

**Figure 2.** Figure 2: Survey responses. Full set of coded survey responses. Each column represents one respondent. A cell is colored if the respondent mentioned that factor in their response. Labels above the columns indicate the self-reported AI use of the respondent (e.g., that they “Rarely” used AI in their work). At the bottom, respondents’ self-reported mathematics experience, formalization experience, and whether or not t… view at source ↗

**Figure 3.** Figure 3: Aggregate problem solving descriptive analyses, with and without tool access, and formal [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Self-reported AI tool use from qualitative survey. [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: Survey results, decomposed by self-reported AI usage. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Survey results, decomposed by whether respondent indicated they had previously contributed to mathlib. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Survey results, decomposed by self-reported respondent general mathematics experience. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Survey results, decomposed by self-reported respondent formalization experience. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Time taken per problem. Time to formalize each problem. Error bars depict standard error. We code tool use based on a mixture of participants’ self-reported tool use (see [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: Formalization performance by Lean experience. a, [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Formalization performance by mathematics experience. a, [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

**Figure 12.** Figure 12: Accuracy per problem, for participants with and without tool use. [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

**Figure 13.** Figure 13: Formalized proof length. a, Average code length (not including comments); b, Average code length (not including comments) per problem. text, one group of participants predominantly formalized the proofs themselves, only lightly engaging AI tools (often, GitHub Copilot in-line). For instance, two participants who could be classified as “human formalizers with AI assistance” each sparingly used GitHub Copil… view at source ↗

read the original abstract

For centuries, human mathematicians have written proofs to substantiate their mathematical arguments; yet, the ability to automatically verify the validity of proofs has long been a challenge. Advances in AI systems' ability to generate code and engage in increasingly high-level mathematical reasoning promise to transform people's ability to formalize and thereby verify proofs. While many works focus on benchmarking the current frontier, we instead study how people use these tools. We conduct a mixed-methods analysis into the initial impact of AI on people's formalization workflows: what people claim they want, what they see as the barriers to those visions, and how they actually use and adapt AI in practice. A qualitative survey shows that people's preferences are diverse, but with a general desire for AI assistance in formalization that preserves high-level human control over the proof discovery process. To assess how people actually engage with AI for formalization under such limitations, we conduct a controlled user study in which participants formalize informal math problems and their proofs, with and without AI, across a range of mathematical problems at varying levels of difficulty and domains. Despite limitations of the tools at the time for autoformalization, participants tend to attain higher formalization accuracy when allowed access to AI tools than when formalizing on their own, with most participants flexibly choosing to use multiple different AI tools. Taken together, our work sheds light on the early stages of AI integration into formalization workflows, involving an intimate interplay of human and AI engagement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper runs an early controlled user study showing accuracy gains from AI access during proof formalization and documents flexible multi-tool usage.

read the letter

The core observation is that participants produced more accurate formalizations when given AI tools than when working alone, and most switched between several tools rather than committing to one. The work pairs a qualitative survey on desired workflows with a within-subjects experiment that counterbalances conditions, scores outputs via Lean compilation plus human review, and logs actual tool calls instead of relying only on self-report.

The study design directly tackles the usual threats to a fair comparison, so the directional accuracy result looks credible on the evidence presented. The survey portion also gives a grounded picture of what users say they want—high-level control retained by the human—before showing how behavior deviates in practice.

The main limitations are the modest number of participants and the use of relatively simple problems with the tools available at the time; both are acknowledged and limit how far the numbers can be generalized. No load-bearing circularity or post-hoc selection issues appear in the reported methods.

This is the kind of paper that matters for groups building interactive formalization systems or studying human-AI math workflows. It supplies concrete usage data where most prior work only speculates. A serious referee should see it; the empirical base is thin but the execution is careful enough to be worth the time.

Referee Report

0 major / 3 minor

Summary. The manuscript describes a mixed-methods investigation into how humans use AI tools for formalizing mathematical proofs. It reports results from a qualitative survey on desired AI assistance features and a controlled within-subjects user study comparing formalization accuracy on informal math problems with and without access to AI tools. The central empirical claim is that participants achieve higher formalization accuracy when using AI tools, with most opting to use multiple tools in a flexible manner.

Significance. Should the findings prove robust, the work offers timely empirical insights into the nascent integration of AI into mathematical formalization practices. It underscores the preference for AI that supports rather than replaces human proof discovery and documents practical usage patterns despite current tool limitations. This contributes to the growing literature on human-AI collaboration in formal methods and can inform the design of more effective proof assistants.

minor comments (3)

[Abstract] Abstract: the claim of 'higher formalization accuracy' would benefit from a brief parenthetical note on how accuracy was operationalized (e.g., Lean compilation success plus human review) to allow readers to gauge the strength of the directional result without consulting the methods section.
[User study] User study section: while counterbalancing and logged usage are noted, a short table summarizing per-participant tool combinations and accuracy deltas would make the 'flexibly choosing multiple tools' observation more concrete and easier to verify.
[Discussion] Discussion: the limitations paragraph could explicitly address the within-subjects design's potential carry-over effects and the modest number of problems per domain, even if these do not undermine the main directional finding.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary and significance assessment, as well as the recommendation for minor revision. We appreciate the recognition that the work provides timely empirical insights into human-AI collaboration for mathematical formalization.

Circularity Check

0 steps flagged

Empirical user study; no derivation chain or circular steps present

full rationale

The paper reports results from a qualitative survey and a controlled within-subjects user study measuring formalization accuracy with/without AI tools. Central claims rest on direct empirical observations (accuracy scores via Lean compilation and human review, logged tool usage) rather than any first-principles derivation, fitted parameters renamed as predictions, or self-citation chains. No equations, uniqueness theorems, or ansatzes appear; the work is self-contained against external benchmarks of participant performance and does not reduce any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical human-subjects study with no mathematical derivations, fitted constants, or postulated entities; ledger is empty by design.

pith-pipeline@v0.9.1-grok · 5850 in / 919 out tokens · 15264 ms · 2026-06-28T09:28:07.658335+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

299 extracted references · 25 canonical work pages · 1 internal anchor

[1]

Bulletin of the American Mathematical Society , year =

Commelin, Johan and Topaz, Adam , title =. Bulletin of the American Mathematical Society , year =
[2]

arXiv preprint arXiv:2510.09801 , year=

How can we assess human-agent interactions? Case studies in software agent design , author=. arXiv preprint arXiv:2510.09801 , year=

Pith/arXiv arXiv
[3]

2507.23726 , archivePrefix=

Luoxin Chen and Jinming Gu and Liankai Huang and Wenhao Huang and Zhicheng Jiang and Allan Jie and Xiaoran Jin and Xing Jin and Chenggang Li and Kaijing Ma and Cheng Ren and Jiawei Shen and Wenlei Shi and Tong Sun and He Sun and Jiahui Wang and Siran Wang and Zhihong Wang and Chenrui Wei and Shufa Wei and Yonghui Wu and Yuchen Wu and Yihang Xia and Huajia...

arXiv
[4]

Current Directions in Psychological Science , pages=

Meaningful Long-Term Thought Partnerships of Minds and Machines , author=. Current Directions in Psychological Science , pages=
[5]

2026 , eprint=

On Talagrand's Convexity Conjecture , author=. 2026 , eprint=

2026
[6]

For the learning of mathematics , volume=

On proof and progress in mathematics , author=. For the learning of mathematics , volume=. 1995 , publisher=

1995
[7]

2025 , month = July, note =

Advanced version of. 2025 , month = July, note =

2025
[8]

arXiv preprint arXiv:2110.14168 , year =

Training Verifiers to Solve Math Word Problems , author =. arXiv preprint arXiv:2110.14168 , year =. 2110.14168 , archivePrefix =

Pith/arXiv arXiv
[9]

2025 , month = feb, howpublished =

Claude 3.7. 2025 , month = feb, howpublished =

2025
[10]

2025 , month = may, howpublished =

Introducing. 2025 , month = may, howpublished =

2025
[12]

Communications of the ACM , volume=

Formal Reasoning Meets LLMs: Toward AI for Mathematics and Verification , author=. Communications of the ACM , volume=. 2026 , publisher=

2026
[13]

Proceedings of the National Academy of Sciences , volume=

Rapid trial-and-error learning with simulation supports flexible tool use and physical reasoning , author=. Proceedings of the National Academy of Sciences , volume=. 2020 , publisher=

2020
[14]

Documenting Deployment with Fabric: A Repository of Real-World

Jorgensen, Mackenzie and Brogle, Kendall and Collins, Katherine M and Ibrahim, Lujain and Shah, Arina and Ivanovic, Petra and Broestl, Noah and Piles, Gabriel and Dongha, Paul and Abdulhussein, Hatim and others , journal=. Documenting Deployment with Fabric: A Repository of Real-World
[15]

Yang, Kaiyu and Swope, Aidan and Gu, Alex and Chalamala, Rahul and Song, Peiyang and Yu, Shixing and Godil, Saad and Prenger, Ryan J and Anandkumar, Animashree , booktitle =
[16]

Z. Z. Ren and Zhihong Shao and Junxiao Song and Huajian Xin and Haocheng Wang and Wanjia Zhao and Liyue Zhang and Zhe Fu and Qihao Zhu and Dejian Yang and Z. F. Wu and Zhibin Gou and Shirong Ma and Hongxuan Tang and Yuxuan Liu and Wenjun Gao and Daya Guo and Chong Ruan , year=. 2504.21801 , archivePrefix=

Pith/arXiv arXiv
[17]

arXiv preprint arXiv:2605.06651 , year=

AI Co-Mathematician: Accelerating Mathematicians with Agentic AI , author=. arXiv preprint arXiv:2605.06651 , year=

Pith/arXiv arXiv
[18]

Interactive Theorem Proving-Modelling the User in the Proof Process , author=
[19]

Design, Specification and Verification of Interactive Systems’ 96: Proceedings of the Eurographics Workshop in Namur, Belgium, June 5--7, 1996 , pages=

Evaluating the interfaces of three theorem proving assistants , author=. Design, Specification and Verification of Interactive Systems’ 96: Proceedings of the Eurographics Workshop in Namur, Belgium, June 5--7, 1996 , pages=. 1996 , organization=

1996
[20]

Aitken and P

J.S. Aitken and P. Gray and T. Melham and M. Thomas , abstract =. Interactive Theorem Proving: An Empirical Study of User Activity , journal =. 1998 , issn =. doi:https://doi.org/10.1006/jsco.1997.0175 , url =

work page doi:10.1006/jsco.1997.0175 1998
[21]

Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies , pages=

Are NLP models really able to solve simple math word problems? , author=. Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies , pages=

2021
[22]

Nature , pages=

Olympiad-level formal mathematical reasoning with reinforcement learning , author=. Nature , pages=. 2025 , publisher=

2025
[23]

Nature , volume=

Mathematical discoveries from program search with large language models , author=. Nature , volume=. 2024 , publisher=

2024
[24]

arXiv preprint arXiv:2503.11061 , year=

Generative modeling for mathematical discovery , author=. arXiv preprint arXiv:2503.11061 , year=

arXiv
[25]

Lessons for interactive theorem proving researchers from a survey of

de Almeida Borges, Ana and Casanueva Art. Lessons for interactive theorem proving researchers from a survey of. Journal of Automated Reasoning , volume=. 2025 , publisher=

2025
[26]

Proceedings of the 2nd International Workshop on User Interface Design for Theorem Proving Systems , pages=

Two modelling approaches applied to user interfaces to theorem proving assistants , author=. Proceedings of the 2nd International Workshop on User Interface Design for Theorem Proving Systems , pages=
[27]

12th Workshop of the Psychology of Programming Interest Group (PPIG) , pages=

A Cognitive Dimensions view of the differences between designers and users of theorem proving assistants , author=. 12th Workshop of the Psychology of Programming Interest Group (PPIG) , pages=
[28]

Communications Psychology , volume=

Effortful leisure is a source of meaning in everyday life , author=. Communications Psychology , volume=. 2025 , publisher=

2025
[29]

Trends in cognitive sciences , volume=

The effort paradox: Effort is both costly and valued , author=. Trends in cognitive sciences , volume=. 2018 , publisher=

2018
[30]

Calibrating Histopathology Image Classifiers using Label Smoothing , publisher =

Wei, Jerry and Torresani, Lorenzo and Wei, Jason and Hassanpour, Saeed , keywords =. Calibrating Histopathology Image Classifiers using Label Smoothing , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2201.11866 , url =

work page doi:10.48550/arxiv.2201.11866 2022
[31]

International Statistical Review , volume =

Fortuin, Vincent , title =. International Statistical Review , volume =. doi:https://doi.org/10.1111/insr.12502 , url =. https://onlinelibrary.wiley.com/doi/pdf/10.1111/insr.12502 , abstract =

work page doi:10.1111/insr.12502
[32]

2022 , copyright =

Welleck, Sean and Liu, Jiacheng and Lu, Ximing and Hajishirzi, Hannaneh and Choi, Yejin , keywords =. 2022 , copyright =. doi:10.48550/ARXIV.2205.12910 , url =

work page doi:10.48550/arxiv.2205.12910 2022
[33]

arXiv preprint arXiv:2403.04814 , year=

Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks , author=. arXiv preprint arXiv:2403.04814 , year=

arXiv
[34]

Advances in Neural Information Processing Systems , volume=

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation , author=. Advances in Neural Information Processing Systems , volume=
[35]

International conference on machine learning , pages=

Do imagenet classifiers generalize to imagenet? , author=. International conference on machine learning , pages=. 2019 , organization=

2019
[36]

arXiv preprint arXiv:2403.04132 , year=

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference , author=. arXiv preprint arXiv:2403.04132 , year=

Pith/arXiv arXiv
[37]

Simulating Iterative Human-

Mozannar, Hussein and Chen, Valerie and Wei, Dennis and Sattigeri, Prasanna and Nagireddy, Manish and Das, Subhro and Talwalkar, Ameet and Sontag, David , booktitle=. Simulating Iterative Human-
[38]

IEEE Transactions on Software Engineering , year=

MultiPL-E: a scalable and polyglot approach to benchmarking neural code generation , author=. IEEE Transactions on Software Engineering , year=
[39]

Findings of the Association for Computational Linguistics: ACL 2023 , pages=

A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

2023
[40]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Large language models meet NL2Code: A survey , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[41]

Findings of the Association for Computational Linguistics: ACL 2023 , pages=

Aligning Offline Metrics and Human Judgments of Value for Code Generation Models , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

2023
[42]

arXiv preprint arXiv:2303.03004 , year=

xcodeeval: A large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval , author=. arXiv preprint arXiv:2303.03004 , year=

arXiv
[43]

Vera and Wortman Vaughan, Jennifer , title =

Vasconcelos, Helena and Bansal, Gagan and Fourney, Adam and Liao, Q. Vera and Wortman Vaughan, Jennifer , title =. ACM Trans. Comput.-Hum. Interact. , month = apr, articleno =. 2025 , issue_date =. doi:10.1145/3702320 , abstract =

work page doi:10.1145/3702320 2025
[44]

, year =

Zhu, Ming and Jain, Aneesh and Suresh, Karthik and Ravindran, Roshan and Tipirneni, Sindhu and Reddy, Chandan K. , year =. XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence , url =. 2206.08474 , archivePrefix=

arXiv
[45]

arXiv preprint arXiv:2311.08588 , year=

Codescope: An execution-based multilingual multitask multidimensional benchmark for evaluating llms on code understanding and generation , author=. arXiv preprint arXiv:2311.08588 , year=

arXiv
[46]

Shuai Lu and Daya Guo and Shuo Ren and Junjie Huang and Alexey Svyatkovskiy and Ambrosio Blanco and Colin Clement and Dawn Drain and Daxin Jiang and Duyu Tang and Ge Li and Lidong Zhou and Linjun Shou and Long Zhou and Michele Tufano and MING GONG and Ming Zhou and Nan Duan and Neel Sundaresan and Shao Kun Deng and Shengyu Fu and Shujie LIU , booktitle=. ...

2021
[47]

Queue , volume=

Taking Flight with Copilot: Early insights and opportunities of AI-powered pair-programming tools , author=. Queue , volume=. 2022 , publisher=

2022
[48]

IEEE Transactions on Software Engineering , volume=

Automatically assessing code understandability , author=. IEEE Transactions on Software Engineering , volume=. 2019 , publisher=

2019
[49]

Transactions on Machine Learning Research , issn=

Evaluating Human-Language Model Interaction , author=. Transactions on Machine Learning Research , issn=. 2023 , url=

2023
[50]

arXiv preprint arXiv:2306.01694 , year=

Evaluating language models for mathematics through interactions , author=. arXiv preprint arXiv:2306.01694 , year=

arXiv
[51]

Human-AI pAIr Programming , author=

Is AI the better programming partner? Human-Human Pair Programming vs. Human-AI pAIr Programming , author=. arXiv preprint arXiv:2306.05153 , year=

arXiv
[52]

arXiv preprint arXiv:2309.03914 , year=

DevGPT: Studying Developer-ChatGPT Conversations , author=. arXiv preprint arXiv:2309.03914 , year=

arXiv
[53]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

ReCode: Robustness Evaluation of Code Generation Models , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[54]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Python Code Generation by Asking Clarification Questions , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[55]

The impact of

Peng, Sida and Kalliamvakou, Eirini and Cihon, Peter and Demirer, Mert , journal=. The impact of
[56]

arXiv preprint arXiv:2303.08774 , year=

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

Pith/arXiv arXiv
[57]

arXiv preprint arXiv:2410.21276 , year =

GPT-4o System Card , author =. arXiv preprint arXiv:2410.21276 , year =

Pith/arXiv arXiv
[58]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

Language Models of Code are Few-Shot Commonsense Learners , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

2022
[59]

Gonzalez and Ion Stoica , booktitle=

Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica , booktitle=. Judging. 2023 , url=

2023
[60]

The Twelfth International Conference on Learning Representations , year=

OctoPack: Instruction Tuning Code Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=
[61]

ICLR , year=

SWE-bench: Can Language Models Resolve Real-world Github Issues? , author=. ICLR , year=
[62]

Advances in Neural Information Processing Systems , volume=

Large language models of code fail at completing code with potential bugs , author=. Advances in Neural Information Processing Systems , volume=
[63]

arXiv preprint arXiv:2401.03065 , year=

Cruxeval: A benchmark for code reasoning, understanding and execution , author=. arXiv preprint arXiv:2401.03065 , year=

Pith/arXiv arXiv
[64]

arXiv preprint arXiv:2308.10620 , year=

Large language models for software engineering: A systematic literature review , author=. arXiv preprint arXiv:2308.10620 , year=

arXiv
[65]

ChatGPT: Introducing ChatGPT , author=
[66]

Available at SSRN 4375283 , year=

Experimental evidence on the productivity effects of generative artificial intelligence , author=. Available at SSRN 4375283 , year=
[67]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[68]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

Contrastive Code Representation Learning , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

2021
[69]

Findings of the Association for Computational Linguistics: EMNLP 2020 , pages=

CodeBERT: A Pre-Trained Model for Programming and Natural Languages , author=. Findings of the Association for Computational Linguistics: EMNLP 2020 , pages=

2020
[70]

arXiv preprint arXiv:2303.10130 , year=

Gpts are gpts: An early look at the labor market impact potential of large language models , author=. arXiv preprint arXiv:2303.10130 , year=

arXiv
[71]

arXiv preprint arXiv:2301.08653 , year=

An analysis of the automatic bug fixing performance of chatgpt , author=. arXiv preprint arXiv:2301.08653 , year=

arXiv
[72]

arXiv preprint arXiv:2311.05584 , year=

Zero-Shot Goal-Directed Dialogue via RL on Imagined Conversations , author=. arXiv preprint arXiv:2311.05584 , year=

arXiv
[73]

2023 , institution=

Generative AI at work , author=. 2023 , institution=

2023
[74]

International archives of occupational and environmental health , volume=

A systematic review on the effect of work-related stressors on mental health of young workers , author=. International archives of occupational and environmental health , volume=. 2020 , publisher=

2020
[75]

SAGE Open , volume=

Examining the Impact of Technology Overload at the Workplace: A Systematic Review , author=. SAGE Open , volume=. 2022 , publisher=

2022
[76]

International Journal of Environmental Research and Public Health , volume=

Technostress dark side of technology in the workplace: A scientometric analysis , author=. International Journal of Environmental Research and Public Health , volume=. 2020 , publisher=

2020
[77]

IEEE MultiMedia , volume=

Automating the recognition of stress and emotion: From lab to real-world impact , author=. IEEE MultiMedia , volume=. 2016 , publisher=

2016
[78]

Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems , pages=

Cogcam: Contact-free measurement of cognitive stress during computer tasks with a digital camera , author=. Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems , pages=

2016
[79]

2014 36th annual international conference of the IEEE engineering in medicine and biology society , pages=

Remote measurement of cognitive stress via heart rate variability , author=. 2014 36th annual international conference of the IEEE engineering in medicine and biology society , pages=. 2014 , organization=

2014
[80]

IEEE Transactions on intelligent transportation systems , volume=

Detecting stress during real-world driving tasks using physiological sensors , author=. IEEE Transactions on intelligent transportation systems , volume=. 2005 , publisher=

2005
[81]

Engineering Reports , volume=

Text-based emotion detection: Advances, challenges, and opportunities , author=. Engineering Reports , volume=. 2020 , publisher=

2020

Showing first 80 references.

[1] [1]

Bulletin of the American Mathematical Society , year =

Commelin, Johan and Topaz, Adam , title =. Bulletin of the American Mathematical Society , year =

[2] [2]

arXiv preprint arXiv:2510.09801 , year=

How can we assess human-agent interactions? Case studies in software agent design , author=. arXiv preprint arXiv:2510.09801 , year=

Pith/arXiv arXiv

[3] [3]

2507.23726 , archivePrefix=

Luoxin Chen and Jinming Gu and Liankai Huang and Wenhao Huang and Zhicheng Jiang and Allan Jie and Xiaoran Jin and Xing Jin and Chenggang Li and Kaijing Ma and Cheng Ren and Jiawei Shen and Wenlei Shi and Tong Sun and He Sun and Jiahui Wang and Siran Wang and Zhihong Wang and Chenrui Wei and Shufa Wei and Yonghui Wu and Yuchen Wu and Yihang Xia and Huajia...

arXiv

[4] [4]

Current Directions in Psychological Science , pages=

Meaningful Long-Term Thought Partnerships of Minds and Machines , author=. Current Directions in Psychological Science , pages=

[5] [5]

2026 , eprint=

On Talagrand's Convexity Conjecture , author=. 2026 , eprint=

2026

[6] [6]

For the learning of mathematics , volume=

On proof and progress in mathematics , author=. For the learning of mathematics , volume=. 1995 , publisher=

1995

[7] [7]

2025 , month = July, note =

Advanced version of. 2025 , month = July, note =

2025

[8] [8]

arXiv preprint arXiv:2110.14168 , year =

Training Verifiers to Solve Math Word Problems , author =. arXiv preprint arXiv:2110.14168 , year =. 2110.14168 , archivePrefix =

Pith/arXiv arXiv

[9] [9]

2025 , month = feb, howpublished =

Claude 3.7. 2025 , month = feb, howpublished =

2025

[10] [10]

2025 , month = may, howpublished =

Introducing. 2025 , month = may, howpublished =

2025

[11] [12]

Communications of the ACM , volume=

Formal Reasoning Meets LLMs: Toward AI for Mathematics and Verification , author=. Communications of the ACM , volume=. 2026 , publisher=

2026

[12] [13]

Proceedings of the National Academy of Sciences , volume=

Rapid trial-and-error learning with simulation supports flexible tool use and physical reasoning , author=. Proceedings of the National Academy of Sciences , volume=. 2020 , publisher=

2020

[13] [14]

Documenting Deployment with Fabric: A Repository of Real-World

Jorgensen, Mackenzie and Brogle, Kendall and Collins, Katherine M and Ibrahim, Lujain and Shah, Arina and Ivanovic, Petra and Broestl, Noah and Piles, Gabriel and Dongha, Paul and Abdulhussein, Hatim and others , journal=. Documenting Deployment with Fabric: A Repository of Real-World

[14] [15]

Yang, Kaiyu and Swope, Aidan and Gu, Alex and Chalamala, Rahul and Song, Peiyang and Yu, Shixing and Godil, Saad and Prenger, Ryan J and Anandkumar, Animashree , booktitle =

[15] [16]

Z. Z. Ren and Zhihong Shao and Junxiao Song and Huajian Xin and Haocheng Wang and Wanjia Zhao and Liyue Zhang and Zhe Fu and Qihao Zhu and Dejian Yang and Z. F. Wu and Zhibin Gou and Shirong Ma and Hongxuan Tang and Yuxuan Liu and Wenjun Gao and Daya Guo and Chong Ruan , year=. 2504.21801 , archivePrefix=

Pith/arXiv arXiv

[16] [17]

arXiv preprint arXiv:2605.06651 , year=

AI Co-Mathematician: Accelerating Mathematicians with Agentic AI , author=. arXiv preprint arXiv:2605.06651 , year=

Pith/arXiv arXiv

[17] [18]

Interactive Theorem Proving-Modelling the User in the Proof Process , author=

[18] [19]

Design, Specification and Verification of Interactive Systems’ 96: Proceedings of the Eurographics Workshop in Namur, Belgium, June 5--7, 1996 , pages=

Evaluating the interfaces of three theorem proving assistants , author=. Design, Specification and Verification of Interactive Systems’ 96: Proceedings of the Eurographics Workshop in Namur, Belgium, June 5--7, 1996 , pages=. 1996 , organization=

1996

[19] [20]

Aitken and P

J.S. Aitken and P. Gray and T. Melham and M. Thomas , abstract =. Interactive Theorem Proving: An Empirical Study of User Activity , journal =. 1998 , issn =. doi:https://doi.org/10.1006/jsco.1997.0175 , url =

work page doi:10.1006/jsco.1997.0175 1998

[20] [21]

Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies , pages=

Are NLP models really able to solve simple math word problems? , author=. Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies , pages=

2021

[21] [22]

Nature , pages=

Olympiad-level formal mathematical reasoning with reinforcement learning , author=. Nature , pages=. 2025 , publisher=

2025

[22] [23]

Nature , volume=

Mathematical discoveries from program search with large language models , author=. Nature , volume=. 2024 , publisher=

2024

[23] [24]

arXiv preprint arXiv:2503.11061 , year=

Generative modeling for mathematical discovery , author=. arXiv preprint arXiv:2503.11061 , year=

arXiv

[24] [25]

Lessons for interactive theorem proving researchers from a survey of

de Almeida Borges, Ana and Casanueva Art. Lessons for interactive theorem proving researchers from a survey of. Journal of Automated Reasoning , volume=. 2025 , publisher=

2025

[25] [26]

Proceedings of the 2nd International Workshop on User Interface Design for Theorem Proving Systems , pages=

Two modelling approaches applied to user interfaces to theorem proving assistants , author=. Proceedings of the 2nd International Workshop on User Interface Design for Theorem Proving Systems , pages=

[26] [27]

12th Workshop of the Psychology of Programming Interest Group (PPIG) , pages=

A Cognitive Dimensions view of the differences between designers and users of theorem proving assistants , author=. 12th Workshop of the Psychology of Programming Interest Group (PPIG) , pages=

[27] [28]

Communications Psychology , volume=

Effortful leisure is a source of meaning in everyday life , author=. Communications Psychology , volume=. 2025 , publisher=

2025

[28] [29]

Trends in cognitive sciences , volume=

The effort paradox: Effort is both costly and valued , author=. Trends in cognitive sciences , volume=. 2018 , publisher=

2018

[29] [30]

Calibrating Histopathology Image Classifiers using Label Smoothing , publisher =

Wei, Jerry and Torresani, Lorenzo and Wei, Jason and Hassanpour, Saeed , keywords =. Calibrating Histopathology Image Classifiers using Label Smoothing , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2201.11866 , url =

work page doi:10.48550/arxiv.2201.11866 2022

[30] [31]

International Statistical Review , volume =

Fortuin, Vincent , title =. International Statistical Review , volume =. doi:https://doi.org/10.1111/insr.12502 , url =. https://onlinelibrary.wiley.com/doi/pdf/10.1111/insr.12502 , abstract =

work page doi:10.1111/insr.12502

[31] [32]

2022 , copyright =

Welleck, Sean and Liu, Jiacheng and Lu, Ximing and Hajishirzi, Hannaneh and Choi, Yejin , keywords =. 2022 , copyright =. doi:10.48550/ARXIV.2205.12910 , url =

work page doi:10.48550/arxiv.2205.12910 2022

[32] [33]

arXiv preprint arXiv:2403.04814 , year=

Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks , author=. arXiv preprint arXiv:2403.04814 , year=

arXiv

[33] [34]

Advances in Neural Information Processing Systems , volume=

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation , author=. Advances in Neural Information Processing Systems , volume=

[34] [35]

International conference on machine learning , pages=

Do imagenet classifiers generalize to imagenet? , author=. International conference on machine learning , pages=. 2019 , organization=

2019

[35] [36]

arXiv preprint arXiv:2403.04132 , year=

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference , author=. arXiv preprint arXiv:2403.04132 , year=

Pith/arXiv arXiv

[36] [37]

Simulating Iterative Human-

Mozannar, Hussein and Chen, Valerie and Wei, Dennis and Sattigeri, Prasanna and Nagireddy, Manish and Das, Subhro and Talwalkar, Ameet and Sontag, David , booktitle=. Simulating Iterative Human-

[37] [38]

IEEE Transactions on Software Engineering , year=

MultiPL-E: a scalable and polyglot approach to benchmarking neural code generation , author=. IEEE Transactions on Software Engineering , year=

[38] [39]

Findings of the Association for Computational Linguistics: ACL 2023 , pages=

A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

2023

[39] [40]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Large language models meet NL2Code: A survey , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[40] [41]

Findings of the Association for Computational Linguistics: ACL 2023 , pages=

Aligning Offline Metrics and Human Judgments of Value for Code Generation Models , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

2023

[41] [42]

arXiv preprint arXiv:2303.03004 , year=

xcodeeval: A large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval , author=. arXiv preprint arXiv:2303.03004 , year=

arXiv

[42] [43]

Vera and Wortman Vaughan, Jennifer , title =

Vasconcelos, Helena and Bansal, Gagan and Fourney, Adam and Liao, Q. Vera and Wortman Vaughan, Jennifer , title =. ACM Trans. Comput.-Hum. Interact. , month = apr, articleno =. 2025 , issue_date =. doi:10.1145/3702320 , abstract =

work page doi:10.1145/3702320 2025

[43] [44]

, year =

Zhu, Ming and Jain, Aneesh and Suresh, Karthik and Ravindran, Roshan and Tipirneni, Sindhu and Reddy, Chandan K. , year =. XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence , url =. 2206.08474 , archivePrefix=

arXiv

[44] [45]

arXiv preprint arXiv:2311.08588 , year=

Codescope: An execution-based multilingual multitask multidimensional benchmark for evaluating llms on code understanding and generation , author=. arXiv preprint arXiv:2311.08588 , year=

arXiv

[45] [46]

Shuai Lu and Daya Guo and Shuo Ren and Junjie Huang and Alexey Svyatkovskiy and Ambrosio Blanco and Colin Clement and Dawn Drain and Daxin Jiang and Duyu Tang and Ge Li and Lidong Zhou and Linjun Shou and Long Zhou and Michele Tufano and MING GONG and Ming Zhou and Nan Duan and Neel Sundaresan and Shao Kun Deng and Shengyu Fu and Shujie LIU , booktitle=. ...

2021

[46] [47]

Queue , volume=

Taking Flight with Copilot: Early insights and opportunities of AI-powered pair-programming tools , author=. Queue , volume=. 2022 , publisher=

2022

[47] [48]

IEEE Transactions on Software Engineering , volume=

Automatically assessing code understandability , author=. IEEE Transactions on Software Engineering , volume=. 2019 , publisher=

2019

[48] [49]

Transactions on Machine Learning Research , issn=

Evaluating Human-Language Model Interaction , author=. Transactions on Machine Learning Research , issn=. 2023 , url=

2023

[49] [50]

arXiv preprint arXiv:2306.01694 , year=

Evaluating language models for mathematics through interactions , author=. arXiv preprint arXiv:2306.01694 , year=

arXiv

[50] [51]

Human-AI pAIr Programming , author=

Is AI the better programming partner? Human-Human Pair Programming vs. Human-AI pAIr Programming , author=. arXiv preprint arXiv:2306.05153 , year=

arXiv

[51] [52]

arXiv preprint arXiv:2309.03914 , year=

DevGPT: Studying Developer-ChatGPT Conversations , author=. arXiv preprint arXiv:2309.03914 , year=

arXiv

[52] [53]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

ReCode: Robustness Evaluation of Code Generation Models , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[53] [54]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Python Code Generation by Asking Clarification Questions , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[54] [55]

The impact of

Peng, Sida and Kalliamvakou, Eirini and Cihon, Peter and Demirer, Mert , journal=. The impact of

[55] [56]

arXiv preprint arXiv:2303.08774 , year=

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

Pith/arXiv arXiv

[56] [57]

arXiv preprint arXiv:2410.21276 , year =

GPT-4o System Card , author =. arXiv preprint arXiv:2410.21276 , year =

Pith/arXiv arXiv

[57] [58]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

Language Models of Code are Few-Shot Commonsense Learners , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

2022

[58] [59]

Gonzalez and Ion Stoica , booktitle=

Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica , booktitle=. Judging. 2023 , url=

2023

[59] [60]

The Twelfth International Conference on Learning Representations , year=

OctoPack: Instruction Tuning Code Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=

[60] [61]

ICLR , year=

SWE-bench: Can Language Models Resolve Real-world Github Issues? , author=. ICLR , year=

[61] [62]

Advances in Neural Information Processing Systems , volume=

Large language models of code fail at completing code with potential bugs , author=. Advances in Neural Information Processing Systems , volume=

[62] [63]

arXiv preprint arXiv:2401.03065 , year=

Cruxeval: A benchmark for code reasoning, understanding and execution , author=. arXiv preprint arXiv:2401.03065 , year=

Pith/arXiv arXiv

[63] [64]

arXiv preprint arXiv:2308.10620 , year=

Large language models for software engineering: A systematic literature review , author=. arXiv preprint arXiv:2308.10620 , year=

arXiv

[64] [65]

ChatGPT: Introducing ChatGPT , author=

[65] [66]

Available at SSRN 4375283 , year=

Experimental evidence on the productivity effects of generative artificial intelligence , author=. Available at SSRN 4375283 , year=

[66] [67]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[67] [68]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

Contrastive Code Representation Learning , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

2021

[68] [69]

Findings of the Association for Computational Linguistics: EMNLP 2020 , pages=

CodeBERT: A Pre-Trained Model for Programming and Natural Languages , author=. Findings of the Association for Computational Linguistics: EMNLP 2020 , pages=

2020

[69] [70]

arXiv preprint arXiv:2303.10130 , year=

Gpts are gpts: An early look at the labor market impact potential of large language models , author=. arXiv preprint arXiv:2303.10130 , year=

arXiv

[70] [71]

arXiv preprint arXiv:2301.08653 , year=

An analysis of the automatic bug fixing performance of chatgpt , author=. arXiv preprint arXiv:2301.08653 , year=

arXiv

[71] [72]

arXiv preprint arXiv:2311.05584 , year=

Zero-Shot Goal-Directed Dialogue via RL on Imagined Conversations , author=. arXiv preprint arXiv:2311.05584 , year=

arXiv

[72] [73]

2023 , institution=

Generative AI at work , author=. 2023 , institution=

2023

[73] [74]

International archives of occupational and environmental health , volume=

A systematic review on the effect of work-related stressors on mental health of young workers , author=. International archives of occupational and environmental health , volume=. 2020 , publisher=

2020

[74] [75]

SAGE Open , volume=

Examining the Impact of Technology Overload at the Workplace: A Systematic Review , author=. SAGE Open , volume=. 2022 , publisher=

2022

[75] [76]

International Journal of Environmental Research and Public Health , volume=

Technostress dark side of technology in the workplace: A scientometric analysis , author=. International Journal of Environmental Research and Public Health , volume=. 2020 , publisher=

2020

[76] [77]

IEEE MultiMedia , volume=

Automating the recognition of stress and emotion: From lab to real-world impact , author=. IEEE MultiMedia , volume=. 2016 , publisher=

2016

[77] [78]

Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems , pages=

Cogcam: Contact-free measurement of cognitive stress during computer tasks with a digital camera , author=. Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems , pages=

2016

[78] [79]

2014 36th annual international conference of the IEEE engineering in medicine and biology society , pages=

Remote measurement of cognitive stress via heart rate variability , author=. 2014 36th annual international conference of the IEEE engineering in medicine and biology society , pages=. 2014 , organization=

2014

[79] [80]

IEEE Transactions on intelligent transportation systems , volume=

Detecting stress during real-world driving tasks using physiological sensors , author=. IEEE Transactions on intelligent transportation systems , volume=. 2005 , publisher=

2005

[80] [81]

Engineering Reports , volume=

Text-based emotion detection: Advances, challenges, and opportunities , author=. Engineering Reports , volume=. 2020 , publisher=

2020