pith. sign in

arxiv: 2605.24228 · v1 · pith:6XGNQWCEnew · submitted 2026-05-22 · 💻 cs.HC

Sketch Bug: Using Sketch-Based Input for Interactive Code Debugging

Pith reviewed 2026-06-30 14:30 UTC · model grok-4.3

classification 💻 cs.HC
keywords sketch-based inputinteractive debuggingpen inputgesture recognitionexecution controlprogrammer studyPython debugging
0
0 comments X

The pith

Sketch-like pen input supports execution control tasks in debugging but introduces challenges in precision and gesture recall.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates sketch-based pen input as an alternative to mouse and keyboard for controlling program execution during debugging. The prototype allows drawing marks to set breakpoints, symbolic strokes to step through code, and spirals to repeat actions, integrated with Python execution tracing. In a study with 24 programmers performing tasks like breakpoint placement and state inspection, the sketch method was found to handle these tasks effectively. However, it also presented issues with precise input, accurate recognition of gestures, and users remembering the gestures. The approach appears suitable particularly for debugging interactions that leverage spatial positioning or continuous gestures rather than replacing all traditional controls.

Core claim

The results show that sketch-like input can support these execution-control tasks, while also introducing challenges in precision, recognition, and gesture recall. Our findings suggest that pen input is most promising where debugger interactions benefit from spatial grounding or continuous movement, rather than as a wholesale replacement for conventional debugging controls.

What carries the argument

Sketch interface using gesture recognition combined with Python execution tracing in an editor, where lightweight marks set breakpoints, strokes control execution, and extended strokes into spirals repeat traversals.

If this is right

  • Sketch input enables programmers to set breakpoints and control execution steps via drawing.
  • Pen-based methods are viable for spatial or continuous debugging actions.
  • Precision, recognition accuracy, and gesture recall remain key hurdles to overcome.
  • Conventional mouse and keyboard remain preferable for many debugging interactions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Deploying the prototype in actual development environments could test its utility in complex, real projects.
  • Gesture sets might be standardized across tools to improve recall.
  • Combining sketch input with other modalities could address precision issues.
  • Similar techniques could extend to debugging in visual programming languages.

Load-bearing premise

The specific debugging tasks and the prototype used in the controlled study are representative of everyday debugging practice.

What would settle it

Observing that professional developers using the sketch interface on their daily work show no measurable improvement in debugging efficiency or preference over standard interfaces.

Figures

Figures reproduced from arXiv: 2605.24228 by Daniel Vogel, Helen Weixu Chen.

Figure 1
Figure 1. Figure 1: Sketch-like pen gestures: (a) a programmer draws a continue symbol on the canvas; (b) after a 300ms dwell, they add [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Session control interactions: (a) set and remove [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Execution flow strokes: (a) step into with ‘L’ shape; [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Repeating spiral: draw an execution stroke, pause [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Simulated VS Code interactive debugging user interface: (a) debug information panels; (b) code editor with overlaid [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Standard debug control buttons for mouse and key [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Actions by interface technique. repeating spiral, can compress multiple repeated commands into a single stroke, this measure should be interpreted as the number of discrete interaction units rather than as a direct measure of effort. Participants produced fewer such interaction units with sketch than with wimp (see [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
read the original abstract

We investigate sketch-like pen input as an alternative way to support execution control in interactive debugging. In our interface, programmers draw lightweight marks to set breakpoints, use symbolic strokes to control execution, and extend strokes into spirals to repeat traversal actions. The prototype combines gesture recognition with Python execution tracing in a conventional editor interface. In a controlled study with 24 programmers, we compared the sketch interface with conventional mouse-and-keyboard input on debugging tasks that required breakpoint placement, step-wise execution, and runtime state inspection. The results show that sketch-like input can support these execution-control tasks, while also introducing challenges in precision, recognition, and gesture recall. Our findings suggest that pen input is most promising where debugger interactions benefit from spatial grounding or continuous movement, rather than as a wholesale replacement for conventional debugging controls.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents 'Sketch Bug', an interface using sketch-based pen input for debugging tasks: drawing marks to set breakpoints, symbolic strokes for execution control (e.g., stepping), and extending strokes into spirals to repeat actions. The prototype integrates gesture recognition with Python execution tracing in a standard editor. It reports a controlled study with 24 programmers comparing the sketch interface to conventional mouse-and-keyboard input on tasks requiring breakpoint placement, step-wise execution, and runtime state inspection. Results indicate sketch input can support these tasks but introduces challenges in precision, recognition, and gesture recall; the authors conclude it is most promising for interactions benefiting from spatial grounding or continuous movement rather than as a full replacement.

Significance. If the empirical results hold, this contributes to HCI research on programming tools by providing evidence for an alternative input modality in debugging, highlighting scenarios where pen input may offer advantages over discrete controls. The tempered conclusions (not claiming wholesale replacement) and focus on specific benefits strengthen the work's utility for guiding future interface designs in spatially-oriented debugging contexts.

major comments (1)
  1. [Methods] Methods section (study design): The description of the controlled tasks does not specify code complexity details such as presence of loops, nested conditionals, repeated state inspections, or multi-file navigation. This is load-bearing for assessing whether the observed support for breakpoint, stepping, and inspection tasks generalizes beyond short single-file snippets, as precision and recall challenges could compound in realistic sessions.
minor comments (2)
  1. [Abstract] Abstract and results: No statistical details (e.g., means, p-values, error bars, or effect sizes) are provided for the comparison between interfaces, making it difficult to evaluate the strength of the 'can support' claim.
  2. [Results] The paper should include a table or figure summarizing task performance metrics across conditions to allow direct comparison.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and positive assessment of the work's contribution. We address the major comment below.

read point-by-point responses
  1. Referee: [Methods] Methods section (study design): The description of the controlled tasks does not specify code complexity details such as presence of loops, nested conditionals, repeated state inspections, or multi-file navigation. This is load-bearing for assessing whether the observed support for breakpoint, stepping, and inspection tasks generalizes beyond short single-file snippets, as precision and recall challenges could compound in realistic sessions.

    Authors: We agree that explicit details on task code complexity are important for evaluating generalizability. Our study tasks used short single-file Python programs (20-40 LOC) containing loops, nested conditionals, and multiple state inspections to require repeated stepping and inspection actions, but without multi-file navigation. We will revise the Methods section to include quantitative metrics (e.g., LOC, control-flow nesting depth, number of inspection points) and example code snippets so readers can assess how precision/recall issues might scale. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical user study grounded in external participant data

full rationale

The paper reports a controlled study with 24 programmers measuring task performance on breakpoint placement, step-wise execution, and state inspection using sketch input versus mouse/keyboard. All claims rest on observed participant outcomes against external benchmarks rather than any derivation, fitted parameters, equations, or self-citation chains. No self-definitional steps, predictions that reduce to inputs, or load-bearing self-citations appear in the abstract or study description. This matches the default expectation for non-circular empirical HCI work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the user-study design, the accuracy of the gesture recognizer, and the assumption that the chosen tasks capture the relevant aspects of debugging; no free parameters, invented entities, or non-standard axioms are introduced in the abstract.

axioms (1)
  • domain assumption Standard assumptions of controlled user studies in HCI (task representativeness, participant pool validity, absence of major learning effects between conditions)
    Invoked implicitly when generalizing from the 24-participant lab study to broader claims about sketch input utility.

pith-pipeline@v0.9.1-grok · 5657 in / 1236 out tokens · 34320 ms · 2026-06-30T14:30:59.862629+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 8 canonical work pages · 3 internal anchors

  1. [1]

    Sven Amann, Sebastian Proksch, Sarah Nadi, and Mira Mezini. 2016. A study of visual studio usage in practice. In2016 ieee 23rd international conference on software analysis, evolution, and reengineering (saner), Vol. 1. IEEE, 124–134

  2. [2]

    Beaudouin-Lafon

    M. Beaudouin-Lafon. 2000. Instrumental interaction: an interaction model for designing post-WIMP user interfaces.Proceedings of the SIGCHI conference on Human Factors in Computing Systems(2000). http://dl.acm.org/citation.cfm?id= 332473

  3. [3]

    Ivan Beschastnikh, Patty Wang, Yuriy Brun, and Michael D. Ernst. 2016. Debug- ging distributed systems.Commun. ACM59, 8 (July 2016), 32–37. doi:10.1145/ 2909480

  4. [4]

    Patrick D Bridge and Shlomo S Sawilowsky. 1999. Increasing physicians’ aware- ness of the impact of statistics on research outcomes: comparative power of the t-test and Wilcoxon rank-sum test in small samples applied research.Journal of clinical epidemiology52, 3 (1999), 229–235

  5. [5]

    John Brooke et al. 1996. SUS-A quick and dirty usability scale.Usability evaluation in industry189, 194 (1996), 4–7

  6. [6]

    Sarah Buchanan and Joseph J Laviola Jr. 2014. Cstutor: A sketch-based tool for visualizing data structures.ACM Transactions on Computing Education (TOCE) 14, 1 (2014), 1–28

  7. [7]

    Renata Castelo-Branco, Inês Caetano, Inês Pereira, and António Leitão. 2022. Sketching algorithmic design.Journal of Architectural Engineering28, 2 (2022), 04022010

  8. [8]

    Clark and A

    James M. Clark and A. Paivio. 1991. Dual coding theory and education.Educa- tional Psychology Review3 (1991), 149–210. https://doi.org/10.1007/BF01320076

  9. [9]

    Richard C Davis, T Scott Saponas, Michael Shilman, and James A Landay. 2007. SketchWizard: Wizard of Oz prototyping of pen-based user interfaces. InProceed- ings of the 20th annual ACM symposium on User interface software and technology. 119–128

  10. [10]

    Rafael del Vado Vírseda and Fernando Pérez Morente. 2012. A Semantic Frame- work for the Declarative Debugging of Wrong and Missing Answers in Declar- ative Constraint Programming. Inunknown. https://api.semanticscholar.org/ CorpusId:14922005

  11. [11]

    Pierre Dragicevic. 2016. Fair statistical communication in HCI. InModern statistical methods for HCI. Springer, 291–330

  12. [12]

    Will Epperson, Gagan Bansal, Victor C Dibia, Adam Fourney, Jack Gerrits, Erkang Zhu, and Saleema Amershi. 2025. Interactive Debugging and Steering of Multi- Agent AI Systems. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–15

  13. [13]

    Leslie Gennari, Levent Burak Kara, Thomas F Stahovich, and Kenji Shimada. 2005. Combining geometry and domain knowledge to interpret hand-drawn diagrams. Computers & Graphics29, 4 (2005), 547–562

  14. [14]

    Gavin Gray, Will Crichton, and Shriram Krishnamurthi. 2025. An Interactive Debugger for Rust Trait Errors.arXiv preprint arXiv:2504.18704(2025)

  15. [15]

    Transparent Statistics in Human-Computer Interaction Working Group. 2019. Transparent Statistics Guidelines.https://transparentstats. github. io/guidelines (2019)

  16. [16]

    Dan Hao, Lingming Zhang, Lu Zhang, Jiasu Sun, and Hong Mei. 2009. VIDA: Vi- sual interactive debugging. In2009 IEEE 31st International Conference on Software Engineering. IEEE, 583–586

  17. [17]

    Sandra G Hart and Lowell E Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. InAdvances in psy- chology. Vol. 52. Elsevier, 139–183

  18. [18]

    Javier Luis Cánovas Izquierdo and Jordi Cabot. 2016. Collaboro: a collaborative (meta) modeling tool.PeerJ Comput. Sci.2 (2016), e84. https://api.semanticscholar. org/CorpusId:5751358

  19. [19]

    I.Yu. Khan, A. Chowdary, Sharoz Haseeb, Urvish Patel, and Yousuf Zaii. 2025. Kodezi Chronos: A Debugging-First Language Model for Repository-Scale Code Understanding.ArXivabs/2507.12482 (2025). https://api.semanticscholar.org/ CorpusId:280275682

  20. [20]

    Joonho Kim and Karan Singh. 2024. Squidgets: Sketch-based Widget Design and Direct Manipulation of 3D Scene.ArXivabs/2402.06795 (2024). https: //api.semanticscholar.org/CorpusId:267627231

  21. [21]

    Amy J Ko and Brad A Myers. 2004. Designing the whyline: a debugging inter- face for asking questions about program behavior. InProceedings of the SIGCHI conference on Human factors in computing systems. 151–158

  22. [22]

    Amy J Ko, Brad A Myers, and Htet Htet Aung. 2004. Six learning barriers in end- user programming systems. In2004 IEEE Symposium on Visual Languages-Human Centric Computing. IEEE, 199–206

  23. [23]

    Ko, Brad A

    Amy J. Ko, Brad A. Myers, Michael J. Coblenz, and Htet Htet Aung. 2006. An Exploratory Study of How Developers Seek, Relate, and Collect Relevant In- formation during Software Maintenance Tasks.IEEE Transactions on Software Engineering32, 12 (2006), 971–987. doi:10.1109/TSE.2006.116

  24. [24]

    Bogdan Korel. 2002. PELAS-program error-locating assistant system.IEEE Transactions on Software Engineering14, 9 (2002), 1253–1260

  25. [25]

    Thomas D LaToza, Gina Venolia, and Robert DeLine. 2006. Maintaining mental models: a study of developer work habits. InProceedings of the 28th international conference on Software engineering. 492–501

  26. [26]

    Bingxin Li, Tong Yang, Yanfang Liu, and Feng Du. 2022. Memory load differen- tially influences younger and older users’ learning curve of touchscreen gestures. Scientific Reports12, 1 (2022), 10814

  27. [27]

    Chuanjun Li, Timothy S Miller, Robert C Zeleznik, and Joseph J LaViola Jr. 2008. AlgoSketch: Algorithm Sketching and Interactive Computation.SBIM8 (2008), 175–182

  28. [28]

    Haolin Li and Michael J. Coblenz. 2026. A Grounded Theory of Debugging in Professional Software Engineering Practice.ArXivabs/2602.11435 (2026). https://api.semanticscholar.org/CorpusId:285540386

  29. [29]

    Damien Masson, Sylvain Malacria, Géry Casiez, and Daniel Vogel. 2023. Direct- GPT: A Direct Manipulation Interface to Interact with Large Language Models. Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (2023). https://api.semanticscholar.org/CorpusId:263671690

  30. [30]

    Damien Masson, Sylvain Malacria, Géry Casiez, and Daniel Vogel. 2023. Statsla- tor: Interactive translation of nhst and estimation statistics reporting styles in scientific documents. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. 1–14

  31. [31]

    Fabio Petrillo, Zéphyrin Soh, Foutse Khomh, Marcelo Pimenta, Carla Freitas, and Yann-Gaël Guéhéneuc. 2016. Towards understanding interactive debugging. In 2016 IEEE International Conference on Software Quality, Reliability and Security (QRS). IEEE, 152–163

  32. [32]

    Andrew Quinn, Jason Flinn, Michael Cafarella, and Baris Kasikci. 2022. Debugging the {OmniTable} Way. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 357–373

  33. [33]

    Rosenberg, Rubaiat Habib Kazi, Li-Yi Wei, Haijun Xia, and Ken Perlin

    K. Rosenberg, Rubaiat Habib Kazi, Li-Yi Wei, Haijun Xia, and Ken Perlin. 2024. DrawTalking: Building Interactive Worlds by Sketching and Speaking.Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology (2024). https://api.semanticscholar.org/CorpusId:266933399

  34. [34]

    M Samadzadeh and Winai Wichaipanitch. 1993. An interactive debugging tool for C based on dynamic slicing and dicing. InProceedings of the 1993 ACM conference on Computer science. 30–37

  35. [35]

    Vinícius CVB Segura and Simone DJ Barbosa. 2012. A combination of stroke manipulation and recognition strategies to support user interface construction and interactive behavior definition through sketching. In2012 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, 45–48

  36. [36]

    Marjorie Skubic, Craig Bailey, and George Chronis. 2003. A sketch interface for mobile robots. InSMC’03 Conference Proceedings. 2003 IEEE International Conference on Systems, Man and Cybernetics. Conference Theme-System Security and Assurance (Cat. No. 03CH37483), Vol. 1. IEEE, 919–924

  37. [37]

    Thomas F Stahovich. 2011. Pen-based interfaces for engineering and education. InSketch-based Interfaces and Modeling. Springer, 119–152

  38. [38]

    Ryo Suzuki, Gustavo Soares, Andrew Head, Elena Glassman, Ruan Reis, Melina Mongiovi, Loris D’Antoni, and Bjoern Hartmann. 2017. Tracediff: Debugging unexpected code behavior using trace divergences. In2017 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, 107–115

  39. [39]

    Matthew Thorne, David Burke, and Michiel Van De Panne. 2004. Motion doodles: an interface for sketching character motion.ACM Transactions on Graphics (ToG) 23, 3 (2004), 424–431

  40. [40]

    Jacob O Wobbrock, Andrew D Wilson, and Yang Li. 2007. Gestures without libraries, toolkits or training: a $1 recognizer for user interface prototypes. In Proceedings of the 20th annual ACM symposium on User interface software and technology. 159–168

  41. [41]

    Doug Woos, Zachary Tatlock, Michael D Ernst, and Thomas E Anderson. 2018. A Graphical Interactive Debugger for Distributed Systems. CoRR abs/1806.05300 (2018).arXiv preprint arXiv:1806.05300(2018)

  42. [42]

    Ryan Yen, Jian Zhao, and Daniel Vogel. 2025. Code Shaping: Iterative Code Editing with Free-form AI-Interpreted Sketching. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–17

  43. [43]

    Xingdi Yuan, Morgane M Moss, Charbel El Feghali, Chinmay Singh, Darya Moldavskaya, Drew MacPhee, Lucas Caccia, Matheus Pereira, Minseon Kim, Alessandro Sordoni, et al . 2025. debug-gym: A Text-Based Environment for Interactive Debugging.arXiv preprint arXiv:2503.21557(2025)

  44. [44]

    Zhenming Yuan, Hong Pan, and Liang Zhang. 2008. A novel pen-based flowchart recognition system for programming teaching. InWorkshop on Blended Learning. Springer, 55–64

  45. [45]

    Yaqian Zhu and John Kolassa. 2018. Assessing and comparing the accuracy of various bootstrap methods.Communications in Statistics-Simulation and Computation47, 8 (2018), 2436–2453

  46. [46]

    Barnwal, Rupayan Neogy, and Arvind Satyanarayan

    Jonathan Zong, D. Barnwal, Rupayan Neogy, and Arvind Satyanarayan. 2020. Lyra 2: Designing Interactive Visualizations by Demonstration.IEEE Trans- actions on Visualization and Computer Graphics27 (2020), 304–314. https: //api.semanticscholar.org/CorpusId:221246085 8 Sketch Bug , , A Task Variations A.1 Variation 1 def accumulate(combiner, base, n, term): ...

  47. [47]

    During the first loop iteration, which functions are called for term(i) andcombiner(...)? What are their input values and return values?

  48. [48]

    Set a breakpoint attotal = combiner(...)

  49. [49]

    What is the value oftotalbefore the first iteration?

  50. [50]

    What is the value oftotalafter the first iteration?

  51. [51]

    What is the final return value?

    Let the program run to completion. What is the final return value?

  52. [52]

    Use the debugger to record the value oftotal: •What istotalwheni = 9? •What istotalwheni = 13? •What istotalwheni = 22? A.2 Variation 2 def apply_until(stop_fn, update_fn, initial): value = initial while not stop_fn(value): value = update_fn(value) return value def greater_than_100(x): return x > 100 def double_plus_one(x): return 2 * x + 1 apply_until(gr...

  53. [53]

    Set a breakpoint at the first line insideapply_until(): value = initial

  54. [54]

    Then answer: •What is the value ofinitial? •What functions were passed asstop_fnandupdate_fn? •What is the initial value ofvalue?

    Run the program until it hits the breakpoint. Then answer: •What is the value ofinitial? •What functions were passed asstop_fnandupdate_fn? •What is the initial value ofvalue?

  55. [55]

    Step Over until you hit the loop guard, i.e., while not stop_fn(value):, for the second time

    Restart the debugger. Step Over until you hit the loop guard, i.e., while not stop_fn(value):, for the second time. •What is the new value ofvalue?

  56. [56]

    •What is the function name? •What is the input? •What is the return value?

    Whenvalue = 63, step into the function call. •What is the function name? •What is the input? •What is the return value?

  57. [57]

    What is the return value of theapply_untilcall? B Interview Questions

  58. [58]

    How did using sketching compare to how you typically interact with a debugger?

  59. [59]

    Were there moments when using sketches felt especially helpful or intu- itive?

  60. [60]

    Were there moments when using sketches felt especially challenging?

  61. [61]

    How did using a pen or drawing gestures affect your experience?

  62. [62]

    If you could change or add new functionalities for sketches, what would you most like to have?

  63. [63]

    In what scenarios do you think this sketch-based debugging approach has the most potential for widespread use?

  64. [64]

    Mean differences are reported aswimp−sketch

    Is there anything you’d like to share? C Statistical Results Table 1: Workload comparisons betweensketchandwimp. Mean differences are reported aswimp−sketch. Measure 95% CI Wilcoxon (𝑊,𝑝) Mental Demand [-2.898, 0.880]𝑊=102.500, 𝑝=0.4341 Physical Demand [-2.657, 3.206]𝑊=113.500, 𝑝=0.9445 Effort [-3.218, 2.078]𝑊=116.000, 𝑝=0.7325 Performance [-0.002, 2.509]...