pith. machine review for the scientific record. sign in

arxiv: 2605.11149 · v1 · submitted 2026-05-11 · 💻 cs.HC

Recognition: no theorem link

Conversational Customization of Productivity Systems: A Design Probe of Malleable AI Interfaces

Aryan Kaul, Karthik Sreedhar, Lydia B. Chilton

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:35 UTC · model grok-4.3

classification 💻 cs.HC
keywords conversational customizationmalleable interfacesproductivity systemsemail workflowsend-user tailoringgenerative AIdesign probeiterative refinement
0
0 comments X

The pith

Users turn their email inbox into a flexible data layer by authoring custom behaviors through natural language conversations rather than treating it as a fixed interface.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how people actually use malleable AI systems by giving participants a conversationally customizable email tool and tracking their activity over several days. Participants mostly adapted familiar email patterns into specialized features instead of inventing wholly new ones, which changed how they viewed and worked with their inbox. This shift brought new risks such as mis-specified rules and unintended filtering, which users handled by watching outcomes and refining instructions over time. The work shows that conversational customization can become part of daily tool use when systems support ongoing changes and visibility into what the AI is doing.

Core claim

In a design probe of a conversationally customizable email system, users created and refined inbox categories, interface elements, and workflow behaviors through natural language. These customizations were typically grounded in existing patterns that participants adapted to their own needs. The inbox therefore became a flexible data layer shaped by user-authored features rather than a fixed interface. Customization also introduced risks such as mis-specified behavior, unintended filtering, and uncertainty about outcomes, which participants managed through ongoing oversight and iterative refinement.

What carries the argument

Conversationally customizable email interface that lets users iteratively restructure categories, add interface elements, and author new workflow behaviors directly in natural language.

If this is right

  • Users adapt familiar patterns to create specialized functionality instead of generating entirely novel features.
  • The inbox shifts from a fixed interface to a data layer that users actively shape through their own instructions.
  • Customization brings risks including mis-specified behavior and unintended filtering that require user oversight.
  • Systems must provide support for iterative refinement, visibility into AI behavior, and safe experimentation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same conversational approach could let users reshape other productivity tools such as calendars or task lists by layering custom behaviors on top of existing data.
  • Over longer periods, users might develop more complex layered customizations that depend on the AI correctly interpreting successive instructions.
  • Designers of future tools could add explicit safeguards such as preview modes or rollback options to reduce the cost of unintended filtering.

Load-bearing premise

Short-term use of one specific email-based design probe captures how conversational customization would play out in long-term, real-world productivity work across different tools and users.

What would settle it

A follow-up study in which participants use the system for weeks across multiple productivity tools and show no increase in ongoing oversight or refinement of customizations would indicate the observed pattern of risk management does not generalize.

Figures

Figures reproduced from arXiv: 2605.11149 by Aryan Kaul, Karthik Sreedhar, Lydia B. Chilton.

Figure 1
Figure 1. Figure 1: Inbox interface of the email assistant. Emails are organized into user-defined categories in the left sidebar, while the [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Conversational restructuring interface in chat mode. The agent analyzes emails in ambiguous categories (e.g., [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Feature generation interface. Users request new functionality in natural language, inspect generated feature files [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: P0’s feature after Day 1: a “Bloomberg terminal [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: P0’s feature after Day 2: the newsletter interface [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The conversational customization loop. Users identify friction in their daily workflow, propose a change through the [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: System architecture. The Gmail Assistant (left) manages authentication, inbox sync, categorization, and everyday [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Feature manager interface. Deployed features are listed with descriptions, ownership metadata, and controls for visi [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Conversational restructuring interface for inbox organization. In chat mode, the agent analyzes emails in ambiguous [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Category-scoped quick reply interface for email response generation. Within a specific category (e.g., WeTransfer), [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Inline TODO extraction and augmentation within the inbox. The system analyzes email content to identify actionable [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Category refinement for surfacing PhD-specific emails. The system enables users to define more granular filters [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Reply prioritization interface for surfacing emails that require responses. The system analyzes incoming emails to [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Derived briefing interface for Financial Times newsletters. The system extracts articles from Financial Times digest [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: P0’s feature after Day 1: a “Bloomberg terminal-style” newsletter interface that aggregates emails into a standalone [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: P0’s feature after Day 2: the newsletter interface extended with extracted financial tickers and geographic regions, [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Participant P1’s Day 1 feature: an apartment listings ranking interface generated from email data. The interface [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Participant P1’s Day 2 iteration of the apartment rankings feature. The system augments listings with inferred [PITH_FULL_IMAGE:figures/full_fig_p024_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Participant P2’s Day 1 job application tracker generated from their inbox. The system aggregates emails in the [PITH_FULL_IMAGE:figures/full_fig_p025_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Participant P2’s Day 2 iteration of the job application tracker introducing task awareness. In response to use, the [PITH_FULL_IMAGE:figures/full_fig_p026_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Further iteration of P2’s feature introducing inbox-level prioritization behavior. Beyond the tracker interface, the [PITH_FULL_IMAGE:figures/full_fig_p027_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Participant P3’s Day 1 newsletter article tracker derived from inbox emails. The system aggregates emails from the [PITH_FULL_IMAGE:figures/full_fig_p028_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Participant P3’s Day 2 iteration aligning the tracker with inbox conventions. Following initial use, the interface is [PITH_FULL_IMAGE:figures/full_fig_p029_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Participant P4’s Day 1 feature introducing category-specific quick replies. P4 adds a “Quick Reply” button directly [PITH_FULL_IMAGE:figures/full_fig_p030_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Participant P4’s Day 2 iteration introducing filtering for group-related requests. Building on the quick reply [PITH_FULL_IMAGE:figures/full_fig_p031_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Further iteration surfacing coordination information for group formation. From the filtered set of “Need Group” [PITH_FULL_IMAGE:figures/full_fig_p031_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Participant P5’s Day 1 feature for filtering and prioritizing robotics-related emails. P5 creates a filter that identifies [PITH_FULL_IMAGE:figures/full_fig_p032_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Participant P6’s Day 1 feature surfacing deadline-related emails. P6 creates a feature that identifies emails containing [PITH_FULL_IMAGE:figures/full_fig_p033_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Participant P6’s Day 2 iteration improving deadline accuracy and adding time estimates. Following initial use, P6 [PITH_FULL_IMAGE:figures/full_fig_p034_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Participant P7’s Day 1 feature constructing a JIRA-style interface from email data. P7 creates a separate interface [PITH_FULL_IMAGE:figures/full_fig_p035_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Participant P7’s Day 2 iteration introducing categorical organization of advisor emails. Building on the initial [PITH_FULL_IMAGE:figures/full_fig_p036_31.png] view at source ↗
read the original abstract

Customization has long been a central goal in interactive systems, yet prior work shows that end-user tailoring occurs infrequently and is often confined to initial setup or moments of breakdown. Recent advances in generative AI suggest that highly malleable systems-where users can modify system behavior through natural language-are now technically feasible. However, it remains unclear how such malleability is used in practice: What kinds of customizations do users create, when do they choose to customize, and how do these modifications shape their experience of everyday tools? We present a design probe that uses a conversationally customizable email system as an instrument to study how users create and refine functionality within everyday tools. The system allows users to iteratively modify their inbox by restructuring categories, introducing interface elements, and authoring new workflow behaviors directly through natural language interaction. We study how participants create, refine, and use these features over several days within their own email workflows. We find that users' customizations are often grounded in existing patterns, which they adapt and specialize to fit their needs, rather than generating entirely novel functionality. Malleability changes how users engage with their inbox, shifting it from a fixed interface to a flexible data layer shaped through user-authored features. At the same time, customization introduces new forms of risk, including mis-specified behavior, unintended filtering, and uncertainty around outcomes, which users manage through ongoing oversight and refinement. These findings highlight how conversational customization becomes embedded within everyday interaction, and point toward the need for systems that support iterative refinement, visibility into behavior, and safe experimentation as users shape their own tools.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper presents a design probe of a conversationally customizable email system that lets users restructure categories, add interface elements, and author workflow behaviors through natural language. In a study where participants used the system within their own email workflows for several days, the authors report that customizations are typically adaptations of existing patterns rather than entirely novel functionality. They claim that malleability transforms the inbox from a fixed interface into a flexible data layer shaped by user-authored features, while also introducing risks such as mis-specified behavior, unintended filtering, and outcome uncertainty that users address through ongoing oversight and iterative refinement.

Significance. If the results hold, this work is significant for HCI research on end-user customization and malleable AI interfaces. By embedding a functional probe in participants' real workflows, it supplies concrete qualitative evidence of how conversational customization integrates into daily use, revealing adaptive behaviors alongside emergent oversight challenges. These insights can guide the design of future systems that better support iterative refinement, behavioral visibility, and safe experimentation in productivity tools.

major comments (2)
  1. Abstract and Findings: The central claim that malleability produces a durable shift 'from a fixed interface to a flexible data layer shaped through user-authored features' rests on observations from a probe lasting only 'several days.' This duration is insufficient to distinguish sustained changes in engagement from novelty effects or temporary adaptation, directly weakening the generalization to everyday productivity workflows stated in the abstract.
  2. Method and Discussion: The probe is confined to a single email system, yet the claims extend to 'productivity systems' broadly and to how users manage risks like mis-specification across tools. Without data from additional domains or a cross-tool comparison, the observed risk-management strategies (ongoing oversight and refinement) may be domain-specific rather than generalizable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below, clarifying the scope of our design probe while revising the manuscript to better reflect its exploratory nature and limitations.

read point-by-point responses
  1. Referee: Abstract and Findings: The central claim that malleability produces a durable shift 'from a fixed interface to a flexible data layer shaped through user-authored features' rests on observations from a probe lasting only 'several days.' This duration is insufficient to distinguish sustained changes in engagement from novelty effects or temporary adaptation, directly weakening the generalization to everyday productivity workflows stated in the abstract.

    Authors: We agree that the short duration of the probe precludes strong claims about long-term durability or sustained shifts in engagement. The study was explicitly framed as a design probe to surface initial patterns of conversational customization within participants' real workflows, not as a longitudinal evaluation. We have revised the abstract to remove references to a 'durable shift' and to emphasize observed behaviors during the multi-day period. We have also expanded the discussion and added a dedicated limitations paragraph that acknowledges potential novelty effects and calls for future longitudinal work to assess persistence. These changes align our claims more precisely with the data collected. revision: yes

  2. Referee: Method and Discussion: The probe is confined to a single email system, yet the claims extend to 'productivity systems' broadly and to how users manage risks like mis-specification across tools. Without data from additional domains or a cross-tool comparison, the observed risk-management strategies (ongoing oversight and refinement) may be domain-specific rather than generalizable.

    Authors: We acknowledge that confining the probe to email limits direct evidence for generalizability across productivity systems. The email domain was chosen because it provides a rich, authentic workflow for observing customization and risk management in situ. We have revised the discussion to explicitly qualify that the specific oversight strategies observed may be influenced by email characteristics and to recommend cross-domain studies. At the same time, the core mechanisms—adapting existing patterns through conversation and managing uncertainty via iterative refinement—are positioned as transferable insights for malleable AI interfaces, supported by connections to prior end-user programming literature. We do not claim domain-independence but maintain that the probe yields actionable HCI implications. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical qualitative design probe with no derivations or self-referential fitting

full rationale

The paper is a design-probe study that deploys a conversational email system with participants for several days and reports observed customization patterns, risks, and engagement shifts. No equations, parameters, predictions, or derivations appear in the abstract or described claims. Central findings (customizations grounded in existing patterns; inbox as flexible data layer; risk management via oversight) are presented as direct outcomes of the user study rather than reductions to fitted inputs, self-definitions, or self-citation chains. The work is self-contained against external benchmarks as an exploratory qualitative instrument; no load-bearing step reduces by construction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is a qualitative design probe study with no quantitative models. No free parameters or invented entities are introduced. Domain assumptions include that natural language suffices for meaningful customization and that short-term probe use reveals stable patterns.

axioms (1)
  • domain assumption Natural language interaction enables users to meaningfully modify system behavior in productivity tools
    Invoked in the description of the probe system allowing iterative modification through conversation.

pith-pipeline@v0.9.0 · 5590 in / 1283 out tokens · 56902 ms · 2026-05-13T02:35:03.750152+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

  1. [1]

    Nikola Banovic, Fanny Chevalier, Tovi Grossman, and George Fitzmaurice. 2012. Triggering Triggers and Burying Barriers to Customizing Software. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’12). ACM, 2717–2726

  2. [2]

    Tommaso Calò, Andrea Sillano, and Luigi De Russis. 2025. MorphGUI: Real- Time GUI Customization with Large Language Models.International Journal of Human-Computer Studies197 (2025). doi:10.1016/j.ijhcs.2025.103695

  3. [3]

    Yining Cao, Peiling Jiang, and Haijun Xia. 2025. Generative and Malleable User Interfaces with Generative and Evolving Task-Driven Data Model. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). ACM. doi:10.1145/3706598.3713285

  4. [4]

    Practice on long sequential user behavior modeling for click-through rate prediction

    Mia Xu Chen, Benjamin N. Lee, Gagan Bansal, Yuan Cao, Shuyuan Zhang, Justin Lu, Jackie Tsay, Yinan Wang, Andrew M. Dai, Zhifeng Chen, Timothy Sohn, and Yonghui Wu. 2019. Gmail Smart Compose: Real-Time Assisted Writing. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’19). ACM, 2287–2295. doi:10.114...

  5. [5]

    Paul Dourish. 2003. The Appropriation of Interactive Technologies: Some Lessons from Placeless Documents.Computer Supported Cooperative Work12, 4 (2003), 465–490

  6. [6]

    Google. 2023. Duet AI for Google Workspace. https://workspace.google.com/ solutions/ai. Accessed: 2026

  7. [7]

    Bederson, Al- lison Druin, Catherine Plaisant, Michel Beaudouin-Lafon, Stéphane Conversy, Helen Evans, Heiko Hansen, Nicolas Roussel, and Björn Eiderbäck

    Hilary Hutchinson, Wendy Mackay, Bo Westerlund, Benjamin B. Bederson, Al- lison Druin, Catherine Plaisant, Michel Beaudouin-Lafon, Stéphane Conversy, Helen Evans, Heiko Hansen, Nicolas Roussel, and Björn Eiderbäck. 2003. Tech- nology Probes: Inspiring Design for and with Families. InProceedings of the SIGCHI Conference on Human Factors in Computing System...

  8. [8]

    Anjuli Kannan, Karol Kurach, Sujith Ravi, Tobias Kaufmann, Andrew Tomkins, Balint Miklos, Greg Corrado, László Lukács, Marina Ganea, Peter Young, and Vivek Ramavajjala. 2016. Smart Reply: Automated Response Suggestion for Email. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16). ACM, 955–964. do...

  9. [9]

    Eagan, and Peter van Hardenberg

    Clemens Nylandsted Klokmose, James R. Eagan, and Peter van Hardenberg. 2024. MyWebstrates: Webstrates as Local-First Software. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology (UIST ’24). ACM. doi:10.1145/3654777.3676445

  10. [10]

    Ko, Robin Abraham, Laura Beckwith, Alan Blackwell, Margaret M

    Andrew J. Ko, Robin Abraham, Laura Beckwith, Alan Blackwell, Margaret M. Burnett, Martin Erwig, Chris Scaffidi, Joseph Lawrance, Henry Lieberman, Brad A. Myers, Mary Beth Rosson, Gregg Rothermel, Mary Shaw, and Susan Wiedenbeck

  11. [11]

    Surveys 43, 3 (2011), 21:1–21:44

    The State of the Art in End-User Software Engineering.Comput. Surveys 43, 3 (2011), 21:1–21:44. doi:10.1145/1922649.1922658

  12. [12]

    Geoffrey Litt, Josh Horowitz, Peter van Hardenberg, and Todd Matthews. 2025. Malleable Software: Restoring User Agency in a World of Locked-Down Apps. Ink & Switch. https://www.inkandswitch.com/essay/malleable-software/

  13. [13]

    Wendy E. Mackay. 1991. Triggers and Barriers to Customizing Software. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’91). ACM, 153–160

  14. [14]

    Allan MacLean, Kathleen Carter, Lennart Lövstrand, and Thomas Moran. 1990. User-Tailorable Systems: Pressing the Issues with Buttons. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’90). ACM, 175–182

  15. [15]

    Microsoft. 2023. Microsoft 365 Copilot. https://www.microsoft.com/en-us/ microsoft-365/copilot. Accessed: 2026

  16. [16]

    Bryan Min, Allen Chen, Yining Cao, and Haijun Xia. 2025. Malleable Overview- Detail Interfaces. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). ACM. doi:10.1145/3706598.3714164

  17. [17]

    Bryan Min and Haijun Xia. 2025. Gradual Generation of User Interfaces as a Design Method for Malleable Software. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST ’25). ACM

  18. [18]

    Bryan Min and Haijun Xia. 2025. Meridian: A Design Framework for Malleable Overview-Detail Interfaces. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST ’25). ACM

  19. [19]

    Myers, Andrew J

    Brad A. Myers, Andrew J. Ko, and Margaret M. Burnett. 2006. Invited Research Overview: End-User Programming. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’06). ACM

  20. [20]

    OpenAI. 2024. Assistants API. https://platform.openai.com/docs/assistants. Accessed: 2026

  21. [21]

    Shwetha Rajaram, Nels Numan, Balasaravanan Thoravi Kumaravel, Nicolai Mar- quardt, and Andrew D. Wilson. 2024. BlendScape: Enabling End-User Customiza- tion of Video-Conferencing Environments through Generative AI. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology (UIST ’24). ACM, 40:1–40:19. doi:10.1145/3654777.3676326

  22. [22]

    Blase Ur, Marias Pak Yong Ho, Stephen Brawner, Jiyun Lee, Sarah Mennicken, Noah Picard, Diane Schulze, and Michael L. Littman. 2016. Trigger-Action Pro- gramming in the Wild: An Analysis of 200,000 IFTTT Recipes. InProceedings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI ’16). ACM, 3227–3231. doi:10.1145/2858036.2858556

  23. [23]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Mod- els. InProceedings of the 11th International Conference on Learning Representations (ICLR ’23)

  24. [24]

    PhD Only

    Zapier, Inc. 2020. Zapier: The Easiest Way to Automate Your Work. https: //zapier.com. Accessed: 2026. UIST 2026, A User Interaction Model Figure 6: The conversational customization loop. Users identify friction in their daily workflow, propose a change through the natural language meta-interface, and the system generates a standalone feature bundle. Afte...