pith. machine review for the scientific record. sign in

arxiv: 2604.10925 · v1 · submitted 2026-04-13 · 💻 cs.HC

Recognition: no theorem link

From Words to Widgets for Controllable LLM Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:27 UTC · model grok-4.3

classification 💻 cs.HC
keywords controllable LLM generationmalleable promptingGUI widgetsuser interfacesnatural language promptingdecoding algorithmuser study
0
0 comments X

The pith

Malleable Prompting converts natural language preferences into configurable GUI widgets for more precise control over LLM generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Users often find it hard to express exact preferences like tone or style when prompting LLMs with words alone. Malleable Prompting solves this by turning those preferences into directable widgets such as sliders and dropdowns. An accompanying decoding algorithm adjusts the model's token choices based on the widget settings. A user study found that this method lets people hit their target preferences more accurately and feels more transparent than standard prompting.

Core claim

Malleable Prompting reifies preference expressions from natural language prompts into GUI widgets that users can adjust to steer LLM outputs, supported by a decoding algorithm that modulates token probability distributions according to the widget values, leading to more precise and controllable generation as shown in a user study.

What carries the argument

Malleable Prompting, which transforms prompt-based preferences into interactive widgets and uses a custom LLM decoding algorithm to influence generation based on those widget settings.

If this is right

  • Participants achieved target preferences more precisely with the widget interface.
  • The approach was perceived as more controllable and transparent.
  • Visualizations help users understand each control's effect on the output.
  • It enables better comparison across different generation iterations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This technique could reduce the trial-and-error involved in crafting effective prompts.
  • Extending it to other modalities like image or code generation might offer similar benefits.
  • It highlights the value of hybrid text-GUI interactions for AI systems.

Load-bearing premise

The decoding algorithm must consistently alter token probabilities to reflect widget adjustments in a way that produces outputs matching the intended preferences without unwanted side effects.

What would settle it

Run a controlled experiment where users specify exact target outputs for subjective attributes and measure the distance between generated text and targets with and without the widgets.

Figures

Figures reproduced from arXiv: 2604.10925 by Chao Zhang, Jeffrey M. Rzeszotarski, Lunyiu Nie, Tal August, Yiren Liu, Yun Huang.

Figure 1
Figure 1. Figure 1: Malleable Prompting. We propose a novel interactive prompting technique that allows users to reify preference expressions in their prompts into tangible GUI controls. With Malleable Prompting, users can directly manipulate preferences in their prompts through widgets to steer generation (A) and inspect how each preference shapes the output (B). Abstract Natural language remains the predominant way people i… view at source ↗
Figure 2
Figure 2. Figure 2: Making prompts malleable. (A) Users can optionally trigger the system to enrich an initial prompt with suggested pref￾erences. (B) The system parses the enriched prompt to detect con￾trollable attributes. (C) These attributes are reified as GUI widgets, with optional manual binding of text spans to widget types. (D) For categorical attributes, the system suggests alternative values that users can use for i… view at source ↗
Figure 3
Figure 3. Figure 3: Inspecting widget effects on generated text. Clicking a widget reveals its localized influence on the output: affected sen￾tences are underlined, positively induced phrases are highlighted in green, and discouraged wording is highlighted in red [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Tracking prompt iterations with a node-link graph. Versions generated with the same prompts and widgets appear along the same horizontal branch. Adding new preferences or wid￾gets creates a new vertical branch. Visual encodings map widget values to node color for categorical attributes, node fill for binary attributes, and node size for continuous or numerical attributes. 1) to pivot the blog post’s overal… view at source ↗
Figure 5
Figure 5. Figure 5: Perceived support for design goals. Participants’ responses to a 7-point Likert-scale questionnaire, comparing our system against the baseline condition across both execution-related (Q1–Q4) and evaluation-related (Q5–Q6) measures. mentioned in the baseline chat condition, “I want something shorter, but it gets way too short. I wanted 50%, but it compressed to 20%. I can only add more degree adverbs and tr… view at source ↗
Figure 6
Figure 6. Figure 6: An example screenshot of the baseline interface [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
read the original abstract

Natural language remains the predominant way people interact with large language models (LLMs). However, users often struggle to precisely express and control subjective preferences (e.g., tone, style, and emphasis) through prompting. We propose Malleable Prompting, a new interactive prompting technique for controllable LLM generation. It reifies preference expressions in natural language prompts into GUI widgets (e.g., sliders, dropdowns, and toggles) that users can directly configure to steer generation, while visualizing each control's influence on the output to support attribution and comparison across iterations. To enable this interaction, we introduce an LLM decoding algorithm that modulates the token probability distribution during generation based on preference expressions and their widget values. Through a user study, we show that Malleable Prompting helps participants achieve target preferences more precisely and is perceived as more controllable and transparent than natural language prompting alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Malleable Prompting, an interactive technique that reifies natural-language preference expressions (e.g., tone, style) into GUI widgets such as sliders and toggles. Users configure these widgets to steer LLM output while the system visualizes each control's influence. The approach is enabled by a new decoding algorithm that modulates the token probability distribution during generation according to the preference expressions and current widget values. A user study is reported to show that the method yields more precise achievement of target preferences and is rated higher in controllability and transparency than natural-language prompting alone.

Significance. If the decoding algorithm's modulation is shown to be faithful and the user-study results hold under standard scrutiny, the work would offer a concrete advance in controllable LLM interfaces by combining direct manipulation with attribution visualizations. This could influence both HCI research on prompt engineering and practical tools for subjective attribute control, provided the algorithmic contribution is isolated from the GUI benefits.

major comments (2)
  1. Abstract and method description: the central claim that the new decoding algorithm enables precise controllability rests on its ability to modulate token probabilities according to widget values without introducing fluency or coherence artifacts, yet no quantitative fidelity metrics (e.g., attribute-matching accuracy, perplexity deltas, or side-effect measurements) are provided independent of the user study.
  2. User-study section: the reported gains in precision, controllability, and transparency are load-bearing for the contribution, but the abstract and available text supply no details on study design, sample size, statistical tests, or exact comparison conditions, preventing assessment of whether outcomes are attributable to the algorithm versus the interface alone.
minor comments (2)
  1. Clarify the exact form of the preference expressions that are reified into widgets and how conflicts between multiple expressions are resolved in the decoding step.
  2. Add a figure or table that isolates the algorithmic contribution (e.g., outputs with widgets fixed vs. natural-language baseline) before presenting the full user-study results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback. We address each major comment below, indicating the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [—] Abstract and method description: the central claim that the new decoding algorithm enables precise controllability rests on its ability to modulate token probabilities according to widget values without introducing fluency or coherence artifacts, yet no quantitative fidelity metrics (e.g., attribute-matching accuracy, perplexity deltas, or side-effect measurements) are provided independent of the user study.

    Authors: We agree that the abstract and method description do not include quantitative fidelity metrics for the decoding algorithm's modulation effects independent of the user study. The current manuscript presents the algorithm in Section 4 and relies primarily on the user study for evidence of controllability. In the revision, we will add a dedicated evaluation subsection reporting quantitative metrics such as attribute-matching accuracy on a held-out preference set and perplexity deltas relative to unmodulated decoding to demonstrate that the modulation preserves fluency and coherence without introducing artifacts. These metrics will be presented separately from the user study. revision: yes

  2. Referee: [—] User-study section: the reported gains in precision, controllability, and transparency are load-bearing for the contribution, but the abstract and available text supply no details on study design, sample size, statistical tests, or exact comparison conditions, preventing assessment of whether outcomes are attributable to the algorithm versus the interface alone.

    Authors: We agree that the abstract lacks these details and that the user study section should more explicitly support assessment of the results. In the revision, we will update the abstract to include a concise summary of the study design (within-subjects comparison), sample size, statistical tests, and comparison conditions. We will also expand the user study section to clarify the contributions of the decoding algorithm versus the GUI widgets, including additional discussion or analysis to help isolate their effects. This will allow readers to better evaluate attribution. revision: yes

Circularity Check

0 steps flagged

No circularity: central claim rests on independent user study evaluation

full rationale

The paper introduces Malleable Prompting and a new decoding algorithm but validates its benefits through a user study measuring precision, controllability, and transparency against natural language prompting. No mathematical derivations, equations, fitted parameters renamed as predictions, or self-citation chains are described in the provided text. The algorithm is presented as an enabling mechanism whose performance is assessed empirically rather than by construction from its own inputs. This is a standard self-contained empirical contribution with no load-bearing reductions to definitions or prior self-work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work introduces a new interaction technique and decoding method without explicit free parameters, mathematical axioms, or invented physical entities; it relies on standard assumptions about LLM token sampling and user study validity.

pith-pipeline@v0.9.0 · 5456 in / 1169 out tokens · 26066 ms · 2026-05-10T16:27:15.830854+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Priming, Path-dependence, and Plasticity: Understanding the molding of user-LLM interaction and its implications from (many) chat logs in the wild

    cs.HC 2026-05 unverdicted novelty 7.0

    Large-scale analysis of wild LLM chat logs finds that user interaction patterns stabilize quickly after initial use and correlate with long-term outcomes like retention, creating an agency paradox of limited explorati...

Reference graph

Works this paper leans on

66 extracted references · 49 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Rifat Mehreen Amin, Oliver Hans Kühle, Daniel Buschek, and Andreas Butz

  2. [2]

    InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA ’25)

    Composable Prompting Workspaces for Creative Writing: Exploration and Iteration Using Dynamic Widgets. InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA ’25). Association for Computing Machinery, New York, NY, USA, 1–11. doi:10.1145/ 3706599.3720243

  3. [3]

    Rifat Mehreen Amin, Oliver Hans Kühle, Daniel Buschek, and Andreas Butz. 2025. PromptCanvas: Composable Prompting Workspaces Using Dynamic Widgets for Exploration and Iteration in Creative Writing. arXiv:2506.03741 [cs] doi:10. 48550/arXiv.2506.03741

  4. [4]

    Glassman

    Ian Arawjo, Chelse Swoopes, Priyan Vaithilingam, Martin Wattenberg, and Elena L. Glassman. 2024. ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis Testing. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI ’24). Association for Computing Machinery, New York, NY, USA, 1–18. doi:10.1145/3613904.3642016

  5. [5]

    Allan Bell. 1984. Language Style as Audience Design.Language in Society13, 2 (June 1984), 145–204. doi:10.1017/S004740450001037X

  6. [6]

    Lloyd F. Bitzer. 1968. The Rhetorical Situation.Philosophy & Rhetoric1, 1 (1968), 1–14. jstor:40236733

  7. [8]

    Stephen Brade, Bryan Wang, Mauricio Sousa, Sageev Oore, and Tovi Grossman

  8. [9]

    InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23)

    Promptify: Text-to-Image Generation through Interactive Prompt Explo- ration with Large Language Models. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23). Association for Computing Machinery, New York, NY, USA, 1–14. doi:10.1145/3586183.3606725

  9. [10]

    Covert, Scott M

    Hugh Chen, Ian C. Covert, Scott M. Lundberg, and Su-In Lee. 2023. Algorithms to Estimate Shapley Value Feature Attributions.Nature Machine Intelligence5, 6 (2023), 590–601

  10. [11]

    Yingchaojie Feng, Xingbo Wang, Kam Kwai Wong, Sijia Wang, Yuhong Lu, Min- feng Zhu, Baicheng Wang, and Wei Chen. 2024. PromptMagician: Interactive Prompt Engineering for Text-to-Image Creation.IEEE Transactions on Visualiza- tion and Computer Graphics30, 1 (Jan. 2024), 295–305. doi:10.1109/TVCG.2023. 3327168

  11. [12]

    Karahalios

    Tong Gao, Mira Dontcheva, Eytan Adar, Zhicheng Liu, and Karrie G. Karahalios

  12. [13]

    InProceedings of the 28th Annual ACM Symposium on User Interface Software & Technology (UIST ’15)

    DataTone: Managing Ambiguity in Natural Language Interfaces for Data Visualization. InProceedings of the 28th Annual ACM Symposium on User Interface Software & Technology (UIST ’15). Association for Computing Machinery, New York, NY, USA, 489–500. doi:10.1145/2807442.2807478

  13. [14]

    InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems

    Katy Ilonka Gero, Chelse Swoopes, Ziwei Gu, Jonathan K. Kummerfeld, and Elena L. Glassman. 2024. Supporting Sensemaking of Large Language Model Outputs at Scale. InProceedings of the CHI Conference on Human Factors in Computing Systems (CHI ’24). Association for Computing Machinery, New York, NY, USA, 1–21. doi:10.1145/3613904.3642139

  14. [15]

    Frederic Gmeiner, Nicolai Marquardt, Michael Bentley, Hugo Romat, Michel Pahud, David Brown, Asta Roseway, Nikolas Martelaro, Kenneth Holstein, Ken Hinckley, and Nathalie Riche. 2025. Intent Tagging: Exploring Micro-Prompting Interactions for Supporting Granular Human-GenAI Co-Creation Workflows. In Proceedings of the 2025 CHI Conference on Human Factors ...

  15. [16]

    Eliya Habba, Noam Dahan, Gili Lior, and Gabriel Stanovsky. 2025. PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation. arXiv:2507.14913 [cs] doi:10.48550/arXiv.2507.14913

  16. [17]

    Rajarshi Haldar and Julia Hockenmaier. 2025. Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks. InFindings of the Association for Computational Linguistics: EMNLP 2025, Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (Eds.). Association for Computational Linguistics, Suzhou, China, 24986–25004. doi:10.18653/v1...

  17. [18]

    Schrum, and Anca Dragan

    Jerry Zhi-Yang He, Sashrika Pandey, Mariah L. Schrum, and Anca Dragan

  18. [19]

    arXiv:2405.01768 [cs] doi:10.48550/arXiv.2405.01768

    Context Steering: Controllable Personalization at Inference Time. arXiv:2405.01768 [cs] doi:10.48550/arXiv.2405.01768

  19. [20]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685 [cs] doi:10.48550/arXiv.2106.09685

  20. [21]

    Hutchins, James D

    Edwin L. Hutchins, James D. Hollan, and Donald A. Norman. 1985. Direct Manipulation Interfaces.Hum.-Comput. Interact.1, 4 (Dec. 1985), 311–338. doi:10. 1207/s15327051hci0104_2

  21. [22]

    Hutchins, James D

    Edwin L. Hutchins, James D. Hollan, and Donald A. Norman. 1986. Direct Manipulation Interfaces. InUser Centered System Design. CRC Press, Boca Raton, FL, 87–124

  22. [23]

    Rahul Jain, Amit Goel, Koichiro Niinuma, and Aakar Gupta. 2025. AdaptiveSliders: User-Aligned Semantic Slider-Based Editing of Text-to-Image Model Output. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). ACM, Yokohama Japan, 1–27. doi:10.1145/3706598.3714292

  23. [24]

    Tae Soo Kim, Yoonjoo Lee, Yoonah Park, Jiho Kim, Young-Ho Kim, and Juho Kim. 2025. CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions. arXiv:2508.01674 [cs] doi:10.48550/arXiv.2508.01674

  24. [25]

    Tae Soo Kim, Yoonjoo Lee, Jamin Shin, Young-Ho Kim, and Juho Kim. 2024. EvalLM: Interactive Evaluation of Large Language Model Prompts on User- Defined Criteria. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI ’24). Association for Computing Machinery, New York, NY, USA, 1–21. doi:10.1145/3613904.3642216

  25. [26]

    Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. 2025. LLMs Get Lost In Multi-Turn Conversation. arXiv:2505.06120 [cs] doi:10.48550/arXiv. 2505.06120

  26. [27]

    Arya Labroo, Ivaxi Sheth, Vyas Raina, Amaani Ahmed, and Mario Fritz. 2026. Funny or Persuasive, but Not Both: Evaluating Fine-Grained Multi-Concept Control in LLMs. arXiv:2601.18483 [cs] doi:10.48550/arXiv.2601.18483

  27. [28]

    Seongyun Lee, Sue Hyun Park, Seungone Kim, and Minjoon Seo. 2024. Aligning to Thousands of Preferences via System Message Generalization. arXiv:2405.17977 [cs] doi:10.48550/arXiv.2405.17977

  28. [29]

    Eliciting human preferences with language models

    Belinda Z. Li, Alex Tamkin, Noah Goodman, and Jacob Andreas. 2023. Eliciting Human Preferences with Language Models. arXiv:2310.11589 [cs]

  29. [30]

    Zhipeng Li, Yi-Chi Liao, and Christian Holz. 2026. Preference-Guided Prompt Optimization for Text-to-Image Generation. arXiv:2602.13131 [cs] doi:10.48550/ arXiv.2602.13131

  30. [31]

    Jiaju Ma, Lei Shi, Kenneth Aleksander Robertsen, and Peggy Chi. 2025. Am- bigChat: Interactive Hierarchical Clarification for Ambiguous Open-Domain Question Answering. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST ’25). Association for Computing Machinery, New York, NY, USA, 1–18. doi:10.1145/3746059.3747686

  31. [32]

    Stephen MacNeil, Andrew Tran, Joanne Kim, Ziheng Huang, Seth Bernstein, and Dan Mogil. 2023. Prompt Middleware: Mapping Prompts for Large Language Models to UI Affordances. arXiv:2307.01142 [cs] doi:10.48550/arXiv.2307.01142

  32. [33]

    Chaitanya Malaviya, Joseph Chee Chang, Dan Roth, Mohit Iyyer, Mark Yatskar, and Kyle Lo. 2025. Contextualized Evaluations: Judging Language Model Re- sponses to Underspecified Queries.Transactions of the Association for Computa- tional Linguistics13 (July 2025), 878–900. doi:10.1162/TACL.a.24

  33. [34]

    V. K. Chaithanya Manam, Joseph Divyan Thomas, and Alexander J. Quinn. 2022. TaskLint: Automated Detection of Ambiguities in Task Instructions.Proceedings of the AAAI Conference on Human Computation and Crowdsourcing10 (Oct. 2022), 160–172. doi:10.1609/hcomp.v10i1.21996

  34. [35]

    Damien Masson, Young-Ho Kim, and Fanny Chevalier. 2025. Textoshop: Interac- tions Inspired by Drawing Software to Facilitate Text Editing. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. ACM, Yokohama Japan, 1–14. doi:10.1145/3706598.3713862 Zhang et al

  35. [36]

    Damien Masson, Sylvain Malacria, Géry Casiez, and Daniel Vogel. 2024. Direct- GPT: A Direct Manipulation Interface to Interact with Large Language Models. InProceedings of the 2024 CHI Conference on Human Factors in Computing Sys- tems (CHI ’24). Association for Computing Machinery, New York, NY, USA, 1–16. doi:10.1145/3613904.3642462

  36. [37]

    Aditi Mishra, Bretho Danzy, Utkarsh Soni, Anjana Arunkumar, Jinbin Huang, Bum Chul Kwon, and Chris Bryan. 2025. PromptAid: Visual Prompt Exploration, Perturbation, Testing and Iteration for Large Language Models.IEEE Transactions on Visualization and Computer Graphics31, 10 (Oct. 2025), 6946–6962. doi:10. 1109/TVCG.2025.3535332

  37. [38]

    Sheshera Mysore, Debarati Das, Hancheng Cao, and Bahareh Sarrafzadeh. 2025. Prototypical Human-AI Collaboration Behaviors from LLM-Assisted Writing in the Wild. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Christos Christodoulopoulos, Tanmoy Chakraborty, Car- olyn Rose, and Violet Peng (Eds.). Association for C...

  38. [39]

    Donald A. Norman. 1986. Cognitive Engineering. InUser Centered System Design. CRC Press, Boca Raton, FL, 31–62

  39. [40]

    Nathalie Riche, Anna Offenwanger, Frederic Gmeiner, David Brown, Hugo Romat, Michel Pahud, Nicolai Marquardt, Kori Inkpen, and Ken Hinckley. 2025. AI- Instruments: Embodying Prompts as Instruments to Abstract & Reflect Graphical Interface Commands as General-Purpose Tools. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI...

  40. [41]

    Donghoon Shin, Alice Gao, Rock Yuren Pang, Jaewook Lee, Katharina Reinecke, and Emily Tseng. 2026. Interrogating Design Homogenization in Web Vibe Coding. arXiv:2603.13036

  41. [42]

    Ben Shneiderman. 1981. Direct Manipulation: A Step beyond Programming Languages. InProceedings of the Joint Conference on Easier and More Productive Use of Computer Systems. (Part - II): Human Interface and the User Interface - Volume 1981 (CHI ’81). Association for Computing Machinery, New York, NY, USA, 143. doi:10.1145/800276.810991

  42. [43]

    Paul Slovic. 1995. The Construction of Preference.American Psychologist50, 5 (1995), 364–371. doi:10.1037/0003-066X.50.5.364

  43. [44]

    Drucker, and Ken Hinckley

    Arjun Srinivasan, Bongshin Lee, Nathalie Henry Riche, Steven M. Drucker, and Ken Hinckley. 2020. InChorus: Designing Consistent Multimodal Interactions for Data Visualization on Tablet Devices. InProceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–13. doi:10.1145...

  44. [45]

    Arjun Srinivasan and John Stasko. 2018. Orko: Facilitating Multimodal Inter- action for Visual Exploration and Analysis of Networks.IEEE Transactions on Visualization and Computer Graphics24, 1 (Jan. 2018), 511–521. doi:10.1109/ TVCG.2017.2745219

  45. [46]

    Arjun Srinivasan and John Stasko. 2020. How to Ask What to Say?: Strategies for Evaluating Natural Language Interfaces for Data Visualization.IEEE Com- puter Graphics and Applications40, 4 (July 2020), 96–103. doi:10.1109/MCG.2020. 2986902

  46. [47]

    S. S. Stevens. 1946. On the Theory of Scales of Measurement.Science103, 2684 (June 1946), 677–680. doi:10.1126/science.103.2684.677

  47. [48]

    Hendrik Strobelt, Albert Webson, Victor Sanh, Benjamin Hoover, Johanna Beyer, Hanspeter Pfister, and Alexander M. Rush. 2022. Interactive and Vi- sual Prompt Engineering for Ad-Hoc Task Adaptation with Large Language Models. arXiv:2208.07852 [cs] doi:10.48550/arXiv.2208.07852

  48. [49]

    Hari Subramonyam, Roy Pea, Christopher Pondoc, Maneesh Agrawala, and Colleen Seifert. 2024. Bridging the Gulf of Envisioning: Cognitive Challenges in Prompt Based Interactions with LLMs. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI ’24). Association for Computing Machinery, New York, NY, USA, 1–19. doi:10.1145/361390...

  49. [50]

    Mukund Sundararajan and Amir Najmi. 2020. The Many Shapley Values for Model Explanation. InProceedings of the 37th International Conference on Machine Learning. PMLR, Vienna, Austria, 9269–9278

  50. [51]

    Yun Wan and Yoram Kalman. 2025. Diverse AI Personas Can Mitigate the Homog- enization Effect in Human-AI Collaborative Ideation. arXiv:2504.13868 [cs.HC]

  51. [52]

    Tongshuang Wu, Ellen Jiang, Aaron Donsbach, Jeff Gray, Alejandra Molina, Michael Terry, and Carrie J Cai. 2022. PromptChainer: Chaining Large Language Model Prompts through Visual Programming. InCHI Conference on Human Factors in Computing Systems Extended Abstracts. ACM, New Orleans LA USA, 1–10. doi:10.1145/3491101.3519729

  52. [53]

    Tongshuang Wu, Michael Terry, and Carrie Jun Cai. 2022. AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts. InCHI Conference on Human Factors in Computing Systems. ACM, New Orleans LA USA, 1–22. doi:10.1145/3491102.3517582

  53. [54]

    Zamfirescu-Pereira, Richmond Y

    J.D. Zamfirescu-Pereira, Richmond Y. Wong, Bjoern Hartmann, and Qian Yang

  54. [55]

    InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23)

    Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, 1–21. doi:10.1145/3544548.3581388

  55. [56]

    Rzeszotarski

    Chao Zhang, Shengqi Zhu, Xinyu Yang, Yu-Chia Tseng, Shenrong Jiang, and Jeffrey M. Rzeszotarski. 2025. Navigating the Fog: How University Students Recalibrate Sensemaking Practices to Address Plausible Falsehoods in LLM Out- puts. InProceedings of the 7th ACM Conference on Conversational User Interfaces (CUI ’25). Association for Computing Machinery, New ...

  56. [57]

    Rongzhi Zhang, Liqin Ye, Yuzhao Heng, Xiang Chen, Tong Yu, Lingkai Kong, Sud- heer Chava, and Chao Zhang. 2025. Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing. arXiv:2510.12121 [cs] doi:10.48550/arXiv.2510.12121

  57. [58]

    Zheng Zhang, Jie Gao, Ranjodh Singh Dhaliwal, and Toby Jia-Jun Li. 2023. VISAR: A Human-AI Argumentative Writing Assistant with Visual Programming and Rapid Draft Prototyping. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23). Association for Computing Machinery, New York, NY, USA, 1–30. doi:10.1145/358618...

  58. [59]

    Han Zhou, Xingchen Wan, Lev Proleev, Diana Mincu, Jilin Chen, Katherine Heller, and Subhrajit Roy. 2024. Batch Calibration: Rethinking Calibration for In-Context Learning and Prompt Engineering. arXiv:2309.17249 [cs] doi:10. 48550/arXiv.2309.17249

  59. [60]

    2025.Beyond Content Relevance: Evaluating Instruction Following in Retrieval Models

    Jianqun Zhou, Yuanlei Zheng, Wei Chen, Qianqian Zheng, Hui Su, Wei Zhang, Rui Meng, and Xiaoyu Shen. 2025.Beyond Content Relevance: Evaluating Instruction Following in Retrieval Models. arXiv:2410.23841 [cs] doi:10.48550/arXiv.2410.23841

  60. [61]

    Shengqi Zhu, Jeffrey Rzeszotarski, and David Mimno. 2026. Show or Tell? Model- ing the Evolution of Request-Making in Human-LLM Conversations. InFindings of the Association for Computational Linguistics: EACL 2026, Vera Demberg, Ken- taro Inui, and Lluís Marquez (Eds.). Association for Computational Linguistics, Rabat, Morocco, 5023–5034. doi:10.18653/v1/...

  61. [62]

    Include examples

    Binary (on/off toggles) For features that can be enabled/disabled. - "Include examples" - "Using bullet points"

  62. [63]

    formal",

    Continuous (scalar intensity) Use continuous for preferences that can vary along a spectrum (degree/strength/level). These are often adjectives or adverbs that describe how something should be (e.g., degree, mood, quality, price). - "formal", "humorous", "casual", "friendly", "professional" - "concise", "brief", "thorough", "verbose", "beginner-friendly" ...

  63. [64]

    for my boss

    Categorical (discrete choices) Use for mutually exclusive selections from a known/implicit set. These are often nouns or labels (audience, domain, locale, evaluation criteria, etc.). - "for my boss" - "in Manhattan, NY" - "Chinese cuisine" - "compare price and hardware" - "explain like I’m five"

  64. [65]

    one-week extension

    Numeric (quantities) For counts and durations. - "one-week extension" - "3 paragraphs" Output Fields - id: lowercase-hyphenated identifier - type: "binary" | "continuous" | "categorical" | "numeric" - anchorText: EXACT substring from input (verbatim match required) Preference extraction rules - Returning an empty list is valid if no clear preferences are ...

  65. [66]

    The base prompt should be a neutral instruction without any context or binary prompt phrases

    Extract the BASE PROMPT: Remove all anchorText related to the attributes from the original prompt, keeping only the core task. The base prompt should be a neutral instruction without any context or binary prompt phrases

  66. [67]

    Be", "Keep

    Generate CONTEXT PROMPTS: For each attribute that needs to be turned into a context prompt, turn the provided anchorText into a standalone sentence. Preserve the attribute id. Important rules: - The base prompt should still be a complete, grammatical sentence - Context prompts should start with "Be", "Keep", "Use", or similar directive Example: Original: ...