pith. machine review for the scientific record. sign in

arxiv: 2604.07652 · v1 · submitted 2026-04-08 · 💻 cs.AI · cs.HC

Recognition: unknown

Bridging Natural Language and Interactive What-If Interfaces via LLM-Generated Declarative Specification

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:01 UTC · model grok-4.3

classification 💻 cs.AI cs.HC
keywords what-if analysisnatural language interfaceslarge language modelsdeclarative specificationinteractive visualizationerror taxonomydata exploration
0
0 comments X

The pith

An intermediate declarative specification lets large language models turn natural-language what-if questions into reliable interactive visual interfaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a two-stage process in which large language models first translate a user's natural-language description of a hypothetical data scenario into a formal specification. This specification encodes the intended parameters, constraints, and logic, allowing errors to be detected and corrected before the specification is compiled into an interactive interface containing controls and linked visualizations. Experiments with 405 questions across multiple datasets and models show that roughly half the specifications are produced correctly on the first try, with targeted repairs raising the rate above 80 percent. The work demonstrates that functional errors which go undetected produce interfaces that compile and appear usable yet systematically misrepresent the original question. This matters because it offers a practical path to make exploratory data analysis accessible through plain language while avoiding the inconsistencies of direct chatbot responses.

Core claim

Large language models can generate declarative specifications from natural-language what-if questions that capture analytical intent and logic; these specifications can be validated and repaired, then compiled into interactive visual interfaces with parameter controls and linked views; when this intermediate layer is omitted, errors propagate and produce plausible but misleading interfaces.

What carries the argument

The declarative specification that encodes user intent, parameters, constraints, and analysis logic, which is generated by the model, repaired as needed, and then compiled into the interactive interface.

If this is right

  • Roughly half of LLM-generated specifications require no intervention across tested models.
  • Few-shot repair prompts can raise the overall success rate to more than 80 percent.
  • Functional errors that survive compilation produce interfaces that appear correct but answer the wrong question.
  • An error taxonomy separating compilation failures from intent mismatches guides repair strategies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same specification layer could be reused to support editing or reuse of previously created what-if scenarios.
  • Displaying the generated specification alongside the interface would let users verify or adjust intent directly.
  • The error taxonomy could inform fine-tuning of models specifically for analytical specification tasks.

Load-bearing premise

Large language models can produce specifications that correctly represent user intent in the majority of cases, so that only targeted repairs are required for the remainder.

What would settle it

A collection of natural-language what-if questions for which a generated specification compiles successfully yet the resulting interface sets parameter values or displays visualizations that contradict the original question.

Figures

Figures reproduced from arXiv: 2604.07652 by Cagatay Demiralp, Dan Marshall, Diya Patel, Ryan Rossi, Sirui Zeng, Sneha Gathani, Steven Drucker, Zhicheng Liu.

Figure 1
Figure 1. Figure 1: Our two-stage workflow: (1) translating NL WIA questions into a structured intermediate representation like [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example PSL specification for a point sensitivity ques￾tion, with properties mapping to PRAXA primitives [11]. outputVariable encodes the target variable of interest from the dataset, such as mapping ‘churn’ to the Exited parameter. objective encodes the intended goal for the outputVariable, e.g., minimize churn. Other goals include maximize or setTarget. model encodes the predictive model connecting input… view at source ↗
Figure 3
Figure 3. Figure 3: Eleven WIA subtypes in our benchmark grouped under three broad categories, with definitions and example questions from [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of the 405-question benchmark by gen [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Strategy adopted for generating PSL for NL WIA questions in benchmark [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Number of erroneous LLM-generated specifications [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: We identify 2 classes of errors observed in the LLM-generated specifications: (1) [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Example of a targeting prompt contents for correcting [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Error distribution after targeted repair across error categories (EC1–EC10) and models. For each error, we report the count [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Example of an error across all functional errors (EC5–EC10) and their impact on the visual interface. We show interfaces [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Six WIA scenarios spanning different analysis types [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Summary of the author’s formative exploration of existing tools (columns) across major WIA tasks (rows). Green checkmarks [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Benchmark construction analysis. (A) SentenceTransformer similarity between seed and hand-authored questions across [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Number of erroneous LLM-generated specifications compared against the ground-truth. (A) Before intervention by dataset [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Findings from Experiment A; Binary detection of any [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Per-category calibration and agreement between human annotators and the LLMs. For each error category (EC1–EC9) [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Common visuals and controls observed in existing BI tools and research systems to illustrate the outputs of different WIA [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗
read the original abstract

What-if analysis (WIA) is an iterative, multi-step process where users explore and compare hypothetical scenarios by adjusting parameters, applying constraints, and scoping data through interactive interfaces. Current tools fall short of supporting effective interactive WIA: spreadsheet and BI tools require time-consuming and laborious setup, while LLM-based chatbot interfaces are semantically fragile, frequently misinterpret intent, and produce inconsistent results as conversations progress. To address these limitations, we present a two-stage workflow that translates natural language (NL) WIA questions into interactive visual interfaces via an intermediate representation, powered by the Praxa Specification Language (PSL): first, LLMs generate PSL specifications from NL questions capturing analytical intent and logic, enabling validation and repair of erroneous specifications; and second, the specifications are compiled into interactive visual interfaces with parameter controls and linked visualizations. We benchmark this workflow with 405 WIA questions spanning 11 WIA types, 5 datasets, and 3 state-of-the-art LLMs. The results show that across models, half of specifications (52.42%) are generated correctly without intervention. We perform an analysis of the failure cases and derive an error taxonomy spanning non-functional errors (specifications fail to compile) and functional errors (specifications compile but misrepresent intent). Based on the taxonomy, we apply targeted repairs on the failure cases using few-shot prompts and improve the success rate to 80.42%. Finally, we show how undetected functional errors propagate through compilation into plausible but misleading interfaces, demonstrating that the intermediate specification is critical for reliably bridging NL and interactive WIA interface in LLM-powered WIA systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes a two-stage workflow that uses LLMs to generate specifications in the newly introduced Praxa Specification Language (PSL) from natural language what-if analysis (WIA) questions; these specifications are then validated/repaired and compiled into interactive visual interfaces with parameter controls and linked visualizations. It evaluates the approach on a benchmark of 405 WIA questions spanning 11 WIA types, 5 datasets, and 3 LLMs, reporting a baseline success rate of 52.42% that rises to 80.42% after targeted few-shot repairs derived from an error taxonomy of non-functional and functional errors, and argues that the intermediate PSL representation is essential to prevent propagation of intent-misrepresentation errors into misleading interfaces.

Significance. If the central claims hold, the work offers a concrete, empirically supported method for making LLM-driven WIA more reliable than direct chatbot interfaces while avoiding the setup overhead of traditional BI tools. The large-scale benchmark, explicit error taxonomy, and demonstration of how undetected functional errors produce plausible but incorrect compiled interfaces are notable strengths that directly support the value of the intermediate declarative layer.

major comments (1)
  1. [results and error-analysis sections] The evaluation of the repair stage (results and error-analysis sections): the few-shot repair prompts are derived from the error taxonomy obtained on the entire 405-question benchmark. No held-out test set, cross-validation, or separate generalization experiment is reported for the repair prompts themselves. This leaves open whether the jump from 52.42% to 80.42% reflects general repair strategies or patterns specific to the benchmark's 11 WIA types and 5 datasets, directly affecting the claim that the workflow reliably captures user intent on new questions.
minor comments (2)
  1. [Abstract] The abstract states that 'half of specifications (52.42%) are generated correctly'; the parenthetical figure is slightly above 50% and should be phrased consistently with the exact reported value.
  2. PSL is presented as an invented entity; a self-contained syntax definition, grammar, or example set should appear early (ideally before the benchmark) to allow readers to assess the compilation step without external lookup.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. The concern about the repair stage evaluation is well-taken, and we address it directly below.

read point-by-point responses
  1. Referee: [results and error-analysis sections] The evaluation of the repair stage (results and error-analysis sections): the few-shot repair prompts are derived from the error taxonomy obtained on the entire 405-question benchmark. No held-out test set, cross-validation, or separate generalization experiment is reported for the repair prompts themselves. This leaves open whether the jump from 52.42% to 80.42% reflects general repair strategies or patterns specific to the benchmark's 11 WIA types and 5 datasets, directly affecting the claim that the workflow reliably captures user intent on new questions.

    Authors: We agree that the current evaluation leaves the generalizability of the specific few-shot repair prompts untested in a strict sense. The error taxonomy was intentionally derived from the full 405-question benchmark to ensure broad coverage of non-functional and functional errors across all 11 WIA types, 5 datasets, and 3 LLMs. The repair prompts were then constructed from representative examples in that taxonomy and applied to the observed failures. While this yields a clear demonstration that targeted repairs can substantially improve success rates (52.42% to 80.42%), it does not constitute an independent test of whether the same prompts would succeed on entirely new questions outside the benchmark distribution. To strengthen the manuscript, we will revise the results and error-analysis sections to explicitly acknowledge this limitation and add a held-out evaluation: we will randomly partition the benchmark into an 80/20 development/test split, re-derive the taxonomy and few-shot prompts solely from the development portion, apply the resulting prompts to failures in the held-out test portion, and report the repair success rate on unseen questions. This addition will directly address the concern about whether the observed improvement reflects general repair strategies. revision: yes

Circularity Check

1 steps flagged

Repair prompts and taxonomy derived from and applied to the same 405-question benchmark

specific steps
  1. fitted input called prediction [Abstract]
    "We perform an analysis of the failure cases and derive an error taxonomy spanning non-functional errors (specifications fail to compile) and functional errors (specifications compile but misrepresent intent). Based on the taxonomy, we apply targeted repairs on the failure cases using few-shot prompts and improve the success rate to 80.42%."

    The error taxonomy is extracted from failures on the 405-question benchmark. Few-shot repair prompts are built from that taxonomy. Applying the resulting repairs to the identical failure cases and reporting the 80.42% success rate means the final metric is obtained by construction from the data used to define the repair strategy, rather than measuring performance on held-out questions.

full rationale

The paper measures initial LLM generation success (52.42%) directly on the 405-question benchmark. It then derives an error taxonomy from the observed failures on that same benchmark and constructs few-shot repair prompts from the taxonomy. These repairs are applied back to the failure cases, yielding the improved 80.42% figure. This makes the headline performance metric a fitted result on the evaluation data rather than an independent test of generalization. The demonstration that undetected functional errors produce misleading interfaces remains an independent observation, and no self-citations or definitional loops appear in the core workflow.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper's claims depend on the LLM's ability to produce usable PSL specs and on the compilation process working as intended from those specs.

axioms (1)
  • domain assumption Current large language models are capable of generating syntactically and semantically correct PSL specifications from natural language descriptions of what-if analyses when given appropriate prompts.
    This is central to the first stage of the workflow and is tested empirically but assumed to hold for the approach to be viable.
invented entities (1)
  • Praxa Specification Language (PSL) no independent evidence
    purpose: To serve as a declarative intermediate representation that captures analytical intent for validation, repair, and compilation into interactive interfaces.
    PSL is introduced in the paper as the key bridge; no external validation of its design is mentioned in the abstract.

pith-pipeline@v0.9.0 · 5619 in / 1358 out tokens · 76315 ms · 2026-05-10T17:01:52.186523+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    Anonymous. 2025. Supplementary Materials: OSF. https://osf.io/hsu6t/overview? view_only=ec9b0bd336c845e1a45408ae25349d4d

  2. [2]

    Anthropic. 2025. Claude (Opus 4.1) [Large language model]. https://claude.ai/

  3. [3]

    Aditya Bhattacharya, Jeroen Ooge, Gregor Stiglic, and Katrien Verbert. 2023. Di- rective explanations for monitoring the risk of diabetes onset: introducing directive data-centric explanations and combinations to support what-if explorations. In Proceedings of the 28th international conference on intelligent user interfaces. 204–219

  4. [4]

    Microsoft Corporation. 2025. Microsoft Excel. https://office.microsoft.com/excel

  5. [5]

    Microsoft Corporation. 2025. Microsoft Power BI. https://powerbi.microsoft.com

  6. [6]

    Kedar Dhamdhere, Kevin S McCurley, Ralfi Nahmias, Mukund Sundararajan, and Qiqi Yan. 2017. Analyza: Exploring data with conversation. InProceedings of the 22nd International Conference on Intelligent User Interfaces. 493–504

  7. [7]

    Ethan Fast, Binbin Chen, Julia Mendelsohn, Jonathan Bassen, and Michael S Bernstein. 2018. Iris: A conversational agent for complex tasks. InProceedings of the 2018 CHI conference on human factors in computing systems. 1–12

  8. [8]

    Han Fu, Chang Liu, Bin Wu, Feifei Li, Jian Tan, and Jianling Sun. 2023. Catsql: Towards real world natural language to sql applications.Proceedings of the VLDB Endowment16, 6 (2023), 1534–1547

  9. [9]

    Tong Gao, Mira Dontcheva, Eytan Adar, Zhicheng Liu, and Karrie G Karahalios

  10. [10]

    InProceedings of the 28th annual acm symposium on user interface software & technology

    Datatone: Managing ambiguity in natural language interfaces for data visualization. InProceedings of the 28th annual acm symposium on user interface software & technology. 489–500

  11. [11]

    Sneha Gathani, Madelon Hulsebos, James Gale, Peter J Haas, and Ça˘gatay Demi- ralp. 2021. Augmenting decision making via interactive what-if analysis.CIDR (2021)

  12. [12]

    PRAXA: A Grammar for What-If Analysis

    Sneha Gathani, Kevin Li, Raghav Thind, Sirui Zeng, Matthew Xu, Peter J. Haas, Cagatay Demiralp, and Zhicheng Liu. 2025. PRAXA: A Grammar for What-If Analysis. arXiv:2510.09791 [cs.HC] https://arxiv.org/abs/2510.09791

  13. [13]

    Sneha Gathani, Zhicheng Liu, Peter J Haas, and Ça˘gatay Demiralp. 2025. What-if Analysis for Business Professionals: Current Practices and Future Opportunities. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–17

  14. [14]

    Gokagglers. 2024. Email Campaign Management Dataset for Predicting Whether Email is Opened or Not. https://www.kaggle.com/datasets/loveall/email- campaign-management-for-sme?select=Train_psolI3n.csv

  15. [15]

    Pat Hanrahan. 2006. Vizql: a language for query, analysis and visualization. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data. 721–721

  16. [16]

    Subhashis Hazarika, Haruki Hirasawa, Sookyung Kim, Kalai Ramea, Salva R Cachay, Peetak Mitra, Dipti Hingmire, Hansi Singh, and Phil J Rasch. 2023. HAiV A: Hybrid AI-assisted Visual Analysis Framework to Study the Effects of Cloud Properties on Climate Patterns. In2023 IEEE Visualization and Visual Analytics (VIS). IEEE, 226–230

  17. [17]

    Jeffrey Heer, Matthew Conlen, Vishal Devireddy, Tu Nguyen, and Joshua Horowitz. 2023. Living papers: A language toolkit for augmented scholarly communication. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. 1–13

  18. [18]

    Fred Hohman, Andrew Head, Rich Caruana, Robert DeLine, and Steven M Drucker. 2019. Gamut: A design probe to understand how data scientists un- derstand machine learning models. InProceedings of the 2019 CHI conference on human factors in computing systems. 1–13

  19. [19]

    Kaggle. 2024. Marketing Analytics Dataset for Predicting Accepted Campaigns. https://www.kaggle.com/code/karltonkxb/marketing-analysis- classification-and-clustering/input

  20. [20]

    Kaggle. 2024. Spotify Revenue, Expenses and Its Premium. https://www.kaggle. com/code/rohithreddy0920/spotify-revenue-expenses-and-its-premium/input

  21. [21]

    Hyeok Kim, Ryan Rossi, Fan Du, Eunyee Koh, Shunan Guo, Jessica Hullman, and Jane Hoffswell. 2022. Cicero: A declarative grammar for responsive visualization. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–15

  22. [22]

    Kenrick Kin, Björn Hartmann, Tony DeRose, and Maneesh Agrawala. 2012. Proton: multitouch gestures as regular expressions. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems. 2885–2894

  23. [23]

    Hyung-Kwon Ko, Hyeon Jeon, Gwanmo Park, Dae Hyun Kim, Nam Wook Kim, Juho Kim, and Jinwook Seo. 2024. Natural language dataset generation framework for visualizations powered by large language models. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–22

  24. [24]

    Sonia Laguna, Julian N Heidenreich, Jiugeng Sun, Nilüfer Cetin, Ibrahim Al- Hazwani, Udo Schlegel, Furui Cheng, and Mennatallah El-Assady. 2023. Ex- pLIMEable: A Visual Analytics Approach for Exploring LIME. In2023 Workshop on Visual Analytics in Healthcare (VAHC). IEEE, 27–33

  25. [25]

    Jungeun Lee, Sungnam Kim, Minki Cheon, Hyojin Ju, JaeEun Lee, and Inseok Hwang. 2022. SleepGuru: Personalized Sleep Planning System for Real-life Ac- tionability and Negotiability. InProceedings of the 35th Annual ACM Symposium on User Interface Software and Technology. 1–16

  26. [26]

    Wei Luo, Michael Steptoe, Zheng Chang, Robert Link, Leon Clarke, and Ross Maciejewski. 2017. Impact of spatial scales on the intercomparison of climate scenarios.IEEE Computer Graphics and Applications37, 5 (2017), 40–49

  27. [27]

    Yuyu Luo, Nan Tang, Guoliang Li, Chengliang Chai, Wenbo Li, and Xuedi Qin

  28. [28]

    InProceedings of the 2021 International Conference on Management of Data

    Synthesizing natural language to visualization (NL2VIS) benchmarks from NL2SQL benchmarks. InProceedings of the 2021 International Conference on Management of Data. 1235–1247

  29. [29]

    Yuyu Luo, Nan Tang, Guoliang Li, Jiawei Tang, Chengliang Chai, and Xuedi Qin

  30. [30]

    Natural language to visualization by neural machine translation.IEEE Transactions on Visualization and Computer Graphics28, 1 (2021), 217–226

  31. [31]

    Sagar Maru. 2024. Bank Customer Dataset for Predicting Customer Churn. https://www.kaggle.com/datasets/marusagar/bank-customer-attrition-insights

  32. [32]

    Microsoft. 2025. Chartifact. (2025). https://microsoft.github.io/chartifact/

  33. [33]

    Dominik Moritz, Chenglong Wang, Greg L Nelson, Halden Lin, Adam M Smith, Bill Howe, and Jeffrey Heer. 2018. Formalizing visualization design knowledge as constraints: Actionable and extensible models in draco.IEEE transactions on visualization and computer graphics25, 1 (2018), 438–448

  34. [34]

    Arpit Narechania, Arjun Srinivasan, and John Stasko. 2020. NL4DV: A toolkit for generating analytic specifications for data visualization from natural language queries.IEEE Transactions on Visualization and Computer Graphics27, 2 (2020), 369–379

  35. [35]

    OpenAI. 2024. Data Analyst GPT. https://chatgpt.com/g/g-HMNcP6w7d-data- analyst?model=gpt-4o. Custom GPT available via ChatGPT

  36. [36]

    Xiaoying Pu and Matthew Kay. 2023. How Data Analysts Use a Visualization Grammar in Practice. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23). Association for Computing Machinery, New York, NY , USA, Article 840, 22 pages. doi:10.1145/ 3544548.3580837

  37. [37]

    Salesforce

    Inc. Salesforce. [n. d.]. Salesforce Einstein Analytics. https://www.salesforce. com/products/einstein/overview

  38. [38]

    Arvind Satyanarayan and Jeffrey Heer. 2014. Lyra: An interactive visualization design environment. InComputer graphics forum, V ol. 33. Wiley Online Library, 351–360. UIST ’26, November 02–05, 2026, Detroit, MI Gathani et al

  39. [39]

    Arvind Satyanarayan, Dominik Moritz, Kanit Wongsuphasawat, and Jeffrey Heer

  40. [40]

    Vega-lite: A grammar of interactive graphics.IEEE transactions on visualization and computer graphics23, 1 (2016), 341–350

  41. [41]

    Arvind Satyanarayan, Kanit Wongsuphasawat, and Jeffrey Heer. 2014. Declarative interaction design for data visualization. InProceedings of the 27th annual ACM symposium on User interface software and technology. 669–678

  42. [42]

    Hans-Jörg Schulz, Thomas Nocke, Magnus Heitzler, and Heidrun Schumann

  43. [43]

    A design space of visualization tasks.IEEE Transactions on Visualization and Computer Graphics19, 12 (2013), 2366–2375

  44. [44]

    Vidya Setlur, Sarah E Battersby, Melanie Tory, Rich Gossweiler, and Angel X Chang. 2016. Eviza: A natural language interface for visual analysis. InProceed- ings of the 29th annual symposium on user interface software and technology. 365–377

  45. [45]

    Yuanfeng Song, Raymond Chi-Wing Wong, Xuefang Zhao, and Di Jiang. 2022. V oiceQuerySystem: A voice-driven database querying system using natural lan- guage questions. InProceedings of the 2022 International Conference on Man- agement of Data. 2385–2388

  46. [46]

    Yuanfeng Song, Xuefang Zhao, Raymond Chi-Wing Wong, and Di Jiang. 2022. Rgvisnet: A hybrid retrieval-generation neural framework towards automatic data visualization generation. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1646–1655

  47. [47]

    Arjun Srinivasan, Bongshin Lee, Nathalie Henry Riche, Steven M Drucker, and Ken Hinckley. 2020. InChorus: Designing consistent multimodal interactions for data visualization on tablet devices. InProceedings of the 2020 CHI conference on human factors in computing systems. 1–13

  48. [48]

    Arjun Srinivasan and John T. Stasko. 2017. Natural Language Interfaces for Data Analysis with Visualization: Considering What Has and Could Be Asked. In EuroVis 2017 - Short Papers, Barbora Kozlikova, Tobias Schreck, and Thomas Wis- chgoll (Eds.). The Eurographics Association. doi:10.2312/eurovisshort.20171133

  49. [49]

    a Salesforce Company Tableau Software. 2025. Tableau. https://www.tableau. com

  50. [50]

    Zain Tariq, Miro Mannino, Mai Le Xuan Anh, Whitney Bagge, Azza Abouzied, and Dennis Shasha. 2021. Planning epidemic interventions with EpiPolicy. In The 34th Annual ACM Symposium on User Interface Software and Technology. 894–909

  51. [51]

    Yuan Tian, Weiwei Cui, Dazhen Deng, Xinjing Yi, Yurun Yang, Haidong Zhang, and Yingcai Wu. 2024. Chartgpt: Leveraging llms to generate charts from abstract natural language.IEEE Transactions on Visualization and Computer Graphics31, 3 (2024), 1731–1745

  52. [52]

    Yuan Tian, Jonathan K Kummerfeld, Toby Jia-Jun Li, and Tianyi Zhang. 2024. Sqlucid: Grounding natural language database queries with interactive explana- tions. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology. 1–20

  53. [53]

    Yuan Tian, Daniel Lee, Fei Wu, Tung Mai, Kun Qian, Siddhartha Sahai, Tianyi Zhang, and Yunyao Li. 2025. Text-to-SQL Domain Adaptation via Human- LLM Collaborative Data Annotation. InProceedings of the 30th International Conference on Intelligent User Interfaces. 1398–1425

  54. [54]

    Priyan Vaithilingam, Elena L Glassman, Jeevana Priya Inala, and Chenglong Wang

  55. [55]

    In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems

    Dynavis: Dynamically synthesized ui widgets for visualization editing. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–17

  56. [56]

    James Wexler, Mahima Pushkarna, Tolga Bolukbasi, Martin Wattenberg, Fernanda Viégas, and Jimbo Wilson. 2019. The what-if tool: Interactive probing of machine learning models.IEEE transactions on visualization and computer graphics26, 1 (2019), 56–65

  57. [57]

    Hadley Wickham. 2011. ggplot2.Wiley interdisciplinary reviews: computational statistics3, 2 (2011), 180–185

  58. [58]

    Kanit Wongsuphasawat, Zening Qu, Dominik Moritz, Riley Chang, Felix Ouk, Anushka Anand, Jock Mackinlay, Bill Howe, and Jeffrey Heer. 2017. V oyager 2: Augmenting visual analysis with partial view specifications. InProceedings of the 2017 chi conference on human factors in computing systems. 2648–2659

  59. [59]

    Yingcai Wu, Shixia Liu, Kai Yan, Mengchen Liu, and Fangzhao Wu. 2014. Opin- ionflow: Visual analysis of opinion diffusion on social media.IEEE transactions on visualization and computer graphics20, 12 (2014), 1763–1772

  60. [60]

    Yogesh Agrawal. 2024. Media Spends Dataset for Predicting Sales. https: //www.kaggle.com/datasets/yugagrawal95/sample-media-spends-data

  61. [61]

    Bowen Yu and Cláudio T Silva. 2019. FlowSense: A natural language interface for visual data exploration within a dataflow system.IEEE transactions on visualization and computer graphics26, 1 (2019), 1–11

  62. [62]

    Create an interactive interface to learn what happens to the churn likelihood if the estimated salary is doubled for this bank dataset

    Yuheng Zhao, Yixing Zhang, Yu Zhang, Xinyi Zhao, Junjie Wang, Zekai Shao, Cagatay Turkay, and Siming Chen. 2024. Leva: Using large language models to enhance visual analytics.IEEE transactions on visualization and computer graphics31, 3 (2024), 1830–1847. Figure 11: Six WIA scenarios spanning different analysis types used in our formative exploration of e...