pith. sign in

arxiv: 2505.11604 · v5 · submitted 2025-05-16 · 💻 cs.CL

Talk to Your Slides: High-Efficiency Slide Editing via Language-Driven Structured Data Manipulation

Pith reviewed 2026-05-22 14:20 UTC · model grok-4.3

classification 💻 cs.CL
keywords slide editingstructured data manipulationlanguage interfacesmultimodal LLMstext-centric taskspresentation softwareefficiency benchmarksTSBench
0
0 comments X

The pith

Language-driven agent edits slides by manipulating structured data models instead of visuals

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Talk-to-Your-Slides, a system that edits presentation slides using language to manipulate their underlying object model rather than relying on images or GUI actions. This targets text-centric and formatting tasks where multimodal LLM agents suffer from high latency and cost due to visual processing. A hierarchical architecture translates high-level instructions into precise low-level codes on the slide data structure. Experiments show the method runs 34 percent faster, achieves 34 percent better instruction fidelity, and costs 87 percent less than GUI-based baselines for these tasks. The work also releases TSBench, a human-verified set of 379 instructions with a hard subset for complex queries.

Core claim

Talk-to-Your-Slides operates via language-driven structured data manipulation on the slide object model, using a hierarchical architecture to bridge high-level user instructions with low-level execution codes, thereby enabling precise content changes and style preservation without visual perception or OCR.

What carries the argument

Hierarchical architecture that translates high-level instructions into execution codes by operating directly on the slide's underlying object model instead of image pixels.

Load-bearing premise

The system assumes reliable access to an accurate underlying object model of the slides that fully captures both content and style details.

What would settle it

A scenario in which the provided object model is incomplete or inaccurate, causing edits to deviate from the intended style or content while visual methods succeed.

Figures

Figures reproduced from arXiv: 2505.11604 by Hojun Cho, Jaegul Choo, Jaehyeok Jang, Jooyeol Yun, Kyudan Jung, Soyoung Yang.

Figure 1
Figure 1. Figure 1: Comparison of slide editing methods on translating 50-page lecture slides from Korean to En￾glish. (a) Manual translation requires day(s) and con￾sumes graduate-student labor. (b) A GUI-based agent incurs high cost. (c) Our approach runs in a low cost and in a relatively short time. content maintenance. While automated genera￾tion has garnered attention, there is a growing demand for the latter, efficientl… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the TALK-TO-YOUR-SLIDES framework. The system consists of four modules: instruction understanding, document understanding, document editing, and code generator. 2024). While these agents excel at spatial tasks, applying them to text-heavy or batch-editing tasks reveals significant limitations due to the high com￾putational cost of processing image inputs and the potential loss of fidelity in te… view at source ↗
Figure 3
Figure 3. Figure 3: Results of TALK-TO-YOUR-SLIDES across four instruction categories. Modifications are highlighted with yellow boxes. TextEditing: Korean text has been translated into English according to the instruction. Visual￾Formatting: the original background and text colors were too similar, reducing readability; the revised version uses white text for improved contrast and clarity. LayoutAdjustment: the widths of the… view at source ↗
Figure 4
Figure 4. Figure 4: A real-world example of the self-reflection [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example output generated by the instruction [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example output of document understand￾ing. The yellow sections contain information about the parsed object’s name, type, location, size, and other de￾tails. The runs highlighted in green demonstrate that different text formatting styles can exist within a single text box. seed instructions was expanded to 560 through GPT-4o-based augmentation. After excluding 181 examples with unclear objectives or those d… view at source ↗
Figure 7
Figure 7. Figure 7: Examples of challenging instructions in [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Examples of instructions across four cate [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example from the TSBench dataset. Some data points consist of a single slide, while others contain [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Example of the original lecture slide before processing. This is a part of the batch editing process [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Example of the original lecture slide after processing. This is a part of the batch editing process (editing [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: An example of a layout error in the slide [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 14
Figure 14. Figure 14: A prompt used in Document Editing. shown in Figure N.5, while the prompt used for evaluating text, image, layout, and color,based on the criteria from Ge et al. (2025), is presented in [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
Figure 13
Figure 13. Figure 13: A prompt used in instruction understanding. [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
Figure 16
Figure 16. Figure 16: A prompt used in baseline system [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: The original slide from which the example [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: A prompt used in LLM judge [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: A prompt used in LLM judge which evaluate text, image, layout, color. [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗
read the original abstract

Editing presentation slides is a frequent yet tedious task, ranging from creative layout design to repetitive text maintenance. While recent GUI-based agents powered by Multimodal LLMs (MLLMs) excel at tasks requiring visual perception, such as spatial layout adjustments, they often incur high computational costs and latency when handling structured, text-centric, or batch processing tasks. In this paper, we propose Talk-to-Your-Slides, a high-efficiency slide editing agent that operates via language-driven structured data manipulation rather than relying on the image modality. By leveraging the underlying object model instead of screen pixels, our approach ensures precise content modification while preserving style fidelity, addressing the limitations of OCR-based visual agents. Our system features a hierarchical architecture that effectively bridges high-level user instructions with low-level execution codes. Experiments demonstrate that for text-centric and formatting tasks, our method enables 34% faster processing, achieves 34% better instruction fidelity, and operates at an 87% lower cost compared to GUI-based baselines. Furthermore, we introduce TSBench, a human-verified benchmark dataset comprising 379 instructions, including a Hard subset designed to evaluate robustness against complex and visually dependent queries. Our code and benchmark are available at https://github.com/KyuDan1/Talk-to-Your-Slides.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Talk-to-Your-Slides, a hierarchical agent for editing presentation slides via language-driven structured data manipulation on the underlying object model rather than visual or GUI methods. It claims 34% faster processing, 34% better instruction fidelity, and 87% lower cost than GUI baselines for text-centric and formatting tasks, while introducing TSBench, a human-verified benchmark of 379 instructions with a Hard subset for complex queries.

Significance. If the results hold under realistic conditions, the work offers a promising direction for low-cost, high-precision automation of structured document tasks by avoiding the overhead of multimodal visual agents. The open-sourced code and benchmark provide a concrete resource for follow-up research on language agents for office documents.

major comments (3)
  1. [§4] §4: The reported gains (34% faster, 34% better fidelity, 87% lower cost) are stated without details on baseline implementations, hardware, statistical tests, or variance across runs, making it impossible to determine whether the improvements are attributable to the structured manipulation or to unstated experimental choices.
  2. [§3] §3: The central premise that an accurate object model fully captures content and style (allowing language-driven manipulation without OCR or pixels) is load-bearing for all efficiency claims, yet the manuscript provides no experiments measuring degradation under realistic parsing errors, missing elements, or incomplete hierarchies from PPTX files, especially on the Hard subset of TSBench.
  3. [§4.1] §4.1: The construction, sampling, and human-verification protocol for the 379 instructions (and the definition of the Hard subset) are not described in sufficient detail to support claims that TSBench reliably evaluates robustness against complex or visually dependent queries.
minor comments (2)
  1. [Abstract] The abstract and §1 could more clearly delimit the method's scope (text-centric tasks only) versus its limitations on spatial layout changes.
  2. [§3] Figure 1 or the architecture diagram in §3 would benefit from explicit pseudocode or data-flow arrows showing how high-level instructions map to low-level codes.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us identify areas for improvement in the manuscript. We address each major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [§4] The reported gains (34% faster, 34% better fidelity, 87% lower cost) are stated without details on baseline implementations, hardware, statistical tests, or variance across runs, making it impossible to determine whether the improvements are attributable to the structured manipulation or to unstated experimental choices.

    Authors: We agree with this observation. The current manuscript lacks sufficient experimental details to fully support the claims. In the revised version, we will add a new subsection in §4 detailing the baseline implementations (including how GUI-based agents were set up), the hardware configuration, the number of experimental runs, standard deviations, and p-values from statistical tests to demonstrate the significance of the improvements. revision: yes

  2. Referee: [§3] The central premise that an accurate object model fully captures content and style (allowing language-driven manipulation without OCR or pixels) is load-bearing for all efficiency claims, yet the manuscript provides no experiments measuring degradation under realistic parsing errors, missing elements, or incomplete hierarchies from PPTX files, especially on the Hard subset of TSBench.

    Authors: This is a fair point, as the robustness to parsing inaccuracies is important for real-world applicability. Although the paper focuses on the benefits assuming a correct object model, we will include additional experiments in the revision that introduce controlled parsing errors and evaluate performance degradation, with particular emphasis on the Hard subset of TSBench. revision: yes

  3. Referee: [§4.1] The construction, sampling, and human-verification protocol for the 379 instructions (and the definition of the Hard subset) are not described in sufficient detail to support claims that TSBench reliably evaluates robustness against complex or visually dependent queries.

    Authors: We acknowledge the need for more transparency in the benchmark construction. We will revise §4.1 to include detailed descriptions of how the 379 instructions were collected and sampled, the specific criteria used to define the Hard subset for complex and visually dependent queries, and the full human-verification protocol including the number of annotators and agreement statistics. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system evaluation with direct baseline comparisons

full rationale

The paper describes a hierarchical agent architecture for slide editing via structured data manipulation and reports empirical results (34% faster, 34% better fidelity, 87% lower cost) from direct comparisons to GUI-based MLLM baselines on TSBench. No equations, fitted parameters, predictions, or derivations are present in the provided text. Central claims rest on experimental measurements rather than any self-referential reduction or self-citation chain. The assumption of an accurate object model is an engineering premise, not a circular derivation. This is a standard non-circular systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the availability of an editable structured object model inside presentation software and on the hierarchical pipeline successfully translating natural language into precise edits without style loss.

axioms (1)
  • domain assumption Presentation software exposes an accurate, manipulable object model that captures both content and style independently of pixel rendering.
    Invoked when the paper states the system operates via language-driven structured data manipulation rather than the image modality and preserves style fidelity.
invented entities (1)
  • Hierarchical architecture bridging high-level user instructions with low-level execution codes no independent evidence
    purpose: To translate natural language commands into precise structured edits
    Introduced as the core system feature that connects user intent to execution.

pith-pipeline@v0.9.0 · 5779 in / 1301 out tokens · 59603 ms · 2026-05-22T14:20:52.702039+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PresentAgent-2: Towards Generalist Multimodal Presentation Agents

    cs.CV 2026-05 unverdicted novelty 6.0

    PresentAgent-2 generates query-driven multimodal presentation videos with research grounding, supporting single-speaker, multi-speaker discussion, and interactive question-answering modes.

  2. AeSlides: Incentivizing Aesthetic Layout in LLM-Based Slide Generation via Verifiable Rewards

    cs.CV 2026-04 unverdicted novelty 6.0

    AeSlides is a GRPO-based RL framework that uses verifiable aesthetic metrics to optimize LLM slide generation, achieving large gains in layout quality metrics and human scores with only 5K prompts.

  3. AI for Auto-Research: Roadmap & User Guide

    cs.AI 2026-05 unverdicted novelty 4.0

    The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 3 Pith papers

  1. [1]

    13 Ozan Caldiran, Kadir Haspalamutgil, Abdullah Ok, Can Palaz, Esra Erdem, and V olkan Patoglu

    Automatikz: Text-guided synthesis of sci- entific vector graphics with tikz.arXiv preprint arXiv:2310.00367. 13 Ozan Caldiran, Kadir Haspalamutgil, Abdullah Ok, Can Palaz, Esra Erdem, and V olkan Patoglu. 2009. Bridging the gap between high-level reasoning and low-level control. InInternational Conference on Logic Programming and Nonmonotonic Reasoning, p...

  2. [2]

    A survey on in-context learning. InProc. of the Conference on Empirical Methods in Nat- ural Language Processing (EMNLP), pages 1107– 1128, Miami, Florida, USA. Association for Com- putational Linguistics. 4 Joshua P Ellis. 2017. Tikz-feynman: Feynman dia- grams with tikz.Computer Physics Communica- tions, 210:103–123. 13 Difei Gao and 1 others. 2024. Ass...

  3. [3]

    4 Athar Sefid, Prasenjit Mitra, and Lee Giles

    Tptu: Large language model-based ai agents for task planning and tool usage.Preprint, arXiv:2308.03427. 4 Athar Sefid, Prasenjit Mitra, and Lee Giles. 2021. Slidegen: an abstractive section-based slide gener- ator for scholarly documents. InProceedings of the 21st ACM Symposium on Document Engineering, DocEng ’21, New York, NY , USA. Association for Compu...

  4. [4]

    InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Min- ing V .1, KDD ’25, page 2584–2595, New York, NY , USA

    Struct-x: Enhancing the reasoning capabili- ties of large language models in structured data sce- narios. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Min- ing V .1, KDD ’25, page 2584–2595, New York, NY , USA. Association for Computing Machinery. 4 Evan Z Wang, Federico Cassano, Catherine Wu, Yun- feng Bai, William Song...

  5. [5]

    is this text bolded?

    Code to think, think to code: A survey on code-enhanced reasoning and reasoning-driven code intelligence in llms.Preprint, arXiv:2502.19411. 4 Ke Yang, Jiateng Liu, John Wu, Chaoqi Yang, Yi Fung, Sha Li, Zixuan Huang, Xu Cao, Xingyao Wang, Heng Ji, and ChengXiang Zhai. 2024. If LLM is the wizard, then code is the wand: A survey on how code empowers large ...

  6. [6]

    Attempt 1 (Generated Code): # Agent incorrectly tries to assign a tuple slide.Shapes(1)...Font.Color.RGB = (255, 0, 0) Execution Error: TypeError: Objects of type ’tuple’ can not be converted to a COM VARIANT

  7. [7]

    However, the PowerPoint COM interface expects a single integer

    Refinement Step (Self-Reflection): ”I attempted to assign a tuple (255, 0, 0) directly. However, the PowerPoint COM interface expects a single integer... I must convert it.”

  8. [8]

    Refine Triggered

    Attempt 2 (Regenerated Code): # Agent corrects the format to an integer slide.Shapes(1)...Font.Color.RGB = 255 Result:Success(Color updated to red) Figure 4: A real-world example of the self-reflection mechanism handling a data type error during execution. Variant Visual-DependentSR(%) AmbiguousSR(%)Multi-stepSR(%)OverallSR(%)CF(%) Ours (w/o self-reflecti...

  9. [9]

    Move the title text box on slide 5 so that its bottom edge touches the top of the bar chart

    It inherits the core strengths of the flagship GPT-4.1 series—state-of-the-art coding ability, robust instruction following, and support for very long (up to one million-token) contexts—while reducing model size to cut inference latency by roughly 50% and lower operational cost. This makes GPT-4.1-mini an ideal choice for applica- tions that demand the la...

  10. [10]

    Specific slides to modify (by page number)

  11. [11]

    Specific sections within slides (title, body, notes, headers, footers, etc.)

  12. [12]

    Specific object elements to add, remove, or change (text boxes, images, shapes, charts, tables, etc.)

  13. [13]

    Precise formatting changes (font, size, color, alignment, etc.)

  14. [14]

    The logical sequence of operations with clear dependen- cies Please write one task for one slide page. Format your response as a JSON format with the following structure:{{”understanding”: ”Detailed summary of what the user wants to achieve”, ”tasks”: [{{”page number”: 1, ”description”: ”Specific task description”, ”target”: ”Precise target location (e.g....

  15. [15]

    Only perform the work described in the ’action’ within ’tasks’

  16. [16]

    Only modify the elements specified in ’target’ within ’tasks’

  17. [17]

    Output must contain pure JSON only - no explanations or additional text

  18. [18]

    Preserve all formatting information (fonts, sizes, colors, etc.)

  19. [19]

    Verify that the JSON format is valid after completing the task Before starting the task:

  20. [20]

    Check the ’understanding’ field to grasp the overall task objective

  21. [21]

    Review ’page number’, ’description’, ’target’, and ’ac- tion’ within ’tasks’

  22. [22]

    Give only the JSON

    Identify all text elements in ’Objects Detail’ The output must maintain the identical structure as the orig- inal JSON, with only the necessary text modified according to the task. Give only the JSON. Response: JSON Figure 14: A prompt used in Document Editing. shown in Figure N.5, while the prompt used for evaluating text, image, layout, and color,based ...

  23. [23]

    Find activate powerpoint app with ppt app = win32com.client.GetActiveObject (”Power- Point.Application”) active presentation = ppt app.ActivePresentation

  24. [24]

    Find the slide specified by page number:{slide num}

  25. [25]

    Target to change:{before}

  26. [26]

    New content to apply:{after}

  27. [27]

    CRITICAL REQUIREMENTS: - DO NOT create a new PowerPoint application - use the existing one - Please check if the slide number you want to work on ex- ists and proceed with the work

    Generate ONLY executable code that will directly modify the PowerPoint. CRITICAL REQUIREMENTS: - DO NOT create a new PowerPoint application - use the existing one - Please check if the slide number you want to work on ex- ists and proceed with the work. The index starts with 1. - The code should NOT be written as a complete program with imports - it will ...