arxiv: 2604.11969 · v1 · submitted 2026-04-13 · 💻 cs.AI

Recognition: unknown

Narrative-Driven Paper-to-Slide Generation via ArcDeck

Furkan Horoz, James Matthew Rehg, Junho Kim, Ozgur Kara, Sachidanand VS, Tarik Can Ozden

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:13 UTC · model grok-4.3

classification 💻 cs.AI

keywords paper-to-slide generationdiscourse modelingmulti-agent systemsnarrative reconstructionpresentation generationlogical coherenceacademic documentsbenchmark evaluation

0 comments

The pith

ArcDeck reconstructs a paper's logical flow into slides by first building a discourse tree and commitment document then refining them through coordinated agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework that treats turning research papers into presentation slides as a task of rebuilding the original narrative structure rather than simply condensing the text. It begins by extracting a discourse tree and a global commitment document to hold onto the paper's high-level intent and argumentative sequence. These elements then steer a series of specialized agents that repeatedly critique and adjust an outline before the final slides are designed and laid out. The authors also release ArcBench, a set of paired papers and slides, to measure how well the resulting presentations keep their narrative thread intact. If the approach holds, it would reduce the common problem of slides that feel disconnected from the research story they are meant to convey.

Core claim

ArcDeck formulates paper-to-slide generation as structured narrative reconstruction. It parses the input paper to construct a discourse tree and a global commitment document that preserve high-level intent. These priors guide an iterative multi-agent process where agents critique and revise the presentation outline, followed by rendering visual layouts. Evaluation on ArcBench shows that this explicit modeling and coordination significantly enhances narrative flow and logical coherence over direct summarization methods.

What carries the argument

The discourse tree paired with the global commitment document, which act as structural priors that coordinate role-specific agents during iterative outline refinement.

If this is right

Slides produced this way retain the sequence of arguments and evidence from the source paper rather than presenting disconnected points.
The multi-agent refinement step catches and corrects breaks in logical progression that single-pass summarizers often miss.
The same parsing and coordination steps can be reused for other long-document to structured-output tasks that require preserving narrative intent.
ArcBench supplies a reusable test set for measuring how well any generation method maintains discourse structure across academic domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the discourse parsing step generalizes, the same pipeline could convert grant proposals or technical reports into executive briefings without manual restructuring.
Combining the outline refinement agents with existing layout generators might allow end-to-end creation of complete presentation decks from a manuscript draft.
The benchmark could serve as a starting point for studying how narrative quality changes when the same paper is presented to audiences with different levels of domain expertise.

Load-bearing premise

That parsing a discourse tree and global commitment document from raw paper text reliably captures the author's high-level intent without loss or bias.

What would settle it

A side-by-side human evaluation on ArcBench papers where raters score narrative coherence and logical flow of slides made by the framework versus direct text-to-slide baselines, with no statistically significant improvement detected.

Figures

Figures reproduced from arXiv: 2604.11969 by Furkan Horoz, James Matthew Rehg, Junho Kim, Ozgur Kara, Sachidanand VS, Tarik Can Ozden.

**Figure 1.** Figure 1: Framework Comparison. (a) Prior methods rely on raw-text summarization, isolated section-level organization, or direct outline extraction, yielding condensed text without narrative guidance. (b) In contrast, ArcDeck explicitly models discourse structure and establishes global commitments to produce narrative-driven outlines through an iterative multi-agent refinement loop. 1 Introduction An effective acad… view at source ↗

**Figure 2.** Figure 2: ArcDeck Overview. 1) Preprocessing extracts textual and visual assets from the source paper; 2) Narrative-Driven Outline Generation combines discourse parsing and global commitment to guide the Narrative Refinement Loop, producing a structured Slide Outline; and 3) Slide Generation constructs the initial deck and refines its aesthetic layout to yield the final presentation. The numbered indices of each ag… view at source ↗

**Figure 3.** Figure 3: Example Discourse Trees from Secs. 1 & 3 of this paper. Leaf nodes correspond to elementary discourse units (EDUs, corresponding to paragraphs), while internal nodes and arrows denote rhetorical relations that recursively group these spans. describing the structure of the slide deck: a snapshot, core content that specifies the thesis and key takeaways, a talk contract that specifies the assumed prerequisit… view at source ↗

**Figure 4.** Figure 4: Global Commitment Example. The global commitment establishes the high-level narrative progression of the presentation through five key components: a snapshot, core content, a talk contract, a narrative spine, and a light section plan. Slide Deck Constructor Asset Dictionary Slide Outline Asset Matching Info. Global Commitment Slide Outline Template Layout Draft Slide Deck Aesthetics Refiner Asset Matching… view at source ↗

**Figure 5.** Figure 5: Slide Generation Stage. Slide Deck Constructor 6 handles asset mapping, layout selection, and text generation, while the Aesthetics Refiner 7 polishes the deck via figure matching and content formatting. updates the slide outline based on the combined feedback, after which it re-enters the evaluation cycle. This refinement process continues until the outline is ready, as determined by the Narrative Judge… view at source ↗

**Figure 6.** Figure 6: ArcBench Statistics. Overview of our curated 100-pair dataset, showing the distribution of research topics across six major CV & ML venues, publication years, visual density, and authorprepared presentation lengths. Aesthetics Refiner 7 . The Aesthetics Refiner utilizes the Asset Matching Info., Slide Outline, and Draft Slide Deck to perform the final polish of the presentation. It executes four targeted … view at source ↗

**Figure 7.** Figure 7: Qualitative Comparison. Unlike baselines, ArcDeck shows clear narrative flow instead of mirroring paper section structure and avoids cross-slide content overlap. the paper’s high-level narrative arc. The largest gains appear on Hard and Depth, which reflect conceptual understanding and fine-grained detail. Under Qwen-3-VL, ArcDeck exceeds the next-best baseline by 6.65 on Hard and 3.43 on Depth, with simi… view at source ↗

**Figure 8.** Figure 8: Qualitative Ablation Study. Comparison between ArcDeck and variants without the Discourse Parser (DP) or Commitment Builder (CB). 0 50 100 150 Tokens (K) Total Input Output 128.5K 116.5K 12.1K Narrative-Driven Outline Generation 0 20 40 60 80 Tokens (K) 83.3K 73.6K 9.7K Slide Generation 0 50 100 150 Tokens (K) ND Outline Gen. Slide Gen. 128.5K 83.3K Overall Comparison Commitment Builder Discourse Parser Na… view at source ↗

**Figure 9.** Figure 9: Token Usage Analysis. discourse-guided content grouping. Without CB, slide ordering becomes disorganized, introducing the proposed method before prior work and task context. In contrast, the full ArcDeck pipeline produces a more coherent narrative progression—from existing methods to method overview and detailed methodology. We further report token usage across agents in [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗

**Figure 10.** Figure 10: Qualitative Comparison of ArcDeck-generated Slides under Different Presentation Durations. 20-minute presentation includes more detailed content, while the 5-minute presentation preserves the key points of the paper. For the target audience analysis, we generate slide decks for two audiences: General Public and Research Scientists. A qualitative comparison of these presentations is shown in [PITH_FULL_IM… view at source ↗

**Figure 11.** Figure 11: Qualitative Comparison of ArcDeck-generated Slides under Different Target Audiences. The figure illustrates how ArcDeck adapts content selection, terminology, and level of technical detail according to the specified target audience. A.3 Analysis of Discourse Parser We conduct further analysis to investigate the properties of the generated discourse trees [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Distribution of Discourse Tree Relations across Section Groups. Stacked bars represent the relative proportion of each discourse relation within each section group. 0 10 20 30 40 50 Global Frequency (%) Purpose Evaluation Explanation Same-Unit Context Organization Joint Elaboration 3.1% 3.8% 4.5% 6.8% 10.1% 11.5% 12.1% 48.0% (a) 10 20 30 40 Number of Paragraphs 0 20 40 60 Tree Height r = 0:601 (b) 10 20 … view at source ↗

**Figure 13.** Figure 13: Relationship of Generated Discourse-Tree Height, Average Leaf Depth, and Group Count per Paragraph to the Number of Paragraphs. Pearson correlation coefficients are shown in the top-left corner of each figure. A.4 Ablation Study of Narrative-Driven Outline Generation We perform an ablation study to evaluate the contribution of the Discourse Parser, Global Commitment, and Narrative Refinement Loop compone… view at source ↗

**Figure 14.** Figure 14: Human Evaluation Interface. Participants were shown slide decks generated by different methods and asked to rank them based on content quality and narrative flow. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗

**Figure 15.** Figure 15: Key Statistics of Papers and Slide Decks in ArcBench. We report distributions for the full paper pool and the curated benchmark, including the number of figures, tables, character counts, and Oral papers. B.1 Full Dataset Collection. The full ArcBench pool consists of 994 paper-slide pairs collected directly from the official proceedings and presentation archives of six major computer vision and machine … view at source ↗

**Figure 16.** Figure 16: ArcBench Topic Distribution. We report the distribution of papers across research topics. B.4 Comparison with Prior Datasets. Tab. 10 places ArcBench in the context of existing paper-to-slide and slide generation datasets. Prior academic datasets such as DOC2PPT [14] and SciDuet [3] cover general scientific or NLP/ML papers without restricting to oral presentations or enforcing content-density thresholds… view at source ↗

**Figure 17.** Figure 17: Pearson Correlation of Normalized Scores Across Methods. (a,b) Correlation between presentations generated by Qwen3-VL-32B and GPT-5, evaluated by Qwen3 and GPT-5, respectively. (c) Correlation between Qwen3 and GPT-5 evaluator scores across all generated presentations. Fig. 17a and Fig. 17b examine whether the relative quality ranking of papers is preserved across the two generation models—Qwen3-VL-32B… view at source ↗

**Figure 18.** Figure 18: ArcDeck Failure Case. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗

**Figure 19.** Figure 19: Template Slide Layout. used by ArcDeck E.2 Impact of the Aesthetic Refiner We present a quantitative ablation study of our Aesthetic Refiner agent in Tab. 11, demonstrating consistent improvements across the VLM-as-Judge evaluation metrics, for Visual Layout and Visual Thematic quality. In addition, the pairwise comparison results indicate that slides generated by ArcDeck with the refiner are preferred in… view at source ↗

**Figure 20.** Figure 20: Aesthetic Refiner Ablation. Comparison of slides generated before and after applying the refiner agent [PITH_FULL_IMAGE:figures/full_fig_p029_20.png] view at source ↗

**Figure 21.** Figure 21: Qualitative Comparison. Sample generated by ArcDeck-(4o) by selecting different user-defined themes while preserving the consistency of the slide layout and content structure. E.5 ArcDeck Extensions to Alternative Rendering Formats Our method is also compatible with multiple slide output formats, such as JavaScript and LaTeX, enabling the generation of visually appealing presentations using libraries like… view at source ↗

**Figure 22.** Figure 22: Qualitative Comparison. Sample generated by SlideGen-4o [PITH_FULL_IMAGE:figures/full_fig_p031_22.png] view at source ↗

**Figure 23.** Figure 23: Qualitative Comparison. Sample generated by PPTAgent-4o 31 [PITH_FULL_IMAGE:figures/full_fig_p031_23.png] view at source ↗

**Figure 24.** Figure 24: Qualitative Comparison. Sample generated by Paper2Poster-4o 32 [PITH_FULL_IMAGE:figures/full_fig_p032_24.png] view at source ↗

**Figure 25.** Figure 25: Qualitative Comparison. Sample generated by HTML-4o 33 [PITH_FULL_IMAGE:figures/full_fig_p033_25.png] view at source ↗

**Figure 26.** Figure 26: Additional Qualitative Results. generated by ArcDeck GPT-5 backbone. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_26.png] view at source ↗

**Figure 27.** Figure 27: Additional Qualitative Results. generated by ArcDeck GPT-5 backbone. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_27.png] view at source ↗

**Figure 28.** Figure 28: Additional Qualitative Results. generated by ArcDeck GPT4o backbone. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_28.png] view at source ↗

**Figure 29.** Figure 29: ArcDeck Extension. to JavaScript output format. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_29.png] view at source ↗

**Figure 30.** Figure 30: ArcDeck Extension. to Latex Beamer output format. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_30.png] view at source ↗

**Figure 31.** Figure 31: Additional Qualitative Results for diverse domain. Slides generated by ArcDeck (GPT-5 backbone) for a document from the Physics domain [PITH_FULL_IMAGE:figures/full_fig_p039_31.png] view at source ↗

**Figure 32.** Figure 32: Additional Qualitative Results for diverse domain. Slides generated by ArcDeck (GPT-5 backbone) for a document from the Biology domain. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_32.png] view at source ↗

**Figure 33.** Figure 33: Discourse Parser Prompt. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_33.png] view at source ↗

**Figure 34.** Figure 34: Commitment Builder Prompt. 41 [PITH_FULL_IMAGE:figures/full_fig_p041_34.png] view at source ↗

**Figure 35.** Figure 35: Slide Planner/Reviser Prompt. 42 [PITH_FULL_IMAGE:figures/full_fig_p042_35.png] view at source ↗

**Figure 36.** Figure 36: Narrative Critic Prompt. 43 [PITH_FULL_IMAGE:figures/full_fig_p043_36.png] view at source ↗

**Figure 37.** Figure 37: Narrative Judge Prompt. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_37.png] view at source ↗

**Figure 38.** Figure 38: Asset Matching Prompt. 45 [PITH_FULL_IMAGE:figures/full_fig_p045_38.png] view at source ↗

**Figure 39.** Figure 39: Slide Deck Constructor Prompt. 46 [PITH_FULL_IMAGE:figures/full_fig_p046_39.png] view at source ↗

**Figure 40.** Figure 40: Aesthetic Refiner Prompt. 47 [PITH_FULL_IMAGE:figures/full_fig_p047_40.png] view at source ↗

**Figure 41.** Figure 41: Pairwise - Narrative Flow Evaluation Prompt. 48 [PITH_FULL_IMAGE:figures/full_fig_p048_41.png] view at source ↗

**Figure 42.** Figure 42 [PITH_FULL_IMAGE:figures/full_fig_p049_42.png] view at source ↗

**Figure 43.** Figure 43: Quiz Generation - Depth Prompt. 50 [PITH_FULL_IMAGE:figures/full_fig_p050_43.png] view at source ↗

**Figure 44.** Figure 44: Quiz Generation - Hard Prompt. 51 [PITH_FULL_IMAGE:figures/full_fig_p051_44.png] view at source ↗

**Figure 45.** Figure 45: Quiz Generation - Story Prompt. 52 [PITH_FULL_IMAGE:figures/full_fig_p052_45.png] view at source ↗

**Figure 46.** Figure 46: Quiz Generation - Visuals Prompt. 53 [PITH_FULL_IMAGE:figures/full_fig_p053_46.png] view at source ↗

**Figure 47.** Figure 47: Quiz Taker - Text Prompt. 54 [PITH_FULL_IMAGE:figures/full_fig_p054_47.png] view at source ↗

**Figure 48.** Figure 48: Quiz Taker - Visual Prompt. 55 [PITH_FULL_IMAGE:figures/full_fig_p055_48.png] view at source ↗

**Figure 49.** Figure 49: VLM as Judge - Narrative Flow Evaluation Prompt. 56 [PITH_FULL_IMAGE:figures/full_fig_p056_49.png] view at source ↗

**Figure 50.** Figure 50: VLM as Judge - Text Quality Evaluation Prompt. 57 [PITH_FULL_IMAGE:figures/full_fig_p057_50.png] view at source ↗

**Figure 51.** Figure 51: VLM as Judge - Visual Layout Evaluation Prompt. 58 [PITH_FULL_IMAGE:figures/full_fig_p058_51.png] view at source ↗

**Figure 52.** Figure 52: VLM as Judge - Visual Thematic Evaluation Prompt. 59 [PITH_FULL_IMAGE:figures/full_fig_p059_52.png] view at source ↗

**Figure 53.** Figure 53: Example Global Commitment generated for ArcDeck. 60 [PITH_FULL_IMAGE:figures/full_fig_p060_53.png] view at source ↗

**Figure 54.** Figure 54: Example Global Commitment generated for Attention Is All You Need. 61 [PITH_FULL_IMAGE:figures/full_fig_p061_54.png] view at source ↗

**Figure 55.** Figure 55: Example full Discourse Parser outputs in JSON format for introduction sections. In (a), the discourse tree for Section 1 of this paper is shown. In (b), the discourse tree for Section 1 of Attention Is All You Need is shown, together with snippets from the corresponding paragraphs [PITH_FULL_IMAGE:figures/full_fig_p062_55.png] view at source ↗

**Figure 1.** Figure 1: The Transfo rmer - model … The Transformer foll ows this overall arch itecture using… Most competitive ne Section 3_0: ural sequence … Section 3_1: Section 3_2: Encoder: The encode r is composed of a … Section 3_3: Attention Is All You Need Section 3: Model Architecture An attention function can be described as… We call our particular attention "Scaled… Decoder: The decode r is also composed… Section 3_4: … view at source ↗

**Figure 56.** Figure 56: Example Discourse Parser output in JSON format. Discourse tree for Section 3 (Model Architecture) of Attention Is All You Need, shown together with snippets from the corresponding paragraphs. 62 [PITH_FULL_IMAGE:figures/full_fig_p062_56.png] view at source ↗

**Figure 57.** Figure 57: Example Slide Outline output in JSON format. 63 [PITH_FULL_IMAGE:figures/full_fig_p063_57.png] view at source ↗

**Figure 58.** Figure 58: Example Narrative Critic output in JSON format. 64 [PITH_FULL_IMAGE:figures/full_fig_p064_58.png] view at source ↗

**Figure 59.** Figure 59: Example Failing Narrative Judge output. 65 [PITH_FULL_IMAGE:figures/full_fig_p065_59.png] view at source ↗

**Figure 60.** Figure 60: Example Passing Narrative Judge output. 66 [PITH_FULL_IMAGE:figures/full_fig_p066_60.png] view at source ↗

**Figure 61.** Figure 61: Asset Matching Sample Output. 67 [PITH_FULL_IMAGE:figures/full_fig_p067_61.png] view at source ↗

**Figure 62.** Figure 62: Slide Deck Constructor Sample Output. 68 [PITH_FULL_IMAGE:figures/full_fig_p068_62.png] view at source ↗

**Figure 63.** Figure 63: Aesthetic Refiner Sample Output. 69 [PITH_FULL_IMAGE:figures/full_fig_p069_63.png] view at source ↗

read the original abstract

We introduce ArcDeck, a multi-agent framework that formulates paper-to-slide generation as a structured narrative reconstruction task. Unlike existing methods that directly summarize raw text into slides, ArcDeck explicitly models the source paper's logical flow. It first parses the input to construct a discourse tree and establish a global commitment document, ensuring the high-level intent is preserved. These structural priors then guide an iterative multi-agent refinement process, where specialized agents iteratively critique and revise the presentation outline before rendering the final visual layouts and designs. To evaluate our approach, we also introduce ArcBench, a newly curated benchmark of academic paper-slide pairs. Experimental results demonstrate that explicit discourse modeling, combined with role-specific agent coordination, significantly improves the narrative flow and logical coherence of the generated presentations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ArcDeck adds discourse trees and commitment documents to multi-agent slide generation from papers and ships a new benchmark, but the results do not yet show that these pieces drive the claimed gains in coherence.

read the letter

The paper's main move is to frame paper-to-slide work as narrative reconstruction rather than flat summarization. ArcDeck first builds a discourse tree and a global commitment document from the source text, then routes those structures through role-specific agents that critique and revise an outline before layout. ArcBench is a fresh collection of paper-slide pairs for testing this pipeline. That combination is new enough to stand out from prior multi-agent summarization work. The structured priors are a reasonable way to keep high-level intent intact across sections, and releasing the benchmark gives others a concrete way to measure progress on this task. Those are the parts that feel like actual additions. The evaluation is the soft spot. The abstract states that explicit discourse modeling plus agent coordination significantly improves narrative flow and logical coherence, yet it supplies no numbers, no baselines, no error rates on the initial parsing step, and no ablation that isolates the discourse tree's contribution. The stress-test concern holds: if the automatic extraction of the tree and commitment document drops cross-section links or misreads the core argument, later agent refinements cannot prove the modeling step is responsible for better slides. Without those checks, the central claim stays untested. The work shows clear thinking about how to inject structure into generation and engages the literature on multi-agent systems without obvious contradictions. It is aimed at people building practical tools for scientific communication rather than core method researchers. A reader who needs ideas for automated presentation pipelines could extract the discourse-plus-agents pattern and the benchmark for their own experiments. I would send it to peer review. The framework and dataset are worth referee scrutiny on the methods and metrics, even though the current evidence is too thin to support the improvement claims yet.

Referee Report

2 major / 2 minor

Summary. The paper introduces ArcDeck, a multi-agent framework for paper-to-slide generation that treats the task as narrative reconstruction. It parses the input paper into a discourse tree and global commitment document to preserve high-level intent and logical flow, then uses specialized agents for iterative critique and refinement of the presentation outline before final rendering. The authors also contribute ArcBench, a new benchmark of paper-slide pairs, and claim that explicit discourse modeling plus role-specific agent coordination yields significantly better narrative flow and logical coherence than direct summarization baselines.

Significance. If the central claims hold after addressing the validation gaps, the work would advance automated academic presentation generation by demonstrating the value of explicit discourse structures and multi-agent coordination for coherence. The introduction of ArcBench provides a reusable resource for future benchmarking in this domain, which is a clear positive contribution.

major comments (2)

[Method / Approach] The central claim that discourse modeling and agent coordination improve narrative flow (abstract and method description) rests on the assumption that the initial parsing step reliably extracts a discourse tree and global commitment document without loss, bias, or misrepresentation of cross-section arguments. The manuscript supplies no parser implementation details, accuracy metrics, error analysis, or ablation isolating this step, which is load-bearing because downstream refinement and ArcBench results cannot be attributed to the modeling if the input representation is flawed.
[Experiments / Evaluation] The experimental results section asserts significant improvements in narrative flow and coherence but, consistent with the abstract, provides insufficient detail on baselines, quantitative metrics (e.g., specific scores or statistical tests), ablation studies (with vs. without discourse tree), or inter-annotator agreement on ArcBench. This weakens the ability to evaluate whether the multi-agent refinement is the causal factor.

minor comments (2)

[Abstract] The abstract would be strengthened by including at least one key quantitative result or metric to support the 'significantly improves' claim, rather than leaving all evidence to the full text.
[Method] Notation for the discourse tree and commitment document could be formalized earlier (e.g., with a small example or diagram) to improve readability for readers unfamiliar with discourse parsing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below and have revised the manuscript to incorporate additional details and analyses where appropriate.

read point-by-point responses

Referee: [Method / Approach] The central claim that discourse modeling and agent coordination improve narrative flow (abstract and method description) rests on the assumption that the initial parsing step reliably extracts a discourse tree and global commitment document without loss, bias, or misrepresentation of cross-section arguments. The manuscript supplies no parser implementation details, accuracy metrics, error analysis, or ablation isolating this step, which is load-bearing because downstream refinement and ArcBench results cannot be attributed to the modeling if the input representation is flawed.

Authors: We agree that the reliability of the discourse parsing step is foundational to our claims and that the original manuscript provided insufficient implementation details. In the revised version, we have added a new subsection in the Method section that specifies the parser implementation (including the underlying discourse parsing model and any fine-tuning procedures), reports accuracy metrics on a held-out set of academic papers, includes an error analysis of common parsing failures (such as cross-section argument misidentification), and presents an ablation study that isolates the contribution of the discourse tree and global commitment document. These changes allow readers to evaluate the input representation quality and attribute downstream results appropriately. revision: yes
Referee: [Experiments / Evaluation] The experimental results section asserts significant improvements in narrative flow and coherence but, consistent with the abstract, provides insufficient detail on baselines, quantitative metrics (e.g., specific scores or statistical tests), ablation studies (with vs. without discourse tree), or inter-annotator agreement on ArcBench. This weakens the ability to evaluate whether the multi-agent refinement is the causal factor.

Authors: We acknowledge that the experimental section required more rigorous documentation to support the claims. The revised manuscript now includes expanded descriptions of all baselines with their specific configurations, full reporting of quantitative metric scores accompanied by statistical significance tests, comprehensive ablation studies that isolate the discourse tree and multi-agent refinement components, and inter-annotator agreement statistics for the ArcBench annotations. These additions provide clearer evidence regarding the causal role of the proposed components. revision: yes

Circularity Check

0 steps flagged

No circularity: framework and benchmark are independently introduced without self-referential reductions

full rationale

The paper describes ArcDeck as a novel multi-agent system that parses input papers into a discourse tree and global commitment document to guide slide generation, then evaluates on a newly introduced ArcBench benchmark of paper-slide pairs. No equations, fitted parameters, or derivations are present in the abstract or described process. The central claim rests on experimental results comparing to existing methods, with no load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work. The parsing step is presented as an input processing stage rather than a self-defined or fitted output, and the benchmark is explicitly new rather than a renamed or internally validated pattern. This satisfies the criteria for a self-contained contribution with no reduction of claims to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on assumptions about reliable discourse parsing and the value of agent coordination, with no free parameters or invented entities explicitly introduced in the abstract.

axioms (2)

domain assumption Academic papers contain extractable discourse structures that accurately represent logical flow and high-level intent.
Invoked when constructing the discourse tree and global commitment document from input text.
domain assumption Iterative multi-agent critique and revision produces objectively superior narrative coherence compared to direct methods.
Central to the refinement process and claimed experimental gains.

pith-pipeline@v0.9.0 · 5435 in / 1328 out tokens · 71120 ms · 2026-05-10T16:13:34.736042+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

184 extracted references · 10 canonical work pages · 5 internal anchors

[1]

New Riders, 2019

Garr Reynolds.Presentation Zen: Simple ideas on presentation design and delivery. New Riders, 2019

2019
[2]

Effectiveness of powerpoint presentations in lectures.Computers & education, 41(1):77–86, 2003

Robert A Bartsch and Kristi M Cobern. Effectiveness of powerpoint presentations in lectures.Computers & education, 41(1):77–86, 2003

2003
[3]

D2s: Document-to-slide generation via query-based text summarization

Edward Sun, Yufang Hou, Dakuo Wang, Yunfeng Zhang, and Nancy XR Wang. D2s: Document-to-slide generation via query-based text summarization. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1405–1418, 2021

2021
[4]

En- hancing presentation slide generation by llms with a multi-staged end-to-end approach

Sambaran Bandyopadhyay, Himanshu Maheshwari, Anandhavelu Natarajan, and Apoorv Saxena. En- hancing presentation slide generation by llms with a multi-staged end-to-end approach. InProceedings of the 17th International Natural Language Generation Conference, pages 222–229, 2024

2024
[5]

Knowledge-centric templatic views of documents

Isabel Cachola, Silviu Cucerzan, Allen Herring, Vuksan Mijovic, Erik Oveson, and Sujay Kumar Jauhar. Knowledge-centric templatic views of documents. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 15460–15476, 2024

2024
[6]

Paper2poster: Towards multimodal poster automation from scientific papers.arXiv preprint arXiv:2505.21497, 2025

Wei Pang, Kevin Qinghong Lin, Xiangru Jian, Xi He, and Philip Torr. Paper2poster: Towards multimodal poster automation from scientific papers.arXiv preprint arXiv:2505.21497, 2025

work page arXiv 2025
[7]

PosterGen: Aesthetic-Aware Multi-Modal Paper-to-Poster Generation via Multi-Agent LLMs

Zhilin Zhang, Xiang Zhang, Jiaqi Wei, Yiwei Xu, and Chenyu You. Postergen: Aesthetic-aware paper- to-poster generation via multi-agent llms.arXiv preprint arXiv:2508.17188, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Autopresent: Designing structured visuals from scratch

Jiaxin Ge, Zora Zhiruo Wang, Xuhui Zhou, Yi-Hao Peng, Sanjay Subramanian, Qinyue Tan, Maarten Sap, Alane Suhr, Daniel Fried, Graham Neubig, et al. Autopresent: Designing structured visuals from scratch. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2902–2911, 2025

2025
[9]

Pptagent: Generating and evaluating presentations beyond text-to-slides

Hao Zheng, Xinyan Guan, Hao Kong, Wenkai Zhang, Jia Zheng, Weixiang Zhou, Hongyu Lin, Yaojie Lu, Xianpei Han, and Le Sun. Pptagent: Generating and evaluating presentations beyond text-to-slides. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 14413–14429, 2025

2025
[10]

Prege- nie: An agentic framework for high-quality visual presentation generation.arXiv preprint arXiv:2505.21660, 2025

Xiaojie Xu, Xinli Xu, Sirui Chen, Haoyu Chen, Fan Zhang, and Ying-Cong Chen. Pregenie: An agentic framework for high-quality visual presentation generation.arXiv preprint arXiv:2505.21660, 2025

work page arXiv 2025
[11]

Slidegen: Collaborative multimodal agents for scientific slide generation.arXiv preprint arXiv:2512.04529, 2025

Xin Liang, Xiang Zhang, Yiwei Xu, Siqi Sun, and Chenyu You. Slidegen: Collaborative multimodal agents for scientific slide generation.arXiv preprint arXiv:2512.04529, 2025

work page arXiv 2025
[12]

Rhetorical structure theory: A theory of text organization

William C Mann and Sandra A Thompson. Rhetorical structure theory: A theory of text organization. Technical report, University of Southern California, Information Sciences Institute Los Angeles, 1987

1987
[13]

Ppsgen: Learning to generate presentation slides for academic papers

Yue Hu and Xiaojun Wan. Ppsgen: Learning to generate presentation slides for academic papers. In IJCAI, pages 2099–2105, 2013

2099
[14]

Doc2ppt: Automatic presentation slides generation from scientific documents

Tsu-Jui Fu, William Yang Wang, Daniel McDuff, and Yale Song. Doc2ppt: Automatic presentation slides generation from scientific documents. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 634–642, 2022

2022
[15]

Pptc benchmark: Evaluating large language models for powerpoint task completion

Yiduo Guo, Zekai Zhang, Yaobo Liang, Dongyan Zhao, and Nan Duan. Pptc benchmark: Evaluating large language models for powerpoint task completion. InFindings of the Association for Computational Linguistics: ACL 2024, pages 8682–8701, 2024

2024
[16]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

MIT press, 2000

Daniel Marcu.The theory and practice of discourse parsing and summarization. MIT press, 2000

2000
[18]

Exploiting discourse-level segmentation for extractive summarization

Zhengyuan Liu and Nancy Chen. Exploiting discourse-level segmentation for extractive summarization. InProceedings of the 2nd Workshop on New Frontiers in Summarization, pages 116–121, 2019

2019
[19]

Discern: Discourse-aware entailment reasoning network for conversational machine reading

Yifan Gao, Chien-Sheng Wu, Jingjing Li, Shafiq Joty, Steven CH Hoi, Caiming Xiong, Irwin King, and Michael Lyu. Discern: Discourse-aware entailment reasoning network for conversational machine reading. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2439–2449, 2020. 12

2020
[20]

Better document-level sentiment analysis from rst discourse parsing

Parminder Bhatia, Yangfeng Ji, and Jacob Eisenstein. Better document-level sentiment analysis from rst discourse parsing. InProceedings of the 2015 conference on empirical methods in natural language processing, pages 2212–2218, 2015

2015
[21]

Top- down rst parsing utilizing granularity levels in documents

Naoki Kobayashi, Tsutomu Hirao, Hidetaka Kamigaito, Manabu Okumura, and Masaaki Nagata. Top- down rst parsing utilizing granularity levels in documents. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8099–8106, 2020

2020
[22]

Rst parsing from scratch

Thanh-Tung Nguyen, Xuan-Phi Nguyen, Shafiq Joty, and Xiaoli Li. Rst parsing from scratch. InPro- ceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1613–1625, 2021

2021
[23]

Text-level discourse parsing with rich linguistic features

Vanessa Wei Feng and Graeme Hirst. Text-level discourse parsing with rich linguistic features. InPro- ceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 60–68, 2012

2012
[24]

A novel discourse parser based on support vector machine classification

Helmut Prendinger et al. A novel discourse parser based on support vector machine classification. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 665–673, 2009

2009
[25]

Recursive deep models for discourse parsing

Jiwei Li, Rumeng Li, and Eduard Hovy. Recursive deep models for discourse parsing. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 2061–2069, 2014

2014
[26]

A simple and strong baseline for end-to-end neural rst-style discourse parsing

Naoki Kobayashi, Tsutomu Hirao, Hidetaka Kamigaito, Manabu Okumura, and Masaaki Nagata. A simple and strong baseline for end-to-end neural rst-style discourse parsing. InFindings of the Association for Computational Linguistics: EMNLP 2022, pages 6725–6737, 2022

2022
[27]

Transition-based neural rst parsing with implicit syntax features

Nan Yu, Meishan Zhang, and Guohong Fu. Transition-based neural rst parsing with implicit syntax features. InProceedings of the 27th International Conference on Computational Linguistics, pages 559– 570, 2018

2018
[28]

Representation learning for text-level discourse parsing

Yangfeng Ji and Jacob Eisenstein. Representation learning for text-level discourse parsing. InProceedings of the 52nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 13–24, 2014

2014
[29]

Multi-view and multi-task training of rst discourse parsers

Chlo ´e Braud, Barbara Plank, and Anders Søgaard. Multi-view and multi-task training of rst discourse parsers. InProceedings of COLING 2016, the 26th International Conference on Computational Linguis- tics: Technical Papers, pages 1903–1913, 2016

2016
[30]

Rst discourse parsing as text-to-text generation.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:3278–3289, 2023

Xinyu Hu and Xiaojun Wan. Rst discourse parsing as text-to-text generation.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:3278–3289, 2023

2023
[31]

Exploring the limits of transfer learning with a unified text-to-text trans- former.Journal of machine learning research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text trans- former.Journal of machine learning research, 21(140):1–67, 2020

2020
[32]

Aru Maekawa, Tsutomu Hirao, Hidetaka Kamigaito, and Manabu Okumura. Can we obtain significant success in rst discourse parsing by using large language models? InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2803–2815, 2024

2024
[33]

Llamipa: An incremental dis- course parser

Kate Thompson, Akshay Chaturvedi, Julie Hunter, and Nicholas Asher. Llamipa: An incremental dis- course parser. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 6418– 6430, 2024

2024
[34]

Dmrst: A joint framework for document-level multilingual rst discourse segmentation and parsing

Zhengyuan Liu, Ke Shi, and Nancy Chen. Dmrst: A joint framework for document-level multilingual rst discourse segmentation and parsing. InProceedings of the 2nd Workshop on Computational Approaches to Discourse, pages 154–164, 2021

2021
[35]

Align to structure: Aligning large language models with structural information.arXiv preprint arXiv:2504.03622, 2025

Zae Myung Kim, Anand Ramachandran, Farideh Tavazoee, Joo-Kyung Kim, Oleg Rokhlenko, and Dongyeop Kang. Align to structure: Aligning large language models with structural information.arXiv preprint arXiv:2504.03622, 2025

work page arXiv 2025
[36]

Rstgen: Imbuing fine-grained interpretable control into long-formtext generators

Rilwan Adewoyin, Ritabrata Dutta, and Yulan He. Rstgen: Imbuing fine-grained interpretable control into long-formtext generators. InProceedings of the 2022 conference of the north American chapter of the association for computational linguistics: human language technologies, pages 1822–1835, 2022. 13

2022
[37]

Docling: An efficient open-source toolkit for ai-driven document conversion.arXiv preprint arXiv:2501.17887, 2025

Nikolaos Livathinos, Christoph Auer, Maksym Lysak, Ahmed Nassar, Michele Dolfi, Panos Vagenas, Cesar Berrospi Ramis, Matteo Omenetti, Kasper Dinkla, Yusik Kim, et al. Docling: An efficient open- source toolkit for ai-driven document conversion.arXiv preprint arXiv:2501.17887, 2025

work page arXiv 2025
[38]

Marker: Convert pdf to markdown + json quickly with high accuracy, 2023

Vik Paruchuri. Marker: Convert pdf to markdown + json quickly with high accuracy, 2023

2023
[39]

Hello gpt-4o, 2024

OpenAI. Hello gpt-4o, 2024

2024
[40]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaugh- lin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 14 Appendix Contents A Narrative-Driven Outline: Additional Details 16 A.1 Human Evaluation of Slide Deck Content and Narra...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Contrastive Learning
[44]

Vision-Language Models
[45]

Self-Supervised Learning
[46]

Multimodal Large Language Models
[47]

Representation Learning
[48]

Conformal Prediction
[49]

Benchmark Development
[50]

Natural Language Processing
[51]

Graph Neural Networks
[52]

Probabilistic Diffusion Models
[53]

Optimal Covariance Matching
[54]

root": "gX

Time Series Analysis Figure 16:ArcBenchTopic Distribution.We report the distribution of papers across research topics. B.4 Comparison with Prior Datasets. Tab. 10 placesArcBenchin the context of existing paper-to-slide and slide generation datasets. Prior academic datasets such as DOC2PPT [14] and SciDuet [3] cover general scientific or NLP/ML pa- pers wi...

2022
[55]

an EDU id exactly equal to one of the paragraph NAMES from the input, OR
[56]

gX" that exists in the

a group id "gX" that exists in the "groups" object. - Do NOT invent EDU ids. GROUP ID RULES - Every group id must be unique ("g1", "g2", ...). Keep them simple and increasing. - Each non-root group id is used as a child exactly once (tree constraint). ROOT RULES - "root" MUST be the id of the single top-level group. - The root group MUST (transitively) co...
[57]

root exists in groups
[58]

EDU_USED == EDU_ALL (every EDU exactly once as a leaf)
[59]

Tree is connected and acyclic; every non-root group is referenced exactly once
[60]

Relations are only from the allowed inventory
[61]

Produce the JSON now

No exact duplicate edges per the rules above. Produce the JSON now. Discourse Parser Figure 33:Discourse Parser Prompt. 40 You are a Global Commitment Builder. Produce a single Markdown file named commitment.md for a paper-to-slides pipeline. INPUTS (in this message, in order):
[62]

Paper content converted to Markdown (full paper)
[63]

GOAL: Write a compact global commitment that downstream slide agents will follow

Optional talk constraints (may be absent): target audience, presentation length (minutes), desired slide count, page limit, style preferences. GOAL: Write a compact global commitment that downstream slide agents will follow. Capture global intent, constraints, narrative spine, and top evidence items grounded in the paper. HARD RULES: - Output ONLY the Mar...
[64]

section_planning

"section_planning" - Build slides for ONE section of the paper. - Prioritize local coherence within the section. - Ensure the section-level slide sequence can later be stitched into a full-deck narrative
[65]

global_revision

"global_revision" - Revise an already merged full-deck slide plan across ALL sections. - Prioritize global narrative flow, coherence, pacing, and feedback incorporation. - Improve readiness for an oral research presentation. PRIMARY GOAL: Produce slide groupings that form a coherent, logically flowing spoken narrative. Each slide should express one clear ...
[66]

Do NOT invent new paragraph IDs

Use ONLY the paragraph IDs provided. Do NOT invent new paragraph IDs
[67]

Every paragraph ID must appear EXACTLY ONCE across all output slides
[68]

Keep slide density balanced for a {presentation_length}-minute presentation
[69]

Prefer at most 4 paragraphs per slide unless a strong coherence reason justifies otherwise
[70]

Titles must be descriptive and reflect the slide s rhetorical role in the talk
[71]

elaboration

Output JSON only. No markdown. No commentary. No extra keys beyond the schema. RST RELATION GUIDANCE: - "elaboration": provides more detail usually group together - "explanation": explains why/how usually group together - "joint": parallel or equally important points often share a slide - "purpose": states goal often introductory - "evaluation": assessmen...
[76]

must-include / must-hit beats

Paragraph text snippets (for grounding). Each paragraph id maps to text: {paragraph_snippets} TASK: Critique the structure and coherence of this slide plan relative to: - A good research talk flow, AND - The commitment's intended story, priorities, and constraints. Focus areas: A) COMMITMENT ALIGNMENT: - Does the plan realize the commitment's thesis and 3...
[77]

Paper metadata: - Audience: {audience} - T otal talk length: {presentation_length} minutes
[78]

Commitment (global contract; READ CAREFULLY): {commitment_md}
[79]

Current merged slide plan (JSON): {merged_slides_json}
[80]

Section inventory (in current order): {section_inventory}
[81]

Paragraph text snippets (for grounding; paragraph ids map to text): {paragraph_snippets}
[82]

commenter

(Optional) Prior critique from a "commenter" module (JSON): {commentary_json} TASK: Decide whether this slide plan is ready to proceed ("pass") or needs further revision ("revise"). Your decision should reflect whether the presentation forms a CLEAR, LOGICALLY PROGRESSING STORY suitable for an academic talk AND whether it matches the commitment's intended...
[83]

JSON content of the paper outline, including each section's title and a brief description
[84]

A list of images (image_information) with captions and size constraints

Showing first 80 references.