pith. machine review for the scientific record. sign in

arxiv: 2604.22840 · v1 · submitted 2026-04-21 · 💻 cs.CV · cs.CL· cs.MM

Recognition: unknown

AeSlides: Incentivizing Aesthetic Layout in LLM-Based Slide Generation via Verifiable Rewards

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:14 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.MM
keywords slide generationLLMreinforcement learningaesthetic layoutverifiable rewardsGRPOlayout optimizationmultimodal alignment
0
0 comments X

The pith

Reinforcement learning with verifiable rewards enables LLMs to produce aesthetically superior slide layouts using minimal training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLMs excel at text but often fail at creating visually pleasing slides due to a gap between text generation and aesthetic quality. AeSlides addresses this by defining verifiable metrics that score layout issues like poor aspect ratios, excess whitespace, colliding elements, and imbalance, then uses these as rewards in a GRPO reinforcement learning setup to train the model directly. This requires only 5,000 prompts and yields major gains in automatic metrics plus better human ratings than costly reflection methods or larger models. The approach provides an efficient alternative to large-scale fine-tuning or agentic reflection for aligning generated slides with human visual preferences.

Core claim

The central discovery is that a GRPO-based reinforcement learning framework, guided by a suite of verifiable metrics for slide layout quality, can directly optimize LLM slide generation for aesthetic coherence. These metrics enable accurate, low-cost supervision of key issues including aspect ratio compliance, whitespace, element collisions, and visual imbalance. Training on GLM-4.7-Flash with 5K prompts improves aspect ratio compliance from 36% to 85%, reduces whitespace by 44%, element collisions by 43%, and visual imbalance by 28%, while increasing human quality scores from 3.31 to 3.56 and outperforming baselines and even Claude-Sonnet-4.5.

What carries the argument

Verifiable metrics quantifying layout aesthetics, employed as rewards within GRPO reinforcement learning to supervise slide generation models.

Load-bearing premise

The verifiable metrics capture the full range of what humans find aesthetically pleasing in slide layouts without significant bias or omission.

What would settle it

Blind human preference tests showing that slides generated by the trained model are not preferred over those from baseline methods, or a low correlation between the metric scores and human ratings.

Figures

Figures reproduced from arXiv: 2604.22840 by Aohan Zeng, Can Huang, Chengwei Hu, Linmei Hu, Mingming Zhao, Xiaohan Zhang, Xuancheng Huang, Yiming Pan, Yuean Bi.

Figure 1
Figure 1. Figure 1: Overview of the AeSlides workflow. Left: Four categories of aesthetic deficiencies commonly observed in LLM-based slide generation. Center: A suite of verifiable aesthetic metrics is introduced and integrated into reinforcement learning to guide the model toward producing visually coherent slide layouts. Right: Representative slides generated by GLM-4.7-AeSlides. Abstract Large language models (LLMs) have … view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of our excessive whitespace detection. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Case studies of slides generated by: AeSlides, GPT-5.2, Claude-Sonnet-4.5, and DeepPresenter. Corresponding aesthetic [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of filtered structurally simple pages in [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of languages in AeSlides-7k. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of page indices in AeSlides-7k. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of prefix token counts in AeSlides-7k. [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Word cloud of visible text in AeSlides-7k. [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Scatter plot of the normalized advantage against the dominant standard deviation in Monte Carlo simulations. [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Correlation between the normalized advantage and [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Reward dynamics during AeSlides training. [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: KL divergence dynamics during AeSlides training, [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Entropy dynamics during AeSlides training, com [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Rollout time during AeSlides training. ( [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Pairwise win rates of all variants based on human evaluation scores. [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Mean score differences between AeSlides and other [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Agreement (Bland-Altman Plot) between human [PITH_FULL_IMAGE:figures/full_fig_p016_17.png] view at source ↗
Figure 19
Figure 19. Figure 19: Failure cases of KL ablation. The generated slides collapse into a limited space of conservative design patterns. [PITH_FULL_IMAGE:figures/full_fig_p017_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Failure cases of AeSlides [PITH_FULL_IMAGE:figures/full_fig_p017_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: End-to-end generation case of AeSlides (i). ( [PITH_FULL_IMAGE:figures/full_fig_p018_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: End-to-end generation case of AeSlides (ii). ( [PITH_FULL_IMAGE:figures/full_fig_p018_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Case of prefix-conditioned generation prompt in AeSlides-7k. The [PITH_FULL_IMAGE:figures/full_fig_p019_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Prompt for VLM-based slide aesthetic overall evaluation. [PITH_FULL_IMAGE:figures/full_fig_p020_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Prompt for VLM-based slide aesthetic issue detection. [PITH_FULL_IMAGE:figures/full_fig_p021_25.png] view at source ↗
read the original abstract

Large language models (LLMs) have demonstrated strong potential in agentic tasks, particularly in slide generation. However, slide generation poses a fundamental challenge: the generation process is text-centric, whereas its quality is governed by visual aesthetics. This modality gap leads current models to frequently produce slides with aesthetically suboptimal layouts. Existing solutions typically rely either on heavy visual reflection, which incurs high inference cost yet yields limited gains; or on fine-tuning with large-scale datasets, which still provides weak and indirect aesthetic supervision. In contrast, the explicit use of aesthetic principles as supervision remains unexplored. In this work, we present AeSlides, a reinforcement learning framework with verifiable rewards for Aesthetic layout supervision in Slide generation. We introduce a suite of meticulously designed verifiable metrics to quantify slide layout quality, capturing key layout issues in an accurate, efficient, and low-cost manner. Leveraging these verifiable metrics, we develop a GRPO-based reinforcement learning method that directly optimizes slide generation models for aesthetically coherent layouts. With only 5K training prompts on GLM-4.7-Flash, AeSlides improves aspect ratio compliance from 36% to 85%, while reducing whitespace by 44%, element collisions by 43%, and visual imbalance by 28%. Human evaluation further shows a substantial improvement in overall quality, increasing scores from 3.31 to 3.56 (+7.6%), outperforming both model-based reward optimization and reflection-based agentic approaches, and even edging out Claude-Sonnet-4.5. These results demonstrate that such a verifiable aesthetic paradigm provides an efficient and scalable approach to aligning slide generation with human aesthetic preferences. Our repository is available at https://github.com/ympan0508/aeslides.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces AeSlides, a GRPO-based reinforcement learning framework that optimizes LLM slide generation using a suite of verifiable, rule-based metrics for layout quality (aspect-ratio compliance, whitespace area, element collisions, and visual imbalance). Trained on only 5K prompts with GLM-4.7-Flash, it reports large gains on these proxies (aspect-ratio compliance 36% to 85%, whitespace reduced 44%, collisions 43%, imbalance 28%) plus a human quality score increase from 3.31 to 3.56 (+7.6%), outperforming model-based reward optimization, reflection-based agents, and even Claude-Sonnet-4.5.

Significance. If the four metrics are faithful low-bias proxies for human aesthetic judgment, the work demonstrates an efficient, low-cost alternative to visual reflection loops or large-scale fine-tuning for aligning text-centric LLMs with visual layout preferences. The public repository supports reproducibility.

major comments (2)
  1. [Abstract] Abstract: the human evaluation reports only a 0.25-point gain on an apparent 5-point scale (+7.6%). Without reported details on the rating protocol, number of raters, inter-rater agreement, statistical significance, or whether raters were instructed to penalize the exact failure modes captured by the four metrics, the modest delta provides weak evidence that the verifiable metrics align with human aesthetic preferences rather than proxy overfitting.
  2. [Results] Results (quantitative and human evaluation sections): large metric improvements are shown, but no ablation or failure-case analysis demonstrates that optimizing the four metrics cannot produce degenerate layouts (e.g., overly sparse or rigidly gridded slides) that score well on the proxies yet receive low human ratings. Such analysis is load-bearing for the central claim that the verifiable-reward paradigm improves true aesthetic quality.
minor comments (1)
  1. [Abstract] Abstract: the baselines ('model-based reward optimization' and 'reflection-based agentic approaches') are mentioned without even one-sentence characterizations; a brief parenthetical description would clarify the comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their insightful comments, which have helped us improve the clarity and rigor of our presentation. Below, we provide point-by-point responses to the major comments and indicate the revisions made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the human evaluation reports only a 0.25-point gain on an apparent 5-point scale (+7.6%). Without reported details on the rating protocol, number of raters, inter-rater agreement, statistical significance, or whether raters were instructed to penalize the exact failure modes captured by the four metrics, the modest delta provides weak evidence that the verifiable metrics align with human aesthetic preferences rather than proxy overfitting.

    Authors: We agree that the abstract and results section would benefit from more details on the human evaluation to substantiate the alignment between our metrics and human preferences. In the revised manuscript, we have expanded the description of the human evaluation protocol, including the rating scale, number of raters, inter-rater agreement, and statistical tests. We have also clarified the instructions given to raters, which emphasized evaluating layout quality in terms of the aspects our metrics target. Although the gain is modest, it is positive and accompanies substantial proxy improvements, providing supporting evidence for our approach. The full details are now included in the main text and appendix. revision: yes

  2. Referee: [Results] Results (quantitative and human evaluation sections): large metric improvements are shown, but no ablation or failure-case analysis demonstrates that optimizing the four metrics cannot produce degenerate layouts (e.g., overly sparse or rigidly gridded slides) that score well on the proxies yet receive low human ratings. Such analysis is load-bearing for the central claim that the verifiable-reward paradigm improves true aesthetic quality.

    Authors: We recognize that demonstrating the absence of degenerate layouts is important to validate that the proxy metrics truly capture aesthetic quality. The original manuscript did not include a dedicated failure-case analysis. We have now added such an analysis in the revised version, where we examine slides with high metric scores but lower human ratings. Our review indicates that such cases are rare and typically stem from content issues rather than the layout degeneracies described. We also include an ablation on metric combinations to show that optimizing all four metrics together produces the best human-evaluated results without introducing sparsity or rigidity problems. This addition directly addresses the concern and bolsters the central claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity: independent metrics and external human validation

full rationale

The paper designs a suite of rule-based verifiable metrics (aspect ratio compliance, whitespace, collisions, visual imbalance) as explicit aesthetic proxies and applies GRPO to optimize the base LLM directly against them. Reported metric improvements are the direct, expected outcome of reward maximization rather than any claimed independent prediction or derivation. The central aesthetic claim is supported by a separate human evaluation (scores rising from 3.31 to 3.56) that is not reducible to the metrics themselves. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing manner; the derivation chain from metric definition through RL training to human-rated quality remains self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on the abstract, the approach relies on the assumption that custom metrics can serve as reliable rewards for RL. No free parameters are explicitly mentioned, but the metrics themselves may involve design choices equivalent to parameters.

axioms (1)
  • domain assumption The designed verifiable metrics provide accurate and efficient quantification of slide layout quality that aligns with human preferences.
    This is central to the method but not validated in the provided abstract.

pith-pipeline@v0.9.0 · 5643 in / 1341 out tokens · 44712 ms · 2026-05-10T02:14:21.795841+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 13 canonical work pages · 10 internal anchors

  1. [1]

    Isabel Cachola, Silviu Cucerzan, Allen Herring, Vuksan Mijovic, Erik Oveson, and Sujay Kumar Jauhar. 2024. Knowledge-Centric Templatic Views of Documents. InFindings of the Association for Computational Linguistics: EMNLP 2024. 15460– 15476

  2. [2]

    Dan Friedman and Adji Bousso Dieng. 2023. The Vendi Score: A Diversity Evaluation Metric for Machine Learning.Transactions on Machine Learning Research(2023)

  3. [3]

    Tsu-Jui Fu, William Yang Wang, Daniel McDuff, and Yale Song. 2022. DOC2PPT: Automatic Presentation Slides Generation from Scientific Documents. InProceed- ings of the AAAI Conference on Artificial Intelligence, Vol. 36. 634–642

  4. [4]

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. 2024. BLINK: Multimodal Large Language Models Can See but Not Perceive. InEuropean Conference on Computer Vision. Springer, 148–166

  5. [5]

    Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. 2025. Soft Adaptive Policy Optimization.arXiv preprint arXiv:2511.20347(2025)

  6. [6]

    Jiaxin Ge, Zora Zhiruo Wang, Xuhui Zhou, Yi-Hao Peng, Sanjay Subramanian, Qinyue Tan, Maarten Sap, Alane Suhr, Daniel Fried, Graham Neubig, et al. 2025. AutoPresent: Designing Structured Visuals from Scratch. InProceedings of the Computer Vision and Pattern Recognition Conference. 2902–2911

  7. [7]

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. 2025. GLM-4.5V and GLM-4.1V- Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning.arXiv preprint arXiv:2507.01006(2025)

  8. [8]

    Juyong Jiang, Chansung Park, Jiasi Shen, Sunghun Kim, Jianguo Li, Yue Wang, et al. 2026. WebGen-R1: Incentivizing LLMs to Generate Functional and Aesthetic Websites with Reinforcement Learning. (2026). https://openreview.net/forum? id=Zzf6ExJZXj

  9. [9]

    Kyudan Jung, Hojun Cho, Jooyeol Yun, Soyoung Yang, Jaehyeok Jang, and Jaegul Choo. 2025. Talk to Your Slides: Language-Driven Agents for Efficient Slide Editing.arXiv preprint arXiv:2505.11604(2025)

  10. [10]

    Dawei Li, Renliang Sun, Yue Huang, Ming Zhong, Bohan Jiang, Jiawei Han, Xiangliang Zhang, Wei Wang, and Huan Liu. 2026. Preference Leakage: A Con- tamination Problem in LLM-as-a-judge. InThe Fourteenth International Conference on Learning Representations

  11. [11]

    Xin Liang, Xiang Zhang, Yiwei Xu, Siqi Sun, and Chenyu You. 2025. SlideGen: Collaborative Multimodal Agents for Scientific Slide Generation.arXiv preprint arXiv:2512.04529(2025)

  12. [12]

    Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, et al

  13. [13]

    GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization.arXiv preprint arXiv:2601.05242(2026)

  14. [14]

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. InProceedings of the 2023 conference on empirical methods in natural language processing. 2511–2522

  15. [15]

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. 2025. Understanding R1-Zero-Like Training: A Critical Perspective. InSecond Conference on Language Modeling

  16. [16]

    Michael Luo, Sijun Tan, Roy Huang, Ameen Patel, Alpay Ariyak, Qingyang Wu, Xiaoxiang Shi, Rachel Xin, Colin Cai, Maurice Weber, Ce Zhang, Li Erran Li, Raluca Ada Popa, and Ion Stoica. 2025. DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level. https://pretty-radio- b75.notion.site/DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini- Level-1cf81902c146...

  17. [17]

    Wenhan Ma, Hailin Zhang, Liang Zhao, Yifan Song, Yudong Wang, Zhifang Sui, and Fuli Luo. 2025. Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers.arXiv preprint arXiv:2510.11370(2025)

  18. [18]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training Language Models to Follow Instructions with Human Feedback.Ad- vances in neural information processing systems35 (2022), 27730–27744

  19. [19]

    Arjun Panickssery, Samuel R Bowman, and Shi Feng. 2024. LLM Evaluators Recognize and Favor Their Own Generations.Advances in Neural Information Processing Systems37 (2024), 68772–68802

  20. [20]

    Sohan Patnaik, Rishabh Jain, Balaji Krishnamurthy, and Mausoom Sarkar. 2025. AesthetiQ: Enhancing Graphic Layout Design via Aesthetic-Aware Preference Alignment of Multi-modal Large Language Models. InProceedings of the Computer Vision and Pattern Recognition Conference. 23701–23711

  21. [21]

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct Preference Optimization: Your Language Model is Secretly a Reward Model.Advances in neural information processing systems36 (2023), 53728–53741

  22. [22]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.arXiv preprint arXiv:2402.03300(2024)

  23. [23]

    Patrick E Shrout and Joseph L Fleiss. 1979. Intraclass Correlations: Uses in Assessing Rater Reliability.Psychological bulletin86, 2 (1979), 420

  24. [24]

    Wenxin Tang, Jingyu Xiao, Wenxuan Jiang, Xi Xiao, Yuhang Wang, Xuxin Tang, Qing Li, Yuehe Ma, Junliang Liu, Shisong Tang, et al. 2025. SlideCoder: Layout- Aware RAG-Enhanced Hierarchical Slide Generation from Design. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 9026–9050

  25. [25]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al . 2026. Kimi K2. 5: Visual Agentic Intelligence.arXiv preprint arXiv:2602.02276(2026)

  26. [26]

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. 2024. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9568–9578

  27. [27]

    Koki Wataoka, Tsubasa Takahashi, and Ryokan Ri. 2024. Self-Preference Bias in LLM-as-a-Judge. InNeurIPS Safe Generative AI Workshop 2024

  28. [28]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-Thought Prompting Elicits Rea- soning in Large Language Models.Advances in neural information processing systems35 (2022), 24824–24837

  29. [29]

    Xiaojie Xu, Xinli Xu, Sirui Chen, Haoyu Chen, Fan Zhang, and Ying-Cong Chen

  30. [30]

    InFindings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China, November 4-9, 2025

    PreGenie: An Agentic Framework for High-quality Visual Presentation Generation. InFindings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China, November 4-9, 2025. 3045–3063

  31. [31]

    Hao Yang, Weijie Qiu, Ru Zhang, Zhou Fang, Ruichao Mao, Xiaoyu Lin, Maji Huang, Zhaosong Huang, Teng Guo, Shuoyang Liu, et al. 2025. UI-UG: A Unified MLLM for UI Understanding and Generation.arXiv preprint arXiv:2509.24361 (2025)

  32. [32]

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-Agent: Agent-Computer Interfaces En- able Automated Software Engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652

  33. [33]

    Feng Yao, Liyuan Liu, Dinghuai Zhang, Chengyu Dong, Jingbo Shang, and Jian- feng Gao. 2025. Your Efficient RL Framework Secretly Brings You Off-Policy RL Training. https://fengyao.notion.site/off-policy-rl

  34. [34]

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al . 2025. DAPO: An Open-Source LLM Reinforcement Learning System at Scale. InAdvances in Neural Information Processing Systems

  35. [35]

    Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chengxing Xie, Cunxiang Wang, et al. 2026. GLM-5: from Vibe Coding to Agentic Engineering.arXiv preprint arXiv:2602.15763(2026)

  36. [36]

    Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. 2025. GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models.arXiv preprint arXiv:2508.06471 (2025)

  37. [37]

    Xin Zhao, Yongkang Liu, Kuan Xu, Jia Guo, Zihao Wang, Yan Sun, Xinyu Kong, Qianggang Cao, Liang Jiang, Zujie Wen, Zhiqiang Zhang, and Jun Zhou. 2025. Small Leak Can Sink a Great Ship–Boost RL Training on MoE with IcePop! https://ringtech.notion.site/icepop

  38. [38]

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. 2025. Group Sequence Policy Optimization.arXiv preprint arXiv:2507.18071(2025)

  39. [39]

    Hao Zheng, Xinyan Guan, Hao Kong, Wenkai Zhang, Jia Zheng, Weixiang Zhou, Hongyu Lin, Yaojie Lu, Xianpei Han, and Le Sun. 2025. PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 14413–14429

  40. [40]

    Hao Zheng, Guozhao Mo, Xinru Yan, Qianhao Yuan, Wenkai Zhang, Xuanang Chen, Yaojie Lu, Hongyu Lin, Xianpei Han, and Le Sun. 2026. DeepPresenter: Environment-Grounded Reflection for Agentic Presentation Generation.arXiv preprint arXiv:2602.22839(2026)

  41. [41]

    Dwelling in the Fuchun Mountains

    Zilin Zhu, Chengxing Xie, Xin Lv, and slime Contributors. 2025. slime: An LLM post-training framework for RL Scaling. https://github.com/THUDM/slime. GitHub repository. Corresponding author: Xin Lv. A Metadata and Statistics of Datasets and Model Checkpoints A.1 AeSlides-Reward-Bench A.1.1 Annotation Protocol.AeSlides-Reward-Bench is introduced to provide...

  42. [42]

    Treat width/height ratios within +/-5% of 16:9 (roughly 1.69 to 1.87) as acceptable

    distorted_aspect_ratio: The slide should follow a 16:9 ratio (1280x720). Treat width/height ratios within +/-5% of 16:9 (roughly 1.69 to 1.87) as acceptable. Deviations or visual distortion should be considered a flaw

  43. [43]

    Note: Background fills (gradients, textures) do NOT count as content

    excessive_whitespace: Identify empty regions without semantic content. Note: Background fills (gradients, textures) do NOT count as content

  44. [44]

    element_collision: Check for overlap, occlusion, overflow, clipping, or boundary violations between elements

  45. [45]

    Evaluation Criteria: Judge the slide holistically as a visual composition

    visual_imbalance: Evaluate whether the layout is visually balanced or biased toward one side. Evaluation Criteria: Judge the slide holistically as a visual composition. Consider layout balance, alignment, spacing, hierarchy, readability, proportional sizing, whitespace distribution, and visual grouping. Strongly penalize: - Incorrect or visually distorted...

  46. [46]

    Treat width/height ratios within +/-5% of 16:9 (roughly 1.69 to 1.87) as acceptable

    has_distorted_aspect_ratio: The standard slide aspect ratio is 16:9 (1280x720). Treat width/height ratios within +/-5% of 16:9 (roughly 1.69 to 1.87) as acceptable. Outside that tolerance, or when the slide visibly appears stretched, squashed, or improperly scaled, it should be treated as a significant aspect-ratio defect

  47. [47]

    Note: Background fills (gradients, textures, color blocks) DO NOT count as content

    has_excessive_whitespace: There are abnormally large regions without semantic content. Note: Background fills (gradients, textures, color blocks) DO NOT count as content

  48. [48]

    Do NOT flag minor or visually harmless overlays (e.g., subtle badges, light decorations)

    has_element_collision: Elements visibly overlap, occlude each other, overflow containers, or extend beyond boundaries. Do NOT flag minor or visually harmless overlays (e.g., subtle badges, light decorations)

  49. [49]

    has_distorted_aspect_ratio

    has_visual_imbalance: The visual center of gravity is clearly biased to one side (left/right/top/bottom), making the layout feel unstable or uncomfortable. Judging Rules: - Only flag obvious and visually significant issues - Do NOT flag minor imperfections - Do NOT penalize intentional clean layouts unless clearly problematic Output the final answer betwe...