AI-Generated Slides: Are They Good? Can Students Tell?

Arto Hellas, Juho Leinonen, Lisa Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:53 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CYcs.HC

keywords generative AIAI-generated slideseducational technologystudent perceptionLLMs in educationslide creation

0 comments

The pith

Coding assistant tools create slides from course notes that students rate equal to instructor-made ones and fail to identify as AI.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests generative AI tools for turning instructor notes into lecture slides. Educators reviewed output from NotebookLM, Claude, M365 Copilot, Cursor, and Claude Code for accuracy, completeness, and teaching value. Coding assistants produced the strongest slides, which were lightly adjusted and then shown to students in a real course alongside human-created slides. Students gave the AI versions similar quality ratings and could not pick out which slides were machine-generated at rates above chance. A negative correlation appeared between high quality scores and suspicion that the slides were AI-made.

Core claim

Generative AI tools, especially coding assistants, produce slides from course notes that are accurate, complete, and pedagogically sound. In a live classroom test, students rate these slides as comparable in quality to instructor-created slides and cannot reliably identify their AI origin.

What carries the argument

Side-by-side educator narrative assessment of slides from five GenAI tools, followed by student quality ratings and origin-identification surveys in an actual course.

Load-bearing premise

The light modifications made to the best AI slides before classroom use did not systematically favor the AI versions in the student comparison.

What would settle it

A blind test using completely unmodified AI slides where students identify the AI origin at rates significantly above 50 percent.

read the original abstract

As generative AI (GenAI) tools become easily accessible, there is promise in using such tools to support instructors. To that end, this paper examines using GenAI to help generate slides from instructor authored course notes, emphasizing instructor and student perceptions. We examine an end-to-end education tool (NotebookLM), two general-purpose LLMs (Claude, M365 Copilot), and two coding assistants (Cursor, Claude Code). We first analyze whether GenAI generated slides are ``good'' via narrative assessment by educators. We choose the best slides to use (with some modification) in a real course setting, and compare the student perception of human vs. AI generated slides. We find that coding assistant tools produce slides that were most accurate, complete, and pedagogically sound. Additionally, students rate GenAI slides to be of similar quality as instructor-created slides, and cannot reliably identify which slides are AI-generated. Additionally, we find a negative correlation between a high quality rating and a high ``AI-generated'' rating, suggesting students associate poor quality with the source of the slides being AI. These findings highlight promising opportunities for integrating GenAI into instructional design workflows and call for further research on how educators can best harness such tools responsibly and effectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Coding-assistant LLMs produced the strongest slides in the comparison, and students could not reliably spot the AI versions after light edits.

read the letter

The paper's core finding is straightforward: among the tested tools, Cursor and Claude Code generated the most accurate, complete, and pedagogically useful slides from instructor notes. In a live course, students rated those AI slides about the same as the instructor's own versions and could not identify the source at better than chance levels. There is also a reported negative link between high quality ratings and guesses that the slides were AI-made. That last point is the most interesting empirical observation here, because it suggests students may default to blaming AI for weaker material rather than judging the content on its own terms. The work is new in its direct head-to-head of general LLMs versus coding assistants on this specific task, plus the real-classroom perception data. The educator narrative assessment step before deployment is a reasonable filter, and running the comparison inside an actual course gives the student ratings more weight than a pure lab study would have. The main limitation is the light modifications applied to the best AI slides before they reached students. Without numbers on how many changes were made or what kinds of issues they fixed, the equivalence result only applies to the post-edit versions. If the edits systematically improved flow or accuracy, the raw outputs might have looked worse and been easier to spot. Details on student sample size and the exact statistical test for the identification task are also thin in the write-up. This is useful reading for instructors who already use LLMs for prep and want tool-specific guidance, or for education researchers tracking how students perceive AI content. It is not a deep theoretical advance, but the evidence directly answers the stated questions without circular modeling or invented constructs. I would send it to peer review; the experiment is practical and the data collection is grounded enough to be worth referee time, even if the modification step needs clearer documentation.

Referee Report

3 major / 3 minor

Summary. This paper evaluates generative AI tools (NotebookLM, Claude, M365 Copilot, Cursor, Claude Code) for creating educational slides from instructor notes. Educator narrative assessments identify coding assistants as producing the most accurate, complete, and pedagogically sound outputs. The top slides undergo light modifications before deployment in a live course, where students rate quality and attempt to identify AI vs. instructor slides. Results show students rate GenAI slides equivalently to human-created ones, cannot reliably distinguish sources, and exhibit a negative correlation between high quality ratings and high AI-attribution ratings.

Significance. If the findings hold, this provides direct empirical support for integrating coding-assistant GenAI into instructional workflows, with ecological validity from the live-course component. The student indistinguishability result and quality-AI correlation offer actionable insights for educators on perception biases, potentially accelerating responsible adoption of AI in pedagogy while highlighting needs for further validation studies.

major comments (3)

[§4.3] §4.3 (Student Deployment): The description states that the best AI slides were used 'with some modification' before classroom deployment, but provides no quantification of change volume, type (e.g., factual corrections, flow edits), or rationale. This is load-bearing for the equivalence and identification claims, as unmeasured edits could have systematically addressed raw GenAI weaknesses, meaning results apply only to post-edit versions rather than unmodified outputs.
[§5.2] §5.2 (Identification Task): The claim that students 'cannot reliably identify' AI-generated slides lacks specification of the statistical test (e.g., proportion test against chance, chi-square), sample size per condition, and effect size or power analysis. Without these, it is impossible to distinguish true indistinguishability from low statistical power, undermining the central student-perception result.
[§3.2] §3.2 (Educator Assessment): The narrative selection of coding assistants as superior relies on educator judgments of 'pedagogically sound' without reported inter-rater reliability, explicit scoring rubric, or example slide excerpts illustrating differences. This reduces transparency for the tool-ranking claim that drives the subsequent student study.

minor comments (3)

[Abstract] Abstract: Report the exact number of slides generated per tool, the specific course topic, and total student sample size to support replicability and generalizability claims.
[§5.1] §5.1 (Quality Ratings): The negative correlation between quality and AI-generated ratings should include the Pearson/Spearman coefficient, p-value, and confidence interval rather than a qualitative description only.
[Results] Figure 2 or equivalent: Ensure axis labels and legends clearly distinguish the five tools and human baseline for the educator assessment results.

Circularity Check

0 steps flagged

No circularity: empirical study rests on independent human judgments

full rationale

The paper conducts an empirical evaluation: GenAI tools generate slides from course notes, educators perform narrative assessments to select the best outputs, light modifications are applied, and the resulting slides are deployed in a real course for student perception and identification surveys. No equations, fitted parameters, predictions, or derivations appear anywhere in the workflow. Claims about accuracy, completeness, pedagogical soundness, quality ratings, and identification rates are grounded directly in the collected human data rather than any self-referential reduction. Self-citations, if present, are not load-bearing for any uniqueness theorem or ansatz. The central results therefore remain independent of the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work relies on standard assumptions that educator narrative judgments and student self-reports are valid proxies for slide quality and that the chosen course is representative; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Educator narrative assessments and student Likert ratings accurately reflect pedagogical soundness and perceived quality.
Invoked when interpreting the tool rankings and student survey results as evidence of slide quality.
domain assumption The selected course and student cohort are representative of typical higher-education settings.
Required to generalize the perception findings beyond the single deployment.

pith-pipeline@v0.9.0 · 5526 in / 1317 out tokens · 34222 ms · 2026-05-14T18:53:30.435428+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 1 internal anchor

[1]

Tushar Aggarwal and Aarohi Bhand. 2025. PASS: Presentation Automation for Slide Generation and Speech.arXiv preprint arXiv:2501.06497(2025)

work page arXiv 2025
[2]

Clark and Allan Paivio

James M. Clark and Allan Paivio. 1991. Dual coding theory and education. Educational Psychology Review3, 3 (1991), 149–210. doi:10.1007/BF01320076

work page doi:10.1007/bf01320076 1991
[3]

Desmarais, and Zhen Ming (Jack) Jiang

Arghavan Moradi Dakhel, Vahid Majdinasab, Amin Nikanjam, Foutse Khomh, Michel C. Desmarais, and Zhen Ming (Jack) Jiang. 2023. Github Copilot AI pair programmer: Asset or liability?J. of Systems and Software203 (2023), 111734

work page 2023
[4]

Paul Denny, Viraj Kumar, and Nasser Giacaman. 2023. Conversing with Copilot: Exploring prompt engineering for solving CS1 problems using natural language. InProceedings of the 54th ACM Technical Symposium on Computer Science Educa- tion V. 1. ACM, 1136–1142

work page 2023
[5]

James Finnie-Ansley, Paul Denny, Andrew Luxton-Reilly, Eddie Antonio Santos, James Prather, and Brett A. Becker. 2023. My AI Wants to Know if This Will Be on the Exam: Testing OpenAI’s Codex on CS2 Programming Exercises. InProceedings of the 25th Australasian Computing Education Conference. ACM, 97–104

work page 2023
[6]

Jiaxin Ge, Zora Zhiruo Wang, Xuhui Zhou, Yi-Hao Peng, Sanjay Subramanian, Qinyue Tan, Maarten Sap, Alane Suhr, Daniel Fried, Graham Neubig, and Trevor Darrell. 2025. AutoPresent: Designing Structured Visuals from Scratch. InProc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR)

work page 2025
[7]

Svetoslav Georgiev and Joseph Tinsley. 2024. Exploring Student Acceptance and Perceptions of AI-Assisted PowerPoint Creation.African Journal of In- ter/Multidisciplinary Studies6, 1 (2024), 1–13

work page 2024
[8]

Quan Connie Gu, Daniel Hickey, and Kimiko Ryokai. 2025. When AI Tells Their Story: Researchers’ Reactions to AI-Generated Podcasts as a Tool for Commu- nicating Research. InExtended Abstracts of the CHI Conf. on Human Factors in Computing Systems (CHI EA ’25). ACM

work page 2025
[9]

Michael Henderson, Margaret Bearman, Jennifer Chung, Tim Fawns, Simon Buckingham Shum, Kelly E Matthews, and Jimena de Mello Heredia. 2025. Com- paring Generative AI and teacher feedback: student perceptions of usefulness and trustworthiness.Assessment & Evaluation in Higher Education(2025), 1–16

work page 2025
[10]

Mollie Jordan, Kevin Ly, and Adalbert Gerald Soosai Raj. 2024. Need a Program- ming Exercise Generated in Your Native Language? ChatGPT’s Got Your Back: Automatic Generation of Non-English Programming Exercises Using OpenAI GPT-3.5. InProc. of the 55th ACM Technical Symposium on Computer Science Education V. 1. Association for Computing Machinery

work page 2024
[11]

Ranim Khojah, Mazen Mohamad, Philipp Leitner, and Francisco Gomes de Oliveira Neto. 2024. Beyond Code Generation: An Observational Study of Chat- GPT Usage in Software Engineering Practice.Proceedings of the ACM on Software Engineering1, FSE (2024), 1819–1840

work page 2024
[12]

Juho Leinonen, Paul Denny, Stephen MacNeil, Sami Sarsa, Seth Bernstein, Joanne Kim, Andrew Tran, and Arto Hellas. 2023. Comparing Code Explanations Created by Students and Large Language Models. InProc. of the 2023 Conf. on Innovation and Technology in Computer Science Education V. 1. ACM

work page 2023
[13]

Zhuoyan Li, Chen Liang, Jing Peng, and Ming Yin. 2024. How Does the Disclosure of AI Assistance Affect the Perceptions of Writing?. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 4849–4868

work page 2024
[14]

Evanfiya Logacheva, Arto Hellas, James Prather, Sami Sarsa, and Juho Leinonen

work page
[15]

InProceedings of the 2024 ACM Conference on International Computing Education Research-Volume 1

Evaluating Contextually Personalized Programming Exercises Created with Generative AI. InProceedings of the 2024 ACM Conference on International Computing Education Research-Volume 1. 95–113

work page 2024
[16]

Richard E. Mayer. 2005. Cognitive theory of multimedia learning. InThe Cam- bridge Handbook of Multimedia Learning, Richard E. Mayer (Ed.). Cambridge University Press, New York, 31–48

work page 2005
[17]

Richard E. Mayer. 2017. Using multimedia for e-learning.Journal of computer assisted learning33, 5 (2017), 403–423

work page 2017
[18]

Shakked Noy and Whitney Zhang. 2023. Experimental evidence on the pro- ductivity effects of generative artificial intelligence.Science381, 6654 (2023), 187–192

work page 2023
[19]

2025.GPT-5 System Card

OpenAI. 2025.GPT-5 System Card. https://openai.com/index/gpt-5-system-card/ Accessed on 2025-11-03

work page 2025
[20]

Vinay Patel. 2024. Fake Or Real? Audio Captures AI Podcast Hosts Realising ‘We’re Not Human... What Happens When They Turn Us Off?’.International Business Times(UK). https://www.ibtimes.co.uk/fake-real-audio-captures-ai- podcast-hosts-realising-were-not-human-what-happens-when-they-1728290 Accessed: 20 April 2026

work page 2024
[21]

Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The Impact of AI on Developer Productivity: Evidence from GitHub Copilot.arXiv preprint arXiv:2302.06590(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Becker, Ibrahim Albluwi, Michelle Craig, Hieke Keuning, Natalie Kiesler, Tobias Kohn, Andrew Luxton- Reilly, Stephen MacNeil, Andrew Petersen, Raymond Pettit, Brent N

James Prather, Paul Denny, Juho Leinonen, Brett A. Becker, Ibrahim Albluwi, Michelle Craig, Hieke Keuning, Natalie Kiesler, Tobias Kohn, Andrew Luxton- Reilly, Stephen MacNeil, Andrew Petersen, Raymond Pettit, Brent N. Reeves, and Jaromir Savelka. 2023. The Robots Are Here: Navigating the Generative AI Revolution in Computing Education. InProc. of the 202...

work page 2023
[23]

Reeves, Jaromir Savelka, IV Smith, David H., Sven Strickroth, and Daniel Zingaro

James Prather, Juho Leinonen, Natalie Kiesler, Jamie Gorson Benario, Sam Lau, Stephen MacNeil, Narges Norouzi, Simone Opel, Vee Pettit, Leo Porter, Brent N. Reeves, Jaromir Savelka, IV Smith, David H., Sven Strickroth, and Daniel Zingaro

work page
[24]

In2024 Working Group Reports on Innovation and Technology in Computer Science Education

Beyond the Hype: A Comprehensive Review of Current Trends in Genera- tive AI Research, Teaching Practices, and Tools. In2024 Working Group Reports on Innovation and Technology in Computer Science Education. ACM, 300–338

work page
[25]

Kunal Rao, Giuseppe Coviello, Murugan Sankaradas, Ciro Giuseppe De Vita, Gennaro Mellone, and Srimat Chakradhar. 2025. SlideCraft: Context-aware Slides Generation Agent. In2025 IEEE Conference on Pervasive and Intelligent Computing (PICom). IEEE, 165–172

work page 2025
[26]

Sami Sarsa, Paul Denny, Arto Hellas, and Juho Leinonen. 2022. Automatic Gen- eration of Programming Exercises and Code Explanations Using Large Language Models. InProceedings of the 2022 ACM Conference on International Computing Education Research-Volume 1. ACM

work page 2022
[27]

Jaromir Savelka, Arav Agarwal, Marshall An, Christopher Bogart, and Majd Sakr

work page
[28]

Thrilled by Your Progress! Large Language Models (GPT-4) No Longer Struggle to Pass Assessments in Higher Education Programming Courses. InProc. of the 2023 ACM Conf. on Int. Computing Education Research-Volume 1. ACM

work page 2023
[29]

Lee Giles

Athar Sefid, Jian Wu, Prasenjit Mitra, and C. Lee Giles. 2019. Automatic slide generation for scientific papers. InThird International Workshop on Capturing Sci- entific Knowledge co-located with the 10th International Conference on Knowledge Capture (K-CAP 2019), SciKnow@ K-CAP 2019

work page 2019
[30]

Juha Sorva. 2013. Notional machines and introductory programming education. ACM Transactions on Computing Education (TOCE)13, 2 (2013), 1–31. doi:10. 1145/2483710.2483713

work page arXiv 2013
[31]

John Sweller. 1988. Cognitive load during problem solving: Effects on learning. Cognitive Science12, 2 (1988), 257–285. doi:10.1207/s15516709cog1202_4

work page doi:10.1207/s15516709cog1202_4 1988
[32]

Ismael Villegas Molina, Audria Montalvo, Shera Zhong, Mollie Jordan, and Adal- bert Gerald Soosai Raj. 2024. Generation and Evaluation of a Culturally-Relevant CS1 Textbook for Latines using Large Language Models. InProceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1. 325–331

work page 2024
[33]

Biao Wang. 2024. NotebookLM now lets you listen to a conversation about your sources.Google Blog. September11 (2024)

work page 2024
[34]

Yu, and Qingsong Wen

Shen Wang, Tianlong Xu, Hang Li, Chaoli Zhang, Joleen Liang, Jiliang Tang, Philip S. Yu, and Qingsong Wen. 2025. Large Language Models for Education: A survey and outlook.IEEE Signal Processing Magazine42, 6 (2025), 51–63

work page 2025
[35]

Leon E. Winslow. 1996. Programming pedagogy—a psychological overview.ACM SIGCSE Bulletin28, 3 (1996), 17–22. doi:10.1145/234867.234872

work page doi:10.1145/234867.234872 1996
[36]

Eric Xie, Danielle Waterfield, Michael Kennedy, and Aidong Zhang. 2026. SlideBot: A Multi-Agent Framework for Generating Informative, Reliable, Multi-Modal Presentations. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 40907–40915

work page 2026
[37]

Weicheng Xing, Tianqing Zhu, Jenny Wang, and Bo Liu. 2024. A Survey on MLLMs in Education: Application and Future Directions.Future Internet(2024)

work page 2024