Recognition: unknown
AI-Generated Slides: Are They Good? Can Students Tell?
Pith reviewed 2026-05-14 18:53 UTC · model grok-4.3
The pith
Coding assistant tools create slides from course notes that students rate equal to instructor-made ones and fail to identify as AI.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Generative AI tools, especially coding assistants, produce slides from course notes that are accurate, complete, and pedagogically sound. In a live classroom test, students rate these slides as comparable in quality to instructor-created slides and cannot reliably identify their AI origin.
What carries the argument
Side-by-side educator narrative assessment of slides from five GenAI tools, followed by student quality ratings and origin-identification surveys in an actual course.
Load-bearing premise
The light modifications made to the best AI slides before classroom use did not systematically favor the AI versions in the student comparison.
What would settle it
A blind test using completely unmodified AI slides where students identify the AI origin at rates significantly above 50 percent.
read the original abstract
As generative AI (GenAI) tools become easily accessible, there is promise in using such tools to support instructors. To that end, this paper examines using GenAI to help generate slides from instructor authored course notes, emphasizing instructor and student perceptions. We examine an end-to-end education tool (NotebookLM), two general-purpose LLMs (Claude, M365 Copilot), and two coding assistants (Cursor, Claude Code). We first analyze whether GenAI generated slides are ``good'' via narrative assessment by educators. We choose the best slides to use (with some modification) in a real course setting, and compare the student perception of human vs. AI generated slides. We find that coding assistant tools produce slides that were most accurate, complete, and pedagogically sound. Additionally, students rate GenAI slides to be of similar quality as instructor-created slides, and cannot reliably identify which slides are AI-generated. Additionally, we find a negative correlation between a high quality rating and a high ``AI-generated'' rating, suggesting students associate poor quality with the source of the slides being AI. These findings highlight promising opportunities for integrating GenAI into instructional design workflows and call for further research on how educators can best harness such tools responsibly and effectively.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper evaluates generative AI tools (NotebookLM, Claude, M365 Copilot, Cursor, Claude Code) for creating educational slides from instructor notes. Educator narrative assessments identify coding assistants as producing the most accurate, complete, and pedagogically sound outputs. The top slides undergo light modifications before deployment in a live course, where students rate quality and attempt to identify AI vs. instructor slides. Results show students rate GenAI slides equivalently to human-created ones, cannot reliably distinguish sources, and exhibit a negative correlation between high quality ratings and high AI-attribution ratings.
Significance. If the findings hold, this provides direct empirical support for integrating coding-assistant GenAI into instructional workflows, with ecological validity from the live-course component. The student indistinguishability result and quality-AI correlation offer actionable insights for educators on perception biases, potentially accelerating responsible adoption of AI in pedagogy while highlighting needs for further validation studies.
major comments (3)
- [§4.3] §4.3 (Student Deployment): The description states that the best AI slides were used 'with some modification' before classroom deployment, but provides no quantification of change volume, type (e.g., factual corrections, flow edits), or rationale. This is load-bearing for the equivalence and identification claims, as unmeasured edits could have systematically addressed raw GenAI weaknesses, meaning results apply only to post-edit versions rather than unmodified outputs.
- [§5.2] §5.2 (Identification Task): The claim that students 'cannot reliably identify' AI-generated slides lacks specification of the statistical test (e.g., proportion test against chance, chi-square), sample size per condition, and effect size or power analysis. Without these, it is impossible to distinguish true indistinguishability from low statistical power, undermining the central student-perception result.
- [§3.2] §3.2 (Educator Assessment): The narrative selection of coding assistants as superior relies on educator judgments of 'pedagogically sound' without reported inter-rater reliability, explicit scoring rubric, or example slide excerpts illustrating differences. This reduces transparency for the tool-ranking claim that drives the subsequent student study.
minor comments (3)
- [Abstract] Abstract: Report the exact number of slides generated per tool, the specific course topic, and total student sample size to support replicability and generalizability claims.
- [§5.1] §5.1 (Quality Ratings): The negative correlation between quality and AI-generated ratings should include the Pearson/Spearman coefficient, p-value, and confidence interval rather than a qualitative description only.
- [Results] Figure 2 or equivalent: Ensure axis labels and legends clearly distinguish the five tools and human baseline for the educator assessment results.
Circularity Check
No circularity: empirical study rests on independent human judgments
full rationale
The paper conducts an empirical evaluation: GenAI tools generate slides from course notes, educators perform narrative assessments to select the best outputs, light modifications are applied, and the resulting slides are deployed in a real course for student perception and identification surveys. No equations, fitted parameters, predictions, or derivations appear anywhere in the workflow. Claims about accuracy, completeness, pedagogical soundness, quality ratings, and identification rates are grounded directly in the collected human data rather than any self-referential reduction. Self-citations, if present, are not load-bearing for any uniqueness theorem or ansatz. The central results therefore remain independent of the paper's own inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Educator narrative assessments and student Likert ratings accurately reflect pedagogical soundness and perceived quality.
- domain assumption The selected course and student cohort are representative of typical higher-education settings.
Reference graph
Works this paper leans on
- [1]
-
[2]
James M. Clark and Allan Paivio. 1991. Dual coding theory and education. Educational Psychology Review3, 3 (1991), 149–210. doi:10.1007/BF01320076
-
[3]
Desmarais, and Zhen Ming (Jack) Jiang
Arghavan Moradi Dakhel, Vahid Majdinasab, Amin Nikanjam, Foutse Khomh, Michel C. Desmarais, and Zhen Ming (Jack) Jiang. 2023. Github Copilot AI pair programmer: Asset or liability?J. of Systems and Software203 (2023), 111734
work page 2023
-
[4]
Paul Denny, Viraj Kumar, and Nasser Giacaman. 2023. Conversing with Copilot: Exploring prompt engineering for solving CS1 problems using natural language. InProceedings of the 54th ACM Technical Symposium on Computer Science Educa- tion V. 1. ACM, 1136–1142
work page 2023
-
[5]
James Finnie-Ansley, Paul Denny, Andrew Luxton-Reilly, Eddie Antonio Santos, James Prather, and Brett A. Becker. 2023. My AI Wants to Know if This Will Be on the Exam: Testing OpenAI’s Codex on CS2 Programming Exercises. InProceedings of the 25th Australasian Computing Education Conference. ACM, 97–104
work page 2023
-
[6]
Jiaxin Ge, Zora Zhiruo Wang, Xuhui Zhou, Yi-Hao Peng, Sanjay Subramanian, Qinyue Tan, Maarten Sap, Alane Suhr, Daniel Fried, Graham Neubig, and Trevor Darrell. 2025. AutoPresent: Designing Structured Visuals from Scratch. InProc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR)
work page 2025
-
[7]
Svetoslav Georgiev and Joseph Tinsley. 2024. Exploring Student Acceptance and Perceptions of AI-Assisted PowerPoint Creation.African Journal of In- ter/Multidisciplinary Studies6, 1 (2024), 1–13
work page 2024
-
[8]
Quan Connie Gu, Daniel Hickey, and Kimiko Ryokai. 2025. When AI Tells Their Story: Researchers’ Reactions to AI-Generated Podcasts as a Tool for Commu- nicating Research. InExtended Abstracts of the CHI Conf. on Human Factors in Computing Systems (CHI EA ’25). ACM
work page 2025
-
[9]
Michael Henderson, Margaret Bearman, Jennifer Chung, Tim Fawns, Simon Buckingham Shum, Kelly E Matthews, and Jimena de Mello Heredia. 2025. Com- paring Generative AI and teacher feedback: student perceptions of usefulness and trustworthiness.Assessment & Evaluation in Higher Education(2025), 1–16
work page 2025
-
[10]
Mollie Jordan, Kevin Ly, and Adalbert Gerald Soosai Raj. 2024. Need a Program- ming Exercise Generated in Your Native Language? ChatGPT’s Got Your Back: Automatic Generation of Non-English Programming Exercises Using OpenAI GPT-3.5. InProc. of the 55th ACM Technical Symposium on Computer Science Education V. 1. Association for Computing Machinery
work page 2024
-
[11]
Ranim Khojah, Mazen Mohamad, Philipp Leitner, and Francisco Gomes de Oliveira Neto. 2024. Beyond Code Generation: An Observational Study of Chat- GPT Usage in Software Engineering Practice.Proceedings of the ACM on Software Engineering1, FSE (2024), 1819–1840
work page 2024
-
[12]
Juho Leinonen, Paul Denny, Stephen MacNeil, Sami Sarsa, Seth Bernstein, Joanne Kim, Andrew Tran, and Arto Hellas. 2023. Comparing Code Explanations Created by Students and Large Language Models. InProc. of the 2023 Conf. on Innovation and Technology in Computer Science Education V. 1. ACM
work page 2023
-
[13]
Zhuoyan Li, Chen Liang, Jing Peng, and Ming Yin. 2024. How Does the Disclosure of AI Assistance Affect the Perceptions of Writing?. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 4849–4868
work page 2024
-
[14]
Evanfiya Logacheva, Arto Hellas, James Prather, Sami Sarsa, and Juho Leinonen
-
[15]
InProceedings of the 2024 ACM Conference on International Computing Education Research-Volume 1
Evaluating Contextually Personalized Programming Exercises Created with Generative AI. InProceedings of the 2024 ACM Conference on International Computing Education Research-Volume 1. 95–113
work page 2024
-
[16]
Richard E. Mayer. 2005. Cognitive theory of multimedia learning. InThe Cam- bridge Handbook of Multimedia Learning, Richard E. Mayer (Ed.). Cambridge University Press, New York, 31–48
work page 2005
-
[17]
Richard E. Mayer. 2017. Using multimedia for e-learning.Journal of computer assisted learning33, 5 (2017), 403–423
work page 2017
-
[18]
Shakked Noy and Whitney Zhang. 2023. Experimental evidence on the pro- ductivity effects of generative artificial intelligence.Science381, 6654 (2023), 187–192
work page 2023
-
[19]
OpenAI. 2025.GPT-5 System Card. https://openai.com/index/gpt-5-system-card/ Accessed on 2025-11-03
work page 2025
-
[20]
Vinay Patel. 2024. Fake Or Real? Audio Captures AI Podcast Hosts Realising ‘We’re Not Human... What Happens When They Turn Us Off?’.International Business Times(UK). https://www.ibtimes.co.uk/fake-real-audio-captures-ai- podcast-hosts-realising-were-not-human-what-happens-when-they-1728290 Accessed: 20 April 2026
work page 2024
-
[21]
Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The Impact of AI on Developer Productivity: Evidence from GitHub Copilot.arXiv preprint arXiv:2302.06590(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
James Prather, Paul Denny, Juho Leinonen, Brett A. Becker, Ibrahim Albluwi, Michelle Craig, Hieke Keuning, Natalie Kiesler, Tobias Kohn, Andrew Luxton- Reilly, Stephen MacNeil, Andrew Petersen, Raymond Pettit, Brent N. Reeves, and Jaromir Savelka. 2023. The Robots Are Here: Navigating the Generative AI Revolution in Computing Education. InProc. of the 202...
work page 2023
-
[23]
Reeves, Jaromir Savelka, IV Smith, David H., Sven Strickroth, and Daniel Zingaro
James Prather, Juho Leinonen, Natalie Kiesler, Jamie Gorson Benario, Sam Lau, Stephen MacNeil, Narges Norouzi, Simone Opel, Vee Pettit, Leo Porter, Brent N. Reeves, Jaromir Savelka, IV Smith, David H., Sven Strickroth, and Daniel Zingaro
-
[24]
In2024 Working Group Reports on Innovation and Technology in Computer Science Education
Beyond the Hype: A Comprehensive Review of Current Trends in Genera- tive AI Research, Teaching Practices, and Tools. In2024 Working Group Reports on Innovation and Technology in Computer Science Education. ACM, 300–338
-
[25]
Kunal Rao, Giuseppe Coviello, Murugan Sankaradas, Ciro Giuseppe De Vita, Gennaro Mellone, and Srimat Chakradhar. 2025. SlideCraft: Context-aware Slides Generation Agent. In2025 IEEE Conference on Pervasive and Intelligent Computing (PICom). IEEE, 165–172
work page 2025
-
[26]
Sami Sarsa, Paul Denny, Arto Hellas, and Juho Leinonen. 2022. Automatic Gen- eration of Programming Exercises and Code Explanations Using Large Language Models. InProceedings of the 2022 ACM Conference on International Computing Education Research-Volume 1. ACM
work page 2022
-
[27]
Jaromir Savelka, Arav Agarwal, Marshall An, Christopher Bogart, and Majd Sakr
-
[28]
Thrilled by Your Progress! Large Language Models (GPT-4) No Longer Struggle to Pass Assessments in Higher Education Programming Courses. InProc. of the 2023 ACM Conf. on Int. Computing Education Research-Volume 1. ACM
work page 2023
-
[29]
Athar Sefid, Jian Wu, Prasenjit Mitra, and C. Lee Giles. 2019. Automatic slide generation for scientific papers. InThird International Workshop on Capturing Sci- entific Knowledge co-located with the 10th International Conference on Knowledge Capture (K-CAP 2019), SciKnow@ K-CAP 2019
work page 2019
- [30]
-
[31]
John Sweller. 1988. Cognitive load during problem solving: Effects on learning. Cognitive Science12, 2 (1988), 257–285. doi:10.1207/s15516709cog1202_4
-
[32]
Ismael Villegas Molina, Audria Montalvo, Shera Zhong, Mollie Jordan, and Adal- bert Gerald Soosai Raj. 2024. Generation and Evaluation of a Culturally-Relevant CS1 Textbook for Latines using Large Language Models. InProceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1. 325–331
work page 2024
-
[33]
Biao Wang. 2024. NotebookLM now lets you listen to a conversation about your sources.Google Blog. September11 (2024)
work page 2024
-
[34]
Shen Wang, Tianlong Xu, Hang Li, Chaoli Zhang, Joleen Liang, Jiliang Tang, Philip S. Yu, and Qingsong Wen. 2025. Large Language Models for Education: A survey and outlook.IEEE Signal Processing Magazine42, 6 (2025), 51–63
work page 2025
-
[35]
Leon E. Winslow. 1996. Programming pedagogy—a psychological overview.ACM SIGCSE Bulletin28, 3 (1996), 17–22. doi:10.1145/234867.234872
-
[36]
Eric Xie, Danielle Waterfield, Michael Kennedy, and Aidong Zhang. 2026. SlideBot: A Multi-Agent Framework for Generating Informative, Reliable, Multi-Modal Presentations. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 40907–40915
work page 2026
-
[37]
Weicheng Xing, Tianqing Zhu, Jenny Wang, and Bo Liu. 2024. A Survey on MLLMs in Education: Application and Future Directions.Future Internet(2024)
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.