pith. sign in

arxiv: 2506.06921 · v2 · submitted 2025-06-07 · ⚛️ physics.ed-ph · astro-ph.CO· astro-ph.GA· astro-ph.IM· astro-ph.SR

Teaching Astronomy with Large Language Models

Pith reviewed 2026-05-19 10:11 UTC · model grok-4.3

classification ⚛️ physics.ed-ph astro-ph.COastro-ph.GAastro-ph.IMastro-ph.SR
keywords astronomy educationlarge language modelsAI literacydomain-specific toolsstudent assessmentLLM gradingundergraduate teaching
0
0 comments X p. Extension

The pith

Structured LLM integration in astronomy courses reduces student reliance while building critical AI skills.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how final-year undergraduate astronomy students interact with large language models under structured guidance. They developed AstroTutor, a specialized tutoring system incorporating curated arXiv papers, and required students to document their AI usage in reflections and surveys. Over the semester, students shifted from seeking basic assistance to using LLMs for verification and strategic tasks, showing decreased overall reliance. LLM grading matched human evaluations with more consistent and detailed feedback, and LLM-assisted interviews were piloted for scalable assessments. The findings indicate that transparency requirements and domain-specific tools can improve both astronomy learning and essential AI skills.

Core claim

By integrating general-purpose and domain-specific LLMs with requirements for students to document their interactions, the study shows that students evolve their AI strategies from basic help-seeking to advanced verification and cross-checking workflows. This structured approach leads to decreased reliance on LLMs rather than increased dependence, while fostering metacognitive awareness and effective prompting techniques. Experimental comparisons confirm that LLM-based grading provides feedback comparable to human grading in quality but with greater detail and consistency, and interview-based exams offer a scalable alternative for individualized evaluation.

What carries the argument

AstroTutor, a domain-specific astronomy tutoring system enhanced with curated arXiv content, combined with mandatory documentation of AI usage through homework reflections and surveys.

If this is right

  • Students develop critical evaluation skills and strategic tool selection over the course of the semester.
  • LLM grading shows strong correlation with human evaluation while delivering more detailed and consistent feedback.
  • LLM-facilitated interview-based examinations provide a scalable alternative for individualized student assessment.
  • Documentation requirements foster metacognitive awareness and evolution from basic assistance to verification workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same structured documentation approach could transfer to other STEM fields to build general AI literacy without increasing dependence.
  • Making the AstroTutor repository openly available enables other instructors to test and adapt the system for different course contexts.
  • Decreased LLM reliance may correlate with improved retention of astronomy concepts through greater student engagement in verification.

Load-bearing premise

Student self-documentation through homework reflections and post-course surveys accurately reflects their actual AI interaction strategies without significant social desirability bias or incomplete reporting.

What would settle it

Direct comparison of student self-reported AI usage patterns against actual logged interactions with the LLM tools to measure discrepancies in reported strategies and skill evolution.

Figures

Figures reproduced from arXiv: 2506.06921 by Teaghan O'Briain, Yuan-Sen Ting.

Figure 1
Figure 1. Figure 1: Multi-agent architecture of AstroTutor. The system employs a Retrieval Augmented Generation (RAG) approach with three specialized agents accessing distinct knowledge domains: course materials and lecture notes adapted from the instruc￾tor’s textbook, trusted reference materials, and a curated database of ArXiv papers from the astro-ph section. All knowledge sources are stored in ChromaDB vector storage for… view at source ↗
Figure 2
Figure 2. Figure 2: User interface of the AstroTutor system. The interface provides organized access to course materials including lectures, tutorials, and reference textbooks. The main chat interface facilitates pedagogical interactions, offering assistance with course concepts, data analysis and coding support, and paper recommendations for assignments and projects. Students can download chat histories and reset conversatio… view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of LLM tool usage among students throughout the semester. The chart displays usage percent￾ages for chat-based LLMs (solid bars) and IDE-integrated tools (hatched bars). ChatGPT was the dominant tool with 90% adoption, followed by AstroTutor at 80%. Students typically used AstroTutor for theoretical understanding and ChatGPT for coding assistance, demonstrating complemen￾tary roles. For IDE-in… view at source ↗
Figure 4
Figure 4. Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Student self-assessment of learning outcomes and LLM proficiency development across six key dimensions (1-10 Likert scale). Each panel shows kernel density estimation of student responses with different color gradients to distinguish topics. The survey assessed: (1) awareness of LLM strengths and limitations through documentation, (2) LLM effectiveness for concept understanding, (3) maintenance of problem-… view at source ↗
Figure 6
Figure 6. Figure 6: Student evaluation of course implementation and future implications (1-10 Likert scale). Each panel shows kernel density estimation of student responses with different color gradients to distinguish topics. The survey assessed: (8) perceived value of the specialized AstroTutor system compared to general-purpose LLMs, (9) impact of documentation requirements on learning experience, (10) anticipated value of… view at source ↗
Figure 7
Figure 7. Figure 7: Example of LLM-generated grading feedback showing detailed error identification and constructive guidance for a student’s analytical approach to calculating distribution moments. sualizations, and appropriate result interpretation. The system generated structured JSON responses with spe￾cific fields including earned points, detailed error de￾scriptions with point deductions, and feedback for each question.… view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of LLM-assisted grading versus human grader scores across four homework assignments for two different models: Claude-3.7-Sonnet (left) and Gemini-2.5-Flash (right). Individual homework scores are shown as colored points, with black circles representing student averages used for linear regression analysis. The dashed line shows the linear fit based on student averages, while the solid gray line i… view at source ↗
read the original abstract

We present a study of LLM integration in final-year undergraduate astronomy education, examining how students develop AI literacy through structured guidance and documentation requirements. We developed AstroTutor, a domain-specific astronomy tutoring system enhanced with curated arXiv content, and deployed it alongside general-purpose LLMs in the course. Students documented their AI usage through homework reflections and post-course surveys. We analyzed student evolution in AI interaction strategies and conducted experimental comparisons of LLM-assisted versus traditional grading methods. LLM grading showed strong correlation with human evaluation while providing more detailed and consistent feedback. We also piloted LLM-facilitated interview-based examinations as a scalable alternative to traditional assessments, demonstrating potential for individualized evaluation that addresses common testing limitations. Students experienced decreased rather than increased reliance on LLMs over the semester, developing critical evaluation skills and strategic tool selection. They evolved from basic assistance-seeking to verification workflows, with documentation requirements fostering metacognitive awareness. Students developed effective prompting strategies, contextual enrichment techniques, and cross-verification practices. Our findings suggest that structured LLM integration with transparency requirements and domain-specific tools can enhance astronomy education while building essential AI literacy skills. We provide implementation guidelines for educators and make our AstroTutor repository freely available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript reports an empirical study of structured LLM integration in a final-year undergraduate astronomy course. The authors developed AstroTutor, a domain-specific tutoring system incorporating curated arXiv content, and deployed it alongside general-purpose LLMs. Students documented AI usage via required homework reflections and post-course surveys. Key claims include that students exhibited decreased (rather than increased) reliance on LLMs over the semester, evolving toward verification workflows, critical evaluation, and strategic tool selection; that LLM grading correlates strongly with human grading while providing more detailed feedback; and that LLM-facilitated interview-based exams offer a scalable assessment alternative. The paper concludes that transparency requirements and domain-specific tools enhance astronomy education while building AI literacy, and it supplies implementation guidelines plus an open AstroTutor repository.

Significance. If the central observations hold under more rigorous scrutiny, the work could usefully inform astronomy educators seeking to incorporate LLMs without fostering dependence. The emphasis on documentation requirements, the open release of AstroTutor, and the practical guidelines constitute concrete contributions that other instructors could adapt. The observational design and focus on student self-reports, however, limit the strength of claims about skill evolution and literacy gains.

major comments (3)
  1. [Abstract / Results] Abstract and Results sections: the headline finding that students showed decreased rather than increased LLM reliance and developed verification workflows rests entirely on analysis of homework reflections and post-course surveys, yet no sample size, quantitative metrics (e.g., frequency counts or change scores), coding protocol, or inter-rater reliability is reported.
  2. [Methods / Results] Methods / Experimental comparisons: the claim of strong correlation between LLM and human grading is presented without the actual correlation coefficient, number of assignments or students involved, or controls for confounding variables such as assignment difficulty or grader familiarity with the material.
  3. [Discussion] Discussion: the interpretation that documentation requirements fostered metacognitive awareness and reduced dependence assumes self-reported reflections accurately capture actual interaction strategies; the manuscript provides no validation against usage logs from AstroTutor, no pre/post objective prompting tasks, and no control cohort to rule out social-desirability bias or course-specific effects.
minor comments (2)
  1. [Abstract] The abstract would benefit from an explicit statement of the number of participating students and the duration of the course.
  2. [Results] Figure or table captions describing LLM grading comparisons should include the precise statistical measure used (Pearson r, Spearman rho, etc.) rather than the qualitative phrase 'strong correlation.'

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive comments, which highlight important areas for improving the clarity and rigor of our reporting. We address each major comment below and indicate where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract and Results sections: the headline finding that students showed decreased rather than increased LLM reliance and developed verification workflows rests entirely on analysis of homework reflections and post-course surveys, yet no sample size, quantitative metrics (e.g., frequency counts or change scores), coding protocol, or inter-rater reliability is reported.

    Authors: We agree that these methodological details were insufficiently reported in the original submission. The analysis drew on reflections submitted by the full class cohort. In the revised manuscript we will explicitly state the sample size, provide quantitative metrics including the proportion of students exhibiting shifts toward verification workflows and frequency counts of key themes across the semester, describe the thematic coding protocol, and report inter-rater reliability for the qualitative analysis. These elements were part of our internal process but omitted from the text. revision: yes

  2. Referee: [Methods / Results] Methods / Experimental comparisons: the claim of strong correlation between LLM and human grading is presented without the actual correlation coefficient, number of assignments or students involved, or controls for confounding variables such as assignment difficulty or grader familiarity with the material.

    Authors: We accept that the quantitative details supporting the grading comparison were not included. The revised manuscript will report the correlation coefficient, the number of assignments and students in the comparison, and describe the grading protocol, including steps taken to minimize effects of assignment difficulty and grader familiarity. This will allow readers to evaluate the strength of the observed agreement directly. revision: yes

  3. Referee: [Discussion] Discussion: the interpretation that documentation requirements fostered metacognitive awareness and reduced dependence assumes self-reported reflections accurately capture actual interaction strategies; the manuscript provides no validation against usage logs from AstroTutor, no pre/post objective prompting tasks, and no control cohort to rule out social-desirability bias or course-specific effects.

    Authors: We acknowledge the limitations of relying on self-reported data without additional validation. The study was observational and did not collect usage logs, conduct pre/post objective tasks, or include a control cohort. In the revised Discussion we will explicitly state these constraints, discuss the possibility of social-desirability bias and course-specific effects, and frame the findings as initial evidence rather than definitive causal claims. We will also outline directions for future work that could incorporate objective measures. revision: partial

standing simulated objections not resolved
  • Direct validation against AstroTutor usage logs cannot be added because such logs were not collected during the study.

Circularity Check

0 steps flagged

No circularity: empirical observational study with no derivations or self-referential reductions

full rationale

The paper reports an empirical study of LLM integration in an astronomy course, including development of AstroTutor, student self-documentation via homework reflections and surveys, analysis of strategy evolution, and comparisons of LLM-assisted grading versus traditional methods. No mathematical derivations, equations, fitted parameters, or first-principles predictions are present that could reduce to inputs by construction. Claims about decreased LLM reliance and skill development rest on direct observational data rather than any self-definitional, fitted-input, or self-citation load-bearing chain. The study is self-contained against its own reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central findings depend on the validity of student self-reporting and the representativeness of the single course studied. No free parameters in the mathematical sense, but the interpretation of 'decreased reliance' assumes accurate measurement of usage.

axioms (1)
  • domain assumption Student self-reports via reflections and surveys validly capture changes in AI usage strategies.
    The main conclusions about decreased reliance and skill development rely on this.
invented entities (1)
  • AstroTutor no independent evidence
    purpose: Domain-specific astronomy tutoring system enhanced with curated arXiv content.
    It's a new tool developed for this study.

pith-pipeline@v0.9.0 · 5745 in / 1335 out tokens · 41894 ms · 2026-05-19T10:11:36.638617+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 12 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., et al. 2023, arXiv e-prints, arXiv:2303.08774. https://arxiv.org/abs/2303.08774

  2. [2]

    2022, AI and ethics, 2, doi: 10.1007/s43681-021-00096-7

    Akgun, S., & Greenhow, C. 2022, AI and ethics, 2, doi: 10.1007/s43681-021-00096-7

  3. [3]

    Alkaissi, H., & McFarlane, S. I. 2023, Cureus, 15, doi: 10.7759/cureus.35179

  4. [4]

    M., Nguyen, S., Zi, Y., et al

    Babe, H. M., Nguyen, S., Zi, Y., et al. 2024, in Findings of the Association for Computational Linguistics: ACL 2024 (Bangkok, Thailand: Association for Computational Linguistics), 8452–8474, doi: 10.18653/v1/2024.findings-acl.501

  5. [5]

    2025, Social Sciences & Humanities Open, 11, 101299, doi: 10.1016/j.ssaho.2025.101299

    Balalle, H., & Pannilage, S. 2025, Social Sciences & Humanities Open, 11, 101299, doi: 10.1016/j.ssaho.2025.101299

  6. [6]

    B., & Polikarpova, N

    Barke, S., James, M. B., & Polikarpova, N. 2022, arXiv e-prints, arXiv:2206.15000, doi: 10.48550/arXiv.2206.15000

  7. [7]

    A., Denny, P., Finnie-Ansley, J., et al

    Becker, B. A., Denny, P., Finnie-Ansley, J., et al. 2022, arXiv e-prints, arXiv:2212.01020, doi: 10.48550/arXiv.2212.01020

  8. [8]

    Bishop, C. M. 2006, Pattern Recognition and Machine Learning (Information Science and Statistics) (Berlin, Heidelberg: Springer-Verlag)

  9. [9]

    Emergent autonomous scientific research capabilities of large language models

    Boiko, D. A., MacKnight, R., & Gomes, G. 2023, arXiv e-prints, arXiv:2304.05332, doi: 10.48550/arXiv.2304.05332

  10. [10]

    D., Jacoby, S., Carney, K., et al

    Borne, K. D., Jacoby, S., Carney, K., et al. 2009, in astro2010: The Astronomy and Astrophysics Decadal

  11. [11]
  12. [12]

    ChemCrow: Augmenting large-language models with chemistry tools

    Bran, A. M., Cox, S., Schilter, O., et al. 2023, arXiv e-prints, arXiv:2304.05376, doi: 10.48550/arXiv.2304.05376

  13. [13]

    2020, Advances in neural information processing systems, 33, 1877 Caldas Ramos, M., Collison, C

    Brown, T., Mann, B., Ryder, N., et al. 2020, Advances in neural information processing systems, 33, 1877 Caldas Ramos, M., Collison, C. J., & White, A. D. 2024, arXiv e-prints, arXiv:2407.01603, doi: 10.48550/arXiv.2407.01603

  14. [14]

    Chan, C. K. Y. 2023, arXiv e-prints, arXiv:2305.00280, doi: 10.48550/arXiv.2305.00280

  15. [15]

    2024a, arXiv e-prints, arXiv:2410.11123, doi: 10.48550/arXiv.2410.11123

    Chen, E., Wang, D., Xu, L., et al. 2024a, arXiv e-prints, arXiv:2410.11123, doi: 10.48550/arXiv.2410.11123

  16. [16]

    2024b, arXiv e-prints, arXiv:2404.18231, doi: 10.48550/arXiv.2404.18231

    Chen, J., Wang, X., Xu, R., et al. 2024b, arXiv e-prints, arXiv:2404.18231, doi: 10.48550/arXiv.2404.18231

  17. [17]

    2021, Philosophy & technology, 34, 1581

    Coghlan, S., Miller, T., & Paterson, J. 2021, Philosophy & technology, 34, 1581

  18. [18]

    2024, Methods in Ecology and Evolution, 15, 1757, doi: 10.1111/2041-210X.14325 de Haan, T., Ting, Y.-S., Ghosal, T., et al

    Cooper, N., Clark, A., Lecomte, N., Qiao, H., & Ellison, A. 2024, Methods in Ecology and Evolution, 15, 1757, doi: 10.1111/2041-210X.14325 de Haan, T., Ting, Y.-S., Ghosal, T., et al. 2025a, Scientific Reports, 15, 13751, doi: 10.1038/s41598-025-97131-y —. 2025b, arXiv e-prints, arXiv:2505.17592, doi: 10.48550/arXiv.2505.17592

  19. [19]

    DeepSeek-V3 Technical Report

    DeepSeek-AI, Liu, A., Feng, B., et al. 2024, arXiv e-prints, arXiv:2412.19437, doi: 10.48550/arXiv.2412.19437

  20. [20]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI, Guo, D., Yang, D., et al. 2025, arXiv e-prints, arXiv:2501.12948, doi: 10.48550/arXiv.2501.12948

  21. [21]

    The emerging generative artificial intelligence divide in the United States

    Deng, R., Jiang, M., Yu, X., Lu, Y., & Liu, S. 2025, Computers & Education, 227, 105224, doi: 10.1016/j.compedu.2024.105224

  22. [22]

    2023, arXiv e-prints, arXiv:2307.16364, doi: 10.48550/arXiv.2307.16364 European Commission, & Directorate-General for

    Denny, P., Leinonen, J., Prather, J., et al. 2023, arXiv e-prints, arXiv:2307.16364, doi: 10.48550/arXiv.2307.16364 European Commission, & Directorate-General for

  23. [23]

    2022, Ethical guidelines on the use of artificial intelligence (AI) and data in teaching and learning for educators (Publications Office of the European Union), doi: 10.2766/153756

    Education, Youth, Sport and Culture. 2022, Ethical guidelines on the use of artificial intelligence (AI) and data in teaching and learning for educators (Publications Office of the European Union), doi: 10.2766/153756

  24. [24]

    2021, Annual Review of Statistics and Its Application, 8, 493, doi: 10.1146/annurev-statistics-042720-112045

    Babu, G. 2021, Annual Review of Statistics and Its Application, 8, 493, doi: 10.1146/annurev-statistics-042720-112045

  25. [25]

    A., Luxton-Reilly, A., & Prather, J

    Finnie-Ansley, J., Denny, P., Becker, B. A., Luxton-Reilly, A., & Prather, J. 2022, in Proceedings of the 24th Australasian Computing Education Conference, ACE ’22 (New York, NY, USA: Association for Computing Machinery), 10–19, doi: 10.1145/3511861.3511863

  26. [26]

    G., Chadayammuri, U., et al

    Fouesneau, M., Momcheva, I. G., Chadayammuri, U., et al. 2024, arXiv e-prints, arXiv:2409.20252, doi: 10.48550/arXiv.2409.20252

  27. [27]

    2025, Societies, 15, 6, doi: 10.3390/soc15010006

    Gerlich, M. 2025, Societies, 15, 6, doi: 10.3390/soc15010006

  28. [28]

    2001, International Journal of Artificial Intelligence in Education, 12

    Graesser, A., & Harter, D. 2001, International Journal of Artificial Intelligence in Education, 12

  29. [29]

    2017, Disability & Society, 32, 1627, doi: 10.1080/09687599.2017.1365695

    Andries, C. 2017, Disability & Society, 32, 1627, doi: 10.1080/09687599.2017.1365695

  30. [30]

    2007, Review of Educational Research, 77, 81, doi: 10.3102/003465430298487

    Hattie, J., & Timperley, H. 2007, Review of Educational Research, 77, 81, doi: 10.3102/003465430298487

  31. [31]

    2021, International Journal of Artificial Intelligence in Education, 32, doi: 10.1007/s40593-021-00239-1

    Holmes, W., Porayska-Pomsta, K., Holstein, K., et al. 2021, International Journal of Artificial Intelligence in Education, 32, doi: 10.1007/s40593-021-00239-1

  32. [32]

    2008, Higher Education Research & Development, 27, 55, doi: 10.1080/07294360701658765

    Hounsell, D., Mccune, V., Hounsell, J., & Litjens, J. 2008, Higher Education Research & Development, 27, 55, doi: 10.1080/07294360701658765

  33. [33]

    A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    Huang, L., Yu, W., Ma, W., et al. 2023, arXiv e-prints, arXiv:2311.05232, doi: 10.48550/arXiv.2311.05232 Teaching Astronomy with Large Language Models 19

  34. [34]

    2012, Assessment & Evaluation in Higher Education, 37, 125, doi: 10.1080/02602938.2010.515012

    Huxham, M., Campbell, F., & Westwood, J. 2012, Assessment & Evaluation in Higher Education, 37, 125, doi: 10.1080/02602938.2010.515012

  35. [35]

    2023, Learning and Individual Differences, 103, 102274, doi: 10.1016/j.lindif.2023.102274

    Kasneci, E., Sessler, K., K¨ uchemann, S., et al. 2023, Learning and Individual Differences, 103, 102274, doi: 10.1016/j.lindif.2023.102274

  36. [36]

    2024, arXiv e-prints, arXiv:2404.03647, doi: 10.48550/arXiv.2404.03647

    Kevian, D., Syed, U., Guo, X., et al. 2024, arXiv e-prints, arXiv:2404.03647, doi: 10.48550/arXiv.2404.03647

  37. [37]

    Knoth, N., Tolzin, A., Janson, A., & Leimeister, J. M. 2024, Computers and Education: Artificial Intelligence, 6, 100225, doi: 10.1016/j.caeai.2024.100225

  38. [38]

    2024, arXiv e-prints, arXiv:2308.07702

    Kong, A., Zhao, S., Chen, H., et al. 2024, arXiv e-prints, arXiv:2308.07702. https://arxiv.org/abs/2308.07702 K¨ uchemann, S., Steinert, S., Revenga, N., et al. 2023, Phys. Rev. Phys. Educ. Res., 19, 020128, doi: 10.1103/PhysRevPhysEducRes.19.020128

  39. [39]

    2023, Int J Educ Integr, 19, doi: 10.1007/s40979-023-00130-7

    Kumar, R. 2023, Int J Educ Integr, 19, doi: 10.1007/s40979-023-00130-7

  40. [40]

    Kumar, T., & Kats, M. A. 2023, American Journal of Physics, 91, 955, doi: 10.1119/5.0182627

  41. [41]

    B., & Sting, F

    Lehmann, M., Cornelius, P. B., & Sting, F. J. 2024, arXiv e-prints, arXiv:2409.09047, doi: 10.48550/arXiv.2409.09047

  42. [42]

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    Lewis, P., Perez, E., Piktus, A., et al. 2020, arXiv e-prints, arXiv:2005.11401, doi: 10.48550/arXiv.2005.11401

  43. [43]

    Sycophancy in large language models: Causes and mitigations

    Malmqvist, L. 2024, arXiv e-prints, arXiv:2411.15287, doi: 10.48550/arXiv.2411.15287

  44. [44]

    M., & Schwartz, R

    Mutambuki, J. M., & Schwartz, R. 2018, Chem. Educ. Res. Pract., 19, 106, doi: 10.1039/C7RP00133A O’Flaherty, J., & Phillips, C. 2015, The Internet and Higher Education, 25, 85, doi: 10.1016/j.iheduc.2015.02.002

  45. [45]

    2024, arXiv e-prints, arXiv:2409.19750, doi: 10.48550/arXiv.2409.19750

    Pan, R., Dung Nguyen, T., Arora, H., et al. 2024, arXiv e-prints, arXiv:2409.19750, doi: 10.48550/arXiv.2409.19750

  46. [46]

    2025, arXiv e-prints, arXiv:2503.23989, doi: 10.48550/arXiv.2503.23989

    Pathak, A., Gandhi, R., Uttam, V., et al. 2025, arXiv e-prints, arXiv:2503.23989, doi: 10.48550/arXiv.2503.23989

  47. [47]

    2025, Royal Society Open Science, 12, doi: 10.1098/rsos.241776

    Peters, U., & Chin-Yee, B. 2025, Royal Society Open Science, 12, doi: 10.1098/rsos.241776

  48. [48]

    L., Santos, J

    Raihan, N., Siddiq, M. L., Santos, J. C. S., & Zampieri, M. 2024, arXiv e-prints, arXiv:2410.16349, doi: 10.48550/arXiv.2410.16349

  49. [49]

    M., & Jesse, J

    Regan, P. M., & Jesse, J. 2019, Ethics Inf Technol, 21, 167

  50. [50]

    2024, Frontiers in Education, 9, doi: 10.3389/feduc.2024.1461362

    Ruwe, T., & Mayweg, E. 2024, Frontiers in Education, 9, doi: 10.3389/feduc.2024.1461362

  51. [51]

    The Prompt Report: A Systematic Survey of Prompt Engineering Techniques

    Schulhoff, S., Ilie, M., Balepur, N., et al. 2024, arXiv e-prints, arXiv:2406.06608, doi: 10.48550/arXiv.2406.06608

  52. [52]

    J., Lara-Alecio, R., & Guerrero, C

    Tong, F., Tang, S., Irby, B. J., Lara-Alecio, R., & Guerrero, C. 2020, International Journal of Educational Research, 99, 101514, doi: 10.1016/j.ijer.2019.101514 Towhidul Islam Tonmoy, S. M., Mehedi Zaman, S. M.,

  53. [53]

    A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models

    Jain, V., et al. 2024, arXiv e-prints, arXiv:2401.01313, doi: 10.48550/arXiv.2401.01313

  54. [54]

    2018, European Journal of Engineering Education, 43, 507, doi: 10.1080/03043797.2017.1290585

    Wallin, P., & Adawi, T. 2018, European Journal of Engineering Education, 43, 507, doi: 10.1080/03043797.2017.1290585

  55. [55]

    2024, arXiv e-prints, arXiv:2403.18105, doi: 10.48550/arXiv.2403.18105

    Wang, S., Xu, T., Li, H., et al. 2024, arXiv e-prints, arXiv:2403.18105, doi: 10.48550/arXiv.2403.18105

  56. [56]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Wei, J., Wang, X., Schuurmans, D., et al. 2022, arXiv e-prints, arXiv:2201.11903, doi: 10.48550/arXiv.2201.11903

  57. [57]

    2023, arXiv e-prints, arXiv:2306.01337, doi: 10.48550/arXiv.2306.01337

    Wu, Y., Jia, F., Zhang, S., et al. 2023, arXiv e-prints, arXiv:2306.01337, doi: 10.48550/arXiv.2306.01337

  58. [58]

    2023, arXiv e-prints, arXiv:2305.14688, doi: 10.48550/arXiv.2305.14688

    Xu, B., Yang, A., Lin, J., et al. 2023, arXiv e-prints, arXiv:2305.14688, doi: 10.48550/arXiv.2305.14688

  59. [59]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Yao, S., Zhao, J., Yu, D., et al. 2022, arXiv e-prints, arXiv:2210.03629, doi: 10.48550/arXiv.2210.03629

  60. [60]

    2024, Smart Learning Environments, 11, doi: 10.1186/s40561-024-00316-7

    Zhai, C., Wibowo, S., & Li, L. 2024, Smart Learning Environments, 11, doi: 10.1186/s40561-024-00316-7

  61. [61]

    2023, arXiv e-prints, arXiv:2311.10054, doi: 10.48550/arXiv.2311.10054

    Zheng, M., Pei, J., Logeswaran, L., Lee, M., & Jurgens, D. 2023, arXiv e-prints, arXiv:2311.10054, doi: 10.48550/arXiv.2311.10054