pith. sign in

arxiv: 2605.13709 · v1 · pith:F4G3NPNQnew · submitted 2026-05-13 · 💻 cs.CL · cs.AI· cs.LG

Children's English Reading Story Generation via Supervised Fine-Tuning of Compact LLMs with Controllable Difficulty and Safety

Pith reviewed 2026-05-14 19:28 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords children's storiesLLM fine-tuningreading difficultycontrollable generationeducational AIsafety evaluationcompact modelssupervised fine-tuning
0
0 comments X

The pith

Fine-tuned 8B LLMs generate children's reading stories that better match target difficulty levels than zero-shot outputs from GPT-4o or Llama 3.3 70B.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that supervised fine-tuning of three different 8B-parameter LLMs on stories produced from an expert-designed children's reading curriculum lets these compact models control reading difficulty more precisely than large models used without fine-tuning. This approach keeps safety issues minimal while cutting the high operational costs of relying on frontier-scale LLMs for everyday educational use. Educators, parents, and children could therefore generate customized English stories at home or in classrooms with targeted readability and without needing expensive API calls.

Core claim

Using stories generated by GPT-4o and Llama 3.3 70B from an existing expert curriculum as training data, the authors fine-tune 8B LLMs so that the resulting stories score better on quantitative difficulty metrics than the zero-shot large-model baselines, while qualitative safety reviews find almost no discernible problems. The method keeps the focus on controllability rather than model scale.

What carries the argument

Supervised fine-tuning of compact 8B LLMs on curriculum-derived story pairs that encode specific reading levels and error patterns, allowing the small models to reproduce controllable difficulty and safety.

If this is right

  • Teachers can generate new stories at any chosen reading level without paying per-token costs for large models.
  • The same fine-tuning pipeline can be reused whenever a new curriculum or set of target error patterns becomes available.
  • Local or low-cost deployment of the 8B models becomes practical for classrooms and homes.
  • Safety filtering can be baked into the fine-tuning data rather than added as a separate post-processing step.
  • Story generation can be iterated quickly to match individual student progress within the curriculum framework.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be tested on other languages by substituting equivalent expert curricula and measuring the same difficulty metrics.
  • Integration into classroom software might allow real-time adjustment of story difficulty based on a student's recent reading performance.
  • If the fine-tuned models retain engagement while controlling difficulty, they could reduce the need for human-authored leveled readers in some settings.
  • Privacy improves because story generation can stay on-device instead of sending prompts to cloud APIs.

Load-bearing premise

The chosen quantitative difficulty metrics and qualitative safety checks accurately reflect what real children experience as readable and safe.

What would settle it

A blind test in which children or teachers rate stories from the fine-tuned 8B models as harder to read or less engaging than zero-shot GPT-4o stories on the same curriculum topics.

Figures

Figures reproduced from arXiv: 2605.13709 by Bonnie J. Dorr (1), Fanghua Cao (1), Gainesville, Min Yao (1), Qian Shen (1), Shlok Gilda (1), USA), Walter L. Leite (1) ((1) University of Florida.

Figure 1
Figure 1. Figure 1: System architecture and experimental workflow for generating children’s English reading stories via [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A qualitative example of a bad case and a good case in our stories. [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Evaluation metrics across 8B models and SFT strategies. We compare our four fine-tuning methods [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: An example of the lessons in the K–2 English reading curriculum we used. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
read the original abstract

Large Language Models (LLMs) are widely applied in educational practices, such as for generating children's stories. However, the generated stories are often too difficult for children to read, and the operational cost of LLMs hinders their widespread adoption in educational settings. We used an existing expert-designed children's reading curriculum and its corresponding generated stories from GPT-4o and Llama 3.3 70B to design different experiments for fine-tuning three 8B-parameter LLMs, which then generated new English reading stories that were subjected to quantitative and qualitative evaluation. Our method prioritizes controllability over scale, enabling educators to target reading levels and error patterns with a compact, affordable model. Our evaluation results show that with appropriate fine-tuning designs, children's English reading stories generated by 8B LLMs perform better on difficulty-related metrics than those from zero-shot GPT-4o and Llama 3.3 70B, with almost no discernible safety issues. Such fine-tuned LLMs could be more broadly used by teachers, parents, and children in classrooms and at home to generate engaging English reading stories with children's interests, controllable difficulty and safety.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper describes fine-tuning three 8B-parameter LLMs on children's English reading stories generated by GPT-4o and Llama 3.3 70B from an expert-designed curriculum. It claims that appropriately fine-tuned compact models produce stories with better difficulty-related metrics than zero-shot larger models, while offering controllability over reading levels and safety with negligible safety issues.

Significance. If the reported difficulty metrics prove to be reliable proxies for actual child readability and the safety claims hold under independent scrutiny, the work could support affordable, controllable story generation for classroom and home use. The emphasis on compact models and curriculum-driven fine-tuning is a practical strength, but the lack of external validation against child outcomes or educator judgment reduces immediate applicability.

major comments (2)
  1. [Abstract] Abstract: The central claim that fine-tuned 8B models 'perform better on difficulty-related metrics' than zero-shot GPT-4o and Llama 3.3 70B is asserted without any reported numerical values, baselines, sample sizes, or statistical tests, preventing assessment of effect size or significance.
  2. [Evaluation] Evaluation section (inferred from abstract description): No correlation is reported between the chosen quantitative difficulty metrics and real-world child comprehension measures such as reading fluency scores or comprehension quizzes, leaving open the possibility that metric improvements reflect stylistic mimicry of the GPT-generated training data rather than genuine simplification.
minor comments (1)
  1. [Abstract] The abstract refers to 'quantitative and qualitative evaluation' but does not specify the exact metrics or the protocol for the qualitative safety checks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and evaluation. We address each major comment below and indicate where revisions will be made to improve clarity and transparency.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that fine-tuned 8B models 'perform better on difficulty-related metrics' than zero-shot GPT-4o and Llama 3.3 70B is asserted without any reported numerical values, baselines, sample sizes, or statistical tests, preventing assessment of effect size or significance.

    Authors: We agree that the abstract would be strengthened by including concrete numerical support for the central claim. In the revised version, we will add a summary sentence reporting key results, including specific difficulty metric values (e.g., average scores on the chosen proxies), the number of stories evaluated per condition, and any statistical comparisons against the zero-shot baselines. Full tables, baselines, and test details will continue to appear in the Evaluation section. revision: yes

  2. Referee: [Evaluation] Evaluation section (inferred from abstract description): No correlation is reported between the chosen quantitative difficulty metrics and real-world child comprehension measures such as reading fluency scores or comprehension quizzes, leaving open the possibility that metric improvements reflect stylistic mimicry of the GPT-generated training data rather than genuine simplification.

    Authors: We acknowledge that our evaluation relies on quantitative proxies derived from the expert curriculum together with qualitative educator review rather than direct child outcome measures. These proxies follow established readability research and were chosen to enable controllable generation aligned with the curriculum; we also show that the fine-tuned models outperform the teacher models on the same metrics while improving safety. We will revise the Evaluation and Limitations sections to explicitly discuss this choice, address the risk of stylistic mimicry, and note that direct correlation with child fluency or quiz scores would require separate human-subject studies outside the present scope. We believe the current evidence supports genuine controllability gains, but we accept that stronger external validation would further strengthen the claims. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation uses external expert curriculum and zero-shot baselines without self-referential reduction

full rationale

The paper trains compact 8B models via supervised fine-tuning on stories generated from an existing expert-designed children's reading curriculum (produced by GPT-4o and Llama 3.3 70B), then evaluates the outputs against zero-shot generations from the same large models using quantitative difficulty metrics and qualitative safety checks. No equations, parameter fits, or claims reduce by construction to the inputs; the central performance claim rests on external curriculum data and direct comparison to held-out zero-shot baselines rather than any self-definition, fitted-input renaming, or self-citation load-bearing step. The chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that supervised fine-tuning on GPT-generated stories transfers controllability without loss of quality; no free parameters, axioms, or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Supervised fine-tuning on high-quality generated data improves controllability of output properties such as reading difficulty.
    Invoked implicitly when claiming that fine-tuning yields better difficulty metrics than zero-shot generation.

pith-pipeline@v0.9.0 · 5548 in / 1107 out tokens · 36223 ms · 2026-05-14T19:28:28.671467+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 2 internal anchors

  1. [1]

    Journal of Children and Media , volume=

    Generative AI and children’s digital futures: New research challenges , author=. Journal of Children and Media , volume=. 2025 , publisher=

  2. [2]

    Reading and Writing , volume=

    Children’s reading comprehension and oral reading fluency in easy text , author=. Reading and Writing , volume=. 2006 , publisher=

  3. [3]

    Journal of Research in Reading , volume=

    What oral text reading fluency can reveal about reading comprehension , author=. Journal of Research in Reading , volume=. 2015 , publisher=

  4. [4]

    Procedia-Social and Behavioral Sciences , volume=

    Interesting reading materials and exercises encourage also reluctant boys to read , author=. Procedia-Social and Behavioral Sciences , volume=. 2014 , publisher=

  5. [5]

    Sustainability , volume=

    Generative AI and ChatGPT in school children’s education: Evidence from a school lesson , author=. Sustainability , volume=. 2023 , publisher=

  6. [6]

    International Journal of Academic Research in Progressive Education and Development , volume=

    How can generative artificial intelligence help teachers in early childhood education with their teaching? Analyses from the perspective of teaching methods , author=. International Journal of Academic Research in Progressive Education and Development , volume=

  7. [7]

    English Language Teaching Perspectives , volume=

    Generative AI and AI tools in English language teaching and learning: An exploratory research , author=. English Language Teaching Perspectives , volume=

  8. [8]

    Intervention in School and Clinic , volume=

    Utilizing text-generative AI for creating oral reading fluency probes , author=. Intervention in School and Clinic , volume=. 2024 , publisher=

  9. [9]

    ECAI 2024 , pages=

    AI Personalized Interactive Fiction for Young Children , author=. ECAI 2024 , pages=. 2024 , publisher=

  10. [10]

    International Journal of Research and Studies Publishing , volume=

    Using ChatGPT to Enrich Children Literature and Enhance their Vocabulary Repertoire , author=. International Journal of Research and Studies Publishing , volume=

  11. [11]

    Education and Information Technologies , volume=

    A systematic review of artificial intelligence technologies used for story writing , author=. Education and Information Technologies , volume=. 2023 , publisher=

  12. [12]

    2002 , publisher=

    Research methods in education , author=. 2002 , publisher=

  13. [13]

    , author=

    Alternative text types to improve reading fluency for competent to struggling readers. , author=. International Journal of Instruction , volume=. 2016 , publisher=

  14. [14]

    2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) , pages=

    Scaling down to scale up: A cost-benefit analysis of replacing OpenAI's LLM with open source SLMs in production , author=. 2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) , pages=. 2024 , organization=

  15. [15]

    Proceedings of the 22nd annual ACM interaction design and children conference , pages=

    Design implications of generative AI systems for visual storytelling for young learners , author=. Proceedings of the 22nd annual ACM interaction design and children conference , pages=

  16. [16]

    Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems , pages=

    From Words to Wonder: Designing and Evaluating an AI-Empowered Creative Storytelling System for Elementary Children , author=. Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems , pages=

  17. [17]

    Proceedings of the ACM on Human-Computer Interaction , volume=

    Exploring Parent's Needs for Children-Centered AI to Support Preschoolers' Interactive Storytelling and Reading Activities , author=. Proceedings of the ACM on Human-Computer Interaction , volume=. 2024 , publisher=

  18. [18]

    They all look mad with each other

    “They all look mad with each other”: Understanding the needs and preferences of children and parents in AI-generated images for stories , author=. International Journal of Child-Computer Interaction , pages=. 2025 , publisher=

  19. [19]

    AI-Powered Storytelling Relay: Designing a creative and interactive game for children and parents , author=

  20. [20]

    Proceedings of the 2025 Workshop on Intelligent and Interactive Writing Assistant (In2Writing) , pages=

    ReadCtrl: Personalizing Text Generation with Readability-Controlled Instruction Learning , author=. Proceedings of the 2025 Workshop on Intelligent and Interactive Writing Assistant (In2Writing) , pages=

  21. [21]

    International Conference on Learning Representations (ICLR) , year=

    The Curious Case of Neural Text Degeneration , author=. International Conference on Learning Representations (ICLR) , year=

  22. [22]

    Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , pages=

    Is it possible to modify text to a target readability level? an initial investigation using zero-shot large language models , author=. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , pages=

  23. [23]

    Technology Enhanced Learning for Inclusive and Equitable Quality Education , series =

    BloomLLM: Large Language Models Based Question Generation Combining Supervised Fine-Tuning and Bloom's Taxonomy , author =. Technology Enhanced Learning for Inclusive and Equitable Quality Education , series =. 2024 , doi =

  24. [24]

    Breaking Barriers with Generative Intelligence

    A Transformer-Based Generative AI Model in Education: Fine-Tuning BERT for Domain-Specific in Student Advising , author =. Breaking Barriers with Generative Intelligence. Using GI to Improve Human Education and Well-Being , series =. 2024 , doi =

  25. [25]

    Proceedings of the 2023 13th International Conference on Information Technology in Medicine and Education (ITME) , pages =

    Fine-Tuning Large Language Models in Education , author =. Proceedings of the 2023 13th International Conference on Information Technology in Medicine and Education (ITME) , pages =. 2023 , doi =

  26. [26]

    IEEE Transactions on Visualization and Computer Graphics , volume =

    Fine-Tuned Large Language Model for Visualization System: A Study on Self-Regulated Learning in Education , author =. IEEE Transactions on Visualization and Computer Graphics , volume =. 2025 , doi =

  27. [27]

    Advances in Neural Information Processing Systems , volume =

    Training language models to follow instructions with human feedback , author =. Advances in Neural Information Processing Systems , volume =

  28. [28]

    Advances in neural information processing systems , volume=

    Qlora: Efficient finetuning of quantized llms , author=. Advances in neural information processing systems , volume=

  29. [29]

    The Twelfth International Conference on Learning Representations , year=

    AlpaGasus: Training a Better Alpaca with Fewer Data , author=. The Twelfth International Conference on Learning Representations , year=

  30. [30]

    Advances in Neural Information Processing Systems , volume=

    Lima: Less is more for alignment , author=. Advances in Neural Information Processing Systems , volume=

  31. [31]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

  32. [32]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    Offline RL by Reward-Weighted Fine-Tuning for Conversation Optimization , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  33. [33]

    NIPS 2006 Workshop: Towards a New Reinforcement Learning? , year=

    Reinforcement Learning by Reward-Weighted Regression , author=. NIPS 2006 Workshop: Towards a New Reinforcement Learning? , year=

  34. [34]

    Advances in Neural Information Processing Systems , volume=

    Recursive introspection: Teaching language model agents how to self-improve , author=. Advances in Neural Information Processing Systems , volume=

  35. [35]

    International Conference on Machine Learning , pages=

    Reward-Augmented Data Enhances Direct Preference Alignment of LLMs , author=. International Conference on Machine Learning , pages=. 2025 , organization=

  36. [36]

    Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

    Self-instruct: Aligning language models with self-generated instructions , author=. Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

  37. [37]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Lamp: When large language models meet personalization , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  38. [38]

    Advances in Neural Information Processing Systems , volume=

    Star: Bootstrapping reasoning with reasoning , author=. Advances in Neural Information Processing Systems , volume=

  39. [39]

    The Elementary School Journal , volume=

    A new readability formula for primary-grade reading materials , author=. The Elementary School Journal , volume=. 1953 , publisher=

  40. [40]

    The 41st international ACM SIGIR conference on research & development in information retrieval , pages=

    Texygen: A benchmarking platform for text generation models , author=. The 41st international ACM SIGIR conference on research & development in information retrieval , pages=

  41. [41]

    Journal of Machine Learning Research , volume=

    Scaling instruction-finetuned language models , author=. Journal of Machine Learning Research , volume=

  42. [42]

    International Conference on Machine Learning , pages=

    Whose opinions do language models reflect? , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  43. [43]

    Proceedings of Anonymous Venue , year =

    Anonymous , title =. Proceedings of Anonymous Venue , year =

  44. [44]

    Proceedings of the 19th International Conference of the Learning Sciences-ICLS 2025, pp

    Storiza: A Platform to Support Children’s Oral Reading Fluency Development with Generative AI , author=. Proceedings of the 19th International Conference of the Learning Sciences-ICLS 2025, pp. 1574-1578 , year=

  45. [45]

    Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025) , year=

    Large Language Models for Education: Understanding the Needs of Stakeholders, Current Capabilities and the Path Forward , author=. Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025) , year=

  46. [46]

    COGENT : A Curriculum-oriented Framework for Generating Grade-appropriate Educational Content

    Liu, Zhengyuan and Yin, Stella Xin and Goh, Dion Hoe-Lian and Chen, Nancy. COGENT : A Curriculum-oriented Framework for Generating Grade-appropriate Educational Content. Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025). 2025. doi:10.18653/v1/2025.bea-1.10

  47. [47]

    gpt-oss-120b & gpt-oss-20b Model Card

    gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

  48. [48]

    Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024) , pages=

    Towards fine-grained pedagogical control over English grammar complexity in educational text generation , author=. Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024) , pages=

  49. [49]

    2022 , publisher=

    UFLI foundations: An explicit and systematic phonics program , author=. 2022 , publisher=