pith. sign in

arxiv: 2606.11744 · v1 · pith:HAYDHACXnew · submitted 2026-06-10 · 💻 cs.CL · cs.AI

Hey Chat, Can You Teach Me? Structuring Socratic Dialogue for Human Learning in the Wild

Pith reviewed 2026-06-27 09:54 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords Socratic dialoguecurriculum sequencingknowledge graphreinforcement learninglarge language modelstutoring systemsstudent progress modeling
0
0 comments X

The pith

Structuring Socratic tutoring around a prerequisite graph and sequencing policy outperforms scaling large language models alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models struggle with extended tutoring because they must sequence a curriculum, run Socratic dialogue, and infer knowledge state all at once. The paper shows this limitation is not fixed by using larger or specialized models. Instead, the system builds a prerequisite knowledge graph from the query, lets a lightweight policy decide which node to teach next and how many turns to spend, and restricts the language model to dialogue and progress reporting. Across held-out topics the structured approach reaches full mastery at higher rates and with fewer turns than baselines or frontier models. A reader would care because everyday chatbot learning is unstructured and this separation of tasks delivers measurable efficiency gains.

Core claim

By constructing a prerequisite knowledge graph in which subtopics are nodes and dependencies are edges, then using a PPO policy to select the next node and the number of dialogue turns to allocate before advancing, while an LLM handles only the Socratic exchange and returns a progress signal, the tutor reaches full curriculum mastery faster and with fewer turns than heuristic baselines, frontier general-purpose models, or models specialized for Socratic dialogue, on both STEM and non-STEM held-out topics.

What carries the argument

A prerequisite knowledge graph whose nodes are subtopics and edges are dependencies, paired with a PPO policy that chooses the next node and the turn budget for it.

If this is right

  • Students reach full curriculum mastery at higher rates than with unstructured or heuristic tutoring.
  • Fewer dialogue turns are required to complete the curriculum.
  • The gains appear on both STEM and non-STEM topics.
  • Explicit structure produces improvements that larger or specialized language models do not match on their own.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation of graph construction, sequencing, and dialogue could be tested in other interactive teaching formats such as worked examples or project-based tasks.
  • If graph accuracy improves with better extraction methods, the same policy could support longer multi-session learning paths.
  • The approach invites direct comparison against human tutor sequencing decisions on the same topic graphs.

Load-bearing premise

The automatically constructed prerequisite graph correctly captures topic dependencies and the language model produces a usable progress signal that the policy can rely on for sequencing.

What would settle it

Replace the learned policy with random node selection on the same graph and measure whether mastery rate and turn count remain better than the unstructured frontier models.

Figures

Figures reproduced from arXiv: 2606.11744 by Arunesh Sinha, Pradeep Varakantham, Sidney Tio.

Figure 1
Figure 1. Figure 1: Overview of the RL-based tutoring system. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Node traversal behavior across an episode. Nodes form a tree with the original query [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: illustrates the sequencing behavior of PPO across a sample episode. PPO adopts a localised, BFS-like traversal: it focuses on a single branch, exhausts the leaf nodes of a subtopic, and only then advances to the subtopic node itself before moving on. This yields a structured progression in which prerequisite concepts are consistently covered before their parent topic. Figure 2a shows the depth of nodes vis… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of three models tutoring the same student opening on the Pythagorean Theorem. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
read the original abstract

Large language models are now widely used for everyday learning, but the underlying interactions are typically unstructured chats rather than following a curriculum. Unlike formal online learning systems, these interactions carry no prior record of the student, so any estimate of what the student already knows must be inferred from the dialogue itself. We show that this gap is not closed by scaling models alone. Frontier and education-tuned LLMs perform poorly when asked to tutor a student over an extended session, because doing so requires three things at once. The tutor must sequence a curriculum, conduct Socratic dialogue, and infer the student's knowledge state from that dialogue. We propose separating these responsibilities. Given a student query, our system constructs a prerequisite knowledge graph in which subtopics are nodes and dependencies are edges, and frames tutoring as deciding which node to teach next and how many dialogue turns to spend on it before moving on. A lightweight PPO policy handles this sequencing decision, while an LLM conducts the Socratic exchange at the chosen node and returns a signal of student progress. Across held-out STEM and non-STEM topics, our PPO-paired tutor outperforms heuristic baselines, frontier general-purpose models, and a model specialised for Socratic dialogue: on both the rate at which students reach full curriculum mastery and the number of turns required. Explicit curriculum structure delivers gains that scaling the underlying model does not.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript claims that separating curriculum sequencing (via automatic construction of a prerequisite knowledge graph and a lightweight PPO policy deciding which node to teach next and for how many turns) from Socratic dialogue execution (handled by an LLM that also emits a scalar progress signal) yields better outcomes than scaling LLMs or using heuristic/Socratic-specialized baselines. Specifically, the PPO-paired tutor achieves higher rates of students reaching full curriculum mastery and requires fewer turns, across held-out STEM and non-STEM topics; the central thesis is that explicit structure provides gains that model scale alone does not.

Significance. If the reported gains are shown to rest on externally validated components rather than internal LLM signals, the work would be significant for LLM tutoring design: it offers a concrete, modular architecture that addresses the multi-task burden on a single model and demonstrates measurable benefits of curriculum structure over unstructured chat. The separation of concerns is a clear strength and could inform follow-on systems.

major comments (3)
  1. [Evaluation] Evaluation section (and abstract): the progress signal produced by the same LLM that conducts the Socratic turns is used both to train the PPO policy and to declare 'full curriculum mastery' for the headline metrics. No independent validation of this signal against human expert judgments, pre/post knowledge tests, or external rubrics is reported; without it the performance deltas cannot be distinguished from artifacts of the signal's own biases.
  2. [Method] Prerequisite graph construction (and method): the automatically built graph supplies the legal sequences and dependencies that the PPO policy optimizes over. The manuscript supplies no comparison of these edges to expert-curated graphs, human-validated prerequisite relations, or ablation studies that remove or perturb the graph; this component is load-bearing for the claim that explicit structure, rather than the LLM alone, drives the gains.
  3. [Results] Experimental design (abstract and results): comparative outperformance is asserted on mastery rate and turns, yet the abstract and summary provide no details on participant pool size or demographics, exact operationalization of 'mastery,' statistical tests, confidence intervals, or controls for confounds such as topic selection bias or varying student priors. These omissions prevent assessment of whether the data support the central claim.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'held-out STEM and non-STEM topics' is used without enumerating the topics or the hold-out procedure, which would improve clarity and reproducibility.

Simulated Author's Rebuttal

3 responses · 2 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, with honest indications of where revisions will be made and where limitations must be acknowledged.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section (and abstract): the progress signal produced by the same LLM that conducts the Socratic turns is used both to train the PPO policy and to declare 'full curriculum mastery' for the headline metrics. No independent validation of this signal against human expert judgments, pre/post knowledge tests, or external rubrics is reported; without it the performance deltas cannot be distinguished from artifacts of the signal's own biases.

    Authors: We agree this is a substantive limitation. Because the identical LLM-generated signal is used for training, evaluation, and all baselines, relative gains can still be attributed to the PPO sequencing component. However, we cannot rule out that the signal's biases inflate absolute mastery rates. In revision we will add an explicit limitations paragraph discussing this issue and the practical difficulties of external validation in wild, unstructured settings. No new validation data can be supplied from the existing experiments. revision: partial

  2. Referee: [Method] Prerequisite graph construction (and method): the automatically built graph supplies the legal sequences and dependencies that the PPO policy optimizes over. The manuscript supplies no comparison of these edges to expert-curated graphs, human-validated prerequisite relations, or ablation studies that remove or perturb the graph; this component is load-bearing for the claim that explicit structure, rather than the LLM alone, drives the gains.

    Authors: The central claim is that automatic construction of an explicit graph enables better outcomes than unstructured LLM chat; the primary evidence is the performance gap versus the no-graph baselines. We will add a new ablation that removes the graph (allowing the policy to select any node without dependency constraints) to isolate its contribution. Direct comparison to expert-curated graphs is outside the scope of the current work, which deliberately targets automatic, scalable construction without expert input. revision: yes

  3. Referee: [Results] Experimental design (abstract and results): comparative outperformance is asserted on mastery rate and turns, yet the abstract and summary provide no details on participant pool size or demographics, exact operationalization of 'mastery,' statistical tests, confidence intervals, or controls for confounds such as topic selection bias or varying student priors. These omissions prevent assessment of whether the data support the central claim.

    Authors: We will revise the abstract, results section, and add a summary table to include all requested details: participant pool size and demographics, precise mastery definition, statistical tests with p-values and confidence intervals, and controls for topic selection and prior knowledge via randomization. These elements appear in the full experimental protocol and will now be foregrounded. revision: yes

standing simulated objections not resolved
  • Independent external validation of the LLM progress signal against human experts or pre/post tests (would require new data collection).
  • Direct comparison of the automatically constructed graph against expert-curated prerequisite relations (would require additional human annotation not performed in this study).

Circularity Check

0 steps flagged

No circularity; empirical claims rest on external baseline comparisons

full rationale

The paper proposes separating curriculum sequencing (via PPO) from Socratic dialogue and progress inference (via LLM), then reports empirical gains on held-out topics versus heuristic baselines, frontier LLMs, and specialized models. No derivation chain reduces a claimed result to a fitted parameter or self-referential definition; the progress signal and graph are components whose reliability is an assumption, not a quantity defined in terms of the policy output. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear in the provided abstract and description. The evaluation metrics are measured against independent baselines, satisfying the self-contained benchmark criterion for a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the knowledge graph and PPO policy are described at the level of high-level system components only.

pith-pipeline@v0.9.1-grok · 5778 in / 1189 out tokens · 22784 ms · 2026-06-27T09:54:27.580559+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 14 canonical work pages · 3 internal anchors

  1. [1]

    doi: 10.1080/00461520.2014. 965823. David Dinucu-Jianu, Jakub Macina, Nico Daheim, Ido Hakimi, Iryna Gurevych, and Mrinmaya Sachan. From problem-solving to teaching problem-solving: Aligning llms with pedagogy using reinforcement learning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 272–292,

  2. [2]

    If the sources could talk: Evaluating large language models for research assistance in history.arXiv preprint arXiv:2310.10808,

    Giselle Gonzalez Garcia and Christian Weilbach. If the sources could talk: Evaluating large language models for research assistance in history.arXiv preprint arXiv:2310.10808,

  3. [3]

    Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville

    URL https://www.anthropic.com/news/anthropic-education-report-how-universit y-students-use-claude. Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. Llms get lost in multi-turn conversation. InICLR 2026, May

  4. [4]

    arXiv preprint arXiv:2412.16429 , year=

    URL https://www.microsoft.com/en-us/resear ch/publication/llms-get-lost-in-multi-turn-conversation/. LearnLM Team. Learnlm: Improving gemini for learning.arXiv preprint arXiv:2412.16429,

  5. [5]

    Evaluating gemini in an arena for learning.arXiv preprint arXiv:2505.24477,

    LearnLM Team. Evaluating gemini in an arena for learning.arXiv preprint arXiv:2505.24477,

  6. [6]

    A training-free large reasoning model-based knowledge tracing framework for unified prediction and prescription.arXiv preprint arXiv:2601.01708,

    Unggi Lee, Joo Young Kim, Ran Ju, Minyoung Jung, and Jeyeon Eo. A training-free large reasoning model-based knowledge tracing framework for unified prediction and prescription.arXiv preprint arXiv:2601.01708,

  7. [7]

    Ef- ficient rl for optimizing conversation level outcomes with an llm-based tutor.arXiv preprint arXiv:2507.16252,

    Hyunji Nam, Omer Gottesman, Amy Zhang, Dean Foster, Emma Brunskill, and Lyle Ungar. Ef- ficient rl for optimizing conversation level outcomes with an llm-based tutor.arXiv preprint arXiv:2507.16252,

  8. [8]

    Personal- ized exercise recommendation with semantically-grounded knowledge tracing.arXiv preprint arXiv:2507.11060,

    Yilmazcan Ozyurt, Tunaberk Almaci, Stefan Feuerriegel, and Mrinmaya Sachan. Personal- ized exercise recommendation with semantically-grounded knowledge tracing.arXiv preprint arXiv:2507.11060,

  9. [9]

    Qwen2.5 Technical Report

    URLhttps://arxiv.org/abs/2412.15115. Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-baselines3: Reliable reinforcement learning implementations.Journal of Machine Learning Research, 22(268):1–8,

  10. [10]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

  11. [11]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    URLhttps://arxiv.org/abs/2402.03300. Kumar Shridhar, Jakub Macina, Mennatallah El-Assady, Tanmay Sinha, Manu Kapur, and Mrinmaya Sachan. Automatic generation of socratic subquestions for teaching math word problems. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Proce...

  12. [12]

    doi: 10.18653/v1/2022.e mnlp-main.277

    Association for Computational Linguistics. doi: 10.18653/v1/2022.e mnlp-main.277. URLhttps://aclanthology.org/2022.emnlp-main.277/. Olivia Sidoti, Eugenie Park, and Jeffrey Gottfried. About a quarter of U.S. teens have used ChatGPT for schoolwork – double the share in 2023,

  13. [13]

    Anni Siren and Vassilios Tzerpos

    URL https://www.pewresearch.org/shor t-reads/2025/01/15/about-a-quarter-of-us-teens-have-used-chatgpt-for-sch oolwork-double-the-share-in-2023/. Anni Siren and Vassilios Tzerpos. Automatic learning path creation using oer: a systematic literature mapping.IEEE Transactions on Learning Technologies,

  14. [14]

    doi: 10.1080/00461520.2011. 611369. Shouang Wei, Min Zhang, Xin Lin, Bo Jiang, Kun Kuang, and Zhongxiang Dai. Uco: A multi-turn interactive reinforcement learning method for adaptive teaching with large language models,

  15. [15]

    Eamon Worden, Cristina Heffernan, Neil Heffernan, and Shashank Sonkar

    URLhttps://arxiv.org/abs/2511.08873. Eamon Worden, Cristina Heffernan, Neil Heffernan, and Shashank Sonkar. Foundationalassist: An educational dataset for foundational knowledge tracing and pedagogical grounding of llms.arXiv preprint arXiv:2602.00070,

  16. [16]

    A comprehensive review of ai-based intelligent tutoring systems: applications and challenges.arXiv preprint arXiv:2507.18882,

    Meriem Zerkouk, Miloud Mihoubi, and Belkacem Chikhaoui. A comprehensive review of ai-based intelligent tutoring systems: applications and challenges.arXiv preprint arXiv:2507.18882,

  17. [17]

    13 A End-to-End RL for Socratic Qualities Model: finetuned Teacher That’s great to hear! You’ve told me that you’re interested in learning about Pythagorean Theorem. [. . . ] Can you tell me, do you know how it relates to triangles? Student The theorem can help us find the perimeter of any triangle if we know two sides. Teacher That’s a great insight! You...

  18. [18]

    That’s great to hear! You mentioned that you’re

    Looking at these conversations side by side makes it easy to see how each model handles the same student query and how their styles diverge. We discover that the finetuned Qwen2.5-7B has collapsed into reward hacking behavior. In particular, the model leans heavily on a small set of repeated openers such as “That’s great to hear! You mentioned that you’re...

  19. [19]

    Limitations

    Rows shaded green are our proposed methods; rows shaded grey are LLM-based comparisons. Steps is∆% relative to Oracle (lower is better; Oracle raw mean =31.00±0.71 steps). Mastery is reported as a percentage. All CIs are 95% (t-distribution). †Methods using ground-truth mastery signals are not deployable. Scope Obs. Method n Success (%) Steps∆%Mastery (%)...