pith. sign in

arxiv: 2605.30651 · v1 · pith:MUY5UOQ7new · submitted 2026-05-28 · 💻 cs.LG · cs.AI

LARK: Learnability-Grounded Trajectory Selection for Efficient Reasoning Distillation

Pith reviewed 2026-06-29 08:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords reasoning distillationtrajectory selectionlearnabilitydata selectionsupervised fine-tuningpolicy regularizationtheoretical guarantees
0
0 comments X

The pith

LARK selects reasoning trajectories by estimating how quickly a student model's loss decreases, using a proxy and chi-squared regularization to balance learnability with full distribution coverage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes LARK as a method for choosing which teacher-generated reasoning trajectories to use when distilling knowledge into a student model. Existing approaches pick trajectories based on quality scores or model confidence, but LARK instead focuses on whether the student can actually learn from them efficiently. It introduces a learnability factor rho that measures the rate of loss decrease during student training and estimates this factor with a proxy that avoids running full training on every option. A chi-squared regularized policy then selects a subset that favors high learnability while still covering the original data distribution, all backed by error bounds. Experiments show the selected trajectories speed up fine-tuning and improve results on reasoning tasks.

Core claim

LARK defines the learnability factor rho as the rate at which student training loss decreases on a given trajectory. It estimates rho via a learnability proxy whose estimation error is theoretically bounded, then applies a chi-squared regularized selection policy that trades off this learnability against distributional coverage, again with error guarantees. The resulting trajectories produce faster supervised fine-tuning loss reduction while preserving generalization, outperforming heuristic baselines across base models and tasks.

What carries the argument

The learnability factor rho, estimated by a learnability proxy and combined with a chi-squared regularized selection policy that enforces coverage of the full training distribution.

If this is right

  • LARK-selected trajectories induce faster loss reduction during the student's supervised fine-tuning phase.
  • The LARK score serves as a predictor of a trajectory's downstream training utility.
  • The method yields consistent gains over data selection baselines on multiple base models and reasoning tasks.
  • Both the proxy estimation and the regularized policy come with explicit bounds on approximation error.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same proxy idea might allow skipping low-learnability examples in other supervised stages such as instruction tuning.
  • If the chi-squared term successfully prevents collapse to narrow subsets, similar regularization could stabilize selection in reinforcement learning from AI feedback loops.
  • Diagnostic checks that correlate the proxy score with actual loss curves could become a lightweight way to audit training data quality before full runs.

Load-bearing premise

The learnability proxy can accurately estimate the true rate of student loss decrease on each trajectory without requiring complete training runs on every candidate.

What would settle it

Run full supervised fine-tuning on trajectories chosen by LARK versus those chosen by quality or confidence heuristics and check whether the LARK set produces measurably slower loss decrease or worse final accuracy on held-out reasoning problems.

Figures

Figures reproduced from arXiv: 2605.30651 by Amanda Hughes, Chih-Chun Chen, Fenglong Ma, Kaixiang Zhao, Porter Jenkins, Taylor W. Killian, Tianrun Yu, Weitong Zhang.

Figure 1
Figure 1. Figure 1: Overview of the LARK pipeline. For each question, multiple teacher-generated reasoning [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Ablation of the two LARK compo￾nents on Qwen-2.5-7B (B = 3). Values are Acc@5 percentages. (a) Score ablation. (b) Weighting ablation [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Three pieces of evidence linking gˆk to learnability. (a) ∆ρ(q) landscape on the simplex (PCA); black dots are qˆ(B) for B=1, . . . ,32. LARK selections trace the ∆ρ ≥ 0 region (green) and avoid ∆ρ < 0 (red). (b) SFT loss reduction ∆Lt = Lt − L0 on Qwen-2.5-7B (B=3); LARK induces the largest SFT loss decay. (c) Per-teacher mean of gˆk vs. AMC Acc@5. strongly correlated with downstream AMC accuracy (Figure … view at source ↗
Figure 4
Figure 4. Figure 4: Token-level behavior on Qwen-2.5-7B. LARK-selected trajectories have lower correct￾token probability p¯(yt) (top) but higher wrong-token concentration P v̸=yt p(v) 2 (bottom). Mechanism: confidently wrong, in a structured way. The three results above establish that maximizing gˆk raises ρ, accelerates SFT loss decay, and predicts utility, but do not explain why the selected trajectories carry strong sig￾na… view at source ↗
Figure 5
Figure 5. Figure 5: Empirical verification of Condition 1 on Qwen-2.5-7B for B ∈ {1, 3, 5, 10, 20}. Blue: ρt (left axis); red dashed: κˆt = ρt/ρ0 (right axis). Panel titles report κˆ (B) = mint κˆt and the step at which it is attained. and the anchor-relative ratio κˆt ≜ ρt/ρ0, which is the discrete-time analog of ρ(ϕQ(B) (s); Q(B) )/ρ(θref; Q(B) ) in Condition 1. After training, for each B we report the trajectory￾wise minim… view at source ↗
Figure 6
Figure 6. Figure 6: Empirical correlation between gˆ and g ∗ on Qwen-2.5-7B. (a) Per-question affine￾rescaled gˆk versus exact g ∗ k across 50 × 33 candidates; pooled Pearson r = 0.748 and Spearman ρ = 0.566. (b) Recall@B between the topB rankings of gˆ and g ∗ as a function of selection budget B; the dashed gray line is the random baseline B/K. (c) Histogram of per-question relative p-weighted RMS error ∥gˆ − g ∗∥p/∥g ∗∥p ac… view at source ↗
Figure 7
Figure 7. Figure 7: Per-sample wall-clock cost of trajectory scoring on 8 A100 GPUs. The number in parentheses is the slowdown relative to the fastest method. LARK is essentially as fast as GRAPE, 1.6× faster than LLM-judged Quality, 2.7× faster than RSR, and 4.0× faster than Local Naturalness. Methods that require no forward pass through the student (Random, Token Length, Rule-based Quality) are omitted because their cost is… view at source ↗
Figure 8
Figure 8. Figure 8: Selection budget scaling of LARK on Qwen-2.5-7B, evaluated on AMC after fine-tuning on the 500-problem subset. LARK achieves strong performance across B ∈ {1, 3, 5, 10, 20}, with the best result obtained at B=10. behavior of the selection rule. We walk through (i) the problem and its candidate pool, (ii) the LARK score landscape, (iii) the top-1 trajectory selected by LARK, and (iv) the closed-form χ 2 -B … view at source ↗
Figure 9
Figure 9. Figure 9: gˆk for all 33 candidate trajectories on problem pid = 839, grouped by teacher. Each teacher contributes 3 rollouts (indexed .1/.2/.3 along the x-axis). The top-1 trajectory selected by LARK (phi4-reason-plus rollout 2, gˆ = 0.01782) is highlighted with a red border. Within-teacher variation is comparable to between-teacher variation. Thus final answer: 5/2. We’ll produce answer in box: \boxed{5/2}. </thin… view at source ↗
read the original abstract

We study trajectory selection for reasoning distillation, where teacher-generated reasoning trajectories are selectively used as supervision for a student model. Existing methods rely on heuristics such as trajectory quality or model confidence, but they often overlook whether a trajectory is learnable by the student. In this paper, we present LARK, a learnability-grounded method for reasoning trajectory selection. LARK selects trajectories that the student can learn efficiently while preserving the generalization of the full training distribution. At the core of LARK is a learnability factor $\rho$, which characterizes the rate at which the student's training loss decreases. To estimate this rate efficiently and maintain generalization, we introduce a learnability proxy and a $\chi^2$-regularized selection policy that balances learnability and distributional coverage, both with strong theoretical guarantees on their estimation error. Empirically, LARK consistently outperforms data selection baselines across multiple base models and reasoning tasks. Diagnostic analyses show that the LARK score predicts downstream training utility and that LARK-selected trajectories induce faster supervised fine-tuning loss reduction. Our code is available at https://github.com/Tianrun-Yu/LARK.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces LARK, a method for selecting teacher-generated reasoning trajectories for efficient distillation into a student model. It defines a learnability factor ρ characterizing the rate of decrease in the student's training loss, estimates ρ via a learnability proxy, and employs a χ²-regularized selection policy to balance learnability against distributional coverage. The paper claims strong theoretical guarantees on the estimation error of both the proxy and the policy, and reports that LARK consistently outperforms heuristic baselines across base models and reasoning tasks, with diagnostics showing that the LARK score predicts downstream utility and induces faster SFT loss reduction.

Significance. If the proxy is shown to be a reliable low-bias surrogate for true ρ obtained from full SFT and the concentration bounds apply under the non-convex dynamics of LLM fine-tuning, the work would supply a principled, theoretically grounded alternative to quality- or confidence-based selection heuristics. This could reduce the computational cost of reasoning distillation while preserving generalization, and the public code release supports direct reproducibility checks.

major comments (1)
  1. [Abstract] Abstract (and the theory section deriving the proxy): the central claim that the learnability proxy yields an estimate of ρ whose error is controlled tightly enough for the χ²-regularized policy to both prefer high-ρ trajectories and preserve generalization rests on the unverified assumption that the proxy functional form is an unbiased or low-bias surrogate for the true ρ obtained by running full supervised fine-tuning on each trajectory. The provided concentration bounds are derived under this surrogate relationship; if the short-horizon or linearized proxy fails to track the actual non-convex loss trajectory, the selection policy can systematically favor trajectories whose apparent ρ is high while actual downstream loss reduction is not, directly undermining both the efficiency and generalization guarantees.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for isolating the key assumption underlying the proxy and its theoretical guarantees. We respond point-by-point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract (and the theory section deriving the proxy): the central claim that the learnability proxy yields an estimate of ρ whose error is controlled tightly enough for the χ²-regularized policy to both prefer high-ρ trajectories and preserve generalization rests on the unverified assumption that the proxy functional form is an unbiased or low-bias surrogate for the true ρ obtained by running full supervised fine-tuning on each trajectory. The provided concentration bounds are derived under this surrogate relationship; if the short-horizon or linearized proxy fails to track the actual non-convex loss trajectory, the selection policy can systematically favor trajectories whose apparent ρ is high while actual downstream loss reduction is not, directly undermining both the efficiency and generalization guarantees.

    Authors: We agree that the concentration bounds are formally conditional on the proxy being a reasonable surrogate for the true ρ obtained from full SFT. The manuscript does not claim a general theoretical proof that the short-horizon proxy is unbiased under arbitrary non-convex LLM dynamics. Instead, it supplies (i) finite-sample error bounds on the proxy estimator itself (given the surrogate relationship) and (ii) a χ²-regularized policy whose guarantees hold once the proxy estimates are in hand. The practical validity of the surrogate is addressed empirically: Section 5.3 and the associated diagnostics show that the LARK score (derived from the proxy) strongly predicts downstream utility and that trajectories selected by LARK produce measurably faster SFT loss reduction than heuristic baselines across multiple models and tasks. These results indicate that, at least for the reasoning-distillation regimes studied, the proxy tracks actual loss trajectories sufficiently well for the selection policy to improve both efficiency and generalization. A complete non-convex analysis of the proxy bias remains an open question and is outside the scope of the present work. revision: no

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines learnability factor ρ directly from the observed rate of student loss decrease and introduces a separate proxy estimator plus χ²-regularized policy, each accompanied by claimed concentration bounds on estimation error. No equations or steps are shown that reduce the final selection policy or performance claims back to the definition of ρ by construction, nor is there load-bearing self-citation or an ansatz smuggled via prior work. The derivation therefore remains self-contained against external benchmarks and does not exhibit any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract alone supplies no explicit free parameters, axioms, or invented entities; the learnability factor rho and chi-squared policy are introduced as part of the method but their precise parameterization is not detailed.

pith-pipeline@v0.9.1-grok · 5750 in / 1132 out tokens · 29856 ms · 2026-06-29T08:14:10.422406+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

4 extracted references

  1. [1]

    Template-shared, numerically perturbed.Same algebraic template, different constants or signs, leading to a different polynomial and a different answer. The highest-similarity case (MATH-L2 #11, J= 0.869 ) asks for the smallest n such that all roots of z4 +z 2 + 1 = 0 are n-th roots of unity, while the closest training problem asks the same question forz 4...

  2. [2]

    Template-shared, different question on the same setup.Same physical or geometric setup, but a different quantity is asked, with no derivable relation between the two answers. For example, MATH-L5 #46 (J= 0.693 ) describes a hot-air balloon held by four ropes anchored at points A, B, C, D; the training problem on the same setup asks for the length of OH, w...

  3. [3]

    MATH-L4 #38 (J= 0.664 ) asks to rewrite x2 + 2.6x+ 3.6 in the form (x+b) 2 +c , while the training problem performs the same operation on x2 −20x+ 36

    Surface-level template variation.Same problem stem with different numerical inputs (e.g., coefficients in a quadratic completion task). MATH-L4 #38 (J= 0.664 ) asks to rewrite x2 + 2.6x+ 3.6 in the form (x+b) 2 +c , while the training problem performs the same operation on x2 −20x+ 36. The procedure is identical but the inputs and outputs differ. 38 In no...

  4. [4]

    We need findb·c

    Also c−a−2b= 3(a×b) . We need findb·c. Step: First, note: ∥a+b∥ 2 = (a+b) 2 =a 2 + 2a·b+b 2 = 1 + 2 cosθ+ 1 = 2 + 2 cosθ= 3 . So 2 + 2 cosθ= 3⇒cosθ= 1/2soa·b= 1/2. Now, note: a×b is perpendicular to both a and b. So it is orthogonal to a, b. Then, c=a+ 2b+ 3(a×b) . Dotting with b: b·c=b·a+ 2b·b+ 3b·(a×b) =a·b+ 2·1 + 3·0 = 1/2 + 2 = 5/2 . So answer: 5/2. 4...