pith. machine review for the scientific record. sign in

arxiv: 2604.25907 · v2 · submitted 2026-04-28 · 💻 cs.LG · cs.AI

Recognition: unknown

How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:05 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords Tsallis q-logarithmSFT-then-RLVRcold startlabel noise robustnessgradient flowreasoning modelsGARL estimatorPAFT estimator
0
0 comments X

The pith

The Tsallis q-logarithm loss family unifies SFT and RLVR by trading cold-start speed for noise robustness in reasoning model training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper supplies a single-parameter family of losses that accounts for the observed success of running supervised fine-tuning before reinforcement learning with verifiable rewards. The family, called J_Q, uses the Tsallis q-logarithm and runs continuously from the RLVR exploitation pole at q=0 to a density-estimation pole at q=1. Gradient-flow calculations show that the q=1 pole leaves the cold-start regime in logarithmic time yet copies label noise, while the q=0 pole takes linear time yet stays robust to noise. The standard training order therefore first uses q=1 to exit cold start and then switches to q=0 for stability. Two new Monte Carlo estimators, GARL and PAFT, let practitioners optimize any fixed q without annotated rationales and improve results on FinQA, HotPotQA, and MuSiQue.

Core claim

Every loss in the J_Q family shares the identical per-example gradient direction and differs only by the independent amplification factor P_theta to the power of negative q. Consequently the exploitation pole at q=0 requires Omega of one over p0 steps to escape cold start but resists label noise, whereas the density-estimation pole at q=1 escapes in Theta of log of one over p0 steps yet memorizes noise. This separation directly accounts for the effectiveness of the stepwise schedule that first applies q=1 and later q=0.

What carries the argument

The single-parameter Tsallis q-logarithm loss family J_Q that interpolates between RLVR (q=0) and log-marginal-likelihood density estimation (q=1) solely through per-example reweighting by P_theta to the power of negative q.

If this is right

  • The SFT-then-RLVR ordering follows directly from the differing escape times and noise-robustness properties of the q=1 and q=0 poles.
  • Fixed-q training becomes practical through the Monte Carlo estimators GARL and PAFT that require no annotated rationales.
  • GARL at high q escapes cold start on FinQA, HotPotQA, and MuSiQue where standard GRPO fails entirely.
  • In stable training regimes low-q GARL improves FinQA while PAFT at q=0.75 remains stable on HotPotQA and MuSiQue.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Gradual rather than abrupt changes in q during training may further improve the speed-robustness balance on new tasks.
  • The shared-gradient property could support similar escape-time analyses for other loss interpolations used in language-model post-training.
  • Early training phases that use high q should avoid data with label noise to prevent memorization.

Load-bearing premise

All members of the J_Q family share the exact same per-example gradient direction and differ only through the independent per-instance reweighting term P_theta to the power of negative q.

What would settle it

Train models from the same cold-start distribution with controlled label noise, measure steps until accuracy rises and final accuracy after noise injection, and check whether escape time scales as Omega of one over p0 for q near zero and Theta of log of one over p0 for q near one.

Figures

Figures reproduced from arXiv: 2604.25907 by Chu-Cheng Lin, Eugene Ie.

Figure 1
Figure 1. Figure 1: The JQ loss family is a continuum between exploitation (q = 0) and density estimation (q = 1) losses (poles at either end of the axis below); correspondingly, commitment is the induced gradient amplification (P −q θ ; top arrow). High q resolves ambiguity (fast cold-start escape) but also memorizes noise; low q resolves noise (robust filtering) but cannot escape cold start. p0 denotes initial success proba… view at source ↗
Figure 2
Figure 2. Figure 2: Two estimators from one gradient identity. The view at source ↗
Figure 3
Figure 3. Figure 3: Cold-start training dynamics on FinQA: maximum amplified advantage view at source ↗
Figure 4
Figure 4. Figure 4: Warm-start validation maj@16 on HotPotQA at q = 0.25: GARL peaks at step 50 (30.6) and collapses to zero by step 100; PAFT remains stable throughout training and reaches 53.6. At fixed q, the contrast isolates the estimator (prior-sampled, all-M vs. posterior-resampled). PAFT at low q is slow, not collapsed. PAFT at q = 0.25 underperforms on MuSiQue (9.0 vs. 15.8), but validation accuracy is still rising a… view at source ↗
read the original abstract

SFT-then-RLVR is widely used for post-training reasoning models, but why this specific ordering, and why RLVR-only stalls at cold start, have lacked a unifying theoretical account. We provide that account under a unified loss family $J_Q$ using the Tsallis $q$-logarithm. $J_Q$ is a single-parameter family that interpolates between RLVR (at $q{=}0$, the \textit{exploitation pole}) and the log-marginal-likelihood over latent trajectories (at $q{=}1$, the \textit{density-estimation pole}), under which the standard pipeline corresponds to a stepwise $q{=}1 \to 0$ schedule. All members share the same per-example gradient direction, differing only by a per-instance amplification $P_\theta^{-q}$ that reweights each instance independently of the learning rate. Under gradient flow analysis, we show that the exploitation pole requires $\Omega(\frac{1}{p_0})$ time to escape cold start but is robust to label noise, while the density-estimation pole escapes in $\Theta\big(\log(\frac{1}{p_0})\big)$ but memorizes label noise. This separation explains how SFT ($q{=}1$) first moves the model out of the cold-start regime, followed by the more robust RLVR ($q{=}0$), under the SFT-then-RLVR paradigm. We further derive two Monte Carlo estimators that directly optimize fixed-$q$ on the $J_Q$ continuum, without annotated rationales: Gradient-Amplified RL (GARL) and Posterior-Attenuated Fine-Tuning (PAFT), with shared bias $O\big(\frac{q}{M P_\theta^q}\big)$ but different variance and stability properties. On FinQA, HotPotQA, and MuSiQue, GARL at sufficiently high $q$ substantially mitigates cold-start stalling, escaping cold start where GRPO fails entirely. In warm start, GARL at low $q$ dominates FinQA where training is stable; on HotPotQA and MuSiQue, GARL destabilizes and PAFT at $q{=}0.75$ remains stable, reaching $47.9$ \texttt{m@16} on HotPotQA ($+13.9$ over GRPO).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the Tsallis loss continuum J_Q, a one-parameter family interpolating between RLVR (q=0, exploitation pole) and log-marginal-likelihood density estimation (q=1, density-estimation pole). It claims all J_Q members share identical per-example gradient directions, differing only by per-instance amplification P_θ^{-q}. Gradient-flow analysis derives cold-start escape times Ω(1/p0) for q=0 (noise-robust) versus Θ(log(1/p0)) for q=1 (noise-memorizing), explaining the SFT-then-RLVR ordering. It proposes GARL and PAFT Monte Carlo estimators for fixed-q optimization (shared bias O(q/(M P_θ^q))) and reports empirical gains on FinQA, HotPotQA, and MuSiQue, with GARL mitigating cold-start stalling where GRPO fails.

Significance. If the shared-gradient-direction assumption holds and the escape-time derivations are rigorous, the work supplies a principled unifying account for post-training reasoning models, justifying the standard SFT-then-RLVR pipeline and introducing a tunable loss continuum with rationale-free estimators. The empirical results on three QA benchmarks indicate practical utility for cold-start mitigation and stability tuning.

major comments (2)
  1. [Gradient Flow Analysis] The load-bearing claim that all J_Q members share the same per-example gradient direction (differing only by scalar P_θ^{-q} reweighting) is invoked to derive the Ω(1/p0) vs Θ(log(1/p0)) escape-time separation and noise-robustness distinction. The manuscript asserts this property but does not provide the explicit computation of ∇J_Q for general q confirming that the Tsallis q-logarithm and latent-trajectory marginal at q=1 introduce no q-dependent directional changes; without this verification the theoretical grounding for the SFT-then-RLVR explanation does not follow.
  2. [Estimator Derivations] The bias O(q/(M P_θ^q)) and differing variance/stability properties for the GARL and PAFT estimators are stated in the abstract, but the full derivations of the time-to-escape expressions under gradient flow and the Monte Carlo bias/variance analysis are not verifiable from the given description. These derivations are required to support the central separation claims.
minor comments (1)
  1. [Experiments] The empirical section would be strengthened by reporting the number of random seeds, standard deviations, and precise GRPO baseline configurations to substantiate claims such as 'escapes cold start where GRPO fails entirely' and the +13.9 m@16 gain on HotPotQA.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and insightful comments on our manuscript. We address each major comment below and commit to providing the requested explicit derivations and verifications in the revised version to strengthen the theoretical contributions.

read point-by-point responses
  1. Referee: [Gradient Flow Analysis] The load-bearing claim that all J_Q members share the same per-example gradient direction (differing only by scalar P_θ^{-q} reweighting) is invoked to derive the Ω(1/p0) vs Θ(log(1/p0)) escape-time separation and noise-robustness distinction. The manuscript asserts this property but does not provide the explicit computation of ∇J_Q for general q confirming that the Tsallis q-logarithm and latent-trajectory marginal at q=1 introduce no q-dependent directional changes; without this verification the theoretical grounding for the SFT-then-RLVR explanation does not follow.

    Authors: We acknowledge that the explicit computation of the gradient ∇J_Q was not presented in sufficient detail in the original submission. The property follows directly from the definition of the Tsallis q-loss, where the gradient with respect to the policy parameters θ decomposes as the expectation over the same importance-weighted direction for all q, scaled by the instance-specific factor P_θ^{-q}. We will include a dedicated subsection in the revised manuscript deriving ∇J_Q explicitly for general q, confirming that no q-dependent directional changes occur beyond the scalar reweighting. This will provide the rigorous verification needed to support the escape-time bounds and the explanation of the SFT-then-RLVR pipeline. revision: yes

  2. Referee: [Estimator Derivations] The bias O(q/(M P_θ^q)) and differing variance/stability properties for the GARL and PAFT estimators are stated in the abstract, but the full derivations of the time-to-escape expressions under gradient flow and the Monte Carlo bias/variance analysis are not verifiable from the given description. These derivations are required to support the central separation claims.

    Authors: We agree that the full derivations of the time-to-escape expressions and the Monte Carlo bias/variance analysis should be made explicit for verifiability. The bias term O(q/(M P_θ^q)) arises from the finite-sample approximation of the posterior over latent trajectories in the Tsallis framework, and the variance differences stem from the amplification factor in GARL versus the attenuation in PAFT. In the revision, we will expand the appendix with complete step-by-step derivations of the gradient flow dynamics for the escape times and the bias/variance bounds for both estimators, ensuring all central claims are fully supported. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper defines the J_Q family via the Tsallis q-logarithm interpolating RLVR (q=0) and log-marginal likelihood (q=1), states the shared per-example gradient direction property as following from that definition, and performs gradient-flow analysis to obtain the Ω(1/p0) vs Θ(log(1/p0)) escape times. No load-bearing step reduces by construction to a fitted quantity, self-citation chain, or renaming; the separation is a direct mathematical consequence of the posited family and the independent-of-LR reweighting, with no evidence that outputs are equivalent to inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The claim rests on the Tsallis q-logarithm interpolation, the shared per-example gradient direction, and gradient-flow assumptions; q is the sole explicit free parameter.

free parameters (1)
  • q
    Interpolation parameter between exploitation pole (q=0) and density-estimation pole (q=1); controls amplification P_θ^{-q}.
axioms (2)
  • domain assumption All J_Q members share identical per-example gradient direction, differing only by instance-wise amplification P_θ^{-q}
    Invoked to separate time-to-escape and noise robustness across the continuum.
  • domain assumption Gradient flow analysis governs discrete training dynamics
    Used to derive Ω(1/p0) and Θ(log(1/p0)) escape times.
invented entities (1)
  • Tsallis loss family J_Q no independent evidence
    purpose: Provide continuous interpolation between RLVR and log-marginal-likelihood over latent trajectories
    New single-parameter family introduced to unify the two poles.

pith-pipeline@v0.9.0 · 5742 in / 1626 out tokens · 44958 ms · 2026-05-08T03:05:41.602221+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

8 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    AP Dempster, NM Laird, and DB Rubin

    URLhttps://arxiv.org/abs/2501.12948. AP Dempster, NM Laird, and DB Rubin. Maximum likelihood from incomplete data via the EM algorithm.Journal of the Royal Statistical Society. Series B (Methodological), pages 1–38, 1977. Davide Ferrari and Yuhong Yang. Maximum Lq-likelihood estimation.The Annals of Statistics, 38 (2):753–783, 2010. Kelvin Guu, Panupong P...

  2. [2]

    s1: Simple test-time scaling

    URLhttps://aclanthology.org/2021.naacl-main.405/. William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of thought. InThe Twelfth International Conference on Learning Representations, 2024. URL https:// openreview.net/forum?id=NjNGlPh8Wh. Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajis...

  3. [3]

    , title =

    ISSN 1532-4435. Fahim Tajwar, Guanning Zeng, Yueer Zhou, Yuda Song, Daman Arora, Yiding Jiang, Jeff Schneider, Ruslan Salakhutdinov, Haiwen Feng, and Andrea Zanette. Maximum likelihood reinforcement learning. 2026. URLhttps://arxiv.org/abs/2602.02710. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. MuSiQue: Multihop questions v...

  4. [4]

    The model exactly recovers the data distribu- tion

    Density-estimation pole( q= 1 ): θ∗ j (1) =α j. The model exactly recovers the data distribu- tion

  5. [5]

    The model concentrates all mass on the most frequent output

    Exploitation pole( q→0 +): assuming a unique mode j∗ = argmaxk αk, θ∗ j (q)→I(j= j∗). The model concentrates all mass on the most frequent output. 3.Monotone sharpening: for0< q ′ < q≤1andα j > α k,θ ∗ j (q′)/θ∗ k(q′)> θ ∗ j (q)/θ∗ k(q). Proof. Part (1): 1/q= 1 . Part (2): (αj/αj∗)1/q →0 for j̸=j ∗. Part (3): θ∗ j /θ∗ k = (α j/αk)1/q, increasing in 1/q. C...

  6. [6]

    Each gm marginalizes out the output y given z(m) analytically via wm =p θ(y∗ |x ∗,z (m)), rather than relying on a sampled output and binary reward

    GARL at q= 0 recovers Rao–Blackwellized REINFORCE [Williams, 1992, Zhou et al., 2026]: [∇θℓq q=0 = ¯gM = 1 M MX m=1 −wm ∇θ logp θ(z(m),y ∗ |x ∗) , which is unbiased for ∇θℓ0 by Equation(8). Each gm marginalizes out the output y given z(m) analytically via wm =p θ(y∗ |x ∗,z (m)), rather than relying on a sampled output and binary reward

  7. [7]

    28 3.PAFT atq= 0reduces to posterior-resampled SFT scaled byP θ: ˆ∇PAFT q=0 =−¯wM · 1 K KX k=1 ∇θ logp θ(z(rk),y ∗ |x ∗)

    GARL at q= 1 recovers the IWAE gradient estimator [Burda et al., 2015], a self-normalized importance sampling (SNIS) estimator for∇ θ logP θ: [∇θℓq q=1 = ¯gM ¯wM = P m wm (−∇θ logp θ(z(m),y ∗ |x ∗))P m wm . 28 3.PAFT atq= 0reduces to posterior-resampled SFT scaled byP θ: ˆ∇PAFT q=0 =−¯wM · 1 K KX k=1 ∇θ logp θ(z(rk),y ∗ |x ∗). The factor ¯wM ≈P θ downweig...

  8. [8]

    The instance weight ( ¯wM)1−1 = 1 vanishes: all instances contribute equally, and the gradient is uniform SFT on approximate posterior samples

    PAFT at q= 1 recovers the E-step of EM [Dempster et al., 1977] / TRICE [Phan et al., 2023]: ˆ∇PAFT q=1 =− 1 K KX k=1 ∇θ logp θ(z(rk),y ∗ |x ∗). The instance weight ( ¯wM)1−1 = 1 vanishes: all instances contribute equally, and the gradient is uniform SFT on approximate posterior samples. Proof. Each case follows by substituting q= 0 or q= 1 into the GARL e...