Recognition: unknown
How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum
Pith reviewed 2026-05-08 03:05 UTC · model grok-4.3
The pith
The Tsallis q-logarithm loss family unifies SFT and RLVR by trading cold-start speed for noise robustness in reasoning model training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Every loss in the J_Q family shares the identical per-example gradient direction and differs only by the independent amplification factor P_theta to the power of negative q. Consequently the exploitation pole at q=0 requires Omega of one over p0 steps to escape cold start but resists label noise, whereas the density-estimation pole at q=1 escapes in Theta of log of one over p0 steps yet memorizes noise. This separation directly accounts for the effectiveness of the stepwise schedule that first applies q=1 and later q=0.
What carries the argument
The single-parameter Tsallis q-logarithm loss family J_Q that interpolates between RLVR (q=0) and log-marginal-likelihood density estimation (q=1) solely through per-example reweighting by P_theta to the power of negative q.
If this is right
- The SFT-then-RLVR ordering follows directly from the differing escape times and noise-robustness properties of the q=1 and q=0 poles.
- Fixed-q training becomes practical through the Monte Carlo estimators GARL and PAFT that require no annotated rationales.
- GARL at high q escapes cold start on FinQA, HotPotQA, and MuSiQue where standard GRPO fails entirely.
- In stable training regimes low-q GARL improves FinQA while PAFT at q=0.75 remains stable on HotPotQA and MuSiQue.
Where Pith is reading between the lines
- Gradual rather than abrupt changes in q during training may further improve the speed-robustness balance on new tasks.
- The shared-gradient property could support similar escape-time analyses for other loss interpolations used in language-model post-training.
- Early training phases that use high q should avoid data with label noise to prevent memorization.
Load-bearing premise
All members of the J_Q family share the exact same per-example gradient direction and differ only through the independent per-instance reweighting term P_theta to the power of negative q.
What would settle it
Train models from the same cold-start distribution with controlled label noise, measure steps until accuracy rises and final accuracy after noise injection, and check whether escape time scales as Omega of one over p0 for q near zero and Theta of log of one over p0 for q near one.
Figures
read the original abstract
SFT-then-RLVR is widely used for post-training reasoning models, but why this specific ordering, and why RLVR-only stalls at cold start, have lacked a unifying theoretical account. We provide that account under a unified loss family $J_Q$ using the Tsallis $q$-logarithm. $J_Q$ is a single-parameter family that interpolates between RLVR (at $q{=}0$, the \textit{exploitation pole}) and the log-marginal-likelihood over latent trajectories (at $q{=}1$, the \textit{density-estimation pole}), under which the standard pipeline corresponds to a stepwise $q{=}1 \to 0$ schedule. All members share the same per-example gradient direction, differing only by a per-instance amplification $P_\theta^{-q}$ that reweights each instance independently of the learning rate. Under gradient flow analysis, we show that the exploitation pole requires $\Omega(\frac{1}{p_0})$ time to escape cold start but is robust to label noise, while the density-estimation pole escapes in $\Theta\big(\log(\frac{1}{p_0})\big)$ but memorizes label noise. This separation explains how SFT ($q{=}1$) first moves the model out of the cold-start regime, followed by the more robust RLVR ($q{=}0$), under the SFT-then-RLVR paradigm. We further derive two Monte Carlo estimators that directly optimize fixed-$q$ on the $J_Q$ continuum, without annotated rationales: Gradient-Amplified RL (GARL) and Posterior-Attenuated Fine-Tuning (PAFT), with shared bias $O\big(\frac{q}{M P_\theta^q}\big)$ but different variance and stability properties. On FinQA, HotPotQA, and MuSiQue, GARL at sufficiently high $q$ substantially mitigates cold-start stalling, escaping cold start where GRPO fails entirely. In warm start, GARL at low $q$ dominates FinQA where training is stable; on HotPotQA and MuSiQue, GARL destabilizes and PAFT at $q{=}0.75$ remains stable, reaching $47.9$ \texttt{m@16} on HotPotQA ($+13.9$ over GRPO).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Tsallis loss continuum J_Q, a one-parameter family interpolating between RLVR (q=0, exploitation pole) and log-marginal-likelihood density estimation (q=1, density-estimation pole). It claims all J_Q members share identical per-example gradient directions, differing only by per-instance amplification P_θ^{-q}. Gradient-flow analysis derives cold-start escape times Ω(1/p0) for q=0 (noise-robust) versus Θ(log(1/p0)) for q=1 (noise-memorizing), explaining the SFT-then-RLVR ordering. It proposes GARL and PAFT Monte Carlo estimators for fixed-q optimization (shared bias O(q/(M P_θ^q))) and reports empirical gains on FinQA, HotPotQA, and MuSiQue, with GARL mitigating cold-start stalling where GRPO fails.
Significance. If the shared-gradient-direction assumption holds and the escape-time derivations are rigorous, the work supplies a principled unifying account for post-training reasoning models, justifying the standard SFT-then-RLVR pipeline and introducing a tunable loss continuum with rationale-free estimators. The empirical results on three QA benchmarks indicate practical utility for cold-start mitigation and stability tuning.
major comments (2)
- [Gradient Flow Analysis] The load-bearing claim that all J_Q members share the same per-example gradient direction (differing only by scalar P_θ^{-q} reweighting) is invoked to derive the Ω(1/p0) vs Θ(log(1/p0)) escape-time separation and noise-robustness distinction. The manuscript asserts this property but does not provide the explicit computation of ∇J_Q for general q confirming that the Tsallis q-logarithm and latent-trajectory marginal at q=1 introduce no q-dependent directional changes; without this verification the theoretical grounding for the SFT-then-RLVR explanation does not follow.
- [Estimator Derivations] The bias O(q/(M P_θ^q)) and differing variance/stability properties for the GARL and PAFT estimators are stated in the abstract, but the full derivations of the time-to-escape expressions under gradient flow and the Monte Carlo bias/variance analysis are not verifiable from the given description. These derivations are required to support the central separation claims.
minor comments (1)
- [Experiments] The empirical section would be strengthened by reporting the number of random seeds, standard deviations, and precise GRPO baseline configurations to substantiate claims such as 'escapes cold start where GRPO fails entirely' and the +13.9 m@16 gain on HotPotQA.
Simulated Author's Rebuttal
We thank the referee for their detailed and insightful comments on our manuscript. We address each major comment below and commit to providing the requested explicit derivations and verifications in the revised version to strengthen the theoretical contributions.
read point-by-point responses
-
Referee: [Gradient Flow Analysis] The load-bearing claim that all J_Q members share the same per-example gradient direction (differing only by scalar P_θ^{-q} reweighting) is invoked to derive the Ω(1/p0) vs Θ(log(1/p0)) escape-time separation and noise-robustness distinction. The manuscript asserts this property but does not provide the explicit computation of ∇J_Q for general q confirming that the Tsallis q-logarithm and latent-trajectory marginal at q=1 introduce no q-dependent directional changes; without this verification the theoretical grounding for the SFT-then-RLVR explanation does not follow.
Authors: We acknowledge that the explicit computation of the gradient ∇J_Q was not presented in sufficient detail in the original submission. The property follows directly from the definition of the Tsallis q-loss, where the gradient with respect to the policy parameters θ decomposes as the expectation over the same importance-weighted direction for all q, scaled by the instance-specific factor P_θ^{-q}. We will include a dedicated subsection in the revised manuscript deriving ∇J_Q explicitly for general q, confirming that no q-dependent directional changes occur beyond the scalar reweighting. This will provide the rigorous verification needed to support the escape-time bounds and the explanation of the SFT-then-RLVR pipeline. revision: yes
-
Referee: [Estimator Derivations] The bias O(q/(M P_θ^q)) and differing variance/stability properties for the GARL and PAFT estimators are stated in the abstract, but the full derivations of the time-to-escape expressions under gradient flow and the Monte Carlo bias/variance analysis are not verifiable from the given description. These derivations are required to support the central separation claims.
Authors: We agree that the full derivations of the time-to-escape expressions and the Monte Carlo bias/variance analysis should be made explicit for verifiability. The bias term O(q/(M P_θ^q)) arises from the finite-sample approximation of the posterior over latent trajectories in the Tsallis framework, and the variance differences stem from the amplification factor in GARL versus the attenuation in PAFT. In the revision, we will expand the appendix with complete step-by-step derivations of the gradient flow dynamics for the escape times and the bias/variance bounds for both estimators, ensuring all central claims are fully supported. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper defines the J_Q family via the Tsallis q-logarithm interpolating RLVR (q=0) and log-marginal likelihood (q=1), states the shared per-example gradient direction property as following from that definition, and performs gradient-flow analysis to obtain the Ω(1/p0) vs Θ(log(1/p0)) escape times. No load-bearing step reduces by construction to a fitted quantity, self-citation chain, or renaming; the separation is a direct mathematical consequence of the posited family and the independent-of-LR reweighting, with no evidence that outputs are equivalent to inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- q
axioms (2)
- domain assumption All J_Q members share identical per-example gradient direction, differing only by instance-wise amplification P_θ^{-q}
- domain assumption Gradient flow analysis governs discrete training dynamics
invented entities (1)
-
Tsallis loss family J_Q
no independent evidence
Reference graph
Works this paper leans on
-
[1]
AP Dempster, NM Laird, and DB Rubin
URLhttps://arxiv.org/abs/2501.12948. AP Dempster, NM Laird, and DB Rubin. Maximum likelihood from incomplete data via the EM algorithm.Journal of the Royal Statistical Society. Series B (Methodological), pages 1–38, 1977. Davide Ferrari and Yuhong Yang. Maximum Lq-likelihood estimation.The Annals of Statistics, 38 (2):753–783, 2010. Kelvin Guu, Panupong P...
-
[2]
URLhttps://aclanthology.org/2021.naacl-main.405/. William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of thought. InThe Twelfth International Conference on Learning Representations, 2024. URL https:// openreview.net/forum?id=NjNGlPh8Wh. Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajis...
work page internal anchor Pith review arXiv 2021
-
[3]
ISSN 1532-4435. Fahim Tajwar, Guanning Zeng, Yueer Zhou, Yuda Song, Daman Arora, Yiding Jiang, Jeff Schneider, Ruslan Salakhutdinov, Haiwen Feng, and Andrea Zanette. Maximum likelihood reinforcement learning. 2026. URLhttps://arxiv.org/abs/2602.02710. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. MuSiQue: Multihop questions v...
-
[4]
The model exactly recovers the data distribu- tion
Density-estimation pole( q= 1 ): θ∗ j (1) =α j. The model exactly recovers the data distribu- tion
-
[5]
The model concentrates all mass on the most frequent output
Exploitation pole( q→0 +): assuming a unique mode j∗ = argmaxk αk, θ∗ j (q)→I(j= j∗). The model concentrates all mass on the most frequent output. 3.Monotone sharpening: for0< q ′ < q≤1andα j > α k,θ ∗ j (q′)/θ∗ k(q′)> θ ∗ j (q)/θ∗ k(q). Proof. Part (1): 1/q= 1 . Part (2): (αj/αj∗)1/q →0 for j̸=j ∗. Part (3): θ∗ j /θ∗ k = (α j/αk)1/q, increasing in 1/q. C...
-
[6]
Each gm marginalizes out the output y given z(m) analytically via wm =p θ(y∗ |x ∗,z (m)), rather than relying on a sampled output and binary reward
GARL at q= 0 recovers Rao–Blackwellized REINFORCE [Williams, 1992, Zhou et al., 2026]: [∇θℓq q=0 = ¯gM = 1 M MX m=1 −wm ∇θ logp θ(z(m),y ∗ |x ∗) , which is unbiased for ∇θℓ0 by Equation(8). Each gm marginalizes out the output y given z(m) analytically via wm =p θ(y∗ |x ∗,z (m)), rather than relying on a sampled output and binary reward
1992
-
[7]
28 3.PAFT atq= 0reduces to posterior-resampled SFT scaled byP θ: ˆ∇PAFT q=0 =−¯wM · 1 K KX k=1 ∇θ logp θ(z(rk),y ∗ |x ∗)
GARL at q= 1 recovers the IWAE gradient estimator [Burda et al., 2015], a self-normalized importance sampling (SNIS) estimator for∇ θ logP θ: [∇θℓq q=1 = ¯gM ¯wM = P m wm (−∇θ logp θ(z(m),y ∗ |x ∗))P m wm . 28 3.PAFT atq= 0reduces to posterior-resampled SFT scaled byP θ: ˆ∇PAFT q=0 =−¯wM · 1 K KX k=1 ∇θ logp θ(z(rk),y ∗ |x ∗). The factor ¯wM ≈P θ downweig...
2015
-
[8]
The instance weight ( ¯wM)1−1 = 1 vanishes: all instances contribute equally, and the gradient is uniform SFT on approximate posterior samples
PAFT at q= 1 recovers the E-step of EM [Dempster et al., 1977] / TRICE [Phan et al., 2023]: ˆ∇PAFT q=1 =− 1 K KX k=1 ∇θ logp θ(z(rk),y ∗ |x ∗). The instance weight ( ¯wM)1−1 = 1 vanishes: all instances contribute equally, and the gradient is uniform SFT on approximate posterior samples. Proof. Each case follows by substituting q= 0 or q= 1 into the GARL e...
1977
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.