arxiv: 2604.25907 · v2 · submitted 2026-04-28 · 💻 cs.LG · cs.AI

Recognition: unknown

How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

Chu-Cheng Lin , Eugene Ie

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:05 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords Tsallis q-logarithmSFT-then-RLVRcold startlabel noise robustnessgradient flowreasoning modelsGARL estimatorPAFT estimator

0 comments

The pith

The Tsallis q-logarithm loss family unifies SFT and RLVR by trading cold-start speed for noise robustness in reasoning model training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper supplies a single-parameter family of losses that accounts for the observed success of running supervised fine-tuning before reinforcement learning with verifiable rewards. The family, called J_Q, uses the Tsallis q-logarithm and runs continuously from the RLVR exploitation pole at q=0 to a density-estimation pole at q=1. Gradient-flow calculations show that the q=1 pole leaves the cold-start regime in logarithmic time yet copies label noise, while the q=0 pole takes linear time yet stays robust to noise. The standard training order therefore first uses q=1 to exit cold start and then switches to q=0 for stability. Two new Monte Carlo estimators, GARL and PAFT, let practitioners optimize any fixed q without annotated rationales and improve results on FinQA, HotPotQA, and MuSiQue.

Core claim

Every loss in the J_Q family shares the identical per-example gradient direction and differs only by the independent amplification factor P_theta to the power of negative q. Consequently the exploitation pole at q=0 requires Omega of one over p0 steps to escape cold start but resists label noise, whereas the density-estimation pole at q=1 escapes in Theta of log of one over p0 steps yet memorizes noise. This separation directly accounts for the effectiveness of the stepwise schedule that first applies q=1 and later q=0.

What carries the argument

The single-parameter Tsallis q-logarithm loss family J_Q that interpolates between RLVR (q=0) and log-marginal-likelihood density estimation (q=1) solely through per-example reweighting by P_theta to the power of negative q.

If this is right

The SFT-then-RLVR ordering follows directly from the differing escape times and noise-robustness properties of the q=1 and q=0 poles.
Fixed-q training becomes practical through the Monte Carlo estimators GARL and PAFT that require no annotated rationales.
GARL at high q escapes cold start on FinQA, HotPotQA, and MuSiQue where standard GRPO fails entirely.
In stable training regimes low-q GARL improves FinQA while PAFT at q=0.75 remains stable on HotPotQA and MuSiQue.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Gradual rather than abrupt changes in q during training may further improve the speed-robustness balance on new tasks.
The shared-gradient property could support similar escape-time analyses for other loss interpolations used in language-model post-training.
Early training phases that use high q should avoid data with label noise to prevent memorization.

Load-bearing premise

All members of the J_Q family share the exact same per-example gradient direction and differ only through the independent per-instance reweighting term P_theta to the power of negative q.

What would settle it

Train models from the same cold-start distribution with controlled label noise, measure steps until accuracy rises and final accuracy after noise injection, and check whether escape time scales as Omega of one over p0 for q near zero and Theta of log of one over p0 for q near one.

Figures

Figures reproduced from arXiv: 2604.25907 by Chu-Cheng Lin, Eugene Ie.

**Figure 1.** Figure 1: The JQ loss family is a continuum between exploitation (q = 0) and density estimation (q = 1) losses (poles at either end of the axis below); correspondingly, commitment is the induced gradient amplification (P −q θ ; top arrow). High q resolves ambiguity (fast cold-start escape) but also memorizes noise; low q resolves noise (robust filtering) but cannot escape cold start. p0 denotes initial success proba… view at source ↗

**Figure 2.** Figure 2: Two estimators from one gradient identity. The view at source ↗

**Figure 3.** Figure 3: Cold-start training dynamics on FinQA: maximum amplified advantage view at source ↗

**Figure 4.** Figure 4: Warm-start validation maj@16 on HotPotQA at q = 0.25: GARL peaks at step 50 (30.6) and collapses to zero by step 100; PAFT remains stable throughout training and reaches 53.6. At fixed q, the contrast isolates the estimator (prior-sampled, all-M vs. posterior-resampled). PAFT at low q is slow, not collapsed. PAFT at q = 0.25 underperforms on MuSiQue (9.0 vs. 15.8), but validation accuracy is still rising a… view at source ↗

read the original abstract

SFT-then-RLVR is widely used for post-training reasoning models, but why this specific ordering, and why RLVR-only stalls at cold start, have lacked a unifying theoretical account. We provide that account under a unified loss family $J_Q$ using the Tsallis $q$-logarithm. $J_Q$ is a single-parameter family that interpolates between RLVR (at $q{=}0$, the \textit{exploitation pole}) and the log-marginal-likelihood over latent trajectories (at $q{=}1$, the \textit{density-estimation pole}), under which the standard pipeline corresponds to a stepwise $q{=}1 \to 0$ schedule. All members share the same per-example gradient direction, differing only by a per-instance amplification $P_\theta^{-q}$ that reweights each instance independently of the learning rate. Under gradient flow analysis, we show that the exploitation pole requires $\Omega(\frac{1}{p_0})$ time to escape cold start but is robust to label noise, while the density-estimation pole escapes in $\Theta\big(\log(\frac{1}{p_0})\big)$ but memorizes label noise. This separation explains how SFT ($q{=}1$) first moves the model out of the cold-start regime, followed by the more robust RLVR ($q{=}0$), under the SFT-then-RLVR paradigm. We further derive two Monte Carlo estimators that directly optimize fixed-$q$ on the $J_Q$ continuum, without annotated rationales: Gradient-Amplified RL (GARL) and Posterior-Attenuated Fine-Tuning (PAFT), with shared bias $O\big(\frac{q}{M P_\theta^q}\big)$ but different variance and stability properties. On FinQA, HotPotQA, and MuSiQue, GARL at sufficiently high $q$ substantially mitigates cold-start stalling, escaping cold start where GRPO fails entirely. In warm start, GARL at low $q$ dominates FinQA where training is stable; on HotPotQA and MuSiQue, GARL destabilizes and PAFT at $q{=}0.75$ remains stable, reaching $47.9$ \texttt{m@16} on HotPotQA ($+13.9$ over GRPO).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper unifies SFT and RLVR under a Tsallis q-loss family, derives explicit cold-start escape times, and supplies GARL/PAFT estimators, but the shared-gradient-direction claim is the load-bearing piece that needs checking.

read the letter

The paper's main point is that a single Tsallis-q loss family puts RLVR at the exploitation end and log-marginal-likelihood at the density-estimation end, and that gradient flow on this family explains why SFT first then RLVR is the practical order. It also gives two Monte Carlo estimators, GARL and PAFT, that optimize at fixed q without annotated rationales and reports that high-q GARL escapes cold start on FinQA, HotPotQA, and MuSiQue where GRPO stalls while low-q versions can be stable in warm start. Those are the concrete new pieces: the q-continuum itself, the Omega(1/p0) versus Theta(log(1/p0)) escape times, and the estimators with their shared bias bound. The work does a straightforward job of turning the usual two-stage pipeline into a continuous schedule and testing the estimators on real QA tasks. The central assumption is that every member of the J_Q family has the identical per-example gradient direction and differs only by the P_theta^{-q} scalar reweighting. That assumption is what produces the escape-time separation and the noise-robustness distinction. If the Tsallis construction changes the direction for different q, especially through the latent-trajectory marginal at q=1, the separation argument does not go through. The abstract states the shared direction but the full derivations would have to show it holds without circularity. The experiments also give gains without visible controls or variance numbers, so the size of the practical improvement is difficult to judge. This paper is for researchers who train reasoning models and want either a theoretical account of loss ordering or new estimators that avoid rationale annotation. Readers working on cold-start mitigation or on alternatives to GRPO will find the estimators directly usable. It deserves a serious referee because the framework is specific enough to check and the claims are testable even if the gradient assumption requires close scrutiny. I would send it for review.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the Tsallis loss continuum J_Q, a one-parameter family interpolating between RLVR (q=0, exploitation pole) and log-marginal-likelihood density estimation (q=1, density-estimation pole). It claims all J_Q members share identical per-example gradient directions, differing only by per-instance amplification P_θ^{-q}. Gradient-flow analysis derives cold-start escape times Ω(1/p0) for q=0 (noise-robust) versus Θ(log(1/p0)) for q=1 (noise-memorizing), explaining the SFT-then-RLVR ordering. It proposes GARL and PAFT Monte Carlo estimators for fixed-q optimization (shared bias O(q/(M P_θ^q))) and reports empirical gains on FinQA, HotPotQA, and MuSiQue, with GARL mitigating cold-start stalling where GRPO fails.

Significance. If the shared-gradient-direction assumption holds and the escape-time derivations are rigorous, the work supplies a principled unifying account for post-training reasoning models, justifying the standard SFT-then-RLVR pipeline and introducing a tunable loss continuum with rationale-free estimators. The empirical results on three QA benchmarks indicate practical utility for cold-start mitigation and stability tuning.

major comments (2)

[Gradient Flow Analysis] The load-bearing claim that all J_Q members share the same per-example gradient direction (differing only by scalar P_θ^{-q} reweighting) is invoked to derive the Ω(1/p0) vs Θ(log(1/p0)) escape-time separation and noise-robustness distinction. The manuscript asserts this property but does not provide the explicit computation of ∇J_Q for general q confirming that the Tsallis q-logarithm and latent-trajectory marginal at q=1 introduce no q-dependent directional changes; without this verification the theoretical grounding for the SFT-then-RLVR explanation does not follow.
[Estimator Derivations] The bias O(q/(M P_θ^q)) and differing variance/stability properties for the GARL and PAFT estimators are stated in the abstract, but the full derivations of the time-to-escape expressions under gradient flow and the Monte Carlo bias/variance analysis are not verifiable from the given description. These derivations are required to support the central separation claims.

minor comments (1)

[Experiments] The empirical section would be strengthened by reporting the number of random seeds, standard deviations, and precise GRPO baseline configurations to substantiate claims such as 'escapes cold start where GRPO fails entirely' and the +13.9 m@16 gain on HotPotQA.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and insightful comments on our manuscript. We address each major comment below and commit to providing the requested explicit derivations and verifications in the revised version to strengthen the theoretical contributions.

read point-by-point responses

Referee: [Gradient Flow Analysis] The load-bearing claim that all J_Q members share the same per-example gradient direction (differing only by scalar P_θ^{-q} reweighting) is invoked to derive the Ω(1/p0) vs Θ(log(1/p0)) escape-time separation and noise-robustness distinction. The manuscript asserts this property but does not provide the explicit computation of ∇J_Q for general q confirming that the Tsallis q-logarithm and latent-trajectory marginal at q=1 introduce no q-dependent directional changes; without this verification the theoretical grounding for the SFT-then-RLVR explanation does not follow.

Authors: We acknowledge that the explicit computation of the gradient ∇J_Q was not presented in sufficient detail in the original submission. The property follows directly from the definition of the Tsallis q-loss, where the gradient with respect to the policy parameters θ decomposes as the expectation over the same importance-weighted direction for all q, scaled by the instance-specific factor P_θ^{-q}. We will include a dedicated subsection in the revised manuscript deriving ∇J_Q explicitly for general q, confirming that no q-dependent directional changes occur beyond the scalar reweighting. This will provide the rigorous verification needed to support the escape-time bounds and the explanation of the SFT-then-RLVR pipeline. revision: yes
Referee: [Estimator Derivations] The bias O(q/(M P_θ^q)) and differing variance/stability properties for the GARL and PAFT estimators are stated in the abstract, but the full derivations of the time-to-escape expressions under gradient flow and the Monte Carlo bias/variance analysis are not verifiable from the given description. These derivations are required to support the central separation claims.

Authors: We agree that the full derivations of the time-to-escape expressions and the Monte Carlo bias/variance analysis should be made explicit for verifiability. The bias term O(q/(M P_θ^q)) arises from the finite-sample approximation of the posterior over latent trajectories in the Tsallis framework, and the variance differences stem from the amplification factor in GARL versus the attenuation in PAFT. In the revision, we will expand the appendix with complete step-by-step derivations of the gradient flow dynamics for the escape times and the bias/variance bounds for both estimators, ensuring all central claims are fully supported. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper defines the J_Q family via the Tsallis q-logarithm interpolating RLVR (q=0) and log-marginal likelihood (q=1), states the shared per-example gradient direction property as following from that definition, and performs gradient-flow analysis to obtain the Ω(1/p0) vs Θ(log(1/p0)) escape times. No load-bearing step reduces by construction to a fitted quantity, self-citation chain, or renaming; the separation is a direct mathematical consequence of the posited family and the independent-of-LR reweighting, with no evidence that outputs are equivalent to inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The claim rests on the Tsallis q-logarithm interpolation, the shared per-example gradient direction, and gradient-flow assumptions; q is the sole explicit free parameter.

free parameters (1)

q
Interpolation parameter between exploitation pole (q=0) and density-estimation pole (q=1); controls amplification P_θ^{-q}.

axioms (2)

domain assumption All J_Q members share identical per-example gradient direction, differing only by instance-wise amplification P_θ^{-q}
Invoked to separate time-to-escape and noise robustness across the continuum.
domain assumption Gradient flow analysis governs discrete training dynamics
Used to derive Ω(1/p0) and Θ(log(1/p0)) escape times.

invented entities (1)

Tsallis loss family J_Q no independent evidence
purpose: Provide continuous interpolation between RLVR and log-marginal-likelihood over latent trajectories
New single-parameter family introduced to unify the two poles.

pith-pipeline@v0.9.0 · 5742 in / 1626 out tokens · 44958 ms · 2026-05-08T03:05:41.602221+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 3 canonical work pages · 1 internal anchor

[1]

AP Dempster, NM Laird, and DB Rubin

URLhttps://arxiv.org/abs/2501.12948. AP Dempster, NM Laird, and DB Rubin. Maximum likelihood from incomplete data via the EM algorithm.Journal of the Royal Statistical Society. Series B (Methodological), pages 1–38, 1977. Davide Ferrari and Yuhong Yang. Maximum Lq-likelihood estimation.The Annals of Statistics, 38 (2):753–783, 2010. Kelvin Guu, Panupong P...

work page doi:10.18653/v1/p17-1097 1977
[2]

s1: Simple test-time scaling

URLhttps://aclanthology.org/2021.naacl-main.405/. William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of thought. InThe Twelfth International Conference on Learning Representations, 2024. URL https:// openreview.net/forum?id=NjNGlPh8Wh. Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajis...

work page internal anchor Pith review arXiv 2021
[3]

, title =

ISSN 1532-4435. Fahim Tajwar, Guanning Zeng, Yueer Zhou, Yuda Song, Daman Arora, Yiding Jiang, Jeff Schneider, Ruslan Salakhutdinov, Haiwen Feng, and Andrea Zanette. Maximum likelihood reinforcement learning. 2026. URLhttps://arxiv.org/abs/2602.02710. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. MuSiQue: Multihop questions v...

work page doi:10.1007/bf00992696 2026
[4]

The model exactly recovers the data distribu- tion

Density-estimation pole( q= 1 ): θ∗ j (1) =α j. The model exactly recovers the data distribu- tion
[5]

The model concentrates all mass on the most frequent output

Exploitation pole( q→0 +): assuming a unique mode j∗ = argmaxk αk, θ∗ j (q)→I(j= j∗). The model concentrates all mass on the most frequent output. 3.Monotone sharpening: for0< q ′ < q≤1andα j > α k,θ ∗ j (q′)/θ∗ k(q′)> θ ∗ j (q)/θ∗ k(q). Proof. Part (1): 1/q= 1 . Part (2): (αj/αj∗)1/q →0 for j̸=j ∗. Part (3): θ∗ j /θ∗ k = (α j/αk)1/q, increasing in 1/q. C...
[6]

Each gm marginalizes out the output y given z(m) analytically via wm =p θ(y∗ |x ∗,z (m)), rather than relying on a sampled output and binary reward

GARL at q= 0 recovers Rao–Blackwellized REINFORCE [Williams, 1992, Zhou et al., 2026]: [∇θℓq q=0 = ¯gM = 1 M MX m=1 −wm ∇θ logp θ(z(m),y ∗ |x ∗) , which is unbiased for ∇θℓ0 by Equation(8). Each gm marginalizes out the output y given z(m) analytically via wm =p θ(y∗ |x ∗,z (m)), rather than relying on a sampled output and binary reward

1992
[7]

28 3.PAFT atq= 0reduces to posterior-resampled SFT scaled byP θ: ˆ∇PAFT q=0 =−¯wM · 1 K KX k=1 ∇θ logp θ(z(rk),y ∗ |x ∗)

GARL at q= 1 recovers the IWAE gradient estimator [Burda et al., 2015], a self-normalized importance sampling (SNIS) estimator for∇ θ logP θ: [∇θℓq q=1 = ¯gM ¯wM = P m wm (−∇θ logp θ(z(m),y ∗ |x ∗))P m wm . 28 3.PAFT atq= 0reduces to posterior-resampled SFT scaled byP θ: ˆ∇PAFT q=0 =−¯wM · 1 K KX k=1 ∇θ logp θ(z(rk),y ∗ |x ∗). The factor ¯wM ≈P θ downweig...

2015
[8]

The instance weight ( ¯wM)1−1 = 1 vanishes: all instances contribute equally, and the gradient is uniform SFT on approximate posterior samples

PAFT at q= 1 recovers the E-step of EM [Dempster et al., 1977] / TRICE [Phan et al., 2023]: ˆ∇PAFT q=1 =− 1 K KX k=1 ∇θ logp θ(z(rk),y ∗ |x ∗). The instance weight ( ¯wM)1−1 = 1 vanishes: all instances contribute equally, and the gradient is uniform SFT on approximate posterior samples. Proof. Each case follows by substituting q= 0 or q= 1 into the GARL e...

1977