arxiv: 2604.17228 · v1 · submitted 2026-04-19 · 💻 cs.LG

Recognition: unknown

Revisiting Auxiliary Losses for Conditional Depth Routing: An Empirical Study

Qingwei Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:02 UTC · model grok-4.3

classification 💻 cs.LG

keywords conditional depth routingauxiliary lossesgate trainingoff-policy oracletransformer efficiencylanguage modelingrouting gate

0 comments

The pith

Removing utility and rank auxiliary losses improves performance in conditional depth routing by fixing an off-policy label mismatch.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Conditional depth execution routes some tokens through cheap FFN layers and others through full ones at selected points in a transformer. Gate training is difficult because the decision's effect on the language modeling loss is delayed and the gradients are weak. The study compares an MLP gate that scores utility from the current state against a JEPA-guided gate that adds a predictor of full versus cheap outcomes in latent space. With the standard set of auxiliary losses including oracle utility regression and pairwise rank supervision, the JEPA gate shows faster early progress. Jointly dropping those two auxiliary losses raises best and average LM performance and speeds threshold hits for both gates while cutting the compute overhead.

Core claim

Under the standard recipe with oracle-style utility regression and pairwise rank supervision, the JEPA-guided gate improves early-to-mid optimization over the MLP gate. Jointly removing util/rank improves best/avg LM and threshold-hit speed in 3/3 seeds for both gates, and the early-to-mid advantage of the JEPA gate disappears. This occurs because the off-policy oracle label assumes all subsequent layers execute full, whereas gated execution routes only a fraction through full, making util/rank net-negative under the current recipe. Removing util/rank also cuts the training FLOPs proxy from ~1.53x to ~1.07x full-only.

What carries the argument

The off-policy oracle label for utility regression and pairwise rank supervision, which assumes full subsequent execution despite actual gated routing.

If this is right

Joint removal of util/rank improves best/avg LM and threshold-hit speed in 3/3 seeds for both gates.
The early-to-mid advantage of the JEPA-guided gate over the MLP gate disappears without util/rank.
Training FLOPs proxy drops from ~1.53x to ~1.07x full-only.
Endpoint LM performance stays within 0.005 of the heuristic reference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designing auxiliary losses that use on-policy estimates matching the actual gated execution fraction could make utility and rank supervision beneficial again.
The off-policy mismatch may affect other conditional computation methods that rely on full-path oracles for supervision.
Simplifying gate training by dropping these losses could reduce training time across different routing budgets.

Load-bearing premise

The off-policy oracle label assumes all subsequent layers execute full, whereas gated execution routes only a fraction through full, making util/rank net-negative under the current recipe.

What would settle it

Replace the off-policy oracle with on-policy utility estimates that condition on the actual fraction of full executions under the current gate and check whether re-adding util/rank then improves rather than degrades LM loss.

Figures

Figures reproduced from arXiv: 2604.17228 by Qingwei Lin.

**Figure 2.** Figure 2: Counterfactual fork used to construct the utility label [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Validation eval lm_loss learning curves for four core configurations at 50% budget (3 seeds; solid line = mean, shaded region = ±1 std). The gray horizontal band marks ±0.005 around the G3 endpoint mean as a heuristic reference (for endpoint eval lm_loss only; not a formal non-inferiority test). Key reading points: (a) G3 consistently outperforms G1 in the early-to-mid training phase under the standard rec… view at source ↗

**Figure 4.** Figure 4: Training grad_norm curves for four core configurations (log-y; solid = mean, shaded = ±1 std; rolling-mean window = 200 steps). Three tiers are clearly separated: G1 ∼ 80, G3 ∼ 8, A3-G3/A3-G1 ∼ 0.5. A1 is excluded from this figure because its low gradient norm (∼ 2.3) arises from predictor collapse (§5.2), qualitatively different from the “lower norms with better LM and non-collapsed diagnostics” shown her… view at source ↗

read the original abstract

Conditional depth execution routes a subset of tokens through a lightweight cheap FFN while the remainder execute the standard full FFN at each controlled layer. The central difficulty is gate training: the gate decision must propagate through many layers before it influences the language modeling (LM) loss, so the resulting gradients are weak and noisy. Auxiliary losses are commonly stacked to stabilise training, yet the interactions among them -- particularly between a predictive auxiliary and explicit score supervision -- have not been systematically compared under controlled conditions. We evaluate two gate designs under a 157.5M-parameter decoder-only model with controller-only training, 50% full-path budget, and 3-seed runs on a fineweb-edu subset. The MLP gate (G1) maps the current hidden state to a utility score; the JEPA-guided gate (G3) adds an action-conditional predictor that forecasts, in a low-dimensional latent space, the outcome of executing full vs. cheap per token, aligned against a fixed target head. Under the standard recipe with oracle-style utility regression and pairwise rank supervision (util/rank), G3 improves early-to-mid optimisation over G1 in 3/3 seeds (lower avg LM, faster threshold hits, ~10.3x lower grad norms), with 20k-step endpoint LM within a 0.005 heuristic reference. A key finding (ablation A3): jointly removing util/rank improves best/avg LM and threshold-hit speed in 3/3 seeds for both gates, and the early-to-mid advantage of G3 over G1 disappears. We trace this to an off-policy oracle label that assumes all subsequent layers execute full, whereas gated execution routes only a fraction through full -- making util/rank net-negative under the current recipe. Removing util/rank also cuts the training FLOPs proxy from ~1.53x to ~1.07x full-only (2.87h to 1.75h on a V100-32GB, ~39%). Conclusions are scoped to the studied regime.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that dropping utility and rank auxiliary losses improves gate training and final performance for conditional depth routing in their controlled 157M setup, with the JEPA edge disappearing as a result.

read the letter

The main thing to take away is that this empirical study finds removing the utility and rank auxiliary losses improves gate training for conditional depth routing, contrary to common practice, at least in their 157M model setup with 50% budget. The ablations show lower best and average LM loss plus faster threshold hits across all three seeds for both the MLP gate and the JEPA-guided gate, and the early-to-mid advantage of the JEPA version vanishes without those terms. They also note the training FLOPs proxy falls from about 1.53x to 1.07x full-only compute. The experiments are run under fixed conditions with consistent reporting on optimization trajectory and cost, which makes the comparison straightforward to follow. The results hold in the reported seeds and the scope is clearly stated. The explanation that the oracle labels are off-policy because they assume full downstream execution while gated paths route only a fraction of tokens that way is plausible and fits the ablation pattern. However, the manuscript does not measure the actual per-token label discrepancy or run a controlled on-policy relabeling test, so other factors such as gradient magnitude or reduced regularization cannot be ruled out as contributors. The work stays within one model size, one budget, and one data subset, so readers will reasonably ask how far it travels. This is the sort of targeted empirical check that matters for people actively tuning conditional execution or router training in transformers. It engages the literature through direct ablations on a known training difficulty and reports reproducible details. The central claim is narrow but the evidence is presented cleanly enough that it deserves referee time to assess generalization and implementation specifics. I would send it for peer review.

Referee Report

2 major / 2 minor

Summary. The paper presents a controlled empirical study of auxiliary losses for gate training in conditional depth routing. Using a 157.5M decoder-only model with 50% full-path budget and 3-seed runs, it compares an MLP gate (G1) against a JEPA-guided gate (G3) and reports that jointly removing utility-regression and pairwise-rank supervision (util/rank) yields lower best/average LM loss and faster threshold hits for both gates, while also reducing a training-FLOPs proxy by ~39%. The improvement is traced to an off-policy mismatch between the oracle utility labels (which assume all subsequent layers execute the full FFN) and the actual gated trajectories.

Significance. If the empirical result holds, the work demonstrates that conventional auxiliary supervision can be counterproductive under gated execution and that simpler recipes can simultaneously improve optimization speed and cut training cost. The multi-seed, controlled-budget design provides a reproducible baseline for future conditional-computation studies.

major comments (2)

[Ablation A3] Ablation A3: the claim that util/rank is net-negative because of off-policy oracle mismatch is not supported by direct evidence. No per-token discrepancy between oracle and realized utility is measured, and no on-policy re-labeling control is performed to isolate the mismatch from incidental effects such as changed gradient scale or reduced regularization.
[Experimental setup] Experimental setup and conclusions: all reported gains and the disappearance of the G3-vs-G1 advantage are demonstrated only at the specific 50% budget and 157.5M model size. Because the central interpretation hinges on the interaction between budget and label mismatch, sensitivity to these choices should be quantified before the recommendation to drop util/rank can be generalized.

minor comments (2)

[Abstract] The ~10.3x lower gradient-norm figure for G3 is stated without accompanying analysis or plot; a short explanation or supplementary figure would clarify whether this is a cause or consequence of the observed optimization behavior.
[Methods] Notation for the two gates (G1, G3) and the auxiliary losses (util/rank) is introduced clearly but could be reinforced with a small summary table in the methods section for readers who skip the ablation details.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and agree that our claims require additional qualification. We will make partial revisions to the manuscript to reflect this.

read point-by-point responses

Referee: [Ablation A3] Ablation A3: the claim that util/rank is net-negative because of off-policy oracle mismatch is not supported by direct evidence. No per-token discrepancy between oracle and realized utility is measured, and no on-policy re-labeling control is performed to isolate the mismatch from incidental effects such as changed gradient scale or reduced regularization.

Authors: We acknowledge that the interpretation of off-policy mismatch is supported only indirectly by the consistent LM loss and threshold-hit improvements upon removal of util/rank across seeds and gates. The manuscript contains no per-token oracle-vs-realized utility measurements or on-policy re-labeling ablations that would isolate mismatch from other factors such as gradient scale or regularization strength. We will revise the discussion of Ablation A3 to present the off-policy hypothesis as a plausible explanation rather than a conclusively demonstrated mechanism and will explicitly note the lack of direct controls as a limitation. revision: partial
Referee: [Experimental setup] Experimental setup and conclusions: all reported gains and the disappearance of the G3-vs-G1 advantage are demonstrated only at the specific 50% budget and 157.5M model size. Because the central interpretation hinges on the interaction between budget and label mismatch, sensitivity to these choices should be quantified before the recommendation to drop util/rank can be generalized.

Authors: All results are obtained at the 50% budget and 157.5M scale, as described in the experimental setup. The manuscript already scopes its conclusions to this regime. We do not have the computational resources to run the required sensitivity sweeps at additional budgets or model sizes. We will strengthen the discussion and conclusion sections to more explicitly caution against generalizing the recommendation to drop util/rank and to highlight the potential dependence on budget and scale. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical ablation study with independent runs

full rationale

The manuscript reports controlled experiments on two gate architectures (MLP G1 and JEPA-guided G3) under fixed budgets, seeds, and data subsets. All claims rest on direct measurements of LM loss, threshold-hit speed, gradient norms, and training FLOPs from fresh training runs. No equations, fitted parameters, or self-citations are invoked to derive or predict the reported outcomes; the off-policy oracle explanation is presented as a post-hoc hypothesis rather than a deductive step. The analysis is therefore self-contained against external benchmarks and does not reduce any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on standard transformer training assumptions and the specific experimental conditions; no new free parameters or invented entities are introduced beyond the two gate designs and auxiliary losses tested.

axioms (2)

domain assumption The gate decision must propagate through many layers before it influences the language modeling loss, resulting in weak and noisy gradients.
Stated as the central difficulty of gate training in the abstract.
domain assumption Oracle-style utility regression and pairwise rank supervision form the standard recipe for auxiliary losses.
Used as the baseline recipe against which ablations are compared.

pith-pipeline@v0.9.0 · 5674 in / 1256 out tokens · 75136 ms · 2026-05-10T07:02:23.246642+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages · 4 internal anchors

[1]

Bardes, Q

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. V-JEPA: Video joint-embedding predictive architecture.arXiv preprint arXiv:2404.16930,

work page arXiv
[2]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432,

work page internal anchor Pith review arXiv
[3]

Adaptive Computation Time for Recurrent Neural Networks

Alex Graves. Adaptive computation time for recurrent neural networks.arXiv preprint arXiv:1603.08983,

work page internal anchor Pith review arXiv
[4]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Mixture- of-depths: Dynamically allocating compute in transformer-based language models.arXiv preprint arXiv:2404.02258,

David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter C Humphreys, and Adam Santoro. Mixture-of-depths: Dynamically allocating compute in transformer-based language models.arXiv preprint arXiv:2404.02258,

work page arXiv
[6]

Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv:2408.15664,

Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv:2408.15664,

work page arXiv
[7]

ST-MoE: Designing Stable and Transferable Sparse Expert Models

Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. ST-MoE: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906,

work page internal anchor Pith review arXiv