arxiv: 2605.07756 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI

Recognition: no theorem link

When Losses Align: Gradient-Based Composite Loss Weighting for Efficient Pretraining

Andrey Savchenko, Ivan Karpukhin

Authors on Pith no claims yet

Pith reviewed 2026-05-11 03:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords composite lossloss weightingpretraininggradient alignmentself-supervised learninghyperparameter optimizationevent sequence modelingcomputer vision

0 comments

The pith

A gradient-based method learns weights for composite pretraining losses by aligning their combined gradient with a downstream objective.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper sets out to prove that the relative weights in a composite pretraining loss can be learned during a single training run. It does so by repeatedly adjusting the weights to make the overall gradient point in the same direction as the gradient from a downstream task. If this holds, it removes the need for many separate trainings to tune those weights, which is currently the main expense when using multiple loss terms on unlabeled data. The approach is tested on modeling sequences of events and on self-supervised image tasks, where it performs at least as well as weights chosen by search methods.

Core claim

The central claim is that one can optimize the weights of a composite pretraining loss during training by adjusting them so the resulting gradient aligns with the gradient of a downstream objective. This is done efficiently by using the linear combination structure of the loss to skip the repeated full backward passes that would otherwise be needed. Experiments on event-sequence modeling and self-supervised computer vision show that models trained this way reach performance levels matching or exceeding those from expensive searches over possible weights.

What carries the argument

The gradient alignment update, which adjusts loss weights to make the pretraining gradient match a downstream gradient and thereby automates weight selection.

If this is right

Loss weights no longer need to be chosen by hand or by running dozens of trainings.
Hyperparameter search overhead drops to a small constant factor over one full pretraining.
The technique transfers across different domains such as sequences and vision without modification.
Final model quality on the target task remains competitive with exhaustive tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same alignment could be used to adapt weights when the downstream task changes slightly after initial pretraining.
Researchers might apply this to balance losses in multi-task learning where one task serves as the guide.
If the method is stable, it could allow weights to evolve throughout training rather than converging to fixed values.
Testing the approach on larger models and more complex composite losses would reveal its scalability limits.

Load-bearing premise

Aligning the pretraining gradient to a downstream objective produces weights that are good enough for the actual pretraining goal and do not lead to unstable training dynamics.

What would settle it

If a model trained with weights from this alignment method performs worse on the downstream task than a model trained with weights found by running a full Bayesian optimization search over many trainings, then the claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.07756 by Andrey Savchenko, Ivan Karpukhin.

**Figure 2.** Figure 2: Comparison of normalization strategies. The objective is to maximize the alignment [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Weights trajectories for the Churn dataset. [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗

**Figure 4.** Figure 4: Weights trajectories for the AgePred dataset. [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: Weights trajectories for the AlfaBattle dataset. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Weights trajectories for the MIMIC-III dataset. [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Weights trajectories for the Taobao dataset. [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Weights trajectories for the CIFAR-10 dataset. [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Weights trajectories for the CIFAR-100 dataset. [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Weights trajectories for the ImageNet-100 dataset. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

read the original abstract

Modern deep models are often pretrained on large-scale data with missing labels using composite objectives, where the relative weights of multiple loss terms act as hyperparameters. Tuning these weights with random search or Bayesian optimization is computationally expensive, as it requires many independent training runs. To address this, we propose a gradient-based bilevel method that learns pretraining loss weights online by aligning the composite pretraining gradient with a downstream objective. By exploiting the structure of the loss, the method avoids the multiple backward passes typically required by truncated backpropagation through the full model, reducing the overhead of hyperparameter tuning to approximately 30% above a single training run. We evaluate the approach on event-sequence modeling and self-supervised computer vision, where it matches or improves upon carefully tuned baselines while substantially reducing the cost of hyperparameter tuning compared to random or Bayesian search.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable online bilevel trick for tuning composite loss weights with modest overhead, but the abstract leaves the implementation and controls too vague to judge if it really beats standard search.

read the letter

The core contribution is a gradient-alignment procedure that updates pretraining loss weights during a single run by matching the composite gradient to a downstream signal. They exploit the loss structure to skip the usual multiple backward passes, claiming the whole thing adds only about 30% cost over baseline training. That efficiency angle is the part worth noticing, since random or Bayesian search for these weights is a known pain in multi-term pretraining on unlabeled data. The evaluations on event sequences and self-supervised vision are framed as matching or beating hand-tuned baselines, which would be useful if the numbers hold up with proper controls. The approach sits on standard bilevel ideas but adds the online, structure-exploiting shortcut, so it is not just a rehash. The main soft spot is the lack of detail on how the hypergradient is computed exactly, whether error bars or multiple runs were used, and how sensitive the weights are to the choice of downstream proxy. Without those, it is hard to know if the approximation stays accurate or drifts in longer runs. The stress-test worry about biased updates from ignoring higher-order terms is reasonable to check against the full math and ablations. This is for groups already running large composite pretraining and looking for cheaper tuning; a reader who cares about practical hyperparameter methods in self-supervised or multi-task settings would get something out of it. The idea is grounded enough to send for peer review so the experiments and implementation can be examined properly.

Referee Report

2 major / 2 minor

Summary. The paper proposes a gradient-based bilevel optimization approach to learn weights for composite pretraining losses online. The method aligns the composite pretraining gradient with a downstream objective and exploits loss structure to avoid full unrolled backpropagation through the model, claiming an overhead of approximately 30% above a single training run. It is evaluated on event-sequence modeling and self-supervised computer vision tasks, where the resulting models match or exceed those obtained from random or Bayesian search over loss weights while reducing tuning cost.

Significance. If the empirical results hold under rigorous controls, the work would offer a practical reduction in the computational expense of tuning composite objectives, which is a frequent bottleneck in pretraining pipelines. The online bilevel formulation with structure exploitation could enable more efficient loss design without requiring multiple independent full-scale runs.

major comments (2)

[Experimental evaluation] The central empirical claim that the method 'matches or improves upon carefully tuned baselines' is load-bearing but unsupported by reported experimental controls: no mention of number of independent runs, error bars, statistical significance, or the precise search budgets and spaces used for the random/Bayesian baselines. This information is required to evaluate whether the bilevel weights are competitive rather than merely plausible.
[Method] The efficiency claim rests on a structure-exploiting approximation that avoids multiple backward passes in the bilevel update. No analysis, bound, or empirical verification of the resulting hypergradient bias or variance is provided; if higher-order interactions between loss terms are ignored, the online weight updates may systematically deviate from the true bilevel optimum, undermining the claim that the learned weights are at least as good as exhaustive search.

minor comments (2)

[Abstract] The abstract quantifies overhead as 'approximately 30% above a single training run' without clarifying whether this includes the cost of evaluating the downstream objective or assumes the downstream task is cheap relative to pretraining.
[Method] Notation for the bilevel objective, the hypergradient estimator, and the online update rule for loss weights should be made fully explicit with equations to allow reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which has helped us improve the clarity and rigor of our work. We address each major comment below and indicate the revisions made to the manuscript.

read point-by-point responses

Referee: [Experimental evaluation] The central empirical claim that the method 'matches or improves upon carefully tuned baselines' is load-bearing but unsupported by reported experimental controls: no mention of number of independent runs, error bars, statistical significance, or the precise search budgets and spaces used for the random/Bayesian baselines. This information is required to evaluate whether the bilevel weights are competitive rather than merely plausible.

Authors: We agree that these details are essential for a rigorous evaluation. In the revised manuscript, we have included the number of independent runs performed (specifically, 3 runs for each configuration to account for randomness in initialization and data ordering), added error bars representing standard deviation in all relevant tables and figures, and conducted statistical significance tests using Welch's t-test to compare our method against the baselines. Additionally, we have specified the exact search spaces and budgets: for random search, 20 trials with uniform sampling over [0.1, 10] for each weight; for Bayesian optimization, 20 trials using a Gaussian process surrogate with the same bounds. These controls confirm that our method achieves comparable or better performance with statistical significance (p < 0.05 in key metrics) while using far less compute. revision: yes
Referee: [Method] The efficiency claim rests on a structure-exploiting approximation that avoids multiple backward passes in the bilevel update. No analysis, bound, or empirical verification of the resulting hypergradient bias or variance is provided; if higher-order interactions between loss terms are ignored, the online weight updates may systematically deviate from the true bilevel optimum, undermining the claim that the learned weights are at least as good as exhaustive search.

Authors: We appreciate this point on the need for validation of the approximation. The structure exploitation is based on the observation that for composite losses of the form L = sum w_i * L_i, the hypergradient with respect to weights can be approximated by aligning the sum of individual loss gradients without full unrolling, as cross-terms are second-order and small under frequent online updates. In the revised manuscript, we have added a new subsection with empirical verification: on a smaller proxy model, we compare the approximate hypergradients to those obtained via full unrolling (truncated at 5 steps), showing that the bias is low (mean absolute error of 0.02 in normalized gradient space) and variance is controlled. We also include a discussion noting that while a general bound would require assumptions like Lipschitz continuity of the losses, the online setting with small step sizes mitigates deviation from the bilevel optimum, as evidenced by the competitive performance in experiments. We believe this addresses the concern without requiring a full theoretical bound. revision: partial

Circularity Check

0 steps flagged

Bilevel gradient alignment for loss weighting is self-contained; no derivation reduces to fitted inputs or self-citation chains

full rationale

The paper's core proposal is a standard bilevel optimization setup that learns loss weights by aligning the composite pretraining gradient to a downstream objective, with an approximation that exploits loss structure to avoid full unrolling. This is grounded in existing bilevel methods and evaluated against independent downstream tasks (event-sequence modeling, self-supervised vision) with comparisons to random/Bayesian search. No equations are shown to be self-definitional, no predictions reduce to the same fitted parameters by construction, and no load-bearing uniqueness theorems or ansatzes are imported via self-citation. The 30% overhead claim and performance matching are empirical outcomes, not tautological. This yields a minor self-citation risk at most but keeps the central claim independent.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the domain assumption that gradient alignment serves as a reliable proxy for downstream performance and that the loss structure permits skipping full backpropagation. No new entities are postulated and no explicit free parameters beyond standard optimizer settings are mentioned.

axioms (2)

domain assumption Gradient alignment between the composite pretraining loss and a downstream objective yields useful loss weights without instability.
This is the core premise that allows the bilevel method to replace exhaustive search.
domain assumption The mathematical structure of the composite loss permits an efficient approximation that avoids multiple full backward passes.
Directly invoked to justify the 30% overhead claim.

pith-pipeline@v0.9.0 · 5435 in / 1520 out tokens · 40353 ms · 2026-05-11T03:06:24.809909+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages

[1]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

work page 2019
[2]

Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

work page 2020
[3]

Coles: Contrastive learning for event sequences with self-supervision

Dmitrii Babaev, Nikita Ovsov, Ivan Kireev, Maria Ivanova, Gleb Gusev, Ivan Nazarov, and Alexander Tuzhilin. Coles: Contrastive learning for event sequences with self-supervision. In Proceedings of the 2022 International Conference on Management of Data, pages 1190–1199, 2022

work page 2022
[4]

Tabular transformers for modeling multivariate time series

Inkit Padhi, Yair Schiff, Igor Melnyk, Mattia Rigotti, Youssef Mroueh, Pierre Dognin, Jerret Ross, Ravi Nair, and Erik Altman. Tabular transformers for modeling multivariate time series. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3565–3569. IEEE, 2021

work page 2021
[5]

All4one: Symbiotic neighbour contrastive learning via self-attention and redundancy reduction

Imanol G Estepa, Ignacio Sarasúa, Bhalaji Nagarajan, and Petia Radeva. All4one: Symbiotic neighbour contrastive learning via self-attention and redundancy reduction. InProceedings of the IEEE/CVF international conference on computer vision, pages 16243–16253, 2023

work page 2023
[6]

Ht-transformer: Event sequences classification by accumulating prefix information with history tokens.arXiv preprint arXiv:2508.01474, 2025

Ivan Karpukhin and Andrey Savchenko. Ht-transformer: Event sequences classification by accumulating prefix information with history tokens.arXiv preprint arXiv:2508.01474, 2025

work page arXiv 2025
[7]

Practical bayesian optimization of machine learning algorithms.Advances in neural information processing systems, 25, 2012

Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms.Advances in neural information processing systems, 25, 2012

work page 2012
[8]

Gradient surgery for multi-task learning.Advances in neural information processing systems, 33:5824–5836, 2020

Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning.Advances in neural information processing systems, 33:5824–5836, 2020

work page 2020
[9]

Bilevel programming for hyperparameter optimization and meta-learning

Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimiliano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. InInternational Conference on Machine Learning, 2018

work page 2018
[10]

Multi-task learning as multi-objective optimization.Advances in neural information processing systems, 31, 2018

Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization.Advances in neural information processing systems, 31, 2018

work page 2018
[11]

Truncated back- propagation for bilevel optimization

Amirreza Shaban, Ching-An Cheng, Nathan Hatch, and Byron Boots. Truncated back- propagation for bilevel optimization. InThe 22nd international conference on artificial intelli- gence and statistics, pages 1723–1732. PMLR, 2019

work page 2019
[12]

Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks

Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. InProceedings of the International Conference on Machine Learning, 2018

work page 2018
[13]

Independent component alignment for multi-task learning

Dmitry Senushkin, Nikolay Patakin, Arseny Kuznetsov, and Anton Konushin. Independent component alignment for multi-task learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20083–20093, 2023

work page 2023
[14]

End-to-end multi-task learning with attention

Shikun Liu, Edward Johns, and Andrew J Davison. End-to-end multi-task learning with attention. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1871–1880, 2019

work page 2019
[15]

Multi-task learning using uncertainty to weigh losses for scene geometry and semantics

Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 10

work page 2018
[16]

Pytorch-lifestream: Learning embeddings on discrete event sequences

Artem Sakhno, Ivan Kireev, Dmitrii Babaev, Maxim Savchenko, Gleb Gusev, and Andrey Savchenko. Pytorch-lifestream: Learning embeddings on discrete event sequences. InPro- ceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 11104–11108, 2025

work page 2025
[17]

Optuna: A next-generation hyperparameter optimization framework

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. InProceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 2623–2631, 2019

work page 2019
[18]

solo- learn: A library of self-supervised methods for visual representation learning.Journal of Machine Learning Research, 23(56):1–6, 2022

Victor Guilherme Turrisi Da Costa, Enrico Fini, Moin Nabi, Nicu Sebe, and Elisa Ricci. solo- learn: A library of self-supervised methods for visual representation learning.Journal of Machine Learning Research, 23(56):1–6, 2022

work page 2022
[19]

Learning multiple layers of features from tiny images

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

work page 2009
[20]

Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

work page 2012
[21]

Barlow twins: Self- supervised learning via redundancy reduction

Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self- supervised learning via redundancy reduction. InInternational conference on machine learning, pages 12310–12320. PMLR, 2021

work page 2021
[22]

With a little help from my friends: Nearest-neighbor contrastive learning of visual represen- tations

Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, and Andrew Zisserman. With a little help from my friends: Nearest-neighbor contrastive learning of visual represen- tations. InProceedings of the IEEE/CVF international conference on computer vision, pages 9588–9597, 2021

work page 2021
[23]

Making a science of model search: Hyper- parameter optimization in hundreds of dimensions for vision architectures

James Bergstra, Daniel Yamins, and David Cox. Making a science of model search: Hyper- parameter optimization in hundreds of dimensions for vision architectures. InInternational conference on machine learning, pages 115–123. PMLR, 2013

work page 2013
[24]

Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-based hyperparameter optimization through reversible learning. InInternational Conference on Machine Learning, 2015

work page 2015
[25]

Drmad: Distilling reverse-mode automatic differentiation for optimizing hyperparameters of deep neural networks

Jie Fu, Hongyin Luo, Jiashi Feng, Kian Hsiang Low, and Tat-Seng Chua. Drmad: Distilling reverse-mode automatic differentiation for optimizing hyperparameters of deep neural networks. arXiv preprint arXiv:1601.00917, 2016

work page arXiv 2016
[26]

Scalable gradient-based tuning of continuous regularization hyperparameters

Jelena Luketina, Mathias Berglund, Klaus Greff, and Tapani Raiko. Scalable gradient-based tuning of continuous regularization hyperparameters. InInternational conference on machine learning, pages 2952–2960. PMLR, 2016

work page 2016
[27]

Optimizing millions of hyperparameters by implicit differentiation

Jonathan Lorraine, Paul Vicol, and David Duvenaud. Optimizing millions of hyperparameters by implicit differentiation. InInternational conference on artificial intelligence and statistics, pages 1540–1552. PMLR, 2020

work page 2020
[28]

Jacobian descent for multi-objective optimization.arXiv preprint arXiv:2406.16232,

Pierre Quinton and Valérian Rey. Jacobian descent for multi-objective optimization.arXiv preprint arXiv:2406.16232, 2024

work page arXiv 2024
[29]

Unpreju- diced training auxiliary tasks makes primary better: A multitask learning perspective.IEEE Transactions on Neural Networks and Learning Systems, 36(7):12091–12105, 2024

Yuanze Li, Chun-Mei Feng, Qilong Wang, Guanglei Yang, and Wangmeng Zuo. Unpreju- diced training auxiliary tasks makes primary better: A multitask learning perspective.IEEE Transactions on Neural Networks and Learning Systems, 36(7):12091–12105, 2024

work page 2024
[30]

Sample-level weighting for multi-task learning with auxiliary tasks: E

Emilie Gregoire, Muhammad Hafeez Chaudhary, and Sam Verboven. Sample-level weighting for multi-task learning with auxiliary tasks: E. grégoire et al.Applied Intelligence, 54(4): 3482–3501, 2024

work page 2024
[31]

Ldc-mtl: Balancing multi-task learning through scalable loss discrepancy control.arXiv preprint arXiv:2502.08585,

Peiyao Xiao, Chaosheng Dong, Shaofeng Zou, and Kaiyi Ji. Ldc-mtl: Balancing multi-task learning through scalable loss discrepancy control.arXiv preprint arXiv:2502.08585, 2025. 11

work page arXiv 2025
[32]

Adapting auxiliary losses using gradient similarity

Yunshu Du, Wojciech M Czarnecki, Siddhant M Jayakumar, Mehrdad Farajtabar, Razvan Pascanu, and Balaji Lakshminarayanan. Adapting auxiliary losses using gradient similarity. arXiv preprint arXiv:1812.02224, 2018

work page arXiv 2018
[33]

Adaptive auxiliary task weighting for reinforcement learning.Advances in neural information processing systems, 32, 2019

Xingyu Lin, Harjatin Baweja, George Kantor, and David Held. Adaptive auxiliary task weighting for reinforcement learning.Advances in neural information processing systems, 32, 2019

work page 2019
[34]

Adam: A method for stochastic optimization

Kingma DP and Ba J. Adam: A method for stochastic optimization. InThe Twelfth International Conference on Learning Representations, 2015

work page 2015
[35]

Large Batch Training of Convolutional Networks

Yang You, Igor Gitman, and Boris Ginsburg. Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888, 2017. 12 A Multi-step SGD Rollout In Sec. 3.1, we derived the loss-weight update using a single-step SGD approximation. We now extend this analysis to a multi-step rollout and show that, to first order, the required hypergradient dep...

work page Pith review arXiv 2017