pith. machine review for the scientific record. sign in

arxiv: 2605.07756 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI

Recognition: no theorem link

When Losses Align: Gradient-Based Composite Loss Weighting for Efficient Pretraining

Andrey Savchenko, Ivan Karpukhin

Authors on Pith no claims yet

Pith reviewed 2026-05-11 03:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords composite lossloss weightingpretraininggradient alignmentself-supervised learninghyperparameter optimizationevent sequence modelingcomputer vision
0
0 comments X

The pith

A gradient-based method learns weights for composite pretraining losses by aligning their combined gradient with a downstream objective.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper sets out to prove that the relative weights in a composite pretraining loss can be learned during a single training run. It does so by repeatedly adjusting the weights to make the overall gradient point in the same direction as the gradient from a downstream task. If this holds, it removes the need for many separate trainings to tune those weights, which is currently the main expense when using multiple loss terms on unlabeled data. The approach is tested on modeling sequences of events and on self-supervised image tasks, where it performs at least as well as weights chosen by search methods.

Core claim

The central claim is that one can optimize the weights of a composite pretraining loss during training by adjusting them so the resulting gradient aligns with the gradient of a downstream objective. This is done efficiently by using the linear combination structure of the loss to skip the repeated full backward passes that would otherwise be needed. Experiments on event-sequence modeling and self-supervised computer vision show that models trained this way reach performance levels matching or exceeding those from expensive searches over possible weights.

What carries the argument

The gradient alignment update, which adjusts loss weights to make the pretraining gradient match a downstream gradient and thereby automates weight selection.

If this is right

  • Loss weights no longer need to be chosen by hand or by running dozens of trainings.
  • Hyperparameter search overhead drops to a small constant factor over one full pretraining.
  • The technique transfers across different domains such as sequences and vision without modification.
  • Final model quality on the target task remains competitive with exhaustive tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same alignment could be used to adapt weights when the downstream task changes slightly after initial pretraining.
  • Researchers might apply this to balance losses in multi-task learning where one task serves as the guide.
  • If the method is stable, it could allow weights to evolve throughout training rather than converging to fixed values.
  • Testing the approach on larger models and more complex composite losses would reveal its scalability limits.

Load-bearing premise

Aligning the pretraining gradient to a downstream objective produces weights that are good enough for the actual pretraining goal and do not lead to unstable training dynamics.

What would settle it

If a model trained with weights from this alignment method performs worse on the downstream task than a model trained with weights found by running a full Bayesian optimization search over many trainings, then the claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.07756 by Andrey Savchenko, Ivan Karpukhin.

Figure 1
Figure 1. Figure 1: Overview of the proposed GraP framework. A shared backbone produces embeddings used [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of normalization strategies. The objective is to maximize the alignment [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Weights trajectories for the Churn dataset. [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Weights trajectories for the AgePred dataset. [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Weights trajectories for the AlfaBattle dataset. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Weights trajectories for the MIMIC-III dataset. [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Weights trajectories for the Taobao dataset. [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Weights trajectories for the CIFAR-10 dataset. [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Weights trajectories for the CIFAR-100 dataset. [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Weights trajectories for the ImageNet-100 dataset. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
read the original abstract

Modern deep models are often pretrained on large-scale data with missing labels using composite objectives, where the relative weights of multiple loss terms act as hyperparameters. Tuning these weights with random search or Bayesian optimization is computationally expensive, as it requires many independent training runs. To address this, we propose a gradient-based bilevel method that learns pretraining loss weights online by aligning the composite pretraining gradient with a downstream objective. By exploiting the structure of the loss, the method avoids the multiple backward passes typically required by truncated backpropagation through the full model, reducing the overhead of hyperparameter tuning to approximately 30% above a single training run. We evaluate the approach on event-sequence modeling and self-supervised computer vision, where it matches or improves upon carefully tuned baselines while substantially reducing the cost of hyperparameter tuning compared to random or Bayesian search.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a gradient-based bilevel optimization approach to learn weights for composite pretraining losses online. The method aligns the composite pretraining gradient with a downstream objective and exploits loss structure to avoid full unrolled backpropagation through the model, claiming an overhead of approximately 30% above a single training run. It is evaluated on event-sequence modeling and self-supervised computer vision tasks, where the resulting models match or exceed those obtained from random or Bayesian search over loss weights while reducing tuning cost.

Significance. If the empirical results hold under rigorous controls, the work would offer a practical reduction in the computational expense of tuning composite objectives, which is a frequent bottleneck in pretraining pipelines. The online bilevel formulation with structure exploitation could enable more efficient loss design without requiring multiple independent full-scale runs.

major comments (2)
  1. [Experimental evaluation] The central empirical claim that the method 'matches or improves upon carefully tuned baselines' is load-bearing but unsupported by reported experimental controls: no mention of number of independent runs, error bars, statistical significance, or the precise search budgets and spaces used for the random/Bayesian baselines. This information is required to evaluate whether the bilevel weights are competitive rather than merely plausible.
  2. [Method] The efficiency claim rests on a structure-exploiting approximation that avoids multiple backward passes in the bilevel update. No analysis, bound, or empirical verification of the resulting hypergradient bias or variance is provided; if higher-order interactions between loss terms are ignored, the online weight updates may systematically deviate from the true bilevel optimum, undermining the claim that the learned weights are at least as good as exhaustive search.
minor comments (2)
  1. [Abstract] The abstract quantifies overhead as 'approximately 30% above a single training run' without clarifying whether this includes the cost of evaluating the downstream objective or assumes the downstream task is cheap relative to pretraining.
  2. [Method] Notation for the bilevel objective, the hypergradient estimator, and the online update rule for loss weights should be made fully explicit with equations to allow reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which has helped us improve the clarity and rigor of our work. We address each major comment below and indicate the revisions made to the manuscript.

read point-by-point responses
  1. Referee: [Experimental evaluation] The central empirical claim that the method 'matches or improves upon carefully tuned baselines' is load-bearing but unsupported by reported experimental controls: no mention of number of independent runs, error bars, statistical significance, or the precise search budgets and spaces used for the random/Bayesian baselines. This information is required to evaluate whether the bilevel weights are competitive rather than merely plausible.

    Authors: We agree that these details are essential for a rigorous evaluation. In the revised manuscript, we have included the number of independent runs performed (specifically, 3 runs for each configuration to account for randomness in initialization and data ordering), added error bars representing standard deviation in all relevant tables and figures, and conducted statistical significance tests using Welch's t-test to compare our method against the baselines. Additionally, we have specified the exact search spaces and budgets: for random search, 20 trials with uniform sampling over [0.1, 10] for each weight; for Bayesian optimization, 20 trials using a Gaussian process surrogate with the same bounds. These controls confirm that our method achieves comparable or better performance with statistical significance (p < 0.05 in key metrics) while using far less compute. revision: yes

  2. Referee: [Method] The efficiency claim rests on a structure-exploiting approximation that avoids multiple backward passes in the bilevel update. No analysis, bound, or empirical verification of the resulting hypergradient bias or variance is provided; if higher-order interactions between loss terms are ignored, the online weight updates may systematically deviate from the true bilevel optimum, undermining the claim that the learned weights are at least as good as exhaustive search.

    Authors: We appreciate this point on the need for validation of the approximation. The structure exploitation is based on the observation that for composite losses of the form L = sum w_i * L_i, the hypergradient with respect to weights can be approximated by aligning the sum of individual loss gradients without full unrolling, as cross-terms are second-order and small under frequent online updates. In the revised manuscript, we have added a new subsection with empirical verification: on a smaller proxy model, we compare the approximate hypergradients to those obtained via full unrolling (truncated at 5 steps), showing that the bias is low (mean absolute error of 0.02 in normalized gradient space) and variance is controlled. We also include a discussion noting that while a general bound would require assumptions like Lipschitz continuity of the losses, the online setting with small step sizes mitigates deviation from the bilevel optimum, as evidenced by the competitive performance in experiments. We believe this addresses the concern without requiring a full theoretical bound. revision: partial

Circularity Check

0 steps flagged

Bilevel gradient alignment for loss weighting is self-contained; no derivation reduces to fitted inputs or self-citation chains

full rationale

The paper's core proposal is a standard bilevel optimization setup that learns loss weights by aligning the composite pretraining gradient to a downstream objective, with an approximation that exploits loss structure to avoid full unrolling. This is grounded in existing bilevel methods and evaluated against independent downstream tasks (event-sequence modeling, self-supervised vision) with comparisons to random/Bayesian search. No equations are shown to be self-definitional, no predictions reduce to the same fitted parameters by construction, and no load-bearing uniqueness theorems or ansatzes are imported via self-citation. The 30% overhead claim and performance matching are empirical outcomes, not tautological. This yields a minor self-citation risk at most but keeps the central claim independent.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the domain assumption that gradient alignment serves as a reliable proxy for downstream performance and that the loss structure permits skipping full backpropagation. No new entities are postulated and no explicit free parameters beyond standard optimizer settings are mentioned.

axioms (2)
  • domain assumption Gradient alignment between the composite pretraining loss and a downstream objective yields useful loss weights without instability.
    This is the core premise that allows the bilevel method to replace exhaustive search.
  • domain assumption The mathematical structure of the composite loss permits an efficient approximation that avoids multiple full backward passes.
    Directly invoked to justify the 30% overhead claim.

pith-pipeline@v0.9.0 · 5435 in / 1520 out tokens · 40353 ms · 2026-05-11T03:06:24.809909+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages

  1. [1]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

  2. [2]

    Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

    Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

  3. [3]

    Coles: Contrastive learning for event sequences with self-supervision

    Dmitrii Babaev, Nikita Ovsov, Ivan Kireev, Maria Ivanova, Gleb Gusev, Ivan Nazarov, and Alexander Tuzhilin. Coles: Contrastive learning for event sequences with self-supervision. In Proceedings of the 2022 International Conference on Management of Data, pages 1190–1199, 2022

  4. [4]

    Tabular transformers for modeling multivariate time series

    Inkit Padhi, Yair Schiff, Igor Melnyk, Mattia Rigotti, Youssef Mroueh, Pierre Dognin, Jerret Ross, Ravi Nair, and Erik Altman. Tabular transformers for modeling multivariate time series. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3565–3569. IEEE, 2021

  5. [5]

    All4one: Symbiotic neighbour contrastive learning via self-attention and redundancy reduction

    Imanol G Estepa, Ignacio Sarasúa, Bhalaji Nagarajan, and Petia Radeva. All4one: Symbiotic neighbour contrastive learning via self-attention and redundancy reduction. InProceedings of the IEEE/CVF international conference on computer vision, pages 16243–16253, 2023

  6. [6]

    Ht-transformer: Event sequences classification by accumulating prefix information with history tokens.arXiv preprint arXiv:2508.01474, 2025

    Ivan Karpukhin and Andrey Savchenko. Ht-transformer: Event sequences classification by accumulating prefix information with history tokens.arXiv preprint arXiv:2508.01474, 2025

  7. [7]

    Practical bayesian optimization of machine learning algorithms.Advances in neural information processing systems, 25, 2012

    Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms.Advances in neural information processing systems, 25, 2012

  8. [8]

    Gradient surgery for multi-task learning.Advances in neural information processing systems, 33:5824–5836, 2020

    Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning.Advances in neural information processing systems, 33:5824–5836, 2020

  9. [9]

    Bilevel programming for hyperparameter optimization and meta-learning

    Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimiliano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. InInternational Conference on Machine Learning, 2018

  10. [10]

    Multi-task learning as multi-objective optimization.Advances in neural information processing systems, 31, 2018

    Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization.Advances in neural information processing systems, 31, 2018

  11. [11]

    Truncated back- propagation for bilevel optimization

    Amirreza Shaban, Ching-An Cheng, Nathan Hatch, and Byron Boots. Truncated back- propagation for bilevel optimization. InThe 22nd international conference on artificial intelli- gence and statistics, pages 1723–1732. PMLR, 2019

  12. [12]

    Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks

    Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. InProceedings of the International Conference on Machine Learning, 2018

  13. [13]

    Independent component alignment for multi-task learning

    Dmitry Senushkin, Nikolay Patakin, Arseny Kuznetsov, and Anton Konushin. Independent component alignment for multi-task learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20083–20093, 2023

  14. [14]

    End-to-end multi-task learning with attention

    Shikun Liu, Edward Johns, and Andrew J Davison. End-to-end multi-task learning with attention. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1871–1880, 2019

  15. [15]

    Multi-task learning using uncertainty to weigh losses for scene geometry and semantics

    Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 10

  16. [16]

    Pytorch-lifestream: Learning embeddings on discrete event sequences

    Artem Sakhno, Ivan Kireev, Dmitrii Babaev, Maxim Savchenko, Gleb Gusev, and Andrey Savchenko. Pytorch-lifestream: Learning embeddings on discrete event sequences. InPro- ceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 11104–11108, 2025

  17. [17]

    Optuna: A next-generation hyperparameter optimization framework

    Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. InProceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 2623–2631, 2019

  18. [18]

    solo- learn: A library of self-supervised methods for visual representation learning.Journal of Machine Learning Research, 23(56):1–6, 2022

    Victor Guilherme Turrisi Da Costa, Enrico Fini, Moin Nabi, Nicu Sebe, and Elisa Ricci. solo- learn: A library of self-supervised methods for visual representation learning.Journal of Machine Learning Research, 23(56):1–6, 2022

  19. [19]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

  20. [20]

    Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

  21. [21]

    Barlow twins: Self- supervised learning via redundancy reduction

    Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self- supervised learning via redundancy reduction. InInternational conference on machine learning, pages 12310–12320. PMLR, 2021

  22. [22]

    With a little help from my friends: Nearest-neighbor contrastive learning of visual represen- tations

    Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, and Andrew Zisserman. With a little help from my friends: Nearest-neighbor contrastive learning of visual represen- tations. InProceedings of the IEEE/CVF international conference on computer vision, pages 9588–9597, 2021

  23. [23]

    Making a science of model search: Hyper- parameter optimization in hundreds of dimensions for vision architectures

    James Bergstra, Daniel Yamins, and David Cox. Making a science of model search: Hyper- parameter optimization in hundreds of dimensions for vision architectures. InInternational conference on machine learning, pages 115–123. PMLR, 2013

  24. [24]

    Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-based hyperparameter optimization through reversible learning. InInternational Conference on Machine Learning, 2015

  25. [25]

    Drmad: Distilling reverse-mode automatic differentiation for optimizing hyperparameters of deep neural networks

    Jie Fu, Hongyin Luo, Jiashi Feng, Kian Hsiang Low, and Tat-Seng Chua. Drmad: Distilling reverse-mode automatic differentiation for optimizing hyperparameters of deep neural networks. arXiv preprint arXiv:1601.00917, 2016

  26. [26]

    Scalable gradient-based tuning of continuous regularization hyperparameters

    Jelena Luketina, Mathias Berglund, Klaus Greff, and Tapani Raiko. Scalable gradient-based tuning of continuous regularization hyperparameters. InInternational conference on machine learning, pages 2952–2960. PMLR, 2016

  27. [27]

    Optimizing millions of hyperparameters by implicit differentiation

    Jonathan Lorraine, Paul Vicol, and David Duvenaud. Optimizing millions of hyperparameters by implicit differentiation. InInternational conference on artificial intelligence and statistics, pages 1540–1552. PMLR, 2020

  28. [28]

    Jacobian descent for multi-objective optimization.arXiv preprint arXiv:2406.16232,

    Pierre Quinton and Valérian Rey. Jacobian descent for multi-objective optimization.arXiv preprint arXiv:2406.16232, 2024

  29. [29]

    Unpreju- diced training auxiliary tasks makes primary better: A multitask learning perspective.IEEE Transactions on Neural Networks and Learning Systems, 36(7):12091–12105, 2024

    Yuanze Li, Chun-Mei Feng, Qilong Wang, Guanglei Yang, and Wangmeng Zuo. Unpreju- diced training auxiliary tasks makes primary better: A multitask learning perspective.IEEE Transactions on Neural Networks and Learning Systems, 36(7):12091–12105, 2024

  30. [30]

    Sample-level weighting for multi-task learning with auxiliary tasks: E

    Emilie Gregoire, Muhammad Hafeez Chaudhary, and Sam Verboven. Sample-level weighting for multi-task learning with auxiliary tasks: E. grégoire et al.Applied Intelligence, 54(4): 3482–3501, 2024

  31. [31]

    Ldc-mtl: Balancing multi-task learning through scalable loss discrepancy control.arXiv preprint arXiv:2502.08585,

    Peiyao Xiao, Chaosheng Dong, Shaofeng Zou, and Kaiyi Ji. Ldc-mtl: Balancing multi-task learning through scalable loss discrepancy control.arXiv preprint arXiv:2502.08585, 2025. 11

  32. [32]

    Adapting auxiliary losses using gradient similarity

    Yunshu Du, Wojciech M Czarnecki, Siddhant M Jayakumar, Mehrdad Farajtabar, Razvan Pascanu, and Balaji Lakshminarayanan. Adapting auxiliary losses using gradient similarity. arXiv preprint arXiv:1812.02224, 2018

  33. [33]

    Adaptive auxiliary task weighting for reinforcement learning.Advances in neural information processing systems, 32, 2019

    Xingyu Lin, Harjatin Baweja, George Kantor, and David Held. Adaptive auxiliary task weighting for reinforcement learning.Advances in neural information processing systems, 32, 2019

  34. [34]

    Adam: A method for stochastic optimization

    Kingma DP and Ba J. Adam: A method for stochastic optimization. InThe Twelfth International Conference on Learning Representations, 2015

  35. [35]

    Large Batch Training of Convolutional Networks

    Yang You, Igor Gitman, and Boris Ginsburg. Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888, 2017. 12 A Multi-step SGD Rollout In Sec. 3.1, we derived the loss-weight update using a single-step SGD approximation. We now extend this analysis to a multi-step rollout and show that, to first order, the required hypergradient dep...