Learning Gradient-based Mixup with Extrapolation toward Flatter Minima for Domain Generalization
Pith reviewed 2026-05-24 11:14 UTC · model grok-4.3
The pith
FGMix weights mixup instances by gradient compatibility to extrapolate toward flatter minima and improve domain generalization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By generating instance weights via gradient-based compatibilities, FGMix assigns greater weight to instances that carry more invariant information and steers the mixup policy (including extrapolation) toward flatter minima, thereby covering potentially unseen regions while avoiding the harms of unconstrained extrapolation and yielding improved generalization on unseen domains.
What carries the argument
Flatness-aware Gradient-based Mixup (FGMix), a policy that derives instance weights from gradient compatibilities to favor invariant information and direct extrapolation toward flatter minima.
If this is right
- Mixup that includes extrapolation can cover feature-space regions outside the source domains.
- Weighting by gradient compatibility produces a mixup policy that favors instances with invariant information.
- Steering the mixup objective toward flatter minima improves generalization to unseen domains.
- Multiple design choices inside FGMix each contribute measurable gains on the DomainBed benchmark.
Where Pith is reading between the lines
- The same gradient-compatibility weighting could be applied to other data-augmentation schemes beyond mixup.
- Flatter-minima guidance might reduce sensitivity to hyper-parameter choices in the extrapolation ratio.
- If the method truly isolates invariant information, it could be tested on tasks with known causal structure such as synthetic shifts with explicit confounders.
Load-bearing premise
Gradient-based compatibilities can reliably identify which instances carry more invariant information across domains.
What would settle it
Running the reported FGMix variants on DomainBed and finding no accuracy gain over standard mixup or existing DG baselines on the held-out target domains would falsify the central claim.
Figures
read the original abstract
To address distribution shifts between training and test data, domain generalization (DG) leverages multiple source domains to learn a model that generalizes well to unseen domains. However, existing DG methods often overfit to the source domains, partly due to the limited coverage of the expected region in feature space. Motivated by this, we propose performing mixup with data interpolation and extrapolation to cover potentially unseen regions. To prevent the detrimental effects of unconstrained extrapolation, we carefully design a policy to generate the instance weights, named Flatness-aware Gradient-based Mixup (FGMix). The policy relies on gradient-based compatibilities to assign greater weights to instances that carry more invariant information and learn the mixup policy towards flatter minima for better generalization. On the DomainBed benchmark, we validate the efficacy of various designs of FGMix and demonstrate its superiority over other DG algorithms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Flatness-aware Gradient-based Mixup (FGMix) to improve domain generalization (DG). It performs mixup via both interpolation and extrapolation in feature space to increase coverage of unseen regions, but constrains extrapolation by deriving instance weights from gradient-based compatibilities. These weights are intended to emphasize instances carrying more invariant information and to steer optimization toward flatter minima. The central empirical claim is that various designs of FGMix outperform prior DG algorithms on the DomainBed benchmark.
Significance. If the central claim holds after proper validation, the work would offer a concrete mechanism for guiding mixup policies in DG using gradient signals to target flat minima, addressing a recognized limitation of unconstrained extrapolation. The approach is technically novel in its combination of gradient compatibility weighting with explicit extrapolation, and the attempt to tie weighting to invariance is a positive direction. However, the current manuscript supplies no reproducible experimental protocol, ablation isolating the weighting rule, or statistical analysis, so the significance cannot yet be assessed.
major comments (3)
- [§4] §4 (Experiments) and abstract: the claim that FGMix demonstrates superiority on DomainBed is unsupported because the manuscript provides no experimental protocol, number of runs, statistical tests, hyperparameter search details, or description of how the gradient-compatibility measure is exactly computed and normalized. Without these, the empirical support for the central claim cannot be evaluated.
- [§3.2] §3.2 (FGMix policy): the weighting rule is defined in terms of gradients produced by the model under training. The manuscript does not supply a derivation or pseudocode showing that these compatibilities are invariant across domains rather than encoding source-specific signals, leaving open the possibility that extrapolation amplifies domain-specific features. This is load-bearing for the claim that the policy selects instances with more invariant information.
- [§4.3] §4.3 (Ablations): no ablation is reported that isolates the gradient-based weighting from plain mixup plus extrapolation. Without this comparison, it is impossible to attribute any gains to the flatness-aware component rather than to the extrapolation itself.
minor comments (2)
- [§3] Notation for the compatibility measure (e.g., symbols for gradient inner products) is introduced without a consolidated table of definitions, making the equations in §3 difficult to follow on first reading.
- [Figure 2] Figure 2 (mixup illustration) would benefit from explicit annotation of the extrapolation direction and the weighting values used in each panel.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. The concerns regarding experimental reproducibility, the invariance properties of the weighting rule, and the need for targeted ablations are valid and will be addressed in a revised manuscript. We respond to each major comment below.
read point-by-point responses
-
Referee: [§4] §4 (Experiments) and abstract: the claim that FGMix demonstrates superiority on DomainBed is unsupported because the manuscript provides no experimental protocol, number of runs, statistical tests, hyperparameter search details, or description of how the gradient-compatibility measure is exactly computed and normalized. Without these, the empirical support for the central claim cannot be evaluated.
Authors: We agree that the current version lacks sufficient detail on the experimental protocol. In the revision we will expand Section 4 to include: (i) the exact DomainBed evaluation protocol with the standard train/validation/test splits, (ii) the number of independent runs (three random seeds per method), (iii) mean and standard deviation reporting together with paired t-tests against the strongest baseline, (iv) the hyperparameter search ranges and selection procedure, and (v) the precise formula and normalization used for the gradient-compatibility weights. These additions will make the superiority claim fully reproducible and statistically supported. revision: yes
-
Referee: [§3.2] §3.2 (FGMix policy): the weighting rule is defined in terms of gradients produced by the model under training. The manuscript does not supply a derivation or pseudocode showing that these compatibilities are invariant across domains rather than encoding source-specific signals, leaving open the possibility that extrapolation amplifies domain-specific features. This is load-bearing for the claim that the policy selects instances with more invariant information.
Authors: We acknowledge that the manuscript currently provides neither a formal derivation nor pseudocode demonstrating domain-invariance of the gradient compatibilities. In the revision we will add both: a short derivation showing that the compatibility score measures alignment of per-domain gradients on features that are predictive across sources, and Algorithm 1 that explicitly computes and normalizes the weights. We will also discuss the assumption that source-specific signals are down-weighted because they produce inconsistent gradient directions across domains, thereby addressing the concern that extrapolation might amplify domain-specific features. revision: yes
-
Referee: [§4.3] §4.3 (Ablations): no ablation is reported that isolates the gradient-based weighting from plain mixup plus extrapolation. Without this comparison, it is impossible to attribute any gains to the flatness-aware component rather than to the extrapolation itself.
Authors: We agree that an ablation isolating the contribution of the gradient-based weighting is essential. We will add a new table in Section 4.3 that compares three variants on DomainBed: (1) standard mixup, (2) mixup with unconstrained extrapolation, and (3) FGMix (extrapolation with gradient-compatibility weighting). This will allow readers to quantify the incremental benefit of the flatness-aware weighting rule over plain extrapolation. revision: yes
Circularity Check
No significant circularity; empirical method validated on external benchmark
full rationale
The paper introduces FGMix as a proposed algorithm that computes instance weights from gradient-based compatibilities during training and validates its designs empirically on the DomainBed benchmark. No load-bearing derivation step reduces by construction to its own inputs, fitted parameters renamed as predictions, or self-citation chains. The central claim of superiority is an empirical result against external baselines rather than a first-principles derivation that collapses into the method's own definitions. This is the most common honest finding for algorithmic papers with benchmark validation.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Gradient-based compatibilities identify instances with more invariant information across domains.
- domain assumption Optimization toward flatter minima improves generalization under distribution shift.
Reference graph
Works this paper leans on
-
[1]
Ensemble of averages: Im- proving model selection and boosting performance in domain generalization
Devansh Arpit, Huan Wang, Yingbo Zhou, and Caiming Xiong. Ensemble of averages: Im- proving model selection and boosting performance in domain generalization. arXiv preprint arXiv:2110.10832, 2021
-
[2]
Metareg: towards domain generalization using meta-regularization
Yogesh Balaji, Swami Sankaranarayanan, and Rama Chellappa. Metareg: towards domain generalization using meta-regularization. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 1006–1016, 2018
work page 2018
-
[3]
Recognition in terra incognita
Sara Beery, Grant Van Horn, and Pietro Perona. Recognition in terra incognita. In Proceedings of the European conference on computer vision (ECCV), pages 456–473, 2018
work page 2018
-
[4]
Understanding and improving interpolation in autoencoders via an adversarial regularizer
David Berthelot, Colin Raffel, Aurko Roy, and Ian Goodfellow. Understanding and improving interpolation in autoencoders via an adversarial regularizer. In International Conference on Learning Representations, 2018
work page 2018
-
[5]
Generalizing from several related classification tasks to a new unlabeled sample
Gilles Blanchard, Gyemin Lee, and Clayton Scott. Generalizing from several related classification tasks to a new unlabeled sample. Advances in neural information processing systems, 24, 2011
work page 2011
-
[6]
Swad: Domain generalization by seeking flat minima
Junbum Cha, Sanghyuk Chun, Kyungjae Lee, Han-Cheol Cho, Seunghyun Park, Yunsung Lee, and Sungrae Park. Swad: Domain generalization by seeking flat minima. In Advances in Neural Information Processing Systems, 2021
work page 2021
-
[7]
Olivier Chapelle, Jason Weston, Léon Bottou, and Vladimir Vapnik. Vicinal risk minimization. In NIPS, 2000
work page 2000
-
[8]
Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys (international conference on learning representations, iclr 2017). In 5th International Conference on Learning Representations, ICLR 2017, 2019
work page 2017
-
[9]
Hsin-Ping Chou, Shih-Chieh Chang, Jia-Yu Pan, Wei Wei, and Da-Cheng Juan. Remix: rebal- anced mixup. In European Conference on Computer Vision, pages 95–110. Springer, 2020
work page 2020
-
[10]
Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias
Chen Fang, Ye Xu, and Daniel N Rockmore. Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias. In Proceedings of the IEEE International Conference on Computer Vision, pages 1657–1664, 2013
work page 2013
-
[11]
Sharpness-aware min- imization for efficiently improving generalization
Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware min- imization for efficiently improving generalization. In International Conference on Learning Representations, 2020
work page 2020
-
[12]
Loss surfaces, mode connectivity, and fast ensembling of dnns
T Garipov, P Izmailov, AG Wilson, D Podoprikhin, and D Vetrov. Loss surfaces, mode connectivity, and fast ensembling of dnns. In Advances in Neural Information Processing Systems, pages 8789–8798, 2018
work page 2018
-
[13]
In search of lost domain generalization
Ishaan Gulrajani and David Lopez-Paz. In search of lost domain generalization. arXiv preprint arXiv:2007.01434, 2020
-
[14]
Mixup as locally linear out-of-manifold regularization
Hongyu Guo, Yongyi Mao, and Richong Zhang. Mixup as locally linear out-of-manifold regularization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3714–3722, 2019
work page 2019
-
[15]
Stochastic weight averaging revisited
Hao Guo, Jiyong Jin, and Bin Liu. Stochastic weight averaging revisited. arXiv preprint arXiv:2201.00519, 2022
-
[16]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
work page 2016
-
[17]
Sepp Hochreiter, J urgen Schmidhuber, and Corso Elvezia. Flat minima. Neural Computation, 9(1):1–42, 1997. 11
work page 1997
-
[18]
Averaging weights leads to wider optima and better generalization
P Izmailov, AG Wilson, D Podoprikhin, D Vetrov, and T Garipov. Averaging weights leads to wider optima and better generalization. In 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018, pages 876–885, 2018
work page 2018
-
[19]
Fantastic generalization measures and where to find them
Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them. In International Conference on Learning Representations, 2019
work page 2019
-
[20]
On large-batch training for deep learning: Generalization gap and sharp minima
Nitish Shirish Keskar, Jorge Nocedal, Ping Tak Peter Tang, Dheevatsa Mudigere, and Mikhail Smelyanskiy. On large-batch training for deep learning: Generalization gap and sharp minima. In 5th International Conference on Learning Representations, ICLR 2017, 2017
work page 2017
-
[21]
Selfreg: Self- supervised contrastive regularization for domain generalization
Daehee Kim, Youngjun Yoo, Seunghyun Park, Jinkyu Kim, and Jaekoo Lee. Selfreg: Self- supervised contrastive regularization for domain generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9619–9628, 2021
work page 2021
-
[22]
Deeper, broader and artier domain generalization
Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision, pages 5542–5550, 2017
work page 2017
-
[23]
Learning to generalize: Meta- learning for domain generalization
Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Learning to generalize: Meta- learning for domain generalization. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018
work page 2018
-
[24]
Domain generalization with adversarial feature learning
Haoliang Li, Sinno Jialin Pan, Shiqi Wang, and Alex C Kot. Domain generalization with adversarial feature learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5400–5409, 2018
work page 2018
-
[25]
Episodic training for domain generalization
Da Li, Jianshu Zhang, Yongxin Yang, Cong Liu, Yi-Zhe Song, and Timothy M Hospedales. Episodic training for domain generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1446–1455, 2019
work page 2019
-
[26]
A practical bayesian framework for backpropagation networks
David JC MacKay. A practical bayesian framework for backpropagation networks. Neural Computation, 4(3):448–472, 1992
work page 1992
-
[27]
Metamixup: Learning adaptive interpolation policy of mixup with metalearning
Zhijun Mai, Guosheng Hu, Dexiong Chen, Fumin Shen, and Heng Tao Shen. Metamixup: Learning adaptive interpolation policy of mixup with metalearning. IEEE Transactions on Neural Networks and Learning Systems, 2021
work page 2021
-
[28]
Domain generalization via gradient surgery
Lucas Mansilla, Rodrigo Echeveste, Diego H Milone, and Enzo Ferrante. Domain generalization via gradient surgery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6630–6638, 2021
work page 2021
-
[29]
Domain generalization via invariant feature representation
Krikamol Muandet, David Balduzzi, and Bernhard Schölkopf. Domain generalization via invariant feature representation. In International Conference on Machine Learning, pages 10–18. PMLR, 2013
work page 2013
-
[30]
Reducing domain gap by reducing style bias
Hyeonseob Nam, HyunJae Lee, Jongchan Park, Wonjun Yoon, and Donggeun Yoo. Reducing domain gap by reducing style bias. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8690–8699, 2021
work page 2021
-
[31]
Learning explanations that are hard to vary
Giambattista Parascandolo, Alexander Neitz, ANTONIO ORVIETO, Luigi Gresele, and Bern- hard Schölkopf. Learning explanations that are hard to vary. In International Conference on Learning Representations, 2020
work page 2020
-
[32]
Moment matching for multi-source domain adaptation
Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1406–1415, 2019
work page 2019
-
[33]
Relative flatness and generalization
Henning Petzka, Michael Kamp, Linara Adilova, Cristian Sminchisescu, and Mario Boley. Relative flatness and generalization. In Advances in Neural Information Processing Systems, 2021
work page 2021
-
[34]
Fishr: Invariant gradient variances for out-of-distribution generalization
Alexandre Rame, Corentin Dancette, and Matthieu Cord. Fishr: Invariant gradient variances for out-of-distribution generalization. arXiv preprint arXiv:2109.02934, 2021. 12
-
[35]
Learning to learn without forgetting by maximizing transfer and minimizing interference
Matthew Riemer, Ignacio Cases, Robert Ajemian, Miao Liu, Irina Rish, Yuhai Tu, and Gerald Tesauro. Learning to learn without forgetting by maximizing transfer and minimizing interference. In International Conference on Learning Representations, 2018
work page 2018
-
[36]
Soroosh Shahtalebi, Jean-Christophe Gagnon-Audet, Touraj Laleh, Mojtaba Faramarzi, Kartik Ahuja, and Irina Rish. Sand-mask: An enhanced gradient masking strategy for the discovery of invariances in domain generalization. arXiv preprint arXiv:2106.02266, 2021
-
[37]
Gradient matching for domain generalization.arXiv preprint arXiv:2104.09937, 2021
Yuge Shi, Jeffrey Seely, Philip HS Torr, N Siddharth, Awni Hannun, Nicolas Usunier, and Gabriel Synnaeve. Gradient matching for domain generalization.arXiv preprint arXiv:2104.09937, 2021
-
[38]
Deep coral: Correlation alignment for deep domain adaptation
Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation. In European conference on computer vision, pages 443–450. Springer, 2016
work page 2016
-
[39]
Crossnorm and selfnorm for generalization under distribution shifts
Zhiqiang Tang, Yunhe Gao, Yi Zhu, Zhi Zhang, Mu Li, and Dimitris N Metaxas. Crossnorm and selfnorm for generalization under distribution shifts. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 52–61, 2021
work page 2021
-
[40]
Vladimir N Vapnick. Statistical learning theory. Wiley, New York, 1998
work page 1998
-
[41]
Deep hashing network for unsupervised domain adaptation
Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5018–5027, 2017
work page 2017
-
[42]
Manifold mixup: Better representations by interpolating hidden states
Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, David Lopez- Paz, and Yoshua Bengio. Manifold mixup: Better representations by interpolating hidden states. In International Conference on Machine Learning, pages 6438–6447. PMLR, 2019
work page 2019
-
[43]
Heterogeneous domain generalization via domain mixup
Yufei Wang, Haoliang Li, and Alex C Kot. Heterogeneous domain generalization via domain mixup. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3622–3626. IEEE, 2020
work page 2020
-
[44]
Gradient surgery for multi-task learning
Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems, 33:5824–5836, 2020
work page 2020
-
[45]
Cutmix: Regularization strategy to train strong classifiers with localizable features
Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019
work page 2019
-
[46]
mixup: Beyond empirical risk minimization
Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018
work page 2018
-
[47]
Towards principled disentanglement for domain generalization
Hanlin Zhang, Yi-Fan Zhang, Weiyang Liu, Adrian Weller, Bernhard Schölkopf, and Eric P Xing. Towards principled disentanglement for domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8024–8034, 2022
work page 2022
-
[48]
Learning to generate novel domains for domain generalization
Kaiyang Zhou, Yongxin Yang, Timothy Hospedales, and Tao Xiang. Learning to generate novel domains for domain generalization. In European conference on computer vision, pages 561–578. Springer, 2020
work page 2020
-
[49]
Domain generalization with mixstyle
Kaiyang Zhou, Yongxin Yang, Yu Qiao, and Tao Xiang. Domain generalization with mixstyle. In International Conference on Learning Representations, 2020. 13 A Additional Experimental Results A.1 Full Results for Overall Comparison Table 3: Overall comparison of selected algorithms on PACS. Algorithm A C P S Avg. ERM [40] 84.7±0.4 80.8±0.6 97.2±0.3 79.3±1.0 ...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.