pith. machine review for the scientific record. sign in

arxiv: 2512.05226 · v2 · submitted 2025-12-04 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Variance Matters: Improving Domain Adaptation via Stratified Sampling

Authors on Pith no claims yet

Pith reviewed 2026-05-17 00:52 UTC · model grok-4.3

classification 💻 cs.LG
keywords unsupervised domain adaptationstratified samplingvariance reductionmaximum mean discrepancycorrelation alignmentdomain shift
0
0 comments X

The pith

Stratified sampling reduces variance in discrepancy estimates for unsupervised domain adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses high variance in stochastic estimates of domain discrepancy that limits the benefits of unsupervised domain adaptation. It proposes using stratified sampling to derive specific objectives for correlation alignment and maximum mean discrepancy measures. These lead to expected and worst-case error bounds, with a proof that the MMD objective is optimal for minimizing variance under certain assumptions. A k-means style algorithm optimizes the stratification in practice, and experiments on domain shift datasets show gains in estimation accuracy and target performance.

Core claim

The central claim is that ad hoc stratification objectives for correlation alignment and MMD, when used in stochastic settings, reduce variance in discrepancy estimates, with the MMD version proven to be theoretically optimal under the paper's assumptions, yielding improved error bounds and empirical results via a practical optimization procedure.

What carries the argument

The stratification objectives for MMD and correlation alignment that partition samples to achieve lower variance in the discrepancy calculations.

If this is right

  • Reduced variance leads to more reliable discrepancy estimates during training.
  • Improved performance on the target domain across multiple datasets.
  • Theoretical guarantees in the form of error bounds for the adapted models.
  • The k-means optimization provides an efficient way to apply the method in practice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar stratification techniques might benefit other variance-sensitive methods in machine learning.
  • Extending this to additional discrepancy measures could broaden the applicability.
  • Validating the assumptions through varied experimental setups would strengthen the optimality claim.

Load-bearing premise

That the certain assumptions hold for the MMD objective to be variance-minimizing, and that the ad hoc objectives prove effective beyond the tested cases.

What would settle it

Demonstrating that the proposed MMD stratification does not achieve lower variance than standard sampling in a controlled stochastic setting, or that target domain accuracy does not improve with the method.

Figures

Figures reproduced from arXiv: 2512.05226 by Andrea Napoli, Paul White.

Figure 1
Figure 1. Figure 1: The relationship between the target (Var  LbMMD ) and surrogate (Var (µbϕ,s)) partitioning objec￾tives, for a range of feature dimensionalities d. rank l on f (S) – i.e., the best partitioning on g is the lth best partitioning on f. Theorem 2 (Worst-case error bound). Assume the growth rate of the sorted f (S) (i) is bounded by some constant K, such that, for any indices i, j, f (S) (i) − f (S) (j) ≤ K(i… view at source ↗
Figure 2
Figure 2. Figure 2: The relationship between the target (Var  Rbs  ) and surrogate (Var  Rb′ s  ) partitioning objectives, for a range of feature dimensionalities d, and two distributions. 2.4 Optimisation algorithm Equation (9) can be solved to a local minimum by applying a Lloyd’s-style alternating optimisation algorithm (Lloyd, 1982) along with the kernel trick (Algorithm 1). Algorithm 1 Dynamically-weighted kernel k-m… view at source ↗
Figure 3
Figure 3. Figure 3: The performance characteristics of Algorithm 2. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Estimator variance vs minibatch size for Algorithm 1 (purple) vs 3 ablations. Algorithm 2 Greedy incremental construction for dy￾namically weighted assignments Require: D ∈ R ns×k Ensure: U ∈ {0, 1} ns×k , P j Uij = 1 1: U ← 0ns×k 2: n1, . . . , nk ← 0 ▷ Interim cluster sizes 3: for all i ∈ {1, . . . , ns} do 4: j ← arg minj Dij (nj + 1) 5: Uij ← 1 6: nj ← nj + 1 7: end for 8: return U The performance char… view at source ↗
Figure 5
Figure 5. Figure 5: Estimator variance (Var (µbϕ,s)) throughout training with VaRDASS for different values of T, also compared to uniform random sampling. the targeted losses, and that this results in consistent improvements in target domain accuracy. For future work, the method could adopt a strata reconstruction condition as in Liu et al. (2020b), be combined with complementary variance reduction techniques, or be extended … view at source ↗
read the original abstract

Domain shift remains a key challenge in deploying machine learning models to the real world. Unsupervised domain adaptation (UDA) aims to address this by minimising domain discrepancy during training, but the discrepancy estimates suffer from high variance in stochastic settings, which can stifle the theoretical benefits of the method. This paper proposes Variance-Reduced Domain Adaptation via Stratified Sampling (VaRDASS), the first specialised stochastic variance reduction technique for UDA. We consider two specific discrepancy measures -- correlation alignment and the maximum mean discrepancy (MMD) -- and derive ad hoc stratification objectives for these terms. We then present expected and worst-case error bounds, and prove that our proposed objective for the MMD is theoretically optimal (i.e., minimises the variance) under certain assumptions. Finally, a practical k-means style optimisation algorithm is introduced and analysed. Experiments on four domain shift datasets demonstrate improved discrepancy estimation accuracy and target domain performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces VaRDASS, a stochastic variance reduction technique for unsupervised domain adaptation (UDA) based on stratified sampling. It derives ad hoc stratification objectives for correlation alignment and maximum mean discrepancy (MMD), provides expected and worst-case error bounds, proves that the MMD objective is theoretically optimal (minimizing variance) under certain assumptions, presents a practical k-means-style optimization algorithm, and reports empirical gains in discrepancy estimation accuracy and target-domain performance on four domain-shift datasets.

Significance. If the optimality result and error bounds hold under realistic conditions, the work offers a targeted variance-reduction approach for discrepancy-based UDA methods, which could stabilize training in mini-batch settings and improve generalization. The explicit optimality proof for the MMD stratification objective (when assumptions are met) and the provision of both theoretical bounds and a practical algorithm are notable strengths.

major comments (2)
  1. [§4, Theorem 3] §4, Theorem 3 (optimality proof): The claim that the proposed MMD stratification objective minimizes variance relies on assumptions including fixed/oracle strata, specific kernel properties, and independence conditions on the discrepancy estimator. These are not shown to hold when strata are instead estimated via the paper's k-means procedure on finite samples drawn from shifted source and target domains; without this transfer, the variance-reduction guarantee does not necessarily support the reported error bounds or empirical improvements in the stochastic regime.
  2. [§3.2] §3.2 (ad hoc stratification objectives): The correlation-alignment objective is presented without a corresponding optimality proof or variance analysis comparable to the MMD case; this asymmetry weakens the unified claim that stratified sampling improves discrepancy estimation across both measures.
minor comments (2)
  1. [§5] The k-means-style algorithm in §5 is described at a high level but lacks pseudocode or explicit complexity analysis, which would aid reproducibility.
  2. Notation for the strata and sampling weights is introduced inconsistently between the theoretical sections and the experimental setup; a single consolidated table of symbols would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and indicate where revisions have been made to the manuscript.

read point-by-point responses
  1. Referee: [§4, Theorem 3] §4, Theorem 3 (optimality proof): The claim that the proposed MMD stratification objective minimizes variance relies on assumptions including fixed/oracle strata, specific kernel properties, and independence conditions on the discrepancy estimator. These are not shown to hold when strata are instead estimated via the paper's k-means procedure on finite samples drawn from shifted source and target domains; without this transfer, the variance-reduction guarantee does not necessarily support the reported error bounds or empirical improvements in the stochastic regime.

    Authors: We thank the referee for this precise observation. Theorem 3 proves optimality of the proposed MMD objective under the explicit assumptions of oracle strata, kernel properties, and independence; these are stated in the theorem. The expected and worst-case error bounds derived earlier in §4 apply to stratified estimators for arbitrary strata and do not rely on optimality. The k-means procedure is presented as a practical heuristic to approximate good strata from data. In the revised manuscript we have added a dedicated paragraph after Theorem 3 that (i) reiterates the oracle-strata assumption, (ii) notes that k-means provides an empirical approximation whose quality depends on sample size and domain shift, and (iii) clarifies that the reported empirical gains are observed directly and do not rest on the theoretical optimality transferring exactly. We believe this makes the scope of the guarantee transparent. revision: yes

  2. Referee: [§3.2] §3.2 (ad hoc stratification objectives): The correlation-alignment objective is presented without a corresponding optimality proof or variance analysis comparable to the MMD case; this asymmetry weakens the unified claim that stratified sampling improves discrepancy estimation across both measures.

    Authors: We agree that the presentation is asymmetric. The CORAL stratification objective was obtained by minimising an upper bound on the variance of the correlation-alignment estimator, analogous to the derivation for MMD, but we did not establish a formal optimality result. The paper’s unified claim is that stratified sampling can be instantiated for both discrepancy measures and yields measurable variance reduction in practice; it does not assert identical theoretical optimality for both. In the revision we have (i) rephrased the introductory paragraph of §3.2 to avoid any implication of symmetric optimality and (ii) inserted a short paragraph providing an explicit variance expression for the CORAL stratified estimator. A complete optimality proof for CORAL remains future work. revision: partial

Circularity Check

0 steps flagged

No circularity: optimality proof and bounds are independent of fitted inputs

full rationale

The paper derives stratification objectives for MMD and correlation alignment, states expected/worst-case error bounds, and proves the MMD objective minimizes variance under explicit assumptions. These steps are presented as first-principles derivations rather than reductions to self-defined quantities, fitted parameters renamed as predictions, or self-citation chains. The subsequent k-means algorithm is a practical implementation analyzed separately from the theoretical optimality result. No load-bearing step reduces by construction to its own inputs per the abstract and context; the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on domain assumptions for the theoretical optimality result and introduces ad hoc stratification objectives as part of the method development.

axioms (1)
  • domain assumption Certain assumptions under which the proposed MMD objective minimises the variance.
    Invoked in the abstract to support the theoretical optimality proof for the MMD stratification objective.

pith-pipeline@v0.9.0 · 5444 in / 1278 out tokens · 65132 ms · 2026-05-17T00:52:35.848031+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 14 internal anchors

  1. [1]

    URLhttps://dl.acm.org/doi/ pdf/10.1145/1137856.1137880

    doi: 10.1145/1137856.1137880. URLhttps://dl.acm.org/doi/ pdf/10.1145/1137856.1137880. David Arthur and Sergei Vassilvitskii. k-means++: The Advantages of Careful Seeding.Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms,

  2. [2]

    k-Means has Polynomial Smoothed Complexity

    ISSN 02725428. doi: 10.1109/FOCS.2009.14. URLhttps://arxiv.org/pdf/0904.1113. Péter Bándi, Oscar Geessink, Quirine Manson, Marcory Van Dijk, Maschenka Balkenhol, Meyke Hermsen, Babak Ehteshami Bejnordi, Byungjae Lee, Kyunghyun Paeng, Aoxiao Zhong, Quanzheng Li, Farhad Ghazvinian Zanjani, Svitlana Zinger, Keisuke Fukuta, Daisuke Komura, Vlado Ovtcharov, Sh...

  3. [3]

    doi: 10.1109/TMI.2018.2867350

    ISSN 1558-254X. doi: 10.1109/TMI.2018.2867350. URLhttps: //pubmed.ncbi.nlm.nih.gov/30716025/. 11 Remi Bardenet, Subhroshekhar Ghosh, and Meixia Lin. Determinantal point processes based on orthogonal polynomials for sampling minibatches in SGD.NeurIPS, 20:16226–16237, 12

  4. [4]

    URL https://arxiv.org/abs/2112.06007v1

    ISSN 10495258. URL https://arxiv.org/abs/2112.06007v1. Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of Representations for Domain Adaptation.NeurIPS, 19,

  5. [5]

    URLhttps://link.springer.com/article/10.1007/s10994-009-5152-4

    1007/S10994-009-5152-4/METRICS. URLhttps://link.springer.com/article/10.1007/s10994-009-5152-4. Wacha Bounliphone, Eugene Belilovsky, Matthew B. Blaschko, Ioannis Antonoglou, and Arthur Gretton. A Test of Relative Similarity For Model Selection in Generative Models.ICLR, 11

  6. [6]

    Scalable Kernel Clustering: Approximate Kernel k-means

    URLhttps://arxiv.org/abs/1402.3849v1. Renato Cordeiro de Amorim and Vladimir Makarenkov. On k-means iterations and Gaussian clusters.Neu- rocomputing, 553:126547, 10

  7. [7]

    doi: 10.1016/J.NEUCOM.2023.126547

    ISSN 0925-2312. doi: 10.1016/J.NEUCOM.2023.126547. URLhttps: //www.sciencedirect.com/science/article/pii/S0925231223006707. Somayeh Danafar, Paola M. V. Rancoita, Tobias Glasmachers, Kevin Whittingstall, and Juergen Schmidhuber. Test- ing Hypotheses by Regularized Maximum Mean Discrepancy.International Journal of Computer and Information Technology, 5

  8. [8]

    Testing Hypotheses by Regularized Maximum Mean Discrepancy

    URLhttps://arxiv.org/pdf/1305.0423. Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives.NeurIPS, 2(January):1646–1654, 7

  9. [9]

    SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives

    ISSN 10495258. URL https://arxiv.org/abs/1407.0202v3. Abhimanyu Dubey, Vignesh Ramanathan, Alex Pentland, and Dhruv Mahajan. Adaptive Methods for Real-World Domain Generalization.CVPR,

  10. [10]

    doi: 10.1109/CVPR46437.2021.01411

    ISSN 10636919. doi: 10.1109/CVPR46437.2021.01411. URLhttps: //arxiv.org/abs/2103.15796v2. Tianfan Fu and Zhihua Zhang. CPSG-MCMC: Clustering-Based Preprocessing method for Stochastic Gradient MCMC.AISTATS, pp. 841–850, 4

  11. [11]

    Robert M

    URLhttps://people.maths.ox.ac.uk/gilesm/mc/mc/lec4.pdf. Robert M. Gower, Mark Schmidt, Francis Bach, and Peter Richtarik. Variance-Reduced Methods for Machine Learning.Proceedings of the IEEE, 108(11):1968–1983, 11

  12. [12]

    doi: 10.1109/JPROC.2020

    ISSN 15582256. doi: 10.1109/JPROC.2020. 3028013. Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A Kernel Two- Sample Test.Journal of Machine Learning Research, 13:723–773,

  13. [13]

    Deep Residual Learning for Image Recognition.Pro- ceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016-December: 770–778, 12

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition.Pro- ceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016-December: 770–778, 12

  14. [14]

    Deep Residual Learning for Image Recognition

    ISSN 10636919. doi: 10.48550/arxiv.1512.03385. URLhttps://arxiv.org/abs/1512.03385v1. I. Heller and C.B. Tompkins. An Extension of a Theorem of Dantzig’s.Linear Inequalities and Related Systems,

  15. [15]

    doi: 10.1007/978-3-030-58589-1{\_}28

    ISSN 16113349. doi: 10.1007/978-3-030-58589-1{\_}28. URL https://arxiv.org/pdf/1912.03699. 12 Rie Johnson and Tong Zhang. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction. NeurIPS,

  16. [16]

    Kingma and Jimmy Lei Ba

    Diederik P. Kingma and Jimmy Lei Ba. Adam: A Method for Stochastic Optimization.3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, 12

  17. [17]

    Adam: A Method for Stochastic Optimization

    doi: 10.48550/arxiv.1412.6980. URLhttps://arxiv.org/abs/1412.6980v9. Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton A. Earnshaw, Imran S. Haque, Sara Beery, Jure Leskovec, Anshul Kunda...

  18. [18]

    Olivier Ledoit and Michael Wolf

    URLhttps://arxiv.org/abs/2501.13296v1. Olivier Ledoit and Michael Wolf. The Power of (Non-)Linear Shrinking: A Review and Guide to Covariance Matrix Estimation.Journal of Financial Econometrics,

  19. [19]

    URLhttps://academic

    doi: 10.1093/jjfinec/nbaa007. URLhttps://academic. oup.com/jfec/advance-article/doi/10.1093/jjfinec/nbaa007/5861007. Haoliang Li, Sinno Jialin Pan, Shiqi Wang, and Alex C. Kot. Domain Generalization with Adversarial Feature Learning.CVPR, pp. 5400–5409, 12

  20. [20]

    doi: 10.1109/CVPR.2018.00566

    ISSN 10636919. doi: 10.1109/CVPR.2018.00566. Feng Liu, Wenkai Xu, Jie Lu, Guangquan Zhang, Arthur Gretton, and D. J. Sutherland. Learning Deep Kernels for Non-Parametric Two-Sample Tests.ICML, pp. 6272–6282, 2 2020a. Jingchang Liu and Linli Xu. Accelerating Stochastic Gradient Descent Using Antithetic Sampling.arXiv, 10

  21. [21]

    Accelerating Stochastic Gradient Descent Using Antithetic Sampling

    URLhttps://arxiv.org/abs/1810.03124v1. Weijie Liu, Hui Qian, Chao Zhang, Zebang Shen, Jiahao Xie, and Nenggan Zheng. Accelerating Stratified Sampling SGD by Reconstructing Strata.IJCAI, 2020b. Stuart P. Lloyd. Least Squares Quantization in PCM.IEEE Transactions on Information Theory, 28(2):129–137,

  22. [22]

    doi: 10.1109/TIT.1982.1056489

    ISSN 15579654. doi: 10.1109/TIT.1982.1056489. MingshengLong, YueCao, JianminWang, andMichaelJordan. LearningTransferableFeatureswithDeepAdaptation Networks. In Francis Bach and David Blei (eds.),Proceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pp. 97–105, Lille, France,

  23. [23]

    URL https://proceedings.mlr.press/v37/long15.html

    PMLR. URL https://proceedings.mlr.press/v37/long15.html. Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I. Jordan. Conditional Adversarial Domain Adaptation. Advances in Neural Information Processing Systems, 2018-December:1640–1650, 5

  24. [24]

    Conditional Adversarial Domain Adaptation

    ISSN 10495258. URL https://arxiv.org/abs/1705.10667v4. Ilya Loshchilov and Frank Hutter. Online Batch Selection for Faster Training of Neural Networks.ICLR workshop track, 11

  25. [25]

    Online Batch Selection for Faster Training of Neural Networks

    URLhttps://arxiv.org/pdf/1511.06343. Yucheng Lu, Youngsuk Park, Lifan Chen, Yuyang Wang, Christopher De Sa, and Dean Foster. Variance Reduced Training with Stratified Sampling for Forecasting Models.ICML, 139:7145–7155, 3

  26. [26]

    URL https://arxiv.org/abs/2103.02062v2

    ISSN 26403498. URL https://arxiv.org/abs/2103.02062v2. Aengus Lynch, Gbètondji J-S Dovonon, Jean Kaddour, and Ricardo Silva. Spawrious: A Benchmark for Fine Control of Spurious Correlation Biases.arXiv, 3

  27. [27]

    Krikamol Muandet, Bharath Sriperumbudur, Kenji Fukumizu, Arthur Gretton, and Bernhard Schölkopf

    URLhttps://arxiv.org/abs/2303.05470v3. Krikamol Muandet, Bharath Sriperumbudur, Kenji Fukumizu, Arthur Gretton, and Bernhard Schölkopf. Kernel Mean Shrinkage Estimators.JMLR, 17:xx–xx, 5

  28. [28]

    Kernel Mean Shrinkage Estimators

    ISSN 15337928. URLhttps://arxiv.org/pdf/1405.5505. 13 Andrea Napoli and Paul White. Unsupervised Domain Adaptation for the Cross-Dataset Detection of Humpback Whale Calls.DCASE,

  29. [29]

    XinyuPeng, LiLi, andFeiYueWang

    URLhttp://arxiv.org/abs/2410.04235. XinyuPeng, LiLi, andFeiYueWang. AcceleratingMinibatchStochasticGradientDescentUsingTypicalitySampling. IEEE Transactions on Neural Networks and Learning Systems, 31(11):4649–4659, 11

  30. [30]

    doi: 10.1109/TNNLS.2019.2957003

    ISSN 21622388. doi: 10.1109/TNNLS.2019.2957003. B. T. Polyak. Some methods of speeding up the convergence of iteration methods.USSR Computational Mathematics and Mathematical Physics, 4(5):1–17,

  31. [31]

    doi: 10.1016/0041-5553(64)90137-5

    ISSN 00415553. doi: 10.1016/0041-5553(64)90137-5. B. T. Polyak and A. B. Juditsky. Acceleration of Stochastic Approximation by Averaging.SIAM Journal on Control and Optimization, 30(4):838–855,

  32. [32]

    doi: 10.1137/0330046

    ISSN 03630129. doi: 10.1137/0330046. Ievgen Redko, Emilie Morvant, Amaury Habrard, Marc Sebban, and Younès Bennani. A survey on domain adaptation theory: learning bounds and theoretical guarantees.arXiv,

  33. [33]

    Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

    ISSN 15324435. URLhttps://arxiv.org/abs/1209.1873v2. Baochen Sun and Kate Saenko. Deep CORAL: Correlation Alignment for Deep Domain Adaptation.ECCV, 9915 LNCS:443–450, 7

  34. [34]

    Deep CORAL: Correlation Alignment for Deep Domain Adaptation

    ISSN 16113349. URLhttps://arxiv.org/abs/1607.01719v1. Danica J Sutherland and Namrata Deka. Unbiased estimators for the variance of MMD estimators.arXiv, 6

  35. [35]

    The MathWorks Inc

    URLhttps://arxiv.org/pdf/1906.02104. The MathWorks Inc. MATLAB,

  36. [36]

    Deep Domain Confusion: Maximizing for Domain Invariance

    URLhttps://arxiv.org/abs/1412.3474v1. Zirui Wang, Zihang Dai, Barnabas Poczos, and Jaime Carbonell. Characterizing and Avoiding Negative Transfer. CVPR, 2019-June:11285–11294, 11

  37. [37]

    doi: 10.1109/CVPR.2019.01155

    ISSN 10636919. doi: 10.1109/CVPR.2019.01155. URLhttps://arxiv. org/abs/1811.09751v4. Ruijia Xu, Guanbin Li, Jihan Yang, and Liang Lin. Larger Norm More Transferable: An Adaptive Feature Norm Approach for Unsupervised Domain Adaptation.ICCV, pp. 1426–1435, 11

  38. [38]

    doi: 10.1109/ ICCV.2019.00151

    ISSN 15505499. doi: 10.1109/ ICCV.2019.00151. URLhttps://arxiv.org/pdf/1811.07456. Cheng Zhang, Hedvig Kjellström, and Stephan Mandt. Determinantal Point Processes for Mini-Batch Diversification. Uncertainty in Artificial Intelligence,

  39. [39]

    doi: 10.1609/AAAI.V33I01.33015741

    ISSN 2374-3468. doi: 10.1609/AAAI.V33I01.33015741. URL https://ojs.aaai.org/index.php/AAAI/article/view/4520. Marvin Zhang, Henrik Marklund, Nikita Dhawan, Abhishek Gupta, Sergey Levine, and Chelsea Finn. Adaptive Risk Minimization: Learning to Adapt to Domain Shift.Advances in Neural Information Processing Systems, 28: 23664–23678, 7

  40. [40]

    URLhttps://arxiv.org/abs/2007.02931v4

    ISSN 10495258. URLhttps://arxiv.org/abs/2007.02931v4. Peilin Zhao and Tong Zhang. Accelerating Minibatch Stochastic Gradient Descent using Stratified Sampling.arXiv, 5

  41. [41]

    Accelerating Minibatch Stochastic Gradient Descent using Stratified Sampling

    URLhttps://arxiv.org/abs/1405.3080v1. Peilin Zhao and Tong Zhang. Stochastic Optimization with Importance Sampling for Regularized Loss Minimization. ICML, pp. 1–9, 6

  42. [42]

    URLhttps://proceedings.mlr.press/v37/zhaoa15.html

    ISSN 1938-7228. URLhttps://proceedings.mlr.press/v37/zhaoa15.html. 14 A Proof of Theorem 1 Theorem 1.AssumeHhas finite dimensionalitydandΣ ˆµϕ,s =σ2Ifor some positive scalarσ2. Then, arg min S Var ( ˆLMMD ) = arg min S Var (ˆµϕ,s).(18) Proof.Noting thatVar (ˆµϕ,s) =σ2d, we have Var ( ˆLMMD ) = 2 Tr (( σ2I+ Σ ˆµϕ,t )2) + 4mT ( σ2I+ Σ ˆµϕ,t ) m = 2 Tr ( σ4I...

  43. [43]

    Proof.For scalar˜zs,h, the covariances can be written as quadratic forms ˆRs = (CZ)TA(CZ) =γZTACZ(23) ˆR′ s = (Z−µs)TA(Z−µs)(24) withZ= (˜z s,1,...,˜zs,k)T ∼N(µ,Σ)

    + 4µTACΣACµ(21) Var ( ˆRs ) =γ2 ( Var ( ˆR′ s ) + 2 Tr ( A2Σ )2 −4 Tr ( A3Σ 2)) (22) whereµ= (µ1,...,µk)T,Σ = diag ((Σ 1,...,Σ k)),A= 1 ns diag ((|S1|,...,|Sk|)),γ=ns ns−1is a bias correc- tion factor, andC=I−JAis the weighted centering matrix, withJthe square matrix of ones. Proof.For scalar˜zs,h, the covariances can be written as quadratic forms ˆRs = (...