pith. machine review for the scientific record. sign in

arxiv: 2605.14280 · v1 · submitted 2026-05-14 · 💻 cs.LG · stat.ML

Recognition: 2 theorem links

· Lean Theorem

TILT: Target-induced loss tilting under covariate shift

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:25 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords domain adaptationcovariate shiftimportance weightingunsupervised domain adaptationReLU networksoracle inequality
0
0 comments X

The pith

The target-side penalty on an auxiliary predictor component induces implicit relative importance weighting that stays bounded even with disjoint supports.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

TILT decomposes the predictor into a main part f and an auxiliary part b. It fits the sum on labeled source data while penalizing b alone on unlabeled target data. The resulting f is then used as the predictor for the target domain. This setup is shown to implicitly perform a form of importance weighting localized to the current error, without needing overlapping supports or density estimates. Finite sample bounds and network guarantees follow from this.

Core claim

At the population level, the target-side penalty on b implicitly induces relative importance weighting in terms of an estimand b*_f that is self-localized to the current error and remains uniformly bounded for any source-target pair, even those with disjoint supports. A general finite-sample oracle inequality holds and yields an end-to-end guarantee for sparse ReLU networks.

What carries the argument

Decomposition of the source predictor as f + b with a penalty applied to b on target inputs.

If this is right

  • The fit f serves as an effective target predictor improving over source-only training.
  • Performance gains hold over exact importance weighting and density-ratio baselines.
  • The approach gives guarantees for training sparse ReLU networks under covariate shift.
  • Finite-sample oracle inequality on the excess risk is established.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could apply to other base learners beyond neural networks.
  • Self-localization of the weighting might allow better handling of varying shift severity.
  • Extensions to semi-supervised settings where some target labels are available could be explored.

Load-bearing premise

The analysis assumes that penalizing the auxiliary component b on target data produces a useful implicit weighting for the main predictor f without needing support overlap or explicit density estimation.

What would settle it

Observing that the induced weighting becomes unbounded or that target performance does not improve in settings with disjoint supports would challenge the central claim.

Figures

Figures reproduced from arXiv: 2605.14280 by Kakei Yamamoto, Martin J. Wainwright.

Figure 1
Figure 1. Figure 1: Left panel: The λ-smoothed density ratio wλ(x) = q(x) p(x)+λq(x) is very spiky, compared to the relatively flat and well-behaved TILT-weight vλ(x) = p(x) p(x)+λq(x) . Right panel: Plots of the optimal offset function b ∗ f for three different choices of f: linear, quadratic and degree six polynomial approximations to f ∗ . The optimal offset is much smaller and smoother than wλ; note the different scale of… view at source ↗
Figure 2
Figure 2. Figure 2: C compares the resulting target-test MSEs after target-validation tuning. At zero shift, source ERM, exact IW, and TILT essentially coincide, as they should in the matched distribution. As the shift increases, exact IW deteriorates rapidly because the ordinary density ratio becomes increasingly variable. TILT remains the most stable method across the positive shift levels and gives the clearest gains as th… view at source ↗
Figure 3
Figure 3. Figure 3: Point-mass nonparametric rate. Left: oracle-tuned TILT under PL = L −1Unif[0, 1]+(1−L −1 )δ0 for a β = 2 sine-series regression function. Right: source ERM on the same n/L axis ascends as L increases. The dotted line has slope (n/L) −4/5 . 4.2 Covariate shift in CIFAR-100 Separation between TILT and KD For CIFAR-100 we use a target-side image corruption intended to mimic poor acquisition conditions: as the… view at source ↗
Figure 4
Figure 4. Figure 4: D plots the target-test cross-entropy of KL-TILT as a function of λ at small and large target shifts. The finite curves are relatively flat across a broad intermediate range of λ in both regimes, showing that the method is not sensitive to precise tuning within that range. At the same time, performance degrades or becomes numerically unstable when λ is made extremely small or extremely large, matching the … view at source ↗
Figure 5
Figure 5. Figure 5: Regularization sensitivity in the one-dimensional synthetic regression experiment. Each panel fixes a source corruption level and plots target-test MSE as a function of λ. Curves show means over trials and shaded bands show interquartile ranges. The TILT and exact RuLSIF curves have visibly different λ dependence, and the favorable range for TILT changes with the source corruption level [PITH_FULL_IMAGE:f… view at source ↗
Figure 6
Figure 6. Figure 6: Auxiliary-component diagnostics for the synthetic regression experiment. For the same one-dimensional synthetic problem as in [PITH_FULL_IMAGE:figures/full_fig_p029_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Finite-linear λ sweep under bounded density ratio. The four panels reproduce the first four diagnostics from the dimension sweep, except that the second panel reports (1 + λ)E 2 λ rather than the raw E 2 λ . When df = 20, the task is effectively well specified for the deployed class. In this case, a smaller auxiliary class keeps the small-λ solution closer to source ERM, whereas a rich auxiliary class can … view at source ↗
Figure 8
Figure 8. Figure 8: Well-specified beta-product neural control. Target MSE is reported for a 16-dimensional beta￾product covariate-shift problem with a fixed ReLU-teacher regression function. Unlike the high-dimensional weak-class experiment in [PITH_FULL_IMAGE:figures/full_fig_p031_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Top-5 accuracy on the CIFAR-100 covariate-shift experiment. This figure complements [PITH_FULL_IMAGE:figures/full_fig_p031_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Target-test cross-entropy under single-corruption CIFAR-100 shifts. This figure repeats the CIFAR-100 distillation experiment while varying the target corruption type one at a time. The top grid visualizes the same CIFAR-100 test image under Gaussian blur, defocus blur, contrast, and pixelate corruptions; columns correspond to the severity values used in the sweep. The bottom panels report target-test cro… view at source ↗
read the original abstract

We introduce and analyze Target-Induced Loss Tilting (TILT) for unsupervised domain adaptation under covariate shift. It is based on a novel objective function that decomposes the source predictor as $f+b$, fits $f+b$ on labeled source data while simultaneously penalizing the auxiliary component $b$ on unlabeled target inputs. The resulting fit $f$ is deployed as the final target predictor. At the population level, we show that this target-side penalty implicitly induces relative importance weighting at the population level, but in terms of an estimand $b^*_f$ that is self-localized to the current error, and remains uniformly bounded for any source-target pair (even those with disjoint supports). We prove a general finite-sample oracle inequality on the excess risk, and use it to give an end-to-end guarantee for training with sparse ReLU networks. Experiments on controlled regression problems and shifted CIFAR-100 distillation show that TILT improves target-domain performance over source-only training, exact importance weighting, and relative density-ratio baselines, with a stable dependence on the regularization parameter.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Target-Induced Loss Tilting (TILT) for unsupervised domain adaptation under covariate shift. The method decomposes the predictor as f + b, fits f + b to labeled source data while applying a penalty to the auxiliary b on unlabeled target inputs, and deploys the resulting f as the target predictor. At the population level, the target penalty is shown to induce relative importance weighting via a self-localized estimand b*_f that is claimed to remain uniformly bounded for arbitrary source-target pairs (including disjoint supports). A general finite-sample oracle inequality is proved and specialized to yield end-to-end excess-risk guarantees for sparse ReLU networks. Experiments on synthetic regression and shifted CIFAR-100 distillation demonstrate improved target performance relative to source-only training, exact importance weighting, and density-ratio baselines, with stable behavior in the regularization parameter.

Significance. If the uniform boundedness of b*_f and the oracle inequality hold, TILT supplies a theoretically grounded alternative to explicit density-ratio estimation that does not require overlapping supports. The end-to-end sparse-ReLU guarantee would be a concrete advance for neural-network domain adaptation, and the empirical gains over standard baselines indicate practical utility when the regularization parameter is chosen reasonably.

major comments (1)
  1. Abstract and population-level analysis (the claim that b*_f is uniformly bounded for any source-target pair, including disjoint supports): the finite-sample oracle inequality and the subsequent sparse-ReLU guarantee rely on constants controlled by this bound. When supports are disjoint the population objective decouples; for any function class rich enough to approximate the residual (including the sparse ReLU networks used in the end-to-end result), b*_f can drive the source term to zero by setting b ≈ y - f on the source support while b ≈ 0 on the target support. Consequently |b*_f| scales with the size of the residual of f and is not uniformly bounded independently of f. This appears to threaten the claimed bound and the validity of the oracle inequality constants. Please supply the precise derivation establishing uniform boundedness or state any additional assumptions that prevent the
minor comments (2)
  1. Notation: the distinction between the population b*_f and its finite-sample estimator is not always explicit in the experimental section; adding a short clarifying sentence would improve readability.
  2. Experiments: the CIFAR-100 distillation protocol would benefit from an explicit statement of the number of random seeds and whether the reported improvements are statistically significant under a paired test.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We appreciate the referee's detailed feedback on the population-level analysis. We address the concern regarding the uniform boundedness of b*_f below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [—] Abstract and population-level analysis (the claim that b*_f is uniformly bounded for any source-target pair, including disjoint supports): the finite-sample oracle inequality and the subsequent sparse-ReLU guarantee rely on constants controlled by this bound. When supports are disjoint the population objective decouples; for any function class rich enough to approximate the residual (including the sparse ReLU networks used in the end-to-end result), b*_f can drive the source term to zero by setting b ≈ y - f on the source support while b ≈ 0 on the target support. Consequently |b*_f| scales with the size of the residual of f and is not uniformly bounded independently of f. This appears to threaten the claimed bound and the validity of the oracle inequality constants. Please supply the precise derivation establishing uniform boundedness or state any additional assumptions that prevent

    Authors: We thank the referee for this insightful observation. Upon closer examination, the decoupling does occur for disjoint supports when the function class is rich enough to fit the residual on the source support independently. The self-localized nature of b*_f means it approximates the residual only where needed, but the magnitude is indeed tied to the current error of f. The original claim of uniform boundedness independent of f was overstated. We will revise the manuscript to remove the claim of uniform boundedness for arbitrary f and instead derive a bound that depends on the excess risk of f or assume bounded residuals (e.g., via bounded labels and Lipschitz losses). This will adjust the constants in the oracle inequality to be explicit in terms of the approximation quality. The end-to-end guarantee for sparse ReLU networks will be updated to reflect this dependence, which is common in oracle inequalities. We believe this clarifies the analysis without invalidating the core contribution. Revision will be made in the population analysis section and the abstract. revision: yes

Circularity Check

1 steps flagged

Self-referential b*_f makes induced weighting definitional by construction

specific steps
  1. self definitional [Abstract]
    "At the population level, we show that this target-side penalty implicitly induces relative importance weighting at the population level, but in terms of an estimand b^*_f that is self-localized to the current error, and remains uniformly bounded for any source-target pair (even those with disjoint supports)."

    The claimed induction occurs 'in terms of' b*_f, where b*_f is the optimal b for the fixed f in the objective that decomposes the predictor as f + b and penalizes b on target data. The weighting effect is therefore equivalent to the definition of b* as the argmin over b, rendering the population result self-definitional rather than a non-tautological first-principles derivation.

full rationale

The paper's population-level claim identifies the target penalty as inducing relative importance weighting via the estimand b*_f. However, b*_f is defined directly as the auxiliary minimizer for fixed f in the decomposed objective, so the induction reduces to the objective's own construction rather than an independent derivation. The subsequent oracle inequality and sparse ReLU guarantees build on this with additional analysis and do not collapse entirely, yielding moderate circularity. No fitted-input predictions, self-citation chains, or uniqueness theorems are load-bearing.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard covariate-shift assumptions plus the existence of a useful decomposition into f and b; the regularization parameter that controls the strength of the b penalty is a free parameter whose value affects performance.

free parameters (1)
  • regularization parameter lambda
    Controls the strength of the penalty on b evaluated on target inputs; its value must be chosen and affects the induced weighting.
axioms (1)
  • domain assumption Covariate shift: the conditional distribution of labels given inputs is the same in source and target, only the marginal input distribution changes.
    Invoked implicitly when claiming the method works under covariate shift.
invented entities (1)
  • auxiliary component b no independent evidence
    purpose: Temporary part of the predictor that is penalized on target data to induce weighting on f.
    Newly introduced decomposition; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5484 in / 1550 out tokens · 130282 ms · 2026-05-15T02:25:24.880973+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

144 extracted references · 144 canonical work pages · 4 internal anchors

  1. [1]

    Tyrrell Rockafellar and Ren

    R. Tyrrell Rockafellar and Ren. Prox-regular functions in variational analysis , journal =

  2. [2]

    2009 , publisher =

    Introduction to Nonparametric Estimation , author =. 2009 , publisher =

  3. [3]

    Tyrrell Rockafellar and Roger J.-B

    R. Tyrrell Rockafellar and Roger J.-B. Wets , title =

  4. [4]

    Foundations and Trends in Optimization , volume =

    Proximal Algorithms , author =. Foundations and Trends in Optimization , volume =. 2014 , publisher =

  5. [5]

    Proceedings of the 29th Annual Conference on Learning Theory , series =

    Benefits of Depth in Neural Networks , author =. Proceedings of the 29th Annual Conference on Learning Theory , series =. 2016 , editor =

  6. [6]

    2018 , publisher=

    Lectures on Convex Optimization , author=. 2018 , publisher=

  7. [7]

    2017 , publisher=

    First-Order Methods in Optimization , author=. 2017 , publisher=

  8. [8]

    2017 , publisher=

    Convex Analysis and Monotone Operator Theory in Hilbert Spaces , author=. 2017 , publisher=

  9. [9]

    2019 , month =

    Jeremy Howard , title =. 2019 , month =

  10. [10]

    IEEE Trans

    Early stopping for kernel boosting algorithms:. IEEE Trans. Info. Theory

  11. [11]

    Raskutti and M

    G. Raskutti and M. J. Wainwright and B. Yu , TITLE =. Journal of Machine Learning Research , VOLUME =. 2014 , TOPIC =

  12. [12]

    A. W. van der Vaart and J. Wellner. Weak Convergence and Empirical Processes. 1996

  13. [13]

    van de Geer

    S. van de Geer. Empirical Processes in M-Estimation. 2000

  14. [14]

    ICLR , year=

    Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , author=. ICLR , year=

  15. [15]

    The Annals of Statistics , volume=

    Additive Logistic Regression: A Statistical View of Boosting , author=. The Annals of Statistics , volume=

  16. [16]

    The Annals of Statistics , volume=

    Boosting with Early Stopping: Convergence and Consistency , author=. The Annals of Statistics , volume=

  17. [17]

    Journal of the American Statistical Association , volume=

    Boosting with the L2 Loss: Regression and Classification , author=. Journal of the American Statistical Association , volume=

  18. [18]

    2025 , eprint=

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , eprint=

  19. [19]

    Born Again Trees , author=

  20. [20]

    Proceedings of Machine Learning Research , volume=

    Born-Again Tree Ensembles , author=. Proceedings of Machine Learning Research , volume=

  21. [21]

    Proceedings of Machine Learning Research , volume=

    Born-Again Neural Networks , author=. Proceedings of Machine Learning Research , volume=

  22. [22]

    Distilling a Neural Network Into a Soft Decision Tree

    Distilling a Neural Network Into a Soft Decision Tree , author=. arXiv preprint arXiv:1711.09784 , year=

  23. [24]

    Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) , year=

    Model Compression , author=. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) , year=

  24. [25]

    Advances in Neural Information Processing Systems 27 (NeurIPS) , year=

    Do Deep Nets Really Need to be Deep? , author=. Advances in Neural Information Processing Systems 27 (NeurIPS) , year=

  25. [26]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    When Does Label Smoothing Help? , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  26. [27]

    CVPR , year=

    Self-Training with Noisy Student Improves ImageNet Classification , author=. CVPR , year=

  27. [28]

    Pattern Recognition , volume=

    Certainty Driven Consistency Loss on Multi-Teacher Networks for Semi-Supervised Learning , author=. Pattern Recognition , volume=. 2021 , publisher=

  28. [29]

    Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results

    Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=. 1703.01780 , archivePrefix=

  29. [30]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    The Convergence Rate of Neural Networks for Learned Functions of Different Frequencies , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  30. [31]

    Machine Learning , volume=

    Random Forests , author=. Machine Learning , volume=

  31. [32]

    1984 , publisher=

    Classification and Regression Trees , author=. 1984 , publisher=

  32. [33]

    IEEE Transactions on Information Theory , volume=

    Probability of error of some adaptive pattern-recognition machines , author=. IEEE Transactions on Information Theory , volume=

  33. [34]

    , author=

    Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption. , author=. Journal of Machine Learning Research (JMLR) , volume=

  34. [35]

    Journal of Machine Learning Research (JMLR) , volume=

    A PAC-style model for learning from labeled and unlabeled data , author=. Journal of Machine Learning Research (JMLR) , volume=

  35. [36]

    FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence , volume =

    Sohn, Kihyuk and Berthelot, David and Carlini, Nicholas and Zhang, Zizhao and Zhang, Han and Raffel, Colin A and Cubuk, Ekin Dogus and Kurakin, Alexey and Li, Chun-Liang , booktitle =. FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence , volume =

  36. [37]

    International Conference on Machine Learning (ICML) , pages=

    Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation , author=. International Conference on Machine Learning (ICML) , pages=

  37. [38]

    International Conference on Learning Representations (ICLR) , year=

    Theoretical Analysis of Self-Training with Deep Networks on Unlabeled Data , author=. International Conference on Learning Representations (ICLR) , year=

  38. [39]

    Proceedings of the 37th International Conference on Machine Learning (ICML) , pages =

    Understanding Self-Training for Gradual Domain Adaptation , author =. Proceedings of the 37th International Conference on Machine Learning (ICML) , pages =. 2020 , volume =

  39. [40]

    Machine Learning , volume=

    A theory of learning from different domains , author=. Machine Learning , volume=

  40. [41]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Correcting sample selection bias by unlabeled data , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  41. [42]

    Journal of Machine Learning Research , volume=

    Covariate shift adaptation by importance weighted cross validation , author=. Journal of Machine Learning Research , volume=

  42. [43]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Learning with Mismatching Distributions , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  43. [44]

    The Annals of Statistics , volume=

    Arcing classifier (with discussion and a rejoinder by the author) , author=. The Annals of Statistics , volume=

  44. [45]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

    Deep residual learning for image recognition , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

  45. [46]

    Neurocomputing , volume=

    Active learning of convex loss functions with margins , author=. Neurocomputing , volume=

  46. [47]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Frequency principle: Fourier analysis sheds light on deep neural networks , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  47. [48]

    Communications in Computational Physics , volume=

    Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks , author=. Communications in Computational Physics , volume=. 2020 , publisher=

  48. [49]

    The Annals of Statistics , volume=

    Consistent nonparametric regression , author=. The Annals of Statistics , volume=

  49. [50]

    Foundations of Computational Mathematics , volume=

    Optimal rates for the regularized least squares algorithm , author=. Foundations of Computational Mathematics , volume=

  50. [51]

    Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

    Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , author=. Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

  51. [52]

    International Conference on Machine Learning (ICML) , pages=

    On the Spectral Bias of Neural Networks , author=. International Conference on Machine Learning (ICML) , pages=. 2019 , organization=

  52. [53]

    Advances in Neural Information Processing Systems (NIPS) , volume=

    Boosting Algorithms as Gradient Descent , author=. Advances in Neural Information Processing Systems (NIPS) , volume=

  53. [54]

    Conference on Learning Theory (COLT) , pages=

    Marginal Singularity, and the Benefits of Covariance Shift , author=. Conference on Learning Theory (COLT) , pages=. 2018 , organization=

  54. [55]

    International Conference on Machine Learning (ICML) , pages=

    A new similarity measure for covariate shift with applications to nonparametric regression , author=. International Conference on Machine Learning (ICML) , pages=

  55. [56]

    Journal of Statistical Planning and Inference , volume=

    Improving Predictive Inference Under Covariate Shift by Weighting the Log-Likelihood Function , author=. Journal of Statistical Planning and Inference , volume=. 2000 , publisher=

  56. [57]

    2006 , publisher=

    All of Nonparametric Statistics , author=. 2006 , publisher=

  57. [58]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

    High-Resolution Image Synthesis with Latent Diffusion Models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

  58. [59]

    Journal of Machine Learning Research (JMLR) , volume=

    Domain-Adversarial Training of Neural Networks , author=. Journal of Machine Learning Research (JMLR) , volume=

  59. [60]

    International Conference on Learning Representations (ICLR) , year=

    Unsupervised Representation Learning by Predicting Image Rotations , author=. International Conference on Learning Representations (ICLR) , year=

  60. [61]

    International Conference on Machine Learning (ICML) , pages=

    A Simple Framework for Contrastive Learning of Visual Representations , author=. International Conference on Machine Learning (ICML) , pages=. 2020 , organization=

  61. [62]

    Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL) , pages=

    Frustratingly easy domain adaptation , author=. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL) , pages=

  62. [63]

    Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=

    Anchor regression: Heterogeneous data meet causality , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2021 , publisher=

  63. [64]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Domain separation networks , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  64. [65]

    International Conference on Machine Learning , pages=

    Stability and hypothesis transfer learning , author=. International Conference on Machine Learning , pages=. 2013 , organization=

  65. [66]

    Advances in Neural Information Processing Systems , volume=

    Unsupervised domain adaptation with residual transfer networks , author=. Advances in Neural Information Processing Systems , volume=

  66. [68]

    Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

    Regularized multi-task learning , author=. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

  67. [69]

    A Dirty Model for Multi-task Learning , volume =

    Jalali, Ali and Sanghavi, Sujay and Ruan, Chao and Ravikumar, Pradeep , booktitle =. A Dirty Model for Multi-task Learning , volume =

  68. [70]

    Journal of Machine Learning Research (JMLR) , volume=

    Boosting as a Regularized Path to a Maximum Margin Classifier , author=. Journal of Machine Learning Research (JMLR) , volume=

  69. [71]

    2012 , publisher=

    Machine Learning in Non-Stationary Environments: Introduction to Covariate Shift Adaptation , author=. 2012 , publisher=

  70. [72]

    Dataset Shift in Machine Learning , pages=

    Covariate shift by kernel mean matching , author=. Dataset Shift in Machine Learning , pages=. 2008 , publisher=

  71. [73]

    Annals of the Institute of Statistical Mathematics , volume=

    Direct importance estimation with model selection and its application to covariate shift adaptation , author=. Annals of the Institute of Statistical Mathematics , volume=. 2008 , publisher=

  72. [74]

    Journal of Machine Learning Research , volume=

    A least-squares approach to direct importance estimation , author=. Journal of Machine Learning Research , volume=

  73. [75]

    Advances in Neural Information Processing Systems , volume=

    Relative density ratio estimation for robust distribution comparison , author=. Advances in Neural Information Processing Systems , volume=

  74. [76]

    Nonparametric regression using deep neural networks with

    Schmidt-Hieber, Johannes , journal=. Nonparametric regression using deep neural networks with. 2020 , publisher=

  75. [77]

    2019 , publisher=

    High-Dimensional Statistics: A Non-Asymptotic Viewpoint , author=. 2019 , publisher=

  76. [78]

    2020 International Joint Conference on Neural Networks (IJCNN) , pages=

    Pseudo-labeling and confirmation bias in deep semi-supervised learning , author=. 2020 International Joint Conference on Neural Networks (IJCNN) , pages=. 2020 , organization=

  77. [79]

    Neurocomputing , year=

    Self-Training: A Survey , author=. Neurocomputing , year=

  78. [80]

    Advances in Neural Information Processing Systems , volume=

    Language Models are Few-Shot Learners , author=. Advances in Neural Information Processing Systems , volume=

  79. [81]

    Proceedings of the 41st International Conference on Machine Learning , pages=

    Transformers Implement Functional Gradient Descent to Learn Non-Linear Functions In Context , author=. Proceedings of the 41st International Conference on Machine Learning , pages=. 2024 , volume=

  80. [82]

    BERT: Pre- training of Deep Bidirectional Transformers for Language Understanding

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v...

Showing first 80 references.