arxiv: 2605.14280 · v1 · submitted 2026-05-14 · 💻 cs.LG · stat.ML

Recognition: 2 theorem links

· Lean Theorem

TILT: Target-induced loss tilting under covariate shift

Kakei Yamamoto , Martin J. Wainwright

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:25 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords domain adaptationcovariate shiftimportance weightingunsupervised domain adaptationReLU networksoracle inequality

0 comments

The pith

The target-side penalty on an auxiliary predictor component induces implicit relative importance weighting that stays bounded even with disjoint supports.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

TILT decomposes the predictor into a main part f and an auxiliary part b. It fits the sum on labeled source data while penalizing b alone on unlabeled target data. The resulting f is then used as the predictor for the target domain. This setup is shown to implicitly perform a form of importance weighting localized to the current error, without needing overlapping supports or density estimates. Finite sample bounds and network guarantees follow from this.

Core claim

At the population level, the target-side penalty on b implicitly induces relative importance weighting in terms of an estimand b*_f that is self-localized to the current error and remains uniformly bounded for any source-target pair, even those with disjoint supports. A general finite-sample oracle inequality holds and yields an end-to-end guarantee for sparse ReLU networks.

What carries the argument

Decomposition of the source predictor as f + b with a penalty applied to b on target inputs.

If this is right

The fit f serves as an effective target predictor improving over source-only training.
Performance gains hold over exact importance weighting and density-ratio baselines.
The approach gives guarantees for training sparse ReLU networks under covariate shift.
Finite-sample oracle inequality on the excess risk is established.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could apply to other base learners beyond neural networks.
Self-localization of the weighting might allow better handling of varying shift severity.
Extensions to semi-supervised settings where some target labels are available could be explored.

Load-bearing premise

The analysis assumes that penalizing the auxiliary component b on target data produces a useful implicit weighting for the main predictor f without needing support overlap or explicit density estimation.

What would settle it

Observing that the induced weighting becomes unbounded or that target performance does not improve in settings with disjoint supports would challenge the central claim.

Figures

Figures reproduced from arXiv: 2605.14280 by Kakei Yamamoto, Martin J. Wainwright.

**Figure 1.** Figure 1: Left panel: The λ-smoothed density ratio wλ(x) = q(x) p(x)+λq(x) is very spiky, compared to the relatively flat and well-behaved TILT-weight vλ(x) = p(x) p(x)+λq(x) . Right panel: Plots of the optimal offset function b ∗ f for three different choices of f: linear, quadratic and degree six polynomial approximations to f ∗ . The optimal offset is much smaller and smoother than wλ; note the different scale of… view at source ↗

**Figure 2.** Figure 2: C compares the resulting target-test MSEs after target-validation tuning. At zero shift, source ERM, exact IW, and TILT essentially coincide, as they should in the matched distribution. As the shift increases, exact IW deteriorates rapidly because the ordinary density ratio becomes increasingly variable. TILT remains the most stable method across the positive shift levels and gives the clearest gains as th… view at source ↗

**Figure 3.** Figure 3: Point-mass nonparametric rate. Left: oracle-tuned TILT under PL = L −1Unif[0, 1]+(1−L −1 )δ0 for a β = 2 sine-series regression function. Right: source ERM on the same n/L axis ascends as L increases. The dotted line has slope (n/L) −4/5 . 4.2 Covariate shift in CIFAR-100 Separation between TILT and KD For CIFAR-100 we use a target-side image corruption intended to mimic poor acquisition conditions: as the… view at source ↗

**Figure 4.** Figure 4: D plots the target-test cross-entropy of KL-TILT as a function of λ at small and large target shifts. The finite curves are relatively flat across a broad intermediate range of λ in both regimes, showing that the method is not sensitive to precise tuning within that range. At the same time, performance degrades or becomes numerically unstable when λ is made extremely small or extremely large, matching the … view at source ↗

**Figure 5.** Figure 5: Regularization sensitivity in the one-dimensional synthetic regression experiment. Each panel fixes a source corruption level and plots target-test MSE as a function of λ. Curves show means over trials and shaded bands show interquartile ranges. The TILT and exact RuLSIF curves have visibly different λ dependence, and the favorable range for TILT changes with the source corruption level [PITH_FULL_IMAGE:f… view at source ↗

**Figure 6.** Figure 6: Auxiliary-component diagnostics for the synthetic regression experiment. For the same one-dimensional synthetic problem as in [PITH_FULL_IMAGE:figures/full_fig_p029_6.png] view at source ↗

**Figure 7.** Figure 7: Finite-linear λ sweep under bounded density ratio. The four panels reproduce the first four diagnostics from the dimension sweep, except that the second panel reports (1 + λ)E 2 λ rather than the raw E 2 λ . When df = 20, the task is effectively well specified for the deployed class. In this case, a smaller auxiliary class keeps the small-λ solution closer to source ERM, whereas a rich auxiliary class can … view at source ↗

**Figure 8.** Figure 8: Well-specified beta-product neural control. Target MSE is reported for a 16-dimensional betaproduct covariate-shift problem with a fixed ReLU-teacher regression function. Unlike the high-dimensional weak-class experiment in [PITH_FULL_IMAGE:figures/full_fig_p031_8.png] view at source ↗

**Figure 9.** Figure 9: Top-5 accuracy on the CIFAR-100 covariate-shift experiment. This figure complements [PITH_FULL_IMAGE:figures/full_fig_p031_9.png] view at source ↗

**Figure 10.** Figure 10: Target-test cross-entropy under single-corruption CIFAR-100 shifts. This figure repeats the CIFAR-100 distillation experiment while varying the target corruption type one at a time. The top grid visualizes the same CIFAR-100 test image under Gaussian blur, defocus blur, contrast, and pixelate corruptions; columns correspond to the severity values used in the sweep. The bottom panels report target-test cro… view at source ↗

read the original abstract

We introduce and analyze Target-Induced Loss Tilting (TILT) for unsupervised domain adaptation under covariate shift. It is based on a novel objective function that decomposes the source predictor as $f+b$, fits $f+b$ on labeled source data while simultaneously penalizing the auxiliary component $b$ on unlabeled target inputs. The resulting fit $f$ is deployed as the final target predictor. At the population level, we show that this target-side penalty implicitly induces relative importance weighting at the population level, but in terms of an estimand $b^*_f$ that is self-localized to the current error, and remains uniformly bounded for any source-target pair (even those with disjoint supports). We prove a general finite-sample oracle inequality on the excess risk, and use it to give an end-to-end guarantee for training with sparse ReLU networks. Experiments on controlled regression problems and shifted CIFAR-100 distillation show that TILT improves target-domain performance over source-only training, exact importance weighting, and relative density-ratio baselines, with a stable dependence on the regularization parameter.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TILT gives a target-penalized auxiliary b that implicitly induces self-localized relative weighting with claimed uniform bounds even on disjoint supports, plus oracle inequalities for ReLU nets, but the boundedness needs proof scrutiny.

read the letter

The main takeaway is that this paper introduces TILT, where you decompose the predictor as f plus an auxiliary b, train the sum on labeled source data, and add a penalty on b using only unlabeled target inputs. At the population level this produces an implicit relative importance weighting expressed through b*_f, which is localized to the current error of f and is stated to stay uniformly bounded for any source-target pair, including disjoint supports. They then prove a general finite-sample oracle inequality on excess risk and specialize it to get end-to-end guarantees for sparse ReLU networks. Experiments on synthetic regression shifts and CIFAR-100 distillation show better target performance than source-only training, exact importance weighting, and relative density-ratio methods, with stable behavior over the regularization parameter lambda. That combination of a new objective, population analysis, and finite-sample bounds is the real contribution. The experiments are relevant and the comparisons are fair. The soft spot is the uniform boundedness of b*_f. When supports are disjoint the source loss and target penalty decouple, so for any reasonably expressive class b can approximate the residual on the source support while staying near zero on the target support. If the residual of f is large, |b*_f| can grow accordingly, which would make the constants in the oracle inequality depend on the initial error rather than being uniform. The abstract asserts the bound holds anyway, so the derivations must pin it down somehow, but that step is load-bearing and worth checking line by line. The self-referential aspect is only moderate because b*_f is defined for fixed f. This work is aimed at researchers doing unsupervised domain adaptation and robust learning under covariate shift. Anyone looking for alternatives to explicit density estimation or importance weighting that avoid unbounded weights would get something concrete from the idea and the bounds. It deserves a serious referee to verify the oracle inequality and the boundedness argument.

Referee Report

1 major / 2 minor

Summary. The paper introduces Target-Induced Loss Tilting (TILT) for unsupervised domain adaptation under covariate shift. The method decomposes the predictor as f + b, fits f + b to labeled source data while applying a penalty to the auxiliary b on unlabeled target inputs, and deploys the resulting f as the target predictor. At the population level, the target penalty is shown to induce relative importance weighting via a self-localized estimand b*_f that is claimed to remain uniformly bounded for arbitrary source-target pairs (including disjoint supports). A general finite-sample oracle inequality is proved and specialized to yield end-to-end excess-risk guarantees for sparse ReLU networks. Experiments on synthetic regression and shifted CIFAR-100 distillation demonstrate improved target performance relative to source-only training, exact importance weighting, and density-ratio baselines, with stable behavior in the regularization parameter.

Significance. If the uniform boundedness of b*_f and the oracle inequality hold, TILT supplies a theoretically grounded alternative to explicit density-ratio estimation that does not require overlapping supports. The end-to-end sparse-ReLU guarantee would be a concrete advance for neural-network domain adaptation, and the empirical gains over standard baselines indicate practical utility when the regularization parameter is chosen reasonably.

major comments (1)

Abstract and population-level analysis (the claim that b*_f is uniformly bounded for any source-target pair, including disjoint supports): the finite-sample oracle inequality and the subsequent sparse-ReLU guarantee rely on constants controlled by this bound. When supports are disjoint the population objective decouples; for any function class rich enough to approximate the residual (including the sparse ReLU networks used in the end-to-end result), b*_f can drive the source term to zero by setting b ≈ y - f on the source support while b ≈ 0 on the target support. Consequently |b*_f| scales with the size of the residual of f and is not uniformly bounded independently of f. This appears to threaten the claimed bound and the validity of the oracle inequality constants. Please supply the precise derivation establishing uniform boundedness or state any additional assumptions that prevent the

minor comments (2)

Notation: the distinction between the population b*_f and its finite-sample estimator is not always explicit in the experimental section; adding a short clarifying sentence would improve readability.
Experiments: the CIFAR-100 distillation protocol would benefit from an explicit statement of the number of random seeds and whether the reported improvements are statistically significant under a paired test.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We appreciate the referee's detailed feedback on the population-level analysis. We address the concern regarding the uniform boundedness of b*_f below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [—] Abstract and population-level analysis (the claim that b*_f is uniformly bounded for any source-target pair, including disjoint supports): the finite-sample oracle inequality and the subsequent sparse-ReLU guarantee rely on constants controlled by this bound. When supports are disjoint the population objective decouples; for any function class rich enough to approximate the residual (including the sparse ReLU networks used in the end-to-end result), b*_f can drive the source term to zero by setting b ≈ y - f on the source support while b ≈ 0 on the target support. Consequently |b*_f| scales with the size of the residual of f and is not uniformly bounded independently of f. This appears to threaten the claimed bound and the validity of the oracle inequality constants. Please supply the precise derivation establishing uniform boundedness or state any additional assumptions that prevent

Authors: We thank the referee for this insightful observation. Upon closer examination, the decoupling does occur for disjoint supports when the function class is rich enough to fit the residual on the source support independently. The self-localized nature of b*_f means it approximates the residual only where needed, but the magnitude is indeed tied to the current error of f. The original claim of uniform boundedness independent of f was overstated. We will revise the manuscript to remove the claim of uniform boundedness for arbitrary f and instead derive a bound that depends on the excess risk of f or assume bounded residuals (e.g., via bounded labels and Lipschitz losses). This will adjust the constants in the oracle inequality to be explicit in terms of the approximation quality. The end-to-end guarantee for sparse ReLU networks will be updated to reflect this dependence, which is common in oracle inequalities. We believe this clarifies the analysis without invalidating the core contribution. Revision will be made in the population analysis section and the abstract. revision: yes

Circularity Check

1 steps flagged

Self-referential b*_f makes induced weighting definitional by construction

specific steps

self definitional [Abstract]
"At the population level, we show that this target-side penalty implicitly induces relative importance weighting at the population level, but in terms of an estimand b^*_f that is self-localized to the current error, and remains uniformly bounded for any source-target pair (even those with disjoint supports)."

The claimed induction occurs 'in terms of' b*_f, where b*_f is the optimal b for the fixed f in the objective that decomposes the predictor as f + b and penalizes b on target data. The weighting effect is therefore equivalent to the definition of b* as the argmin over b, rendering the population result self-definitional rather than a non-tautological first-principles derivation.

full rationale

The paper's population-level claim identifies the target penalty as inducing relative importance weighting via the estimand b*_f. However, b*_f is defined directly as the auxiliary minimizer for fixed f in the decomposed objective, so the induction reduces to the objective's own construction rather than an independent derivation. The subsequent oracle inequality and sparse ReLU guarantees build on this with additional analysis and do not collapse entirely, yielding moderate circularity. No fitted-input predictions, self-citation chains, or uniqueness theorems are load-bearing.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard covariate-shift assumptions plus the existence of a useful decomposition into f and b; the regularization parameter that controls the strength of the b penalty is a free parameter whose value affects performance.

free parameters (1)

regularization parameter lambda
Controls the strength of the penalty on b evaluated on target inputs; its value must be chosen and affects the induced weighting.

axioms (1)

domain assumption Covariate shift: the conditional distribution of labels given inputs is the same in source and target, only the marginal input distribution changes.
Invoked implicitly when claiming the method works under covariate shift.

invented entities (1)

auxiliary component b no independent evidence
purpose: Temporary part of the predictor that is penalized on target data to induce weighting on f.
Newly introduced decomposition; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5484 in / 1550 out tokens · 130282 ms · 2026-05-15T02:25:24.880973+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J-cost uniqueness) echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

inf_b 1/λ A(f,b) = E²_λ(f) with b*_f(x) := -v_λ(x)(f(x)-f*(x)), v_λ(x) := p(x)/(p(x)+λq(x))
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 2 oracle inequality on E²_λ(bf) via metric entropy of F+B and B

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

144 extracted references · 144 canonical work pages · 4 internal anchors

[1]

Tyrrell Rockafellar and Ren

R. Tyrrell Rockafellar and Ren. Prox-regular functions in variational analysis , journal =

work page
[2]

2009 , publisher =

Introduction to Nonparametric Estimation , author =. 2009 , publisher =

work page 2009
[3]

Tyrrell Rockafellar and Roger J.-B

R. Tyrrell Rockafellar and Roger J.-B. Wets , title =

work page
[4]

Foundations and Trends in Optimization , volume =

Proximal Algorithms , author =. Foundations and Trends in Optimization , volume =. 2014 , publisher =

work page 2014
[5]

Proceedings of the 29th Annual Conference on Learning Theory , series =

Benefits of Depth in Neural Networks , author =. Proceedings of the 29th Annual Conference on Learning Theory , series =. 2016 , editor =

work page 2016
[6]

2018 , publisher=

Lectures on Convex Optimization , author=. 2018 , publisher=

work page 2018
[7]

2017 , publisher=

First-Order Methods in Optimization , author=. 2017 , publisher=

work page 2017
[8]

2017 , publisher=

Convex Analysis and Monotone Operator Theory in Hilbert Spaces , author=. 2017 , publisher=

work page 2017
[9]

2019 , month =

Jeremy Howard , title =. 2019 , month =

work page 2019
[10]

IEEE Trans

Early stopping for kernel boosting algorithms:. IEEE Trans. Info. Theory

work page
[11]

Raskutti and M

G. Raskutti and M. J. Wainwright and B. Yu , TITLE =. Journal of Machine Learning Research , VOLUME =. 2014 , TOPIC =

work page 2014
[12]

A. W. van der Vaart and J. Wellner. Weak Convergence and Empirical Processes. 1996

work page 1996
[13]

van de Geer

S. van de Geer. Empirical Processes in M-Estimation. 2000

work page 2000
[14]

ICLR , year=

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , author=. ICLR , year=

work page
[15]

The Annals of Statistics , volume=

Additive Logistic Regression: A Statistical View of Boosting , author=. The Annals of Statistics , volume=

work page
[16]

The Annals of Statistics , volume=

Boosting with Early Stopping: Convergence and Consistency , author=. The Annals of Statistics , volume=

work page
[17]

Journal of the American Statistical Association , volume=

Boosting with the L2 Loss: Regression and Classification , author=. Journal of the American Statistical Association , volume=

work page
[18]

2025 , eprint=

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , eprint=

work page 2025
[19]

Born Again Trees , author=

work page
[20]

Proceedings of Machine Learning Research , volume=

Born-Again Tree Ensembles , author=. Proceedings of Machine Learning Research , volume=

work page
[21]

Proceedings of Machine Learning Research , volume=

Born-Again Neural Networks , author=. Proceedings of Machine Learning Research , volume=

work page
[22]

Distilling a Neural Network Into a Soft Decision Tree

Distilling a Neural Network Into a Soft Decision Tree , author=. arXiv preprint arXiv:1711.09784 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) , year=

Model Compression , author=. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) , year=

work page
[25]

Advances in Neural Information Processing Systems 27 (NeurIPS) , year=

Do Deep Nets Really Need to be Deep? , author=. Advances in Neural Information Processing Systems 27 (NeurIPS) , year=

work page
[26]

Advances in Neural Information Processing Systems (NeurIPS) , year=

When Does Label Smoothing Help? , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[27]

CVPR , year=

Self-Training with Noisy Student Improves ImageNet Classification , author=. CVPR , year=

work page
[28]

Pattern Recognition , volume=

Certainty Driven Consistency Loss on Multi-Teacher Networks for Semi-Supervised Learning , author=. Pattern Recognition , volume=. 2021 , publisher=

work page 2021
[29]

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=. 1703.01780 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Advances in Neural Information Processing Systems (NeurIPS) , year=

The Convergence Rate of Neural Networks for Learned Functions of Different Frequencies , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[31]

Machine Learning , volume=

Random Forests , author=. Machine Learning , volume=

work page
[32]

1984 , publisher=

Classification and Regression Trees , author=. 1984 , publisher=

work page 1984
[33]

IEEE Transactions on Information Theory , volume=

Probability of error of some adaptive pattern-recognition machines , author=. IEEE Transactions on Information Theory , volume=

work page
[34]

, author=

Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption. , author=. Journal of Machine Learning Research (JMLR) , volume=

work page
[35]

Journal of Machine Learning Research (JMLR) , volume=

A PAC-style model for learning from labeled and unlabeled data , author=. Journal of Machine Learning Research (JMLR) , volume=

work page
[36]

FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence , volume =

Sohn, Kihyuk and Berthelot, David and Carlini, Nicholas and Zhang, Zizhao and Zhang, Han and Raffel, Colin A and Cubuk, Ekin Dogus and Kurakin, Alexey and Li, Chun-Liang , booktitle =. FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence , volume =

work page
[37]

International Conference on Machine Learning (ICML) , pages=

Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation , author=. International Conference on Machine Learning (ICML) , pages=

work page
[38]

International Conference on Learning Representations (ICLR) , year=

Theoretical Analysis of Self-Training with Deep Networks on Unlabeled Data , author=. International Conference on Learning Representations (ICLR) , year=

work page
[39]

Proceedings of the 37th International Conference on Machine Learning (ICML) , pages =

Understanding Self-Training for Gradual Domain Adaptation , author =. Proceedings of the 37th International Conference on Machine Learning (ICML) , pages =. 2020 , volume =

work page 2020
[40]

Machine Learning , volume=

A theory of learning from different domains , author=. Machine Learning , volume=

work page
[41]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Correcting sample selection bias by unlabeled data , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

work page
[42]

Journal of Machine Learning Research , volume=

Covariate shift adaptation by importance weighted cross validation , author=. Journal of Machine Learning Research , volume=

work page
[43]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Learning with Mismatching Distributions , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

work page
[44]

The Annals of Statistics , volume=

Arcing classifier (with discussion and a rejoinder by the author) , author=. The Annals of Statistics , volume=

work page
[45]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

work page
[46]

Neurocomputing , volume=

Active learning of convex loss functions with margins , author=. Neurocomputing , volume=

work page
[47]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Frequency principle: Fourier analysis sheds light on deep neural networks , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

work page
[48]

Communications in Computational Physics , volume=

Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks , author=. Communications in Computational Physics , volume=. 2020 , publisher=

work page 2020
[49]

The Annals of Statistics , volume=

Consistent nonparametric regression , author=. The Annals of Statistics , volume=

work page
[50]

Foundations of Computational Mathematics , volume=

Optimal rates for the regularized least squares algorithm , author=. Foundations of Computational Mathematics , volume=

work page
[51]

Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , author=. Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

work page
[52]

International Conference on Machine Learning (ICML) , pages=

On the Spectral Bias of Neural Networks , author=. International Conference on Machine Learning (ICML) , pages=. 2019 , organization=

work page 2019
[53]

Advances in Neural Information Processing Systems (NIPS) , volume=

Boosting Algorithms as Gradient Descent , author=. Advances in Neural Information Processing Systems (NIPS) , volume=

work page
[54]

Conference on Learning Theory (COLT) , pages=

Marginal Singularity, and the Benefits of Covariance Shift , author=. Conference on Learning Theory (COLT) , pages=. 2018 , organization=

work page 2018
[55]

International Conference on Machine Learning (ICML) , pages=

A new similarity measure for covariate shift with applications to nonparametric regression , author=. International Conference on Machine Learning (ICML) , pages=

work page
[56]

Journal of Statistical Planning and Inference , volume=

Improving Predictive Inference Under Covariate Shift by Weighting the Log-Likelihood Function , author=. Journal of Statistical Planning and Inference , volume=. 2000 , publisher=

work page 2000
[57]

2006 , publisher=

All of Nonparametric Statistics , author=. 2006 , publisher=

work page 2006
[58]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

High-Resolution Image Synthesis with Latent Diffusion Models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

work page
[59]

Journal of Machine Learning Research (JMLR) , volume=

Domain-Adversarial Training of Neural Networks , author=. Journal of Machine Learning Research (JMLR) , volume=

work page
[60]

International Conference on Learning Representations (ICLR) , year=

Unsupervised Representation Learning by Predicting Image Rotations , author=. International Conference on Learning Representations (ICLR) , year=

work page
[61]

International Conference on Machine Learning (ICML) , pages=

A Simple Framework for Contrastive Learning of Visual Representations , author=. International Conference on Machine Learning (ICML) , pages=. 2020 , organization=

work page 2020
[62]

Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL) , pages=

Frustratingly easy domain adaptation , author=. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL) , pages=

work page
[63]

Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=

Anchor regression: Heterogeneous data meet causality , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2021 , publisher=

work page 2021
[64]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Domain separation networks , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

work page
[65]

International Conference on Machine Learning , pages=

Stability and hypothesis transfer learning , author=. International Conference on Machine Learning , pages=. 2013 , organization=

work page 2013
[66]

Advances in Neural Information Processing Systems , volume=

Unsupervised domain adaptation with residual transfer networks , author=. Advances in Neural Information Processing Systems , volume=

work page
[68]

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

Regularized multi-task learning , author=. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

work page
[69]

A Dirty Model for Multi-task Learning , volume =

Jalali, Ali and Sanghavi, Sujay and Ruan, Chao and Ravikumar, Pradeep , booktitle =. A Dirty Model for Multi-task Learning , volume =

work page
[70]

Journal of Machine Learning Research (JMLR) , volume=

Boosting as a Regularized Path to a Maximum Margin Classifier , author=. Journal of Machine Learning Research (JMLR) , volume=

work page
[71]

2012 , publisher=

Machine Learning in Non-Stationary Environments: Introduction to Covariate Shift Adaptation , author=. 2012 , publisher=

work page 2012
[72]

Dataset Shift in Machine Learning , pages=

Covariate shift by kernel mean matching , author=. Dataset Shift in Machine Learning , pages=. 2008 , publisher=

work page 2008
[73]

Annals of the Institute of Statistical Mathematics , volume=

Direct importance estimation with model selection and its application to covariate shift adaptation , author=. Annals of the Institute of Statistical Mathematics , volume=. 2008 , publisher=

work page 2008
[74]

Journal of Machine Learning Research , volume=

A least-squares approach to direct importance estimation , author=. Journal of Machine Learning Research , volume=

work page
[75]

Advances in Neural Information Processing Systems , volume=

Relative density ratio estimation for robust distribution comparison , author=. Advances in Neural Information Processing Systems , volume=

work page
[76]

Nonparametric regression using deep neural networks with

Schmidt-Hieber, Johannes , journal=. Nonparametric regression using deep neural networks with. 2020 , publisher=

work page 2020
[77]

2019 , publisher=

High-Dimensional Statistics: A Non-Asymptotic Viewpoint , author=. 2019 , publisher=

work page 2019
[78]

2020 International Joint Conference on Neural Networks (IJCNN) , pages=

Pseudo-labeling and confirmation bias in deep semi-supervised learning , author=. 2020 International Joint Conference on Neural Networks (IJCNN) , pages=. 2020 , organization=

work page 2020
[79]

Neurocomputing , year=

Self-Training: A Survey , author=. Neurocomputing , year=

work page
[80]

Advances in Neural Information Processing Systems , volume=

Language Models are Few-Shot Learners , author=. Advances in Neural Information Processing Systems , volume=

work page
[81]

Proceedings of the 41st International Conference on Machine Learning , pages=

Transformers Implement Functional Gradient Descent to Learn Non-Linear Functions In Context , author=. Proceedings of the 41st International Conference on Machine Learning , pages=. 2024 , volume=

work page 2024
[82]

BERT: Pre- training of Deep Bidirectional Transformers for Language Understanding

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v...

work page doi:10.18653/v1/n19-1423 2019

Showing first 80 references.