Anytime PAC-Bayes for Constrained Density-Ratio Networks under Covariate Shift

Paulo Akira F. Enabe

arxiv: 2605.17212 · v2 · pith:PZXJZY6Nnew · submitted 2026-05-17 · 💻 cs.LG

Anytime PAC-Bayes for Constrained Density-Ratio Networks under Covariate Shift

Paulo Akira F. Enabe This is my paper

Pith reviewed 2026-05-25 05:52 UTC · model grok-4.3

classification 💻 cs.LG

keywords covariate shiftdensity ratio estimationPAC-Bayes boundsimportance weightingRadon-Nikodym derivativeaugmented Lagrangiananytime generalization

0 comments

The pith

A constrained density-ratio network approximates dP/dQ to feed anytime PAC-Bayes certificates under covariate shift.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a unified method for learning under covariate shift by training a density-ratio network subject to normalization and moment-matching constraints that approximate the true Radon-Nikodym derivative r* = dP/dQ. This learned ratio is used to importance-weight the source risk, after which an anytime PAC-Bayes bound supplies a uniform generalization certificate on the target risk. Experiments on analytic ground truth and real data show the network yields calibrated ratios, lowers target 0/1 loss relative to unweighted and baseline estimators, and produces coverage failures only when label shift appears.

Core claim

A constrained density-ratio network approximates the Radon-Nikodym derivative r* = dP/dQ and supplies the weights for an anytime PAC-Bayes generalization certificate that decomposes target risk into a ratio-bias term controlled by L2(Q) error and a generalization-gap term controlled by weighted-loss variability.

What carries the argument

The constrained density-ratio network that enforces normalization and moment-matching identities as hard integral constraints through an augmented-Lagrangian scheme, combined with geometric peeling to strengthen fixed-time Bernoulli-KL PAC-Bayes bounds into an anytime certificate uniform in t.

If this is right

The learned ratio reduces target 0/1 loss compared with unweighted ERM and classical direct ratio estimators.
The certificate remains valid uniformly for all t greater than or equal to a minimum time and quantifies stability under L2(Q) perturbations of the ratio.
The network-weighted Gibbs posterior is the unique KL-regularized minimizer of the importance-weighted risk.
A single fixed-time coverage failure aligns one-to-one with label-shift magnitude, confirming the covariate-only modeling choice is operationally tight.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the L2(Q) approximation error of the ratio can be bounded independently, the risk decomposition cleanly isolates estimation bias from generalization error.
The same constrained-network construction could be tested on problems where mild label shift is known to be present to measure the degradation of the certificate.
Replacing the augmented-Lagrangian constraint enforcement with a softer penalty might trade calibration for faster training while preserving the anytime guarantee.

Load-bearing premise

The distribution shift is purely covariate-based with no label shift.

What would settle it

A PAC-Bayes coverage failure occurs precisely when label shift is introduced while covariate shift remains controlled.

Figures

Figures reproduced from arXiv: 2605.17212 by Paulo Akira F. Enabe.

**Figure 1.** Figure 1: Pointwise ratio-fit error ∥rθ − r ∗ µ∥L2(Q) across the constrained DRN configurations of S1, S2, and S3 at µ = 0.5. The dashed line marks the pre-registered threshold τL2 = 0.05. Each added constraint reduces the error monotonically, from 0.127 at S1 to 0.094 at S2 and 0.080 at S3. The remaining gap at S3 is a paper-level training-recipe limitation rather than a method-level defect. the failure direction a… view at source ↗

**Figure 2.** Figure 2: Empirical ESS fraction ess/n as a function of the shift magnitude µ ∈ {0.5, 1.5, 2.0}, for the three tail-control variants of S4. The dashed line marks the pre-registered floor 0.2 · e −µ 2 used to define the stress-regime ESS criterion. Clipping recovers ESS most cleanly under stress, while tempering on the deployed raw ratio does not improve ESS once constraints are restricted to the raw ratio in accorda… view at source ↗

**Figure 3.** Figure 3: Empirical weighted risk Rbθ t (ha) against the analytic target risk RPµ (ha) = (1 − a) 2 (1 + µ 2 ) over 100 runs (10 seeds × 2 shifts × 5 predictors a ∈ {−1, −0.5, 0, 0.5, 1}). The reference line y = x is shown. Points are colored by shift magnitude. The pre-registered band |Rbθ t (ha) − RPµ (ha)| ≤ k(µ) σMC with k(0.5) = 3 and k(1.5) = 4 is met by all 100 runs. that capacity, training horizon, and parame… view at source ↗

**Figure 4.** Figure 4: PAC-Bayes certificate diagnostics on the oracle-ratio sanity case. [PITH_FULL_IMAGE:figures/full_fig_p039_4.png] view at source ↗

**Figure 5.** Figure 5: Per-pair relative improvement of DRN-weighted empirical risk minimization over the unweighted baseline on [PITH_FULL_IMAGE:figures/full_fig_p041_5.png] view at source ↗

**Figure 6.** Figure 6: Median target-test 0/1 loss over S = 10 seeds on each of the six splits, comparing the best-per-split DRN variant against KLIEP, uLSIF, and the discriminator-based ratio. Error bars are 25-75% inter-quartile ranges over seeds. The DRN clears the 1.02× proportional threshold against KLIEP and uLSIF on five of six splits. The fixed-time PAC-Bayes layer at T5 is the only stage of the campaign where a pre-regi… view at source ↗

**Figure 7.** Figure 7: Per-split fixed-time PAC-Bayes Bernoulli-KL bound (blue, median and inter-quartile range over [PITH_FULL_IMAGE:figures/full_fig_p043_7.png] view at source ↗

**Figure 8.** Figure 8: Anytime PAC-Bayes bound under geometric peeling (blue, median over [PITH_FULL_IMAGE:figures/full_fig_p044_8.png] view at source ↗

read the original abstract

A unified framework for learning under covariate shift is presented, in which a constrained density-ratio network approximates the Radon-Nikodym derivative $r^\star = dP/dQ$ and feeds an anytime PAC-Bayes generalization certificate. A change-of-measure identity decomposes the gap between target risk and importance-weighted source risk into a ratio-bias term governed by $\|r_\theta - r^\star\|_{L^2(Q)}$ and a generalization-gap term governed by the variability of the weighted loss. Normalization and moment-matching identities are enforced as hard integral constraints through an augmented-Lagrangian scheme, with a second-moment penalty controlling the effective sample size. PAC-Bayes is instantiated on the weighted risk in a fixed-time regime that yields Bernoulli-KL bounds, identifies the network-weighted Gibbs posterior as the unique KL-regularized minimizer, and quantifies stability under $L^2(Q)$ perturbations of the learned ratio, and is then strengthened by geometric peeling to an anytime certificate uniform in $t \geq t_{\min}$. A pre-registered two-campaign protocol combining a patch test against analytic ground truth with a real-data deployment validates the framework: the network produces calibrated ratios, reduces target $0/1$ loss against unweighted ERM and classical direct ratio-estimation baselines, and attains the anytime certificate. A single fixed-time coverage failure is recorded, with per-split coverage aligning one-to-one with the magnitude of the label shift, confirming that the covariate-only assumption is operationally tight rather than a defect of the certificate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper integrates constrained density-ratio networks with anytime PAC-Bayes for covariate shift, but the Bernoulli-KL application to weighted losses needs explicit range justification.

read the letter

The one thing to know is that the paper builds a framework linking constrained density ratio networks to anytime PAC-Bayes bounds for covariate shift settings. It enforces the ratio constraints hard via augmented Lagrangian and uses geometric peeling to get uniform-in-time certificates. The new part is the specific combination of hard integral constraints, the L2 bias decomposition, and the extension to anytime via peeling, along with the network-weighted Gibbs posterior. The experiments are a plus because they are pre-registered and include both analytic validation and real data, with the coverage failure used to confirm the assumption rather than hide it. The soft spot is the one raised in the stress test. Bernoulli-KL bounds require the loss to be in [0,1], but the importance weighted version is r_θ times the original loss, and the constraints on expectation and second moment do not prevent r_θ from being larger than 1 on some points. Without an additional argument or modification mentioned in the abstract, this could invalidate the direct application. If the full paper has a fix, that would be good to see, but it looks like a gap that needs checking. This paper is for specialists in PAC-Bayes and domain adaptation who care about certificates. It is worth a serious referee to examine the derivations and whether the bound holds. Recommendation is to send it for peer review.

Referee Report

2 major / 2 minor

Summary. The paper claims a unified framework for learning under covariate shift: a constrained density-ratio network approximates r^*=dP/dQ via augmented-Lagrangian enforcement of E_Q[r_θ]=1 and moment-matching plus a second-moment penalty; a change-of-measure identity decomposes target risk minus weighted source risk into an L²(Q) ratio-bias term and a generalization gap; PAC-Bayes is applied to the weighted risk to obtain Bernoulli-KL bounds on a network-weighted Gibbs posterior, with stability under L² perturbations and geometric peeling to an anytime certificate uniform in t; pre-registered experiments (patch test on analytic ground truth plus real-data deployment) show calibrated ratios, reduced 0/1 loss versus baselines, and coverage that aligns with label-shift magnitude.

Significance. If the derivations hold, the work would supply a practical route to anytime-valid certificates under pure covariate shift by tightly coupling constrained ratio estimation with PAC-Bayes, including an explicit decomposition of bias and generalization terms and a pre-registered validation protocol. The geometric-peeling extension to anytime bounds and the explicit stability quantification under L²(Q) perturbations of the learned ratio are potentially useful technical contributions.

major comments (2)

[Abstract] Abstract (PAC-Bayes instantiation paragraph): the claim that the framework 'yields Bernoulli-KL bounds' on the weighted risk presupposes that the importance-weighted loss r_θ·ℓ (ℓ the 0/1 loss) takes values in [0,1]. The only controls stated are the hard integral constraints E_Q[r_θ]=1, moment matching, and a second-moment penalty; these do not imply an almost-sure upper bound on r_θ, so r_θ·ℓ ranges in [0,r_θ] and the standard Bernoulli-KL derivation does not apply without an additional range argument or rescaling.
[Abstract] Abstract (change-of-measure identity and weighted Gibbs posterior): the decomposition of target risk minus weighted source risk into an L²(Q) ratio-bias term plus generalization gap, followed by identification of the network-weighted Gibbs posterior as the unique KL-regularized minimizer, is load-bearing for the certificate; without the range control above, the subsequent stability claim under L²(Q) perturbations of the learned ratio cannot be instantiated with the stated Bernoulli-KL form.

minor comments (2)

[Abstract] The abstract refers to 'geometric peeling' for the anytime certificate but does not indicate the precise peeling schedule or the dependence on t_min; a short clarifying sentence would help readers locate the argument in the main text.
[Abstract] The single coverage failure is attributed to label shift, but the manuscript should explicitly state whether any diagnostic for detecting label shift (beyond post-hoc alignment) is provided to users of the certificate.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful identification of the range condition needed to instantiate the Bernoulli-KL PAC-Bayes bounds. We respond point-by-point to the major comments below.

read point-by-point responses

Referee: [Abstract] Abstract (PAC-Bayes instantiation paragraph): the claim that the framework 'yields Bernoulli-KL bounds' on the weighted risk presupposes that the importance-weighted loss r_θ·ℓ (ℓ the 0/1 loss) takes values in [0,1]. The only controls stated are the hard integral constraints E_Q[r_θ]=1, moment matching, and a second-moment penalty; these do not imply an almost-sure upper bound on r_θ, so r_θ·ℓ ranges in [0,r_θ] and the standard Bernoulli-KL derivation does not apply without an additional range argument or rescaling.

Authors: We agree that the standard Bernoulli-KL form requires the (weighted) loss to lie in [0,1] almost surely. The stated integral constraints and second-moment penalty control expectations and effective sample size but do not enforce an a.s. upper bound on r_θ. We will revise the manuscript by adding an explicit almost-sure bound on r_θ (via an additional hard constraint or clipping term in the augmented Lagrangian) and will rescale the weighted loss accordingly so that the Bernoulli-KL derivation applies directly. The abstract and the PAC-Bayes instantiation section will be updated to state this range control explicitly. revision: yes
Referee: [Abstract] Abstract (change-of-measure identity and weighted Gibbs posterior): the decomposition of target risk minus weighted source risk into an L²(Q) ratio-bias term plus generalization gap, followed by identification of the network-weighted Gibbs posterior as the unique KL-regularized minimizer, is load-bearing for the certificate; without the range control above, the subsequent stability claim under L²(Q) perturbations of the learned ratio cannot be instantiated with the stated Bernoulli-KL form.

Authors: We concur that the change-of-measure decomposition, the identification of the network-weighted Gibbs posterior, and the L²(Q)-stability claim all presuppose a valid Bernoulli-KL bound on the weighted risk. Once the range control is added as described above, these elements will be instantiated with the corrected Bernoulli-KL form. We will revise the abstract to make the dependence on the range condition and its resolution explicit, and we will verify that the stability quantification remains valid under the added constraint. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation applies external PAC-Bayes to learned weighted risk

full rationale

The abstract describes a change-of-measure decomposition into ratio-bias and generalization terms, hard constraints on the density-ratio network via augmented Lagrangian, a second-moment penalty, and instantiation of standard Bernoulli-KL PAC-Bayes bounds on the importance-weighted risk to obtain an anytime certificate via geometric peeling. These steps invoke general theorems (change-of-measure identities and PAC-Bayes) rather than reducing any claimed prediction or certificate to a fitted quantity or self-citation by construction. No self-citations, ansatzes smuggled via prior work, or uniqueness theorems from the same authors are referenced. The framework remains self-contained against external benchmarks such as PAC-Bayes theory and moment-matching optimization.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No concrete free parameters, axioms, or invented entities can be extracted from the abstract alone; the Radon-Nikodym derivative and PAC-Bayes posterior are standard concepts.

pith-pipeline@v0.9.0 · 5805 in / 1237 out tokens · 25362 ms · 2026-05-25T05:52:24.839588+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 1 internal anchor

[1]

Shimodaira

H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function.Jour- nal of Statistical Planning and Inference, 90(2):227–244, 2000. https://doi.org/10.1016/S0378-3758(00) 00115-4

work page doi:10.1016/s0378-3758(00 2000
[2]

Sugiyama, M

M. Sugiyama, M. Krauledat, and K.-R. Müller. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8:985–1005, 2007

work page 2007
[3]

J. A. Anderson. Multivariate logistic compounds.Biometrika, 66(1):17–26, 1979. https://doi.org/10.1093/ biomet/66.1.17

work page 1979
[4]

S. Moro, P. Cortez, and P. Rita. A data-driven approach to predict the success of bank telemarketing.Decision Support Systems, 62:22–31, 2014.https://doi.org/10.1016/j.dss.2014.03.001

work page doi:10.1016/j.dss.2014.03.001 2014
[5]

F. Ding, M. Hardt, J. Miller, and L. Schmidt. Retiring adult: New datasets for fair machine learning. InAdvances in Neural Information Processing Systems 34, pages 6478–6490, 2021. https://arxiv.org/abs/2108.04884

work page arXiv 2021
[6]

Sugiyama, S

M. Sugiyama, S. Nakajima, H. Kashima, P. von Bünau, and M. Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. InAdvances in Neural Information Processing Systems 20, pages 1433–1440, 2008

work page 2008
[7]

Sugiyama, T

M. Sugiyama, T. Suzuki, S. Nakajima, H. Kashima, P. von Bünau, and M. Kawanabe. Direct importance estimation for covariate shift adaptation.Annals of the Institute of Statistical Mathematics, 60(4):699–746, 2008. https://doi.org/10.1007/s10463-008-0197-x

work page doi:10.1007/s10463-008-0197-x 2008
[8]

Kanamori, S

T. Kanamori, S. Hido, and M. Sugiyama. A least-squares approach to direct importance estimation.Journal of Machine Learning Research, 10:1391–1445, 2009

work page 2009
[9]

Nguyen, M

X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization.IEEE Transactions on Information Theory, 56(11):5847–5861, 2010. https: //doi.org/10.1109/TIT.2010.2068870

work page doi:10.1109/tit.2010.2068870 2010
[10]

A. G. Zhang and J. Chen. Density ratio model with data-adaptive basis function.Journal of Multivariate Analysis, 191:105043, 2022.https://doi.org/10.1016/j.jmva.2022.105043

work page doi:10.1016/j.jmva.2022.105043 2022
[11]

J. H. McVittie and A. G. Zhang. Density ratio model for multiple types of survival data with empirical likelihood. arXiv preprint arXiv:2511.09398, 2025.https://arxiv.org/abs/2511.09398

work page arXiv 2025
[12]

Rhodes, K

B. Rhodes, K. Xu, and M. U. Gutmann. Telescoping density-ratio estimation. InAdvances in Neural Information Processing Systems 33, pages 4905–4916, 2020

work page 2020
[13]

S. Liu, M. Yamada, N. Collier, and M. Sugiyama. Change-point detection in time-series data by relative density- ratio estimation.Neural Networks, 43:72–83, 2013.https://doi.org/10.1016/j.neunet.2013.01.012

work page doi:10.1016/j.neunet.2013.01.012 2013
[14]

Sugiyama, T

M. Sugiyama, T. Suzuki, and T. Kanamori.Density Ratio Estimation in Machine Learning. Cambridge University Press, Cambridge, 2012.https://doi.org/10.1017/CBO9781139035613

work page doi:10.1017/cbo9781139035613 2012
[15]

D. A. McAllester. Some PAC-Bayesian theorems.Machine Learning, 37(3):355–363, 1999. https://doi.org/ 10.1023/A:1007618624809

work page doi:10.1023/a:1007618624809 1999
[16]

D. A. McAllester. PAC-Bayesian model averaging. InProceedings of the Twelfth Annual Conference on Computational Learning Theory (COLT), pages 164–170, 1999. https://doi.org/10.1145/307400.307435

work page doi:10.1145/307400.307435 1999
[17]

D. A. McAllester. PAC-Bayesian stochastic model selection.Machine Learning, 51(1):5–21, 2003. https: //doi.org/10.1023/A:1021840411064. 47 MAY25, 2026

work page doi:10.1023/a:1021840411064 2003
[18]

D. A. McAllester. Simplified PAC-Bayesian margin bounds. InProceedings of the 16th Annual Con- ference on Computational Learning Theory (COLT), pages 203–215, 2003. https://doi.org/10.1007/ 978-3-540-45167-9_16

work page 2003
[19]

M. Seeger. PAC-Bayesian generalisation error bounds for Gaussian process classification.Journal of Machine Learning Research, 3:233–269, 2002

work page 2002
[20]

Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning

O. Catoni.PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning, volume 56 ofIMS Lecture Notes Monograph Series. Institute of Mathematical Statistics, Beachwood, OH, 2007. https: //arxiv.org/abs/0712.0248

work page internal anchor Pith review Pith/arXiv arXiv 2007
[21]

Tolstikhin and Y

I. Tolstikhin and Y . Seldin. PAC-Bayes-empirical-Bernstein inequality. InAdvances in Neural Information Processing Systems 26, pages 109–117, 2013

work page 2013
[22]

Thiemann, C

N. Thiemann, C. Igel, O. Wintenberger, and Y . Seldin. A strongly quasiconvex PAC-Bayesian bound. In Proceedings of the 28th International Conference on Algorithmic Learning Theory (ALT), volume 76 ofProceedings of Machine Learning Research, pages 1–26, 2017

work page 2017
[23]

Germain, A

P. Germain, A. Lacasse, F. Laviolette, M. Marchand, and S. Shanian. From PAC-Bayes bounds to KL regularization. InAdvances in Neural Information Processing Systems 22, pages 603–610, 2009

work page 2009
[24]

Alquier, J

P. Alquier, J. Ridgway, and N. Chopin. On the properties of variational approximations of Gibbs posteriors. Journal of Machine Learning Research, 17(236):1–41, 2016

work page 2016
[25]

Keshet, D

J. Keshet, D. McAllester, and T. Hazan. PAC-Bayesian approach for minimization of phoneme error rate. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2224–2227, 2011.https://doi.org/10.1109/ICASSP.2011.5946923

work page doi:10.1109/icassp.2011.5946923 2011
[26]

Chugg, H

B. Chugg, H. Wang, and A. Ramdas. A unified recipe for deriving (time-uniform) PAC-Bayes bounds.Journal of Machine Learning Research, 24(372):1–61, 2023

work page 2023
[27]

Rodríguez-Gálvez, R

B. Rodríguez-Gálvez, R. Thobaben, and M. Skoglund. More PAC-Bayes bounds: From bounded losses, to losses with general tail behaviors, to anytime validity.Journal of Machine Learning Research, 25(110):1–43, 2024

work page 2024
[28]

D. A. Levin and Y . Peres.Markov Chains and Mixing Times. American Mathematical Society, second edition, 2017.https://bookstore.ams.org/mbk-107

work page 2017
[29]

T. M. Cover and J. A. Thomas.Elements of Information Theory. John Wiley & Sons, second edition, 2005. https://doi.org/10.1002/047174882X

work page doi:10.1002/047174882x 2005
[30]

Billingsley.Probability and Measure

P. Billingsley.Probability and Measure. Wiley Series in Probability and Statistics. John Wiley & Sons, anniversary edition, 2012. ISBN 978-1-118-34191-9

work page 2012
[31]

Çınlar.Probability and Stochastics

E. Çınlar.Probability and Stochastics. Graduate Texts in Mathematics, vol. 261. Springer, 2011. https: //doi.org/10.1007/978-0-387-87859-1

work page doi:10.1007/978-0-387-87859-1 2011
[32]

L. C. Evans.Partial Differential Equations. Graduate Studies in Mathematics, vol. 19. American Mathematical Society, second edition, 2010.https://bookstore.ams.org/gsm-19-r

work page 2010
[33]

Ledoux and M

M. Ledoux and M. Talagrand.Probability in Banach Spaces: Isoperimetry and Processes. Classics in Mathematics. Springer-Verlag Berlin Heidelberg, 1991, reprinted 2011.https://doi.org/10.1007/978-3-642-20212-4. 48 MAY25, 2026 A Measure theory and the Radon-Nikodym theorem The entire framework developed in this paper rests on a single object, the Radon-Nikod...

work page doi:10.1007/978-3-642-20212-4 1991

[1] [1]

Shimodaira

H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function.Jour- nal of Statistical Planning and Inference, 90(2):227–244, 2000. https://doi.org/10.1016/S0378-3758(00) 00115-4

work page doi:10.1016/s0378-3758(00 2000

[2] [2]

Sugiyama, M

M. Sugiyama, M. Krauledat, and K.-R. Müller. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8:985–1005, 2007

work page 2007

[3] [3]

J. A. Anderson. Multivariate logistic compounds.Biometrika, 66(1):17–26, 1979. https://doi.org/10.1093/ biomet/66.1.17

work page 1979

[4] [4]

S. Moro, P. Cortez, and P. Rita. A data-driven approach to predict the success of bank telemarketing.Decision Support Systems, 62:22–31, 2014.https://doi.org/10.1016/j.dss.2014.03.001

work page doi:10.1016/j.dss.2014.03.001 2014

[5] [5]

F. Ding, M. Hardt, J. Miller, and L. Schmidt. Retiring adult: New datasets for fair machine learning. InAdvances in Neural Information Processing Systems 34, pages 6478–6490, 2021. https://arxiv.org/abs/2108.04884

work page arXiv 2021

[6] [6]

Sugiyama, S

M. Sugiyama, S. Nakajima, H. Kashima, P. von Bünau, and M. Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. InAdvances in Neural Information Processing Systems 20, pages 1433–1440, 2008

work page 2008

[7] [7]

Sugiyama, T

M. Sugiyama, T. Suzuki, S. Nakajima, H. Kashima, P. von Bünau, and M. Kawanabe. Direct importance estimation for covariate shift adaptation.Annals of the Institute of Statistical Mathematics, 60(4):699–746, 2008. https://doi.org/10.1007/s10463-008-0197-x

work page doi:10.1007/s10463-008-0197-x 2008

[8] [8]

Kanamori, S

T. Kanamori, S. Hido, and M. Sugiyama. A least-squares approach to direct importance estimation.Journal of Machine Learning Research, 10:1391–1445, 2009

work page 2009

[9] [9]

Nguyen, M

X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization.IEEE Transactions on Information Theory, 56(11):5847–5861, 2010. https: //doi.org/10.1109/TIT.2010.2068870

work page doi:10.1109/tit.2010.2068870 2010

[10] [10]

A. G. Zhang and J. Chen. Density ratio model with data-adaptive basis function.Journal of Multivariate Analysis, 191:105043, 2022.https://doi.org/10.1016/j.jmva.2022.105043

work page doi:10.1016/j.jmva.2022.105043 2022

[11] [11]

J. H. McVittie and A. G. Zhang. Density ratio model for multiple types of survival data with empirical likelihood. arXiv preprint arXiv:2511.09398, 2025.https://arxiv.org/abs/2511.09398

work page arXiv 2025

[12] [12]

Rhodes, K

B. Rhodes, K. Xu, and M. U. Gutmann. Telescoping density-ratio estimation. InAdvances in Neural Information Processing Systems 33, pages 4905–4916, 2020

work page 2020

[13] [13]

S. Liu, M. Yamada, N. Collier, and M. Sugiyama. Change-point detection in time-series data by relative density- ratio estimation.Neural Networks, 43:72–83, 2013.https://doi.org/10.1016/j.neunet.2013.01.012

work page doi:10.1016/j.neunet.2013.01.012 2013

[14] [14]

Sugiyama, T

M. Sugiyama, T. Suzuki, and T. Kanamori.Density Ratio Estimation in Machine Learning. Cambridge University Press, Cambridge, 2012.https://doi.org/10.1017/CBO9781139035613

work page doi:10.1017/cbo9781139035613 2012

[15] [15]

D. A. McAllester. Some PAC-Bayesian theorems.Machine Learning, 37(3):355–363, 1999. https://doi.org/ 10.1023/A:1007618624809

work page doi:10.1023/a:1007618624809 1999

[16] [16]

D. A. McAllester. PAC-Bayesian model averaging. InProceedings of the Twelfth Annual Conference on Computational Learning Theory (COLT), pages 164–170, 1999. https://doi.org/10.1145/307400.307435

work page doi:10.1145/307400.307435 1999

[17] [17]

D. A. McAllester. PAC-Bayesian stochastic model selection.Machine Learning, 51(1):5–21, 2003. https: //doi.org/10.1023/A:1021840411064. 47 MAY25, 2026

work page doi:10.1023/a:1021840411064 2003

[18] [18]

D. A. McAllester. Simplified PAC-Bayesian margin bounds. InProceedings of the 16th Annual Con- ference on Computational Learning Theory (COLT), pages 203–215, 2003. https://doi.org/10.1007/ 978-3-540-45167-9_16

work page 2003

[19] [19]

M. Seeger. PAC-Bayesian generalisation error bounds for Gaussian process classification.Journal of Machine Learning Research, 3:233–269, 2002

work page 2002

[20] [20]

Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning

O. Catoni.PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning, volume 56 ofIMS Lecture Notes Monograph Series. Institute of Mathematical Statistics, Beachwood, OH, 2007. https: //arxiv.org/abs/0712.0248

work page internal anchor Pith review Pith/arXiv arXiv 2007

[21] [21]

Tolstikhin and Y

I. Tolstikhin and Y . Seldin. PAC-Bayes-empirical-Bernstein inequality. InAdvances in Neural Information Processing Systems 26, pages 109–117, 2013

work page 2013

[22] [22]

Thiemann, C

N. Thiemann, C. Igel, O. Wintenberger, and Y . Seldin. A strongly quasiconvex PAC-Bayesian bound. In Proceedings of the 28th International Conference on Algorithmic Learning Theory (ALT), volume 76 ofProceedings of Machine Learning Research, pages 1–26, 2017

work page 2017

[23] [23]

Germain, A

P. Germain, A. Lacasse, F. Laviolette, M. Marchand, and S. Shanian. From PAC-Bayes bounds to KL regularization. InAdvances in Neural Information Processing Systems 22, pages 603–610, 2009

work page 2009

[24] [24]

Alquier, J

P. Alquier, J. Ridgway, and N. Chopin. On the properties of variational approximations of Gibbs posteriors. Journal of Machine Learning Research, 17(236):1–41, 2016

work page 2016

[25] [25]

Keshet, D

J. Keshet, D. McAllester, and T. Hazan. PAC-Bayesian approach for minimization of phoneme error rate. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2224–2227, 2011.https://doi.org/10.1109/ICASSP.2011.5946923

work page doi:10.1109/icassp.2011.5946923 2011

[26] [26]

Chugg, H

B. Chugg, H. Wang, and A. Ramdas. A unified recipe for deriving (time-uniform) PAC-Bayes bounds.Journal of Machine Learning Research, 24(372):1–61, 2023

work page 2023

[27] [27]

Rodríguez-Gálvez, R

B. Rodríguez-Gálvez, R. Thobaben, and M. Skoglund. More PAC-Bayes bounds: From bounded losses, to losses with general tail behaviors, to anytime validity.Journal of Machine Learning Research, 25(110):1–43, 2024

work page 2024

[28] [28]

D. A. Levin and Y . Peres.Markov Chains and Mixing Times. American Mathematical Society, second edition, 2017.https://bookstore.ams.org/mbk-107

work page 2017

[29] [29]

T. M. Cover and J. A. Thomas.Elements of Information Theory. John Wiley & Sons, second edition, 2005. https://doi.org/10.1002/047174882X

work page doi:10.1002/047174882x 2005

[30] [30]

Billingsley.Probability and Measure

P. Billingsley.Probability and Measure. Wiley Series in Probability and Statistics. John Wiley & Sons, anniversary edition, 2012. ISBN 978-1-118-34191-9

work page 2012

[31] [31]

Çınlar.Probability and Stochastics

E. Çınlar.Probability and Stochastics. Graduate Texts in Mathematics, vol. 261. Springer, 2011. https: //doi.org/10.1007/978-0-387-87859-1

work page doi:10.1007/978-0-387-87859-1 2011

[32] [32]

L. C. Evans.Partial Differential Equations. Graduate Studies in Mathematics, vol. 19. American Mathematical Society, second edition, 2010.https://bookstore.ams.org/gsm-19-r

work page 2010

[33] [33]

Ledoux and M

M. Ledoux and M. Talagrand.Probability in Banach Spaces: Isoperimetry and Processes. Classics in Mathematics. Springer-Verlag Berlin Heidelberg, 1991, reprinted 2011.https://doi.org/10.1007/978-3-642-20212-4. 48 MAY25, 2026 A Measure theory and the Radon-Nikodym theorem The entire framework developed in this paper rests on a single object, the Radon-Nikod...

work page doi:10.1007/978-3-642-20212-4 1991