pith. sign in

arxiv: 2605.26429 · v1 · pith:JQTIN5AAnew · submitted 2026-05-26 · 📊 stat.ME · cs.AI· cs.LG· stat.ML

Structure-Adaptive Conformal Inference for Large-Scale Out-of-Distribution Testing

Pith reviewed 2026-06-29 16:16 UTC · model grok-4.3

classification 📊 stat.ME cs.AIcs.LGstat.ML
keywords conformal inferenceout-of-distribution testingfalse discovery ratepairwise exchangeabilityq-valuesmodel selectionstructured data
0
0 comments X

The pith

SCQ and P-TAMS deliver finite-sample FDR control for structured out-of-distribution testing by replacing joint exchangeability with pairwise exchangeability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops methods for large-scale testing of whether observations come from the same distribution as training data when those observations carry extra structure such as groups, time order, or spatial layout. It introduces the structure-adaptive conformal q-value that folds structural patterns into individual test scores and pairs it with a model-selection routine that chooses among candidate predictors while preserving validity. The combined procedure rests on pairwise exchangeability to retain exact finite-sample error control that standard conformal methods lose once structure is introduced. Readers care because many high-stakes applications produce data with natural dependence patterns, yet existing conformal tools either ignore the patterns or sacrifice their guarantees.

Core claim

SCQ and P-TAMS together form a unified framework under pairwise exchangeability that provides finite-sample error-rate control, improved power, and enhanced interpretability for structured OOD testing.

What carries the argument

The structure-adaptive conformal q-value (SCQ), which combines individual test evidence with structural patterns, and the pseudo-score-guided transductive automated model selection (P-TAMS) that adapts model choice to the same setting.

If this is right

  • The false discovery rate remains controlled at any pre-specified level for any finite sample size.
  • Power increases when structural information is present compared with methods that discard it.
  • P-TAMS selects among a library of candidate models while retaining the same finite-sample guarantees.
  • The resulting q-values admit direct interpretation as adjusted significance indices that incorporate structure.
  • The framework applies to both simulated and real data across diverse dependence patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pairwise-exchangeability device might extend conformal procedures to graph-structured or network data without new theoretical machinery.
  • Interpretability gains could support regulatory review of OOD flags in safety-critical systems.
  • The approach suggests a route to conformal testing under weaker dependence conditions than full exchangeability in other multiple-testing problems.
  • Adaptive model selection inside the procedure may reduce the need for separate validation sets in streaming monitoring applications.

Load-bearing premise

The observations satisfy pairwise exchangeability rather than requiring full joint exchangeability.

What would settle it

A dataset generated so that pairwise exchangeability holds yet the realized false discovery rate of SCQ exceeds the nominal level across repeated trials, or a dataset where pairwise exchangeability is violated yet the method still controls the rate.

Figures

Figures reproduced from arXiv: 2605.26429 by Rongyi Sun, Wenguang Sun, Zinan Zhao.

Figure 1
Figure 1. Figure 1: Schematic representation of the P-TAMS algorithm. The following theorem establishes the finite-sample validity of P-TAMS. Theorem 2. Under the conditions in Proposition 1, the SCQ procedure refined with P￾TAMS controls the FDR at level α. 4 Asymptotic Theories 4.1 Mirror Calibration and Equivalent Thresholding Rules Existing conformalized OOD testing methods involve constructing conformal p-values fol￾lowe… view at source ↗
Figure 2
Figure 2. Figure 2: AP and FDR comparison of the SCQ variants, together with P-TAMS that selects among them, against density-ratio–based methods at α = 0.05. made. First, all methods successfully control the FDR below 0.05. Second, both SCQ-KDE and SCQ-PU demonstrate noticeable power improvements over their AdaDetect counter￾parts by effectively leveraging structural information. Third, in the low-dimensional setting (p = 5),… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of AP and FDR between SCQ, cfBH, and ICP at α = 0.05. The results, shown in [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of AP and FDR for SCQ variants (implemented with two OCC-based and two BIC-based score functions) and P-TAMS, which uniformly enhances SCQ. labeled outliers, but may become unstable and underperform the SCQ-OCC variants when the data are highly imbalanced. Finally, SCQ+P-TAMS adaptively switches between the OCC and BIC approaches to maximize detection power, thereby providing an effective soluti… view at source ↗
Figure 5
Figure 5. Figure 5: Left two columns: FDR and ETP of different methods; right column: hourly distribu￾tion of attack samples in the test set. We compare the performance of different methods, including AdaDetect-KDE, AdaDetect￾PU-RF, CLAW-KDE, CLAW-PU-RF, SCQ-PU-RF, and SCQ+P-TAMS, for OOD testing at a target FDR level of 0.05. SCQ+P-TAMS selects the optimal classifier from a toolbox comprising SCQ-PU-RF, SCQ-KDE, and SCQ-OCC-… view at source ↗
Figure 6
Figure 6. Figure 6: FDR and ETP of four SCQ variants and SCQ+P-TAMS on the PageBlocks data. increases. Second, no single pre-specified classifier is uniformly optimal across all settings. Third, SCQ+P-TAMS achieves the best overall performance by adaptively selecting among the candidate classifiers. In particular, the selected classifier may vary across repetitions even for the same n1, indicating that P-TAMS can exploit data… view at source ↗
read the original abstract

This paper addresses structured out-of-distribution (OOD) testing in high-stakes machine learning applications. Traditional conformal methods rely on joint exchangeability, making it difficult to incorporate auxiliary information such as spatiotemporal or grouping structures. To overcome this limitation, we propose the structure-adaptive conformal q-value (SCQ), a significance index that integrates individual test evidence with structural patterns. We also develop pseudo-score-guided transductive automated model selection (P-TAMS), which adapts conformalized model selection to structured OOD testing across a toolbox of candidate models. Together, SCQ and P-TAMS form a unified framework under pairwise exchangeability, providing finite-sample error-rate control, improved power, and enhanced interpretability. Experiments on simulated and real data demonstrate that the proposed approach controls the false discovery rate and performs well across diverse settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes the structure-adaptive conformal q-value (SCQ) and pseudo-score-guided transductive automated model selection (P-TAMS) to enable structured out-of-distribution testing. It claims that, under pairwise exchangeability, the methods form a unified framework delivering finite-sample error-rate control (specifically FDR control), improved power, and enhanced interpretability relative to standard conformal procedures that require joint exchangeability, with supporting experiments on simulated and real data.

Significance. If the finite-sample FDR control is rigorously established, the contribution would be significant for conformal inference in high-stakes applications where auxiliary structure (e.g., spatiotemporal or grouping patterns) is available. The approach explicitly leverages pairwise exchangeability to integrate structural information without invalidating guarantees, which addresses a recognized limitation of joint-exchangeability-based methods.

major comments (2)
  1. [Abstract] Abstract: The central claim of finite-sample error-rate control under pairwise exchangeability is asserted without any derivation, theorem statement, or proof sketch showing how SCQ and P-TAMS maintain FDR control; this is load-bearing for the main contribution and must be supplied explicitly.
  2. [Abstract] Abstract: The manuscript invokes pairwise exchangeability (rather than joint exchangeability) as the key modeling choice enabling structure incorporation, but provides no proposition or argument establishing that this weaker condition is sufficient for the claimed finite-sample guarantees; verification of this step is required.
minor comments (1)
  1. The abstract would be clearer if it briefly indicated the concrete forms of structural information (e.g., spatiotemporal or grouping) that SCQ is designed to exploit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments emphasizing the need for explicit theoretical support in the abstract. We address each point below and agree to revise the abstract accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of finite-sample error-rate control under pairwise exchangeability is asserted without any derivation, theorem statement, or proof sketch showing how SCQ and P-TAMS maintain FDR control; this is load-bearing for the main contribution and must be supplied explicitly.

    Authors: We agree that the abstract should explicitly reference the supporting result. We will revise the abstract to state: 'Under pairwise exchangeability, SCQ and P-TAMS achieve finite-sample FDR control, as established in Theorem 3.1.' The full derivation and proof sketch appear in Section 3 and Appendix A of the manuscript. revision: yes

  2. Referee: [Abstract] Abstract: The manuscript invokes pairwise exchangeability (rather than joint exchangeability) as the key modeling choice enabling structure incorporation, but provides no proposition or argument establishing that this weaker condition is sufficient for the claimed finite-sample guarantees; verification of this step is required.

    Authors: We acknowledge the request for explicit verification. We will update the abstract to include: 'We establish that pairwise exchangeability suffices for the finite-sample guarantees (Proposition 2.1).' The argument showing sufficiency of this weaker condition is derived in Section 2. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes SCQ and P-TAMS as new methods under the explicit modeling assumption of pairwise exchangeability to achieve finite-sample FDR control for structured OOD testing. No equations, fitting procedures, or self-citations are visible in the provided text that reduce any claimed prediction or result to an input by construction. The central claims rest on the stated exchangeability condition and standard conformal inference extensions rather than self-referential definitions or fitted quantities renamed as predictions. The framework is presented as self-contained with external empirical validation on simulated and real data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations, proofs, or experimental details from which free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.1-grok · 5674 in / 1018 out tokens · 36300 ms · 2026-06-29T16:16:17.130769+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 12 canonical work pages · 1 internal anchor

  1. [1]

    arXiv preprint arXiv:2411.17983 , year=

    T. Bai and Y. Jin. Optimized conformal selection: Powerful selective inference after conformity score optimization.arXiv preprint arXiv:2411.17983,

  2. [2]

    R. F. Barber and E. J. Cand` es. Controlling the false discovery rate via knockoffs.The Annals of Statistics, 43(5):2055 – 2085,

  3. [3]

    doi: 10.1214/08-EJS180. T. Cai, W. Sun, and W. Wang. Covariate-assisted ranking and screening for large-scale two- sample inference.Journal of the Royal Statistical Society Series B: Statistical Methodology, 81 (2):187–234,

  4. [4]

    Y. Gui, Y. Jin, Y. Nair, and Z. Ren. Acs: An interactive framework for conformal selection.arXiv preprint arXiv:2507.15825,

  5. [5]

    arXiv preprint arXiv:2102.12967 , year=

    M. Haroush, T. Frostig, R. Heller, and D. Soudry. A statistical framework for efficient out of distribution detection in deep neural networks.arXiv preprint arXiv:2102.12967,

  6. [6]

    Leung and W

    D. Leung and W. Sun. Zap: Z z-value adaptive procedures for false discovery rate control with side information.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 84(5):1886–1946,

  7. [7]

    doi: 10.5705/ss.202024

    ISSN 1017-0405. doi: 10.5705/ss.202024

  8. [8]

    C. G. Magnani, M. Sesia, and A. Solari. Collective outlier detection and enumeration with conformalized closed testing.arXiv preprint arXiv:2308.05534,

  9. [9]

    doi: https://doi.org/10.1016/j.sigpro

    ISSN 0165-1684. doi: https://doi.org/10.1016/j.sigpro. 2013.12.026. Z. Ren and E. Cand` es. Knockoffs with side information.The Annals of Applied Statistics, 17(2): 1152–1174,

  10. [10]

    Stutz, A

    D. Stutz, A. T. Cemgil, A. Doucet, et al. Learning optimal conformal classifiers.arXiv preprint arXiv:2110.09192,

  11. [11]

    doi: 10.1049/cp:19950597. V. Vovk, A. Gammerman, and C. Saunders. Machine-learning applications of algorithmic ran- domness. InProceedings of the Sixteenth International Conference on Machine Learning, ICML ’99, page 444–453, San Francisco, CA, USA,

  12. [12]

    C.-Y. Yang, L. Lei, N. Ho, and W. Fithian. Bonus: Multiple multivariate testing with a data- adaptivetest statistic.arXiv preprint arXiv:2106.15743,

  13. [13]

    doi: 10.1007/ s11263-024-02117-4

    ISSN 0920-5691. doi: 10.1007/ s11263-024-02117-4. Y. Yang and A. K. Kuchibhotla. Finite-sample efficient conformal prediction.arXiv preprint arXiv:2104.13871, 5,

  14. [14]

    Structure-Adaptive Conformal Inference for Large-Scale Out-of-Distribution Testing

    Z. Zhao and W. Sun. A conformalized empirical bayes method for multiple testing with side information.arXiv preprint arXiv:2502.19667, 2025a. Z. Zhao and W. Sun. False discovery rate control for structured multiple testing: Asymmetric rules and conformal q-values.Journal of the American Statistical Association, 120(550):805– 817, 2025b. 24 Online Suppleme...

  15. [15]

    Thus, the three rejection sets coincide:R ebh =R bc =R scq

    Otherwise, we will have e(ˆk) = 0, and this contradicts with ˆke(ˆk) m ≥1 αby definition. Thus, the three rejection sets coincide:R ebh =R bc =R scq. We establish the FDR control of the SCQ procedure by proving the validity ofR ebh. By the e-BH theory of Wang and Ramdas (2022), this validity holds provided that the e-values defined in (A.4) are generalize...

  16. [16]

    Step (b): The asymptotic equivalence between ˆtandt ∗.Sincet ∗is nonrandom, (A.8) implies ˆQc(t∗)−¯Q(t∗) = ˆQc(t∗) p →0

    = 1 m ∑ j∈Dtest P( ˜Vj≤t,˜Vj <V j,Yj = 1)→0, finishing the proof of (A.8). Step (b): The asymptotic equivalence between ˆtandt ∗.Sincet ∗is nonrandom, (A.8) implies ˆQc(t∗)−¯Q(t∗) = ˆQc(t∗) p →0. Thus we have P(t∗≤ˆt)≥P(ˆQc(t∗) = 0)→1, m→∞.(A.9) Note that the BC thresholdτ(13) can be equally expressed as τ= max { t∈{ν(i)}m i=1 : ˆQ(t)≤0 } . 5 Letτ=ν(k). B...

  17. [17]

    ∑ j∈Dtest I(Vj≤ˆt, Vj < ˜Vj) ⏐⏐⏐⏐⏐ ] =E [ A|t∗= ˆt ] P(t∗= ˆt) +E [ A|t∗̸=ˆt ] P(t∗̸=ˆt) ≤P(t∗̸=ˆt) =o(1), whereAdenotes the term ∑ j∈Dtest I(Vj≤t∗,Vj< ˜Vj,Yj=0)∑ j∈Dtest I(Vj≤t∗,Vj< ˜Vj) − ∑ j∈Dtest I(Vj≤ˆt,Vj< ˜Vj,Yj=0)∑ j∈Dtest I(Vj≤ˆt,Vj< ˜Vj) for simplic- ity. Finally, combining the above arguments, we conclude that limm→∞FDRδτδτδτ=α, which completes...

  18. [18]

    ) /2 = l N+2− l(l+1) 2(N+1)(N+2) 1 2−1 N+2 ≥l+ 1 N+ 1 . As a result, we have Lemma 4 proved by noticing that, for anyt∈[ 1 N+ 1 ,1) (the case wheret= 1 holds trivially), supposetlies in [ l N+ 1 , l+ 1 N+ 1 ) for some 1≤l≤N, we obtain the desired inequality as follows P(pi≤t|pi <˜pi,Yi = 0,Si)≥P(pi≤l N+ 1 |pi <˜pi,Yi = 0,Si)≥l+ 1 N+ 1 ≥t. A.11 Further det...

  19. [19]

    Proof.Under the model in Claim 1, the desired convergence is established by showing that 1 m m∑ j=1 P{˜Vj <V j|Yj = 1}≤1 m m∑ j=1 P{˜Vj≤Vj|Yj = 1} 14 = 1 m m∑ j=1 P { f0( ˜Xj) f( ˜Xj) ≤f0(Xj) f(Xj) ⏐⏐⏐⏐⏐Yj = 1 } = 1 m m∑ j=1 P { ϕσ0( ˜Xj−µ0) ϕσ0( ˜Xj−µm)≤ϕσ0(Xj−µ0) ϕσ0(Xj−µm) ⏐⏐⏐⏐⏐Yj = 1 } = 1 m m∑ j=1 P { (µ0−µm) ˜Xj≤(µ0−µm)Xj ⏐⏐⏐Y j = 1 } = Φ ( −|µ0−µm|...

  20. [20]

    Remark 11.As highlighted by Cai et al

    Sparsity estimation.We construct a swap-invariant estimator: ˆπ(Sj) = 1− ∑m i=1ωij [I(pi >λ) +I(˜pi >λ)] 2(1−λ)( ∑m i=1ωij) ,(A.19) where (pi,˜pi) are conformal p-values andλ∈(0,1) is a screening threshold, with the default choice ofλ= 0.1. Remark 11.As highlighted by Cai et al. (2022), the selection ofλentails a bias–variance trade-off. Hence, we develop...

  21. [21]

    positive sample

    Bias correction.Under standard regularity conditions, ˆπ(Sj) converges in prob- ability toπ(Sj)/2 asm→∞, introducing a multiplicative bias. We therefore recalibrate the estimator and construct the final data-driven weights as w(Sj) = ˆπ(Sj) 1/2−ˆπ(Sj), j∈D test.(A.20) By construction, the estimator satisfy the swap-invariance condition (4) as desired. Fin...

  22. [22]

    Other settings remain the same as in Section 5.3. We implement the P-TAMS algorithm with four candidate classifiers: two OCCs, including the SVM- 21 sigmoid (OneClassSVM with sigmoid kernel), SVM-poly (OneClassSVM with polynomial kernel), and two BICs, including the KNN (K-nearest neighbors) and MLP (multi-layer perceptron). D.2 Additional details for the...

  23. [23]

    Two one-class classifiers (OneClassSVM and LOF) and two binary classifiers (QDA and RF) are employed

    We first utilize the same experimental setup and candidate classifiers in Figure 4, which exhibits the effectiveness of P-TAMS selecting among four SCQ variants (see Appendix D.1 for detailed experimental setup), to explore the performance of ICP-AMS selecting among 24 SCQ-OCC-SVM SCQ-OCC-LOF SCQ-BIC-QDA SCQ-BIC-RF cfBH-OCC-SVM cfBH-OCC-LOF cfBH-BIC-QDA c...