pith. sign in

arxiv: 2605.17251 · v1 · pith:I52ZNCZOnew · submitted 2026-05-17 · 💻 cs.DS · cs.LG

Iterative Chow Filtering for Learning with Distribution Shift

Pith reviewed 2026-05-19 23:15 UTC · model grok-4.3

classification 💻 cs.DS cs.LG
keywords PQ learningdistribution shiftDNF learningChow parametersL1 sandwichingiterative filteringPAC learningabstention
0
0 comments X p. Extension
pith:I52ZNCZO Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{I52ZNCZO}

Prints a linked pith:I52ZNCZO badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

L1 sandwiching suffices for efficient PQ learning of DNFs under distribution shift

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the weaker L1 sandwiching approximation is enough to achieve efficient PQ learning, where a learner receives labeled training examples and unlabeled test examples and must predict correctly on test points that match the training distribution while abstaining elsewhere. Prior methods required stronger L2 sandwiching, which produced weak bounds for basic classes such as DNF formulas. The new result yields the first quasipolynomial-time PQ algorithm for DNFs under the uniform distribution that essentially matches ordinary PAC learning guarantees, plus exponential improvements for constant-depth circuits and constant-degree polynomial threshold functions.

Core claim

We show that the weaker notion of L1 sandwiching suffices for efficient PQ learning. As a consequence, we obtain the first quasipolynomial-time PQ learning algorithm for DNFs under the uniform distribution and essentially match the guarantees known for ordinary PAC learning. Our main technical ingredient is Iterative Chow Filtering, a new procedure that uses low-degree Chow parameters to identify and remove test points incompatible with the training distribution.

What carries the argument

Iterative Chow Filtering: a procedure that repeatedly uses low-degree Chow parameters to identify and remove test points incompatible with the training distribution

If this is right

  • Yields the first quasipolynomial-time PQ algorithm for learning DNF formulas under the uniform distribution.
  • Gives exponential improvements for learning constant-depth circuits in the PQ model.
  • Extends efficient PQ learning to constant-degree polynomial threshold functions.
  • Matches the best known ordinary PAC learning bounds for these classes even when test points may be out of distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The filtering step may extend naturally to other function classes that admit low-degree Chow parameter approximations.
  • Similar iterative removal could improve robustness in related settings such as learning with partial labels or noisy test data.
  • Empirical tests on small DNFs could check whether the low-degree filter remains effective when the shift is mild rather than adversarial.

Load-bearing premise

Low-degree Chow parameters can reliably identify and remove test points incompatible with the training distribution without harming learning guarantees for the target class.

What would settle it

A concrete DNF example and distribution shift where low-degree Chow parameters fail to separate compatible from incompatible test points, causing the PQ learner to either abstain too often or output incorrect predictions on in-distribution points.

read the original abstract

Recent work due to Goel et al. gave the first efficient algorithms for learning with distribution shift in the challenging PQ framework. In this setting, a learner receives labeled training examples, unlabeled test examples, and must make correct predictions on the test set but is allowed to abstain from predicting on out-of-distribution points. Their results rely on ${\cal L}_2$ sandwiching approximations, a strong requirement that leads to poor bounds for several basic function classes such as DNF formulas. Here, we show that the weaker notion of ${\cal L}_1$ sandwiching suffices for efficient PQ learning. As a consequence, we obtain the first quasipolynomial-time PQ learning algorithm for DNFs under the uniform distribution and essentially match the guarantees known for ordinary PAC learning. More broadly, our bounds provide exponential improvements for several classes including constant depth circuits and constant degree polynomial threshold functions. Our main technical ingredient is Iterative Chow Filtering, a new procedure that uses low-degree Chow parameters to identify and remove test points incompatible with the training distribution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that the weaker L1 sandwiching notion suffices for efficient PQ learning under distribution shift, in contrast to prior L2 sandwiching results. As a consequence, it presents Iterative Chow Filtering—a procedure that uses low-degree Chow parameters to identify and remove test points incompatible with the training distribution—and obtains the first quasipolynomial-time PQ algorithm for DNFs under the uniform distribution, essentially matching known PAC guarantees, together with exponential improvements for constant-depth circuits and constant-degree polynomial threshold functions.

Significance. If the central claims hold, the work is significant: it relaxes a strong approximation requirement to a weaker one that is known to be achievable with better parameters for several natural classes, directly yielding improved PQ learning bounds that were previously out of reach. The introduction of Iterative Chow Filtering as a general-purpose tool for handling distribution shift via Chow parameters is a reusable technical contribution.

major comments (2)
  1. [§4] §4 (Iterative Chow Filtering procedure): the argument that repeated point removal changes the effective test measure by at most a negligible additive factor in L1 sandwiching error is not shown explicitly for the quasipolynomial regime. For DNFs the base L1 sandwiching error is already only quasipolynomial; even a small multiplicative blow-up in the error or sample complexity after filtering would push the overall runtime outside quasipolynomial, and no explicit lemma bounding the total error increase across iterations is cited.
  2. [Theorem 1.1] Theorem 1.1 (main PQ learning result for DNFs): the reduction from L1 sandwiching to PQ learning is stated to follow from the filtering procedure, yet the proof sketch does not address how the low-degree Chow parameters remain reliable estimators after the distribution has been altered by filtering; a concrete bound relating the degree, the number of iterations, and the final L1 error is needed to confirm the claimed quasipolynomial runtime.
minor comments (2)
  1. [Introduction] The abstract and introduction use “essentially match the guarantees known for ordinary PAC learning” without a side-by-side comparison table of sample and time complexities; adding such a table would clarify the improvement.
  2. [Preliminaries] Notation for the L1 sandwiching error (Definition 2.3) is introduced after the main theorem statement; moving the definition earlier would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for highlighting these points about explicitness in the quasipolynomial regime. We agree that additional detail will strengthen the presentation and will revise the paper accordingly.

read point-by-point responses
  1. Referee: [§4] §4 (Iterative Chow Filtering procedure): the argument that repeated point removal changes the effective test measure by at most a negligible additive factor in L1 sandwiching error is not shown explicitly for the quasipolynomial regime. For DNFs the base L1 sandwiching error is already only quasipolynomial; even a small multiplicative blow-up in the error or sample complexity after filtering would push the overall runtime outside quasipolynomial, and no explicit lemma bounding the total error increase across iterations is cited.

    Authors: We agree that the current write-up of Section 4 presents the error accumulation argument at a high level and does not include an explicit lemma tailored to the quasipolynomial parameter setting. In the revised manuscript we will add a new lemma (Lemma 4.5) that bounds the total additive L1 error increase over all iterations. The lemma shows that when each filtering step contributes an additive δ = ε / polylog(n) and the number of iterations is at most quasipolynomial, the overall L1 sandwiching error remains quasipolynomial; consequently the sample complexity and runtime stay inside the claimed bound with no multiplicative blow-up. revision: yes

  2. Referee: [Theorem 1.1] Theorem 1.1 (main PQ learning result for DNFs): the reduction from L1 sandwiching to PQ learning is stated to follow from the filtering procedure, yet the proof sketch does not address how the low-degree Chow parameters remain reliable estimators after the distribution has been altered by filtering; a concrete bound relating the degree, the number of iterations, and the final L1 error is needed to confirm the claimed quasipolynomial runtime.

    Authors: The referee is correct that the proof sketch of Theorem 1.1 is brief and does not explicitly track the effect of filtering on the reliability of the low-degree Chow estimators. While the full argument in Section 4 already shows that filtering removes only points whose Chow parameters deviate by more than the target error, we will expand the proof sketch to include a concrete bound: after k iterations the estimation error on degree-d Chow parameters is at most the original L1 sandwiching error plus an additive term O(k · ε / d), which remains quasipolynomial when d and k are quasipolynomial. This will make the reduction from L1 sandwiching to PQ learning fully explicit for the DNF case. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper's core contribution is showing that L1 sandwiching suffices for efficient PQ learning (improving on prior L2-based results from Goel et al.), with Iterative Chow Filtering presented as a new procedure that leverages standard low-degree Chow parameters to filter test points. No step reduces a claimed prediction or theorem to a fitted quantity, self-defined input, or load-bearing self-citation chain by construction. The abstract and described technical ingredient treat Chow parameters as an external primitive from learning theory, and the quasipolynomial bounds for DNFs follow from the new L1 sufficiency argument without tautological equivalence to the inputs. This is the common case of an independent algorithmic result.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Based solely on the abstract, the central claim rests on the domain assumption that L1 sandwiching approximations exist for the target classes and that low-degree Chow parameters suffice to detect distribution mismatch. No explicit free parameters or invented entities are named.

axioms (2)
  • domain assumption L1 sandwiching approximations exist and suffice to control the PQ error for the function classes considered
    The abstract states that the weaker L1 notion replaces the stronger L2 requirement used in prior work.
  • domain assumption The data is drawn from the uniform distribution
    The quasipolynomial bound for DNFs is stated to hold under the uniform distribution.
invented entities (1)
  • Iterative Chow Filtering procedure no independent evidence
    purpose: Identify and remove test points incompatible with the training distribution using low-degree Chow parameters
    New algorithmic component introduced to achieve the L1-based PQ learning result.

pith-pipeline@v0.9.0 · 5727 in / 1496 out tokens · 68246 ms · 2026-05-19T23:15:55.177753+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

  1. [1]

    [VW09] Aad Vaart and Jon Wellner

    Schloss Dagstuhl – Leibniz-Zentrum f¨ ur Informatik. [VW09] Aad Vaart and Jon Wellner. A note on bounds for vc dimensions.Institute of Mathematical Statistics collections, 5:103–107, 01 2009. [Win71] R. O. Winder. Chow parameters in threshold logic.Journal of the ACM, 18:265–289, 1971. 16 A Extended Preliminaries A.1 Notation Let N ={0, 1,...}denote the s...

  2. [2]

    Then, with probability at least 1−δoverSandS′, we have that for anyi∈[k], it holds E x∼D′ [fi(x)pi(x)s(x)]≤(R+ϵ)(1 +ϵ)E x∼D [fi(x)|pi(x)|] + 2ϵ

    Let any p1,...,pk∈Pd,ℓsuch that maxi∈[k]Ex∼D[pi(x)2]≤β. Then, with probability at least 1−δoverSandS′, we have that for anyi∈[k], it holds E x∼D′ [fi(x)pi(x)s(x)]≤(R+ϵ)(1 +ϵ)E x∼D [fi(x)|pi(x)|] + 2ϵ

  3. [3]

    Proof.Let s,B,∆,t,S 0,...,St,µ⋆ 1,...,µ⋆ t+1,f ⋆ 1,...,f ⋆ t+1,p⋆ 1,...,p⋆ t+1,τ⋆ 1,...,τ⋆ t,P(f 1),...,P(fk) be defined as in the description of Algorithm 2

    With probability at least1−δoverSandS ′, it holds Pr x∼D [s(x) = 0]≤min {1 +ϵ R ,dTV(D,D′) +ϵ R−1 } . Proof.Let s,B,∆,t,S 0,...,St,µ⋆ 1,...,µ⋆ t+1,f ⋆ 1,...,f ⋆ t+1,p⋆ 1,...,p⋆ t+1,τ⋆ 1,...,τ⋆ t,P(f 1),...,P(fk) be defined as in the description of Algorithm 2. First, we show that Algorithm 2 is efficient. For any f∈F, the setP(f) can be described by O(|S|...

  4. [4]

    examples fromD train

    It draws a setS train of poly ( (d+ 1)ℓ,1/ϵ,1/η,log(1/δ) ) i.i.d. examples fromD train

  5. [6]

    examples from D andD′ respectively

    It draws sets S,S′of poly((d + 1)ℓ,Aℓ, 1/ϵ,1/η,log(1/δ)ℓ) i.i.d. examples from D andD′ respectively

  6. [7]

    It runs Algorithm 1 with input ˆf,S,S ′,ℓ,R′= 1/η+ϵ/96,β= 4(2A)2ℓ,ϵ′=ϵη/96 to get a selectors:X→{0,1}

  7. [8]

    First, since|Strain|is sufficiently large, by Theorem E.1, we can ensure that with probability at least 1−δ/2, it holdsPr(x,y)∼Dtrain[ ˆf(x)̸=y]≤opttrain +ϵη/2

    It returns ( ˆf,s). First, since|Strain|is sufficiently large, by Theorem E.1, we can ensure that with probability at least 1−δ/2, it holdsPr(x,y)∼Dtrain[ ˆf(x)̸=y]≤opttrain +ϵη/2. Now, let any f⋆∈Csuch that errtrain(f⋆) =λtrain and errtest(f⋆) =λtest and let pup,p down be ϵ′′-approximateL1 sandwiching polynomials for f⋆underD of degree at most ℓ, where ϵ...

  8. [9]

    examples from Dtrain and a set Xtest ofO(log(1/δ)/ϵ2) i.i.d

    It draws a set Strain of poly ( (d+ 1)ℓ,R/ϵ,log(1/δ) ) i.i.d. examples from Dtrain and a set Xtest ofO(log(1/δ)/ϵ2) i.i.d. examples fromD ′

  9. [10]

    It runs the degree-ℓL1 polynomial regression algorithm of Theorem E.1 with input Strain to get a classifier ˆf:R d→{0,1}

  10. [11]

    examples from D and D′respectively

    It draws sets S,S′of poly((d + 1)ℓ,Aℓ, 1/ϵ,R(R−1)−1,log (1/δ)ℓ) i.i.d. examples from D and D′respectively

  11. [12]

    It runs Algorithm 1 with input ˆf,S,S ′,ℓ,R,β= 4(2A)2ℓ,ϵ′= (R−1)ϵ/(128R2) to get a selectors:X→{0,1}

  12. [13]

    Proof of Soundness.We assume that the algorithm accepts

    If Prx∼U(Xtest)[s(x) = 0] ≥R(R−1)−1θ+ϵ/4 it returns REJECT, otherwise it returns (ACCEPT, ˆf). Proof of Soundness.We assume that the algorithm accepts. So, from Hoeffding’s inequality, since|Xtest|≳log(1/δ)/ϵ2, with probability at least 1−δ/3, it holds Pr x∼D′ [s(x) = 0]≤Pr x∼U(Xtest) [s(x) = 0] +ϵ/4≤Rθ R−1+ϵ/2. Moreover, since|Strain|is sufficiently larg...