Iterative Chow Filtering for Learning with Distribution Shift

arxiv: 2605.17251 · v1 · pith:I52ZNCZOnew · submitted 2026-05-17 · 💻 cs.DS · cs.LG

Iterative Chow Filtering for Learning with Distribution Shift

Gautam Chandrasekaran , Georgios Gkrinias , Adam R. Klivans , Konstantinos Stavropoulos , Arsen Vasilyan This is my paper

Pith reviewed 2026-05-19 23:15 UTC · model grok-4.3

classification 💻 cs.DS cs.LG

keywords PQ learningdistribution shiftDNF learningChow parametersL1 sandwichingiterative filteringPAC learningabstention

0 comments p. Extension

pith:I52ZNCZO Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{I52ZNCZO}

Prints a linked pith:I52ZNCZO badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

L1 sandwiching suffices for efficient PQ learning of DNFs under distribution shift

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the weaker L1 sandwiching approximation is enough to achieve efficient PQ learning, where a learner receives labeled training examples and unlabeled test examples and must predict correctly on test points that match the training distribution while abstaining elsewhere. Prior methods required stronger L2 sandwiching, which produced weak bounds for basic classes such as DNF formulas. The new result yields the first quasipolynomial-time PQ algorithm for DNFs under the uniform distribution that essentially matches ordinary PAC learning guarantees, plus exponential improvements for constant-depth circuits and constant-degree polynomial threshold functions.

Core claim

We show that the weaker notion of L1 sandwiching suffices for efficient PQ learning. As a consequence, we obtain the first quasipolynomial-time PQ learning algorithm for DNFs under the uniform distribution and essentially match the guarantees known for ordinary PAC learning. Our main technical ingredient is Iterative Chow Filtering, a new procedure that uses low-degree Chow parameters to identify and remove test points incompatible with the training distribution.

What carries the argument

Iterative Chow Filtering: a procedure that repeatedly uses low-degree Chow parameters to identify and remove test points incompatible with the training distribution

If this is right

Yields the first quasipolynomial-time PQ algorithm for learning DNF formulas under the uniform distribution.
Gives exponential improvements for learning constant-depth circuits in the PQ model.
Extends efficient PQ learning to constant-degree polynomial threshold functions.
Matches the best known ordinary PAC learning bounds for these classes even when test points may be out of distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The filtering step may extend naturally to other function classes that admit low-degree Chow parameter approximations.
Similar iterative removal could improve robustness in related settings such as learning with partial labels or noisy test data.
Empirical tests on small DNFs could check whether the low-degree filter remains effective when the shift is mild rather than adversarial.

Load-bearing premise

Low-degree Chow parameters can reliably identify and remove test points incompatible with the training distribution without harming learning guarantees for the target class.

What would settle it

A concrete DNF example and distribution shift where low-degree Chow parameters fail to separate compatible from incompatible test points, causing the PQ learner to either abstain too often or output incorrect predictions on in-distribution points.

read the original abstract

Recent work due to Goel et al. gave the first efficient algorithms for learning with distribution shift in the challenging PQ framework. In this setting, a learner receives labeled training examples, unlabeled test examples, and must make correct predictions on the test set but is allowed to abstain from predicting on out-of-distribution points. Their results rely on ${\cal L}_2$ sandwiching approximations, a strong requirement that leads to poor bounds for several basic function classes such as DNF formulas. Here, we show that the weaker notion of ${\cal L}_1$ sandwiching suffices for efficient PQ learning. As a consequence, we obtain the first quasipolynomial-time PQ learning algorithm for DNFs under the uniform distribution and essentially match the guarantees known for ordinary PAC learning. More broadly, our bounds provide exponential improvements for several classes including constant depth circuits and constant degree polynomial threshold functions. Our main technical ingredient is Iterative Chow Filtering, a new procedure that uses low-degree Chow parameters to identify and remove test points incompatible with the training distribution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

L1 sandwiching plus Iterative Chow Filtering gives the first quasipolynomial PQ algorithm for DNFs that matches PAC bounds, though the error change after filtering needs a close check.

read the letter

Hi, the main point is that this paper shows the weaker L1 sandwiching condition is enough for efficient PQ learning under distribution shift. That yields the first quasipolynomial-time PQ algorithm for DNFs on the uniform distribution and essentially matches the known PAC guarantees. They also report exponential improvements for constant-depth circuits and constant-degree polynomial threshold functions. The new piece is Iterative Chow Filtering, which uses low-degree Chow parameters to identify and drop test points that are incompatible with the training distribution. This replaces the stronger L2 sandwiching from Goel et al. and appears to be a direct, useful simplification that still preserves the learning guarantees for these classes. The abstract frames the procedure as the central technical step and builds cleanly on standard Chow parameters without obvious circularity or invented quantities. The soft spot is whether repeated filtering keeps the L1 approximation error from growing by more than negligible additive factors. For DNFs the base sandwiching is already only quasipolynomial, so any extra multiplicative blow-up in error or sample size from changing the effective test measure could push the runtime outside the target bound. The stress-test concern is reasonable and worth verifying in the full proof; if the paper bounds the change tightly, the result holds up. This is aimed at learning theorists who work on Boolean functions and robust learning with distribution shift. Readers who know Chow parameters and sandwiching approximations will extract the most value. The paper shows clear thinking and honest engagement with the prior literature, so it deserves a serious referee even if the error analysis needs tightening in revision. I would bring it to a reading group to walk through the filtering details and might cite it once the bounds are confirmed.

Referee Report

2 major / 2 minor

Summary. The paper claims that the weaker L1 sandwiching notion suffices for efficient PQ learning under distribution shift, in contrast to prior L2 sandwiching results. As a consequence, it presents Iterative Chow Filtering—a procedure that uses low-degree Chow parameters to identify and remove test points incompatible with the training distribution—and obtains the first quasipolynomial-time PQ algorithm for DNFs under the uniform distribution, essentially matching known PAC guarantees, together with exponential improvements for constant-depth circuits and constant-degree polynomial threshold functions.

Significance. If the central claims hold, the work is significant: it relaxes a strong approximation requirement to a weaker one that is known to be achievable with better parameters for several natural classes, directly yielding improved PQ learning bounds that were previously out of reach. The introduction of Iterative Chow Filtering as a general-purpose tool for handling distribution shift via Chow parameters is a reusable technical contribution.

major comments (2)

[§4] §4 (Iterative Chow Filtering procedure): the argument that repeated point removal changes the effective test measure by at most a negligible additive factor in L1 sandwiching error is not shown explicitly for the quasipolynomial regime. For DNFs the base L1 sandwiching error is already only quasipolynomial; even a small multiplicative blow-up in the error or sample complexity after filtering would push the overall runtime outside quasipolynomial, and no explicit lemma bounding the total error increase across iterations is cited.
[Theorem 1.1] Theorem 1.1 (main PQ learning result for DNFs): the reduction from L1 sandwiching to PQ learning is stated to follow from the filtering procedure, yet the proof sketch does not address how the low-degree Chow parameters remain reliable estimators after the distribution has been altered by filtering; a concrete bound relating the degree, the number of iterations, and the final L1 error is needed to confirm the claimed quasipolynomial runtime.

minor comments (2)

[Introduction] The abstract and introduction use “essentially match the guarantees known for ordinary PAC learning” without a side-by-side comparison table of sample and time complexities; adding such a table would clarify the improvement.
[Preliminaries] Notation for the L1 sandwiching error (Definition 2.3) is introduced after the main theorem statement; moving the definition earlier would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for highlighting these points about explicitness in the quasipolynomial regime. We agree that additional detail will strengthen the presentation and will revise the paper accordingly.

read point-by-point responses

Referee: [§4] §4 (Iterative Chow Filtering procedure): the argument that repeated point removal changes the effective test measure by at most a negligible additive factor in L1 sandwiching error is not shown explicitly for the quasipolynomial regime. For DNFs the base L1 sandwiching error is already only quasipolynomial; even a small multiplicative blow-up in the error or sample complexity after filtering would push the overall runtime outside quasipolynomial, and no explicit lemma bounding the total error increase across iterations is cited.

Authors: We agree that the current write-up of Section 4 presents the error accumulation argument at a high level and does not include an explicit lemma tailored to the quasipolynomial parameter setting. In the revised manuscript we will add a new lemma (Lemma 4.5) that bounds the total additive L1 error increase over all iterations. The lemma shows that when each filtering step contributes an additive δ = ε / polylog(n) and the number of iterations is at most quasipolynomial, the overall L1 sandwiching error remains quasipolynomial; consequently the sample complexity and runtime stay inside the claimed bound with no multiplicative blow-up. revision: yes
Referee: [Theorem 1.1] Theorem 1.1 (main PQ learning result for DNFs): the reduction from L1 sandwiching to PQ learning is stated to follow from the filtering procedure, yet the proof sketch does not address how the low-degree Chow parameters remain reliable estimators after the distribution has been altered by filtering; a concrete bound relating the degree, the number of iterations, and the final L1 error is needed to confirm the claimed quasipolynomial runtime.

Authors: The referee is correct that the proof sketch of Theorem 1.1 is brief and does not explicitly track the effect of filtering on the reliability of the low-degree Chow estimators. While the full argument in Section 4 already shows that filtering removes only points whose Chow parameters deviate by more than the target error, we will expand the proof sketch to include a concrete bound: after k iterations the estimation error on degree-d Chow parameters is at most the original L1 sandwiching error plus an additive term O(k · ε / d), which remains quasipolynomial when d and k are quasipolynomial. This will make the reduction from L1 sandwiching to PQ learning fully explicit for the DNF case. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper's core contribution is showing that L1 sandwiching suffices for efficient PQ learning (improving on prior L2-based results from Goel et al.), with Iterative Chow Filtering presented as a new procedure that leverages standard low-degree Chow parameters to filter test points. No step reduces a claimed prediction or theorem to a fitted quantity, self-defined input, or load-bearing self-citation chain by construction. The abstract and described technical ingredient treat Chow parameters as an external primitive from learning theory, and the quasipolynomial bounds for DNFs follow from the new L1 sufficiency argument without tautological equivalence to the inputs. This is the common case of an independent algorithmic result.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Based solely on the abstract, the central claim rests on the domain assumption that L1 sandwiching approximations exist for the target classes and that low-degree Chow parameters suffice to detect distribution mismatch. No explicit free parameters or invented entities are named.

axioms (2)

domain assumption L1 sandwiching approximations exist and suffice to control the PQ error for the function classes considered
The abstract states that the weaker L1 notion replaces the stronger L2 requirement used in prior work.
domain assumption The data is drawn from the uniform distribution
The quasipolynomial bound for DNFs is stated to hold under the uniform distribution.

invented entities (1)

Iterative Chow Filtering procedure no independent evidence
purpose: Identify and remove test points incompatible with the training distribution using low-degree Chow parameters
New algorithmic component introduced to achieve the L1-based PQ learning result.

pith-pipeline@v0.9.0 · 5727 in / 1496 out tokens · 68246 ms · 2026-05-19T23:15:55.177753+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our main technical ingredient is Iterative Chow Filtering, a new procedure that uses low-degree Chow parameters to identify and remove test points incompatible with the training distribution.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we show that the weaker notion of L1 sandwiching suffices for efficient PQ learning

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

[VW09] Aad Vaart and Jon Wellner

Schloss Dagstuhl – Leibniz-Zentrum f¨ ur Informatik. [VW09] Aad Vaart and Jon Wellner. A note on bounds for vc dimensions.Institute of Mathematical Statistics collections, 5:103–107, 01 2009. [Win71] R. O. Winder. Chow parameters in threshold logic.Journal of the ACM, 18:265–289, 1971. 16 A Extended Preliminaries A.1 Notation Let N ={0, 1,...}denote the s...

work page 2009
[2]

Then, with probability at least 1−δoverSandS′, we have that for anyi∈[k], it holds E x∼D′ [fi(x)pi(x)s(x)]≤(R+ϵ)(1 +ϵ)E x∼D [fi(x)|pi(x)|] + 2ϵ

Let any p1,...,pk∈Pd,ℓsuch that maxi∈[k]Ex∼D[pi(x)2]≤β. Then, with probability at least 1−δoverSandS′, we have that for anyi∈[k], it holds E x∼D′ [fi(x)pi(x)s(x)]≤(R+ϵ)(1 +ϵ)E x∼D [fi(x)|pi(x)|] + 2ϵ

work page
[3]

Proof.Let s,B,∆,t,S 0,...,St,µ⋆ 1,...,µ⋆ t+1,f ⋆ 1,...,f ⋆ t+1,p⋆ 1,...,p⋆ t+1,τ⋆ 1,...,τ⋆ t,P(f 1),...,P(fk) be defined as in the description of Algorithm 2

With probability at least1−δoverSandS ′, it holds Pr x∼D [s(x) = 0]≤min {1 +ϵ R ,dTV(D,D′) +ϵ R−1 } . Proof.Let s,B,∆,t,S 0,...,St,µ⋆ 1,...,µ⋆ t+1,f ⋆ 1,...,f ⋆ t+1,p⋆ 1,...,p⋆ t+1,τ⋆ 1,...,τ⋆ t,P(f 1),...,P(fk) be defined as in the description of Algorithm 2. First, we show that Algorithm 2 is efficient. For any f∈F, the setP(f) can be described by O(|S|...

work page
[4]

examples fromD train

It draws a setS train of poly ( (d+ 1)ℓ,1/ϵ,1/η,log(1/δ) ) i.i.d. examples fromD train

work page
[6]

examples from D andD′ respectively

It draws sets S,S′of poly((d + 1)ℓ,Aℓ, 1/ϵ,1/η,log(1/δ)ℓ) i.i.d. examples from D andD′ respectively

work page
[7]

It runs Algorithm 1 with input ˆf,S,S ′,ℓ,R′= 1/η+ϵ/96,β= 4(2A)2ℓ,ϵ′=ϵη/96 to get a selectors:X→{0,1}

work page
[8]

First, since|Strain|is sufficiently large, by Theorem E.1, we can ensure that with probability at least 1−δ/2, it holdsPr(x,y)∼Dtrain[ ˆf(x)̸=y]≤opttrain +ϵη/2

It returns ( ˆf,s). First, since|Strain|is sufficiently large, by Theorem E.1, we can ensure that with probability at least 1−δ/2, it holdsPr(x,y)∼Dtrain[ ˆf(x)̸=y]≤opttrain +ϵη/2. Now, let any f⋆∈Csuch that errtrain(f⋆) =λtrain and errtest(f⋆) =λtest and let pup,p down be ϵ′′-approximateL1 sandwiching polynomials for f⋆underD of degree at most ℓ, where ϵ...

work page
[9]

examples from Dtrain and a set Xtest ofO(log(1/δ)/ϵ2) i.i.d

It draws a set Strain of poly ( (d+ 1)ℓ,R/ϵ,log(1/δ) ) i.i.d. examples from Dtrain and a set Xtest ofO(log(1/δ)/ϵ2) i.i.d. examples fromD ′

work page
[10]

It runs the degree-ℓL1 polynomial regression algorithm of Theorem E.1 with input Strain to get a classifier ˆf:R d→{0,1}

work page
[11]

examples from D and D′respectively

It draws sets S,S′of poly((d + 1)ℓ,Aℓ, 1/ϵ,R(R−1)−1,log (1/δ)ℓ) i.i.d. examples from D and D′respectively

work page
[12]

It runs Algorithm 1 with input ˆf,S,S ′,ℓ,R,β= 4(2A)2ℓ,ϵ′= (R−1)ϵ/(128R2) to get a selectors:X→{0,1}

work page
[13]

Proof of Soundness.We assume that the algorithm accepts

If Prx∼U(Xtest)[s(x) = 0] ≥R(R−1)−1θ+ϵ/4 it returns REJECT, otherwise it returns (ACCEPT, ˆf). Proof of Soundness.We assume that the algorithm accepts. So, from Hoeffding’s inequality, since|Xtest|≳log(1/δ)/ϵ2, with probability at least 1−δ/3, it holds Pr x∼D′ [s(x) = 0]≤Pr x∼U(Xtest) [s(x) = 0] +ϵ/4≤Rθ R−1+ϵ/2. Moreover, since|Strain|is sufficiently larg...

work page

[1] [1]

[VW09] Aad Vaart and Jon Wellner

Schloss Dagstuhl – Leibniz-Zentrum f¨ ur Informatik. [VW09] Aad Vaart and Jon Wellner. A note on bounds for vc dimensions.Institute of Mathematical Statistics collections, 5:103–107, 01 2009. [Win71] R. O. Winder. Chow parameters in threshold logic.Journal of the ACM, 18:265–289, 1971. 16 A Extended Preliminaries A.1 Notation Let N ={0, 1,...}denote the s...

work page 2009

[2] [2]

Then, with probability at least 1−δoverSandS′, we have that for anyi∈[k], it holds E x∼D′ [fi(x)pi(x)s(x)]≤(R+ϵ)(1 +ϵ)E x∼D [fi(x)|pi(x)|] + 2ϵ

Let any p1,...,pk∈Pd,ℓsuch that maxi∈[k]Ex∼D[pi(x)2]≤β. Then, with probability at least 1−δoverSandS′, we have that for anyi∈[k], it holds E x∼D′ [fi(x)pi(x)s(x)]≤(R+ϵ)(1 +ϵ)E x∼D [fi(x)|pi(x)|] + 2ϵ

work page

[3] [3]

Proof.Let s,B,∆,t,S 0,...,St,µ⋆ 1,...,µ⋆ t+1,f ⋆ 1,...,f ⋆ t+1,p⋆ 1,...,p⋆ t+1,τ⋆ 1,...,τ⋆ t,P(f 1),...,P(fk) be defined as in the description of Algorithm 2

With probability at least1−δoverSandS ′, it holds Pr x∼D [s(x) = 0]≤min {1 +ϵ R ,dTV(D,D′) +ϵ R−1 } . Proof.Let s,B,∆,t,S 0,...,St,µ⋆ 1,...,µ⋆ t+1,f ⋆ 1,...,f ⋆ t+1,p⋆ 1,...,p⋆ t+1,τ⋆ 1,...,τ⋆ t,P(f 1),...,P(fk) be defined as in the description of Algorithm 2. First, we show that Algorithm 2 is efficient. For any f∈F, the setP(f) can be described by O(|S|...

work page

[4] [4]

examples fromD train

It draws a setS train of poly ( (d+ 1)ℓ,1/ϵ,1/η,log(1/δ) ) i.i.d. examples fromD train

work page

[5] [6]

examples from D andD′ respectively

It draws sets S,S′of poly((d + 1)ℓ,Aℓ, 1/ϵ,1/η,log(1/δ)ℓ) i.i.d. examples from D andD′ respectively

work page

[6] [7]

It runs Algorithm 1 with input ˆf,S,S ′,ℓ,R′= 1/η+ϵ/96,β= 4(2A)2ℓ,ϵ′=ϵη/96 to get a selectors:X→{0,1}

work page

[7] [8]

First, since|Strain|is sufficiently large, by Theorem E.1, we can ensure that with probability at least 1−δ/2, it holdsPr(x,y)∼Dtrain[ ˆf(x)̸=y]≤opttrain +ϵη/2

It returns ( ˆf,s). First, since|Strain|is sufficiently large, by Theorem E.1, we can ensure that with probability at least 1−δ/2, it holdsPr(x,y)∼Dtrain[ ˆf(x)̸=y]≤opttrain +ϵη/2. Now, let any f⋆∈Csuch that errtrain(f⋆) =λtrain and errtest(f⋆) =λtest and let pup,p down be ϵ′′-approximateL1 sandwiching polynomials for f⋆underD of degree at most ℓ, where ϵ...

work page

[8] [9]

examples from Dtrain and a set Xtest ofO(log(1/δ)/ϵ2) i.i.d

It draws a set Strain of poly ( (d+ 1)ℓ,R/ϵ,log(1/δ) ) i.i.d. examples from Dtrain and a set Xtest ofO(log(1/δ)/ϵ2) i.i.d. examples fromD ′

work page

[9] [10]

It runs the degree-ℓL1 polynomial regression algorithm of Theorem E.1 with input Strain to get a classifier ˆf:R d→{0,1}

work page

[10] [11]

examples from D and D′respectively

It draws sets S,S′of poly((d + 1)ℓ,Aℓ, 1/ϵ,R(R−1)−1,log (1/δ)ℓ) i.i.d. examples from D and D′respectively

work page

[11] [12]

It runs Algorithm 1 with input ˆf,S,S ′,ℓ,R,β= 4(2A)2ℓ,ϵ′= (R−1)ϵ/(128R2) to get a selectors:X→{0,1}

work page

[12] [13]

Proof of Soundness.We assume that the algorithm accepts

If Prx∼U(Xtest)[s(x) = 0] ≥R(R−1)−1θ+ϵ/4 it returns REJECT, otherwise it returns (ACCEPT, ˆf). Proof of Soundness.We assume that the algorithm accepts. So, from Hoeffding’s inequality, since|Xtest|≳log(1/δ)/ϵ2, with probability at least 1−δ/3, it holds Pr x∼D′ [s(x) = 0]≤Pr x∼U(Xtest) [s(x) = 0] +ϵ/4≤Rθ R−1+ϵ/2. Moreover, since|Strain|is sufficiently larg...

work page