pith. sign in

arxiv: 2604.02555 · v1 · submitted 2026-04-02 · 💻 cs.DS · cs.LG

Robust Learning with Optimal Error

Pith reviewed 2026-05-13 20:13 UTC · model grok-4.3

classification 💻 cs.DS cs.LG
keywords robust learningadversarial noiserandomized hypothesesoptimal errormalicious noisenasty noiseVC-dimensionagnostic learning
0
0 comments X

The pith

Randomized hypotheses achieve optimal error rates under adversarial noise that improve on deterministic hypotheses by up to a factor of two.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs algorithms that output randomized hypotheses to reach the lowest possible error when labels are corrupted by different adversarial noise processes. For malicious noise at rate η the best error drops to half of η over one minus η, resolving an open question on how much randomness helps. For nasty noise the error reaches 3/2 η in the distribution-independent case and η when the distribution is fixed, while agnostic noise and nasty classification noise both reach error η. All learners use sample complexity linear in the VC-dimension and are time-efficient given an empirical risk minimization oracle.

Core claim

Randomized hypotheses achieve an optimal error of ½ ⋅ η/(1-η) for η-rate malicious noise, 3/2 ⋅ η for distribution-independent η-rate nasty noise, η for fixed-distribution η-rate nasty noise, and η for η-rate agnostic and nasty classification noise. Each bound improves on the corresponding 2η error attainable by any deterministic hypothesis, and the algorithms attain these rates with sample size linear in the VC-dimension of the concept class.

What carries the argument

Randomized hypotheses, which return a label drawn from a distribution rather than a fixed label, paired with access to an empirical risk minimization oracle.

If this is right

  • Sample complexity scales linearly with VC-dimension and polynomially with inverse excess error.
  • All but one of the learners run in polynomial time given an empirical risk minimization oracle.
  • The malicious-noise result improves the previous best factor of 6/7 shown by Cesa-Bianchi et al.
  • The nasty-noise results close the gap left open by Bshouty et al. and later works.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • In settings where outputting a distribution over labels is acceptable, practitioners could obtain noticeably lower error on noisy data without collecting more samples.
  • The same randomization technique may extend to other noise models or to hypothesis classes where VC-dimension is replaced by other complexity measures.
  • The linear sample-complexity dependence suggests that similar optimal-error results could be obtained for parametric families used in modern machine learning.

Load-bearing premise

The concept class has finite VC-dimension and an empirical risk minimization oracle is available.

What would settle it

A concrete concept class of finite VC-dimension together with an η-rate malicious noise process for which every randomized hypothesis incurs error strictly larger than ½ ⋅ η/(1-η).

Figures

Figures reproduced from arXiv: 2604.02555 by Guy Blanc.

Figure 1
Figure 1. Figure 1: Our malicious noise learner. A priori, it is not even clear that there should exist a single choice of µ ∗ for which ℓ(c, g ◦ µ ∗ ) is small for all c ∈ C: The minimax lemma in the preceding step only establishes there exists, roughly speaking, a distribution over µ so that h := Eµ[g ◦µ] that has small loss. Yet, the learner in [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Our algorithm meeting the requirements of [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Algorithm 2 in [Jag13]. Remark 2 (Use of AI tools). We originally had a more ad-hoc analysis of the convergence of the algorithm in [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The shaded region depicts the constraints of [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of our construction used in the proof of [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The shaded region depicts the constraints of [PITH_FULL_IMAGE:figures/full_fig_p048_6.png] view at source ↗
read the original abstract

We construct algorithms with optimal error for learning with adversarial noise. The overarching theme of this work is that the use of \textsl{randomized} hypotheses can substantially improve upon the best error rates achievable with deterministic hypotheses. - For $\eta$-rate malicious noise, we show the optimal error is $\frac{1}{2} \cdot \eta/(1-\eta)$, improving on the optimal error of deterministic hypotheses by a factor of $1/2$. This answers an open question of Cesa-Bianchi et al. (JACM 1999) who showed randomness can improve error by a factor of $6/7$. - For $\eta$-rate nasty noise, we show the optimal error is $\frac{3}{2} \cdot \eta$ for distribution-independent learners and $\eta$ for fixed-distribution learners, both improving upon the optimal $2 \eta$ error of deterministic hypotheses. This closes a gap first noted by Bshouty et al. (Theoretical Computer Science 2002) when they introduced nasty noise and reiterated in the recent works of Klivans et al. (NeurIPS 2025) and Blanc et al. (SODA 2026). - For $\eta$-rate agnostic noise and the closely related nasty classification noise model, we show the optimal error is $\eta$, improving upon the optimal $2\eta$ error of deterministic hypotheses. All of our learners have sample complexity linear in the VC-dimension of the concept class and polynomial in the inverse excess error. All except for the fixed-distribution nasty noise learner are time efficient given access to an oracle for empirical risk minimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript constructs algorithms for robust learning under adversarial noise models that achieve optimal error rates by using randomized hypotheses. For η-rate malicious noise the optimal error is ½⋅η/(1-η), improving on deterministic hypotheses by a factor of 1/2 and resolving an open question of Cesa-Bianchi et al. (JACM 1999). For η-rate nasty noise the optimal error is 3/2⋅η for distribution-independent learners and η for fixed-distribution learners, both improving on the 2η bound for deterministic hypotheses and closing a gap noted by Bshouty et al. (TCS 2002). For η-rate agnostic noise the optimal error is η, again improving on the deterministic 2η bound. All learners have sample complexity linear in the VC-dimension of the concept class and polynomial in the inverse excess error; all but the fixed-distribution nasty-noise learner are time-efficient given an ERM oracle.

Significance. If the stated upper and lower bounds hold, the work substantially advances robust learning theory by showing that randomization yields strictly better optimal error rates across malicious, nasty, and agnostic settings, with explicit constants that match both directions. The linear dependence on VC-dimension together with the ERM-oracle efficiency results provide concrete algorithmic improvements over prior deterministic analyses, directly answering long-standing open questions and supplying matching bounds that were previously unavailable.

minor comments (2)
  1. [Abstract] The abstract states that all learners except the fixed-distribution nasty-noise learner are time-efficient given an ERM oracle; a brief remark in §4 or §5 clarifying why the fixed-distribution case requires super-polynomial time would help readers understand the distinction.
  2. The sample-complexity bound is described as linear in VC-dimension and polynomial in 1/ε; an explicit statement of the dependence on η (or confirmation that it is absorbed into the polynomial) would make the complexity claim fully precise.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review and recommendation to accept the manuscript. We are pleased that the contributions on randomized hypotheses achieving optimal error rates across malicious, nasty, and agnostic noise models have been recognized as resolving long-standing open questions.

Circularity Check

0 steps flagged

No significant circularity; minor self-citation not load-bearing

full rationale

The paper's central claims derive optimal error rates (½⋅η/(1-η) for malicious noise, 3/2⋅η and η for nasty noise variants, η for agnostic) via explicit algorithmic constructions for randomized hypotheses and matching information-theoretic lower bounds. These rely on standard finite VC-dimension and ERM-oracle assumptions that are declared upfront and do not reduce to fitted parameters or self-referential definitions. The citation to Blanc et al. (SODA 2026) merely notes a previously identified gap and is not invoked as a uniqueness theorem or load-bearing premise; the current work supplies the new constructions that close it. No self-definitional, fitted-input, or ansatz-smuggling steps appear in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Relies on standard statistical learning assumptions; no free parameters or invented entities introduced.

axioms (2)
  • domain assumption Concept class has finite VC-dimension
    Invoked to obtain sample complexity linear in VC-dimension.
  • domain assumption Existence of empirical risk minimization oracle
    Required for time-efficiency of all but one learner.

pith-pipeline@v0.9.0 · 5588 in / 1236 out tokens · 22672 ms · 2026-05-13T20:13:04.649402+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

  1. [1]

    Rademacher and gaussian complexities: Risk bounds and structural results

    1 40 [BM02] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of machine learning research , 3(Nov):463–482,

  2. [2]

    Learning halfspaces with the zero-one loss: Time-accuracy tradeoffs

    4 [BS12] Aharon Birnbaum and Shai Shalev-Shwartz. Learning halfspaces with the zero-one loss: Time-accuracy tradeoffs. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems , pages 935–943,

  3. [3]

    errorD(ˆhs, c⋆) pmax ≥ errorD(h, c⋆)(1 + ∆) pmax # ≤ max δ, e−Ω(∆2µ), e−Ω(∆µ) , where µ := errorD(h,c⋆) pmax . We can instead write this as Pr s

    1 [Bsh98] Nader H. Bshouty. A new composition theorem for learning algorithms. In Proc. 30th Annual ACM Symposium on Theory of Computing (STOC) , pages 583–589, 1998. 1 [BV25] Guy Blanc and Gregory Valiant. Adaptive and oblivious statistical adversaries are equivalent. In Proceedings of the 57th Annual ACM Symposium on Theory of Computing, pages 2031–2042...