Robust Learning with Optimal Error
Pith reviewed 2026-05-13 20:13 UTC · model grok-4.3
The pith
Randomized hypotheses achieve optimal error rates under adversarial noise that improve on deterministic hypotheses by up to a factor of two.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Randomized hypotheses achieve an optimal error of ½ ⋅ η/(1-η) for η-rate malicious noise, 3/2 ⋅ η for distribution-independent η-rate nasty noise, η for fixed-distribution η-rate nasty noise, and η for η-rate agnostic and nasty classification noise. Each bound improves on the corresponding 2η error attainable by any deterministic hypothesis, and the algorithms attain these rates with sample size linear in the VC-dimension of the concept class.
What carries the argument
Randomized hypotheses, which return a label drawn from a distribution rather than a fixed label, paired with access to an empirical risk minimization oracle.
If this is right
- Sample complexity scales linearly with VC-dimension and polynomially with inverse excess error.
- All but one of the learners run in polynomial time given an empirical risk minimization oracle.
- The malicious-noise result improves the previous best factor of 6/7 shown by Cesa-Bianchi et al.
- The nasty-noise results close the gap left open by Bshouty et al. and later works.
Where Pith is reading between the lines
- In settings where outputting a distribution over labels is acceptable, practitioners could obtain noticeably lower error on noisy data without collecting more samples.
- The same randomization technique may extend to other noise models or to hypothesis classes where VC-dimension is replaced by other complexity measures.
- The linear sample-complexity dependence suggests that similar optimal-error results could be obtained for parametric families used in modern machine learning.
Load-bearing premise
The concept class has finite VC-dimension and an empirical risk minimization oracle is available.
What would settle it
A concrete concept class of finite VC-dimension together with an η-rate malicious noise process for which every randomized hypothesis incurs error strictly larger than ½ ⋅ η/(1-η).
Figures
read the original abstract
We construct algorithms with optimal error for learning with adversarial noise. The overarching theme of this work is that the use of \textsl{randomized} hypotheses can substantially improve upon the best error rates achievable with deterministic hypotheses. - For $\eta$-rate malicious noise, we show the optimal error is $\frac{1}{2} \cdot \eta/(1-\eta)$, improving on the optimal error of deterministic hypotheses by a factor of $1/2$. This answers an open question of Cesa-Bianchi et al. (JACM 1999) who showed randomness can improve error by a factor of $6/7$. - For $\eta$-rate nasty noise, we show the optimal error is $\frac{3}{2} \cdot \eta$ for distribution-independent learners and $\eta$ for fixed-distribution learners, both improving upon the optimal $2 \eta$ error of deterministic hypotheses. This closes a gap first noted by Bshouty et al. (Theoretical Computer Science 2002) when they introduced nasty noise and reiterated in the recent works of Klivans et al. (NeurIPS 2025) and Blanc et al. (SODA 2026). - For $\eta$-rate agnostic noise and the closely related nasty classification noise model, we show the optimal error is $\eta$, improving upon the optimal $2\eta$ error of deterministic hypotheses. All of our learners have sample complexity linear in the VC-dimension of the concept class and polynomial in the inverse excess error. All except for the fixed-distribution nasty noise learner are time efficient given access to an oracle for empirical risk minimization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript constructs algorithms for robust learning under adversarial noise models that achieve optimal error rates by using randomized hypotheses. For η-rate malicious noise the optimal error is ½⋅η/(1-η), improving on deterministic hypotheses by a factor of 1/2 and resolving an open question of Cesa-Bianchi et al. (JACM 1999). For η-rate nasty noise the optimal error is 3/2⋅η for distribution-independent learners and η for fixed-distribution learners, both improving on the 2η bound for deterministic hypotheses and closing a gap noted by Bshouty et al. (TCS 2002). For η-rate agnostic noise the optimal error is η, again improving on the deterministic 2η bound. All learners have sample complexity linear in the VC-dimension of the concept class and polynomial in the inverse excess error; all but the fixed-distribution nasty-noise learner are time-efficient given an ERM oracle.
Significance. If the stated upper and lower bounds hold, the work substantially advances robust learning theory by showing that randomization yields strictly better optimal error rates across malicious, nasty, and agnostic settings, with explicit constants that match both directions. The linear dependence on VC-dimension together with the ERM-oracle efficiency results provide concrete algorithmic improvements over prior deterministic analyses, directly answering long-standing open questions and supplying matching bounds that were previously unavailable.
minor comments (2)
- [Abstract] The abstract states that all learners except the fixed-distribution nasty-noise learner are time-efficient given an ERM oracle; a brief remark in §4 or §5 clarifying why the fixed-distribution case requires super-polynomial time would help readers understand the distinction.
- The sample-complexity bound is described as linear in VC-dimension and polynomial in 1/ε; an explicit statement of the dependence on η (or confirmation that it is absorbed into the polynomial) would make the complexity claim fully precise.
Simulated Author's Rebuttal
We thank the referee for their positive review and recommendation to accept the manuscript. We are pleased that the contributions on randomized hypotheses achieving optimal error rates across malicious, nasty, and agnostic noise models have been recognized as resolving long-standing open questions.
Circularity Check
No significant circularity; minor self-citation not load-bearing
full rationale
The paper's central claims derive optimal error rates (½⋅η/(1-η) for malicious noise, 3/2⋅η and η for nasty noise variants, η for agnostic) via explicit algorithmic constructions for randomized hypotheses and matching information-theoretic lower bounds. These rely on standard finite VC-dimension and ERM-oracle assumptions that are declared upfront and do not reduce to fitted parameters or self-referential definitions. The citation to Blanc et al. (SODA 2026) merely notes a previously identified gap and is not invoked as a uniqueness theorem or load-bearing premise; the current work supplies the new constructions that close it. No self-definitional, fitted-input, or ansatz-smuggling steps appear in the derivation chain.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Concept class has finite VC-dimension
- domain assumption Existence of empirical risk minimization oracle
Reference graph
Works this paper leans on
-
[1]
Rademacher and gaussian complexities: Risk bounds and structural results
1 40 [BM02] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of machine learning research , 3(Nov):463–482,
-
[2]
Learning halfspaces with the zero-one loss: Time-accuracy tradeoffs
4 [BS12] Aharon Birnbaum and Shai Shalev-Shwartz. Learning halfspaces with the zero-one loss: Time-accuracy tradeoffs. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems , pages 935–943,
-
[3]
1 [Bsh98] Nader H. Bshouty. A new composition theorem for learning algorithms. In Proc. 30th Annual ACM Symposium on Theory of Computing (STOC) , pages 583–589, 1998. 1 [BV25] Guy Blanc and Gregory Valiant. Adaptive and oblivious statistical adversaries are equivalent. In Proceedings of the 57th Annual ACM Symposium on Theory of Computing, pages 2031–2042...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.