Deep-ICE: the first globally optimal algorithm for minimizing 0-1 loss in two-layer ReLU and maxout networks

Max A. Little; Xi He; Yi Miao

arxiv: 2505.05740 · v3 · submitted 2025-05-09 · 💻 cs.LG

Deep-ICE: the first globally optimal algorithm for minimizing 0-1 loss in two-layer ReLU and maxout networks

Xi He , Yi Miao , Max A. Little This is my paper

Pith reviewed 2026-05-22 17:00 UTC · model grok-4.3

classification 💻 cs.LG

keywords globally optimal trainingtwo-layer ReLU networks0-1 lossempirical risk minimizationactivation patternsmaxout networkscoreset selectionexact optimization

0 comments

The pith

The paper presents the first algorithm that is guaranteed to find the two-layer ReLU or maxout network minimizing the number of training misclassifications.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The work develops an exact solver for the non-convex problem of training two-layer networks to minimize 0-1 loss. It proceeds by enumerating every possible pattern of hidden-unit activations across the training points, then solving a linear program for each pattern. The resulting procedure runs in O(N^{D K + 1}) time, works for any computable loss, and is shown to produce provably optimal networks on small data. A coreset reduction extends the method to larger sets while still outperforming gradient descent and linear models on the same architecture.

Core claim

By exhaustively enumerating all feasible activation patterns of the K hidden units on N points, the algorithm solves the empirical risk minimization problem for two-layer maxout and ReLU networks to global optimality, returning the network that misclassifies the fewest training examples.

What carries the argument

Exhaustive enumeration of all possible activation patterns of the K hidden neurons on the N training points, each pattern inducing a linear classification problem solved to optimality.

If this is right

The algorithm returns a provably optimal network for any two-layer ReLU or maxout architecture on small data.
The same procedure applies unchanged to any computable loss function.
A coreset selection step allows the method to scale while still producing networks with 20-30 percent fewer errors than gradient descent on the same model.
The approach supplies an exact baseline against which approximate training methods can be compared.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Exact enumeration may become practical for moderate K once better pruning or branch-and-bound techniques are added.
The same pattern-enumeration idea could be tested on other shallow architectures whose decision regions admit a finite combinatorial description.
Coreset quality directly limits how large a dataset can be handled before the optimality guarantee is lost.

Load-bearing premise

The enumeration of activation patterns is assumed to include every feasible network without omissions or duplicates.

What would settle it

A small dataset and network size where a hand-checked or brute-force search finds a network with strictly fewer misclassifications than the algorithm returns.

read the original abstract

This paper introduces the first globally optimal algorithm for the empirical risk minimization problem of two-layer maxout and ReLU networks, i.e., minimizing the number of misclassifications. The algorithm has a worst-case time complexity of $O\left(N^{DK+1}\right)$, where $K$ denotes the number of hidden neurons and $D$ represents the number of features. It can be can be generalized to accommodate arbitrary computable loss functions without affecting its computational complexity. Our experiments demonstrate that the proposed algorithm provides provably exact solutions for small-scale datasets. To handle larger datasets, we introduce a novel coreset selection method that reduces the data size to a manageable scale, making it feasible for our algorithm. This extension enables efficient processing of large-scale datasets and achieves significantly improved performance, with a 20-30\% reduction in misclassifications for both training and prediction, compared to state-of-the-art approaches (neural networks trained using gradient descent and support vector machines), when applied to the same models (two-layer networks with fixed hidden nodes and linear models). The artifacts of the Deep-ICE algorithm can be found in https://github.com/XiHegrt/DeepICE-algorithm-artifacts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a claimed first exact solver for 0-1 loss on two-layer ReLU and maxout nets by enumerating activation patterns, plus a coreset for scale, but the covering argument for completeness of the patterns is the main soft spot.

read the letter

The main point of this paper is a new algorithm that finds the globally optimal two-layer ReLU or maxout network for minimizing 0-1 loss on a dataset. It does so by enumerating possible activation patterns in O(N^{DK+1}) time and then picking the best output layer weights for each. This is new because prior work either used approximations or convex relaxations, and this one aims for exactness on the classification error directly. The coreset idea to reduce the data size for larger problems is a practical addition, and the reported 20-30% drop in misclassifications compared to gradient descent and SVMs on the same model size is a solid experimental result. The code release helps too, as it lets others check the small cases where exact solutions are feasible. The potential issue is whether the enumeration really captures every possible activation pattern without missing some or counting duplicates. The approach of picking D+1 points to define each hidden unit's hyperplane is described, but I would want to see a more explicit proof that all realizable patterns from any weight choice are hit exactly once. If that's solid in the full text, then the optimality holds; otherwise it might only be optimal over the enumerated subset. The coreset method also raises the question of how much it preserves the true minimum, though the experiments suggest it works well in practice. This kind of work is useful for anyone studying the limits of what small networks can do exactly or looking for benchmarks against which to test heuristics. It deserves a serious referee because the problem is well-posed and the results are falsifiable on small data. I would recommend sending it out for review rather than desk rejecting it.

Referee Report

1 major / 2 minor

Summary. The paper introduces Deep-ICE as the first globally optimal algorithm for minimizing 0-1 loss (empirical risk minimization via number of misclassifications) in two-layer ReLU and maxout networks. It achieves this via exhaustive enumeration of activation patterns with worst-case complexity O(N^{DK+1}), generalizes to arbitrary computable losses, and introduces a coreset method to scale to larger datasets, reporting 20-30% reductions in misclassifications versus gradient descent and SVM baselines on the same two-layer architectures. Code artifacts are provided on GitHub.

Significance. If the optimality guarantee holds, this supplies a rare exact solver for a non-convex, NP-hard problem on small instances and a practical coreset-based extension that demonstrably improves training and test error over standard methods. The provision of reproducible code artifacts is a clear strength that enables direct verification and follow-on work.

major comments (1)

[Algorithm description and complexity analysis] The central global-optimality claim rests on the assertion that enumerating, for each of the K hidden units, all combinations of D+1 data points to define candidate hyperplanes produces a set that contains every realizable activation pattern exactly once. No explicit covering argument, bijection, or proof is supplied showing that (a) every weight vector inducing a distinct pattern on the N points is captured by at least one such (D+1)-tuple and (b) duplicate patterns are not generated. This completeness step is load-bearing for both the O(N^{DK+1}) complexity bound and the claim that the minimum over the enumerated set equals the true global minimum 0-1 loss.

minor comments (2)

[Experiments] The experimental section reports 20-30% error reductions but provides no error bars, number of independent runs, or detailed controls for the gradient-descent baselines (e.g., initialization, learning-rate schedules, early stopping).
[Algorithm description] Notation for the output-layer optimization step after pattern enumeration is introduced without a clear equation reference; a short derivation or pseudocode block would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for recognizing the significance of providing an exact solver along with reproducible code. We respond to the single major comment below.

read point-by-point responses

Referee: The central global-optimality claim rests on the assertion that enumerating, for each of the K hidden units, all combinations of D+1 data points to define candidate hyperplanes produces a set that contains every realizable activation pattern exactly once. No explicit covering argument, bijection, or proof is supplied showing that (a) every weight vector inducing a distinct pattern on the N points is captured by at least one such (D+1)-tuple and (b) duplicate patterns are not generated. This completeness step is load-bearing for both the O(N^{DK+1}) complexity bound and the claim that the minimum over the enumerated set equals the true global minimum 0-1 loss.

Authors: We agree that an explicit completeness argument is required to rigorously support the global-optimality and complexity claims. The current manuscript describes the enumeration of (D+1)-tuples per hidden unit and asserts that the resulting set contains every realizable activation pattern, but does not supply a formal covering lemma or bijection. In the revised version we will insert a new lemma (with proof) in Section 3 establishing both (a) and (b). The argument relies on a standard general-position perturbation: any weight vector realizing a given sign pattern on the N points can be infinitesimally perturbed so that its hyperplane passes through exactly D+1 affinely independent points while leaving the pattern unchanged; each such minimal supporting tuple determines a unique pattern, and duplicate patterns generated by different tuples are removed by canonical hashing of the activation vector. The added proof does not alter the algorithm, its implementation, or the reported experimental results, but makes the load-bearing completeness step fully explicit. revision: yes

Circularity Check

0 steps flagged

No circularity; optimality follows from explicit enumeration strategy without self-referential definitions or fitted predictions

full rationale

The paper claims global optimality for 0-1 loss minimization in two-layer ReLU/maxout networks via exhaustive enumeration of activation patterns across N points for K hidden units, yielding the stated O(N^{DK+1}) complexity bound directly from the enumeration size. No equations define a derived quantity in terms of itself, no parameters are fitted to data and then relabeled as predictions, and no self-citations are used to import uniqueness theorems or ansatzes. The derivation chain is self-contained as a combinatorial construction whose correctness (completeness and lack of duplicates) is a separate verification obligation rather than a reduction to the input claim by construction. Experiments on small datasets provide external validation independent of the enumeration logic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the correctness of an exhaustive enumeration over activation patterns whose completeness is asserted but not derived in the abstract; no numerical parameters are fitted inside the algorithm itself.

axioms (1)

domain assumption All feasible two-layer networks are captured by enumerating the possible activation patterns of the K hidden units on the N data points.
This assumption underpins both the optimality guarantee and the O(N^{DK+1}) complexity bound stated in the abstract.

pith-pipeline@v0.9.0 · 5757 in / 1427 out tokens · 97821 ms · 2026-05-22T17:00:09.965030+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery and embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DeepICE(D, K) = min 0-1 (K) ◦ eval(K) ◦ cp(basgns(K)) ◦ nestedCombs(D, K)
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking (D=3 forcing) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 4. The DeepICE algorithm has a time complexity of O(N^{DK+1})

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Optimal hypersurface decision trees
cs.LG 2025-09 unverdicted novelty 8.0

Proposes the first algorithm for optimal hypersurface decision trees with time complexity O(K! × N^{DG+G}) plus pruning and incremental checks that reduce practical cost.