pith. sign in

arxiv: 2505.05740 · v3 · submitted 2025-05-09 · 💻 cs.LG

Deep-ICE: the first globally optimal algorithm for minimizing 0-1 loss in two-layer ReLU and maxout networks

Pith reviewed 2026-05-22 17:00 UTC · model grok-4.3

classification 💻 cs.LG
keywords globally optimal trainingtwo-layer ReLU networks0-1 lossempirical risk minimizationactivation patternsmaxout networkscoreset selectionexact optimization
0
0 comments X

The pith

The paper presents the first algorithm that is guaranteed to find the two-layer ReLU or maxout network minimizing the number of training misclassifications.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The work develops an exact solver for the non-convex problem of training two-layer networks to minimize 0-1 loss. It proceeds by enumerating every possible pattern of hidden-unit activations across the training points, then solving a linear program for each pattern. The resulting procedure runs in O(N^{D K + 1}) time, works for any computable loss, and is shown to produce provably optimal networks on small data. A coreset reduction extends the method to larger sets while still outperforming gradient descent and linear models on the same architecture.

Core claim

By exhaustively enumerating all feasible activation patterns of the K hidden units on N points, the algorithm solves the empirical risk minimization problem for two-layer maxout and ReLU networks to global optimality, returning the network that misclassifies the fewest training examples.

What carries the argument

Exhaustive enumeration of all possible activation patterns of the K hidden neurons on the N training points, each pattern inducing a linear classification problem solved to optimality.

If this is right

  • The algorithm returns a provably optimal network for any two-layer ReLU or maxout architecture on small data.
  • The same procedure applies unchanged to any computable loss function.
  • A coreset selection step allows the method to scale while still producing networks with 20-30 percent fewer errors than gradient descent on the same model.
  • The approach supplies an exact baseline against which approximate training methods can be compared.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Exact enumeration may become practical for moderate K once better pruning or branch-and-bound techniques are added.
  • The same pattern-enumeration idea could be tested on other shallow architectures whose decision regions admit a finite combinatorial description.
  • Coreset quality directly limits how large a dataset can be handled before the optimality guarantee is lost.

Load-bearing premise

The enumeration of activation patterns is assumed to include every feasible network without omissions or duplicates.

What would settle it

A small dataset and network size where a hand-checked or brute-force search finds a network with strictly fewer misclassifications than the algorithm returns.

read the original abstract

This paper introduces the first globally optimal algorithm for the empirical risk minimization problem of two-layer maxout and ReLU networks, i.e., minimizing the number of misclassifications. The algorithm has a worst-case time complexity of $O\left(N^{DK+1}\right)$, where $K$ denotes the number of hidden neurons and $D$ represents the number of features. It can be can be generalized to accommodate arbitrary computable loss functions without affecting its computational complexity. Our experiments demonstrate that the proposed algorithm provides provably exact solutions for small-scale datasets. To handle larger datasets, we introduce a novel coreset selection method that reduces the data size to a manageable scale, making it feasible for our algorithm. This extension enables efficient processing of large-scale datasets and achieves significantly improved performance, with a 20-30\% reduction in misclassifications for both training and prediction, compared to state-of-the-art approaches (neural networks trained using gradient descent and support vector machines), when applied to the same models (two-layer networks with fixed hidden nodes and linear models). The artifacts of the Deep-ICE algorithm can be found in https://github.com/XiHegrt/DeepICE-algorithm-artifacts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Deep-ICE as the first globally optimal algorithm for minimizing 0-1 loss (empirical risk minimization via number of misclassifications) in two-layer ReLU and maxout networks. It achieves this via exhaustive enumeration of activation patterns with worst-case complexity O(N^{DK+1}), generalizes to arbitrary computable losses, and introduces a coreset method to scale to larger datasets, reporting 20-30% reductions in misclassifications versus gradient descent and SVM baselines on the same two-layer architectures. Code artifacts are provided on GitHub.

Significance. If the optimality guarantee holds, this supplies a rare exact solver for a non-convex, NP-hard problem on small instances and a practical coreset-based extension that demonstrably improves training and test error over standard methods. The provision of reproducible code artifacts is a clear strength that enables direct verification and follow-on work.

major comments (1)
  1. [Algorithm description and complexity analysis] The central global-optimality claim rests on the assertion that enumerating, for each of the K hidden units, all combinations of D+1 data points to define candidate hyperplanes produces a set that contains every realizable activation pattern exactly once. No explicit covering argument, bijection, or proof is supplied showing that (a) every weight vector inducing a distinct pattern on the N points is captured by at least one such (D+1)-tuple and (b) duplicate patterns are not generated. This completeness step is load-bearing for both the O(N^{DK+1}) complexity bound and the claim that the minimum over the enumerated set equals the true global minimum 0-1 loss.
minor comments (2)
  1. [Experiments] The experimental section reports 20-30% error reductions but provides no error bars, number of independent runs, or detailed controls for the gradient-descent baselines (e.g., initialization, learning-rate schedules, early stopping).
  2. [Algorithm description] Notation for the output-layer optimization step after pattern enumeration is introduced without a clear equation reference; a short derivation or pseudocode block would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for recognizing the significance of providing an exact solver along with reproducible code. We respond to the single major comment below.

read point-by-point responses
  1. Referee: The central global-optimality claim rests on the assertion that enumerating, for each of the K hidden units, all combinations of D+1 data points to define candidate hyperplanes produces a set that contains every realizable activation pattern exactly once. No explicit covering argument, bijection, or proof is supplied showing that (a) every weight vector inducing a distinct pattern on the N points is captured by at least one such (D+1)-tuple and (b) duplicate patterns are not generated. This completeness step is load-bearing for both the O(N^{DK+1}) complexity bound and the claim that the minimum over the enumerated set equals the true global minimum 0-1 loss.

    Authors: We agree that an explicit completeness argument is required to rigorously support the global-optimality and complexity claims. The current manuscript describes the enumeration of (D+1)-tuples per hidden unit and asserts that the resulting set contains every realizable activation pattern, but does not supply a formal covering lemma or bijection. In the revised version we will insert a new lemma (with proof) in Section 3 establishing both (a) and (b). The argument relies on a standard general-position perturbation: any weight vector realizing a given sign pattern on the N points can be infinitesimally perturbed so that its hyperplane passes through exactly D+1 affinely independent points while leaving the pattern unchanged; each such minimal supporting tuple determines a unique pattern, and duplicate patterns generated by different tuples are removed by canonical hashing of the activation vector. The added proof does not alter the algorithm, its implementation, or the reported experimental results, but makes the load-bearing completeness step fully explicit. revision: yes

Circularity Check

0 steps flagged

No circularity; optimality follows from explicit enumeration strategy without self-referential definitions or fitted predictions

full rationale

The paper claims global optimality for 0-1 loss minimization in two-layer ReLU/maxout networks via exhaustive enumeration of activation patterns across N points for K hidden units, yielding the stated O(N^{DK+1}) complexity bound directly from the enumeration size. No equations define a derived quantity in terms of itself, no parameters are fitted to data and then relabeled as predictions, and no self-citations are used to import uniqueness theorems or ansatzes. The derivation chain is self-contained as a combinatorial construction whose correctness (completeness and lack of duplicates) is a separate verification obligation rather than a reduction to the input claim by construction. Experiments on small datasets provide external validation independent of the enumeration logic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the correctness of an exhaustive enumeration over activation patterns whose completeness is asserted but not derived in the abstract; no numerical parameters are fitted inside the algorithm itself.

axioms (1)
  • domain assumption All feasible two-layer networks are captured by enumerating the possible activation patterns of the K hidden units on the N data points.
    This assumption underpins both the optimality guarantee and the O(N^{DK+1}) complexity bound stated in the abstract.

pith-pipeline@v0.9.0 · 5757 in / 1427 out tokens · 97821 ms · 2026-05-22T17:00:09.965030+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Optimal hypersurface decision trees

    cs.LG 2025-09 unverdicted novelty 8.0

    Proposes the first algorithm for optimal hypersurface decision trees with time complexity O(K! × N^{DG+G}) plus pruning and incremental checks that reduce practical cost.