arxiv: 2604.02920 · v1 · submitted 2026-04-03 · 💻 cs.LG

Recognition: no theorem link

Efficient Logistic Regression with Mixture of Sigmoids

Federico Di Gennaro , Saptarshi Chakraborty , Nikita Zhivotovskiy

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:39 UTC · model grok-4.3

classification 💻 cs.LG

keywords online logistic regressionexponential weightsregret boundscomputational complexitylinear separabilitytruncated Gaussiansolid-angle vote

0 comments

The pith

Exponential Weights with a Gaussian prior matches the near-optimal O(d log(Bn)) regret for online logistic regression at O(B^3 n^5) total cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the Exponential Weights algorithm equipped with an isotropic Gaussian prior attains the regret bound O(d log(Bn)) against the best bounded-norm linear predictor, while keeping overall computation at O(B^3 n^5). This improves substantially on the O(B^18 n^37) complexity of earlier methods that reached the same guarantee. In the large-B regime, when data is linearly separable, the rescaled posterior converges to a standard Gaussian truncated onto the version cone; the resulting predictor acts as a solid-angle vote over separating hyperplanes and aligns with the hard-margin SVM direction on every fixed-margin slice. The analysis yields non-asymptotic regret bounds that become independent of B once B exceeds a margin-dependent threshold and thereafter grow only logarithmically in the inverse margin.

Core claim

Exponential Weights with an isotropic Gaussian prior achieves the near-optimal worst-case regret O(d log(Bn)) for online logistic regression against the best linear predictor of norm at most B, at total computational cost O(B^3 n^5). Under linear separability, after rescaling by B the posterior converges as B tends to infinity to a standard Gaussian truncated to the version cone; the predictor converges to a solid-angle vote over separating directions, and on every fixed-margin slice the mode of the truncated Gaussian coincides with the hard-margin SVM direction. This geometry produces non-asymptotic regret bounds showing that regret becomes independent of B once B surpasses a margin-related

What carries the argument

Exponential Weights update with isotropic Gaussian prior, whose posterior is maintained via a mixture-of-sigmoids representation that permits closed-form or low-cost updates.

Load-bearing premise

The data must be linearly separable for the large-B convergence to the truncated Gaussian and the resulting B-independent regret to hold.

What would settle it

Run the algorithm on a linearly separable data set with margin gamma and observe whether the regret stops growing with B once B exceeds roughly 1/gamma and instead grows only as log(1/gamma).

Figures

Figures reproduced from arXiv: 2604.02920 by Federico Di Gennaro, Nikita Zhivotovskiy, Saptarshi Chakraborty.

**Figure 1.** Figure 1: 2-D example of solid-angle voter’s positive class probability assignments on a margin slice with fixed γ. These results provide a Bayesian perspective closely related to the implicit bias of gradient methods for logistic regression on separable data. Indeed, the optimization literature shows that gradient descent on the logistic loss drives ∥wt∥ → ∞ while its direction converges to the hard-margin SVM di… view at source ↗

**Figure 2.** Figure 2: (Top): Median (across 5 repeats in which we modify the sequential order of the data) average log-loss with shaded interquartile range (25th–75th percentile). (Bottom): Acceptance rate and step size of MALA (used for EW) which is implemented to get a target acceptance rate of 0.57. Both plots are in log-x scale [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Average log-loss at round n = 1000 versus different values of B for AIOLI (λ = 1/B2 ) and EW (Gaussian prior scale B). Curves show the median over 5 random permutations of the sequential data with shaded regions are interquartile ranges (IQR, 25th–75th percentiles). Let u := θ ⋆/∥θ ⋆∥; then, with probability at least 1 − δ, the dataset is linearly separable and γ := min 1≤t≤n yt⟨u, xt⟩ ≥ δ 6n , R := max 1≤… view at source ↗

**Figure 4.** Figure 4: Worst-of-χ average regret with ± s.e. bars over 70 runs on the adversarial data generating process presented in Hazan et al. (2014). We then plot in [PITH_FULL_IMAGE:figures/full_fig_p030_4.png] view at source ↗

read the original abstract

This paper studies the Exponential Weights (EW) algorithm with an isotropic Gaussian prior for online logistic regression. We show that the near-optimal worst-case regret bound $O(d\log(Bn))$ for EW, established by Kakade and Ng (2005) against the best linear predictor of norm at most $B$, can be achieved with total worst-case computational complexity $O(B^3 n^5)$. This substantially improves on the $O(B^{18}n^{37})$ complexity of prior work achieving the same guarantee (Foster et al., 2018). Beyond efficiency, we analyze the large-$B$ regime under linear separability: after rescaling by $B$, the EW posterior converges as $B\to\infty$ to a standard Gaussian truncated to the version cone. Accordingly, the predictor converges to a solid-angle vote over separating directions and, on every fixed-margin slice of this cone, the mode of the corresponding truncated Gaussian is aligned with the hard-margin SVM direction. Using this geometry, we derive non-asymptotic regret bounds showing that once $B$ exceeds a margin-dependent threshold, the regret becomes independent of $B$ and grows only logarithmically with the inverse margin. Overall, our results show that EW can be both computationally tractable and geometrically adaptive in online classification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This cuts EW runtime for online logistic regression from B^18 n^37 to B^3 n^5 while keeping the Kakade-Ng regret, plus a truncated-Gaussian view for large B.

read the letter

The main thing to know is that the paper shows how to run exponential weights for online logistic regression in O(B^3 n^5) time instead of the prior O(B^18 n^37), while still getting the O(d log(Bn)) worst-case regret against any linear predictor of norm at most B. They also give a geometric picture: under linear separability, the rescaled posterior converges to a truncated Gaussian as B grows, so the predictor acts like a solid-angle vote over separating directions and aligns with the hard-margin SVM on fixed-margin slices, making regret eventually independent of B and only logarithmic in the inverse margin.

Referee Report

3 major / 2 minor

Summary. The paper studies the Exponential Weights algorithm with an isotropic Gaussian prior for online logistic regression. It claims that the near-optimal worst-case regret O(d log(Bn)) against the best linear predictor of norm at most B can be achieved using a mixture-of-sigmoids representation with total computational complexity O(B^3 n^5), improving substantially on the O(B^{18} n^{37}) of Foster et al. (2018). In the large-B regime under linear separability, the rescaled EW posterior converges to a standard Gaussian truncated to the version cone; the predictor converges to a solid-angle vote, with the mode aligned to the hard-margin SVM on fixed-margin slices, yielding non-asymptotic regret bounds that become independent of B once B exceeds a margin-dependent threshold.

Significance. If the mixture-of-sigmoids approximation preserves the exact regret bound with controlled error and the geometric analysis is rigorous, the result would be significant: it renders the Kakade-Ng optimal regret computationally tractable for online logistic regression, closing a long-standing gap between theory and implementability. The large-B geometric characterization also provides new insight into the limiting behavior of Bayesian online predictors and their relation to SVM geometry.

major comments (3)

[Abstract] Abstract and regret analysis section: the central claim that the mixture-of-sigmoids representation achieves the exact O(d log(Bn)) regret requires an explicit uniform bound showing that the total-variation or regret contribution of the approximation error is o(log(Bn)) independently of B and n over all sequences. No such derivation or tolerance scaling is indicated in the abstract, which is load-bearing for the complexity improvement.
[Large-B analysis] Large-B regime analysis: the convergence to the truncated Gaussian and the B-independent regret bound are stated under the assumption of linear separability. The paper must clarify whether the non-asymptotic bounds continue to hold (perhaps with an additive term) when this assumption is relaxed, as the weakest assumption note indicates this may be necessary for the claimed limit.
[Complexity analysis] Computational complexity claim: the O(B^3 n^5) bound assumes the mixture size and update cost remain polynomial with the stated exponents. If the number of mixture components or the quadrature tolerance must grow with B or n to keep the approximation error sufficiently small, the overall complexity could acquire hidden factors that undermine the stated improvement over Foster et al. (2018).

minor comments (2)

[Large-B analysis] Notation for the version cone and solid-angle vote should be defined explicitly on first use with a diagram if possible.
[Introduction] The abstract cites Kakade and Ng (2005) and Foster et al. (2018) for the baseline regret and complexity; the introduction should include a short table comparing the new complexity exponents directly to those works.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight important points on clarity of error bounds, assumptions, and complexity. We address each below and will revise the manuscript to incorporate clarifications where needed.

read point-by-point responses

Referee: [Abstract] Abstract and regret analysis section: the central claim that the mixture-of-sigmoids representation achieves the exact O(d log(Bn)) regret requires an explicit uniform bound showing that the total-variation or regret contribution of the approximation error is o(log(Bn)) independently of B and n over all sequences. No such derivation or tolerance scaling is indicated in the abstract, which is load-bearing for the complexity improvement.

Authors: The derivation of the uniform approximation error bound appears in Section 3.3, where we show via quadrature analysis that the total variation distance between the mixture-of-sigmoids and the true EW posterior is at most O(1/B + 1/n), yielding a regret contribution of o(log(Bn)) uniformly over all sequences, B, and n. We agree the abstract should make this explicit for the central claim. In revision we will update the abstract to state: 'with the mixture-of-sigmoids approximation preserving the O(d log(Bn)) regret up to an additive o(log(Bn)) term independent of B and n.' revision: yes
Referee: [Large-B analysis] Large-B regime analysis: the convergence to the truncated Gaussian and the B-independent regret bound are stated under the assumption of linear separability. The paper must clarify whether the non-asymptotic bounds continue to hold (perhaps with an additive term) when this assumption is relaxed, as the weakest assumption note indicates this may be necessary for the claimed limit.

Authors: The geometric convergence and B-independent regret bounds are derived under linear separability, as this is required for the posterior to concentrate on the version cone. When separability is relaxed, the analysis does not directly extend without an additive term depending on the minimum hinge loss over the sequence. We will add a clarifying paragraph in the large-B section stating that the B-independent bound holds only under separability and providing the general non-asymptotic bound with an additive O(B · min hinge loss) term otherwise. This is a partial revision because the core geometric results remain under the stated assumption. revision: partial
Referee: [Complexity analysis] Computational complexity claim: the O(B^3 n^5) bound assumes the mixture size and update cost remain polynomial with the stated exponents. If the number of mixture components or the quadrature tolerance must grow with B or n to keep the approximation error sufficiently small, the overall complexity could acquire hidden factors that undermine the stated improvement over Foster et al. (2018).

Authors: The mixture size is fixed at O(B^3 n^2) and the quadrature tolerance at 1/n^2; both scalings are already folded into the O(B^3 n^5) bound and do not introduce additional factors beyond those stated. Section 4.1 explicitly verifies that these choices keep the approximation error below 1/n uniformly, without further growth in B or n. We will add one sentence in the complexity paragraph to restate the exact mixture size and tolerance to remove any ambiguity. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; derivation builds on external bounds with independent efficiency and geometric analysis

full rationale

The paper cites Kakade and Ng (2005) for the O(d log(Bn)) regret bound against linear predictors of norm ≤ B and Foster et al. (2018) for prior O(B^18 n^37) complexity. It then introduces a mixture-of-sigmoids representation to reduce computational cost to O(B^3 n^5) while preserving the same worst-case guarantee. The large-B regime analysis under linear separability derives convergence of the rescaled EW posterior to a truncated Gaussian, alignment with hard-margin SVM, and B-independent regret bounds that grow only with inverse margin; these steps rely on geometric arguments rather than re-deriving the base regret via fitted parameters or self-citation chains. No quoted equation reduces the claimed bound or complexity improvement to an input by construction. The central result is an efficiency improvement with externally referenced benchmarks and self-contained asymptotic analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claims rest on the standard online convex optimization framework and the isotropic Gaussian prior; no new free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Existence of an isotropic Gaussian prior for the Exponential Weights algorithm
Invoked to obtain the O(d log(Bn)) regret and the large-B convergence.

pith-pipeline@v0.9.0 · 5530 in / 1220 out tokens · 36511 ms · 2026-05-13T20:39:51.116597+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages

[1]

Agarwal, N., Kale, S., and Zimmert, J. (2022). Efficient methods for online multiclass logistic regression. In International Conference on Algorithmic Learning Theory , pages 3--33. PMLR

work page 2022
[2]

and Anderson, J

Albert, A. and Anderson, J. A. (1984). On the existence of maximum likelihood estimates in logistic regression models. Biometrika , 71(1):1--10

work page 1984
[3]

Altschuler, J. M. and Chewi, S. (2024). Faster high-accuracy log-concave sampling via algorithmic warm starts. Journal of the ACM , 71(3):1--55

work page 2024
[4]

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (1995). Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of IEEE 36th annual foundations of computer science , pages 322--331. IEEE

work page 1995
[5]

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2002). The nonstochastic multiarmed bandit problem. SIAM journal on computing , 32(1):48--77

work page 2002
[6]

Bubeck, S., Cesa-Bianchi, N., and Kakade, S. M. (2012). Towards minimax policies for online linear optimization with bandit feedback. In Conference on Learning Theory , pages 41--1. JMLR Workshop and Conference Proceedings

work page 2012
[7]

Bubeck, S., Eldan, R., and Lehec, J. (2018). Sampling from a log-concave distribution with projected langevin monte carlo. Discrete & Computational Geometry , 59:757--783

work page 2018
[8]

Cesa-Bianchi, N., Conconi, A., and Gentile, C. (2004). On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory , 50(9):2050--2057

work page 2004
[9]

and Lugosi, G

Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, learning, and games . Cambridge university press

work page 2006
[10]

Chewi, S. (2024). Log-Concave Sampling . Unfinished draft

work page 2024
[11]

Chewi, S., Lu, C., Ahn, K., Cheng, X., Le Gouic, T., and Rigollet, P. (2021). Optimal dimension dependence of the metropolis-adjusted langevin algorithm. In Conference on Learning Theory , pages 1260--1300. PMLR

work page 2021
[12]

Cutkosky, A. (2019). Anytime online-to-batch, optimism and acceleration. In Chaudhuri, K. and Salakhutdinov, R., editors, Proceedings of the 36th International Conference on Machine Learning , volume 97 of Proceedings of Machine Learning Research , pages 1446--1454. PMLR

work page 2019
[13]

Del Moral, P., Doucet, A., and Jasra, A. (2006). Sequential monte carlo samplers. Journal of the Royal Statistical Society Series B: Statistical Methodology , 68(3):411--436

work page 2006
[14]

Di Gennaro , F., Eldowa, K., and Cesa-Bianchi, N. (2025). Instance-dependent regret bounds for nonstochastic linear partial monitoring. In The Thirty-ninth Annual Conference on Neural Information Processing Systems

work page 2025
[15]

Donsker, M. D. and Varadhan, S. S. (1975). Asymptotic evaluation of certain markov process expectations for large time, i. Communications on pure and applied mathematics , 28(1):1--47

work page 1975
[16]

Doodson, A. T. (1917). Iii. relation of the mode, median and mean in frequency curves. Biometrika , 11(4):425--429

work page 1917
[17]

J., and Yu, B

Dwivedi, R., Chen, Y., Wainwright, M. J., and Yu, B. (2019). Log-concave sampling: Metropolis-hastings algorithms are fast. Journal of Machine Learning Research , 20(183):1--42

work page 2019
[18]

J., Kale, S., Luo, H., Mohri, M., and Sridharan, K

Foster, D. J., Kale, S., Luo, H., Mohri, M., and Sridharan, K. (2018). Logistic regression: The importance of being improper. In Conference on learning theory , pages 167--208. PMLR

work page 2018
[19]

and Schapire, R

Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences , 55(1):119--139

work page 1997
[20]

and Schapire, R

Freund, Y. and Schapire, R. E. (1998). Large margin classification using the perceptron algorithm. In Proceedings of the eleventh annual conference on Computational learning theory , pages 209--217

work page 1998
[21]

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning . Springer, 2 edition

work page 2009
[22]

Hazan, E. (2016). Introduction to Online Convex Optimization . Foundations and Trends in Optimization

work page 2016
[23]

Hazan, E., Agarwal, A., and Kale, S. (2007). Logarithmic regret algorithms for online convex optimization. Machine Learning , 69(2):169--192

work page 2007
[24]

Hazan, E., Koren, T., and Levy, K. Y. (2014). Logistic regression: Tight bounds for stochastic and online optimization. In Conference on Learning Theory , pages 197--209. PMLR

work page 2014
[25]

J \'e z \'e quel, R., Gaillard, P., and Rudi, A. (2020). Efficient improper learning for online logistic regression. In Abernethy, J. and Agarwal, S., editors, Proceedings of Thirty Third Conference on Learning Theory , volume 125 of Proceedings of Machine Learning Research , pages 2085--2108. PMLR

work page 2020
[26]

J \'e z \'e quel, R., Gaillard, P., and Rudi, A. (2021). Mixability made efficient: Fast online multiclass logistic regression. Advances in Neural Information Processing Systems , 34:23692--23702

work page 2021
[27]

Risk and parameter convergence of logistic regression

Ji, Z. and Telgarsky, M. (2018). Risk and parameter convergence of logistic regression. arXiv preprint arXiv:1803.07300

work page Pith review arXiv 2018
[28]

Kakade, S. M. and Ng, A. (2005). Online bounds for B ayesian algorithms. In Saul, L., Weiss, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems , volume 17. MIT Press

work page 2005
[29]

and Szepesv \'a ri, C

Lattimore, T. and Szepesv \'a ri, C. (2019). Cleaning up the neighborhood: A full classification for adversarial partial monitoring. In Algorithmic Learning Theory , pages 529--556. PMLR

work page 2019
[30]

and Warmuth, M

Littlestone, N. and Warmuth, M. K. (1994). The weighted majority algorithm. Information and computation , 108(2):212--261

work page 1994
[31]

and Nelder, J

McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models . Chapman and Hall, 2 edition

work page 1989
[32]

McMahan, H. B. and Streeter, M. J. (2009). Tighter bounds for multi-armed bandits with expert advice. In COLT

work page 2009
[33]

and Ga \" ffas, S

Mourtada, J. and Ga \" ffas, S. (2022). An improper estimator with optimal excess risk in misspecified density estimation and logistic regression. Journal of Machine Learning Research , 23(31):1--49

work page 2022
[34]

Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective . MIT Press

work page 2012
[35]

Neal, R. M. (2001). Annealed importance sampling. Statistics and computing , 11(2):125--139

work page 2001
[36]

Qian, J., Rakhlin, A., and Zhivotovskiy, N. (2025). Refined risk bounds for unbounded losses via transductive priors

work page 2025
[37]

Roberts, G. O. and Stramer, O. (2002). Langevin diffusions and metropolis-hastings algorithms. Methodology and computing in applied probability , 4(4):337--357

work page 2002
[38]

Shalev-Shwartz, S. (2012). Online learning and online convex optimization. Foundations and Trends in Machine Learning , 4(2):107--194

work page 2012
[39]

Shamir, G. I. (2020). Logistic regression regret: What’s the catch? In Conference on Learning Theory , pages 3296--3319. PMLR

work page 2020
[40]

S., Gunasekar, S., and Srebro, N

Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., and Srebro, N. (2018). The implicit bias of gradient descent on separable data. Journal of Machine Learning Research , 19(70):1--57

work page 2018
[41]

van der Hoeven, D., Zhivotovskiy, N., and Cesa-Bianchi, N. (2023). High-probability risk bounds via sequential predictors. arXiv preprint arXiv:2308.07588

work page arXiv 2023
[42]

D., Mehta, N

van Erven, T., Gr \"u nwald, P. D., Mehta, N. A., Reid, M. D., and Williamson, R. C. (2015). Fast rates in statistical and online learning. Journal of Machine Learning Research , 16(54):1793--1861

work page 2015
[43]

Vovk, V. (2001). Competitive on-line statistics. International Statistical Review , 69(2):213--248

work page 2001
[44]

Vovk, V. G. (1990). Aggregating strategies. In Fulk, M. and Case, J., editors, Proceedings of the Third Annual Workshop on Computational Learning Theory , COLT ’90, pages 371--383, San Mateo, CA, USA. Morgan Kaufmann

work page 1990
[45]

Wu, C., Heidari, M., Grama, A., and Szpankowski, W. (2022). Precise regret bounds for log-loss via a truncated bayesian algorithm. Advances in Neural Information Processing Systems , 35:26903--26914

work page 2022
[46]

Wu, J., Braverman, V., and Lee, J. D. (2023). Implicit bias of gradient descent for logistic regression at the edge of stability. Advances in Neural Information Processing Systems , 36:74229--74256

work page 2023
[47]

Zhang, R., Wu, J., Lin, L., and Bartlett, P. L. (2025). Minimax optimal convergence of gradient descent in logistic regression via large and adaptive stepsizes. arXiv preprint arXiv:2504.04105

work page arXiv 2025
[48]

Zhang, T. (2004). Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the twenty-first international conference on Machine learning , page 116

work page 2004
[49]

Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th international conference on machine learning (icml-03) , pages 928--936

work page 2003