Recognition: no theorem link
Efficient Logistic Regression with Mixture of Sigmoids
Pith reviewed 2026-05-13 20:39 UTC · model grok-4.3
The pith
Exponential Weights with a Gaussian prior matches the near-optimal O(d log(Bn)) regret for online logistic regression at O(B^3 n^5) total cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Exponential Weights with an isotropic Gaussian prior achieves the near-optimal worst-case regret O(d log(Bn)) for online logistic regression against the best linear predictor of norm at most B, at total computational cost O(B^3 n^5). Under linear separability, after rescaling by B the posterior converges as B tends to infinity to a standard Gaussian truncated to the version cone; the predictor converges to a solid-angle vote over separating directions, and on every fixed-margin slice the mode of the truncated Gaussian coincides with the hard-margin SVM direction. This geometry produces non-asymptotic regret bounds showing that regret becomes independent of B once B surpasses a margin-related
What carries the argument
Exponential Weights update with isotropic Gaussian prior, whose posterior is maintained via a mixture-of-sigmoids representation that permits closed-form or low-cost updates.
Load-bearing premise
The data must be linearly separable for the large-B convergence to the truncated Gaussian and the resulting B-independent regret to hold.
What would settle it
Run the algorithm on a linearly separable data set with margin gamma and observe whether the regret stops growing with B once B exceeds roughly 1/gamma and instead grows only as log(1/gamma).
Figures
read the original abstract
This paper studies the Exponential Weights (EW) algorithm with an isotropic Gaussian prior for online logistic regression. We show that the near-optimal worst-case regret bound $O(d\log(Bn))$ for EW, established by Kakade and Ng (2005) against the best linear predictor of norm at most $B$, can be achieved with total worst-case computational complexity $O(B^3 n^5)$. This substantially improves on the $O(B^{18}n^{37})$ complexity of prior work achieving the same guarantee (Foster et al., 2018). Beyond efficiency, we analyze the large-$B$ regime under linear separability: after rescaling by $B$, the EW posterior converges as $B\to\infty$ to a standard Gaussian truncated to the version cone. Accordingly, the predictor converges to a solid-angle vote over separating directions and, on every fixed-margin slice of this cone, the mode of the corresponding truncated Gaussian is aligned with the hard-margin SVM direction. Using this geometry, we derive non-asymptotic regret bounds showing that once $B$ exceeds a margin-dependent threshold, the regret becomes independent of $B$ and grows only logarithmically with the inverse margin. Overall, our results show that EW can be both computationally tractable and geometrically adaptive in online classification.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies the Exponential Weights algorithm with an isotropic Gaussian prior for online logistic regression. It claims that the near-optimal worst-case regret O(d log(Bn)) against the best linear predictor of norm at most B can be achieved using a mixture-of-sigmoids representation with total computational complexity O(B^3 n^5), improving substantially on the O(B^{18} n^{37}) of Foster et al. (2018). In the large-B regime under linear separability, the rescaled EW posterior converges to a standard Gaussian truncated to the version cone; the predictor converges to a solid-angle vote, with the mode aligned to the hard-margin SVM on fixed-margin slices, yielding non-asymptotic regret bounds that become independent of B once B exceeds a margin-dependent threshold.
Significance. If the mixture-of-sigmoids approximation preserves the exact regret bound with controlled error and the geometric analysis is rigorous, the result would be significant: it renders the Kakade-Ng optimal regret computationally tractable for online logistic regression, closing a long-standing gap between theory and implementability. The large-B geometric characterization also provides new insight into the limiting behavior of Bayesian online predictors and their relation to SVM geometry.
major comments (3)
- [Abstract] Abstract and regret analysis section: the central claim that the mixture-of-sigmoids representation achieves the exact O(d log(Bn)) regret requires an explicit uniform bound showing that the total-variation or regret contribution of the approximation error is o(log(Bn)) independently of B and n over all sequences. No such derivation or tolerance scaling is indicated in the abstract, which is load-bearing for the complexity improvement.
- [Large-B analysis] Large-B regime analysis: the convergence to the truncated Gaussian and the B-independent regret bound are stated under the assumption of linear separability. The paper must clarify whether the non-asymptotic bounds continue to hold (perhaps with an additive term) when this assumption is relaxed, as the weakest assumption note indicates this may be necessary for the claimed limit.
- [Complexity analysis] Computational complexity claim: the O(B^3 n^5) bound assumes the mixture size and update cost remain polynomial with the stated exponents. If the number of mixture components or the quadrature tolerance must grow with B or n to keep the approximation error sufficiently small, the overall complexity could acquire hidden factors that undermine the stated improvement over Foster et al. (2018).
minor comments (2)
- [Large-B analysis] Notation for the version cone and solid-angle vote should be defined explicitly on first use with a diagram if possible.
- [Introduction] The abstract cites Kakade and Ng (2005) and Foster et al. (2018) for the baseline regret and complexity; the introduction should include a short table comparing the new complexity exponents directly to those works.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The comments highlight important points on clarity of error bounds, assumptions, and complexity. We address each below and will revise the manuscript to incorporate clarifications where needed.
read point-by-point responses
-
Referee: [Abstract] Abstract and regret analysis section: the central claim that the mixture-of-sigmoids representation achieves the exact O(d log(Bn)) regret requires an explicit uniform bound showing that the total-variation or regret contribution of the approximation error is o(log(Bn)) independently of B and n over all sequences. No such derivation or tolerance scaling is indicated in the abstract, which is load-bearing for the complexity improvement.
Authors: The derivation of the uniform approximation error bound appears in Section 3.3, where we show via quadrature analysis that the total variation distance between the mixture-of-sigmoids and the true EW posterior is at most O(1/B + 1/n), yielding a regret contribution of o(log(Bn)) uniformly over all sequences, B, and n. We agree the abstract should make this explicit for the central claim. In revision we will update the abstract to state: 'with the mixture-of-sigmoids approximation preserving the O(d log(Bn)) regret up to an additive o(log(Bn)) term independent of B and n.' revision: yes
-
Referee: [Large-B analysis] Large-B regime analysis: the convergence to the truncated Gaussian and the B-independent regret bound are stated under the assumption of linear separability. The paper must clarify whether the non-asymptotic bounds continue to hold (perhaps with an additive term) when this assumption is relaxed, as the weakest assumption note indicates this may be necessary for the claimed limit.
Authors: The geometric convergence and B-independent regret bounds are derived under linear separability, as this is required for the posterior to concentrate on the version cone. When separability is relaxed, the analysis does not directly extend without an additive term depending on the minimum hinge loss over the sequence. We will add a clarifying paragraph in the large-B section stating that the B-independent bound holds only under separability and providing the general non-asymptotic bound with an additive O(B · min hinge loss) term otherwise. This is a partial revision because the core geometric results remain under the stated assumption. revision: partial
-
Referee: [Complexity analysis] Computational complexity claim: the O(B^3 n^5) bound assumes the mixture size and update cost remain polynomial with the stated exponents. If the number of mixture components or the quadrature tolerance must grow with B or n to keep the approximation error sufficiently small, the overall complexity could acquire hidden factors that undermine the stated improvement over Foster et al. (2018).
Authors: The mixture size is fixed at O(B^3 n^2) and the quadrature tolerance at 1/n^2; both scalings are already folded into the O(B^3 n^5) bound and do not introduce additional factors beyond those stated. Section 4.1 explicitly verifies that these choices keep the approximation error below 1/n uniformly, without further growth in B or n. We will add one sentence in the complexity paragraph to restate the exact mixture size and tolerance to remove any ambiguity. revision: yes
Circularity Check
No significant circularity detected; derivation builds on external bounds with independent efficiency and geometric analysis
full rationale
The paper cites Kakade and Ng (2005) for the O(d log(Bn)) regret bound against linear predictors of norm ≤ B and Foster et al. (2018) for prior O(B^18 n^37) complexity. It then introduces a mixture-of-sigmoids representation to reduce computational cost to O(B^3 n^5) while preserving the same worst-case guarantee. The large-B regime analysis under linear separability derives convergence of the rescaled EW posterior to a truncated Gaussian, alignment with hard-margin SVM, and B-independent regret bounds that grow only with inverse margin; these steps rely on geometric arguments rather than re-deriving the base regret via fitted parameters or self-citation chains. No quoted equation reduces the claimed bound or complexity improvement to an input by construction. The central result is an efficiency improvement with externally referenced benchmarks and self-contained asymptotic analysis.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existence of an isotropic Gaussian prior for the Exponential Weights algorithm
Reference graph
Works this paper leans on
-
[1]
Agarwal, N., Kale, S., and Zimmert, J. (2022). Efficient methods for online multiclass logistic regression. In International Conference on Algorithmic Learning Theory , pages 3--33. PMLR
work page 2022
-
[2]
Albert, A. and Anderson, J. A. (1984). On the existence of maximum likelihood estimates in logistic regression models. Biometrika , 71(1):1--10
work page 1984
-
[3]
Altschuler, J. M. and Chewi, S. (2024). Faster high-accuracy log-concave sampling via algorithmic warm starts. Journal of the ACM , 71(3):1--55
work page 2024
-
[4]
Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (1995). Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of IEEE 36th annual foundations of computer science , pages 322--331. IEEE
work page 1995
-
[5]
Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2002). The nonstochastic multiarmed bandit problem. SIAM journal on computing , 32(1):48--77
work page 2002
-
[6]
Bubeck, S., Cesa-Bianchi, N., and Kakade, S. M. (2012). Towards minimax policies for online linear optimization with bandit feedback. In Conference on Learning Theory , pages 41--1. JMLR Workshop and Conference Proceedings
work page 2012
-
[7]
Bubeck, S., Eldan, R., and Lehec, J. (2018). Sampling from a log-concave distribution with projected langevin monte carlo. Discrete & Computational Geometry , 59:757--783
work page 2018
-
[8]
Cesa-Bianchi, N., Conconi, A., and Gentile, C. (2004). On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory , 50(9):2050--2057
work page 2004
-
[9]
Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, learning, and games . Cambridge university press
work page 2006
-
[10]
Chewi, S. (2024). Log-Concave Sampling . Unfinished draft
work page 2024
-
[11]
Chewi, S., Lu, C., Ahn, K., Cheng, X., Le Gouic, T., and Rigollet, P. (2021). Optimal dimension dependence of the metropolis-adjusted langevin algorithm. In Conference on Learning Theory , pages 1260--1300. PMLR
work page 2021
-
[12]
Cutkosky, A. (2019). Anytime online-to-batch, optimism and acceleration. In Chaudhuri, K. and Salakhutdinov, R., editors, Proceedings of the 36th International Conference on Machine Learning , volume 97 of Proceedings of Machine Learning Research , pages 1446--1454. PMLR
work page 2019
-
[13]
Del Moral, P., Doucet, A., and Jasra, A. (2006). Sequential monte carlo samplers. Journal of the Royal Statistical Society Series B: Statistical Methodology , 68(3):411--436
work page 2006
-
[14]
Di Gennaro , F., Eldowa, K., and Cesa-Bianchi, N. (2025). Instance-dependent regret bounds for nonstochastic linear partial monitoring. In The Thirty-ninth Annual Conference on Neural Information Processing Systems
work page 2025
-
[15]
Donsker, M. D. and Varadhan, S. S. (1975). Asymptotic evaluation of certain markov process expectations for large time, i. Communications on pure and applied mathematics , 28(1):1--47
work page 1975
-
[16]
Doodson, A. T. (1917). Iii. relation of the mode, median and mean in frequency curves. Biometrika , 11(4):425--429
work page 1917
-
[17]
Dwivedi, R., Chen, Y., Wainwright, M. J., and Yu, B. (2019). Log-concave sampling: Metropolis-hastings algorithms are fast. Journal of Machine Learning Research , 20(183):1--42
work page 2019
-
[18]
J., Kale, S., Luo, H., Mohri, M., and Sridharan, K
Foster, D. J., Kale, S., Luo, H., Mohri, M., and Sridharan, K. (2018). Logistic regression: The importance of being improper. In Conference on learning theory , pages 167--208. PMLR
work page 2018
-
[19]
Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences , 55(1):119--139
work page 1997
-
[20]
Freund, Y. and Schapire, R. E. (1998). Large margin classification using the perceptron algorithm. In Proceedings of the eleventh annual conference on Computational learning theory , pages 209--217
work page 1998
-
[21]
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning . Springer, 2 edition
work page 2009
-
[22]
Hazan, E. (2016). Introduction to Online Convex Optimization . Foundations and Trends in Optimization
work page 2016
-
[23]
Hazan, E., Agarwal, A., and Kale, S. (2007). Logarithmic regret algorithms for online convex optimization. Machine Learning , 69(2):169--192
work page 2007
-
[24]
Hazan, E., Koren, T., and Levy, K. Y. (2014). Logistic regression: Tight bounds for stochastic and online optimization. In Conference on Learning Theory , pages 197--209. PMLR
work page 2014
-
[25]
J \'e z \'e quel, R., Gaillard, P., and Rudi, A. (2020). Efficient improper learning for online logistic regression. In Abernethy, J. and Agarwal, S., editors, Proceedings of Thirty Third Conference on Learning Theory , volume 125 of Proceedings of Machine Learning Research , pages 2085--2108. PMLR
work page 2020
-
[26]
J \'e z \'e quel, R., Gaillard, P., and Rudi, A. (2021). Mixability made efficient: Fast online multiclass logistic regression. Advances in Neural Information Processing Systems , 34:23692--23702
work page 2021
-
[27]
Risk and parameter convergence of logistic regression
Ji, Z. and Telgarsky, M. (2018). Risk and parameter convergence of logistic regression. arXiv preprint arXiv:1803.07300
work page Pith review arXiv 2018
-
[28]
Kakade, S. M. and Ng, A. (2005). Online bounds for B ayesian algorithms. In Saul, L., Weiss, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems , volume 17. MIT Press
work page 2005
-
[29]
Lattimore, T. and Szepesv \'a ri, C. (2019). Cleaning up the neighborhood: A full classification for adversarial partial monitoring. In Algorithmic Learning Theory , pages 529--556. PMLR
work page 2019
-
[30]
Littlestone, N. and Warmuth, M. K. (1994). The weighted majority algorithm. Information and computation , 108(2):212--261
work page 1994
-
[31]
McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models . Chapman and Hall, 2 edition
work page 1989
-
[32]
McMahan, H. B. and Streeter, M. J. (2009). Tighter bounds for multi-armed bandits with expert advice. In COLT
work page 2009
-
[33]
Mourtada, J. and Ga \" ffas, S. (2022). An improper estimator with optimal excess risk in misspecified density estimation and logistic regression. Journal of Machine Learning Research , 23(31):1--49
work page 2022
-
[34]
Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective . MIT Press
work page 2012
-
[35]
Neal, R. M. (2001). Annealed importance sampling. Statistics and computing , 11(2):125--139
work page 2001
-
[36]
Qian, J., Rakhlin, A., and Zhivotovskiy, N. (2025). Refined risk bounds for unbounded losses via transductive priors
work page 2025
-
[37]
Roberts, G. O. and Stramer, O. (2002). Langevin diffusions and metropolis-hastings algorithms. Methodology and computing in applied probability , 4(4):337--357
work page 2002
-
[38]
Shalev-Shwartz, S. (2012). Online learning and online convex optimization. Foundations and Trends in Machine Learning , 4(2):107--194
work page 2012
-
[39]
Shamir, G. I. (2020). Logistic regression regret: What’s the catch? In Conference on Learning Theory , pages 3296--3319. PMLR
work page 2020
-
[40]
S., Gunasekar, S., and Srebro, N
Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., and Srebro, N. (2018). The implicit bias of gradient descent on separable data. Journal of Machine Learning Research , 19(70):1--57
work page 2018
- [41]
-
[42]
van Erven, T., Gr \"u nwald, P. D., Mehta, N. A., Reid, M. D., and Williamson, R. C. (2015). Fast rates in statistical and online learning. Journal of Machine Learning Research , 16(54):1793--1861
work page 2015
-
[43]
Vovk, V. (2001). Competitive on-line statistics. International Statistical Review , 69(2):213--248
work page 2001
-
[44]
Vovk, V. G. (1990). Aggregating strategies. In Fulk, M. and Case, J., editors, Proceedings of the Third Annual Workshop on Computational Learning Theory , COLT ’90, pages 371--383, San Mateo, CA, USA. Morgan Kaufmann
work page 1990
-
[45]
Wu, C., Heidari, M., Grama, A., and Szpankowski, W. (2022). Precise regret bounds for log-loss via a truncated bayesian algorithm. Advances in Neural Information Processing Systems , 35:26903--26914
work page 2022
-
[46]
Wu, J., Braverman, V., and Lee, J. D. (2023). Implicit bias of gradient descent for logistic regression at the edge of stability. Advances in Neural Information Processing Systems , 36:74229--74256
work page 2023
- [47]
-
[48]
Zhang, T. (2004). Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the twenty-first international conference on Machine learning , page 116
work page 2004
-
[49]
Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th international conference on machine learning (icml-03) , pages 928--936
work page 2003
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.