pith. sign in

arxiv: 2605.15571 · v1 · pith:IL4K7Q3Lnew · submitted 2026-05-15 · 📊 stat.ML · cs.LG

MaxSketch: Robust Distinct Counting in Streams via Random Projections

Pith reviewed 2026-05-19 20:04 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords distinct countingstreaming algorithmsrandom projectionsrobust estimationhigh-dimensional streamsMaxSketchgeometric structurelearned representations
0
0 comments X p. Extension
pith:IL4K7Q3L Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{IL4K7Q3L}

Prints a linked pith:IL4K7Q3L badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

A max-linear sketch over random Gaussian projections recovers distinct counts in noisy streams using only logarithmic memory when observations share geometric structure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that classical distinct-counting sketches break when repeated objects appear as high-dimensional noisy observations rather than identical repeats. Under the assumption that these observations lie in a geometry typical of learned representations, MaxSketch aggregates the maximum projection value across random Gaussian directions and shows that O~(log n / ε²) such directions suffice to approximate the number of latent objects within a (1 + ε) factor. This improves on the worst-case lower bound of Θ(√n) memory for general metric spaces. The result directly connects streaming algorithms to representation learning by turning geometric separation into a memory-efficient counting primitive.

Core claim

We introduce MaxSketch, a simple max-linear sketch built from random Gaussian projections, and prove that it succeeds in estimating the number of distinct latent objects. Concretely, we show that under this assumption m = O~(log n / ε²) random projections (and hence O~(log n/ε²) memory) suffice to recover the true distinct count within a (1+ε) factor.

What carries the argument

MaxSketch, the max-linear sketch that records the highest projection value of each observed point onto m random Gaussian vectors and uses the count of distinct maxima to estimate latent objects.

If this is right

  • Streaming pipelines processing image or embedding streams can maintain distinct counts with memory scaling only logarithmically in the number of objects.
  • The same sketch continues to work on data generated outside the original training distribution, as confirmed by the image-stream experiments.
  • The memory bound removes the √n barrier that holds for arbitrary metric spaces once geometric structure is present.
  • Distinct-count queries become practical subroutines inside larger representation-learning systems that already produce the required embeddings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same projection mechanism may apply to other streaming primitives such as heavy-hitter detection inside the same embedding space.
  • Quantifying the minimal margin or embedding dimension needed for the geometric separation would give an explicit sample-complexity trade-off.
  • Hybrid sketches could combine MaxSketch with classical hash-based methods for data that mixes exact repeats with noisy variants.

Load-bearing premise

The input observations of each latent object share enough geometric closeness that their random projections produce separable maximum values at the stated memory cost.

What would settle it

A controlled experiment on synthetic data where points are placed uniformly at random without clusters, checking whether the recovered count deviates from the true number of groups by more than (1+ε) when using only O(log n / ε²) projections.

Figures

Figures reproduced from arXiv: 2605.15571 by Christos Tzamos, Constantine Caramanis, Nikos Tsikouras.

Figure 1
Figure 1. Figure 1: MaxSketch counting accuracy on the held-out face identity split, reported at tolerances of [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Exact counting accuracy of three baseline models, MaxSketch, DeepSets, and LSTM, on [PITH_FULL_IMAGE:figures/full_fig_p022_2.png] view at source ↗
read the original abstract

Estimating the number of distinct elements in a data stream is well understood when repeated elements are identical. In modern settings, however, observations are high-dimensional and noisy, so repeated instances of the same object are only approximately similar -- for example, different images of the same individual may vary significantly at the pixel level. Classical sketches such as HyperLogLog rely on consistent hash values for identical elements and break down in this regime. Recent work on robust distinct counting in general metric spaces achieves $\widetilde{\Theta}(\sqrt{n})$ memory, which is tight in the worst case. We show that substantially improved memory guarantees are possible under geometric structure common in learned representations. We introduce MaxSketch, a simple max-linear sketch built from random Gaussian projections, and prove that it succeeds in estimating the number of distinct latent objects. Concretely, we show that under this assumption $m = \widetilde{O} (\log n / \varepsilon^2)$ random projections (and hence $\widetilde{O} (\log n/\varepsilon^2)$ memory) suffice to recover the true distinct count within a $(1+\varepsilon)$ factor. Experiments on image streams confirm that MaxSketch accurately estimates distinct counts and generalizes beyond the training regime. Our results bridge classical streaming algorithms and modern representation learning, showing how geometric structure can fundamentally reduce the complexity of distinct counting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MaxSketch, a max-linear sketch constructed from random Gaussian projections, for estimating the number of distinct latent objects in high-dimensional noisy data streams. Under the assumption of geometric structure common in learned representations, it claims that m = O~(log n / ε²) projections (hence the same memory) suffice to recover the true distinct count within a (1+ε) factor, improving on the worst-case Θ(√n) memory of prior metric-space methods. Experiments on image streams are reported to confirm accuracy and generalization.

Significance. If the central memory bound holds under the stated geometric assumption, the result is significant: it shows how representation-learning structure can reduce distinct-counting complexity from √n to polylog(n), bridging classical streaming algorithms with modern ML pipelines. The provision of confirming experiments on real image data and the parameter-free character of the claimed bound (once the geometric assumption is granted) are strengths.

major comments (2)
  1. [Main theorem / proof of the memory bound] Main theorem (the O~(log n / ε²) bound): the proof sketch relies on each new distinct embedding producing a strictly larger max-projection value with high probability. This requires a minimum separation δ (angular or Euclidean) between any two distinct latent vectors that is independent of n. If δ = o(1) and shrinks with the number of distinct items k ≤ n, the per-projection success probability drops and a union bound over Θ(k) items forces an extra log k or 1/δ² factor into m, contradicting the stated bound. The manuscript must explicitly state the dependence of δ on n (or prove δ = Ω(1) under the geometric assumption).
  2. [Proof section (concentration / union-bound argument)] Concentration analysis for the max operator: the argument that m = O~(log n / ε²) suffices for (1+ε) approximation appears to use a per-projection failure probability that is independent of k. Without an explicit lower bound on the probability that a new object exceeds the current max (which itself depends on δ), the overall success probability may fall below 1-1/poly(n) once k grows with n.
minor comments (2)
  1. [Introduction / assumption statement] Clarify the precise definition of the geometric assumption (e.g., minimum angle or distance) and state whether it is assumed to hold with high probability or deterministically.
  2. [Experiments] In the experimental section, report the exact values of m used relative to the theoretical bound and include ablation on projection dimension versus observed error.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments. The feedback has helped us identify areas where the assumptions and proof details need to be made more explicit. We address each major comment below and plan to incorporate revisions to clarify the geometric assumption and strengthen the analysis.

read point-by-point responses
  1. Referee: [Main theorem / proof of the memory bound] Main theorem (the O~(log n / ε²) bound): the proof sketch relies on each new distinct embedding producing a strictly larger max-projection value with high probability. This requires a minimum separation δ (angular or Euclidean) between any two distinct latent vectors that is independent of n. If δ = o(1) and shrinks with the number of distinct items k ≤ n, the per-projection success probability drops and a union bound over Θ(k) items forces an extra log k or 1/δ² factor into m, contradicting the stated bound. The manuscript must explicitly state the dependence of δ on n (or prove δ = Ω(1) under the geometric assumption).

    Authors: We agree with the referee that the dependence of the minimum separation δ must be stated explicitly. Under the geometric structure assumption for learned representations, we posit that distinct latent vectors are separated by a constant angular distance δ = Ω(1), which is independent of n and k. This is a standard assumption in representation learning, where embeddings of different objects are designed to be well-separated. With this, the probability that a new projection exceeds the current max is bounded below by a positive constant depending only on δ and the Gaussian projection. Consequently, the union bound over k items requires only an additional O(log k) factor, which is absorbed into the Õ notation since k ≤ n. We will revise the manuscript to explicitly state this assumption on δ and include a brief justification based on properties of learned embeddings. revision: yes

  2. Referee: [Proof section (concentration / union-bound argument)] Concentration analysis for the max operator: the argument that m = O~(log n / ε²) suffices for (1+ε) approximation appears to use a per-projection failure probability that is independent of k. Without an explicit lower bound on the probability that a new object exceeds the current max (which itself depends on δ), the overall success probability may fall below 1-1/poly(n) once k grows with n.

    Authors: We thank the referee for this observation. To address this, we will expand the proof to derive an explicit lower bound on the success probability p for a new distinct item to update the max, which is p ≥ c(δ) > 0 when δ = Ω(1). Then, using standard Chernoff bounds or similar concentration inequalities for the number of updates, with m = O~(log n / ε²) we ensure that the estimated count is within (1+ε) with high probability. The failure probability per item is exponentially small in m, allowing the union bound to succeed. We will include these details in the revised proof section. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation relies on independent probabilistic analysis under stated geometric assumption

full rationale

The paper introduces MaxSketch as a max-linear sketch over random Gaussian projections and derives the m = O~(log n / ε²) memory bound from standard concentration properties of Gaussians combined with the external geometric structure assumption on latent embeddings. No step reduces the claimed bound to a fitted parameter, self-definition, or load-bearing self-citation chain; the analysis is presented as following from first-principles random projection arguments rather than renaming or smuggling prior results by the same authors. The central claim remains independent of the target distinct-count output and does not exhibit any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption of geometric structure in learned representations and on standard properties of random Gaussian projections; no free parameters or new invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Observations arise from a small number of latent objects whose representations exhibit geometric structure that random projections can exploit for distinct counting.
    The logarithmic memory bound is stated to hold under this assumption; without it the guarantee reverts to the sqrt(n) worst-case bound.

pith-pipeline@v0.9.0 · 5773 in / 1312 out tokens · 52366 ms · 2026-05-19T20:04:45.435299+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages

  1. [1]

    Concentration inequalities

    Stéphane Boucheron, Gábor Lugosi, and Olivier Bousquet. Concentration inequalities. In Summer school on machine learning, pages 208–240. Springer, 2003

  2. [2]

    Streaming algorithms for robust distinct elements

    Di Chen and Qin Zhang. Streaming algorithms for robust distinct elements. InProceedings of the 2016 International Conference on Management of Data, pages 1433–1447, 2016

  3. [3]

    Distinct sampling on streaming data with near-duplicates

    Jiecao Chen and Qin Zhang. Distinct sampling on streaming data with near-duplicates. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 369–382, 2018

  4. [4]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020

  5. [5]

    Loglog counting of large cardinalities

    Marianne Durand and Philippe Flajolet. Loglog counting of large cardinalities. InEuropean Symposium on Algorithms, pages 605–617. Springer, 2003

  6. [6]

    Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm.Discrete mathematics & theoretical computer science, (Proceedings), 2007

    Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm.Discrete mathematics & theoretical computer science, (Proceedings), 2007

  7. [7]

    Probabilistic counting algorithms for data base applica- tions.Journal of computer and system sciences, 31(2):182–209, 1985

    Philippe Flajolet and G Nigel Martin. Probabilistic counting algorithms for data base applica- tions.Journal of computer and system sciences, 31(2):182–209, 1985

  8. [8]

    Cnn-based density estimation and crowd counting: A survey.arXiv preprint arXiv:2003.12783, 2020

    Guangshuai Gao, Junyu Gao, Qingjie Liu, Qi Wang, and Yunhong Wang. Cnn-based density estimation and crowd counting: A survey.arXiv preprint arXiv:2003.12783, 2020

  9. [9]

    Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

    Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

  10. [10]

    Neural collapse under mse loss: Proximity to and dynamics on the central path.arXiv preprint arXiv:2106.02073, 2021

    XY Han, Vardan Papyan, and David L Donoho. Neural collapse under mse loss: Proximity to and dynamics on the central path.arXiv preprint arXiv:2106.02073, 2021

  11. [11]

    Momentum contrast for unsupervised visual representation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020

  12. [12]

    Face recognition across pose using view based active appear- ance models (vbaams) on cmu multi-pie dataset

    Jingu Heo and Marios Savvides. Face recognition across pose using view based active appear- ance models (vbaams) on cmu multi-pie dataset. InInternational Conference on Computer Vision Systems, pages 527–535. Springer, 2008

  13. [13]

    Hyperloglog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm

    Stefan Heule, Marc Nunkesser, and Alexander Hall. Hyperloglog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. InProceedings of the 16th International Conference on Extending Database Technology, pages 683–692, 2013

  14. [14]

    Long short-term memory.Neural computation, 9(8):1735–1780, 1997

    Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural computation, 9(8):1735–1780, 1997

  15. [15]

    Labeled faces in the wild: A database forstudying face recognition in unconstrained environments

    Gary B Huang, Marwan Mattar, Tamara Berg, and Eric Learned-Miller. Labeled faces in the wild: A database forstudying face recognition in unconstrained environments. InWorkshop on faces in’Real-Life’Images: detection, alignment, and recognition, 2008

  16. [16]

    Approximate nearest neighbors: towards removing the curse of dimensionality

    Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. InProceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604–613, 1998

  17. [17]

    Locality-preserving hashing in multidimensional spaces

    Piotr Indyk, Rajeev Motwani, Prabhakar Raghavan, and Santosh Vempala. Locality-preserving hashing in multidimensional spaces. InProceedings of the twenty-ninth annual ACM symposium on Theory of computing, pages 618–625, 1997. 10

  18. [18]

    An optimal algorithm for the dis- tinct elements problem

    Daniel M Kane, Jelani Nelson, and David P Woodruff. An optimal algorithm for the dis- tinct elements problem. InProceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 41–52, 2010

  19. [19]

    Generalised f-mean aggregation for graph neural networks.Advances in Neural Information Processing Systems, 36:34439–34450, 2023

    Ryan Kortvelesy, Steven Morad, and Amanda Prorok. Generalised f-mean aggregation for graph neural networks.Advances in Neural Information Processing Systems, 36:34439–34450, 2023

  20. [20]

    Springer, 1991

    Michel Ledoux and Michel Talagrand.Probability in Banach Spaces: isoperimetry and processes, volume 23. Springer, 1991

  21. [21]

    Deep learning face attributes in the wild

    Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. InProceedings of the IEEE international conference on computer vision, pages 3730–3738, 2015

  22. [22]

    Prevalence of neural collapse during the terminal phase of deep learning training.Proceedings of the National Academy of Sciences, 117(40):24652–24663, 2020

    Vardan Papyan, XY Han, and David L Donoho. Prevalence of neural collapse during the terminal phase of deep learning training.Proceedings of the National Academy of Sciences, 117(40):24652–24663, 2020

  23. [23]

    Learning aggregation functions.arXiv preprint arXiv:2012.08482, 2020

    Giovanni Pellegrini, Alessandro Tibo, Paolo Frasconi, Andrea Passerini, and Manfred Jaeger. Learning aggregation functions.arXiv preprint arXiv:2012.08482, 2020

  24. [24]

    A benchmark of facial recognition pipelines and co-usability performances of modules.Bili¸ sim Teknolojileri Dergisi, 17(2):95–107, 2024

    Sefik Serengil and Alper Özpınar. A benchmark of facial recognition pipelines and co-usability performances of modules.Bili¸ sim Teknolojileri Dergisi, 17(2):95–107, 2024

  25. [25]

    Cambridge university press, 2014

    Shai Shalev-Shwartz and Shai Ben-David.Understanding machine learning: From theory to algorithms. Cambridge university press, 2014

  26. [26]

    The one-sided barrier problem for gaussian noise.Bell System Technical Journal, 41(2):463–501, 1962

    David Slepian. The one-sided barrier problem for gaussian noise.Bell System Technical Journal, 41(2):463–501, 1962

  27. [27]

    On deep set learning and the choice of aggregations

    Maximilian Soelch, Adnan Akhundov, Patrick van der Smagt, and Justin Bayer. On deep set learning and the choice of aggregations. InInternational Conference on Artificial Neural Networks, pages 444–457. Springer, 2019

  28. [28]

    On the limitations of representing functions on sets

    Edward Wagstaff, Fabian Fuchs, Martin Engelcke, Ingmar Posner, and Michael A Osborne. On the limitations of representing functions on sets. InInternational conference on machine learning, pages 6487–6494. PMLR, 2019

  29. [29]

    Deep sets.Advances in neural information processing systems, 30, 2017

    Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets.Advances in neural information processing systems, 30, 2017

  30. [30]

    Robust statistical analysis on streaming data with near-duplicates in general metric spaces.Proceedings of the ACM on Management of Data, 3(2):1–25, 2025

    Qin Zhang. Robust statistical analysis on streaming data with near-duplicates in general metric spaces.Proceedings of the ACM on Management of Data, 3(2):1–25, 2025

  31. [31]

    On the optimization landscape of neural collapse under mse loss: Global optimality with unconstrained features

    Jinxin Zhou, Xiao Li, Tianyu Ding, Chong You, Qing Qu, and Zhihui Zhu. On the optimization landscape of neural collapse under mse loss: Global optimality with unconstrained features. In International Conference on Machine Learning, pages 27179–27202. PMLR, 2022

  32. [32]

    A geometric analysis of neural collapse with unconstrained features.Advances in Neural Information Processing Systems, 34:29820–29834, 2021

    Zhihui Zhu, Tianyu Ding, Jinxin Zhou, Xiao Li, Chong You, Jeremias Sulam, and Qing Qu. A geometric analysis of neural collapse with unconstrained features.Advances in Neural Information Processing Systems, 34:29820–29834, 2021. 11 A Appendix: Extended Related Work A.1 Classical Sketching and Cardinality Estimation The problem of estimating the number of d...

  33. [33]

    Limitations

    is a survey about direct count regression in images (e.g., counting people in a crowd). These methods differ fundamentally from our work: they predict total counts in a single image, not distinct elements in a stream. No theoretical guarantees or handling of weak supervision. A.5 Permutation-invariant models and max aggregation. In deep learning, permutat...

  34. [34]

    Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...