pith. machine review for the scientific record. sign in

arxiv: 2605.07119 · v1 · submitted 2026-05-08 · 📊 stat.ML · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Classification Fields: Arbitrarily Fine Recursive Hierarchical Clustering From Few Examples

Anastasis Kratsios, Haitz S\'aez de Oc\'ariz Borde, Paul D. McNicholas, Ruiyang Hong, Yicen Li

Pith reviewed 2026-05-11 01:29 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords hierarchical clusteringrecursive refinementclassification fieldsVoronoi partitionsReLU approximationfractal hierarchiesmetric convergence
0
0 comments X

The pith

Finite observed prefixes of recursive cluster hierarchies can be used to learn a local refinement rule that generates consistent, arbitrarily deep classification fields with exponential convergence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines classification fields as infinite-depth hierarchies of Voronoi clusters on R^d produced by repeatedly applying a single local rule that turns each parent center into a fixed number of ordered child centers. It demonstrates that a predictor trained on only a short observed prefix of such a hierarchy can approximate the underlying rule well enough to be unrolled to depths far beyond the training data. The resulting structures converge exponentially fast to the true infinite hierarchy when measured in a completed cell metric that accounts for all finer scales. This approach is shown to work on synthetic recursive partitions, fractal sets, and image-derived hierarchies while preserving child ordering and path distances.

Core claim

A classification field is an infinite recursive partition of R^d generated from a root, a fixed scale factor, and a generator that maps any parent center to an ordered, bounded, separated tuple of child residuals. Given only a finite initial segment of centers and cells, a learned predictor approximates the generator so that the induced metric DAGs and Voronoi cells converge exponentially in the completed cell metric; the predictors themselves are realizable by ReLU networks whose width is O(ε^{-γ}) and depth is Õ(ε^{-3γ/2}) for γ = log K / (-log s).

What carries the argument

The classification field generator: a map sending each parent center to an ordered tuple of child residuals that, together with scaling, produces the next level of centers and cells while preserving separation and boundedness.

If this is right

  • The learned predictor can generate cluster hierarchies at depths much larger than those observed while keeping geometric and ordering fidelity.
  • ReLU networks of width scaling as O(ε^{-γ}) suffice to approximate the generator to any prescribed accuracy.
  • The same finite-prefix learning procedure applies uniformly to CFG-generated trees, iterated function system fractals, and image-induced recursive partitions.
  • Hierarchy-level path metrics and unordered geometry are preserved under recursive rollout of the predictor.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be used to extrapolate hierarchical structure in domains where only coarse observations are available, such as partial segmentations of large images.
  • If the underlying rule is approximately self-similar, the same predictor might be reused across different root locations without retraining.
  • Testing the rollout on natural data sets with known finer-scale annotations would directly measure how far the exponential convergence extends in practice.

Load-bearing premise

The refinement rule that turns parents into ordered children is the same at every depth and scale, so a single predictor learned from the observed prefix continues to work uniformly when rolled out further.

What would settle it

If the Hausdorff distance between the Voronoi cells produced by rolling out the learned predictor and the true cells at greater depth fails to decay exponentially, or if child ordering is lost under rollout, the truncation convergence claim is false.

Figures

Figures reproduced from arXiv: 2605.07119 by Anastasis Kratsios, Haitz S\'aez de Oc\'ariz Borde, Paul D. McNicholas, Ruiyang Hong, Yicen Li.

Figure 1
Figure 1. Figure 1: Learning a classification field from finite observations. The observed hierarchy [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Matched CFG rollout. All methods are trained on levels 0–2 and recursively rolled out for [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Recursive rollout on nonlinear IFS-induced classification fields. From left to right: Sier [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: CIFAR-induced approximate classification fields. CLIP-space centroids are visualized after [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Two views of an infinite classification field. Iterating the CFG first produces a rooted [PITH_FULL_IMAGE:figures/full_fig_p031_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Observed CFG training prefixes across 9 independent trials. Each panel shows the finite [PITH_FULL_IMAGE:figures/full_fig_p033_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative CFP rollouts on CFG-generated hierarchies. Each panel shows a CFP trained [PITH_FULL_IMAGE:figures/full_fig_p033_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Recursive rollout on linear IFS-induced classification fields. From left to right: Sierpi [PITH_FULL_IMAGE:figures/full_fig_p037_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Effect of the observed training depth Ltrain on extrapolation MSE. As additional levels are observed during training, the CFP achieves substantially lower error on deeper unseen levels. Effect of scale factor. We next vary the scale factor s ∈ {0.3, 0.5, 0.7, 0.85} while fixing d = 2, K = 3, and Ltrain = 2. Because changing s also changes the overall geometric scale of the hierarchy, we report a normalized… view at source ↗
Figure 10
Figure 10. Figure 10: Effect of the scale factor s on extrapolation performance in CFG-generated hierarchies. We fix d = 2, K = 3, and Ltrain = 2, and report normalized MSE across unseen levels for different values of s. Larger values of s make recursive rollout substantially more difficult for all methods, while the CFP remains the most robust across the full range. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Sensitivity to child-slot correspondence. We compare the generator-induced ordering, [PITH_FULL_IMAGE:figures/full_fig_p039_11.png] view at source ↗
read the original abstract

Classical clustering methods usually return either a finite partition of the observed data or a finite dendrogram over it. This finite-sample view is inadequate when the hierarchy of interest is a recursive geometric object with fine-scale refinements that continue beyond the levels directly observed. We introduce classification fields: infinite-depth hierarchical cluster structures on $\mathbb{R}^d$ generated by a local parent-to-child refinement rule. A classification field generator maps each parent centre to an ordered, bounded, and separated tuple of child residuals. Together with a root and a scale factor, this rule recursively generates cluster centres, Voronoi cells, and a metric DAG encoding the hierarchy. Given only a finite prefix of such a hierarchy, we learn a classification field predictor that approximates the generator and can be rolled out to unseen depths. We prove exponential truncation convergence in the completed cell metric and ReLU realizability with width $O(\varepsilon^{-\gamma})$ and depth $\widetilde O(\varepsilon^{-3\gamma/2})$, where $\gamma=\log K/(-\log s)$, up to finite-window aspect-ratio factors. The approximation holds at the level of the induced compact metric structures, measured in the completed cell-metric Hausdorff distance. Experimental validation on matched CFG-generated hierarchies, IFS fractals, and image-induced recursive clustering hierarchies shows that learned predictors preserve ordered child slots, unordered geometry, and hierarchy-level path metrics under recursive rollout. These results support the claim that finite hierarchical observations can reveal local refinement rules capable of generating substantially deeper classification fields.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper introduces classification fields as infinite-depth recursive hierarchical cluster structures on R^d generated by a fixed local parent-to-child refinement rule that maps each parent center to an ordered tuple of child residuals. From a finite observed prefix of such a hierarchy (with root and scale factor s), the authors learn a predictor approximating the generator and prove that recursive rollout yields exponentially convergent deeper structures in the completed cell metric, together with ReLU realizability bounds of width O(ε^{-γ}) and depth Õ(ε^{-3γ/2}) where γ = log K / (-log s). Experiments on CFG-generated hierarchies, IFS fractals, and image-induced recursive clusters confirm that the rolled-out structures preserve ordered child slots, unordered geometry, and hierarchy path metrics.

Significance. If the derivations hold, the work supplies a principled method for extrapolating finite hierarchical observations to arbitrary depths under an explicit local-rule consistency assumption, with concrete convergence rates and neural-network complexity bounds tied directly to the generator parameters K and s. This could be relevant for self-similar or fractal data regimes where fine-scale labels are scarce. Strengths include the reduction of the approximation task to standard ReLU bounds on a residual map, the explicit statement of the modeling assumption, and validation across three qualitatively different hierarchy generators. The absence of machine-checked proofs is noted but does not detract from the standard contraction-mapping argument outlined.

major comments (1)
  1. §5.2, Theorem 5.1: the exponential truncation convergence in the completed cell metric is stated to hold uniformly once the predictor approximates the residual map to accuracy ε; however, the proof sketch does not explicitly bound the propagation of approximation error through the metric DAG at arbitrary depths, which is load-bearing for the claim that rollout remains exponentially convergent rather than merely convergent.
minor comments (3)
  1. Abstract and §2: the qualifier 'up to finite-window aspect-ratio factors' appears in the realizability bound but is not restated in the theorem; moving or repeating this qualifier in the formal statement would improve clarity.
  2. §6 (Experiments): the data-exclusion rules and exact prefix depths used for the image-induced hierarchies are not detailed; adding a short paragraph or table entry would aid reproducibility.
  3. Notation: the completed cell metric is introduced in §3 but its Hausdorff-distance formulation is only sketched; a self-contained definition or reference to the precise metric would help readers unfamiliar with the DAG construction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and the recommendation of minor revision. The single major comment identifies a point where the proof sketch can be made more explicit; we address it directly below and will revise the manuscript to incorporate a detailed error-propagation argument.

read point-by-point responses
  1. Referee: [—] §5.2, Theorem 5.1: the exponential truncation convergence in the completed cell metric is stated to hold uniformly once the predictor approximates the residual map to accuracy ε; however, the proof sketch does not explicitly bound the propagation of approximation error through the metric DAG at arbitrary depths, which is load-bearing for the claim that rollout remains exponentially convergent rather than merely convergent.

    Authors: We agree that the current sketch invokes the contraction-mapping property of the generator without spelling out the inductive control of approximation error across arbitrary depths of the metric DAG. In the revision we will insert a supporting lemma that makes this bound explicit. Let d_H denote Hausdorff distance in the completed cell metric. If the learned predictor approximates the true residual map to accuracy ε, the local refinement rule (with scale s < 1 and separated child residuals) implies that the error contributed at each child level is at most s times the parent error plus a term linear in ε. By induction on depth, the accumulated approximation error remains bounded by Cε/(1-s) uniformly in depth, while the truncation error to the infinite structure still contracts geometrically at rate s. Consequently the rolled-out finite-depth structures converge exponentially to the infinite classification field in the completed cell metric, as stated in Theorem 5.1. The added lemma will be placed immediately before the theorem statement. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper's core claims rest on an explicit generator definition mapping parent centers to ordered child residuals, a metric DAG construction, a contraction argument in the scale factor s for exponential truncation convergence in the completed cell metric, and an approximation theorem that reduces ReLU realizability bounds directly to standard width/depth results on the residual map. These steps are stated in terms of the generator properties and Hausdorff distance on induced compact structures rather than fitted predictor parameters. The local rule consistency is declared as a modeling assumption, not derived from data. Experiments on CFG, IFS, and image hierarchies serve as separate validation of rollout preservation. No load-bearing step reduces by construction to inputs, fitted values, or self-citations.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The central claim rests on the existence of a stable local refinement rule that can be learned from finite prefixes and on standard properties of ReLU networks and Hausdorff-type metrics; no new physical constants or data-fitted scalars are introduced beyond the network weights.

free parameters (2)
  • scale factor s
    Appears in the definition of recursive generation and in the exponent γ = log K / (-log s); treated as part of the generator rather than fitted post-hoc.
  • number of children K
    Determines the branching and enters the width bound; part of the generator specification.
axioms (2)
  • domain assumption The refinement rule produces ordered, bounded, and separated child residuals at every scale.
    Invoked in the definition of the classification field generator and required for the metric DAG and Voronoi cells to remain well-defined under recursion.
  • standard math ReLU networks can realize the required approximation with the stated width and depth bounds up to finite-window aspect-ratio factors.
    Standard universal approximation results for ReLU nets are extended to the specific function class of the generator.
invented entities (2)
  • classification field no independent evidence
    purpose: Infinite-depth hierarchical cluster structure on R^d
    New object defined by recursive application of the generator; no independent evidence outside the paper's construction.
  • completed cell metric no independent evidence
    purpose: Distance that measures agreement of the induced compact metric structures at all scales
    Introduced to state the convergence result; defined in terms of the hierarchy itself.

pith-pipeline@v0.9.0 · 5591 in / 1818 out tokens · 33057 ms · 2026-05-11T01:29:50.121153+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 1 internal anchor

  1. [1]

    An axiomatic definition of hierarchical clustering.Journal of Machine Learning Research, 26(10):1–26, 2025

    Ery Arias-Castro and Elizabeth Coda. An axiomatic definition of hierarchical clustering.Journal of Machine Learning Research, 26(10):1–26, 2025. URL http://jmlr.org/papers/v26/ 24-1052.html

  2. [2]

    Aubin.Mutational and morphological analysis : tools for shape evolution and morphogenesis

    Jean Pierre. Aubin.Mutational and morphological analysis : tools for shape evolution and morphogenesis. Systems & control. Birkhäuser, Boston, 1999. ISBN 0817639357

  3. [3]

    Solution of an inverse problem for fractals and other sets.Proceedings of the National Academy of Sciences, 83(7): 1975–1977, 1986

    Michael F Barnsley, Vincent Ervin, Doug Hardin, and John Lancaster. Solution of an inverse problem for fractals and other sets.Proceedings of the National Academy of Sciences, 83(7): 1975–1977, 1986

  4. [4]

    Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian

    Peter L. Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight vc- dimension and pseudodimension bounds for piecewise linear neural networks.Journal of Machine Learning Research, 20(63):1–17, 2019. URL http://jmlr.org/papers/v20/ 17-612.html

  5. [5]

    Ergun, Chen Wang, and Samson Zhou

    Vladimir Braverman, Jon C. Ergun, Chen Wang, and Samson Zhou. Learning-augmented hierarchical clustering. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=FU527ycGgW

  6. [6]

    Deep clustering for unsupervised learning of visual features

    Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. InProceedings of the European conference on computer vision (ECCV), pages 132–149, 2018

  7. [7]

    From trees to continuous embeddings and back: Hyperbolic hierarchical clustering.Advances in Neural Information Pro- cessing Systems, 33:15065–15076, 2020

    Ines Chami, Albert Gu, Vaggos Chatziafratis, and Christopher Ré. From trees to continuous embeddings and back: Hyperbolic hierarchical clustering.Advances in Neural Information Pro- cessing Systems, 33:15065–15076, 2020. URL https://dl.acm.org/doi/abs/10.5555/ 3495724.3496987

  8. [8]

    Rates of convergence for the cluster tree.Advances in neural information processing systems, 23, 2010

    Kamalika Chaudhuri and Sanjoy Dasgupta. Rates of convergence for the cluster tree.Advances in neural information processing systems, 23, 2010

  9. [9]

    Consistent procedures for cluster tree estimation and pruning.IEEE Transactions on Information Theory, 60(12):7900–7912, 2014

    Kamalika Chaudhuri, Sanjoy Dasgupta, Samory Kpotufe, and Ulrike V on Luxburg. Consistent procedures for cluster tree estimation and pruning.IEEE Transactions on Information Theory, 60(12):7900–7912, 2014

  10. [10]

    Hierar- chical clustering: Objective functions and algorithms.Journal of the ACM (JACM), 66(4):1–42, 2019

    Vincent Cohen-Addad, Varun Kanade, Frederik Mallmann-Trenn, and Claire Mathieu. Hierar- chical clustering: Objective functions and algorithms.Journal of the ACM (JACM), 66(4):1–42, 2019

  11. [11]

    A cost function for similarity-based hierarchical clustering

    Sanjoy Dasgupta. A cost function for similarity-based hierarchical clustering. InProceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 118–127, 2016

  12. [12]

    Neural snowflakes: Universal latent graph inference via trainable latent geometries

    Haitz Sáez de Ocáriz Borde and Anastasis Kratsios. Neural snowflakes: Universal latent graph inference via trainable latent geometries. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=djM3WzpOmK. 3https://vectorinstitute.ai/partnerships/current-partners/ 10

  13. [13]

    Neural latent geometry search: Product manifold inference via gromov- hausdorff-informed bayesian optimization

    Haitz Sáez de Ocáriz Borde, Alvaro Arroyo, Ismael Morales López, Ingmar Posner, and Xiaowen Dong. Neural latent geometry search: Product manifold inference via gromov- hausdorff-informed bayesian optimization. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=Gij638d76O

  14. [14]

    Law, Xiaowen Dong, and Michael M

    Haitz Sáez de Ocáriz Borde, Anastasis Kratsios, Marc T. Law, Xiaowen Dong, and Michael M. Bronstein. Neural spacetimes for DAG representation learning. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=skGSOcrIj7

  15. [15]

    Hyperbolic neu- ral networks.Advances in Neural Information Processing Systems, 31:5345– 5355, 2018

    Octavian Ganea, Gary Becigneul, and Thomas Hofmann. Hyperbolic neu- ral networks.Advances in Neural Information Processing Systems, 31:5345– 5355, 2018. URL https://proceedings.neurips.cc/paper/2018/file/ dbab2adc8f9d078009ee3fa810bea142-Paper.pdf

  16. [16]

    Consistency of single linkage for high-density clusters.Journal of the American Statistical Association, 76(374):388–394, 1981

    John A Hartigan. Consistency of single linkage for high-density clusters.Journal of the American Statistical Association, 76(374):388–394, 1981

  17. [17]

    Fractals and self similarity.Indiana University Mathematics Journal, 30 (5):713–747, 1981

    John E Hutchinson. Fractals and self similarity.Indiana University Mathematics Journal, 30 (5):713–747, 1981

  18. [18]

    Hierarchical clustering schemes.Psychometrika, 32(3):241–254, 1967

    Stephen C Johnson. Hierarchical clustering schemes.Psychometrika, 32(3):241–254, 1967

  19. [19]

    Minimum width for universal approximation using relu networks on compact domain.arXiv preprint arXiv:2309.10402, 2023

    Namjun Kim, Chanho Min, and Sejun Park. Minimum width for universal approximation using relu networks on compact domain.arXiv preprint arXiv:2309.10402, 2023

  20. [20]

    Expected probabilistic hierarchies.Advances in Neural Information Processing Systems, 37:13818–13850, 2024

    Marcel Kollovieh, Bertrand Charpentier, Daniel Zügner, and Stephan Günnemann. Expected probabilistic hierarchies.Advances in Neural Information Processing Systems, 37:13818–13850, 2024

  21. [21]

    Small transformers compute universal metric embeddings.Journal of Machine Learning Research, 24(170):1–48, 2023

    Anastasis Kratsios, Valentin Debarnot, and Ivan Dokmani ´c. Small transformers compute universal metric embeddings.Journal of Machine Learning Research, 24(170):1–48, 2023

  22. [22]

    Krauthgamer, J

    R. Krauthgamer, J. R. Lee, M. Mendel, and A. Naor. Measured descent: a new embedding method for finite metrics.Geom. Funct. Anal., 15(4):839–858, 2005

  23. [23]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

  24. [24]

    Ma- tryoshka representation learning.Advances in Neural Information Processing Systems, 35: 30233–30249, 2022

    Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, et al. Ma- tryoshka representation learning.Advances in Neural Information Processing Systems, 35: 30233–30249, 2022

  25. [25]

    Nearly-optimal hierarchical clustering for well-clustered graphs

    Steinar Laenen, Bogdan Adrian Manghiuc, and He Sun. Nearly-optimal hierarchical clustering for well-clustered graphs. InInternational Conference on Machine Learning, pages 18207– 18249. PMLR, 2023

  26. [26]

    McNicholas

    Yicen Li, Haitz Sáez de Ocáriz Borde, Anastasis Kratsios, and Paul D. McNicholas. Keep it light! simplifying image clustering via text-free adapters, 2025. URL https://arxiv.org/ abs/2502.04226

  27. [27]

    Joint hierarchical representation learning of samples and features via informed tree-wasserstein distance.arXiv preprint arXiv:2501.03627, 2025

    Ya-Wei Eileen Lin, Ronald R Coifman, Gal Mishne, and Ronen Talmon. Joint hierarchical representation learning of samples and features via informed tree-wasserstein distance.arXiv preprint arXiv:2501.03627, 2025

  28. [28]

    Tree variational autoencoders

    Laura Manduchi, Moritz Vandenhirtz, Alain Ryser, and Julia V ogt. Tree variational autoencoders. Advances in neural information processing systems, 36:54952–54986, 2023

  29. [29]

    Hierarchical clustering: o(1)-approximation for well- clustered graphs.advances in neural information processing systems, 34:9278–9289, 2021

    Bogdan-Adrian Manghiuc and He Sun. Hierarchical clustering: o(1)-approximation for well- clustered graphs.advances in neural information processing systems, 34:9278–9289, 2021

  30. [30]

    Approximation bounds for hierarchical clustering: Average linkage, bisecting k-means, and local search.Journal of Machine Learning Research, 24(1):1–36, 2023

    Benjamin Moseley and Joshua R Wang. Approximation bounds for hierarchical clustering: Average linkage, bisecting k-means, and local search.Journal of Machine Learning Research, 24(1):1–36, 2023. 11

  31. [31]

    The gromov-hausdorff distance between ultrametric spaces: its structure and computation, 2021

    Facundo Mémoli, Zane Smith, and Zhengchao Wan. The gromov-hausdorff distance between ultrametric spaces: its structure and computation, 2021. URL https://arxiv.org/abs/ 2110.03136

  32. [32]

    arXiv preprint arXiv:2407.18384 , year=

    Philipp Petersen and Jakob Zech. Mathematical theory of deep learning.arXiv preprint arXiv:2407.18384, 2024

  33. [33]

    Self-similarity priors: Neural collages as differentiable fractal representations.Advances in Neural Information Processing Systems, 35:30393–30405, 2022

    Michael Poli, Winnie Xu, Stefano Massaroli, Chenlin Meng, Kuno Kim, and Stefano Ermon. Self-similarity priors: Neural collages as differentiable fractal representations.Advances in Neural Information Processing Systems, 35:30393–30405, 2022

  34. [34]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/2103.00020

  35. [35]

    Hierarchical clustering via spreading metrics.Journal of Machine Learning Research, 18(88):1–35, 2017

    Aurko Roy and Sebastian Pokutta. Hierarchical clustering via spreading metrics.Journal of Machine Learning Research, 18(88):1–35, 2017

  36. [36]

    Differentiable clustering with perturbed spanning forests.Advances in Neural Information Processing Systems, 36:31158–31176, 2023

    Lawrence Stewart, Francis Bach, Felipe Llinares-López, and Quentin Berthet. Differentiable clustering with perturbed spanning forests.Advances in Neural Information Processing Systems, 36:31158–31176, 2023

  37. [37]

    A generalized single linkage method for estimating the cluster tree of a density.Journal of Computational and Graphical Statistics, 19(2):397–418, 2010

    Werner Stuetzle and Rebecca Nugent. A generalized single linkage method for estimating the cluster tree of a density.Journal of Computational and Graphical Statistics, 19(2):397–418, 2010

  38. [38]

    Scan: Learning to classify images without labels

    Wouter Van Gansbeke, Simon Vandenhende, Stamatios Georgoulis, Marc Proesmans, and Luc Van Gool. Scan: Learning to classify images without labels. InEuropean conference on computer vision, pages 268–285. Springer, 2020

  39. [39]

    On the optimal memorization power of ReLU neural networks.International Conference on Learning Representations, 2022

    Gal Vardi, Gilad Yehudai, and Ohad Shamir. On the optimal memorization power of ReLU neural networks.International Conference on Learning Representations, 2022

  40. [40]

    An objective for hierarchical clustering in euclidean space and its connection to bisecting k-means

    Yuyan Wang and Benjamin Moseley. An objective for hierarchical clustering in euclidean space and its connection to bisecting k-means. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 6307–6314, 2020

  41. [41]

    MacLellan

    Zekun Wang, Ethan Haarer, Tianyi Zhu, Zhiyi Dai, and Christopher J. MacLellan. Deep taxonomic networks for unsupervised hierarchical prototype discovery. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2026. URL https: //openreview.net/forum?id=My72tmxg6t

  42. [42]

    Hierarchical grouping to optimize an objective function.Journal of the American statistical association, 58(301):236–244, 1963

    Joe H Ward Jr. Hierarchical grouping to optimize an objective function.Journal of the American statistical association, 58(301):236–244, 1963

  43. [43]

    Unsupervised deep embedding for clustering analysis

    Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised deep embedding for clustering analysis. InInternational conference on machine learning, pages 478–487. PMLR, 2016

  44. [44]

    Generalizablity of memorization neural network

    Lijia Yu, Xiao-Shan Gao, Lijun Zhang, and Yibo Miao. Generalizablity of memorization neural network. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,

  45. [45]

    URLhttps://openreview.net/forum?id=sABwo1ZTFi

  46. [46]

    Deep network approximation: Achieving arbitrary accuracy with fixed number of neurons.Journal of Machine Learning Research, 23 (276):1–60, 2022

    Shijun Zhang, Zuowei Shen, and Haizhao Yang. Deep network approximation: Achieving arbitrary accuracy with fixed number of neurons.Journal of Machine Learning Research, 23 (276):1–60, 2022

  47. [47]

    hyperspace

    Sheng Zhou, Hongjia Xu, Zhuonan Zheng, Jiawei Chen, Zhao Li, Jiajun Bu, Jia Wu, Xin Wang, Wenwu Zhu, and Martin Ester. A comprehensive survey on deep clustering: Taxonomy, challenges, and future directions.ACM Computing Surveys, 57(3):1–38, 2024. 12 A Preliminaries In this appendix, we review the mathematical background used in the paper. A.1 Notation Let...

  48. [48]

    class memorization

    Assume also that A1(x) and A2(x) are skew-symmetric for everyx∈B d 2. For everyx∈B d 2 define thed×Kmatrix r(x) def. =e A1(x) diag(σ(x))eA2(x)⊤ C. and writer(x) = (r 1(x)| · · · |rK(x)). Then, for everyx,˜x∈B d 2, max k∈[K] + ∥rk(x)−r k(˜x)∥2 ≤L r ∥x−˜x∥2,(16) whereL r def. =λ + σmaxLA1 +L σ +σ maxLA2 . Proof.See Appendix D.1. The key point of the structu...

  49. [49]

    for everyk∈[d], σmin ≤ |m k| ≤σ max

  50. [50]

    for every distincti, j∈[d], √ 2σ min ≤ |m i −m j| ≤ √ 2σ max. Proof. Since mk =M e k, the first claim follows by applying Lemma D.2 with u=e k and using |ek|= 1 . Similarly, for distinct i, j∈[d] , mi −m j =M(e i −e j). Applying Lemma D.2 to u=e i and v=e j, and using |ei −e j|= √ 2, yields √ 2σ min =σ min|ei −e j| ≤ |m i −m j| ≤ σmax|ei −e j|= √ 2σ max. ...

  51. [51]

    for everyk∈[K],σ minλ− ≤ |r k(x)| ≤σ maxλ+

  52. [52]

    In particular, r(Rd)⊆ A(σminλ−, σmaxλ+) K , where A(a, b) def

    for every distincti, j∈[K],|r i(x)−r j(x)| ≥σ minε. In particular, r(Rd)⊆ A(σminλ−, σmaxλ+) K , where A(a, b) def. = z∈R d :a≤ |z| ≤b , and the output configurations are uniformly separated:min i̸=j |ri(x)−r j(x)| ≥σ minε∀x∈R d. Proof. Fix x∈R d. By Lemma D.1, both V(x) =e A1(x) andW(x) =e A2(x) are orthogo- nal. Hence M(x) has the orthogonal-diagonal-ort...

  53. [53]

    =e A1(x), W(x) def

    Write V(x) def. =e A1(x), W(x) def. =e A2(x),Σ(x) def. = diag(σ(x)). Thenr(x) =V(x)Σ(x)W(x) ⊤C.SinceA 1(x)andA 2(x)are skew-symmetric, Lemma D.1 yields V(x), W(x)∈O(d) for each x∈B d

  54. [54]

    sin(⟨w(1) k , x⟩+ϕ k) cos(⟨w(2) k , x⟩ −0.37ϕ k) # , the Jacobian is Dfk(x) =A k +α k

    In particular, ∥V(x)∥ op =∥W(x)∥ op = 1 We first claim that, for any skew-symmetric matricesS, T∈R d×d, ∥eS −e T ∥op ≤ ∥S−T∥ op.(47) Indeed, using the standard identity eS −e T = Z 1 0 e(1−u)S(S−T)e uT du, we obtain ∥eS −e T ∥op ≤ R 1 0 ∥e(1−u)S∥op ∥S−T∥ op ∥euT ∥op du. Since S and T are skew- symmetric,e (1−u)S ande uT are orthogonal, whence each operato...