Clark Hash: Stateless Sparse Johnson-Lindenstrauss Quantization for Neural Embeddings
Pith reviewed 2026-06-29 13:00 UTC · model grok-4.3
The pith
A stateless sparse Johnson-Lindenstrauss projection followed by clipping and scalar quantization stores 384-dimensional neural embeddings in 48 bytes while retaining cosine similarity information.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Clark Hash applies a deterministic sparse signed Johnson-Lindenstrauss projection to normalized embedding vectors, clips the projected coordinates, and stores the result as a fixed-width scalar-quantized integer code. The resulting 48-byte sketches are scored against full-precision query vectors by the same cosine function used on the original dense vectors. On the STS17 and STS22 collections the sketches achieve macro Pearson correlations of 0.910 and 0.946 with the dense baseline when the underlying encoder is a multilingual MiniLM model.
What carries the argument
Deterministic sparse signed Johnson-Lindenstrauss projection followed by clipping and scalar quantization, which maps each normalized vector to a short fixed-length integer code that approximately preserves inner-product information.
If this is right
- Embedding collections can be stored using 32 times less memory than dense float32 vectors.
- New vectors can be encoded and inserted without retraining the codec or recomputing any corpus statistics.
- Query vectors stay in full floating-point precision and are compared directly to the stored codes.
- The codec works for any embedding dimensionality once the projection matrix is fixed and requires no learned parameters.
- It functions as a lightweight alternative to methods that rely on data-dependent codebooks or rotations.
Where Pith is reading between the lines
- The same fixed projection could be applied to embeddings from domains other than sentences, such as image or graph representations, to test whether the cosine preservation holds without retraining.
- Because the projection matrix is deterministic and stateless, multiple independent databases could share the identical encoding scheme and be merged without re-encoding.
- Pairing the sketches with an existing approximate nearest-neighbor index would allow memory-efficient search at the cost of an extra decompression or scoring step against the quantized codes.
- The method opens a route to theoretical bounds on the worst-case distortion of cosine similarity under this specific clipping-plus-quantization pipeline.
Load-bearing premise
The fixed sparse signed projection, after clipping and quantization, preserves enough of the original cosine information for the sentence-similarity tasks without any corpus-dependent calibration.
What would settle it
Running the identical 48-byte codec on a fresh sentence-similarity collection and obtaining a macro Pearson correlation noticeably below 0.91 with the dense cosine scores would falsify the preservation claim for the reported operating point.
read the original abstract
Clark Hash is a small method for storing neural embeddings in less space. It normalizes each database vector, applies a deterministic sparse signed Johnson-Lindenstrauss projection, clips the result, and stores a fixed-width scalar-quantized code. Queries stay in floating point and are scored against the stored sketches. In the default 384-dimensional sentence-embedding setting, Clark Hash stores a cosine-search vector in 48 bytes instead of 1536 bytes for dense f32 storage. This is 32x smaller. The method does not need a training pass, learned codebooks, rotations, or corpus statistics before new vectors can be stored. We describe the codec, the Rust implementation, and a multilingual sentence-similarity evaluation on 9,304 labeled pairs from 29 subsets. With a multilingual MiniLM encoder, the 48-byte sketches reached 0.910 and 0.946 macro Pearson correlation with dense cosine scores on STS17 and STS22. Clark Hash is not a new Johnson-Lindenstrauss theorem and it is not a replacement for approximate nearest-neighbor indexes. It is a simple stateless codec for compact embedding storage.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Clark Hash, a stateless codec for compressing neural embeddings. It normalizes each vector, applies a deterministic sparse signed Johnson-Lindenstrauss projection, clips the result, and stores a fixed-width scalar-quantized code (48 bytes for 384-dimensional embeddings). Queries remain in floating point and are scored against the sketches. The central empirical claim is that this procedure achieves macro Pearson correlations of 0.910 on STS17 and 0.946 on STS22 with dense cosine similarities, evaluated on 9,304 labeled pairs from 29 subsets using a multilingual MiniLM encoder, while requiring no training, learned parameters, or corpus statistics.
Significance. If the empirical results hold, the work demonstrates a practical, fully deterministic and stateless method for 32x reduction in embedding storage while retaining sufficient similarity signal for sentence-similarity tasks. Positive aspects include the explicit sequence of operations with no fitted parameters and the provision of a Rust implementation, which supports reproducibility.
major comments (2)
- [Evaluation] Evaluation section: the reported macro Pearson correlations of 0.910 and 0.946 are given as single point estimates with no error bars, standard deviations, per-subset breakdowns, or variance estimates across the 29 subsets, and without a full description of the experimental protocol or ablation studies on individual codec components; this renders the support for the central claim that the 48-byte sketches preserve enough cosine information only moderate.
- [Method] Method section: the precise construction of the deterministic sparse signed Johnson-Lindenstrauss projection (including the exact sparsity level, how the signing is made deterministic, and the target sketch dimension) is described at a high level but lacks the concrete parameter values or pseudocode needed to reproduce the exact numerical results from the textual description alone.
minor comments (1)
- [Abstract] The abstract and introduction could more explicitly contrast Clark Hash with prior quantization and sketching methods to clarify its positioning.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of Clark Hash and the recommendation for minor revision. We address the two major comments point by point below.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the reported macro Pearson correlations of 0.910 and 0.946 are given as single point estimates with no error bars, standard deviations, per-subset breakdowns, or variance estimates across the 29 subsets, and without a full description of the experimental protocol or ablation studies on individual codec components; this renders the support for the central claim that the 48-byte sketches preserve enough cosine information only moderate.
Authors: We agree that the evaluation would be strengthened by additional statistical detail. In the revised manuscript we will report standard deviations across the 29 subsets, include per-subset Pearson correlations with error bars, expand the experimental protocol description, and add ablation results on sparsity level, signing determinism, and quantization bit-width. These changes directly address the concern about moderate support for the central claim. revision: yes
-
Referee: [Method] Method section: the precise construction of the deterministic sparse signed Johnson-Lindenstrauss projection (including the exact sparsity level, how the signing is made deterministic, and the target sketch dimension) is described at a high level but lacks the concrete parameter values or pseudocode needed to reproduce the exact numerical results from the textual description alone.
Authors: We accept this observation. The revised manuscript will specify the exact sparsity (4 non-zeros per column), the deterministic signing procedure (fixed-seed hash function), the target sketch dimension (96), and will include pseudocode for the full projection-plus-quantization pipeline. These additions will enable exact reproduction from the text while retaining the existing Rust implementation as supplementary material. revision: yes
Circularity Check
No significant circularity
full rationale
The paper's construction consists of an explicit, deterministic sequence of operations (vector normalization, fixed sparse signed JL projection, clipping, and scalar quantization to a fixed 48-byte code) with no learned parameters, fitted values, or data-dependent calibration steps. The reported result is a direct empirical Pearson correlation between the resulting sketches and dense cosine scores on held-out STS subsets; this evaluation does not reduce to any self-referential definition, fitted-input prediction, or load-bearing self-citation. The text explicitly states that the method is not a new JL theorem and introduces no uniqueness claims derived from the authors' prior work.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math The Johnson-Lindenstrauss lemma guarantees approximate preservation of inner products under the chosen sparse signed projection.
Reference graph
Works this paper leans on
-
[1]
Database-friendly random projections: Johnson-lindenstrauss with binary coins.Journal of Computer and System Sciences, 66(4):671–687, 2003
Dimitris Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binary coins.Journal of Computer and System Sciences, 66(4):671–687, 2003
2003
-
[2]
SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation
Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. InProceedings of the 11th International Workshop on Semantic Evaluation, pages 1–14, 2017
2017
-
[3]
Finding frequent items in data streams
Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. InProceedings of the 29th International Colloquium on Automata, Languages and Programming, pages 693–703, 2002
2002
-
[4]
A sparse johnson-lindenstrauss transform
Anirban Dasgupta, Ravi Kumar, and Tamás Sarlós. A sparse johnson-lindenstrauss transform. InProceedings of the 42nd ACM Symposium on Theory of Computing, pages 341–350, 2010
2010
-
[5]
Gray and David L
Robert M. Gray and David L. Neuhoff. Quantization.IEEE Transactions on Information Theory, 44(6):2325–2383, 1998
1998
-
[6]
Product quantization for nearest neighbor search.IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):117–128, 2011
Hervé Jégou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search.IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):117–128, 2011
2011
-
[7]
Johnson and Joram Lindenstrauss
William B. Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert space.Contemporary Mathematics, 26:189–206, 1984
1984
-
[8]
Kane and Jelani Nelson
Daniel M. Kane and Jelani Nelson. Sparser johnson-lindenstrauss transforms.Journal of the ACM, 61(1):4:1–4:23, 2014
2014
-
[9]
MTEB: Massive Text Embedding Benchmark
Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. MTEB: Massive text embedding benchmark.arXiv preprint arXiv:2210.07316, 2022. 6
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[10]
Sentence-BERT: Sentence embeddings using siamese BERT- networks
Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT- networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 3982–3992, 2019
2019
-
[11]
MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers
Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. InAdvances in Neural Information Processing Systems, volume 33, pages 5776–5788, 2020
2020
-
[12]
Weinberger, Anirban Dasgupta, John Langford, Alexander J
Kilian Q. Weinberger, Anirban Dasgupta, John Langford, Alexander J. Smola, and Josh Attenberg. Feature hashing for large scale multitask learning. InProceedings of the 26th Annual International Conference on Machine Learning, pages 1113–1120, 2009. 7
2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.