$\mathbb{R}^{2k}$ is Theoretically Large Enough for Embedding-based Top-$k$ Retrieval

Ginny Wong; Hanghang Tong; Hang Yin; Lihui Liu; Simon See; Yangqiu Song; Zihao Wang

arxiv: 2601.20844 · v3 · pith:G5VSMIE4new · submitted 2026-01-28 · 💻 cs.LG · cs.AI· cs.IR

mathbb{R}^(2k) is Theoretically Large Enough for Embedding-based Top-k Retrieval

Zihao Wang , Hang Yin , Lihui Liu , Hanghang Tong , Yangqiu Song , Ginny Wong , Simon See This is my paper

classification 💻 cs.LG cs.AIcs.IR

keywords retrievalcentroiddimensionembedding-basedepsilonrobustsqrttheoretical

0 comments

read the original abstract

This paper studies the Minimal Embeddable Dimension (MED): the least dimension in which there exists a configuration of $m$ object vectors so that every subset of size at most $k$ is exactly retrieved by score comparison. Our result shows MED is $\Theta(k)$, independent of $m$, for inner product, Euclidean distance, and cosine similarity. We then consider Robust MED (RMED), where all vectors are unit normed and an $\epsilon$ gap of scores is required. We derive the $m$-dependent feasibility ceiling $\epsilon_\star(m,k)=m/\sqrt{k(m-1)(m-k)}$, which approaches $1/\sqrt{k}$ when $m\gg k$, and a Gaussian centroid construction gives a robust witness upper bound in the feasible margin regime. Numerical simulation on synthetic top-$2$ retrieval with cyclic polytope and centroid query optimization confirmed our theoretical claims. Experiments on LIMIT and LIMIT-small datasets also show that simple embedding-based retrieval baselines can overfit and outperform the reported single-vector LLM embedding baseline. Both theoretical and empirical findings rule out the lack of exact geometric capacity as the obstruction.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Is Dimensionality a Barrier for Retrieval Models?
cs.LG 2026-05 unverdicted novelty 8.0

Dimension d = O(m^{-2} log n) nearly achieves the optimal margin m^rd(+∞, A) for retrieval embeddings, with matching lower bounds showing d = O(k log(n/k)) suffices and is necessary for m = Θ(k^{-1/2}) on k-sparse que...