SOLANET: Distributed Neighbor Graph Construction on GPU-Accelerated Systems
Pith reviewed 2026-06-29 15:11 UTC · model grok-4.3
The pith
SOLANET builds large neighbor graphs by first creating local GPU graphs then refining them with remote ANN searches via MPI one-sided operations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SOLANET first constructs local graphs on each GPU after data partitioning and then refines them via approximate nearest neighbor searches over remote graphs pulled from other GPUs using MPI one-sided operations. It also provides a lock-free single-GPU neighbor graph construction algorithm for AMD GPUs. This yields better performance than prior single-GPU methods and demonstrates 11X speedup from 32 to 512 APUs for 1 billion data points and 6.9x speedup from 64 to 512 APUs for 2 billion points.
What carries the argument
The two-phase construction where local approximate graphs are built on partitioned data and then refined through remote ANN searches accessed with MPI one-sided operations.
If this is right
- Neighbor graphs for 1-2 billion points can be built with near-linear scaling up to 512 APUs.
- Single-GPU construction on AMD MI300A APUs runs faster than existing GPU-based approximate methods.
- The lock-free single-GPU algorithm removes synchronization overhead during local graph building.
- The same refinement approach can be applied to other distributed approximate graph tasks that start from partitioned local results.
Where Pith is reading between the lines
- Further tuning of the remote search frequency or data pull size could extend scaling beyond 512 APUs on slower interconnects.
- The partitioning-plus-refinement pattern might transfer to other irregular distributed algorithms that mix local computation with selective remote access.
- Embedding this construction inside larger distributed analytics pipelines could reduce end-to-end time for graph-based learning workloads.
Load-bearing premise
That the costs of pulling remote graphs and running ANN searches on them stay low enough that the refinement step produces net speedups rather than being dominated by communication or load imbalance.
What would settle it
A runtime measurement on 512 APUs for 1 billion points that is not at least 11 times faster than the runtime on 32 APUs, or data showing that remote access time exceeds local computation time across the workload.
Figures
read the original abstract
Neighbor graphs capture relationships among data points and are widely used in data analytics and AI workloads. Many studies have explored approximate construction methods for single-node systems, including GPUs. However, extending this to distributed systems for larger data and further acceleration remains challenging due to irregular computation patterns. We present SOLANET, a GPU-accelerated distributed neighbor graph construction toolkit. SOLANET first constructs local graphs on each GPU after data partitioning and then refines them via approximate nearest neighbor (ANN) searches over remote graphs pulled from other GPUs using MPI one-sided operations. SOLANET also provides a lock-free single-GPU neighbor graph construction algorithm for AMD GPUs. Our single-GPU implementation outperforms a state-of-the-art GPU-based approximate neighbor graph construction implementation across multiple datasets on a single MI300A APU. Furthermore, SOLANET demonstrates 11X speedup from 32 to 512 APUs for 1 billion data points and 6.9x speedup from 64 to 512 APUs for 2 billion points.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents SOLANET, a GPU-accelerated distributed toolkit for neighbor graph construction. Data is partitioned across APUs; each builds a local graph, which is then refined by approximate nearest-neighbor searches over remote shards fetched via MPI one-sided operations. A lock-free single-GPU construction algorithm for AMD GPUs is also described. The central empirical claims are single-GPU outperformance versus prior GPU baselines on MI300A APUs and strong-scaling speedups of 11× (32→512 APUs, 1 B points) and 6.9× (64→512 APUs, 2 B points).
Significance. If the reported speedups are reproducible with full experimental disclosure, the work would address a practically important scaling problem for billion-scale neighbor graphs in analytics and AI pipelines. The local-plus-remote-refinement strategy is a plausible way to exploit GPU compute while managing distributed memory, but the absence of any supporting methodology, timings, or balance metrics leaves the net-gain claim unverified.
major comments (2)
- [Abstract] Abstract: the headline speedups (11× and 6.9×) are stated without any description of datasets, baseline implementations, measurement methodology, error bars, per-phase timings, communication volumes, or load-balance statistics. Because these empirical results constitute the central claim, the manuscript cannot be evaluated for soundness.
- [Abstract] Abstract (distributed-construction paragraph): the assumption that partitioning plus local-graph construction plus remote ANN refinement via MPI one-sided pulls produces net compute gain (rather than being dominated by communication or imbalance) is asserted but unsupported by any quantitative evidence. At 512 APUs the per-APU data volume decreases while the number of remote probes increases; without communication or idle-time breakdowns the strong-scaling claim rests on an untested premise.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. The concerns focus on the level of detail in the abstract regarding experimental methodology and supporting evidence for the distributed scaling claims. We will revise the abstract and, where needed, the experimental presentation to address these points directly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline speedups (11× and 6.9×) are stated without any description of datasets, baseline implementations, measurement methodology, error bars, per-phase timings, communication volumes, or load-balance statistics. Because these empirical results constitute the central claim, the manuscript cannot be evaluated for soundness.
Authors: We agree the abstract is too terse for the central claims. In the revised manuscript we will expand the abstract to name the datasets (synthetic uniform distributions and real-world corpora at 1 B and 2 B points), identify the single-GPU baseline, state that all timings are end-to-end wall-clock on MI300A APUs, and note that error bars, per-phase breakdowns, communication volumes, and load-balance statistics appear in the Experiments section with the associated figures. This change will make the abstract self-contained enough for initial evaluation while preserving its length constraints. revision: yes
-
Referee: [Abstract] Abstract (distributed-construction paragraph): the assumption that partitioning plus local-graph construction plus remote ANN refinement via MPI one-sided pulls produces net compute gain (rather than being dominated by communication or imbalance) is asserted but unsupported by any quantitative evidence. At 512 APUs the per-APU data volume decreases while the number of remote probes increases; without communication or idle-time breakdowns the strong-scaling claim rests on an untested premise.
Authors: The reported speedups are measured end-to-end and therefore already incorporate communication, imbalance, and idle time. Nevertheless, we accept that the abstract does not surface the supporting metrics. We will add one sentence to the abstract summarizing the measured communication-to-compute ratio and load-balance statistics at the 512-APU scale, and we will ensure the Experiments section contains explicit per-phase timing tables and MPI one-sided volume data so that readers can verify the net-gain premise. revision: yes
Circularity Check
Empirical scaling results contain no circular derivations or self-referential predictions
full rationale
The paper reports measured wall-clock speedups on specific hardware configurations (11× from 32→512 APUs at 1 B points; 6.9× from 64→512 at 2 B points) after describing a concrete implementation (local graph construction followed by MPI one-sided remote ANN refinement). No equations, fitted parameters, uniqueness theorems, or ansatzes appear in the provided text. All performance claims are direct experimental outcomes rather than derivations that reduce to their own inputs by construction. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
D. Adeniyi, Z. Wei, and Y . Yongquan, “Automated web usage data mining and recommendation system using k-nearest neighbor (knn) classification method,”Applied Computing and Informatics, vol. 12, no. 1, pp. 90–108, 2016. [Online]. Available: https://doi.org/10.1016/j.aci.2014.10.001
-
[2]
Y . Liao and V . Vemuri, “Use of k-nearest neighbor classifier for intrusion detection11an earlier version of this paper is to appear in the proceedings of the 11th usenix security symposium, san francisco, ca, august 2002,”Computers & Security, vol. 21, no. 5, pp. 439–448, 2002. [Online]. Available: https://doi.org/10.1016/S0167-4048(02)00514-X
-
[3]
Retrieval-augmented generation for knowledge-intensive nlp tasks,
P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive nlp tasks,” inProceedings of the 34th International Conference on Neural Information Processing Systems, ser. NIPS’20. Red Hook, NY , USA: Curran Associate...
-
[4]
Vector database management techniques and systems,
J. J. Pan, J. Wang, and G. Li, “Vector database management techniques and systems,” inCompanion of the 2024 International Conference on Management of Data, ser. SIGMOD ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 597–604. [Online]. Available: https://doi.org/10.1145/3626246.3654691
-
[5]
Fast agglomerative cluster- ing using a k-nearest neighbor graph,
P. Franti, O. Virmajoki, and V . Hautamaki, “Fast agglomerative cluster- ing using a k-nearest neighbor graph,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 11, pp. 1875–1881,
-
[6]
Available: https://doi.org/10.1109/TPAMI.2006.227
[Online]. Available: https://doi.org/10.1109/TPAMI.2006.227
-
[7]
Connectivity of the mutual k-nearest-neighbor graph in clustering and outlier detection,
M. Brito, E. Ch ´avez, A. Quiroz, and J. Yukich, “Connectivity of the mutual k-nearest-neighbor graph in clustering and outlier detection,” Statistics & Probability Letters, vol. 35, no. 1, pp. 33–42, 1997. [Online]. Available: https://doi.org/10.1016/S0167-7152(96)00213-1
-
[8]
A novel clustering method based on hybrid k-nearest-neighbor graph,
Y . Qin, Z. L. Yu, C.-D. Wang, Z. Gu, and Y . Li, “A novel clustering method based on hybrid k-nearest-neighbor graph,” Pattern Recognition, vol. 74, pp. 1–14, 2018. [Online]. Available: https://doi.org/10.1016/j.patcog.2017.09.008
-
[9]
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
L. McInnes, J. Healy, and J. Melville, “Umap: Uniform manifold approximation and projection for dimension reduction,” 2020. [Online]. Available: https://doi.org/10.48550/arXiv.1802.03426
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1802.03426 2020
-
[10]
Efficient integration of heterogeneous single-cell transcriptomes using scanorama,
B. Hie, B. Bryson, and B. Berger, “Efficient integration of heterogeneous single-cell transcriptomes using scanorama,”Nature biotechnology, vol. 37, no. 6, pp. 685–691, 2019. [Online]. Available: https://doi.org/10.1038/s41587-019-0113-3
-
[11]
Scanpy: large-scale single-cell gene expression data analysis,
F. A. Wolf, P. Angerer, and F. J. Theis, “Scanpy: large-scale single-cell gene expression data analysis,”Genome biology, vol. 19, no. 1, p. 15,
-
[12]
Available: https://doi.org/10.1186/s13059-017-1382-0
[Online]. Available: https://doi.org/10.1186/s13059-017-1382-0
-
[13]
Giotto: a toolbox for integrative analysis and visualization of spatial expression data,
R. Dries, Q. Zhu, R. Dong, C.-H. L. Eng, H. Li, K. Liu, Y . Fu, T. Zhao, A. Sarkar, F. Baoet al., “Giotto: a toolbox for integrative analysis and visualization of spatial expression data,” Genome biology, vol. 22, no. 1, p. 78, 2021. [Online]. Available: https://doi.org/10.1186/s13059-021-02286-2
-
[14]
Accelerating t-sne using tree-based algorithms,
L. Van Der Maaten, “Accelerating t-sne using tree-based algorithms,”J. Mach. Learn. Res., vol. 15, no. 1, p. 3221–3245, Jan. 2014. [Online]. Available: https://dl.acm.org/doi/10.5555/2627435.2697068
-
[15]
Billion-Scale Approximate Nearest Neighbor Search Challenge: NeurIPS’21 competition track,
H. V . Simhadri, G. Williams, M. Aum ¨uller, A. Babenko, D. Baranchuk, Q. Chen, M. Douze, L. Hosseini, R. Krishnaswamy, G. Srinivasa, S. J. Subramanya, and J. Wang, “Billion-Scale Approximate Nearest Neighbor Search Challenge: NeurIPS’21 competition track,” http://big- ann-benchmarks.com/neurips21.html, [Accessed 30-Mar-2026]
2026
-
[16]
The movielens datasets: History and context,
F. M. Harper and J. A. Konstan, “The movielens datasets: History and context,”ACM Trans. Interact. Intell. Syst., vol. 5, no. 4, dec 2015. [Online]. Available: https://doi.org/10.1145/2827872
-
[17]
Song: Approximate nearest neighbor search on gpu,
W. Zhao, S. Tan, and P. Li, “Song: Approximate nearest neighbor search on gpu,” in2020 IEEE 36th International Conference on Data Engineering (ICDE), 2020, pp. 1033–1044. [Online]. Available: https://doi.org/10.1109/ICDE48307.2020.00094
-
[18]
H. Wang, W.-L. Zhao, X. Zeng, and J. Yang,Fast K-NN Graph Construction by GPU Based NN-Descent. New York, NY , USA: Association for Computing Machinery, 2021, pp. 1929–1938. [Online]. Available: https://doi.org/10.1145/3459637.3482344
-
[19]
Ggnn: Graph-based gpu nearest neighbor search,
F. Groh, L. Ruppert, P. Wieschollek, and H. P. A. Lensch, “Ggnn: Graph-based gpu nearest neighbor search,”IEEE Transactions on Big Data, vol. 9, no. 1, pp. 267–279, 2023. [Online]. Available: https://doi.org/10.1109/TBDATA.2022.3161156
-
[20]
Cagra: Highly parallel graph construction and approximate nearest neighbor search for gpus,
H. Ootomo, A. Naruse, C. Nolet, R. Wang, T. Feher, and Y . Wang, “ CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search for GPUs ,” in2024 IEEE 40th International Conference on Data Engineering (ICDE). Los Alamitos, CA, USA: IEEE Computer Society, May 2024, pp. 4236–4247. [Online]. Available: https://doi.ieeecomputersociety.or...
-
[21]
Bang: Billion-scale approximate nearest neighbour search using a single gpu,
K. Venkatasubba, S. Khan, S. Singh, H. V . Simhadri, and J. Vedurada, “Bang: Billion-scale approximate nearest neighbour search using a single gpu,”IEEE Transactions on Big Data, vol. 11, no. 6, pp. 3142–3157, 2025. [Online]. Available: https://doi.org/10.1109/TBDATA.2025.3581085
-
[22]
Gpu-accelerated algorithms for graph vector search: Taxonomy, empirical study, and research directions,
Y . Liu, X. Chen, A. Tian, H. Li, Q. Li, X. Zhang, A. Zhou, C. J. Zhang, Q. Li, and L. Chen, “Gpu-accelerated algorithms for graph vector search: Taxonomy, empirical study, and research directions,”
-
[23]
Available: https://doi.org/10.48550/arXiv.2602.16719
[Online]. Available: https://doi.org/10.48550/arXiv.2602.16719
-
[24]
TOP500 list,
TOP500.org, “TOP500 list,” https://top500.org/, [Accessed 20-Mar- 2026]
2026
-
[25]
Dissecting cpu-gpu unified physical memory on amd mi300a apus,
J. Wahlgren, G. Schieffer, R. Shi, E. A. Le ´on, R. Pearce, M. Gokhale, and I. Peng, “Dissecting cpu-gpu unified physical memory on amd mi300a apus,” in2025 IEEE International Symposium on Workload Characterization (IISWC), 2025, pp. 368–380. [Online]. Available: https://doi.org/10.1109/IISWC66894.2025.00038
-
[26]
M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P.-E. Mazar ´e, M. Lomeli, L. Hosseini, and H. J ´egou, “The faiss library,”IEEE Transactions on Big Data, vol. 12, no. 2, pp. 346–361, 2026. [Online]. Available: https://doi.org/10.1109/TBDATA.2025.3618474
-
[27]
Mapreduce: simplified data processing on large clusters,
J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,”Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008. [Online]. Available: https://dl.acm.org/doi/10.1145/1327452.1327492
-
[28]
Efficient K-Nearest Neighbor Graph Construction for Generic Similarity Measures,
W. Dong, C. Moses, and K. Li, “Efficient K-Nearest Neighbor Graph Construction for Generic Similarity Measures,” inProceedings of the 20th International Conference on World Wide Web, ser. WWW ’11. New York, NY , USA: Association for Computing Machinery, 2011, pp. 577–586. [Online]. Available: https://doi.org/10.1145/1963405.1963487
-
[29]
Efficient k-nearest neighbor graph construction using mapreduce for large-scale data sets,
T. W ARASHINA, K. AOY AMA, H. SAW ADA, and T. HATTORI, “Efficient k-nearest neighbor graph construction using mapreduce for large-scale data sets,”IEICE Transactions on Information and Systems, vol. E97.D, no. 12, pp. 3142–3154, 2014. [Online]. Available: https://doi.org/10.1587/transinf.2014EDP7108
-
[30]
S.-H. Kim and H.-M. Park, “Efficient distributed approximate k-nearest neighbor graph construction by multiway random division forest,” in Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, ser. KDD ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 1097–1106. [Online]. Available: https://doi.org/10.1...
-
[31]
Towards a massive-scale distributed neighborhood graph construction,
K. Iwabuchi, T. Steil, B. Priest, R. Pearce, and G. Sanders, “Towards a massive-scale distributed neighborhood graph construction,” inProceedings of the SC ’23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, ser. SC-W ’23. New York, NY , USA: Association for Computing Machinery, 2023, pp. 730–738. [...
-
[32]
Neo-dnnd: Communication-optimized distributed nearest neighbor graph construction,
K. Iwabuchi, T. Steil, B. W. Priest, R. Pearce, and G. Sanders, “Neo-dnnd: Communication-optimized distributed nearest neighbor graph construction,” inSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2024, pp. 688–696. [Online]. Available: https://doi.org/10.1109/SCW63240.2024.00096
-
[33]
Scalable knn graph construction for heterogeneous architectures,
W. Ruys, A. Ghafouri, C. Chen, and G. Biros, “Scalable knn graph construction for heterogeneous architectures,”ACM Trans. Parallel Comput., vol. 12, no. 3, Jun. 2025. [Online]. Available: https://doi.org/10.1145/3733610
-
[34]
Fast scalable approximate nearest neighbor search for high-dimensional data,
K. G. Renga Bashyam and S. Vadhiyar, “Fast scalable approximate nearest neighbor search for high-dimensional data,” in2020 IEEE International Conference on Cluster Computing (CLUSTER), 2020, pp. 294–302. [Online]. Available: https://doi.org/10.1109/CLUSTER49012.2020.00040
-
[35]
The diskann library: Graph-based indices for fast, fresh and filtered vector search,
R. Krishnaswamy, R. Krishnaswamy, M. Manohar, and H. Simhadri, “The diskann library: Graph-based indices for fast, fresh and filtered vector search,”IEEE Data Eng. Bull., vol. 48, pp. 20–42, December 2024. [Online]. Available: https://www.microsoft.com/en- us/research/publication/the-diskann-library-graph-based-indices-for-fast- fresh-and-filtered-vector-search/
2024
-
[36]
cuVS: Vector Search and Clustering on the GPU,
“cuVS: Vector Search and Clustering on the GPU,” https://github.com/rapidsai/cuvs, [Accessed 25-Mar-2026]
2026
-
[37]
hipVS: GPU-accelerated vector search for AMD GPUs,
“hipVS: GPU-accelerated vector search for AMD GPUs,” https://github.com/ROCm-DS/hipVS, [Accessed 25-Mar-2026]
2026
-
[38]
M. Wang, X. Xu, Q. Yue, and Y . Wang, “A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor search,”Proc. VLDB Endow., vol. 14, no. 11, pp. 1964–1978, jul 2021. [Online]. Available: https://doi.org/10.14778/3476249.3476255
-
[39]
Fast approximate nearest neighbor search with the navigating spreading-out graph,
C. Fu, C. Xiang, C. Wang, and D. Cai, “Fast approximate nearest neighbor search with the navigating spreading-out graph,”Proc. VLDB Endow., vol. 12, no. 5, pp. 461–474, jan 2019. [Online]. Available: https://doi.org/10.14778/3303753.3303754
-
[40]
GitHub - lmcinnes/pynndescent: A Python nearest neighbor descent for approximate nearest neighbors — github.com,
PyNNDescent, “GitHub - lmcinnes/pynndescent: A Python nearest neighbor descent for approximate nearest neighbors — github.com,” https://github.com/lmcinnes/pynndescent, [Accessed 24-Feb-2026]
2026
-
[41]
The communication challenge for MPP: Intel Paragon and Meiko CS-2,
R. W. Hockney, “The communication challenge for MPP: Intel Paragon and Meiko CS-2,”Parallel Computing, vol. 20, no. 3, pp. 389–398, 1994. [Online]. Available: https://doi.org/10.1016/S0167-8191(06)80021-9
-
[42]
Ann-benchmarks: A benchmarking tool for approximate nearest neighbor algorithms,
M. Aum ¨uller, E. Bernhardsson, and A. Faithfull, “Ann-benchmarks: A benchmarking tool for approximate nearest neighbor algorithms,” Information Systems, vol. 87, p. 101374, 2020. [Online]. Available: https://doi.org/10.1016/j.is.2019.02.006
-
[43]
H. Xiao, K. Rasul, and R. V ollgraf. (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. [Online]. Available: https://doi.org/10.48550/arXiv.1708.07747
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1708.07747 2017
-
[44]
D. Newman, “Bag of Words,” UCI Machine Learning Repository, 2008, DOI: https://doi.org/10.24432/C5ZG6P
-
[45]
Product Quantization for Nearest Neighbor Search,
H. J ´egou, M. Douze, and C. Schmid, “Product Quantization for Nearest Neighbor Search,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 1, pp. 117–128, Jan. 2011. [Online]. Available: https://doi.org/10.1109/TPAMI.2010.57
-
[46]
GloVe: Global vectors for word representation,
J. Pennington, R. Socher, and C. Manning, “GloVe: Global vectors for word representation,” inProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), A. Moschitti, B. Pang, and W. Daelemans, Eds. Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 1532–1543. [Online]. Available: http://doi.org/10.3...
-
[47]
Efficient indexing of billion-scale datasets of deep descriptors,
A. B. Yandex and V . Lempitsky, “Efficient indexing of billion-scale datasets of deep descriptors,” in2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2055–2063. [Online]. Available: https://doi.org/10.1109/CVPR.2016.226
-
[48]
Searching in one billion vectors: Re-rank with source coding,
H. J ´egou, R. Tavenard, M. Douze, and L. Amsaleg, “Searching in one billion vectors: Re-rank with source coding,” in2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011, pp. 861–864. [Online]. Available: https://doi.org/10.1109/ICASSP.2011.5946540
-
[49]
OpenAI, “GPT5,” https://openai.com/gpt-5/, [Accessed 5-Apr-2026]
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.