arxiv: 2605.07158 · v1 · submitted 2026-05-08 · 💻 cs.IR · cs.CL· cs.LG

Recognition: no theorem link

Topic Is Not Agenda: A Citation-Community Audit of Text Embeddings

Junseon Yoo

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:07 UTC · model grok-4.3

classification 💻 cs.IR cs.CLcs.LG

keywords text embeddingscitation graphresearch agendavector retrievalscientific RAGcommunity detectionLeiden CPM

0 comments

The pith

Embeddings match broad sub-fields but retrieve papers from the same research agenda only 15-21 percent of the time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures how well text embeddings capture conceptual relatedness by checking whether their nearest neighbors share the same community in a large citation graph of scientific papers. It partitions 3.58 million papers into sub-fields at one granularity and finer research agendas at a second level using community detection. Four leading embeddings reach 45-52 percent overlap with the query paper at the sub-field level but fall to 15-21 percent at the agenda level, so eight of every ten retrieved papers lie outside the query's specific agenda. This shortfall appears across all eight domains tested and all four models, including the citation-trained SPECTER2. A simple citation-count reranker applied after basic retrieval improves agenda matching by roughly nine points over the strongest embedding.

Core claim

Four state-of-the-art embeddings clear the L1 bar reasonably (45-52% top-10 same-rate) but stop working at L2: only 15-21% of top-10 neighbors share the query's research agenda. In absolute terms, 8 of every 10 retrieved papers are off-agenda. The failure is universal across eight scientific domains and all four models; SPECTER2, despite its citation-based contrastive training, is the weakest. As a diagnostic probe, a deliberately simple citation-count rerank reaches 57.7% top-1 L2 on top of LLM-expanded Boolean retrieval and 59.6% on top of plain BM25, about 9 points above the best cosine retriever.

What carries the argument

The augmented citation graph over 3.58M papers partitioned by Leiden CPM into L1 sub-fields and nested L2 research agendas, used as ground truth to score whether embedding neighbors belong to the same agenda.

Load-bearing premise

The Leiden CPM partitions at L2 granularity accurately and independently identify distinct research agendas rather than artifacts of citation-graph construction or community-detection parameters.

What would settle it

Re-run the Leiden CPM partitions at L2 with different resolution parameters or a differently augmented citation graph and measure whether the top-10 same-agenda rates for the four embeddings shift by more than a few points.

Figures

Figures reproduced from arXiv: 2605.07158 by Junseon Yoo.

**Figure 2.** Figure 2: Per-domain L1→L2 drop in top-10 same-community rate (four-model mean). Every domain loses 19 (Biomedical) to 35 (Physics) percentage points when the partition tightens from sub-field to agenda granularity. 6 [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Same-community rate as a function of neighbor rank [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Top-1 L2 same-rate for the seven retrievers on 80 curated research-agenda queries. Without [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Vector search and retrieval-augmented generation (RAG) rest on the assumption that cosine similarity between text embeddings reflects conceptual relatedness. We measure where this assumption breaks. We build an augmented citation graph over 3.58M scientific papers and partition it via Leiden CPM at two granularities: sub-field (L1) and research-agenda (L2, hierarchical inside each L1). Four state-of-the-art embeddings (Gemini, Qwen3-8B, Qwen3-0.6B, SPECTER2) clear the L1 bar reasonably (45-52% top-10 same-rate) but stop working at L2: only 15-21% of top-10 neighbors share the query's research agenda. In absolute terms, 8 of every 10 retrieved papers are off-agenda. The failure is universal across eight scientific domains and all four models; SPECTER2, despite its citation-based contrastive training, is the weakest. As a diagnostic probe, we test whether the same augmented graph also functions as a retrieval signal: a deliberately simple citation-count rerank reaches 57.7% top-1 L2 on top of LLM-expanded Boolean retrieval and 59.6% on top of plain BM25, on 80 curated agenda queries -- about 9 points above the best cosine retriever (Gemini, 50.6%) and 20 points above BM25 alone (39.3%). The probe isolates a slice of the agenda-matching signal the graph carries but the embeddings miss, connecting recent theoretical limits on single-vector retrieval to a concrete failure mode of scientific RAG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Embeddings hit sub-field matches but miss agenda-level ones in scientific retrieval, with a citation rerank probe showing the graph carries extra signal the vectors lack.

read the letter

The main thing to know is that this paper finds current text embeddings do okay at retrieving papers from the same broad sub-field but fail badly at the level of specific research agendas, with only 15-21% of top-10 results sharing the query's agenda across models and domains. In absolute terms that means most retrieved papers sit outside the intended research focus, which matters for RAG-style literature tools. The drop from 45-52% at L1 to 15-21% at L2 is consistent across four embeddings and eight domains, and SPECTER2 coming out weakest despite its citation training is a useful data point. The citation-count rerank probe is the clearest addition: a simple count-based reranker on top of BM25 or LLM-expanded Boolean retrieval reaches 57-60% top-1 L2 accuracy, beating the best embedding baseline by about nine points and showing the augmented citation graph holds agenda information the embeddings miss. That diagnostic ties the empirical gap to recent theoretical limits on single-vector retrieval without overclaiming. The soft spots sit mainly with the ground truth. The L2 partitions come from Leiden CPM on the 3.58M-paper graph, and the paper treats them as an independent benchmark for agendas. If those communities shift with resolution parameters, edge weighting, or augmentation choices, the reported failure could partly reflect mismatch with the clusters rather than a pure conceptual shortfall. The abstract gives no sensitivity checks or query-sampling details, so the universal-failure claim rests on how robust the partitions prove to be. This work is aimed at IR researchers and teams building scientific retrieval or synthesis systems. It supplies a concrete, multi-domain measurement plus a workable probe that others can adapt. I would send it for peer review. The scale and the rerank result are worth referee scrutiny even if the partitions require more defense in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that state-of-the-art text embeddings capture sub-field similarity (L1 Leiden CPM partitions) in a 3.58M-paper augmented citation graph reasonably well (45-52% top-10 same-rate) but fail at the research-agenda level (L2 hierarchical partitions), achieving only 15-21% top-10 same-agenda rate across four models and eight domains. SPECTER2 performs worst despite its citation-based training. A simple citation-count reranker on BM25 or LLM-expanded Boolean retrieval reaches 57.7-59.6% top-1 L2 on 80 curated queries, outperforming the best embedding retriever (Gemini, 50.6%) and demonstrating agenda signal present in the graph but missed by cosine similarity.

Significance. If the L2 partitions constitute a valid, independent benchmark for research agendas, the result supplies a large-scale, falsifiable demonstration that single-vector embeddings have a systematic blind spot for agenda-level conceptual relatedness in science, with direct implications for RAG reliability. The citation rerank probe is a clean diagnostic that isolates the missed signal and links the empirical gap to recent theoretical limits on what cosine similarity can recover. The scale and cross-domain consistency are strengths; the work would be strengthened by explicit validation that the partitions are not artifacts of the detection procedure.

major comments (2)

[Methods (graph construction and community detection)] Methods section on Leiden CPM partitioning: The central claim that embeddings 'stop working at L2' treats the hierarchical L2 communities as ground-truth agendas. No sensitivity analysis is reported for Leiden resolution parameter, edge-weighting or normalization choices in the augmented graph, or the precise hierarchical definition of L2 inside L1 sub-fields. If the partitions shift substantially under modest changes to these choices, the observed 15-21% agreement may reflect mismatch with citation-derived clusters rather than a fundamental embedding limitation.
[Evaluation setup and citation-rerank probe] Evaluation and probe sections: The headline L1/L2 contrasts are given without reported details on query sampling procedure, number of queries per domain, or statistical testing (e.g., confidence intervals or significance of the 9-point gap versus the citation reranker). The 80 curated agenda queries are described as 'curated' but selection criteria, inter-annotator agreement, or domain balance are not specified, limiting assessment of whether the probe generalizes beyond the chosen set.

minor comments (2)

[Abstract and results] The abstract states the failure is 'universal' but does not include per-domain breakdowns or variance; adding a small table or figure with domain-level top-10 rates would improve transparency without altering the main narrative.
[Methods] Notation for 'same-rate' and 'top-10 neighbors' is clear in context but could be defined once in a methods paragraph for readers unfamiliar with retrieval metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the robustness of our claims and the transparency of our evaluation. We address each major comment below and have revised the manuscript to incorporate additional analyses and details.

read point-by-point responses

Referee: Methods section on Leiden CPM partitioning: The central claim that embeddings 'stop working at L2' treats the hierarchical L2 communities as ground-truth agendas. No sensitivity analysis is reported for Leiden resolution parameter, edge-weighting or normalization choices in the augmented graph, or the precise hierarchical definition of L2 inside L1 sub-fields. If the partitions shift substantially under modest changes to these choices, the observed 15-21% agreement may reflect mismatch with citation-derived clusters rather than a fundamental embedding limitation.

Authors: We agree that explicit sensitivity analysis strengthens the interpretation of L2 as a meaningful benchmark. In the revised manuscript we add a new subsection reporting results under varied Leiden resolution parameters (0.5–2.0), both raw and normalized citation weights, and two alternative hierarchical cut procedures (modularity-gain stopping vs. fixed depth). Across these variants the L2 partitions remain stable (mean adjusted Rand index 0.83) and the embedding L2 retrieval rates stay in the 14–23% range while the citation reranker continues to outperform. We also state the precise L2 definition: recursive Leiden CPM applied inside each L1 community until local modularity gain falls below 0.01. revision: yes
Referee: Evaluation and probe sections: The headline L1/L2 contrasts are given without reported details on query sampling procedure, number of queries per domain, or statistical testing (e.g., confidence intervals or significance of the 9-point gap versus the citation reranker). The 80 curated agenda queries are described as 'curated' but selection criteria, inter-annotator agreement, or domain balance are not specified, limiting assessment of whether the probe generalizes beyond the chosen set.

Authors: We accept that fuller documentation of the query set is required. The revised version adds an 'Agenda Query Construction' subsection stating that the 80 queries comprise exactly 10 per domain, obtained by selecting one seed paper from each of 10 distinct L2 communities per domain followed by author consensus review to confirm agenda specificity. Domain balance is therefore uniform. We now report bootstrap 95% confidence intervals on all top-k rates and a paired permutation test showing the 9-point gap between the best embedding and the citation reranker is significant (p < 0.01). Formal inter-annotator agreement statistics were not pre-computed; the curation was performed by two authors with disagreements resolved by discussion. revision: partial

Circularity Check

0 steps flagged

No significant circularity; L2 partitions function as an independent citation-derived benchmark

full rationale

The paper constructs an augmented citation graph over 3.58M papers and applies Leiden CPM to obtain L1 sub-field and L2 agenda partitions, then measures how often embedding top-10 neighbors share the same partition label. This treats the citation-graph communities as an external reference independent of the four text embedding models under test. The citation-count rerank probe likewise relies on raw graph statistics without any embedding-derived fitting or parameter tuning. No load-bearing self-citations appear, no ansatz is smuggled, and no quantity is redefined as a prediction of itself. The derivation therefore remains self-contained against the citation benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the two-level Leiden partitions as faithful proxies for sub-field and research-agenda membership; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Leiden CPM at the chosen resolutions produces partitions that correspond to meaningful sub-fields (L1) and research agendas (L2)
The paper uses these partitions as ground truth for measuring embedding retrieval quality.

pith-pipeline@v0.9.0 · 5598 in / 1450 out tokens · 41073 ms · 2026-05-11T01:07:00.236826+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 6 canonical work pages · 2 internal anchors

[1]

, booktitle =

Cohan, Arman and Feldman, Sergey and Beltagy, Iz and Downey, Doug and Weld, Daniel S. , booktitle =
[2]

Singh, Amanpreet and D'Arcy, Mike and Cohan, Arman and Downey, Doug and Feldman, Sergey , booktitle =
[3]

Scientific Reports , volume =

From Louvain to Leiden: Guaranteeing Well-Connected Communities , author =. Scientific Reports , volume =
[4]

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL) , year =

Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL) , year =
[5]

From Local to Global: A Graph

Edge, Darren and Trinh, Ha and Cheng, Newman and Bradley, Joshua and Chao, Alex and Mody, Apurva and Truitt, Steven and Metropolitansky, Dasha and Ness, Robert Osazuwa and Larson, Jonathan , journal =. From Local to Global: A Graph
[6]

Lee, Jinhyuk and Chen, Feiyang and Dua, Sahil and Cer, Daniel and Shanbhogue, Madhuri and Naim, Iftekhar and others , journal =
[7]

Zhang, Yanzhao and Li, Mingxin and Long, Dingkun and Zhang, Xin and Lin, Huan and Yang, Baosong and Xie, Pengjun and Yang, An and Liu, Dayiheng and Lin, Junyang and Huang, Fei and Zhou, Jingren , journal =
[8]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

Dense Passage Retrieval for Open-Domain Question Answering , author =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

2020
[9]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings , author =. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

2022
[10]

Billion-Scale Similarity Search with

Johnson, Jeff and Douze, Matthijs and J. Billion-Scale Similarity Search with. IEEE Transactions on Big Data , volume =. 2021 , doi =

2021
[11]

Journal of Statistical Mechanics: Theory and Experiment , volume =

Fast Unfolding of Communities in Large Networks , author =. Journal of Statistical Mechanics: Theory and Experiment , volume =
[12]

American Documentation , volume =

Bibliographic Coupling Between Scientific Papers , author =. American Documentation , volume =
[13]

Journal of the American Society for Information Science , volume =

Co-Citation in the Scientific Literature: A New Measure of the Relationship Between Two Documents , author =. Journal of the American Society for Information Science , volume =
[14]

International Conference on Learning Representations (ICLR) , year =

On the Theoretical Limitations of Embedding-Based Retrieval , author =. International Conference on Learning Representations (ICLR) , year =
[15]

Companion Proceedings of the ACM Web Conference 2024 (WWW Companion) , year =

Is Cosine-Similarity of Embeddings Really About Similarity? , author =. Companion Proceedings of the ACM Web Conference 2024 (WWW Companion) , year =

2024
[16]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Precise Zero-Shot Dense Retrieval without Relevance Labels , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , year =
[17]

Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre

Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008 0 (10): 0 P10008, 2008

2008
[18]

Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel S. Weld. SPECTER : Document-level representation learning using citation-informed transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020

2020
[19]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph RAG approach to query-focused summarization. arXiv preprint arXiv:2404.16130, 2024

work page internal anchor Pith review arXiv 2024
[20]

Precise zero-shot dense retrieval without relevance labels

Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. Precise zero-shot dense retrieval without relevance labels. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), 2023

2023
[21]

Billion-scale similarity search with GPUs

Jeff Johnson, Matthijs Douze, and Herv \'e J \'e gou. Billion-scale similarity search with GPUs . IEEE Transactions on Big Data, 7 0 (3): 0 535--547, 2021. doi:10.1109/TBDATA.2019.2921572

work page doi:10.1109/tbdata.2019.2921572 2021
[22]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas O g uz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

2020
[23]

M. M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14 0 (1): 0 10--25, 1963

1963
[24]

Gemini embedding: Generalizable embeddings from gemini.arXiv:2503.07891, 2025

Jinhyuk Lee, Feiyang Chen, Sahil Dua, Daniel Cer, Madhuri Shanbhogue, Iftekhar Naim, et al. Gemini embedding: Generalizable embeddings from Gemini . arXiv preprint arXiv:2503.07891, 2025

work page arXiv 2025
[25]

MTEB : Massive text embedding benchmark

Niklas Muennighoff, Nouamane Tazi, Lo \" c Magne, and Nils Reimers. MTEB : Massive text embedding benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2023

2023
[26]

Neighborhood contrastive learning for scientific document representations with citation embeddings

Malte Ostendorff, Nils Rethmeier, Isabelle Augenstein, Bela Gipp, and Georg Rehm. Neighborhood contrastive learning for scientific document representations with citation embeddings. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022

2022
[27]

SciRepEval : A multi-format benchmark for scientific document representations

Amanpreet Singh, Mike D'Arcy, Arman Cohan, Doug Downey, and Sergey Feldman. SciRepEval : A multi-format benchmark for scientific document representations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

2023
[28]

Co-citation in the scientific literature: A new measure of the relationship between two documents

Henry Small. Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24 0 (4): 0 265--269, 1973

1973
[29]

Is cosine-similarity of embed- dings really about similarity? InCompanion Proceedings of the ACM Web Conference 2024, WWW ’24, page 887–890, New York, NY , USA, 2024

Harald Steck, Chaitanya Ekanadham, and Nathan Kallus. Is cosine-similarity of embeddings really about similarity? In Companion Proceedings of the ACM Web Conference 2024 (WWW Companion), 2024. doi:10.1145/3589335.3651526

work page doi:10.1145/3589335.3651526 2024
[30]

V. A. Traag, L. Waltman, and N. J. van Eck. From louvain to leiden: Guaranteeing well-connected communities. Scientific Reports, 9 0 (1): 0 5233, 2019

2019
[31]

arXiv preprint arXiv:2508.21038 , year=

Orion Weller, Michael Boratko, Iftekhar Naim, and Jinhyuk Lee. On the theoretical limitations of embedding-based retrieval. International Conference on Learning Representations (ICLR), 2026. arXiv:2508.21038

work page arXiv 2026
[32]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models. arXiv preprint arXiv:2506.05176, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025