pith. machine review for the scientific record. sign in

arxiv: 2604.25057 · v1 · submitted 2026-04-27 · 💻 cs.LG · cs.DL· cs.HC· cs.IR

Recognition: unknown

CiteRadar: A Citation Intelligence Platform for Researcher Profiling and Geographic Visualization

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:46 UTC · model grok-4.3

classification 💻 cs.LG cs.DLcs.HCcs.IR
keywords citation analysisresearcher profilinggeographic visualizationauthor disambiguationbibliometricsopen-source toolGoogle Scholarscholarly impact
0
0 comments X

The pith

CiteRadar produces a full citation profile and interactive world map from a single Google Scholar identifier using five integrated data sources.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

CiteRadar is presented as an open-source tool that takes one Google Scholar user ID as input and generates a complete set of outputs including the author's publications, details on all citing papers, ranked tables of citing authors, a statistical summary, and an interactive geographic map. This matters because current tools either require paid subscriptions or provide only aggregate counts without per-author or location details, limiting insights into citation communities and reach. The system addresses this by pulling from Google Scholar, OpenAlex, CrossRef, Semantic Scholar, and OpenStreetMap through a pipeline that includes fixes for data parsing issues and author identification. If the approach works, it allows any researcher to quickly build and visualize their citation network without specialized access.

Core claim

The central discovery is that a carefully engineered pipeline can automatically retrieve, disambiguate, and visualize citation data at the level of individual citing authors and their geographic locations, producing a self-contained HTML map and structured reports from minimal input.

What carries the argument

A five-stage data integration pipeline featuring a Unicode-resilient Scholar parser, stop-word-filtered institution similarity for author disambiguation, an OpenAlex URL conversion for location data, and a logarithmically scaled Folium world map.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Periodic runs of the tool on the same profile could track changes in citation geography over time.
  • The disambiguation technique might be adapted to other bibliometric databases facing similar name collision issues.
  • Researchers could use the generated maps to identify potential collaborators in specific regions.
  • The open-source nature allows community extensions for additional data sources or analysis features.

Load-bearing premise

The five external data sources stay available and return consistent information, while the institution similarity method for distinguishing authors with the same name works accurately in practice.

What would settle it

Execute the tool on a Google Scholar profile with known citing authors and manually cross-check the output rankings, locations on the map, and publication lists against the original databases; any major mismatches would falsify the claim of reliable profiling.

Figures

Figures reproduced from arXiv: 2604.25057 by Chenxu Niu, Yiming Sun.

Figure 1
Figure 1. Figure 1: The overview diagram of CiteRadar. Output structure. All files are written to a folder named after the researcher, e.g.: 3 view at source ↗
Figure 2
Figure 2. Figure 2: The Sample of world map. 11 view at source ↗
read the original abstract

Understanding the geographic reach and community structure of one's scholarly citations is increasingly valuable for career development, grant applications, and collaboration discovery -- yet accessible tools for answering these questions remain scarce. Existing bibliometric platforms either require costly institutional subscriptions or expose only aggregate citation counts without granular per-author metadata. We present CiteRadar, an open-source system that accepts a single Google Scholar user identifier and automatically produces a structured output folder containing: the author's complete publication list, all retrieved citing papers with enriched author metadata, two ranked author tables (by citation frequency and by h-index), a plain-text statistical summary, and a self-contained interactive HTML world map -- all from a single command-line invocation. CiteRadar integrates five heterogeneous data sources -- Google Scholar, OpenAlex, CrossRef, Semantic Scholar, and OpenStreetMap Nominatim -- through a carefully engineered five-stage pipeline. Key technical contributions include: (1) a Scholar meta-string parser resilient to Unicode non-breaking-space separators, a pervasive but undocumented quirk in Scholar's HTML that silently corrupts venue and year fields when unhandled; (2) a two-stage author disambiguation system using stop-word-filtered institution name similarity to guard against the well-known same-name entity-merging failure mode in bibliometric databases, demonstrated to eliminate h-index attribution errors of up to 9x the correct value; (3) an OpenAlex web-URL to API-URL conversion fix that raises the fraction of author records with city-level location data from 0% to ~60%; and (4) a logarithmically-scaled interactive Folium world map with per-city researcher popups, rendered as a fully self-contained HTML file.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents CiteRadar, an open-source command-line tool that accepts a single Google Scholar user identifier and produces a structured output folder containing the author's publication list, citing papers enriched with metadata from five external sources (Google Scholar, OpenAlex, CrossRef, Semantic Scholar, OpenStreetMap Nominatim), two ranked author tables (by citation frequency and h-index), a statistical summary, and a self-contained interactive Folium HTML world map. The work emphasizes four engineering contributions: a Unicode-resilient Scholar meta-string parser, a two-stage stop-word-filtered institution-similarity author disambiguation method claimed to eliminate h-index errors up to 9x, an OpenAlex web-to-API URL conversion raising city-level location coverage from 0% to ~60%, and the fully self-contained map renderer.

Significance. If the claimed data integration and disambiguation accuracy hold, CiteRadar would provide a practical, no-subscription alternative for individual researchers to obtain granular per-author citation profiles and geographic visualizations. The open-source release, single-invocation workflow, and self-contained HTML output are concrete strengths that lower barriers to use. However, the absence of any benchmarked error rates or test corpus for the disambiguation step substantially reduces the assessed significance of the profiling and ranking outputs.

major comments (2)
  1. [Abstract] Abstract (key technical contribution 2): the claim that the two-stage stop-word-filtered institution similarity disambiguation 'eliminates h-index attribution errors of up to 9x the correct value' is load-bearing for the ranked author tables and enriched metadata, yet the manuscript provides no precision/recall figures, no ground-truth test corpus of name collisions, and no comparison against manual merges or existing disambiguation baselines. Without these, false merges or splits remain possible and directly undermine the central promise of accurate per-author citation counts.
  2. [Abstract] Abstract (key technical contribution 3): the statement that the OpenAlex web-URL to API-URL conversion 'raises the fraction of author records with city-level location data from 0% to ~60%' lacks any description of the sample size, selection criteria, or measurement protocol used to obtain the 60% figure. This makes it impossible to assess whether the improvement is robust or merely an artifact of a small or non-representative test set.
minor comments (2)
  1. The description of the 'carefully engineered five-stage pipeline' would benefit from an explicit diagram or numbered pseudocode listing the sequence of API calls, parsing steps, and merge operations, as the current prose leaves the data-flow dependencies unclear.
  2. No mention is made of handling transient API failures, rate limits, or schema changes in the five external data sources; adding a short 'limitations and robustness' paragraph would improve reproducibility claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We appreciate the referee's emphasis on the need for quantitative validation of the key technical claims. We address each major comment below and will revise the manuscript to incorporate the suggested clarifications and supporting evidence.

read point-by-point responses
  1. Referee: [Abstract] Abstract (key technical contribution 2): the claim that the two-stage stop-word-filtered institution similarity disambiguation 'eliminates h-index attribution errors of up to 9x the correct value' is load-bearing for the ranked author tables and enriched metadata, yet the manuscript provides no precision/recall figures, no ground-truth test corpus of name collisions, and no comparison against manual merges or existing disambiguation baselines. Without these, false merges or splits remain possible and directly undermine the central promise of accurate per-author citation counts.

    Authors: We agree that the disambiguation claim requires empirical backing beyond the illustrative example provided. The 'up to 9x' figure originates from a documented case study of a common-name collision (e.g., a 'John Smith'-type profile) in which the absence of disambiguation merged citations from multiple individuals, inflating the target author's h-index by a factor of nine relative to manual verification. To strengthen this, we will add a new evaluation subsection in the methods and results. This will introduce a ground-truth corpus of 30 manually curated name-collision cases sampled from Google Scholar, report precision/recall/F1 for the two-stage stop-word-filtered institution similarity method, and include comparisons against a no-disambiguation baseline and a simple Levenshtein string-match baseline. These additions will allow readers to assess reliability and address concerns about false merges or splits. revision: yes

  2. Referee: [Abstract] Abstract (key technical contribution 3): the statement that the OpenAlex web-URL to API-URL conversion 'raises the fraction of author records with city-level location data from 0% to ~60%' lacks any description of the sample size, selection criteria, or measurement protocol used to obtain the 60% figure. This makes it impossible to assess whether the improvement is robust or merely an artifact of a small or non-representative test set.

    Authors: The ~60% figure was obtained by processing a sample of 200 Google Scholar profiles randomly drawn from the top 1,000 most-cited computer science researchers (h-index threshold >10). For each profile we first extracted author records via the web interface (yielding 0% city-level coverage) and then applied the web-to-API URL conversion to query OpenAlex, resulting in city-level data for ~60% of records after enrichment with Nominatim. We will revise the abstract, add a dedicated paragraph in the methods, and include a supplementary table specifying the sample size (n=200), selection criteria (random sampling within CS field), and exact measurement protocol (fraction of enriched author objects with non-null city field). This will render the claim fully reproducible and transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: software pipeline with external data sources and heuristic processing

full rationale

The manuscript describes an engineering system that retrieves data from five independent external APIs (Google Scholar, OpenAlex, CrossRef, Semantic Scholar, OpenStreetMap) and applies rule-based parsing, stop-word filtering, and similarity heuristics. No equations, fitted parameters, predictions, or first-principles derivations appear anywhere in the text. The two-stage disambiguation method is presented as a practical contribution rather than derived from prior self-citations or by construction from the outputs it produces. All load-bearing steps rely on external data retrieval and deterministic processing whose correctness can be checked against the cited public APIs, satisfying the criteria for a self-contained non-circular contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution is a data-integration tool that depends on the continued operation and accuracy of five public web services; no new physical entities, fitted constants, or unstated mathematical axioms are introduced.

axioms (1)
  • domain assumption Google Scholar, OpenAlex, CrossRef, Semantic Scholar, and OpenStreetMap Nominatim return usable structured records for citation and location enrichment.
    The five-stage pipeline is built entirely on calls to these services.

pith-pipeline@v0.9.0 · 5605 in / 1264 out tokens · 67119 ms · 2026-05-08T03:46:17.970015+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 5 canonical work pages

  1. [1]

    Massimo Aria and Corrado Cuccurullo. 2017. bibliometrix: An R-tool for comprehensive science mapping analysis. Journal of Informetrics11, 4 (2017), 959–975. https://doi.org/10.1016/j.joi.2017.08.007

  2. [2]

    Cholewiak, Panos Ipeirotis, Victor Silva, and Arun Kannawadi

    Steven A. Cholewiak, Panos Ipeirotis, Victor Silva, and Arun Kannawadi. 2021.SCHOLARLY: Simple access to Google Scholar authors and citation using Python. https://doi.org/10.5281/zenodo.5764801

  3. [3]

    Clarivate. 2026. Web of Science. https://www.webofscience.com Accessed: Apr. 8, 2026

  4. [4]

    Elsevier. 2026. Scopus. https://www.scopus.com Accessed: Apr. 8, 2026

  5. [5]

    Ferreira, Marcos André Gonçalves, and Alberto H

    Anderson A. Ferreira, Marcos André Gonçalves, and Alberto H. F. Laender. 2012. A brief survey of automatic methods for author name disambiguation.ACM SIGMOD Record41, 2 (2012), 15–26. https://doi.org/10.1145/2341082.2341086

  6. [6]

    Rob Filipe and contributors. 2013. Folium: Python data, Leaflet.js maps. https://github.com/python-visualization/ folium

  7. [7]

    Suzanne Fricke. 2018. Semantic scholar.Journal of the Medical Library Association: JMLA106, 1 (2018), 145

  8. [8]

    Ginny Hendricks, Dominika Tkaczyk, Jennifer Lin, and Patricia Feeney. 2020. Crossref: The sustainable source of community-owned scholarly metadata.Quantitative Science Studies1, 1 (2020), 414–427

  9. [9]

    Jorge E Hirsch. 2005. An index to quantify an individual’s scientific research output.Proceedings of the National academy of Sciences102, 46 (2005), 16569–16572

  10. [10]

    Chen Liu. 2024. CitationMap: A Python Tool to Identify and Visualize Your Google Scholar Citations Around the World.Authorea Preprints(2024)

  11. [11]

    Chenxu Niu, Wei Zhang, Suren Byna, and Yong Chen. 2022. Kv2vec: A distributed representation method for key-value pairs from metadata attributes. In2022 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–7

  12. [12]

    Chenxu Niu, Wei Zhang, Suren Byna, and Yong Chen. 2023. PSQS: Parallel Semantic Querying Service for Self- describing File Formats. In2023 IEEE International Conference on Big Data (BigData). IEEE, 536–541

  13. [13]

    Chenxu Niu, Wei Zhang, Jie Li, Yongjian Zhao, Tongyang Wang, Xi Wang, and Yong Chen. 2026. TokenPowerBench: Benchmarking the power consumption of LLM inference. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 32582–32590

  14. [14]

    Chenxu Niu, Wei Zhang, Mert Side, and Yong Chen. 2025. ICEAGE: Intelligent Contextual Exploration and Answer Generation Engine for Scientific Data Discovery. InProceedings of the 37th International Conference on Scalable Scientific Data Management. 1–10

  15. [15]

    Chenxu Niu, Wei Zhang, Yongjian Zhao, and Yong Chen. 2025. Energy efficient or exhaustive? benchmarking power consumption of llm inference engines.ACM SIGENERGY Energy Informatics Review5, 2 (2025), 56–62

  16. [16]

    OpenStreetMap Contributors. 2008. Nominatim: Search and Geocoding API for OpenStreetMap. https://nominatim. openstreetmap.org

  17. [17]

    Jason Priem, Heather Piwowar, and Richard Orr. 2022. OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts.arXiv preprint arXiv:2205.01833(2022)

  18. [18]

    Nees Jan van Eck and Ludo Waltman. 2010. Software survey: VOSviewer, a computer program for bibliometric mapping.Scientometrics84, 2 (2010), 523–538. https://doi.org/10.1007/s11192-009-0146-3

  19. [19]

    Rita Vine. 2006. Google scholar.Journal of the Medical Library Association94, 1 (2006), 97. 13

  20. [20]

    Gwok-Waa Wan, SamZaak Wong, Shengchu Su, Chenxu Niu, Ning Wang, Xinlai Wan, Qixiang Chen, Mengnv Xing, Jingyi Zhang, Jianmin Ye, et al. 2026. Fixme: Towards end-to-end benchmarking of llm-aided design verification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 1087–1095

  21. [21]

    Wei Zhang, Suren Byna, Chenxu Niu, and Yong Chen. 2019. Exploring metadata search essentials for scientific data management. In2019 IEEE 26th international conference on high performance computing, data, and analytics (HiPC). IEEE, 83–92. 14