WebKnoGraph: GNN-Powered Internal Linking
Pith reviewed 2026-06-27 23:29 UTC · model grok-4.3
The pith
Automatic link selection via GraphSAGE yields higher authority redistribution than expert-assisted selection, at the cost of semantic coherence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The framework models a website as a directed graph, represents pages by embeddings, scores candidate links with GraphSAGE, and evaluates interventions by embedding the site into larger host environments. On the Kalicube.com crawl, automatic selection generally produces stronger authority redistribution with higher Authority Yield, but also larger semantic coherence costs. Expert-assisted selection better preserves semantic coherence and, when targeting low-PageRank pages, achieves the highest Authority Yield, although with the least favorable loss-gain balance.
What carries the argument
WebKnoGraph framework that models the target site as a directed graph, scores candidate links with GraphSAGE, and measures authority and coherence after embedding the site inside a FineWeb-based or Barabási-Albert host graph.
If this is right
- Automatic selection produces stronger authority redistribution and higher overall Authority Yield.
- Expert-assisted selection preserves semantic coherence more effectively.
- Expert-assisted selection aimed at low-PageRank pages reaches the single highest Authority Yield.
- Authority Volatility supplies an additional stability signal, although different numbers of intervention sets limit direct comparison.
- A usable workflow generates candidate sets at scale, scores them jointly on the four metrics, and routes the best ones for editorial review before deployment.
Where Pith is reading between the lines
- Hybrid pipelines that let automatic methods propose many candidates and experts pick among them could combine high yield with acceptable coherence.
- The same evaluation loop could be applied to other site modifications such as content rewrites or navigation menu changes.
- Repeating the experiments across several production sites would reveal whether the observed patterns generalize beyond one crawl.
- If live ranking data later contradict the simulated authority gains, the host-graph construction step would need revision.
Load-bearing premise
That authority and coherence measurements obtained inside the simulated FineWeb or synthetic host graphs will match the outcomes that real search engines produce once the chosen links are added to the live site.
What would settle it
Deploy the top automatic and expert-assisted link sets on the actual Kalicube.com site, then compare subsequent search-ranking shifts, traffic changes, and navigation metrics against an untouched control group.
Figures
read the original abstract
Internal link optimization is a recurring task in search engine optimization, yet many production workflows rely on manual judgment, fixed page templates, or generic tool recommendations. Practitioners need ways to evaluate candidate links before deployment because link changes can redistribute authority and affect semantic coherence in ways that are difficult to isolate after release. We present WebKnoGraph, an open-source framework for evaluating internal linking strategies on website crawls. The framework models a website as a directed graph, represents pages by embeddings, scores candidate links with GraphSAGE, and evaluates interventions by embedding the site into larger host environments. We instantiate WebKnoGraph on a production crawl of Kalicube.com and compare automatic with expert-assisted link selection in an empirical FineWeb-based host graph and a synthetic Barab\'asi-Albert host graph, using PageRank-based authority metrics and semantic coherence. The results show that automatic selection generally produces stronger authority redistribution, with higher Authority Yield, but also larger semantic coherence costs. Expert-assisted selection better preserves semantic coherence and, when targeting low-PageRank pages, achieves the highest Authority Yield, although with the least favorable loss-gain balance. Authority Volatility provides an additional stability perspective, but is interpreted cautiously because the two regimes use different numbers of intervention sets. These findings support a practical workflow in which candidate intervention sets are generated at scale, evaluated jointly across authority gain, volatility, loss-gain balance, and semantic coherence, and then reviewed for editorial deployability before implementation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents WebKnoGraph, an open-source framework that models a website as a directed graph, uses page embeddings and GraphSAGE to score candidate internal links, and evaluates interventions by embedding the target site into larger host graphs (a FineWeb-derived empirical graph and a Barabási-Albert synthetic graph). Authority redistribution is measured via PageRank-derived metrics including Authority Yield, while semantic coherence and Authority Volatility are also tracked. On a production crawl of Kalicube.com, the empirical comparison finds that automatic link selection generally produces higher Authority Yield than expert-assisted selection, but at greater coherence cost; expert-assisted selection preserves coherence better and can achieve the highest Authority Yield when targeting low-PageRank pages.
Significance. If the proxy host graphs produce authority deltas that track real post-deployment PageRank changes, the framework would offer a practical pre-deployment evaluation workflow for internal linking that jointly considers authority gain, volatility, loss-gain balance, and semantic coherence. The open-source release and use of GNN-based scoring are concrete strengths that could support reproducibility and extension.
major comments (2)
- [Evaluation section] Evaluation section (host-graph construction and results): All quantitative comparisons of Authority Yield, coherence, and volatility rest on the assumption that embedding Kalicube.com into the FineWeb-based or Barabási-Albert host graphs yields authority deltas that generalize to real search-engine behavior after link deployment. No external validation, correlation with live crawl data, or sensitivity analysis across host-graph choices is reported, so the relative ordering of automatic vs. expert-assisted strategies cannot yet be treated as deployable evidence.
- [Results] Results paragraphs on Authority Yield and coherence: The claims that automatic selection 'generally produces stronger authority redistribution' and expert-assisted 'achieves the highest Authority Yield when targeting low-PageRank pages' are presented without reported dataset sizes for the intervention sets, error bars, or statistical tests. This makes it impossible to determine whether the observed differences are robust or merely artifacts of the particular crawl and graph realizations.
minor comments (2)
- [Abstract and Results] The abstract and results sections would benefit from explicit statements of the number of pages in the Kalicube.com crawl, the number of candidate links evaluated, and the precise definitions of Authority Yield and loss-gain balance (including any free parameters).
- [Figures and Evaluation] Figure captions and the description of the two host-graph regimes should clarify why different numbers of intervention sets are used and how this affects the interpretation of Authority Volatility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the evaluation assumptions and results presentation. We respond to each major comment below, clarifying the framework's scope as a proxy-based pre-deployment tool while committing to clarifications and additions where possible.
read point-by-point responses
-
Referee: [Evaluation section] Evaluation section (host-graph construction and results): All quantitative comparisons of Authority Yield, coherence, and volatility rest on the assumption that embedding Kalicube.com into the FineWeb-based or Barabási-Albert host graphs yields authority deltas that generalize to real search-engine behavior after link deployment. No external validation, correlation with live crawl data, or sensitivity analysis across host-graph choices is reported, so the relative ordering of automatic vs. expert-assisted strategies cannot yet be treated as deployable evidence.
Authors: The framework is explicitly designed as a simulation using proxy host graphs to enable relative comparisons of linking strategies prior to deployment, rather than a direct model of live search-engine dynamics. The manuscript already cautions on interpretation for Authority Volatility due to differing intervention set sizes. We will revise the evaluation section to more explicitly articulate the proxy limitations and emphasize that results provide comparative insights within the modeled environments, not absolute deployable predictions. Additional sensitivity analysis beyond the empirical and synthetic graphs used is outside the current study scope. revision: partial
-
Referee: [Results] Results paragraphs on Authority Yield and coherence: The claims that automatic selection 'generally produces stronger authority redistribution' and expert-assisted 'achieves the highest Authority Yield when targeting low-PageRank pages' are presented without reported dataset sizes for the intervention sets, error bars, or statistical tests. This makes it impossible to determine whether the observed differences are robust or merely artifacts of the particular crawl and graph realizations.
Authors: We will add the exact sizes of the intervention sets to the results section in the revision. The experiments rely on single realizations of the host graphs, and we will include an explicit statement noting the lack of error bars or statistical tests as a limitation of the current setup, aligning with the existing cautious interpretation of Authority Volatility. revision: yes
- External validation, correlation with live crawl data, or post-deployment PageRank changes, as these require production search engine access and real-world deployment experiments not feasible within this study.
Circularity Check
No circularity; empirical metrics computed directly from simulations on proxy graphs.
full rationale
The paper describes modeling a site as a graph, scoring links via GraphSAGE, embedding into FineWeb or Barabási-Albert host graphs, and computing PageRank-derived metrics (Authority Yield, coherence, volatility) for automatic vs. expert-assisted interventions. No equations, fitted parameters, or self-citations are shown that reduce any reported result to a definition or input by construction. All quantitative comparisons are independent simulation outputs, not tautological renamings or self-referential fits. The generalization from proxy graphs is an external-validity concern, not a circularity issue.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Websites can be faithfully represented as directed graphs whose nodes carry semantic embeddings.
- domain assumption Authority redistribution after link changes can be approximated by PageRank recomputation inside an external host graph.
Reference graph
Works this paper leans on
-
[1]
[n. d.]. Yandex Ranking Factors. https://yandex-ranking-factors.netlify.app/
-
[2]
Anurag Acharya, Matt Cutts, Jeffrey Dean, Paul Haahr, Monika Henzinger, Steve Lawrence, Karl Pfleger, and Simon Tong. 2013. Document scoring based on link-based criteria. Expired – fee related; priority application filed Sept 30, 2003
2013
-
[3]
Shaun Anderson. 2025. Strategic SEO. https://www.hobo-web.co.uk/strategic- seo-2025/
2025
-
[4]
Konstantin Avrachenkov and Nelly Litvak. 2004. Decomposition of the google pagerank and optimal linking strategy.INRIA Research Report(2004)
2004
-
[5]
Ricardo A Baeza-Yates, Carlos Castillo, Vicente López, and Cátedra Telefónica
-
[6]
InAIRWeb, Vol
Pagerank Increase under Different Collusion Topologies. InAIRWeb, Vol. 5. 25–32
-
[7]
Albert-László Barabási and Réka Albert. 1999. Emergence of scaling in random networks.Science286, 5439 (1999), 509–512. Conference’17, July 2017, Washington, DC, USA Gjorgjevska et al
1999
-
[8]
Adrien Barbaresi. 2021. Trafilatura: A web scraping library and command-line tool for text discovery and extraction. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations. 122–131
2021
-
[9]
Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual web search engine.Computer networks and ISDN systems30, 1-7 (1998), 107–117
1998
-
[10]
Balázs Csanád Csáji, Raphaël M Jungers, and Vincent D Blondel. 2014. PageRank optimization by edge selection.Discrete Applied Mathematics169 (2014), 73–87
2014
-
[11]
Dennis Fetterly, Mark Manasse, and Marc Najork. 2004. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. InProceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004. 1–6
2004
-
[12]
Martin Gerlach, Marshall Miller, Rita Ho, Kosta Harlan, and Djellel Difallah. 2021. Multilingual entity linking system for wikipedia with a machine-in-the-loop approach. InProceedings of the 30th ACM International Conference on Information & Knowledge Management. 3818–3827
2021
-
[13]
Emilija Gjorgjevska and Georgina Mirceva. 2021. Content Engineering for State- of-the-art SEO Digital Strategies by Using NLP and ML. In2021 3rd International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA). IEEE, 1–6
2021
-
[14]
Emilija Gjorgjevska, Miroslav Mirchev, and Georgina Mircheva. 2024. Web- KnoGraph: AI-Driven Framework for Large-Scale Internal Link Optimization. https://github.com/martech-engineer/WebKnoGraph
2024
-
[15]
GrowthSRC Media. 2025. Leaked Google Search Algorithm Ranking Factors Database: By GrowthSRC Media.searchrankingfactors.com(2025). https:// searchrankingfactors.com/ Accessed: 12 September 2025
2025
-
[16]
Nissan Hajaj. 2015. Producing a ranking for pages using distances in a web-link graph. Term extended by 268 days under 35 U.S.C. 154(b)
2015
-
[17]
Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs.Advances in neural information processing systems30 (2017)
2017
-
[18]
2025.SEO Market Size & Share Analysis – Growth Trends & Forecasts (2025–2030)
Mordor Intelligence. 2025.SEO Market Size & Share Analysis – Growth Trends & Forecasts (2025–2030). https://www.mordorintelligence.com/industry-reports/ seo-market Accessed: 2025-09-29
2025
-
[19]
Ivan Franko Lviv National University, O I Marchuk, and T M Kushnir. 2024. Evaluation of the effectiveness of offline search optimization in the SEO toolbox. Mark. Digit. Technol.8, 4 (Dec. 2024), 44–57
2024
-
[20]
Shima Khoshraftar and Aijun An. 2024. A survey on graph representation learning methods.ACM Transactions on Intelligent Systems and Technology15, 1 (2024), 1–55
2024
-
[21]
Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks.arXiv preprint arXiv:1609.02907(2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[22]
2015.Search engine optimization bible
Jerri L Ledford. 2015.Search engine optimization bible. Vol. 584. John Wiley & Sons
2015
-
[23]
Dirk Lewandowski, Sebastian Sünkler, and Nurce Yagci. 2021. The influence of search engine optimization on Google’s results: A multi-dimensional approach for detecting SEO. InProceedings of the 13th ACM Web Science Conference 2021. 12–20
2021
-
[24]
Lijun Lyu and Besnik Fetahu. 2018. Real-time event-based news suggestion for Wikipedia pages from news streams. InCompanion Proceedings of the The Web Conference 2018. 1793–1799
2018
-
[25]
Ross A Malaga. 2008. Worst practices in search engine optimization.Commun. ACM51, 12 (2008), 147–150
2008
-
[26]
Natasa Milic-Frayling, Eduarda Mendes Rodrigues, and Shashank Pandit. 2008. Website structure analysis. Application publication; priority filing Dec 5, 2006
2008
-
[27]
Morris, Brandon Duderstadt, and Andriy Mulyar
Zach Nussbaum, John X. Morris, Brandon Duderstadt, and Andriy Mulyar
-
[28]
Nomic Embed: Training a Reproducible Long Context Text Embedder
Nomic Embed: Training a Reproducible Long Context Text Embedder. arXiv:2402.01613 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Naoto Ohsaka, Tomohiro Sonobe, Naonori Kakimura, Takuro Fukunaga, Sumio Fujita, and Ken-ichi Kawarabayashi. 2018. Boosting PageRank scores by optimiz- ing internal link structure. InInternational Conference on Database and Expert Systems Applications. Springer, 424–439
2018
-
[30]
1999.The PageRank citation ranking: Bringing order to the web.Technical Report
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999.The PageRank citation ranking: Bringing order to the web.Technical Report. Stanford infolab
1999
-
[31]
Anna Patterson and Paul Haahr. 2013. Ranking based on reference contexts. Priority and filing both on 2004-03-15; expected expiry March 28, 2032
2013
-
[32]
Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. 2024. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/forum?id=n6SCkn2QaG
2024
-
[33]
Memet Sanjaya, Rizaldi Putra, Deni Utama, and Arif Prayoga. 2025. Optimizing website ranking using long-tail keywords and internal linking: A case study.jidt (Aug. 2025), 31–36
2025
-
[34]
Hasnae Amnoun1 Naoual Smaili, Hamza Barboucha1, and Mohcine Kodad. 2024. The Future of Search Attention: Leveraging AI to Enhance PageRank’s Influence. Advances in Smart Medical, IoT & Artificial Intelligence: Proceedings of ICSMAI’2024, Volume 111 (2024), 125
2024
-
[35]
Olof Sundin. 2025. Theorising notions of searching, (re)sources and evaluation in the light of generative AI.Information Research30, CoLIS (May 2025), 291–302. doi:10.47989/ir30CoLIS52258
-
[36]
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks.arXiv preprint arXiv:1710.10903(2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[37]
wordLift. [n. d.]. Creating Internal Links | WordLift Developer Documenta- tion — docs.wordlift.io. https://docs.wordlift.io/agent-wordlift/workflows/create- internal-links/. [Accessed 03-09-2025]
2025
-
[38]
Shanchan Wu, Louiqa Raschid, and William Rand. 2011. Future link prediction in the blogosphere for recommendation. InProceedings of the International AAAI Conference on Web and Social Media, Vol. 5. 642–645. Appendix A. GenAI Usage Disclosure Large Language Models (LLMs) were used as assistants during the preparation of this paper. Specifically, LLMs supp...
2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.