pith. machine review for the scientific record. sign in

arxiv: 2604.05711 · v1 · submitted 2026-04-07 · 💻 cs.SE · cs.AI· cs.CL· cs.IR

Recognition: 2 theorem links

· Lean Theorem

SemLink: A Semantic-Aware Automated Test Oracle for Hyperlink Verification using Siamese Sentence-BERT

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:21 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CLcs.IR
keywords semantic hyperlink verificationSiamese Sentence-BERTautomated test oraclesemantic driftlink rotweb quality assuranceHWPPs datasetSentence-BERT
0
0 comments X

The pith

SemLink uses a Siamese Sentence-BERT network to flag semantic drift in hyperlinks at 96 percent recall while running 47 times faster than large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Hyperlinks on the web can remain technically live yet lose their intended meaning when target pages change over time. SemLink builds an automated oracle that encodes both the source context around a link and the target page content, then scores how well they align semantically. The model is trained on a new dataset of more than 60,000 labeled pairs and reaches recall comparable to current generative models. Because it avoids the latency and resource demands of those models, the approach makes repeated semantic checks practical for large sites. This fills the gap between simple HTTP status checks and full generative analysis for keeping web content consistent.

Core claim

SemLink proposes a Siamese Neural Network with a pre-trained Sentence-BERT backbone that computes semantic coherence between a hyperlink's source context (anchor text, surrounding DOM elements, and visual features) and its target page content. Trained and evaluated on the newly introduced Hyperlink-Webpage Positive Pairs dataset of over 60,000 pairs, the system attains 96.00 percent recall, matching the level of GPT-5.2 while operating approximately 47.5 times faster and consuming far fewer computational resources.

What carries the argument

The Siamese Sentence-BERT architecture that encodes source context and target content into separate embeddings and measures their semantic similarity to serve as an automated test oracle.

If this is right

  • Large-scale web regression suites can now incorporate semantic checks without incurring the latency or cost of repeated LLM calls.
  • Web crawlers and monitoring services gain the ability to surface links that are technically live but contextually outdated.
  • Continuous integration pipelines for web applications can add an efficient oracle that reduces the chance of shipping mismatched hyperlinks.
  • Quality-assurance teams obtain a practical middle ground between syntactic status checks and full generative models for routine link verification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the model generalizes to pages outside the training distribution, it could be applied to other dynamic web elements such as embedded forms or media captions.
  • Deployment in browser-based tools might enable real-time user warnings when a visited page has drifted from the link that led there.
  • The low resource footprint opens the possibility of running semantic checks directly on developer machines rather than in cloud environments.

Load-bearing premise

The HWPPs dataset of over 60,000 pairs accurately represents real-world semantic drift cases, and the Siamese SBERT architecture reliably computes semantic coherence from source context and target content.

What would settle it

Running SemLink on an independent collection of hyperlinks whose semantic drift has been confirmed by human review or site-owner reports and measuring whether recall falls substantially below 96 percent.

Figures

Figures reproduced from arXiv: 2604.05711 by Farn Wang, Guan-Yan Yang, Kuo-Hui Yeh, Shu-Yuan Ku, Wei-Ling Wen.

Figure 1
Figure 1. Figure 1: The automated data collection pipeline for the HWPPs dataset. The [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The SemLink Feature Extraction Pipeline. The system handles both [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The SemLink Siamese Network Architecture. Two inputs are pro [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Efficiency vs. Performance. SemLink (Green Star) is the sole approach [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualizing the Side-Text Heuristic. SemLink extracts the parent [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: A False Negative case. The link redirects to a login portal. SemLink [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
read the original abstract

Web applications rely heavily on hyperlinks to connect disparate information resources. However, the dynamic nature of the web leads to link rot, where targets become unavailable, and more insidiously, semantic drift, where a valid HTTP 200 connection exists, but the target content no longer aligns with the source context. Traditional verification tools, which primarily function as crash oracles by checking HTTP status codes, often fail to detect semantic inconsistencies, thereby compromising web integrity and user experience. While Large Language Models (LLMs) offer semantic understanding, they suffer from high latency, privacy concerns, and prohibitive costs for large-scale regression testing. In this paper, we propose SemLink, a novel automated test oracle for semantic hyperlink verification. SemLink leverages a Siamese Neural Network architecture powered by a pre-trained Sentence-BERT (SBERT) backbone to compute the semantic coherence between a hyperlink's source context (anchor text, surrounding DOM elements, and visual features) and its target page content. To train and evaluate our model, we introduce the Hyperlink-Webpage Positive Pairs (HWPPs) dataset, a rigorously constructed corpus of over 60,000 semantic pairs. Our evaluation demonstrates that SemLink achieves a Recall of 96.00%, comparable to state-of-the-art LLMs (GPT-5.2), while operating approximately 47.5 times faster and requiring significantly fewer computational resources. This work bridges the gap between traditional syntactic checkers and expensive generative AI, offering a robust and efficient solution for automated web quality assurance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes SemLink, a Siamese Sentence-BERT architecture for semantic hyperlink verification that detects semantic drift (beyond HTTP 200 failures) by comparing source context (anchor text, DOM, visual features) with target content. It introduces the HWPPs dataset of over 60,000 pairs and reports 96% recall comparable to GPT-5.2 while running 47.5 times faster with lower resource demands.

Significance. If the evaluation is sound, the work offers a practical, scalable alternative to both traditional syntactic oracles and expensive LLMs for web application testing, addressing link rot and semantic inconsistencies efficiently. The introduction of the HWPPs dataset is a positive contribution that could support future benchmark development in semantic web verification.

major comments (2)
  1. The central performance claim (96% recall, 47.5× speedup vs. GPT-5.2) rests entirely on the HWPPs dataset, yet the abstract and evaluation provide no information on pair generation, semantic drift definition, negative pair selection, annotation protocol, or inter-annotator agreement. This is load-bearing because any bias in labeling or pair construction could inflate recall without reflecting real-world generalization.
  2. No details are supplied on error bars, statistical significance tests, or the precise baseline configurations beyond GPT-5.2, making it impossible to assess whether the reported metrics reliably support the claim of comparability with state-of-the-art LLMs.
minor comments (1)
  1. The abstract would be strengthened by a one-sentence summary of the HWPPs construction process to allow readers to immediately gauge the evaluation's credibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough and constructive review. The comments highlight important areas for improving the clarity and reproducibility of our work on SemLink and the HWPPs dataset. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: The central performance claim (96% recall, 47.5× speedup vs. GPT-5.2) rests entirely on the HWPPs dataset, yet the abstract and evaluation provide no information on pair generation, semantic drift definition, negative pair selection, annotation protocol, or inter-annotator agreement. This is load-bearing because any bias in labeling or pair construction could inflate recall without reflecting real-world generalization.

    Authors: We acknowledge that the manuscript does not provide sufficient detail on the HWPPs dataset construction within the evaluation section, which limits assessment of potential biases. In the revised version, we will expand Section 4 (Dataset) with a new subsection explicitly describing: the pair generation methodology (including how source contexts from anchor text, DOM, and visual features were matched to target webpage content), the operational definition of semantic drift used for positive/negative labeling, the negative pair selection strategy (e.g., sampling from semantically unrelated pages while controlling for topic overlap), the full annotation protocol (guidelines, annotator training, and quality control steps), and inter-annotator agreement metrics such as Cohen's kappa. These additions will enable readers to evaluate dataset validity and generalization potential. The reported performance figures are derived from the existing dataset splits and will remain unchanged. revision: yes

  2. Referee: No details are supplied on error bars, statistical significance tests, or the precise baseline configurations beyond GPT-5.2, making it impossible to assess whether the reported metrics reliably support the claim of comparability with state-of-the-art LLMs.

    Authors: We agree that the current evaluation lacks these elements of statistical rigor, which weakens the support for claims of comparability. In the revised manuscript, we will augment the results section with: error bars (standard deviation computed over 5-fold cross-validation and multiple random seeds for all metrics including recall), statistical significance tests (e.g., McNemar's test for paired recall comparisons between SemLink and GPT-5.2, with p-values reported), and precise baseline configurations (GPT-5.2 model version, exact prompt templates and parameters used for inference, hardware platform for latency measurements, and any other implementation details). This will allow proper evaluation of the reliability of the 96% recall and 47.5× speedup claims. revision: yes

Circularity Check

0 steps flagged

No load-bearing circularity; model trained and evaluated on newly introduced independent dataset

full rationale

The paper introduces the HWPPs dataset of >60k pairs as a new corpus and trains a Siamese SBERT model on it before reporting recall on (presumably held-out) pairs from the same corpus. No equations, derivations, or fitted parameters are shown to reduce to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. External comparisons to GPT-5.2 are presented as independent benchmarks. The central performance numbers therefore rest on the quality of the new dataset and standard train/test practices rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of pre-trained SBERT embeddings for web content similarity and on the representativeness of the newly introduced HWPPs dataset; no free parameters or invented entities are explicitly named in the abstract.

axioms (1)
  • domain assumption Pre-trained Sentence-BERT embeddings capture semantic coherence between source hyperlink context and target page content
    Invoked when stating that the Siamese network computes semantic coherence using SBERT backbone.

pith-pipeline@v0.9.0 · 5597 in / 1443 out tokens · 76126 ms · 2026-05-10T19:21:52.627926+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 17 canonical work pages

  1. [1]

    Web application testing—challenges and opportunities,

    S. Balsam and D. Mishra, “Web application testing—challenges and opportunities,”Journal of Systems and Software, vol. 219, p. 112186, 2025. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S0164121224002309

  2. [2]

    When online content disappears,

    A. Chapekis, S. Bestvater, E. Remy, and G. Rivero, “When online content disappears,” Pew Research Center, Tech. Rep., 2024. [Online]. Available: https://www.pewresearch.org/wp-content/uploads/ sites/20/2024/05/pl 2024.05.17 link-rot report.pdf

  3. [3]

    The oracle problem in software testing: A survey,

    E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo, “The oracle problem in software testing: A survey,”IEEE Transactions on Software Engineering, vol. 41, no. 5, pp. 507–525, May 2015

  4. [4]

    Dead Link Checker,

    “Dead Link Checker,” https://www.deadlinkchecker.com/, 2024, ac- cessed: 2025-07-01

  5. [5]

    Screaming Frog SEO Spider,

    “Screaming Frog SEO Spider,” https://www.screamingfrog.co.uk/ seo-spider/, 2024, accessed: 2025-07-01

  6. [6]

    How do large language models understand relevance? a mechanistic interpretability perspective,

    Q. Liu, H. Duan, J. Mao, and J.-R. Wen, “How do large language models understand relevance? a mechanistic interpretability perspective,”ACM Trans. Inf. Syst., Nov. 2025. [Online]. Available: https://doi.org/10.1145/3774942

  7. [7]

    Testora: Using natural language intent to detect behavioral regressions,

    M. Pradel, “Testora: Using natural language intent to detect behavioral regressions,” in2026 IEEE/ACM 48th International Conference on Software Engineering (ICSE ’26). Rio de Janeiro, Brazil: ACM, 2026, p. 13 pages. [Online]. Available: https: //doi.org/10.1145/3744916.3764527

  8. [8]

    Jailbroken: How does LLM safety training fail?

    A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How does LLM safety training fail?” inAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. [Online]. Available: http://papers.nips.cc/paper files/paper/2023/hash/ fd661313...

  9. [9]

    ArtPerception: ASCII art-based jailbreak on llms with recognition pre-test,

    G.-Y . Yang, T.-Y . Cheng, Y .-W. Teng, F. Wang, and K.-H. Yeh, “ArtPerception: ASCII art-based jailbreak on llms with recognition pre-test,”Journal of Network and Computer Applications, 2025. [Online]. Available: https://www.sciencedirect.com/science/article/abs/ pii/S108480452500253X

  10. [10]

    In: Inui, K., Jiang, J., Ng, V., Wan, X

    N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019. Association for Computational Linguistics,...

  11. [11]

    Hyperlink analyses of the World Wide Web: A review,

    H. W. Park and M. Thelwall, “Hyperlink analyses of the World Wide Web: A review,”Journal of Computer-Mediated Communication, vol. 8, no. 4, p. JCMC843, 2003

  12. [12]

    Hyperlink analysis for the web,

    M. R. Henzinger, “Hyperlink analysis for the web,”IEEE Internet Comput., vol. 5, no. 1, pp. 45–50, 2001. [Online]. Available: https://doi.org/10.1109/4236.895141

  13. [13]

    Updating broken web links: An automatic recommendation system,

    J. Mart ´ınez-Romo and L. Araujo, “Updating broken web links: An automatic recommendation system,”Inf. Process. Manag., vol. 48, no. 2, pp. 183–203, 2012. [Online]. Available: https://doi.org/10.1016/ j.ipm.2011.03.006

  14. [14]

    Recommendation system for automatic recovery of broken web links,

    J. Martinez-Romo and L. Araujo, “Recommendation system for automatic recovery of broken web links,” inAdvances in Artificial Intelligence - IBERAMIA 2008, 11th Ibero-American Conference on AI, Lisbon, Portugal, October 14-17, 2008. Proceedings, ser. Lecture Notes in Computer Science, vol. 5290. Springer, 2008, pp. 302–311. [Online]. Available: https://doi...

  15. [15]

    Semantic test repair for web applications,

    X. Qi, X. Qian, and Y . Li, “Semantic test repair for web applications,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE ’23). ACM, 2023, pp. 1190–1202

  16. [16]

    Enhancing web test script repair using integrated ui structural and visual information,

    Z. Wen, Y . Lu, T. Xu, M. Pan, T. Zhang, and X. Li, “Enhancing web test script repair using integrated ui structural and visual information,” in2024 IEEE International Conference on Software Maintenance and Evolution (ICSME), 2024, pp. 75–86

  17. [17]

    Automated repair of layout cross browser issues using search-based techniques,

    S. Mahajan, A. Alameer, P. McMinn, and W. G. J. Halfond, “Automated repair of layout cross browser issues using search-based techniques,” in Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2017. New York, NY , USA: Association for Computing Machinery, 2017, p. 249–260. [Online]. Available: https://d...

  18. [18]

    Methods for evaluating the quality of hypertext links,

    J. Blustein, R. Webber, and J. Tague-Sutcliffe, “Methods for evaluating the quality of hypertext links,”Information Processing & Management, vol. 33, no. 2, pp. 255–271, 1997. [Online]. Available: https://doi.org/10.1016/S0306-4573(96)00066-0

  19. [19]

    Using semantic similarity in crawling- based web application testing,

    J.-W. Lin, F. Wang, and P. Chu, “Using semantic similarity in crawling- based web application testing,” in2017 IEEE International Conference on Software Testing, Verification and Validation (ICST), 2017, pp. 138– 148

  20. [20]

    Unblind your apps: predicting natural-language labels for mobile GUI components by deep learning,

    J. Chen, C. Chen, Z. Xing, X. Xu, L. Zhu, G. Li, and J. Wang, “Unblind your apps: predicting natural-language labels for mobile GUI components by deep learning,” inICSE ’20: 42nd International Conference on Software Engineering, Seoul, South Korea, 27 June - 19 July, 2020. ACM, 2020, pp. 322–334. [Online]. Available: https://doi.org/10.1145/3377811.3380327

  21. [21]

    ACM TOSEM32, 3, Article 75 (April 2023), 30 pages

    M. Nass, E. Al ´egroth, R. Feldt, M. Leotta, and F. Ricca, “Similarity- based web element localization for robust test automation,”ACM Trans. Softw. Eng. Methodol., vol. 32, no. 3, Apr. 2023. [Online]. Available: https://doi.org/10.1145/3571855

  22. [22]

    Nlp-assisted web element identification toward script-free testing,

    H. Kirinuki, S. Matsumoto, Y . Higo, and S. Kusumoto, “Nlp-assisted web element identification toward script-free testing,” in2021 IEEE International Conference on Software Maintenance and Evolution (IC- SME), 2021, pp. 639–643

  23. [23]

    Aeon: a method for automatic evaluation of nlp test cases,

    J.-t. Huang, J. Zhang, W. Wang, P. He, Y . Su, and M. R. Lyu, “Aeon: a method for automatic evaluation of nlp test cases,” ser. ISSTA 2022. New York, NY , USA: Association for Computing Machinery, 2022, p. 202–214. [Online]. Available: https://doi.org/10.1145/3533767.3534394

  24. [24]

    Automated testing linguistic capabilities of NLP models,

    J. Lee, S. Chen, A. Mordahl, C. Liu, W. Yang, and S. Wei, “Automated testing linguistic capabilities of NLP models,”ACM Trans. Softw. Eng. Methodol., vol. 33, no. 7, Sep. 2024. [Online]. Available: https://doi.org/10.1145/3672455

  25. [25]

    BERT: Pre- training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short ...

  26. [26]

    A survey on the techniques, applications, and performance of short text semantic similarity,

    M. Han, X. Zhang, X. Yuan, J. Jiang, W. Yun, and C. Gao, “A survey on the techniques, applications, and performance of short text semantic similarity,”Concurrency and Computation: Practice and Experience, vol. 33, no. 5, p. e5971, 2021. [Online]. Available: https://doi.org/10.1002/cpe.5971

  27. [27]

    DOM Standard,

    WHATWG, “DOM Standard,” https://dom.spec.whatwg.org/, 2024, ac- cessed: 2025-12-10

  28. [28]

    Siamese neural networks: An overview,

    D. Chicco, “Siamese neural networks: An overview,” vol. 2190, pp. 73–94, 2021. [Online]. Available: https://doi.org/10.1007/ 978-1-0716-0826-5 3

  29. [29]

    TextRank: Bringing order into text,

    R. Mihalcea and P. Tarau, “TextRank: Bringing order into text,” in Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, D. Lin and D. Wu, Eds. Barcelona, Spain: Association for Computational Linguistics, Jul. 2004, pp. 404–411. [Online]. Available: https://aclanthology.org/W04-3252/

  30. [30]

    2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 815–823 (2015)

    F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” inIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. IEEE Computer Society, 2015, pp. 815–823. [Online]. Available: https://doi.org/10.1109/CVPR.2015.7298682

  31. [31]

    C. M. Bishop,Pattern recognition and machine learning, 5th Edition, ser. Information science and statistics. Springer, 2007. [Online]. Available: https://www.worldcat.org/oclc/71008143

  32. [32]

    A framework for multiple- instance learning,

    O. Maron and T. Lozano-P ´erez, “A framework for multiple- instance learning,” inAdvances in Neural Information Processing Systems 10, [NIPS Conference, Denver, Colorado, USA, 1997]. The MIT Press, 1997, pp. 570–576. [Online]. Available: http: //papers.nips.cc/paper/1346-a-framework-for-multiple-instance-learning

  33. [33]

    Natural language processing (almost) from scratch,

    R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. P. Kuksa, “Natural language processing (almost) from scratch,”J. Mach. Learn. Res., vol. 12, pp. 2493–2537, 2011. [Online]. Available: https://dl.acm.org/doi/10.5555/1953048.2078186

  34. [34]

    Laws of organization in perceptual forms

    M. Wertheimer, “Laws of organization in perceptual forms.” 1938. [Online]. Available: https://doi.org/10.1037/11496-005

  35. [35]

    VIPS: a vision-based page segmentation algorithm,

    D. Cai, S. Yu, J.-R. Wen, and W.-Y . Ma, “VIPS: a vision-based page segmentation algorithm,” inMicrosoft technical report (MSR-TR-2003- 79). Microsoft Research, 2003

  36. [36]

    Nielsen,Usability engineering

    J. Nielsen,Usability engineering. Academic Press, 1993