arxiv: 2605.06239 · v1 · submitted 2026-05-07 · 💻 cs.LG

Recognition: unknown

When Graph Language Models Go Beyond Memorization

Mahito Sugiyama, Masatsugu Yamada

Pith reviewed 2026-05-08 13:21 UTC · model grok-4.3

classification 💻 cs.LG

keywords graph language modelsmemorizationsubgraph miningstructural alignmentscalingTU benchmarksdiagnostic protocolfrequent subgraph mining

0 comments

The pith

Graph language models acquire structural regularities beyond memorization at large scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

It remains unclear whether graph language models learn actual structural patterns in graphs or simply recall training examples. The paper introduces a diagnostic protocol that uses frequent subgraph mining, a graph-level bootstrap baseline simulating memorization, and frequency stratification to isolate the two. At small scales the models' outputs match what pure recall would produce. At large scales with 3.75 million graphs, however, verbatim memorization falls sharply while rank correlations with structural patterns stay high, and this alignment persists when analysis is restricted to graphs absent from training.

Core claim

Using the new diagnostic, the authors establish that graph language models acquire structural regularities beyond memorization at scale, primarily for high-frequency patterns. This appears as high subgraph-rank correlations that the memorization bootstrap matches at small scale but cannot explain at large scale, with the novel-only subset analysis confirming that the alignment is not driven solely by recall of seen graphs. High-frequency subgraphs are reproduced reliably across scales while rare ones remain poorly covered with little gain from added capacity.

What carries the argument

The calibrated diagnostic protocol that combines frequent subgraph mining, a graph-level bootstrap baseline, and three-level frequency stratification to separate memorization effects from structural alignment.

If this is right

Verbatim memorization drops sharply at large scale while rank correlation with structural patterns remains near ceiling.
High-frequency patterns are reproduced well at all scales; rare patterns stay poorly covered with only marginal improvement as capacity grows.
The scale-dependent separation from memorization holds under both canonical DFS code and action-sequence graph serializations.
On five TU benchmarks, models reach high subgraph-rank correlations that the bootstrap matches or exceeds at small scales.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Scaling may therefore be more effective for capturing common motifs than for covering the tail of rare structures.
The same diagnostic protocol could be applied to sequence or tree generators to test whether similar scale-dependent generalization occurs.
Data curation that balances subgraph frequencies might reduce the persistent gap for rare patterns.

Load-bearing premise

The graph-level bootstrap baseline correctly models pure verbatim recall without capturing any structural regularities on its own.

What would settle it

If frequent subgraph mining applied only to the novel-only generated graphs produced Spearman correlations that no longer tracked the full-set correlations at the 3.75-million-graph scale, the claim of structural learning beyond memorization would be falsified.

Figures

Figures reproduced from arXiv: 2605.06239 by Mahito Sugiyama, Masatsugu Yamada.

**Figure 1.** Figure 1: LLM (DFS canonical, both_default, single training seed = 42) vs. graph-level bootstrap baseline on the five TU benchmarks. Bars on each panel show the model and bootstrap mean; error bars on the bootstrap bar are ±1.96 σ over 10 bootstrap repeats (a Gaussian approximation of a 95% band; the raw 5–95 percentile would be too noisy at nboot = 10). Bootstrap matches or exceeds the model on Spearman ρ for MUTAG… view at source ↗

**Figure 2.** Figure 2: Frequency-stratified metrics (DFS canonical, single training seed = 42), bar heights = view at source ↗

**Figure 3.** Figure 3: Cross-model PCQM4Mv2 scaling. Memorization, overall view at source ↗

**Figure 4.** Figure 4: Per-pattern train-vs-generated support on log–log axes for all five TU benchmarks. Blue view at source ↗

**Figure 5.** Figure 5: Per-pattern train-vs-generated support on log–log axes for PCQM4Mv2 across the four view at source ↗

**Figure 6.** Figure 6: Per-dataset checkpoint trajectories. Rows: TU benchmark datasets. Columns: serial view at source ↗

**Figure 7.** Figure 7: Training (faded curves) and held-out evaluation (markers) cross-entropy loss on view at source ↗

**Figure 8.** Figure 8: PCQM4Mv2 LLM-DFS checkpoint trajectory across the four LLaMA capacities sweep view at source ↗

**Figure 9.** Figure 9: MUTAG (DFS canonical) early-checkpoint dynamics. Unique rate, precision, and Spear view at source ↗

read the original abstract

It remains unclear whether graph language models learn structural regularities or merely memorize training graphs; this cannot be resolved by current aggregate fidelity metrics alone. We develop a calibrated diagnostic protocol that combines frequent subgraph mining, a graph-level bootstrap baseline, and three-level frequency stratification to disentangle memorization from structural alignment. Using this framework, we show that graph language models can acquire structural regularities beyond memorization at scale, primarily in the high-frequency regime. This is supported by the following empirical evidence: On five TU benchmarks, LLaMA-style graph language models reach high subgraph-rank correlation, yet their alignment is matched or exceeded by the memorization bootstrap in most cases. At small scale, under our bootstrap diagnostic, fidelity is largely indistinguishable from verbatim recall. In contrast, at large scale with 3.75M graphs, verbatim memorization drops sharply while rank correlation remains near ceiling. Crucially, in a separate fixed-subsample analysis, frequent subgraph mining restricted to the novel-only subset closely tracks the corresponding all-generation Spearman correlation, providing evidence that the alignment is not driven solely by verbatim recall. Across all scales, high-frequency patterns are well reproduced, while rare patterns remain poorly covered, and this deficit narrows only marginally as capacity increases. We observe the same scale-dependent crossover under two distinct graph serializations (canonical DFS code and action sequences), providing evidence of robustness in our analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's diagnostic shows graph LMs shift toward structural alignment at large scale, but the bootstrap and novel-subset controls need tighter validation against serialization artifacts.

read the letter

The main takeaway is that this work gives a practical way to separate memorization from structure learning in graph language models, and the evidence points to a real crossover once you hit millions of graphs. They combine frequent subgraph mining with a graph-level bootstrap and frequency bins to track how well generated graphs reproduce common patterns versus rare ones. On the TU datasets the models look strong by rank correlation, but the bootstrap often matches or beats them at small scales, which suggests early behavior is mostly recall. At 3.75 million graphs the verbatim match drops while the correlation stays high, and the same pattern holds when they restrict mining to graphs the model never saw in training. That last control is the cleanest part of the argument because it directly tests whether the alignment survives without exact copies. The two serializations (DFS and action sequences) producing similar results adds some robustness. The soft spot is the bootstrap itself. Sampling whole training graphs without the model's learned distribution or exact tokenization rules may not reproduce the frequency profile that pure recall would actually produce, so the claimed drop in memorization could be partly an artifact of a weak baseline. The novel-only subset also risks indirect leakage if high-frequency subgraphs appear across both train and test splits. The abstract does not give the exact mining parameters or how they handled isomorphism, which makes it hard to judge how much leakage is possible. Overall the method is a step forward from simple fidelity scores, and the scale-dependent pattern is worth testing further. This is useful for anyone building or evaluating graph generators for chemistry or networks, where knowing whether the model generalizes matters. It is coherent enough on its own terms to deserve referee time, though the controls will need tightening in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a diagnostic protocol combining frequent subgraph mining, a graph-level bootstrap baseline, and three-level frequency stratification to test whether graph language models acquire structural regularities or merely memorize training graphs. On five TU benchmarks, it reports that LLaMA-style models achieve high subgraph-rank correlation that, at small scales, is matched by the memorization bootstrap, but at 3.75M graphs verbatim memorization drops sharply while rank correlation remains high; crucially, frequent subgraph mining on a novel-only subset tracks the all-generation Spearman correlation, with the pattern holding under two serializations and concentrated in the high-frequency regime.

Significance. If the central scale-dependent crossover holds, the work supplies concrete empirical evidence that graph LMs can move beyond verbatim recall toward structural alignment at sufficient scale, particularly for frequent patterns. The use of an external bootstrap and novel-subset controls, together with the reported rank correlations, strengthens the claim relative to aggregate fidelity metrics alone.

major comments (2)

[Methods / Bootstrap definition] The graph-level bootstrap baseline (described in the methods and used for the small-scale vs. large-scale comparison) samples from training graphs without incorporating the model's learned distribution or the exact serialization constraints (DFS code or action sequences). This risks under-modeling the frequency-rank profile that pure verbatim recall would produce, which could artifactually produce the observed crossover at 3.75M graphs rather than demonstrate structural learning.
[Results / Novel-subset FSM analysis] The novel-only subset analysis (fixed-subsample results) claims that FSM restricted to novel graphs tracks the full-set Spearman correlation. However, without explicit quantification of residual structural overlap (e.g., shared high-frequency subgraphs between novel test graphs and the training corpus), indirect leakage cannot be ruled out; this directly affects whether the alignment is shown to be non-memorized.

minor comments (2)

[Abstract / Methods] Clarify the precise definition of 'verbatim memorization' versus 'rank correlation' and how the three-level frequency stratification is operationalized (e.g., exact thresholds for high/medium/rare).
[Results] The manuscript states the same scale-dependent pattern holds under two serializations; include a direct side-by-side table or figure quantifying any differences in the crossover point or correlation values.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed feedback on our diagnostic protocol for distinguishing memorization from structural learning in graph language models. We address each major comment below, offering clarifications on our methodological choices and committing to revisions where they strengthen the evidence.

read point-by-point responses

Referee: [Methods / Bootstrap definition] The graph-level bootstrap baseline (described in the methods and used for the small-scale vs. large-scale comparison) samples from training graphs without incorporating the model's learned distribution or the exact serialization constraints (DFS code or action sequences). This risks under-modeling the frequency-rank profile that pure verbatim recall would produce, which could artifactually produce the observed crossover at 3.75M graphs rather than demonstrate structural learning.

Authors: The bootstrap baseline is designed to provide a direct empirical estimate of the subgraph rank profile that would arise from verbatim recall of the training set, by resampling graphs from the training distribution. Because frequent subgraph mining operates on the underlying graph structures (independent of serialization), this approach captures the structural frequencies without introducing model-specific generative biases. We acknowledge that a bootstrap that also respects the exact serialization format used by the model could offer a more precise null model. To address this, we will perform an additional experiment in the revision where bootstrapped graphs are serialized using the same DFS code or action sequence format before mining, allowing us to compare the resulting rank correlations more closely to the model's output distribution. This will clarify whether the observed scale-dependent crossover is robust to these factors. revision: partial
Referee: [Results / Novel-subset FSM analysis] The novel-only subset analysis (fixed-subsample results) claims that FSM restricted to novel graphs tracks the full-set Spearman correlation. However, without explicit quantification of residual structural overlap (e.g., shared high-frequency subgraphs between novel test graphs and the training corpus), indirect leakage cannot be ruled out; this directly affects whether the alignment is shown to be non-memorized.

Authors: We recognize the importance of quantifying potential structural overlap to strengthen the claim that the alignment in the novel subset is not due to indirect leakage of frequent patterns from the training corpus. While the graphs in the novel subset are distinct from those in training, high-frequency subgraphs may indeed be shared across the dataset. In the revised manuscript, we will include a new analysis that computes the overlap (e.g., via set intersection or rank correlation of subgraph frequencies) between the frequent subgraphs mined from the novel-only subset and those from the training set, particularly focusing on the high-frequency regime. This addition will provide a more rigorous check on the extent of shared structure and support the interpretation of structural alignment beyond memorization. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical diagnostic framework

full rationale

The paper's central claims rest on direct empirical measurements: subgraph rank correlations computed via frequent subgraph mining on model-generated graphs, compared against a graph-level bootstrap baseline that resamples from the training corpus to simulate verbatim recall, plus a fixed-subsample analysis restricted to novel-only generations. These quantities are independently observed from the data and controls rather than defined in terms of each other or fitted to produce the reported scale-dependent crossover. No self-definitional equations, fitted inputs renamed as predictions, load-bearing self-citations, imported uniqueness theorems, smuggled ansatzes, or renamings of known results appear in the derivation chain. The protocol is self-contained against external benchmarks and falsifiable via the reported metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is an empirical diagnostic study that introduces no new free parameters, invented entities, or non-standard mathematical axioms beyond routine assumptions in machine-learning evaluation.

axioms (1)

domain assumption The graph-level bootstrap baseline using random or permuted graphs accurately models the output distribution expected from pure verbatim memorization.
Invoked to establish that model alignment exceeds what memorization alone would produce.

pith-pipeline@v0.9.0 · 5540 in / 1396 out tokens · 51786 ms · 2026-05-08T13:21:07.562330+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 5 canonical work pages · 1 internal anchor

[1]

MolGPT: Molecular generation using a transformer-decoder model.Journal of Chemical Information and Modeling, 62(9):2064– 2076,

Viraj Bagal, Rishal Aggarwal, P K Vinod, and U Deva Priyakumar. MolGPT: Molecular generation using a transformer-decoder model.Journal of Chemical Information and Modeling, 62(9):2064– 2076,

2064
[3]

Bahare Fatemi, Jonathan Halcrow, and Bryan Perozzi

URLhttps: //arxiv.org/abs/2502.02216. Bahare Fatemi, Jonathan Halcrow, and Bryan Perozzi. Talk like a graph: Encoding graphs for large language models. InInternational Conference on Learning Representations,

work page arXiv
[4]

Graphgen: A scalable approach to domain- agnostic labeled graph generation

Nikhil Goyal, Harsh Vardhan Jain, and Sayan Ranu. Graphgen: A scalable approach to domain- agnostic labeled graph generation. InProceedings of The Web Conference 2020, pages 1253– 1263,

2020
[5]

Learning deep generative models of graphs.arXiv preprint arXiv:1803.03324, 2018

Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu, and Peter W. Battaglia. Learning deep gener- ative models of graphs.arXiv preprint arXiv:1803.03324,

work page arXiv
[6]

Kriege, Franka Bause, Kristian Kersting, Petra Mutzel, and Marion Neumann

Christopher Morris, Nils M. Kriege, Franka Bause, Kristian Kersting, Petra Mutzel, and Marion Neumann. TUDataset: A collection of benchmark datasets for learning with graphs. InICML 2020 Workshop on Graph Representation Learning and Beyond (GRL+),

2020
[7]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review arXiv
[8]

Jiaxuan You, Bowen Liu, Rex Ying, Vijay Pande, and Jure Leskovec

doi: 10.1109/ICDM.2002.1184038. Jiaxuan You, Bowen Liu, Rex Ying, Vijay Pande, and Jure Leskovec. Graph convolutional pol- icy network for goal-directed molecular graph generation. InAdvances in Neural Information Processing Systems, 2018a. Jiaxuan You et al. Graphrnn: Generating realistic graphs with deep auto-regressive models. In Proceedings of the 35t...

work page doi:10.1109/icdm.2002.1184038 2002
[9]

Keren Zhou

URL https://arxiv.org/abs/2401.00529. Keren Zhou. gBolt: a C++ implementation of the gspan algorithm.https://github.com/ Jokeren/gBolt,

work page arXiv
[10]

Reproducibility Statement We aim to make every reported number reproducible from the public companion repository

BSD 2-Clause License. Reproducibility Statement We aim to make every reported number reproducible from the public companion repository. Datasets.TU-benchmark graphs (MUTAG, ENZYMES, NCI1, PROTEINS, PTC_MR) are ob- tained through PyTorch Geometric’sTUDatasetloader using the canonical splits. PCQM4Mv2 is obtained from the OGB v1.3.5 release; we use the 2D-g...

2017
[11]

Mean / MaxL

Dataset Degree Clustering Orbit Spectral MUTAG 0.15 0.38 0.22 0.41 PTC_MR 0.04 0.01 0.16 0.36 ENZYMES 0.28 0.52 0.35 0.48 PROTEINS 0.24 0.45 0.30 0.52 NCI1 0.08 0.12 0.10 0.25 These low values are consistent with high memorization: if most generated graphs are exact copies of training graphs, their aggregate statistics will trivially match training statis...

2021
[12]

synchronization

J gSpan Minimum Support Sensitivity Table 17 contrastsσ= 0.1(the main-text default) with the more permissiveσ= 0.01on three TU datasets (MUTAG, ENZYMES, NCI1). Loweringσexposes more low-support patterns but leaves the qualitative conclusions (strong head alignment, degraded tail) intact, as discussed below. Table 17: Effect of gSpan minimum support ratio ...

2024
[13]

Head and Torsoρsit near0.91and0.97with much narrower CI, mirroring the saturation visible in the main-text single-seed results

is consistent with the seed-42 value (0.82) reported in Table 13: both lie within between-seed variance, and the wide CI is dominated by the small sample size (t0.975(1) = 12.7forn= 2). Head and Torsoρsit near0.91and0.97with much narrower CI, mirroring the saturation visible in the main-text single-seed results. A definitive seed-stability claim awaits th...

2024
[14]

Scaling memorization across training-set sizes.Table 29 contrasts the LLaMA-SMALLexact- match recall and distribution-alignment metrics across the five TU corpora and PCQM4Mv2 under the same architecture andboth_defaultdecoding, supporting the headline crossover of the main- text PCQM4Mv2 scaling (§6). Subsampled evaluation.Table 30 reports gSpan-based me...

2018
[15]

1.00/–” rows. “−

are provided for reference in the main paper. Memorization-based gSpan pattern Structural MMD Dataset Model Mem. rate Novelty SharedρJSD WL deg orb MUTAG DiGress 0.000 1.000 92 0.621 0.217 0.0780 0.00475 0.07332 MUTAG GraphRNN 0.000 1.000 16 0.063 0.554 0.2373 0.00023 0.00011 MUTAG DGMG-official 0.000 1.000 16 0.069 0.549 0.2216 0.00108 0.00163 PTC_MR DiG...

1934
[16]

ρ∩”: intersection Spearman, “ρtk

Subgraph-level alignment (ρ, JSD) and whole-graph similarity (WL-MMD) stay in narrow bands across the43×parameter range. Tail rank correlation improves with capacity, reaching its best value atLARGE(Tailρ≈0.43); theMEDIUMpoint (0.34) sits within two standard deviations ofTINYand does not break the upward trend. memorization and missing mass instead track ...

2024
[17]

All-genρ

Left: DFS canonical (TINY/MEDIUM/SMALLshare a 50-epoch schedule,LARGEa 12-epoch schedule). Right: DGMG action sequence (TINY/MEDIUM/SMALLshare a 20-epoch schedule,LARGEa 12-epoch schedule). Loss separation by capacity is consistent across both serializations: larger models reach lower train and eval loss in every regime we examine. 103 104 Training step (...

2009