arxiv: 2604.18092 · v1 · submitted 2026-04-20 · 💻 cs.LG

Recognition: unknown

Generalization Boundaries of Fine-Tuned Small Language Models for Graph Structural Inference

Michal Podstawski

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:35 UTC · model grok-4.3

classification 💻 cs.LG

keywords graph structural inferencesmall language modelsfine-tuninggeneralizationordinal consistencygraph size scalinggraph familiesproperty estimation

0 comments

The pith

Fine-tuned small language models maintain ordinal consistency in ranking graph structures even on substantially larger graphs and across different families.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates how well small language models fine-tuned to estimate graph properties can apply that ability when graphs grow much larger than training examples or come from new structural families. The authors run controlled tests on three 3-4B instruction-tuned models using two text serialization formats for graphs. They measure whether the models still order graphs correctly by features such as density or connectivity. A reader would care because the findings mark clear limits where these compact models can support graph reasoning tasks without retraining on every new scale or distribution.

Core claim

Using a controlled experimental setup with three instruction-tuned models in the 3-4B parameter class and two graph serialization formats, the work evaluates performance on graphs substantially larger than the training range and across held-out random graph families. The results show that fine-tuned models maintain strong ordinal consistency across structurally distinct graph families and continue to rank graphs by structural properties on inputs substantially larger than those seen during training, with distinct architecture-specific degradation profiles. These findings delineate where fine-tuned small language models generalize reliably for graph-based reasoning tasks.

What carries the argument

Ordinal consistency in ranking graphs by structural properties, measured across size scaling and held-out random graph families on fine-tuned 3-4B language models.

If this is right

The models can continue to rank graphs by properties such as connectivity without retraining when inputs exceed training sizes.
Ordinal ranking performance transfers across structurally different random graph families.
Degradation rates vary by model architecture, allowing architecture choice based on expected graph scale.
These patterns provide empirical grounding for deploying fine-tuned small models on graph reasoning without full retraining on new domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same ordinal consistency might appear on real-world graphs such as citation or social networks if the random-family results transfer.
Testing much larger size jumps or mixed serialization formats could expose additional limits not captured in the current setup.
Architecture-specific degradation profiles suggest selecting models according to the scale of graphs expected in a target application.

Load-bearing premise

That results from random graph families and controlled size increases with specific 3-4B models and serialization formats reflect the general generalization capabilities of fine-tuned small language models for graph structural inference.

What would settle it

A drop in rank correlation between model outputs and true graph properties to near zero on graphs twice the training size or from a new random family would show the generalization boundaries do not hold.

Figures

Figures reproduced from arXiv: 2604.18092 by Michal Podstawski.

**Figure 1.** Figure 1: All models maintain meaningful structural ordering well beyond the training size range under adjacency-list serialization. Phi-4-mini and Qwen2.5-3B remain above ρ = 0.74 across the entire evaluated range (up to n = 150), while Llama-3.2- 3B shows more variable behavior, with performance above ρ = 0.77 in most size bins but a localized dip at n = 130–140. The three models exhibit qualitatively distinct … view at source ↗

**Figure 1.** Figure 1: Size generalization: Spearman ρ across increasing graph size bins for each model. Solid lines denote adjacency-list serialization, dashed lines denote edge-list. The shaded region marks the training range (n = 20–30). All models show graceful degradation under adjacency-list encoding, with performance gaps between representations widening at larger graph sizes. non-monotonic pattern suggests greater sensit… view at source ↗

read the original abstract

Small language models fine-tuned for graph property estimation have demonstrated strong in-distribution performance, yet their generalization capabilities beyond training conditions remain poorly understood. In this work, we systematically investigate the boundaries of structural inference in fine-tuned small language models along two generalization axes - graph size and graph family distribution - and assess domain-learning capability on real-world graph benchmarks. Using a controlled experimental setup with three instruction-tuned models in the 3-4B parameter class and two graph serialization formats, we evaluate performance on graphs substantially larger than the training range and across held-out random graph families. Our results show that fine-tuned models maintain strong ordinal consistency across structurally distinct graph families and continue to rank graphs by structural properties on inputs substantially larger than those seen during training, with distinct architecture-specific degradation profiles. These findings delineate where fine-tuned small language models generalize reliably, providing empirical grounding for their use in graph-based reasoning tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that fine-tuned small language models in the 3-4B parameter range, trained for graph property estimation, exhibit strong ordinal consistency when ranking graphs by structural properties across held-out random graph families and on graphs substantially larger than the training distribution. Using three instruction-tuned models and two serialization formats, the work reports architecture-specific degradation profiles and assesses domain adaptation on real-world graph benchmarks, framing the contribution as delineating reliable generalization regimes rather than universal performance.

Significance. If the empirical results hold under scrutiny, the study provides actionable boundaries for deploying fine-tuned SLMs in graph-based reasoning applications, particularly where training data is restricted to smaller or specific graph families. The controlled multi-model, multi-format design and separation of random-graph from real-world evaluations are strengths that could inform practical model selection and training strategies in graph ML.

major comments (2)

[Methods/Results] Methods and Results sections: The manuscript reports 'strong ordinal consistency' and 'ranking performance' but does not specify the exact metrics (e.g., Kendall tau, Spearman rank correlation), how ties are handled, or whether statistical significance tests (with p-values or confidence intervals) were applied to the held-out family and size-extrapolation results. This makes it difficult to evaluate whether the observed performance exceeds chance or baseline levels.
[Experimental setup] Experimental setup (likely §3-4): Details on data splits, graph generation parameters for the random families, exact training ranges for graph sizes, and the precise definition of 'substantially larger' inputs are insufficient to reproduce or assess the generalization claims. Without these, the load-bearing assertion that models 'continue to rank graphs' on out-of-distribution sizes cannot be fully verified.

minor comments (2)

[Abstract] Abstract and introduction: The phrasing 'delineate where fine-tuned small language models generalize reliably' could be tightened to more precisely reflect the two axes studied (size and family) rather than implying broader generalization.
[Throughout] Notation and figures: Ensure consistent use of terms like 'ordinal consistency' across text and any tables/figures reporting the metrics; clarify if serialization format effects are ablated in all experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. The comments on metric specification and experimental reproducibility are well-taken and will strengthen the paper. We address each major comment below.

read point-by-point responses

Referee: [Methods/Results] Methods and Results sections: The manuscript reports 'strong ordinal consistency' and 'ranking performance' but does not specify the exact metrics (e.g., Kendall tau, Spearman rank correlation), how ties are handled, or whether statistical significance tests (with p-values or confidence intervals) were applied to the held-out family and size-extrapolation results. This makes it difficult to evaluate whether the observed performance exceeds chance or baseline levels.

Authors: We agree that the evaluation metrics and statistical procedures must be stated explicitly. The primary metric used throughout is Kendall's tau rank correlation (tau-b variant) to quantify ordinal consistency, which is appropriate for the ranking task and incorporates a standard tie-handling correction. We also computed Spearman rank correlation as a secondary check. For the held-out family and size-extrapolation results, we applied bootstrap resampling to obtain 95% confidence intervals and conducted one-sided permutation tests against a random-ranking null model to establish statistical significance. We will insert a new subsection in Methods (and reference it in Results) that fully documents these choices, reports the p-values, and includes the confidence intervals in the relevant tables and figures. revision: yes
Referee: [Experimental setup] Experimental setup (likely §3-4): Details on data splits, graph generation parameters for the random families, exact training ranges for graph sizes, and the precise definition of 'substantially larger' inputs are insufficient to reproduce or assess the generalization claims. Without these, the load-bearing assertion that models 'continue to rank graphs' on out-of-distribution sizes cannot be fully verified.

Authors: We acknowledge that the current description of the experimental setup is insufficient for full reproducibility. We will expand §3 (Experimental Setup) and §4 (Results) with the following additions: (i) explicit train/validation/test split ratios and how they were applied across families; (ii) the precise graph-generation parameters for each random family (e.g., edge-probability ranges for Erdős–Rényi, preferential-attachment parameters for Barabási–Albert, and rewiring probabilities for Watts–Strogatz); (iii) the exact node-count ranges used for training (minimum and maximum); and (iv) a clear operational definition of 'substantially larger' together with the specific size ranges evaluated in the extrapolation experiments. These details will be presented both in the main text and in an expanded Appendix, allowing readers to verify the out-of-distribution ranking claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation only

full rationale

The paper is a controlled empirical study that reports observed performance metrics (ordinal consistency, ranking accuracy) of fine-tuned 3-4B models on held-out graph sizes and families. No derivation chain, equations, or predictions are present that could reduce to fitted parameters, self-citations, or ansatzes; all claims are framed as direct experimental outcomes on explicitly separated test distributions. The central contribution is therefore self-contained against external benchmarks and does not rely on any load-bearing internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As an empirical machine learning study, the work relies on standard experimental practices in the field rather than new axioms or invented entities. No free parameters are explicitly fitted in the sense of the central claim; model choices and graph generation are part of the experimental design.

pith-pipeline@v0.9.0 · 5443 in / 1104 out tokens · 51432 ms · 2026-05-10T05:35:15.118559+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 8 canonical work pages · 4 internal anchors

[1]

H. Wang, S. Feng, T. He, Z. Tan, X. Han, Y . Tsvetkov, Can language models solve graph problems in natural language?, In: Proceedings of the 37th International Conference on Neu- ral Information Processing Systems, 2023, pp. 30840–30861

2023
[2]

Fatemi, J

B. Fatemi, J. Halcrow, B. Perozzi, Talk like a Graph: Encoding Graphs for Large Language Models, arXiv preprint arXiv:2310.04560, 2023

work page arXiv 2023
[3]

J. Guo, L. Du, H. Liu, M. Zhou, X. He, S. Han, GPT4Graph: Can Large Language Models Un- derstand Graph Structured Data? An Empirical Evaluation and Benchmarking, arXiv preprint arXiv:2305.15066, 2023

work page arXiv 2023
[4]

M. Podstawski, TinyGraphEstimator: Adapt- ing Lightweight Language Models for Graph Structure Inference, In: Proceedings of the 18th International Conference on Agents and Artifi- cial Intelligence (ICAART), vol. 5, 2026, pp. 4080–4087

2026
[5]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkor- eit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polo- sukhin, Attention is All you Need, In: Ad- vances in Neural Information Processing Sys- tems, vol. 30, 2017

2017
[6]

C. Ying, T. Cai, S. Luo, S. Zheng, G. Ke, D. He, Y . Shen, T.-Y . Liu, Do Transformers Re- ally Perform Bad for Graph Representation?, In: Proceedings of the 35th International Con- ference on Neural Information Processing Sys- tems, 2021, pp. 28877–28888

2021
[7]

Ramp ´aˇsek, M

L. Ramp ´aˇsek, M. Galkin, V . P. Dwivedi, A. T. Luu, G. Wolf, D. Beaini, Recipe for a Gen- eral, Powerful, Scalable Graph Transformer, In: Proceedings of the 36th International Con- ference on Neural Information Processing Sys- tems, 2022

2022
[8]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, W. Chen, LoRA: Low-Rank Adapta- tion of Large Language Models, arXiv preprint arXiv:2106.09685, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

Fine-tuning can distort pretrained features and underperform out-of-distribution.arXiv preprint arXiv:2202.10054,

A. Kumar, A. Raghunathan, R. Jones, T. Ma, P. Liang, Fine-Tuning can Distort Pretrained Fea- tures and Underperform Out-of-Distribution, arXiv preprint arXiv:2202.10054, 2022

work page arXiv 2022
[10]

Robust fine-tuning of zero-shot models.arXiv preprint arXiv:2109.01903, 2021

M. Wortsman, G. Ilharco, M. Li, J. W. Kim, H. Hajishirzi, A. Farhadi, H. Namkoong, L. Schmidt, Robust Fine-Tuning of Zero-Shot Models, arXiv preprint arXiv:2109.01903, 2021

work page arXiv 2021
[11]

K. Xu, W. Hu, J. Leskovec, S. Jegelka, How Powerful are Graph Neural Networks?, In: In- ternational Conference on Learning Represen- tations (ICLR), 2019

2019
[12]

T. N. Kipf, M. Welling, Semi-Supervised Clas- sification with Graph Convolutional Networks, In: International Conference on Learning Rep- resentations (ICLR), 2017

2017
[13]

Llama Team, The Llama 3 Herd of Models, arXiv:2407.21783, 2024. 14

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Qwen Team, Qwen2.5 Technical Report, arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Phi Team, Phi-4 Technical Report, arXiv:2412.08905, 2024

work page internal anchor Pith review arXiv 2024
[16]

Open LLM LB,https://huggingface.co/ spaces/open-llm-leaderboard, accessed April 2026

2026
[17]

Morris, N

C. Morris, N. M. Kriege, F. Bause, K. Kersting, P. Mutzel, M. Neumann, TUDataset: A Collec- tion of Benchmark Datasets for Learning with Graphs, In: ICML 2020 Workshop on Graph Representation Learning and Beyond (GRL+ 2020), 2020

2020
[18]

W. Hu, M. Fey, M. Zitnik, Y . Dong, H. Ren, B. Liu, M. Catasta, J. Leskovec, Open Graph Benchmark: Datasets for Machine Learning on Graphs, In: Proceedings of the 34th Interna- tional Conference on Neural Information Pro- cessing Systems, 2020, pp. 22118–22133

2020
[19]

anthropic.com/claude, accessed April 2026

Anthropic, Claude,https://www. anthropic.com/claude, accessed April 2026. 15

2026