Billion-Scale Graph Foundation Models

Ami Tavory; Andrey Isakov; Daniel Haimovich; David Abensur; Ido Guy; Maya Bechler-Speicher; Udi Weinsberg; Yoel Gottlieb

arxiv: 2602.04768 · v2 · pith:ZTYIUBKLnew · submitted 2026-02-04 · 💻 cs.LG · cs.AI

Billion-Scale Graph Foundation Models

Maya Bechler-Speicher , Yoel Gottlieb , Andrey Isakov , David Abensur , Ami Tavory , Daniel Haimovich , Ido Guy , Udi Weinsberg This is my paper

Pith reviewed 2026-05-22 10:53 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords graph foundation modelsheterogeneous graphsneural scaling lawspretrainingTransformer architecturebillion-scale graphsfew-shot learningGraphBFF

0 comments

The pith

GraphBFF supplies a complete recipe to pretrain billion-parameter Transformers on large heterogeneous graphs that then outperform baselines on ten unseen downstream tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that the foundation-model paradigm of large-scale pretraining followed by lightweight adaptation can be made to work for general real-world graphs. It does so by defining an end-to-end process that includes a new Transformer architecture, concrete batching and training procedures, and explicit scaling laws for heterogeneous graphs. A reader would care because many high-value applications rely on graph data yet still train models from scratch for each new task; a working GFM would change that cost structure. The authors validate the approach on a real billion-scale graph and report consistent gains, including in few-shot regimes, across node and link prediction problems.

Core claim

GraphBFF is an end-to-end recipe for building billion-parameter Graph Foundation Models on heterogeneous graphs. Its central component is the GraphBFF Transformer, a flexible and scalable architecture that supports neural scaling laws in which loss falls predictably with added model capacity or training data. When a billion-parameter instance is pretrained on a real-world billion-scale graph and then evaluated on ten diverse downstream tasks on graphs never seen in training, it outperforms baselines by margins reaching 31 PRAUC points, including in few-shot settings.

What carries the argument

The GraphBFF Transformer, a scalable architecture that processes heterogeneous graphs for both pretraining and task-specific adaptation.

If this is right

Loss on heterogeneous graphs decreases in a predictable way when either model size or pretraining data volume is increased, whichever is the current bottleneck.
Explicit methods for data batching, pretraining objectives, and fine-tuning enable practical construction of GFMs at industrial scale.
The same pretrained model delivers strong results on both node-level and link-level classification and regression tasks.
Performance advantages persist in few-shot adaptation to completely new graphs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the observed scaling behavior generalizes, organizations could allocate compute between model size and data collection more efficiently when building graph models.
Pretrained GFMs might lower the barrier for applying graph learning in domains that currently lack large labeled datasets.
The open challenges noted in the paper point to the need for new techniques in efficient inference and continual adaptation at even larger scales.
Similar recipe-driven approaches could be tested on other structured domains such as knowledge graphs or temporal networks.

Load-bearing premise

The Transformer architecture remains effective and computationally tractable when scaled to a billion parameters on industrial heterogeneous graphs.

What would settle it

Train the billion-parameter GraphBFF on the same billion-scale graph, then measure performance on the ten held-out tasks; if gains over baselines fall below a few PRAUC points or disappear in few-shot regimes, the central claim does not hold.

read the original abstract

Graph-structured data underpins many critical applications. While foundation models have transformed language and vision via large-scale pretraining and lightweight adaptation, extending this paradigm to general, real-world graphs is challenging. In this work, we present Graph Billion-Foundation-Fusion (GraphBFF): an end-to-end recipe for building billion-parameter Graph Foundation Models (GFMs) for large-scale heterogeneous graphs. Central to the recipe is the GraphBFF Transformer, a flexible and scalable architecture designed for practical billion-scale GFMs. Using the GraphBFF, we present neural scaling laws for heterogeneous graphs and show that loss decreases predictably as either model capacity or training data scales, depending on which factor is the bottleneck. The GraphBFF framework provides concrete methodologies for data batching, pretraining, and fine-tuning for building GFMs at scale. We demonstrate the effectiveness of the framework over a real-world billion-scale graph, with an evaluation of a billion-parameter GraphBFF Transformer following the proposed recipe. Across ten diverse, real-world downstream tasks on graphs unseen during training, spanning node- and link-level classification and regression, GraphBFF consistently outperforms baselines, with large margins of up to 31 PRAUC points, including in few-shot settings. Finally, we discuss key challenges and open opportunities for making GFMs a practical and principled foundation for graph learning at industrial scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GraphBFF gives a usable end-to-end recipe for billion-parameter heterogeneous graph pretraining plus scaling-law observations, but the large reported gains may simply reflect unmatched model capacity rather than the new methods.

read the letter

The main point is that the authors supply a concrete pipeline for pretraining a billion-parameter transformer on a real billion-node heterogeneous graph, along with scaling laws showing predictable loss reduction as capacity or data grows. They then adapt the model to ten downstream tasks on held-out graphs and report sizable improvements, including in few-shot regimes. That combination of scale, recipe, and multi-task results is what stands out at first read. The GraphBFF Transformer and the batching/pretraining/fine-tuning steps appear detailed enough to be tried by others working on large graphs. The scaling-law plots are straightforward and useful for anyone planning similar runs. Credit is due for actually running the thing at industrial size instead of stopping at small proxies. The weakest part is the baseline comparison. The abstract and stress-test note both leave open whether the ten baselines were also trained at billion-parameter scale with comparable data volume. If they were smaller or non-pretrained, the margins up to 31 PRAUC points are consistent with known capacity effects and do not isolate the contribution of the new architecture or recipe. Downstream ablations that hold scale fixed would have made the central claim much stronger. Minor gaps include missing error bars and data-selection details, but those are secondary. This paper is aimed at practitioners and researchers who already deal with massive heterogeneous graphs in recommendation, fraud, or social-network settings and want to move away from training separate models per task. A reader focused on scaling graph models will get practical value from the methodologies even if the performance story needs tighter controls. It is coherent on its own terms and shows clear engagement with the scaling problem, so it deserves a serious referee. I would send it to review with a specific request for scale-matched baseline results and a few more ablations.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces GraphBFF, an end-to-end recipe for billion-parameter Graph Foundation Models on large-scale heterogeneous graphs. It centers on the GraphBFF Transformer architecture, presents neural scaling laws showing predictable loss reduction with model capacity or data scale, and details methodologies for data batching, pretraining, and fine-tuning. A billion-parameter model pretrained on a real-world billion-scale graph is evaluated on ten diverse downstream tasks (node- and link-level classification and regression) on unseen graphs, where it outperforms baselines by margins up to 31 PRAUC points, including in few-shot regimes.

Significance. If the performance margins and scaling observations hold under controlled conditions, the work would be significant for extending the foundation-model paradigm to industrial-scale heterogeneous graphs. The concrete batching/pretraining/fine-tuning recipes and empirical demonstration on a billion-node graph provide actionable guidance that could accelerate practical GFM deployment, while the scaling-law results offer a basis for predicting compute requirements in graph domains.

major comments (2)

[Evaluation / Results] The central claim attributes large downstream gains (up to 31 PRAUC) to the GraphBFF recipe and scaling laws, yet the evaluation provides no parameter counts, data volumes, or training details for the ten baselines. Without matched-scale controls, the margins are consistent with known capacity effects and do not isolate the contribution of the proposed Transformer, batching, or pretraining methods (see abstract and results sections).
[Neural Scaling Laws] The scaling-laws section reports loss trends versus capacity and data but lacks ablations that hold model size fixed while varying only the batching or pretraining components; this leaves the load-bearing claim that the end-to-end recipe (rather than raw scale) drives the observed downstream improvements without direct supporting evidence.

minor comments (2)

[Experimental Setup] Clarify whether the ten downstream tasks use the same heterogeneous graph schema as pretraining or introduce new node/edge types, as this affects claims of generalization to unseen graphs.
[Results] Add error bars or multiple random seeds to the reported PRAUC margins to allow assessment of statistical significance of the largest gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of evaluation rigor and the need to better isolate contributions in our scaling analysis. We address each point below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Evaluation / Results] The central claim attributes large downstream gains (up to 31 PRAUC) to the GraphBFF recipe and scaling laws, yet the evaluation provides no parameter counts, data volumes, or training details for the ten baselines. Without matched-scale controls, the margins are consistent with known capacity effects and do not isolate the contribution of the proposed Transformer, batching, or pretraining methods (see abstract and results sections).

Authors: We agree that additional details on the baselines are necessary for a fair assessment. In the revised manuscript, we will add a supplementary table reporting parameter counts, data volumes, and training configurations for each baseline (sourced from original publications or our controlled re-implementations). We note that several baselines were not originally designed or scaled to billion-parameter regimes on heterogeneous graphs, which is itself part of the contribution: demonstrating that the GraphBFF recipe enables effective pretraining and adaptation at this scale on unseen graphs, including in few-shot settings. To further address capacity concerns, we will include results from scaled versions of representative baselines where compute permits. revision: yes
Referee: [Neural Scaling Laws] The scaling-laws section reports loss trends versus capacity and data but lacks ablations that hold model size fixed while varying only the batching or pretraining components; this leaves the load-bearing claim that the end-to-end recipe (rather than raw scale) drives the observed downstream improvements without direct supporting evidence.

Authors: We acknowledge that dedicated ablations isolating batching and pretraining while holding model size fixed would provide stronger evidence. The current scaling laws demonstrate predictable loss reduction under the full proposed recipe. In the revision, we will add smaller-scale controlled ablations (holding capacity fixed) that vary batching strategy and pretraining objectives to quantify their individual contributions. Full-scale ablations at billion parameters remain computationally prohibitive, but the smaller-scale results combined with the end-to-end performance on unseen tasks support the recipe's role beyond raw scale alone. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical scaling laws and downstream results are independent of inputs

full rationale

The paper reports an architecture (GraphBFF Transformer), observed neural scaling laws showing predictable loss decrease with scale or data, and empirical outperformance on ten held-out downstream tasks after pretraining on a billion-scale graph. These are presented as experimental outcomes from training and evaluation rather than closed-form derivations or predictions that reduce to fitted parameters by construction. No equations or self-citations are invoked to force the central claims; the scaling behavior is described as an observed phenomenon depending on bottlenecks, and results are validated on unseen graphs. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the central claims rest on the unstated assumption that the proposed architecture scales to billion parameters without additional hidden costs or instabilities.

pith-pipeline@v0.9.0 · 5794 in / 1095 out tokens · 39659 ms · 2026-05-22T10:53:44.090883+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GraphBFFTransformer with Type-Conditioned Attention (TCA) and Type-Agnostic Attention (TAA); KL-Batching and Round-Robin Batching; neural scaling law L(N,D)=L∞+(Nc/N)^αN+(Dc/D)^αD
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

1.4B-parameter model pretrained on 1B edges; zero-shot/probing gains up to 31 PRAUC on unseen graphs

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Deep Neural Sheaf Diffusion
cs.LG 2026-05 unverdicted novelty 6.0

DNSD replaces the sheaf Laplacian with a sheaf adjacency operator to maintain informative signals in deep layers, outperforming GNN and NSD baselines on long-range synthetic and real graph tasks.