OgBench: A Framework for Evaluating Graph Neural Networks on Omics Data

Guillermo Bern\'ardez; Johan Mathe; Louisa Cornelis; Louis Van Langendonck; Nina Miolane

arxiv: 2605.15511 · v2 · pith:UXKJKG3Gnew · submitted 2026-05-15 · 💻 cs.LG

OgBench: A Framework for Evaluating Graph Neural Networks on Omics Data

Louisa Cornelis , Johan Mathe , Louis Van Langendonck , Guillermo Bern\'ardez , Nina Miolane This is my paper

Pith reviewed 2026-05-19 15:40 UTC · model grok-4.3

classification 💻 cs.LG

keywords Graph Neural NetworksOmics DataBenchmarkingGraph-Level PredictionBiological NetworksMachine Learning BaselinesLow-Sample High-Dimensional Data

0 comments

The pith

Graph neural networks often fail to outperform simple MLPs on omics data tasks with few samples and many nodes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a standardized benchmark for evaluating graph neural networks in the omics domain, where the number of patient samples is small compared to the number of genes or proteins. It constructs families of graphs from raw biological data and tests classical GNNs, specialized variants, MLPs, and other baselines on graph-level prediction tasks. Results indicate that standard GNNs frequently match or fall short of non-graph methods, which questions whether the added graph structure provides meaningful gains in this setting. A sympathetic reader would care because many current approaches in bioinformatics assume graphs help, and this finding points toward the need for methods better matched to high-dimensional, low-sample biological data.

Core claim

OgBench supplies an end-to-end pipeline that turns raw omics measurements into varied featured graphs, then measures the performance of GNNs against MLPs and classical baselines in the n much less than p regime. The central finding is that widely used GNNs do not reliably surpass simpler models, thereby challenging the idea that biological graph structure inherently improves predictive accuracy on such data.

What carries the argument

OgBench, a modular benchmarking platform that generates families of featured graphs from raw omics data and runs standardized graph-level prediction experiments.

If this is right

Simpler non-graph models should be included as strong baselines when applying machine learning to omics graphs.
New architectures for biological data must explicitly address the low-sample high-node regime rather than borrowing from dense-graph settings.
The value of incorporating graph structure from omics measurements requires fresh validation rather than being taken as given.
Development of omics-specific GNN variants can now be guided by the standardized evaluation setup provided.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If feature signals dominate over topology in these tasks, then methods that learn adaptive graph construction or edge weighting from data may prove more useful than fixed biological networks.
Extending the benchmark to additional omics modalities or multi-task settings could reveal whether the observed pattern holds beyond the current collection of datasets.
The results suggest that practitioners might first try classical feature-based models before investing in graph-based pipelines for similar biological prediction problems.

Load-bearing premise

The graphs derived from raw omics data encode biologically meaningful relationships that matter for the downstream prediction tasks.

What would settle it

A concrete test would be to run the benchmark on its provided datasets and observe whether any GNN architecture achieves statistically higher accuracy or AUC than the MLP baseline across repeated trials with fixed hyperparameters.

Figures

Figures reproduced from arXiv: 2605.15511 by Guillermo Bern\'ardez, Johan Mathe, Louisa Cornelis, Louis Van Langendonck, Nina Miolane.

**Figure 1.** Figure 1: Existing graph benchmarks operate in the n ≫ p regime, where the number of graphs n far exceeds the average number of nodes per graph p. Bar plot of n/p for benchmark graph classification datasets from [39, 25, 55, 18, 17]. On the other hand, existing GNN inductive benchmarks—ranging from the recent GraphBench [55] to established ones like OGB [25], TUDataset [39], and LRGB [18]—predominantly operate in t… view at source ↗

**Figure 2.** Figure 2: Overview of OgBench: First GNN benchmark platform for omics graph datasets. Left: Transcriptomics or proteomics expression data across p genes/proteins and n samples. Middle — 1) Co-expression or PPI graphs are constructed using classical omics approaches; each sample becomes a graph with nodes representing genes/proteins and normalized expression values as node features. 2) A model is trained on a graph-l… view at source ↗

**Figure 3.** Figure 3: Best model performance per dataset (selected by validation F1, error bars = std across 3 [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Best test F1 by node selection method and sampling ratio. For each model family, the configuration with the highest validation F1 is selected per method-ratio combination and evaluated on the test set. confidence intervals. On Parkinsons and BRCA, linear baselines remain competitive or superior. Clearly, more complex models (GPS, ChebNet, SAGN, MLA-GNN) do not guarantee better performance, with rankings d… view at source ↗

**Figure 5.** Figure 5: Test F1 by readout type for each GNN backbone, sweeping over node selection method, [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Test F1 by edge construction method for each GNN backbone, sweeping over node selection [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Test F1-macro for validation-best (K=1) vs. top-K ensembles (K=3, 5, 10) across model [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of model performance under (a) traditional single-best-validation selection [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Validation rank vs. test F1 for all hyperparameter configurations (pooled: MLP + GNNs). [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Validation rank vs. test F1 (top 100 validation ranks only). All models show a wide (about [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: Linear regression of node sample ratio vs. Test F1 Macro score. Each subplot shows fitted regression lines for all models within a specific dataset and graph construction method. Solid lines indicate p < 0.05; dashed lines indicate p ≥ 0.05. 42 [PITH_FULL_IMAGE:figures/full_fig_p042_11.png] view at source ↗

read the original abstract

Graph Neural Networks (GNNs) have become the dominant framework for inductive graph-level learning. Yet most benchmarks focus on the regime $n \gg p$, where the number of graphs $n$ greatly exceeds the number of nodes per graph $p$. This overlooks biological domains such as omics, which operate in the opposite $n \ll p$ regime, characterized by large graphs of genes, transcripts, or proteins across few patient samples. This raises the question: \textit{how do GNNs perform in this low-sample, high-node omics setting?} We introduce \texttt{OgBench} (Omics-Graph Bench), the first benchmarking platform for graph-level prediction in the $n \ll p$ regime characteristic of omics data. We provide a standardized, end-to-end modular infrastructure from raw omics data to families of featured graphs with varied structural properties. We benchmark classical GNNs, as well as GNNs designed for large graphs and omics applications, alongside MLPs and machine learning baselines to establish reference performances. Our results show that widely used GNNs often do not outperform simple MLPs and classical baselines. These findings challenge the prevailing assumption that graph structure inherently adds value in this domain, fostering a critical reassessment of current learning paradigms. Ultimately, by exposing these limitations, OgBench provides the open-source ecosystem necessary for the community to develop and validate novel architectures explicitly tailored for biological graphs. The code is available at https://github.com/geometric-intelligence/ogbench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OgBench introduces a needed benchmark for GNNs on omics data but its main claim needs validation that the graphs are biologically meaningful.

read the letter

Hi, The one or two things to know about this paper are that it introduces OgBench as the first benchmark specifically for graph-level prediction with GNNs in the omics regime where samples are few but nodes are many, and it finds that widely used GNNs often do not beat simple MLPs or classical machine learning methods. What the paper does well is create a standardized and modular framework that takes raw omics data and produces families of featured graphs with different structural properties. They then benchmark a selection of GNN architectures, including some meant for large graphs and omics applications, against MLPs and other baselines. This setup is practical, the code is available, and it directly tackles a data regime that most existing graph benchmarks ignore. The soft spots are in the interpretation of the results. The central finding challenges the idea that graph structure adds value, but this depends on the graphs actually encoding biologically relevant relationships for the prediction tasks. The manuscript describes building the graphs but does not include validation steps such as checking overlap with known biological pathways or running enrichment analyses. If the edges are largely uninformative, then the comparison between GNNs and MLPs does not tell us much about whether graph structure can help when the structure is real and meaningful. This paper is for researchers in machine learning who apply models to biological data and for those developing new GNN architectures for similar high-dimensional low-sample settings. A reader looking for an empirical reference or a platform to test ideas would find it valuable. It deserves a serious referee because the benchmark is new and the questions it raises are relevant, even though the graph validation could be strengthened. I would recommend sending it to peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces OgBench, a benchmarking framework for graph-level prediction with GNNs on omics data in the n ≪ p regime. It supplies a modular pipeline that converts raw omics data into families of featured graphs with controlled structural properties, then evaluates classical GNNs, large-graph GNNs, omics-specific models, MLPs, and classical ML baselines. The central empirical finding is that widely used GNNs frequently fail to outperform simple MLPs and non-graph baselines, which the authors interpret as evidence that graph structure does not inherently add value in this domain.

Significance. If the constructed graphs can be shown to encode biologically meaningful relationships, the result would be significant: it would supply the first standardized benchmark exposing limitations of current GNN architectures on high-dimensional biological graphs and would motivate the development of new inductive biases tailored to the omics setting. The open-source modular infrastructure is a concrete contribution that could accelerate such work. The current evidence, however, rests on unvalidated graph constructions, which weakens the force of the claim that graph structure itself is unhelpful.

major comments (2)

[§3] §3 (Graph Construction): The manuscript states that families of graphs are built from raw omics data with varied structural properties, yet supplies no external validation—such as overlap with curated pathway databases, gene-set enrichment statistics, or expert review—that the retained edges capture biologically relevant interactions rather than statistical artifacts or arbitrary thresholds. Because the central claim (that GNNs add no value over MLPs) presupposes that the graphs encode task-relevant structure, this omission is load-bearing.
[§5] §5 (Experimental Results): The reported comparisons lack details on statistical testing (e.g., paired t-tests or Wilcoxon tests across random seeds), exact sample sizes per dataset, and the precise graph-construction hyperparameters (thresholds, feature-selection criteria). Without these, it is impossible to judge whether the observed parity or underperformance of GNNs is robust or an artifact of particular dataset realizations.

minor comments (2)

[Abstract] The abstract claims the code is available at the cited GitHub link, but the manuscript should include a permanent archive link (e.g., Zenodo DOI) to satisfy reproducibility standards.
[Tables in §5] Notation for the n ≪ p regime is introduced in the abstract but not consistently reused in the experimental tables; adding a column or row label that explicitly flags this regime would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on our manuscript. We address each major comment point by point below, indicating where we agree that revisions are warranted to strengthen the work.

read point-by-point responses

Referee: [§3] §3 (Graph Construction): The manuscript states that families of graphs are built from raw omics data with varied structural properties, yet supplies no external validation—such as overlap with curated pathway databases, gene-set enrichment statistics, or expert review—that the retained edges capture biologically relevant interactions rather than statistical artifacts or arbitrary thresholds. Because the central claim (that GNNs add no value over MLPs) presupposes that the graphs encode task-relevant structure, this omission is load-bearing.

Authors: We thank the referee for this important observation. OgBench is constructed to provide modular control over graph families with differing structural properties (e.g., via varying correlation thresholds and feature-selection criteria) precisely so that the community can test the value of graph structure under different assumptions in the n ≪ p regime. We acknowledge that the absence of external validation against pathway databases or enrichment statistics makes it harder to interpret whether the reported parity between GNNs and MLPs reflects the limited utility of graph structure or the limitations of existing GNN inductive biases. In the revised manuscript we will add quantitative validation: overlap statistics with KEGG and Reactome pathways, as well as gene-set enrichment results for the retained edges across the graph families. These additions will be presented in a new subsection of §3 together with a discussion of how the validation affects the strength of the central claim. revision: yes
Referee: [§5] §5 (Experimental Results): The reported comparisons lack details on statistical testing (e.g., paired t-tests or Wilcoxon tests across random seeds), exact sample sizes per dataset, and the precise graph-construction hyperparameters (thresholds, feature-selection criteria). Without these, it is impossible to judge whether the observed parity or underperformance of GNNs is robust or an artifact of particular dataset realizations.

Authors: We agree that these experimental details are necessary for reproducibility and for readers to assess robustness. The current version reports mean performance but does not include formal statistical comparisons or the exact construction hyperparameters. In the revised manuscript we will expand §5 (and the supplementary material) to report: (i) paired t-tests and Wilcoxon signed-rank tests across at least five random seeds for all model comparisons, (ii) the precise values of n (number of graphs) and p (number of nodes) for every dataset, and (iii) the full list of graph-construction hyperparameters, including correlation thresholds, p-value cutoffs, and feature-selection procedures. These additions will allow direct evaluation of whether the observed results are stable across realizations. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no load-bearing derivations or self-referential reductions

full rationale

The manuscript introduces OgBench as an empirical evaluation framework for GNNs on omics data in the n ≪ p regime. It constructs graph families from raw data, runs standard GNN and MLP baselines on public datasets, and reports comparative performance numbers. No equations, uniqueness theorems, fitted parameters renamed as predictions, or self-citation chains appear in the derivation of the central claim. The results are direct statistical comparisons against external open-source implementations and classical baselines; the claim that GNNs often fail to outperform MLPs follows from those measurements rather than from any internal redefinition or tautological reduction. The paper is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper contributes an empirical benchmark and infrastructure rather than new theoretical derivations; it relies on standard machine-learning evaluation practices and domain conventions for turning omics tables into graphs.

axioms (1)

domain assumption Omics measurements can be converted into graphs whose nodes are genes or proteins and whose edges reflect known or inferred biological relationships.
This conversion step is required to produce the featured graphs used in all experiments.

invented entities (1)

OgBench framework no independent evidence
purpose: Modular end-to-end infrastructure that converts raw omics data into families of graphs and runs standardized GNN and baseline evaluations.
New platform introduced to address the lack of benchmarks in the n << p regime.

pith-pipeline@v0.9.0 · 5822 in / 1283 out tokens · 47879 ms · 2026-05-19T15:40:50.935953+00:00 · methodology

OgBench: A Framework for Evaluating Graph Neural Networks on Omics Data

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)