Efficient Synthetic Network Generation via Latent Embedding Reconstruction

Feifan Jiang; Gongjun Xu; Ji Zhu; Shihao Wu; Yinan Bu

arxiv: 2606.00934 · v1 · pith:7FSVTYONnew · submitted 2026-05-31 · 📊 stat.ML · cs.LG· stat.AP· stat.ME

Efficient Synthetic Network Generation via Latent Embedding Reconstruction

Feifan Jiang , Yinan Bu , Shihao Wu , Gongjun Xu , Ji Zhu This is my paper

Pith reviewed 2026-06-28 16:48 UTC · model grok-4.3

classification 📊 stat.ML cs.LGstat.APstat.ME

keywords synthetic network generationlatent embeddingsnetwork generationdegree distributionlatent space modelsconsistency resultsgraph generation

0 comments

The pith

SyNGLER generates synthetic networks from reconstructed latent embeddings while preserving structural properties with theoretical consistency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SyNGLER as a framework for generating synthetic networks. It applies a latent space network model to learn low-dimensional embeddings of nodes from an observed network. A distribution-free generator is then built over these embeddings to allow sampling of new embeddings. Synthetic networks are produced by applying the latent space model to the sampled embeddings. This method is designed to efficiently capture and preserve features like sparsity and degree heterogeneity, supported by consistency theorems on edge distribution distances and empirical comparisons showing better preservation of network moments and degree distributions.

Core claim

Given an observed network, SyNGLER learns low-dimensional latent node embeddings via a latent space network model and then reconstructs the latent space by building a distribution-free generator over these embeddings. For generation, it samples node embeddings from the generator and produces synthetic networks using the latent space network model. This yields networks that preserve sparsity and node degree heterogeneity with consistency results on the distance between the true and synthetic edge distributions.

What carries the argument

The combination of latent space network models for embedding and a distribution-free generator for resampling embeddings in the latent space.

If this is right

Synthetic networks better preserve key network characteristics such as network moments and degree distributions.
Consistency results hold on the distance between true and synthetic edge distributions.
The method allows efficient training with lower computational cost than many deep architectures.
Unique characteristics like sparsity and degree heterogeneity are preserved through the latent space framework.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such methods could enable better simulation studies in social sciences and biology by providing more realistic network data at scale.
The distribution-free generator might be adaptable to other embedding-based models beyond the specific latent space ones used here.
Future work could test the approach on very large networks where computational efficiency is critical.

Load-bearing premise

The low-dimensional embeddings from the latent space model sufficiently encode the network's characteristic structure including sparsity and degree heterogeneity.

What would settle it

Observing that synthetic networks generated by SyNGLER have edge distribution distances or degree distributions that deviate substantially from the observed network beyond what the consistency results predict.

Figures

Figures reproduced from arXiv: 2606.00934 by Feifan Jiang, Gongjun Xu, Ji Zhu, Shihao Wu, Yinan Bu.

**Figure 1.** Figure 1: An illustrative SyNGLER pipeline using the YOUTUBE dataset (Yang & Leskovec, 2012) with a two-dimensional latent space. From left to right: observed network in the form of an adjacency matrix; learned latent embeddings; synthetic embeddings from the generator in the latent space; synthetic network. tectures by applying diffusion in a continuous latent space. The resulting latent-diffusion approach has bee… view at source ↗

**Figure 2.** Figure 2: Visualization synthetic networks by different methods, generated on YouTube dataset. 10 2 10 3 #nodes 10 10 10 11 10 12 10 13 10 14 #e-FLOPS SyNG-D EDGE GRAN VGAE Training graph SyNG-D(ours) GRAN VGAE EDGE [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Efficiency comparison. Left: number of e-FLOPS versus the number of nodes in the observed graph. Right: Visualization of the synthetic graphs generated on DBLP. 4.2. ML Utility Evaluation We also evaluate the machine learning utility of the generated graphs, that is, whether synthetic graphs can effectively support downstream predictive tasks. Following the evaluation protocol proposed by Li et al. (2024… view at source ↗

**Figure 4.** Figure 4: Degree and eigenvalue distributions for four real-world datasets. Real-world datasets. We evaluate on four networks spanning thousands to millions of nodes. For Yelp, YouTube, and DBLP, whose full graphs are extremely large and highly sparse, we construct tractable training sets by extracting high-degree nodes and then taking the largest connected component (LCC). In the Yelp and YouTube datasets, nodes re… view at source ↗

**Figure 5.** Figure 5: Wall-clock training time of different methods for datasets of different sizes. Evaluation metrics and configuration. We compare training and sampling efficiency between SyNG-D and the baseline methods through the time they spend during training and sampling. SyNG-D and VGAE are trained on CPUs, while GRAN and EDGE are trained on a single NVIDIA GeForce RTX 4090 with memory of 24GB. For each dataset, we tra… view at source ↗

**Figure 6.** Figure 6: Evaluation of SyNG-D(MLP) on the one-million-node SBM network. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization via the Spring layout. Training graph SyNG-D(ours) SyNG-D(MLP)(ours) SyNG-R(ours) VGAE EDGE GRAN ER GraphMaker [PITH_FULL_IMAGE:figures/full_fig_p048_7.png] view at source ↗

**Figure 8.** Figure 8: Visualization via the Spectral layout. SyNG-D(ours) SyNG-D(MLP)(ours) SyNG-R(ours) VGAE EDGE GRAN ER Training graph GraphMaker [PITH_FULL_IMAGE:figures/full_fig_p048_8.png] view at source ↗

**Figure 9.** Figure 9: Visualization via the Kamada kawai layout. 48 [PITH_FULL_IMAGE:figures/full_fig_p048_9.png] view at source ↗

read the original abstract

Network data are ubiquitous across the social sciences, biology, and information systems. Generating realistic synthetic network data has broad applications from network simulation to scientific discovery. However, many existing black-box approaches for network generation tend to overfit observed data while overlooking characteristic network structure, and incur substantial computational overhead at scale. These practical challenges call for synthetic network generation methods that are both efficient and capable of capturing structural properties of networks. In this paper, we introduce Synthetic Network Generation via Latent Embedding Reconstruction (SyNGLER), a general and efficient framework for synthetic network generation that builds on latent space network models. Given an observed network, SyNGLER first learns low-dimensional latent node embeddings via a latent space network model and then reconstructs the latent space by building a distribution-free generator over these embeddings. For generation, SyNGLER first samples (or resamples) node embeddings from the generator in the latent space and then produces synthetic networks using the latent space network model. Through the latent space framework, SyNGLER preserves unique characteristics in networks such as sparsity and node degree heterogeneity, while allowing for efficient training with lower computational cost than many existing deep architectures. We provide theoretical guarantees by developing consistency results on the distance between the true and synthetic edge distributions. Empirical studies further demonstrate the effectiveness of SyNGLER, which efficiently produces networks that better preserve key network characteristics such as network moments and degree distributions compared with existing approaches. Code is available at https://github.com/FeifanJiang/syngler.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SyNGLER gives an efficient latent-space route to synthetic networks, but its theory on edge distributions leaves the claims about moments and degrees under-supported.

read the letter

SyNGLER learns low-dimensional embeddings from a latent space network model on the observed network, then fits a distribution-free generator to those embeddings. New embeddings are sampled from the generator and fed back into the latent space model to produce synthetic networks. This setup aims for efficiency and structural preservation without the overhead of deep generative models.

The approach is new in its use of a distribution-free generator over the embeddings for the reconstruction step. It does well by keeping the generation tied to an interpretable latent model, which helps with sparsity and degree heterogeneity. The consistency result for edge distributions is a reasonable theoretical step, and releasing the code is helpful.

The main soft spot is the leap from edge distribution consistency to better preservation of network moments and degree distributions. Edge probability closeness controls individual connections but does not automatically control nonlinear functionals like clustering coefficients or the full degree distribution. The paper would need continuity or Lipschitz-type bounds to bridge that gap, and nothing in the abstract suggests they are provided. Empirical results may show the improvement, but the theory does not fully back the stronger claims.

This work is for network researchers who need scalable synthetic data generation in applied fields. It shows clear thinking on the framework and merits a serious referee to examine the theoretical connection and the empirical setup.

Referee Report

2 major / 1 minor

Summary. The paper proposes SyNGLER, a framework that fits a latent space network model to an observed network to obtain low-dimensional node embeddings, constructs a distribution-free generator over those embeddings, samples new embeddings from the generator, and induces synthetic networks via the original latent space model. It claims that this yields synthetic networks that preserve sparsity, degree heterogeneity, network moments, and degree distributions, supported by consistency results on the distance between true and synthetic edge distributions, while being computationally more efficient than deep generative alternatives.

Significance. If the consistency results extend to the claimed network functionals and the empirical gains hold under fair capacity controls, the approach would offer a theoretically grounded, scalable alternative to black-box network generators that explicitly leverages latent space structure rather than learning it implicitly.

major comments (2)

[Abstract / theoretical results] Abstract and theoretical section: consistency is established only for the edge-probability distribution (i.e., the law of individual edges). Network moments (transitivity, assortativity) and the empirical degree distribution are nonlinear functionals of the full adjacency matrix. No Lipschitz, continuity, or uniform integrability argument is supplied showing that small edge-distribution distance implies small distance for these functionals under the same metric; this link is load-bearing for the central claim that edge consistency supports preservation of moments and degree distributions.
[Empirical studies] Empirical section: the reported improvements in moment preservation and degree-distribution fidelity are compared against baselines, but the manuscript does not appear to control for the capacity of the latent space model itself versus the generator; without such controls it is unclear whether gains are due to the reconstruction step or simply to the underlying latent space model.

minor comments (1)

[Method] Notation for the generator and the latent space model should be introduced with explicit dimension and parameter counts to clarify computational claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments that help clarify the scope of our contributions. We address each major point below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract / theoretical results] Abstract and theoretical section: consistency is established only for the edge-probability distribution (i.e., the law of individual edges). Network moments (transitivity, assortativity) and the empirical degree distribution are nonlinear functionals of the full adjacency matrix. No Lipschitz, continuity, or uniform integrability argument is supplied showing that small edge-distribution distance implies small distance for these functionals under the same metric; this link is load-bearing for the central claim that edge consistency supports preservation of moments and degree distributions.

Authors: We agree that the consistency result applies only to the edge-probability distribution and that no general continuity or integrability argument is provided to extend it to nonlinear functionals such as network moments or the degree distribution. The manuscript's theoretical guarantee is limited to edge distributions; preservation of moments and degrees is shown empirically. We will revise the abstract and theoretical section to state the scope of the consistency result explicitly and remove any implication of a direct theoretical implication for the functionals. revision: yes
Referee: [Empirical studies] Empirical section: the reported improvements in moment preservation and degree-distribution fidelity are compared against baselines, but the manuscript does not appear to control for the capacity of the latent space model itself versus the generator; without such controls it is unclear whether gains are due to the reconstruction step or simply to the underlying latent space model.

Authors: We acknowledge that the empirical section does not include explicit controls that isolate the latent space model's capacity from the generator. We will add experiments that compare SyNGLER to direct generation from the fitted latent space model (without the distribution-free generator) and to capacity-matched variants, thereby clarifying the contribution of the reconstruction step. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain.

full rationale

The paper applies standard latent space network models to obtain embeddings from observed networks, then constructs a separate distribution-free generator over those embeddings to sample new ones before feeding them back into the same model for synthetic networks. The claimed consistency results are stated as results on the distance between true and synthetic edge distributions, derived from the generative process rather than by re-expressing fitted quantities as predictions. No equations reduce a claimed prediction to a fitted parameter by definition, and the central theoretical guarantee does not rely on self-citations or imported uniqueness theorems from the same authors. Empirical preservation of moments and degree distributions is presented as an experimental outcome, not a definitional consequence of the embedding step.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that latent space models capture essential network structure; no free parameters or invented entities are explicitly introduced in the abstract description.

axioms (1)

domain assumption Latent space network models applied to an observed network produce low-dimensional embeddings that encode the network's characteristic structure including sparsity and degree heterogeneity.
This premise is required for the reconstruction step to produce useful synthetic networks.

pith-pipeline@v0.9.1-grok · 5809 in / 1391 out tokens · 24755 ms · 2026-06-28T16:48:09.179579+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 2 canonical work pages

[1]

Adamic, L. A. and Glance, N. The political blogosphere and the 2004 us election: divided they blog. InProceedings of the 3rd International Workshop on Link Discovery, pp. 36–43,

2004
[2]

Erdos, P

doi: 10.37236/702. Erdos, P. and R´enyi, A. On the evolution of random graphs. Publ. Math. Inst. Hungar. Acad. Sci, 5:17–61,

work page doi:10.37236/702
[3]

N., Duvenaud, D., Hernández -Lobato, J

doi: 10.1021/acscentsci.7b00572. Haefeli, K. K., Martinkus, K., Perraudin, N., and Watten- hofer, R. Diffusion models for graphs benefit from dis- crete state spaces. InNeurIPS 2022 Workshop on New Frontiers in Graph Learning,

work page doi:10.1021/acscentsci.7b00572 2022
[4]

Statistical inference on latent space models for network data.arXiv preprint arXiv:2312.06605v3,

Li, J., Wu, S., Cui, C., Xu, G., and Zhu, J. Statistical inference on latent space models for network data.arXiv preprint arXiv:2312.06605v3,

arXiv
[5]

Schmidt, R. M. Recurrent neural networks (RNNs): A gentle introduction and overview.arXiv preprint arXiv:1912.05911,

arXiv 1912
[6]

Denoising diffused em- beddings: a generative approach for hypergraphs.arXiv preprint arXiv:2501.01541,

Wu, S., Yang, J., Xu, G., and Zhu, J. Denoising diffused em- beddings: a generative approach for hypergraphs.arXiv preprint arXiv:2501.01541,

Pith/arXiv arXiv
[7]

log(R/γ n) + log(2/δ)/ √ 2nyields the desired result. A.5. Proof of Theorem 3.4 Proof of Theorem 3.4. Define the eventEn ={n −1P i≤n ∥ˆϕi − Tϕ ∗ i ∥2 2 ≤γ ′ n 2}, where γ′ n = Ω((wnn)−1/2+ϵ/2) for some fixedϵ >0. Then, we have thatP(E n)→1asn→ ∞, as shown in Lemma A.4. On the other hand, we have that max ϕ∈Gγn |ˇqγn −ˆqγn | ≤ 1 n nX i=1 1{projGγn (ˆϕi)̸= ...

2011
[8]

With the embedding matrix Z= (z 1,

assumes that each node i has a latent position zi ∈R r such that z⊤ i zj ∈[0,1] for all i, j. With the embedding matrix Z= (z 1, . . . , zn)⊤ ∈R n×r, RDPG assumes that A∼Bernoulli(ZZ ⊤), which is exactly a latent space model with the linear link functionp(· |π) = Bernoulli(π) . Besides, many block-structured graph models also fall into the scope of latent...

1983
[9]

The detail of this initialization algorithm can be found in Ma et al

as the initialization ofZandα. The detail of this initialization algorithm can be found in Ma et al. (2020). C.2. Dataset Details Simulated Datasets.In the simulated datasets evaluation, we consider (n, r)∈ {500,1000,1500} × {2,3,4} . For each (n, r)pair and each replicatet= 1, . . . ,200, we generate an undirected sparse simple graphA∈ {0,1} n×n as follo...

2020
[10]

= 1/2 for i= 1, . . . , n . Finally, we set z′ i =ezi +v (Li) and zi =z ′ i · (n−1∥P i z′ iz′⊤ i ∥F)−1/2. Given the latent positions and the degree parameters, we generate the network edges. We set the sparsity parameter ρ∗ n =−0.4 logn . For each pair of nodes 1≤i < j≤n , we calculate pij =σ(α i +α j +z ⊤ i zj +ρ ∗ n). Then we independently sampleA ij =A...

1976
[11]

Tables 20 and 21 summarize the auxiliary runs used for rebuttal positioning

and iterative local expansion (ILE) (Bergmeister et al., 2024). Tables 20 and 21 summarize the auxiliary runs used for rebuttal positioning. These results are not used for model selection in the main tables; they are included to clarify how SyNGLER compares with recent scalable baselines when measurements are available. Table 20.Auxiliary scalable-baselin...

2024
[12]

Algorithm 3Synthetic Network Generation via Latent Emedding Reconstruction for Attributed Network 1:Input:Adjacency matrixA∈ {0,1} n×n, Attribute matrixY∈R n×p

In practice, we use the sigmoid function as the link function when modeling binary networks, so thatBernoulli(g(·))reduces to a standard logistic formulation for edge probabilities. Algorithm 3Synthetic Network Generation via Latent Emedding Reconstruction for Attributed Network 1:Input:Adjacency matrixA∈ {0,1} n×n, Attribute matrixY∈R n×p. 2: Fit the lat...

arXiv 2024
[13]

Entries marked with “–” indicate OOM issues

46 SyNGLER: Efficient Synthetic Network Generation via Latent Embedding Reconstruction Table 38.ML utility evaluation of SyNG-D, SyNG-R, EDGE, GRAN, and GraphMaker across four datasets. Entries marked with “–” indicate OOM issues. Method Config DBLP PolBlogs YouTube Yelp SyNG-D 21.00±0.000.98±0.01 0.94±0.02 0.98±0.00 3 1.00±0.00 0.98±0.01 0.98±0.01 0.99±0...

2008

[1] [1]

Adamic, L. A. and Glance, N. The political blogosphere and the 2004 us election: divided they blog. InProceedings of the 3rd International Workshop on Link Discovery, pp. 36–43,

2004

[2] [2]

Erdos, P

doi: 10.37236/702. Erdos, P. and R´enyi, A. On the evolution of random graphs. Publ. Math. Inst. Hungar. Acad. Sci, 5:17–61,

work page doi:10.37236/702

[3] [3]

N., Duvenaud, D., Hernández -Lobato, J

doi: 10.1021/acscentsci.7b00572. Haefeli, K. K., Martinkus, K., Perraudin, N., and Watten- hofer, R. Diffusion models for graphs benefit from dis- crete state spaces. InNeurIPS 2022 Workshop on New Frontiers in Graph Learning,

work page doi:10.1021/acscentsci.7b00572 2022

[4] [4]

Statistical inference on latent space models for network data.arXiv preprint arXiv:2312.06605v3,

Li, J., Wu, S., Cui, C., Xu, G., and Zhu, J. Statistical inference on latent space models for network data.arXiv preprint arXiv:2312.06605v3,

arXiv

[5] [5]

Schmidt, R. M. Recurrent neural networks (RNNs): A gentle introduction and overview.arXiv preprint arXiv:1912.05911,

arXiv 1912

[6] [6]

Denoising diffused em- beddings: a generative approach for hypergraphs.arXiv preprint arXiv:2501.01541,

Wu, S., Yang, J., Xu, G., and Zhu, J. Denoising diffused em- beddings: a generative approach for hypergraphs.arXiv preprint arXiv:2501.01541,

Pith/arXiv arXiv

[7] [7]

log(R/γ n) + log(2/δ)/ √ 2nyields the desired result. A.5. Proof of Theorem 3.4 Proof of Theorem 3.4. Define the eventEn ={n −1P i≤n ∥ˆϕi − Tϕ ∗ i ∥2 2 ≤γ ′ n 2}, where γ′ n = Ω((wnn)−1/2+ϵ/2) for some fixedϵ >0. Then, we have thatP(E n)→1asn→ ∞, as shown in Lemma A.4. On the other hand, we have that max ϕ∈Gγn |ˇqγn −ˆqγn | ≤ 1 n nX i=1 1{projGγn (ˆϕi)̸= ...

2011

[8] [8]

With the embedding matrix Z= (z 1,

assumes that each node i has a latent position zi ∈R r such that z⊤ i zj ∈[0,1] for all i, j. With the embedding matrix Z= (z 1, . . . , zn)⊤ ∈R n×r, RDPG assumes that A∼Bernoulli(ZZ ⊤), which is exactly a latent space model with the linear link functionp(· |π) = Bernoulli(π) . Besides, many block-structured graph models also fall into the scope of latent...

1983

[9] [9]

The detail of this initialization algorithm can be found in Ma et al

as the initialization ofZandα. The detail of this initialization algorithm can be found in Ma et al. (2020). C.2. Dataset Details Simulated Datasets.In the simulated datasets evaluation, we consider (n, r)∈ {500,1000,1500} × {2,3,4} . For each (n, r)pair and each replicatet= 1, . . . ,200, we generate an undirected sparse simple graphA∈ {0,1} n×n as follo...

2020

[10] [10]

= 1/2 for i= 1, . . . , n . Finally, we set z′ i =ezi +v (Li) and zi =z ′ i · (n−1∥P i z′ iz′⊤ i ∥F)−1/2. Given the latent positions and the degree parameters, we generate the network edges. We set the sparsity parameter ρ∗ n =−0.4 logn . For each pair of nodes 1≤i < j≤n , we calculate pij =σ(α i +α j +z ⊤ i zj +ρ ∗ n). Then we independently sampleA ij =A...

1976

[11] [11]

Tables 20 and 21 summarize the auxiliary runs used for rebuttal positioning

and iterative local expansion (ILE) (Bergmeister et al., 2024). Tables 20 and 21 summarize the auxiliary runs used for rebuttal positioning. These results are not used for model selection in the main tables; they are included to clarify how SyNGLER compares with recent scalable baselines when measurements are available. Table 20.Auxiliary scalable-baselin...

2024

[12] [12]

Algorithm 3Synthetic Network Generation via Latent Emedding Reconstruction for Attributed Network 1:Input:Adjacency matrixA∈ {0,1} n×n, Attribute matrixY∈R n×p

In practice, we use the sigmoid function as the link function when modeling binary networks, so thatBernoulli(g(·))reduces to a standard logistic formulation for edge probabilities. Algorithm 3Synthetic Network Generation via Latent Emedding Reconstruction for Attributed Network 1:Input:Adjacency matrixA∈ {0,1} n×n, Attribute matrixY∈R n×p. 2: Fit the lat...

arXiv 2024

[13] [13]

Entries marked with “–” indicate OOM issues

46 SyNGLER: Efficient Synthetic Network Generation via Latent Embedding Reconstruction Table 38.ML utility evaluation of SyNG-D, SyNG-R, EDGE, GRAN, and GraphMaker across four datasets. Entries marked with “–” indicate OOM issues. Method Config DBLP PolBlogs YouTube Yelp SyNG-D 21.00±0.000.98±0.01 0.94±0.02 0.98±0.00 3 1.00±0.00 0.98±0.01 0.98±0.01 0.99±0...

2008