Markov Chain Decoders Overcome the Heavy-Tail Limitations of Lipschitz Generative Models

Abdelhakim Ziani; Andras Horvath; Paolo Ballarini

arxiv: 2605.18931 · v2 · pith:MPG2JXIMnew · submitted 2026-05-18 · 📊 stat.ML · cs.AI· cs.LG

Markov Chain Decoders Overcome the Heavy-Tail Limitations of Lipschitz Generative Models

Abdelhakim Ziani , Andras Horvath , Paolo Ballarini This is my paper

Pith reviewed 2026-05-20 08:29 UTC · model grok-4.3

classification 📊 stat.ML cs.AIcs.LG

keywords heavy-tailed distributionsphase-type distributionsvariational autoencodersMarkov chainsLipschitz continuitygenerative modelstail approximationPareto distributions

0 comments

The pith

Replacing Gaussian decoders with Phase-Type Markov chain distributions allows Lipschitz VAEs to generate heavy-tailed outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard VAEs with Gaussian decoder likelihoods and Lipschitz-constrained networks cannot produce heavy-tailed data, as the exponential Gaussian tail decay cannot be overcome by bounded amplification from the latent space. The paper replaces only the decoder likelihood with a Phase-Type distribution represented by a continuous-time Markov chain, which can approximate any positive distribution to arbitrary accuracy while preserving the encoder, latent space, and training procedure. Controlled experiments on synthetic Pareto data with tail indices from 2 to 30 and dimensions up to 10 show the change reduces tail Kolmogorov-Smirnov distance by factors of up to 6 and extreme quantile error by factors of up to 10. This establishes a practical route to heavy-tail generation in otherwise standard generative models used for performance evaluation, network traffic, and risk modeling.

Core claim

Heavy-tailed distributions pose a fundamental challenge for modern deep generative models because Gaussian tails decay exponentially and Lipschitz continuity prevents the decoder from amplifying rare latent events sufficiently. Replacing the Gaussian decoder with a Phase-Type distribution based on Markov chains, while keeping the encoder, latent space, and training identical, overcomes this structural limitation since Phase-Type distributions approximate any positive-valued distribution, including heavy-tailed families, to arbitrary precision. On synthetic Pareto data across tail indices alpha in {2, 3, 5, 30} and dimensions d in {1, 5, 10}, the Phase-Type decoder reduces tail Kolmogorov-Sm,

What carries the argument

Phase-Type distribution modeled by a continuous-time Markov chain that serves as the decoder likelihood for positive outputs.

If this is right

Generative models can now produce accurate samples from heavy-tailed distributions common in performance evaluation and risk modeling.
The same encoder and latent space can be reused for both light- and heavy-tailed data by swapping only the decoder distribution.
Training remains end-to-end differentiable without additional constraints or changes to the optimization procedure.
The approach directly addresses the structural mismatch between Lipschitz networks and heavy tails without sacrificing model capacity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decoder substitution could be tested in other Lipschitz-constrained architectures such as Wasserstein GANs to check whether the heavy-tail limitation is decoder-specific.
Multivariate extensions of Phase-Type distributions might enable joint heavy-tail modeling in higher-dimensional settings where marginal approximations alone are insufficient.
Because the Markov chain representation is explicit, one could inspect the inferred chain parameters to diagnose which phases capture the heavy-tail behavior.

Load-bearing premise

Phase-Type distributions can approximate any positive-valued distribution including heavy-tailed families to arbitrary precision and can be integrated as the decoder likelihood while leaving the encoder, latent space, and training procedure unchanged.

What would settle it

Running the Phase-Type decoder model on real heavy-tailed datasets such as network traffic traces or financial returns and finding that tail Kolmogorov-Smirnov distance or extreme quantile error shows no reduction or an increase relative to the Gaussian baseline would falsify the practical effectiveness claim.

Figures

Figures reproduced from arXiv: 2605.18931 by Abdelhakim Ziani, Andras Horvath, Paolo Ballarini.

**Figure 1.** Figure 1: A degree 3 PH distribution. PH distributions form a dense family on (0, ∞): any positive-valued distribution can be approximated arbitrarily well by a PH distribution. Although PH distributions are asymptotically light-tailed (their tails ultimately decay exponentially due to the finite Markov chain), they can closely approximate heavytailed behavior over any bounded, data-relevant range by using mult… view at source ↗

**Figure 2.** Figure 2: Series canonical form Phase-Type Distribution [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Log-log CCDF of true Pareto data, Gaussian VAE generations, and PH [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Tail KS distance at the 99th percentile for Gaussian VAE and PH-VAE [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 2.** Figure 2: Series canonical form Phase-Type Distribution [PITH_FULL_IMAGE:figures/full_fig_p020_2.png] view at source ↗

**Figure 3.** Figure 3: Log-log CCDF of true Pareto data, Gaussian VAE generations, and PH [PITH_FULL_IMAGE:figures/full_fig_p024_3.png] view at source ↗

**Figure 4.** Figure 4: Tail KS distance at the 99th percentile for Gaussian VAE and PH-VAE [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗

read the original abstract

Heavy-tailed distributions are prevalent in performance evaluation, network traffic, and risk modeling. This behavior poses a fundamental challenge for modern deep generative models. Standard Variational Autoencoders (VAEs) employ Gaussian decoder likelihoods and Lipschitz-constrained neural networks, a combination that is structurally incapable of producing heavy-tailed outputs: the Gaussian tail decays exponentially, and Lipschitz continuity prevents the decoder from amplifying rare events from the latent space input to sufficiently overcome this decay. We provide both a theoretical characterization of this limitation and a controlled empirical demonstration using synthetic Pareto data across a grid of tail indices $\alpha$ $\in$ {2, 3, 5, 30} and dimensions d $\in$ {1, 5, 10}. As a solution, we replace the Gaussian decoder with a Phase-Type (PH) distribution based on Markov chains, while keeping the encoder, latent space, and training procedure identical. PH distributions allow for arbitrarily precise approximations of any positive-valued distributions, including heavy-tailed families. Experiments showed that the PH-based model reduces tail Kolmogorov-Smirnov distance by up to x6 and extreme quantile error by up to x10 compared to the Gaussian baseline for heavy-tailed data. These results demonstrate that integrating Markov chain-based distributions into the decoder of a generative model institutes a principled and practically effective solution to the heavy-tail generation problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags why Gaussian decoders plus Lipschitz nets can't hit heavy tails and tests Phase-Type Markov chain decoders as a direct swap-in, with theory and synthetic Pareto results.

read the letter

The core point is that standard VAEs are structurally limited on heavy tails because Gaussian tails decay exponentially and Lipschitz continuity caps how much the decoder can amplify rare latent draws. The authors lay this out cleanly and then swap the decoder for a Phase-Type distribution from a continuous-time Markov chain while leaving the encoder, latent space, and training loop unchanged. PH distributions can approximate any positive distribution to arbitrary accuracy, so the idea is to let the model produce the right tail shape without fighting the Gaussian decay. On synthetic Pareto data across several tail indices and dimensions they report clear gains: up to 6x lower tail KS distance and 10x lower extreme quantile error versus the Gaussian baseline. That comparison is straightforward and the numbers are large enough to notice. The work earns credit for keeping the rest of the architecture fixed so the decoder change is isolated. The soft spot is the claim that everything else stays identical. Outputting a valid PH parameterization requires the decoder head to produce a subgenerator matrix with negative diagonals, non-negative off-diagonals, and row sums that are non-positive, plus a valid initial probability vector. Standard networks do not enforce these constraints automatically, so some combination of activations, reshaping, or projection is almost certainly needed. That changes the decoder architecture and gradient path in practice, even if the high-level training script looks the same. The experiments stay on synthetic data only, which is reasonable for a controlled study but leaves the practical payoff on real network or risk data untested. Readers working on generative models for heavy-tailed simulation tasks will find the targeted fix useful. The paper shows clear thinking on a concrete limitation and supplies enough evidence to justify sending it to referees, who will want details on the exact output parameterization and perhaps a real-data check.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that standard VAEs with Gaussian decoders and Lipschitz-constrained networks cannot generate heavy-tailed outputs, as Gaussian tails decay exponentially and Lipschitz continuity bounds the amplification of rare latent events. It provides a theoretical characterization of this limitation and proposes replacing the Gaussian decoder with a Phase-Type (PH) distribution parameterized via continuous-time Markov chains, while keeping the encoder, latent space, and training procedure identical. PH distributions are argued to approximate any positive-valued distribution arbitrarily well. Controlled experiments on synthetic Pareto data across tail indices α ∈ {2, 3, 5, 30} and dimensions d ∈ {1, 5, 10} report that the PH-based model reduces tail Kolmogorov-Smirnov distance by up to a factor of 6 and extreme quantile error by up to a factor of 10 relative to the Gaussian baseline.

Significance. If the central claims hold, the work offers a principled approach to heavy-tailed generation in deep models, relevant to performance evaluation, network traffic, and risk modeling. Strengths include the controlled synthetic experimental grid, direct baseline comparison, and the flexibility of PH approximations for positive support. The theoretical characterization of the Lipschitz-Gaussian limitation is a useful contribution if rigorously derived.

major comments (2)

[§3 (Decoder Architecture)] §3 (Decoder Architecture): The assertion that the encoder, latent space, and training procedure remain identical is undermined by the algebraic constraints required for a valid continuous PH distribution. The decoder must output a subgenerator matrix T (negative diagonals, non-negative off-diagonals, non-positive row sums) and initial vector α (non-negative, sums to 1). This necessitates specialized activations (e.g., softplus on rates), output reshaping, or projection steps absent from standard Gaussian decoders, which only require unconstrained mean and positive variance. These changes alter the decoder head, gradient computation, and numerical stability (via matrix exponential), weakening the 'drop-in replacement' claim.
[Experiments section] Experiments section: The quantitative claims of up to ×6 reduction in tail KS distance and ×10 in extreme quantile error lack supporting details on the number of phases used for the PH approximation, variance across random seeds, or statistical significance tests. Without these, it is difficult to evaluate robustness, especially for the heaviest tails (α=2) where approximation quality depends critically on phase count and parameterization.

minor comments (2)

[Abstract] Abstract: The 'up to ×6' and 'up to ×10' improvements should specify the exact (α, d) pair at which the maxima occur, rather than leaving the range implicit.
[Notation] Notation: Define explicitly how the neural network outputs the Markov chain parameters (rates, initial probabilities) and any normalization or constraint enforcement applied during forward passes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below, outlining planned changes to improve clarity and completeness while preserving the core contributions of the work.

read point-by-point responses

Referee: [§3 (Decoder Architecture)] The assertion that the encoder, latent space, and training procedure remain identical is undermined by the algebraic constraints required for a valid continuous PH distribution. The decoder must output a subgenerator matrix T (negative diagonals, non-negative off-diagonals, non-positive row sums) and initial vector α (non-negative, sums to 1). This necessitates specialized activations (e.g., softplus on rates), output reshaping, or projection steps absent from standard Gaussian decoders, which only require unconstrained mean and positive variance. These changes alter the decoder head, gradient computation, and numerical stability (via matrix exponential), weakening the 'drop-in replacement' claim.

Authors: We appreciate the referee's careful reading of the implementation requirements. The manuscript's statement that the encoder, latent space, and training procedure remain identical is accurate in the sense that these components are unchanged from the Gaussian baseline; only the decoder likelihood is replaced. However, we agree that the PH decoder requires specific output constraints and activations to produce a valid subgenerator matrix T and probability vector α. In the revised manuscript we will expand §3 with an explicit description of the decoder head, including the use of softplus on the diagonal and off-diagonal entries of T (to enforce sign constraints) and softmax on α (to ensure non-negativity and summation to one). We will also briefly discuss the matrix-exponential computation and its effect on gradient propagation. These additions clarify the localized nature of the decoder changes without altering the experimental protocol or the central claim. revision: partial
Referee: [Experiments section] The quantitative claims of up to ×6 reduction in tail KS distance and ×10 in extreme quantile error lack supporting details on the number of phases used for the PH approximation, variance across random seeds, or statistical significance tests. Without these, it is difficult to evaluate robustness, especially for the heaviest tails (α=2) where approximation quality depends critically on phase count and parameterization.

Authors: We concur that additional experimental details are required for reproducibility and to demonstrate robustness. In the revised version we will state that a fixed number of 10 phases was employed for all PH approximations. We will augment the reported metrics with means and standard deviations computed over five independent random seeds and will include paired t-test p-values comparing the PH and Gaussian models on the tail KS and extreme-quantile errors. These updates will appear in the Experiments section and in revised tables and figures, directly addressing concerns about the heaviest tails (α=2). revision: yes

Circularity Check

0 steps flagged

No circularity: modeling substitution and empirical comparison are independent of fitted inputs.

full rationale

The paper advances a modeling substitution (Gaussian decoder to Phase-Type/Markov-chain decoder) plus a theoretical characterization of Lipschitz+Gaussian limitations, followed by controlled experiments on synthetic Pareto data. No derivation step reduces a claimed prediction or result to a fitted parameter or self-citation by construction. The 'identical training procedure' statement is an empirical claim about implementation, not a definitional equivalence. External benchmarks (KS distance, quantile error) are measured against a separate baseline, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper relies on the known approximation power of Phase-Type distributions and the structural analysis of Gaussian-Lipschitz limitations; no new free parameters or invented entities are introduced beyond standard VAE components.

free parameters (1)

Phase-Type distribution parameters
Parameters of the Markov chain (number of phases, transition rates) are fitted as part of decoder training.

axioms (1)

domain assumption Phase-Type distributions can approximate any positive-valued distribution to arbitrary precision
Invoked to justify that the decoder can represent heavy-tailed families.

pith-pipeline@v0.9.0 · 5791 in / 1351 out tokens · 45987 ms · 2026-05-20T08:29:08.040441+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

replace the Gaussian decoder with a Phase-Type (PH) distribution based on Markov chains... PH distributions allow for arbitrarily precise approximations of any positive-valued distributions, including heavy-tailed families
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lipschitz continuity prevents the decoder from amplifying rare events... tail collapse

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.