Tail Annealing for Heavy-Tailed Flow Matching

Jean Pachebat

arxiv: 2605.20068 · v1 · pith:X5KY6MGMnew · submitted 2026-05-19 · 📊 stat.ML · cs.LG

Tail Annealing for Heavy-Tailed Flow Matching

Jean Pachebat This is my paper

Pith reviewed 2026-05-20 03:41 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords heavy-tailed dataflow matchinggenerative modelstail transformationsoft-log transformextreme quantilesmultivariate generationPareto tails

0 comments

The pith

A coordinate-wise soft-log transform lets standard flow matching generate power-law tails from Gaussian noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to resolve the mismatch between standard flow matching models, which start from Gaussian noise and struggle to reach heavy-tailed distributions, and real data that often exhibits power-law tails. The core proposal applies the soft-log transform to the data before training so that heavy tails become exponential-like, then reverses the transform on the generated samples. A simple Hill plot diagnostic selects the transform per coordinate, leaving light-tailed variables untouched. This produces better matches to Wasserstein distance, conditional value-at-risk, and extreme quantiles on a benchmark spanning multiple copulas, dimensions up to 100, and varying tail indices. The approach requires no heavy-tailed base measures or architectural changes.

Core claim

The soft-log transform maps Pareto tails to exponentials while the induced flow dynamics implement tail annealing via power transformations, allowing unmodified flow matching to generate accurate heavy-tailed multivariate samples.

What carries the argument

The soft-log transform φ(x) = sign(x) · log(1 + |x|) applied coordinate-wise after a Hill diagnostic decides which margins require compression.

If this is right

Standard flow matching becomes competitive with specialized heavy-tailed methods on W1, CVaR99, and extreme-quantile metrics without added model complexity.
The method records zero severe divergences across 2,880 runs on 144 configurations that vary copula type, dimension, and tail index.
Per-coordinate selection preserves light-tailed margins while handling heavy tails, avoiding global compromises in mixed-tail data.
No heavy-tailed base distribution or Lipschitz relaxation is needed once the tails have been annealed by the transform.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same preprocessing could be tested on diffusion or score-based models that share the Gaussian-to-data interpolation challenge.
The tail-annealing interpretation may suggest designing flow schedules whose speed varies with tail weight rather than applying a fixed transform.
In high-dimensional settings the coordinate-wise choice could be automated further by coupling the Hill diagnostic with the learned flow itself.

Load-bearing premise

The soft-log transform together with the resulting flow dynamics reliably converts Pareto tails into exponentials without distorting dependence structure enough to degrade sample quality or extreme-value calibration.

What would settle it

If samples generated after the inverse transform systematically fail to recover the input tail indices or exhibit mismatched joint tail dependence on the same benchmark data, the central claim would be refuted.

read the original abstract

Standard generative models struggle with heavy-tailed data: Lipschitz architectures cannot produce power-law tails from Gaussian noise, and interpolating between heavy-tailed data and Gaussians is ill-posed. We propose a simple fix: apply the soft-log transform $\phi(x) = \mathrm{sign}(x) \cdot \log(1 + |x|)$ coordinate-wise to data before training, then exponentiate samples after generation. A Hill diagnostic decides per-coordinate whether to transform, leaving light-tailed margins untouched at no added complexity. This compresses heavy tails into a range where standard flow matching succeeds, without heavy-tailed base distributions or architectural modifications. We provide theoretical intuition for why this works: the log-transform maps Pareto tails to exponentials, and the induced dynamics implement a form of tail annealing via power transformations. On a 144-configuration multivariate benchmark (3 copulas, $d$ up to 100, 4 tail indices), Log-FM dominates specialized baselines on $W_1$, CVaR$_{99}$, and extreme-quantile metrics, and is the only method with zero severe divergences across 2{,}880 runs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main contribution is a coordinate-wise soft-log transform plus Hill selection that lets standard flow matching handle heavy tails without architecture changes, backed by a large benchmark but with an open question on joint tail dependence.

read the letter

The one thing your colleague should know is that this gives a lightweight preprocessing fix for heavy-tailed flow matching: apply the soft-log coordinate-wise where Hill plots flag heavy tails, train normally, then exponentiate the outputs. It claims to map Pareto tails to something more manageable and reports strong results on extremes without needing special base distributions or Lipschitz fixes.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Log-FM, which applies the coordinate-wise soft-log transform φ(x) = sign(x) · log(1 + |x|) to heavy-tailed data before flow matching (with a Hill diagnostic selecting which coordinates to transform) and exponentiates generated samples afterward. It supplies theoretical intuition that the transform maps Pareto tails to exponentials and induces tail annealing via power transformations in the flow dynamics. On a 144-configuration benchmark (3 copulas, d ≤ 100, 4 tail indices), Log-FM outperforms specialized baselines on W1, CVaR99, and extreme-quantile metrics while recording zero severe divergences across 2,880 runs.

Significance. If the central performance claims hold without hidden dependence distortion, the method offers a low-complexity preprocessing route to heavy-tailed generation that avoids architectural changes or heavy-tailed base measures. The scale of the benchmark (covering high dimensions and multiple dependence structures) is a clear strength and supports the claim of robustness.

major comments (2)

[§3] §3 (Theoretical Intuition): the statement that the induced dynamics implement 'tail annealing via power transformations' is presented as intuition without a derivation or explicit mapping from the vector field in φ-space back to the original heavy-tailed measure; this is load-bearing because the central justification for applying a marginal transform to joint data rests on it.
[§5] §5 (Experiments, benchmark tables): aggregate W1/CVaR99/extreme-quantile results are reported, but no direct verification of preserved tail-dependence functions (e.g., χ(u) or extremal coefficients) appears for the three copulas at d=100; given the marginal application of φ, this check is required to confirm that reported gains reflect accurate joint extremes rather than marginal compression alone.

minor comments (2)

[Methods] The precise rule for the Hill threshold (including any default value or sensitivity analysis) should be stated explicitly in the methods section for reproducibility.
[Figures] Figure captions for the multivariate results should clarify whether the reported metrics are averaged over all 2,880 runs or per-configuration, and whether error bars reflect variability across random seeds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting the benchmark's scale as a strength. We address each major comment below with clarifications and proposed revisions.

read point-by-point responses

Referee: [§3] §3 (Theoretical Intuition): the statement that the induced dynamics implement 'tail annealing via power transformations' is presented as intuition without a derivation or explicit mapping from the vector field in φ-space back to the original heavy-tailed measure; this is load-bearing because the central justification for applying a marginal transform to joint data rests on it.

Authors: We agree that the discussion in §3 is framed as intuition. The soft-log transform maps Pareto tails to exponentials, and pulling back the learned vector field through the inverse transform yields dynamics whose effective scaling corresponds to power transformations that anneal the tails. To make this explicit, we will expand §3 in the revision with a derivation showing the composition of the flow ODE with φ^{-1} and the resulting tail index evolution under the Jacobian. revision: yes
Referee: [§5] §5 (Experiments, benchmark tables): aggregate W1/CVaR99/extreme-quantile results are reported, but no direct verification of preserved tail-dependence functions (e.g., χ(u) or extremal coefficients) appears for the three copulas at d=100; given the marginal application of φ, this check is required to confirm that reported gains reflect accurate joint extremes rather than marginal compression alone.

Authors: We acknowledge that direct verification of tail dependence would further substantiate that gains arise from joint modeling rather than marginal effects alone. Although the strictly monotone marginal transforms preserve the underlying copula, we will add in the revision estimated χ(u) curves and extremal coefficient values for the d=100 cases across all three copulas to confirm that extreme dependence structures are retained. revision: yes

Circularity Check

0 steps flagged

No significant circularity; transform justified by standard tail-mapping property

full rationale

The paper's core proposal is the coordinate-wise soft-log transform φ(x) = sign(x)·log(1+|x|) applied before flow matching, with a Hill diagnostic for per-coordinate choice. This is presented as a preprocessing step whose justification rests on the elementary fact that the log maps Pareto tails to exponential tails (a direct consequence of the survival function transformation, not fitted from data or self-citations). The subsequent claim that the induced dynamics implement 'tail annealing via power transformations' is offered as intuition rather than a closed derivation that reduces to the method's own outputs. Empirical dominance on the 144-configuration benchmark is reported as validation, not as a 'prediction' that is forced by construction from fitted parameters. No load-bearing self-citation chains, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text; the derivation chain remains self-contained against external mathematical properties of monotone transforms and standard flow matching.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven mapping from Pareto to exponential tails under the flow dynamics and on the assumption that the Hill diagnostic selects the transform without introducing selection bias. No free parameters or invented entities are explicitly named in the abstract.

axioms (1)

domain assumption The soft-log transform maps Pareto tails to exponentials and the induced flow implements tail annealing via power transformations.
Stated as theoretical intuition in the abstract; no derivation or reference is provided.

pith-pipeline@v0.9.0 · 5718 in / 1365 out tokens · 27805 ms · 2026-05-20T03:41:54.128938+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the log-transform maps Pareto tails to exponentials, and the induced dynamics implement a form of tail annealing via power transformations X^{α_t}_0
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_high_calibrated_iff unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ϕ(x) = sign(x)·log(1+|x|); ϕ^{-1}(y) = sign(y)·(e^{|y|}-1)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

Ho, J., Jain, A., and Abbeel, P

arXiv:2406.16971. Ho, J., Jain, A., and Abbeel, P. Denoising diffusion prob- abilistic models. InAdvances in Neural Information Processing Systems, volume 33, pp. 6840–6851,

work page arXiv
[2]

arXiv preprint arXiv:2410.14171 , year=

ISBN 9781316511732. Pandey, K., Pathak, J., Xu, Y ., Mandt, S., Pritchard, M., Vahdat, A., and Mardani, M. Heavy-tailed diffusion models.arXiv preprint arXiv:2410.14171,

work page arXiv
[3]

Extreme Value Theory: Technical Details This appendix provides the formal definitions and results from extreme value theory used in the main text

10 Tail Annealing for Heavy-Tailed Flow Matching A. Extreme Value Theory: Technical Details This appendix provides the formal definitions and results from extreme value theory used in the main text. See (Resnick, 1987; de Haan & Ferreira,

work page 1987
[4]

for a comprehensive treatment. A.1. Heavy-Tailed Distributions Definition A.1(Heavy-Tailed (Nair et al., 2022)).A random variableXisheavy-tailedifE[e λX] =∞for allλ >0. Definition A.2(Regular Variation).A measurable function L: (0,∞)→(0,∞) isslowly varyingat infinity if L(tx)/L(t)→1 as t→ ∞ for all x >0 . A distribution F isregularly varyingwith index −α ...

work page 2022
[5]

=N(x t;α tx0, β2 t Id).(5) Common schedules.Two standard choices are thevariance-preserving(VP) schedule with α2 t +β 2 t = 1 (Ho et al., 2020), which ensures Var(Xt) = Var(X

work page 2020
[6]

Table 6 summarizes common schedule choices with their derivatives

when X0 has unit variance, and thelinear(flow matching) schedule with (αt, βt) = (1−t, t) (Lipman et al., 2023), corresponding to straight-line interpolation between data and noise. Table 6 summarizes common schedule choices with their derivatives. Table 6.Interpolation schedules for flow matching. All satisfy boundary conditions (α0, β0) = (1,0) and (α1,...

work page 2023
[7]

Taking the conditional expectation givenX t =x t: E αtX0 −x t β2 t Xt =x t = αtˆx0(xt, t)−x t β2 t =− ˆx1(xt, t) βt , where the last equality uses (6)

= (αtx0 −x t)/β2 t . Taking the conditional expectation givenX t =x t: E αtX0 −x t β2 t Xt =x t = αtˆx0(xt, t)−x t β2 t =− ˆx1(xt, t) βt , where the last equality uses (6). Thus, learning the denoiserˆx1 is equivalent to learning the score∇logp t. 12 Tail Annealing for Heavy-Tailed Flow Matching B.4. Training Objectives The denoiser can be trained by regr...

work page 2005
[8]

unifying

B.5. DDIM Sampling The DDIM framework (Song et al., 2021a) is canonically defined under the variance-preserving constraint α2 t +β 2 t = 1 and produces a one-parameter family of reverse transitions sharing the marginals of (5). Given timesteps (tk)K k=0 with tK = 1 andt 0 = 0, the transition fromt k+1 tot k is: xtk =α tk ˆxθ 0(xtk+1 , tk+1) + q β2 tk −η 2...

work page 2020
[9]

Identifying t=k/K , αt = √¯αk, and βt = √1−¯αk recovers (5) exactly

=N xk; √¯αk x0,(1−¯α k)Id . Identifying t=k/K , αt = √¯αk, and βt = √1−¯αk recovers (5) exactly. The DDPM noise schedule (βDDPM k ) is therefore one particular choice within the VP family in Table 6, withα 2 t +β 2 t = 1by construction. 13 Tail Annealing for Heavy-Tailed Flow Matching (ii) Forward score SDE.Song et al. (2021b) unify diffusion models throu...

work page 2023
[10]

Proposition C.1(Log-Transform of Regularly Varying; precise).Let X be a nonnegative random variable that is regularly varying with index −α, α >0

for a comprehensive treatment. Proposition C.1(Log-Transform of Regularly Varying; precise).Let X be a nonnegative random variable that is regularly varying with index −α, α >0 . Set ˜X=ϕ(X) with ϕ(x) = sign(x) log(1 +|x|) . Then for every ϵ >0 there exists z0 =z 0(ϵ)such that for allz≥z 0, e−(α+ϵ)z ≤P( ˜X > z)≤e −(α−ϵ)z.(14) In particular,−logP( ˜X > z) ...

work page 1987
[11]

A simple data-driven choice is s(j) 2 =c/IQR(X j) for a constant c (e.g

15 Tail Annealing for Heavy-Tailed Flow Matching Coordinate-wise s(j) 2 and the adaptive instance.In the multivariate setting we may pick a coordinate-dependent scale s(j) 2 . A simple data-driven choice is s(j) 2 =c/IQR(X j) for a constant c (e.g. c= 1 ), which puts the cross-over at a robust scale of the marginal. The Hill-gated method of Section 4.2 is...

work page 2025
[12]

Baseline Validation To ensure fair comparison, we validate that our implementations of the baseline methods reproduce the results reported in Hickling & Prangle (2025)

E.4. Baseline Validation To ensure fair comparison, we validate that our implementations of the baseline methods reproduce the results reported in Hickling & Prangle (2025). Table 10 compares our NLL values (per dimension) with their Table 7 reference values across 16 Tail Annealing for Heavy-Tailed Flow Matching Table 8.W 1 on Fama–French 5 (mean±std, 10...

work page 2025
[13]

Table 10.NLL Validation: Our Implementation vs Hickling Reference (Table 7)

α10 20 50 100 200 500 1.5 0.521 0.543 0.568 0.579 0.584 0.588 2.0 0.137 0.135 0.138 0.141 0.142 0.143 2.5 0.084 0.081 0.083 0.084 0.085 0.086 the original benchmark configurations. Table 10.NLL Validation: Our Implementation vs Hickling Reference (Table 7). Format: ours [ref]. The coupling-layer baselines (TTFfix, TTF) reproduce the reference values withi...

work page 2025

[1] [1]

Ho, J., Jain, A., and Abbeel, P

arXiv:2406.16971. Ho, J., Jain, A., and Abbeel, P. Denoising diffusion prob- abilistic models. InAdvances in Neural Information Processing Systems, volume 33, pp. 6840–6851,

work page arXiv

[2] [2]

arXiv preprint arXiv:2410.14171 , year=

ISBN 9781316511732. Pandey, K., Pathak, J., Xu, Y ., Mandt, S., Pritchard, M., Vahdat, A., and Mardani, M. Heavy-tailed diffusion models.arXiv preprint arXiv:2410.14171,

work page arXiv

[3] [3]

Extreme Value Theory: Technical Details This appendix provides the formal definitions and results from extreme value theory used in the main text

10 Tail Annealing for Heavy-Tailed Flow Matching A. Extreme Value Theory: Technical Details This appendix provides the formal definitions and results from extreme value theory used in the main text. See (Resnick, 1987; de Haan & Ferreira,

work page 1987

[4] [4]

for a comprehensive treatment. A.1. Heavy-Tailed Distributions Definition A.1(Heavy-Tailed (Nair et al., 2022)).A random variableXisheavy-tailedifE[e λX] =∞for allλ >0. Definition A.2(Regular Variation).A measurable function L: (0,∞)→(0,∞) isslowly varyingat infinity if L(tx)/L(t)→1 as t→ ∞ for all x >0 . A distribution F isregularly varyingwith index −α ...

work page 2022

[5] [5]

=N(x t;α tx0, β2 t Id).(5) Common schedules.Two standard choices are thevariance-preserving(VP) schedule with α2 t +β 2 t = 1 (Ho et al., 2020), which ensures Var(Xt) = Var(X

work page 2020

[6] [6]

Table 6 summarizes common schedule choices with their derivatives

when X0 has unit variance, and thelinear(flow matching) schedule with (αt, βt) = (1−t, t) (Lipman et al., 2023), corresponding to straight-line interpolation between data and noise. Table 6 summarizes common schedule choices with their derivatives. Table 6.Interpolation schedules for flow matching. All satisfy boundary conditions (α0, β0) = (1,0) and (α1,...

work page 2023

[7] [7]

Taking the conditional expectation givenX t =x t: E αtX0 −x t β2 t Xt =x t = αtˆx0(xt, t)−x t β2 t =− ˆx1(xt, t) βt , where the last equality uses (6)

= (αtx0 −x t)/β2 t . Taking the conditional expectation givenX t =x t: E αtX0 −x t β2 t Xt =x t = αtˆx0(xt, t)−x t β2 t =− ˆx1(xt, t) βt , where the last equality uses (6). Thus, learning the denoiserˆx1 is equivalent to learning the score∇logp t. 12 Tail Annealing for Heavy-Tailed Flow Matching B.4. Training Objectives The denoiser can be trained by regr...

work page 2005

[8] [8]

unifying

B.5. DDIM Sampling The DDIM framework (Song et al., 2021a) is canonically defined under the variance-preserving constraint α2 t +β 2 t = 1 and produces a one-parameter family of reverse transitions sharing the marginals of (5). Given timesteps (tk)K k=0 with tK = 1 andt 0 = 0, the transition fromt k+1 tot k is: xtk =α tk ˆxθ 0(xtk+1 , tk+1) + q β2 tk −η 2...

work page 2020

[9] [9]

Identifying t=k/K , αt = √¯αk, and βt = √1−¯αk recovers (5) exactly

=N xk; √¯αk x0,(1−¯α k)Id . Identifying t=k/K , αt = √¯αk, and βt = √1−¯αk recovers (5) exactly. The DDPM noise schedule (βDDPM k ) is therefore one particular choice within the VP family in Table 6, withα 2 t +β 2 t = 1by construction. 13 Tail Annealing for Heavy-Tailed Flow Matching (ii) Forward score SDE.Song et al. (2021b) unify diffusion models throu...

work page 2023

[10] [10]

Proposition C.1(Log-Transform of Regularly Varying; precise).Let X be a nonnegative random variable that is regularly varying with index −α, α >0

for a comprehensive treatment. Proposition C.1(Log-Transform of Regularly Varying; precise).Let X be a nonnegative random variable that is regularly varying with index −α, α >0 . Set ˜X=ϕ(X) with ϕ(x) = sign(x) log(1 +|x|) . Then for every ϵ >0 there exists z0 =z 0(ϵ)such that for allz≥z 0, e−(α+ϵ)z ≤P( ˜X > z)≤e −(α−ϵ)z.(14) In particular,−logP( ˜X > z) ...

work page 1987

[11] [11]

A simple data-driven choice is s(j) 2 =c/IQR(X j) for a constant c (e.g

15 Tail Annealing for Heavy-Tailed Flow Matching Coordinate-wise s(j) 2 and the adaptive instance.In the multivariate setting we may pick a coordinate-dependent scale s(j) 2 . A simple data-driven choice is s(j) 2 =c/IQR(X j) for a constant c (e.g. c= 1 ), which puts the cross-over at a robust scale of the marginal. The Hill-gated method of Section 4.2 is...

work page 2025

[12] [12]

Baseline Validation To ensure fair comparison, we validate that our implementations of the baseline methods reproduce the results reported in Hickling & Prangle (2025)

E.4. Baseline Validation To ensure fair comparison, we validate that our implementations of the baseline methods reproduce the results reported in Hickling & Prangle (2025). Table 10 compares our NLL values (per dimension) with their Table 7 reference values across 16 Tail Annealing for Heavy-Tailed Flow Matching Table 8.W 1 on Fama–French 5 (mean±std, 10...

work page 2025

[13] [13]

Table 10.NLL Validation: Our Implementation vs Hickling Reference (Table 7)

α10 20 50 100 200 500 1.5 0.521 0.543 0.568 0.579 0.584 0.588 2.0 0.137 0.135 0.138 0.141 0.142 0.143 2.5 0.084 0.081 0.083 0.084 0.085 0.086 the original benchmark configurations. Table 10.NLL Validation: Our Implementation vs Hickling Reference (Table 7). Format: ours [ref]. The coupling-layer baselines (TTFfix, TTF) reproduce the reference values withi...

work page 2025