arxiv: 2605.13386 · v1 · submitted 2026-05-13 · 💻 cs.LG · stat.ML

Recognition: 2 theorem links

· Lean Theorem

Support-Conditioned Flow Matching Is Kernel Smoothing

Daniel Matsui Smola

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:20 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords flow matchingNadaraya-Watsonkernel smoothingcross-attentionoptimal transportgenerative modelsconditioning

0 comments

The pith

Under the Gaussian optimal-transport path, flow matching conditioning on finite support is exactly Nadaraya-Watson kernel smoothing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that conditioning a flow matching model on a small set of examples produces a velocity field that is precisely a Nadaraya-Watson kernel smoother when the Gaussian optimal transport path is followed. The smoother's bandwidth shrinks steadily with flow time, so early steps average broadly across the support while late steps act like nearest-neighbor lookup. A single Gaussian-kernel attention head implements this field exactly, linking cross-attention to classical nonparametric regression. The same analysis identifies three concrete regimes where the smoothing breaks down, and experiments on mixtures and image features confirm that learned attention improves exactly where the theory predicts failures.

Core claim

Under the Gaussian optimal-transport path, the exact velocity field induced by a finite support set is a Nadaraya--Watson kernel smoother whose bandwidth decreases with flow time, from broad averaging at early steps to nearest-neighbor at late steps. A single Gaussian-kernel attention head exactly computes this field, connecting cross-attention conditioning to classical kernel theory. The theory predicts three failure regimes: nearest-neighbor collapse of the kernel at high dimension, mismatch between the isotropic kernel and the data geometry, and insufficient support for nonparametric estimation.

What carries the argument

Nadaraya-Watson kernel smoother with time-dependent bandwidth under the Gaussian optimal-transport path, computed exactly by a single Gaussian-kernel attention head.

If this is right

The velocity field induced by any finite support set is exactly the weighted average given by a Gaussian-kernel Nadaraya-Watson estimator.
The estimator's bandwidth shrinks monotonically, producing a continuous transition from global averaging to local nearest-neighbor behavior.
A single attention head with Gaussian kernel exactly realizes the required velocity field without further parameters.
Conditioning fails when dimension causes kernel collapse to nearest neighbors, when the isotropic kernel mismatches data geometry, or when support size is too small for stable nonparametric estimation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Attention-based conditioning in other generative models can be re-derived as adaptive kernel smoothers once the transport path is fixed.
Replacing the isotropic Gaussian kernel with a geometry-aware kernel would directly address one of the predicted failure regimes.
The same reduction may apply to score-based diffusion models under Gaussian paths, offering a statistical interpretation of their conditioning layers.
IP-Adapter-style adapters already implement an approximation to this smoother, which explains why they succeed on image features but may still degrade in high dimensions.

Load-bearing premise

The derivation assumes the Gaussian optimal-transport path between the data and noise distributions.

What would settle it

Explicitly computing the velocity field for a non-Gaussian path such as linear interpolation and verifying that the weights no longer equal the Nadaraya-Watson kernel form.

Figures

Figures reproduced from arXiv: 2605.13386 by Daniel Matsui Smola.

**Figure 1.** Figure 1: (a) Exact head benefit (%) vs. dimension (≥ 4 seeds): declines from +28% at d = 2 to 0% at d = 16. (b) neff at mid-flow (t = 0.56) vs. dimension: the kernel collapses to 1-NN over the same range. (c) Multi-head scaling on R 8 Gaussian mixtures: the exponent α (from MMD2 ∼ m−α) increases with head count H; more heads (each in lower dk = dmodel/H dimensions) yield faster rates. 4.3 Geometry mismatch: the iso… view at source ↗

**Figure 2.** Figure 2: (a) On spherical shells at d = 8, the isotropic NW kernel collapses to 1-NN at late flow times, while learned heads maintain multi-neighbor smoothing with differentiated bandwidths. (b) On DINOv2+PCA ImageNet features (d = 16), the learned model outperforms the plug-in at small m; the plug-in catches up as m grows. Crossover near m ≈ 25. 4.4 Support scarcity: when the plug-in is data-poor The plug-in field… view at source ↗

**Figure 3.** Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: (a) shows the ImageNet transition curve: the exact head is marginal at d ≤ 16 and harmful at d = 64 (3/5 ON seeds diverge vs. 0/3 OFF) [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: shows the whitened ImageNet control: five whitening strengths interpolating from original (neff = 12.35) to fully whitened (neff = 1.05). Plug-in MMD2 is flat (0.0075–0.0077), confirming that neff collapse alone does not degrade the plug-in. 0.0 0.2 0.4 0.6 0.8 1.0 Whitening strength ® 2 4 6 8 10 12 neff (a) neff drops with whitening 0.0 0.2 0.4 0.6 0.8 1.0 Whitening strength ® 0.0060 0.0065 0.0070 0.0075 … view at source ↗

**Figure 6.** Figure 6: (a) R 2 GMMs: learned maintains 2.4–5.5× advantage at all m. (b) Shells d = 8: αshells ≈ αGMM; the rate bottleneck is meta-learning, not task complexity. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Leave-one-family-out: plug-in MMD2 (red) vs. LOFO held-out MMD2 (orange). GMMs held out from curve families: 37× worse. Spirals held out: 0.4× (better than plug-in). 0.1 0.3 0.5 0.7 0.9 Flow time t 0 1 2 3 4 5 Attention entropy (nats) d = 8, H = 8: per-head entropy vs. flow time (layer 0) Head 0 Head 1 Head 2 Head 3 Head 4 Head 5 Head 6 Head 7 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Per-head attention entropy at d = 8, H = 8 (layer 0). Heads specialize to different effective bandwidths: some maintain high entropy (broad smoothing) while others collapse to near-zero (sharp, nearest-neighbor-like). 0.25 0.00 0.25 0.50 0.75 1.00 Spearman ½ 0 500 1000 1500 2000 Count (a) Per-head ½ at t = 1 Mean = 0.88 Permutation null 0.25 0.00 0.25 0.50 0.75 1.00 Spearman ½ 0.0 0.2 0.4 0.6 0.8 1.0 CDF (… view at source ↗

**Figure 9.** Figure 9: IP-Adapter null-model analysis (SD 1.5, 50 images, 128 heads). [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Support-conditioned generation on four R 2 families (m = 50). Columns: ground truth, support set, plug-in, learned ICFM. The plug-in places mass near support points; the learned model generates fresh samples matching the ground-truth density. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

read the original abstract

Generative models are often conditioned on a small set of examples via cross-attention. Under the Gaussian optimal-transport path, we show that the exact velocity field induced by a finite support set is a Nadaraya--Watson kernel smoother whose bandwidth decreases with flow time, from broad averaging at early steps to nearest-neighbor at late steps. A single Gaussian-kernel attention head exactly computes this field, connecting cross-attention conditioning to classical kernel theory. The theory predicts three failure regimes: nearest-neighbor collapse of the kernel at high dimension, mismatch between the isotropic kernel and the data geometry, and insufficient support for nonparametric estimation. Experiments on Gaussian mixtures, spherical shells, and DINOv2 ImageNet features confirm that learned conditioning improves in precisely these regimes, and that IP-Adapter's cross-attention implements approximate NW smoothing in practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Under the Gaussian OT path, support-conditioned flow matching velocity is exactly time-dependent Nadaraya-Watson smoothing.

read the letter

The core result is that under the Gaussian optimal-transport path, the velocity field in support-conditioned flow matching is precisely a Nadaraya-Watson kernel smoother whose bandwidth shrinks with time. The paper derives this directly from the path definition and the conditional expectation of the velocity, landing on a kernel-weighted average of the finite support points. A single Gaussian-kernel attention head computes the field exactly, which ties cross-attention conditioning to classical nonparametric regression without extra machinery. The work also flags three failure regimes—nearest-neighbor collapse in high dimensions, mismatch between the isotropic kernel and data geometry, and insufficient support size—and the experiments on Gaussian mixtures, spherical shells, and DINOv2 ImageNet features show that addressing them improves performance. It notes that IP-Adapter implements something close to this smoothing in practice. This is new in the flow-matching literature; prior papers did not spell out the explicit Nadaraya-Watson equivalence under this path. The derivation looks direct and avoids circularity by starting from the standard Gaussian OT interpolation. The main limitation is the path restriction. The exact kernel form holds only for the Gaussian OT interpolation; straight-line paths or non-Gaussian plans break the closed form, though the paper states this upfront. The experiments are illustrative and hit the targeted regimes, but they remain somewhat controlled rather than exhaustive on large-scale real data. This paper is for researchers working on the theory of conditioning in diffusion and flow models who want to connect modern mechanisms to kernel statistics. A reader interested in why certain attention designs succeed or fail will find the framing useful. It deserves a serious referee because the central equivalence is cleanly derived and the connection is worth having in the literature. I would send it out for peer review.

Referee Report

1 major / 2 minor

Summary. The paper claims that under the Gaussian optimal-transport path between noise and data, the exact velocity field induced by conditioning flow matching on a finite support set is identical to the Nadaraya-Watson kernel smoother whose bandwidth shrinks monotonically with flow time t (broad averaging early, nearest-neighbor late). It shows that this field is computed exactly by a single Gaussian-kernel attention head, thereby linking cross-attention conditioning to classical kernel regression. The theory predicts three concrete failure regimes (high-dimensional NN collapse, isotropic-kernel/geometry mismatch, and insufficient support size) that are then tested on Gaussian mixtures, spherical shells, and DINOv2 ImageNet features.

Significance. If the central equivalence holds, the work supplies a parameter-free derivation that unifies attention-based conditioning in flow models with nonparametric statistics, explains observed behaviors, and yields falsifiable predictions about when conditioning succeeds or fails. The explicit connection to the Gaussian OT path and the supporting mixture and feature experiments constitute a clear contribution to the theoretical understanding of conditional generative models.

major comments (1)

[§3] §3 (velocity-field derivation): the reduction to the exact Nadaraya-Watson form (with the stated time-dependent bandwidth schedule) is shown only for the Gaussian OT interpolation; the manuscript correctly flags this scope restriction, but an explicit one-line statement of the conditional-expectation step that produces the kernel weights would make the load-bearing algebraic step easier to verify without re-deriving the entire OT path.

minor comments (2)

[Experiments] Figure 3 (DINOv2 experiment): the caption and surrounding text do not report the effective support size or ambient dimension used for the nearest-neighbor-collapse prediction, making it harder to map the observed improvement directly onto the three failure regimes listed in the theory section.
Notation: the symbol for the time-dependent bandwidth (denoted sigma(t) in the text) is not introduced with an explicit equation reference in the main derivation paragraph, which slightly obscures the link between the closed-form velocity and the plotted kernel widths.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation to accept. We address the single major comment below.

read point-by-point responses

Referee: [§3] §3 (velocity-field derivation): the reduction to the exact Nadaraya-Watson form (with the stated time-dependent bandwidth schedule) is shown only for the Gaussian OT interpolation; the manuscript correctly flags this scope restriction, but an explicit one-line statement of the conditional-expectation step that produces the kernel weights would make the load-bearing algebraic step easier to verify without re-deriving the entire OT path.

Authors: We thank the referee for this constructive suggestion. We will add a single explicit sentence immediately after the conditional-expectation expression in §3 that identifies the algebraic step producing the kernel weights, thereby allowing verification without re-deriving the full OT path. This change will be incorporated in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation from Gaussian OT conditional expectation to NW kernel form is direct and self-contained

full rationale

The paper computes the exact velocity field as the conditional expectation E[v_t | x_t] under the explicit Gaussian optimal-transport interpolation x_t = (1-t)x_0 + t x_1 + sigma(t) z with z ~ N(0,I). This expectation reduces algebraically to the Nadaraya-Watson weighted average of the finite support points with time-dependent bandwidth sigma(t), without any fitted parameters, without invoking prior self-citations for the identity, and without renaming an existing empirical pattern. The paper explicitly restricts the claim to this path and flags that other interpolations break the closed form, so the central result does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the Gaussian optimal-transport path between noise and data; no free parameters are introduced beyond the standard kernel bandwidth schedule that emerges from the derivation. No new entities are postulated.

axioms (1)

domain assumption The interpolation path between noise and data is the Gaussian optimal-transport path.
Invoked in the first sentence of the abstract as the setting under which the velocity field is derived.

pith-pipeline@v0.9.0 · 5430 in / 1244 out tokens · 39482 ms · 2026-05-14T20:20:36.293939+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Under the Gaussian optimal-transport path, the exact velocity field induced by a finite support set is a Nadaraya–Watson kernel smoother whose bandwidth decreases with flow time
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A single Gaussian-kernel attention head exactly computes this field

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 6 canonical work pages · 2 internal anchors

[1]

Albergo, M. S. & Vanden-Eijnden, E. Building normalizing flows with stochastic interpolants. In ICLR, 2023

2023
[2]

& Rosenbaum, D

Abu Hamad, H. & Rosenbaum, D. Flow matching neural processes. In NeurIPS, 2025

2025
[3]

& Trabs, M

Kunkel, L. & Trabs, M. On the minimax optimality of flow matching through the connection to kernel density estimation. arXiv preprint arXiv:2504.13336, 2025

work page arXiv 2025
[4]

Zhou, Z., Zhang, Z., & Amini, A. A. Flow matching generalizes through discretization bias. Submitted to ICLR, 2026

2026
[5]

The curse of highly variable functions for local kernel machines

Bengio, Y., Delalleau, O., & Le Roux , N. The curse of highly variable functions for local kernel machines. In NeurIPS, 2005

2005
[6]

When is ``nearest neighbor'' meaningful? In ICDT, 1999

Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U. When is ``nearest neighbor'' meaningful? In ICDT, 1999

1999
[7]

Chen, W., Hu, H., Li, Y., Ruiz, N., Jia, X., Chang, M.-W., & Cohen, W. W. Subject-driven text-to-image generation via apprenticeship learning. In NeurIPS, 2023

2023
[8]

Rethinking attention with P erformers

Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, L., Belanger, D., Colwell, L., & Weller, A. Rethinking attention with P erformers. In ICLR, 2021

2021
[9]

Tweedie's formula and selection bias

Efron, B. Tweedie's formula and selection bias. Journal of the American Statistical Association, 106(496):1602--1614, 2011

2011
[10]

Flow matching achieves almost minimax optimal convergence

Fukumizu, K., Suzuki, T., Isobe, N., Oko, K., & Koyama, M. Flow matching achieves almost minimax optimal convergence. In ICLR, 2025

2025
[11]

& Bartlett, P

Goel, G. & Bartlett, P. Can a transformer represent a K alman filter? In L4DC (PMLR vol. 242), 2024

2024
[12]

Few-shot diffusion models

Giannone, G., Nielsen, D., & Winther, O. Few-shot diffusion models. arXiv preprint arXiv:2205.15463, 2022

work page arXiv 2022
[13]

& Sushko, P

Ilin, V. & Sushko, P. DiScoFormer : Plug-in density and score estimation with transformers. arXiv preprint arXiv:2511.05924, 2025

work page arXiv 2025
[14]

Context diffusion: In-context aware image generation

Najdenkoska, I., Sinha, A., Dubey, A., Mahajan, D., Ramanathan, V., & Radenovic, F. Context diffusion: In-context aware image generation. In ECCV, 2024

2024
[15]

Transformers are RNN s: Fast autoregressive transformers with linear attention

Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. Transformers are RNN s: Fast autoregressive transformers with linear attention. In ICML, 2020

2020
[16]

Optimal spatial adaptation to inhomogeneous smoothness: An approach based on kernel estimates with variable bandwidth selectors

Lepski, O., Mammen, E., & Spokoiny, V. Optimal spatial adaptation to inhomogeneous smoothness: An approach based on kernel estimates with variable bandwidth selectors. Annals of Statistics, 25(3):929--947, 1997

1997
[17]

Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., & Le, M. Flow matching for generative modeling. In ICLR, 2023

2023
[18]

From Navigation to Refinement: Revealing the Two-Stage Nature of Flow-based Diffusion Models through Oracle Velocity

Liu, H., Liu, J., Li, Y., Bai, L., Ji, Y., Guo, Y., Wan, S., & Wen, H. From navigation to refinement: Revealing the two-stage nature of flow-based diffusion models through oracle velocity. arXiv preprint arXiv:2512.02826, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

How diffusion models memorize

Kim, J., Kim, S., & Lee, J.-S. How diffusion models memorize. arXiv preprint arXiv:2509.25705, 2025

work page arXiv 2025
[20]

Nonlinear meta-learning can guarantee faster rates

Meunier, D., Li, Z., Gretton, A., & Kpotufe, S. Nonlinear meta-learning can guarantee faster rates. SIAM Journal on Mathematics of Data Science, 7(4):1594--1615, 2025

2025
[21]

Nadaraya, E. A. On estimating regression. Theory of Probability and its Applications, 9(1):141--142, 1964

1964
[22]

Oquab, M., Darcet, T., Moutakanni, T., Vo, H. V., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P., Li, S., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., J \'e gou, H., Mairal, J., Labatut, P., Joulin, A., & Bojanowski, P. DINOv2 : Learning robust visual f...

2024
[23]

Stone, C. J. Optimal global rates of convergence for nonparametric regression. Annals of Statistics, 10(4):1040--1053, 1982

1982
[24]

Tsybakov, A. B. Introduction to Nonparametric Estimation. Springer, 2009

2009
[25]

H., Bai, S., Yamada, M., Morency, L.-P., & Salakhutdinov, R

Tsai, Y.-H. H., Bai, S., Yamada, M., Morency, L.-P., & Salakhutdinov, R. Transformer dissection: A unified understanding of transformer's attention via the lens of kernel. In EMNLP, 2019

2019
[26]

Watson, G. S. Smooth regression analysis. Sankhy\= a : The Indian Journal of Statistics, Series A , 26(4):359--372, 1964

1964
[27]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Ye, H., Zhang, J., Liu, S., Han, X., & Yang, W. IP-Adapter : Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023