pith. sign in

arxiv: 2605.28075 · v1 · pith:KL4HOGFNnew · submitted 2026-05-27 · 💻 cs.LG

Measure-to-measure Regression with Transformers

Pith reviewed 2026-06-29 14:44 UTC · model grok-4.3

classification 💻 cs.LG
keywords measure-to-measure regressiontransformersprobability measuresmean-field limitsinteracting particle systemsorganoid datanonlinear operatorspopulation dynamics
0
0 comments X

The pith

Transformers can learn nonlinear maps between probability measures from finite observed input-output pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formalizes nonlinear measure-to-measure regression as the task of learning an operator that transforms one probability measure into another, using entire distributions rather than individual points as data. It introduces two transformer-based methods that exploit the architecture's natural dependence on measures and its mean-field limit to produce scalable, expressive solutions. This framing matters for settings like cellular biology where populations evolve collectively rather than independently. The methods are shown to generalize to unseen measures across synthetic data, interacting particle systems, and a large organoid dataset.

Core claim

We formalize nonlinear M2M regression and introduce two easy-to-use, expressive, and scalable approaches to learn such operators: transformers as static M2M maps and transformers as dynamic M2M velocity fields. Our approach leverages the natural measure-dependent and mean-field structure of transformers to learn nonlinear M2M maps on the space of probability distributions.

What carries the argument

Transformers used as static maps from input measure to output measure, or as velocity fields that generate the map via a dynamical system, both exploiting the architecture's permutation-equivariant, mean-field behavior on sets of points.

If this is right

  • The static and dynamic transformer formulations both generalize to measures not seen during training on synthetic and particle-system benchmarks.
  • The same models produce accurate predictions of treatment-induced shifts in cell-population distributions on a colorectal-cancer organoid dataset.
  • Because the methods operate directly on point-cloud representations of measures, they scale to the sample sizes typical of modern single-cell experiments without requiring explicit density estimation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dynamic velocity-field formulation may allow integration with existing continuous-time models of population dynamics outside the training distribution.
  • If the mean-field transformer limit captures the necessary nonlinearities, similar architectures could be applied to other measure-valued processes such as empirical distributions in fluid mechanics or opinion dynamics.
  • The finite-sample guarantee implicit in the organoid results suggests that collecting paired measure observations may be more data-efficient than collecting matched individual trajectories when the goal is population-level prediction.

Load-bearing premise

The measure-dependent and mean-field structure of transformers is sufficient to learn nonlinear maps on the space of probability distributions from finite observed input-output pairs of measures.

What would settle it

On the held-out patient-derived organoid dataset, if the transformer-predicted output measures show no improvement in matching the observed treatment-response distributions compared with a baseline that maps each input point cloud independently without using the global measure structure, the practical claim would be refuted.

Figures

Figures reproduced from arXiv: 2605.28075 by Lazar Atanackovic, Martha White, Matthew Vandergrift, Philippe Rigollet, Yury Polyanskiy.

Figure 1
Figure 1. Figure 1: Illustration of the measure-to-measure (M2M) regression problem. In M2M regression, we treat entire distributions/measures as paired data points (e.g., (µi, νi) ∈ P(R d ) × P(R d )) to learn a contin￾uous mapping between them. Leveraging the natu￾ral measure-dependent structure of transformers, we introduce two classes of methods for learning non￾linear M2M operators: static one-step pushforward maps F : µ… view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of model predictions on multi-measure objects for unseen measures. Methods are tasked with transporting particles from source measures (top row) to target measures (bottom row). We show visuals for models trained to reverse (a) a diffusion process, and (b) a kernel interaction process. unknown meta-distribution on P(R d ), and the goal is to sample new measures from this same meta￾distributio… view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of various model predictions on 2 dimen￾sional Kuramoto system. We set new initial conditions, apply the kuramoto dynamics to it and have the models predict. For static methods we only show M2M-W1. We evaluate our approaches for pre￾dicting the evolution of synthetic McKean-Vlasov processes [McK￾ean Jr, 1966]. We use McKean-Vlasov processes because their dynamics are distribution dependent an… view at source ↗
Figure 4
Figure 4. Figure 4: Empirical distribution of ED across all measures. Statistics reported for each split. [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Extended visualization of model predictions on multi-measure objects for unseen measures. [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗
read the original abstract

Many learning problems require predicting how populations evolve under an unknown transformation. A natural representation for such populations is a probability measure, with point clouds as a key example. In this work, we study the measure-to-measure (M2M) regression problem, in which one seeks to learn a map between probability measures from a finite collection of observed input-output pairs. In contrast to classical regression, where individual samples are transformed independently, M2M regression treats entire distributions as the data points. This perspective is vital in certain scientific applications, for example, cellular and molecular biology, where cells are known to evolve not as independent data points but as a collection. However, few existing approaches address the problem of M2M regression with sufficient expressivity and scalability. We present a formalization of nonlinear M2M regression and introduce two easy-to-use, expressive, and scalable approaches to learn such operators: transformers as static M2M maps and transformers as dynamic M2M velocity fields. Our approach leverages the natural measure-dependent and mean-field structure of transformers to learn nonlinear M2M maps on the space of probability distributions. We illustrate the effectiveness of our proposed method to generalize to unseen measures on synthetic experiments, interacting particle systems, and a large-scale patient-derived organoid dataset for predicting treatment response in colorectal cancer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript formalizes the nonlinear measure-to-measure (M2M) regression problem of learning maps between probability measures from finite observed input-output pairs. It introduces two transformer-based approaches—static M2M maps and dynamic M2M velocity fields—that exploit the measure-dependent and mean-field structure of transformers to learn nonlinear operators on the space of distributions. Effectiveness is illustrated via generalization to unseen measures on synthetic experiments, interacting particle systems, and a large-scale patient-derived organoid dataset for colorectal cancer treatment response prediction.

Significance. If the central claim holds—that the mean-field structure of transformers is sufficient to learn arbitrary nonlinear maps on probability measures from finite pairs—this would supply a scalable, expressive framework for population-level prediction tasks in biology and physics where classical pointwise regression is inadequate. The explicit use of transformers' built-in permutation invariance and aggregation properties for distributional data is a natural architectural choice that could reduce the need for custom measure-valued architectures.

major comments (2)
  1. [Abstract] Abstract: the claim that the proposed methods 'illustrate the effectiveness' of learning nonlinear M2M maps is unsupported by any quantitative results, error metrics, baseline comparisons, or ablation studies, preventing assessment of whether the data actually supports the sufficiency of the transformer mean-field structure for the stated task.
  2. [Abstract] Abstract: no equations, architectural specifications, or expressivity arguments are supplied to substantiate that the measure-dependent structure of transformers is in fact sufficient to learn arbitrary nonlinear maps on probability measures from finite input-output pairs, leaving the central modeling assumption untestable from the provided text.
minor comments (1)
  1. [Abstract] The abstract refers to 'synthetic experiments' and 'a large-scale patient-derived organoid dataset' without naming the specific datasets, number of measures, or dimensionality, which would aid reproducibility even at the high level.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their comments. We respond to the major comments on the abstract below. The full manuscript contains the supporting details, but we will consider revisions to strengthen the abstract where possible.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the proposed methods 'illustrate the effectiveness' of learning nonlinear M2M maps is unsupported by any quantitative results, error metrics, baseline comparisons, or ablation studies, preventing assessment of whether the data actually supports the sufficiency of the transformer mean-field structure for the stated task.

    Authors: The abstract provides a concise summary of the work. The manuscript includes quantitative experimental results on synthetic data, particle systems, and the organoid dataset, with comparisons and ablations detailed in the experimental section. We agree that the abstract could better hint at these results and will revise it to include a brief reference to the observed generalization performance and comparisons, subject to length constraints. revision: partial

  2. Referee: [Abstract] Abstract: no equations, architectural specifications, or expressivity arguments are supplied to substantiate that the measure-dependent structure of transformers is in fact sufficient to learn arbitrary nonlinear maps on probability measures from finite input-output pairs, leaving the central modeling assumption untestable from the provided text.

    Authors: We note that the abstract does not claim sufficiency for arbitrary nonlinear maps; it states that the approach leverages the structure to learn nonlinear M2M maps and illustrates effectiveness on specific tasks. The full paper provides the formalization, model architectures, and theoretical motivations in the methods section. The central claim is empirical effectiveness rather than universal sufficiency. We can revise the abstract to clarify the scope of the claims if needed, but the supporting details are in the main text. revision: no

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The provided abstract and context contain no equations, derivations, or load-bearing steps that reduce by construction to fitted parameters, self-definitions, or self-citation chains. The central claims rest on leveraging the pre-existing measure-dependent and mean-field properties of transformers (an external architectural fact) to address M2M regression, without any internal reduction of predictions to inputs or renaming of known results as novel derivations. This is the most common honest finding for papers that apply established models without introducing self-referential formalisms.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; all technical details are absent.

pith-pipeline@v0.9.1-grok · 5769 in / 986 out tokens · 34272 ms · 2026-06-29T14:44:01.589162+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 25 canonical work pages · 12 internal anchors

  1. [1]

    Stochastic Interpolants: A Unifying Framework for Flows and Diffusions

    Michael S Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.arXiv preprint arXiv:2303.08797,

  2. [2]

    Meta optimal transport.arXiv preprint arXiv:2206.05262,

    Brandon Amos, Samuel Cohen, Giulia Luise, and Ievgen Redko. Meta optimal transport.arXiv preprint arXiv:2206.05262,

  3. [3]

    Flow map matching with stochastic interpolants: A mathematical framework for consistency models.arXiv preprint arXiv:2406.07507,

    Nicholas M Boffi, Michael S Albergo, and Eric Vanden-Eijnden. Flow map matching with stochastic interpolants: A mathematical framework for consistency models.arXiv preprint arXiv:2406.07507,

  4. [4]

    A Unified Perspective on the Dynamics of Deep Transformers

    Valérie Castin, Pierre Ablin, José Antonio Carrillo, and Gabriel Peyré. A unified perspective on the dynamics of deep transformers.arXiv preprint arXiv:2501.18322,

  5. [5]

    Quantitative Clustering in Mean-Field Transformer Models

    Shi Chen, Zhengjiang Lin, Yury Polyanskiy, and Philippe Rigollet. Quantitative clustering in mean-field transformer models, 2025a. arXiv:2504.14697. Shi Chen, Zhengjiang Lin, Yury Polyanskiy, and Philippe Rigollet. Critical attention scaling in long-context transformers. InInternational Conference on Learning Representations,

  6. [6]

    Generative Modeling via Drifting

    Mingyang Deng, He Li, Tianhong Li, Yilun Du, and Kaiming He. Generative modeling via drifting. arXiv preprint arXiv:2602.04770,

  7. [7]

    Minibatch optimal transport distances; analysis and applications.arXiv preprint arXiv:2101.01792,

    Kilian Fatras, Younes Zine, Szymon Majewski, Rémi Flamary, Rémi Gribonval, and Nicolas Courty. Minibatch optimal transport distances; analysis and applications.arXiv preprint arXiv:2101.01792,

  8. [8]

    Generative distribution embeddings.arXiv preprint arXiv:2505.18150,

    Nic Fishman, Gokul Gowri, Peng Yin, Jonathan Gootenberg, and Omar Abudayyeh. Generative distribution embeddings.arXiv preprint arXiv:2505.18150,

  9. [9]

    Distribution-conditioned transport.arXiv preprint arXiv:2603.04736,

    Nic Fishman, Gokul Gowri, Paolo LB Fischer, Marinka Zitnik, Omar Abudayyeh, and Jonathan Gootenberg. Distribution-conditioned transport.arXiv preprint arXiv:2603.04736,

  10. [10]

    The emergence of clusters in self-attention dynamics

    Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. The emergence of clusters in self-attention dynamics. InNeural Information Processing Systems, 2024a. Borjan Geshkovski, Philippe Rigollet, and Domènec Ruiz-Balet. Measure-to-measure interpolation using transformers.arXiv preprint arXiv:2411.04551, 2024b. Borjan Geshkovski, Cyril L...

  11. [11]

    Panaretos

    Laya Ghodrati and Victor M. Panaretos. Transportation of measure regression in higher dimensions. arXiv preprint arXiv:2305.17503,

  12. [12]

    A Call to Lagrangian Action: Learning Population Mechanics from Temporal Snapshots

    Vincent Guan, Lazar Atanackovic, and Kirill Neklyudov. A call to lagrangian action: Learning population mechanics from temporal snapshots.arXiv preprint arXiv:2605.08550,

  13. [13]

    12 Bowen Jing, Hannes Stärk, Tommi Jaakkola, and Bonnie Berger

    arXiv preprint arXiv:2302.13752. 12 Bowen Jing, Hannes Stärk, Tommi Jaakkola, and Bonnie Berger. Generative modeling of molecular dynamics trajectories.Neural Information Processing Systems,

  14. [14]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

  15. [15]

    Continuous transformations of probability measures and their transport representations

    Hugo Lavenant and Giuseppe Savaré. Continuous transformations of probability measures and their transport representations.arXiv preprint arXiv:2604.16653,

  16. [16]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

  17. [17]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003,

  18. [18]

    A class of markov processes associated with nonlinear parabolic equations

    Henry P McKean Jr. A class of markov processes associated with nonlinear parabolic equations. Proceedings of the National Academy of Sciences, 56(6):1907–1911,

  19. [19]

    Wfr-fm: Simulation-free dynamic unbalanced optimal transport.arXiv preprint arXiv:2601.06810,

    Qiangwei Peng, Zihan Wang, Junda Ying, Yuhao Sun, Qing Nie, Lei Zhang, Tiejun Li, and Peijie Zhou. Wfr-fm: Simulation-free dynamic unbalanced optimal transport.arXiv preprint arXiv:2601.06810,

  20. [20]

    Katarina Petrovi ´c, Lazar Atanackovic, Viggo Moro, Kacper Kapu ´sniak, Ismail Ilkan Ceylan, Michael M

    arXiv preprint arXiv:2402.14753. Katarina Petrovi ´c, Lazar Atanackovic, Viggo Moro, Kacper Kapu ´sniak, Ismail Ilkan Ceylan, Michael M. Bronstein, Joey Bose, and Alexander Tong. Curly flow matching for learning non- gradient field dynamics. InNeural Information Processing Systems,

  21. [21]

    Meta Flow Maps enable scalable reward alignment

    13 Peter Potaptchik, Adhi Saravanan, Abbas Mammadov, Alvaro Prat, Michael S Albergo, and Yee Whye Teh. Meta flow maps enable scalable reward alignment.arXiv preprint arXiv:2601.14430,

  22. [22]

    SplineFlow: Flow Matching for Dynamical Systems with B-Spline Interpolants

    Santanu Subhash Rathod, Pietro Liò, and Xiao Zhang. Splineflow: Flow matching for dynamical systems with b-spline interpolants.arXiv preprint arXiv:2601.23072,

  23. [23]

    Modeling microenvironment trajectories on spatial transcriptomics with nicheflow.arXiv preprint arXiv:2511.00977,

    Kristiyan Sakalyan, Alessandro Palma, Filippo Guerranti, Fabian J Theis, and Stephan Günnemann. Modeling microenvironment trajectories on spatial transcriptomics with nicheflow.arXiv preprint arXiv:2511.00977,

  24. [24]

    Michael E Sander, Pierre Ablin, Mathieu Blondel, and Gabriel Peyré

    arXiv preprint arXiv:2410.03011. Michael E Sander, Pierre Ablin, Mathieu Blondel, and Gabriel Peyré. Sinkformers: Transformers with doubly stochastic attention. InInternational Conference on Artificial Intelligence and Statistics. PMLR,

  25. [25]

    Improving and generalizing flow-based generative models with minibatch optimal transport

    Alexander Tong, Kilian Fatras, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector- Brooks, Guy Wolf, and Yoshua Bengio. Improving and generalizing flow-based generative models with minibatch optimal transport.arXiv preprint arXiv:2302.00482,

  26. [26]

    Zhiting Wei, Yiheng Wang, Yicheng Gao, Shuguang Wang, Ping Li, Duanmiao Si, Yuli Gao, Siqi Wu, Danlu Li, Kejing Dong, et al

    arXiv preprint arXiv:2402.00522. Zhiting Wei, Yiheng Wang, Yicheng Gao, Shuguang Wang, Ping Li, Duanmiao Si, Yuli Gao, Siqi Wu, Danlu Li, Kejing Dong, et al. Benchmarking algorithms for generalizable single-cell perturbation response prediction.Nature Methods,

  27. [27]

    Tahoe-100m: A giga-scale single-cell perturbation atlas for context-dependent gene function and cellular modeling.BioRxiv, pages 2025–02,

    Jesse Zhang, Airol A Ubas, Richard de Borja, Valentine Svensson, Nicole Thomas, Neha Thakar, Ian Lai, Aidan Winters, Umair Khan, Matthew G Jones, et al. Tahoe-100m: A giga-scale single-cell perturbation atlas for context-dependent gene function and cellular modeling.BioRxiv, pages 2025–02,

  28. [28]

    Learning gaussian mix- ture models via transformer measure flows

    Aleksandr Zimin, Anastasiia Kutakh, Yury Polyanskiy, and Philippe Rigollet. Learning gaussian mix- ture models via transformer measure flows. InICML 2025 Workshop on Methods and Opportunities at Small Scale,

  29. [29]

    15 A Additional Experimental Details A.1 Multi-measure Objects In this section we outline additional experimental setup details for the Multi-measure Objects experiments. The kernel interaction process in this experiment is defined by the following dynamics equation, xt+∆t =x t +η(x t − X j Aijxj)∆t+σ √ ∆tϵ where Aij = exp(−∥x i −x j∥2/2h2)/P k exp(−∥xi −...

  30. [30]

    We choose to treat these pairs as independent rather than employ complex solutions used for multi-marginal flow matching Rohbeck et al

    We update on each of these 16 pairs in parallel. We choose to treat these pairs as independent rather than employ complex solutions used for multi-marginal flow matching Rohbeck et al. [2025]. This means that the models trained in this setup, sample two adjacent time-points for updating. Since the model is conditioned on time, it learns a single map then ...

  31. [31]

    Statistics reported for each split

    Cut-off Threshold: 0 .1 (c) PDO-75 Figure 4: Empirical distribution of ED across all measures. Statistics reported for each split. 17 A.4 Additional Information on Baselines In this section we offer information about our baseline methods. All models incorporate positional encoding for samples/particles. For time-dependent models, we use sinusoidal embeddin...

  32. [32]

    maximum mean discrepancy (MMD)

    demonstrated that OT-CFM yields competitive performance relative to CellOT [Bunne et al., 2023], an input convex neural network (ICNN)-based approach, on the colorectal cancer treatment-response dataset, which we consider in Section 5.3. Similar to [Bunne et al., 2023], OT-CFM provides an optimal transport-based approximation for the M2M operator, and as ...

  33. [33]

    Computation of Losses.To compute distributional losses efficiently, we employ the Geometric Loss Functions package [Feydy et al., 2019]

    for a definition of this metric. Computation of Losses.To compute distributional losses efficiently, we employ the Geometric Loss Functions package [Feydy et al., 2019]. This package provides auto-differentiable versions of these distances enabling efficient and scalable training of our static approaches. Specifically for the p-Wasserstein loss, we approx...

  34. [34]

    Time is encoded and passed into the model with a sinusoidal timestep embedding. Together, this is then fed into a stack of AdaLNBlocks [Perez et al., 2018] blocks, each block uses adaptive layer normalization modulated by time embedding (and any additional covariates/conditions that are used, e.g. treatments in the biological expeirmets), and provides gat...

  35. [35]

    20 Table 4: Hyperparameters for the M2M-TFM model on multi-measure Object experiments. Hyperparameter Value Input dimension 2 Hidden dimension 64 Number of layers 5 Number of attention heads 8 Learning rate5×10 −5 Dropout 0.05 Time embedding dimension 128 Time-varying dynamics True Time steps at inference 100 Measure Batch Size 16 Particle Batch Size 512 ...

  36. [36]

    The hyperparameters for Meta-FM are shown in table Table

    We use the same Meta-FM hyperparameters as [Atanackovic et al., 2025], but instead of a graph convolution network we use a transformer population embedding which uses 3 layers, with a width of 64, and 8 heads. The hyperparameters for Meta-FM are shown in table Table

  37. [37]

    Table 10: Hyperparameters for the CFM model in the multi-measure object experiments. Hyperparameter Value Input dimension 2 Hidden dimension 512 Number of layers 4 Skip connections False OT during Training True Time steps at inference 100 Learning rate1×10 −4 Measure Batch Size 16 Particle Batch Size 512 Epochs (passes over training data) 1000 Table 11: H...

  38. [38]

    In Table 24 we provide results for our ablation over the number of inference steps, i.e number of function evaluations used in inference forM2M-TFMon the50dimensional systems. 24 Source Letters Numbers Shapes Multi-object CFM Meta−FM WFM M2M−ED M2M−MMD M2M−W1 M2M−W2 M2M− OTMSE M2M−TFM Target (a)Diffusion process Source Letters Numbers Shapes Multi-object ...