Multi-Objective Learning for Diffusion Models: A Statistical Theory under Semi-Supervised Learning

Hanlin Zhu; Haoran Geng; Jitendra Malik; Pieter Abbeel; Somayeh Sojoudi; Xin Guo; Yixiao Huang; Ziheng Cheng

arxiv: 2605.25210 · v1 · pith:LVCBKI3Rnew · submitted 2026-05-24 · 💻 cs.LG · cs.AI· stat.ML

Multi-Objective Learning for Diffusion Models: A Statistical Theory under Semi-Supervised Learning

Ziheng Cheng , Yixiao Huang , Hanlin Zhu , Haoran Geng , Somayeh Sojoudi , Jitendra Malik , Pieter Abbeel , Xin Guo This is my paper

Pith reviewed 2026-06-30 12:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords multi-objective learningdiffusion modelssemi-supervised learninggeneralization boundsspecialist modelsdistillationdiffusion policies

0 comments

The pith

A two-stage procedure lets diffusion models solve multiple tasks with paired data scaling only with specialist complexity rather than generalist capacity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a semi-supervised framework for training diffusion models on multiple tasks when paired data is limited. It uses a two-stage process where small specialist models are trained on scarce paired data, then used to generate pseudo-samples to train a larger generalist model. Generalization bounds show that the number of required paired samples depends only on the specialists' complexity, not the generalist's. This is extended to diffusion policies accounting for distribution shift in rollouts. Experiments on robotics and image tasks support the approach.

Core claim

In multi-objective learning for diffusion models under semi-supervised settings, a two-stage procedure—fitting specialist models then distilling via pseudo-samples—ensures that the sample complexity for paired data is determined solely by the complexity of the specialist classes rather than the larger generalist class.

What carries the argument

The two-stage training procedure that fits lightweight specialists from paired data and distills them into a generalist using generated pseudo-samples, supported by generalization bounds.

Load-bearing premise

The pseudo-samples generated by the specialist models must be of sufficient quality and match the target distributions closely enough for the generalist to learn the Pareto trade-offs.

What would settle it

An experiment showing that increasing generalist capacity still requires proportionally more paired samples despite high-quality pseudo-samples from specialists would falsify the bound.

Figures

Figures reproduced from arXiv: 2605.25210 by Hanlin Zhu, Haoran Geng, Jitendra Malik, Pieter Abbeel, Somayeh Sojoudi, Xin Guo, Yixiao Huang, Ziheng Cheng.

**Figure 1.** Figure 1: Semi-supervised multi-objective learning for conditional diffusion models. Given limited paired data {(x k i , yk i )} n i=1 i.i.d. ∼ Pk and abundant condition-only data {ye k i } N i=1 i.i.d. ∼ P Y k for each objective k, we first train lightweight specialists bhk ∈ Hk and use them to generate pseudo-paired data. A large diffusion generalist in class F is then trained on labeled and pseudo-labeled data. T… view at source ↗

**Figure 2.** Figure 2: Visualization of the domain-randomization levels used in the robotics manipulation experiments. Level 0 uses the canonical scene, Level 1 adds scene/material randomization, Level 2 further adds lighting randomization, and Level 3 additionally introduces camera-pose randomization. Levels 2–3 are used as held-out OOD evaluations. weak supervision, have not been established. The paradigm of distilling special… view at source ↗

**Figure 3.** Figure 3: Visualization of the CelebA-HQ inpainting objectives. Each objective corresponds to a different mask family, producing a distinct corrupted image for the same clean target. head and a ResNet-18 visual encoder. Per-task specialists (≈ 85M parameters) are paired with a wider generalist (≈ 293M parameters), realizing the theoretical capacity gap CF ≫ CH. Specialists are trained on ∼ 900 (StackCube) or 100 (P… view at source ↗

read the original abstract

Diffusion models are increasingly used as powerful conditional generators, yet real deployments often involve multiple target distributions arising from different tasks, e.g., diverse prompt domains in text-to-image generation, or multiple environments in robotics with diffusion policies. This naturally leads to a multi-objective learning (MOL) problem. A key challenge is that achieving good Pareto trade-offs can require a generalist model class with substantially larger capacity than what suffices for solving any individual task, thereby increasing statistical cost since sample complexity typically scales with the model complexity. To reconcile this, we develop a principled MOL framework for diffusion models with limited data: a semi-supervised regime where paired (labeled) samples are scarce, but (unlabeled) condition data are abundant. We propose a two-stage training procedure that first fits lightweight specialist models from limited paired data, and then distills them into a generalist model by generating pseudo-samples. We establish generalization bounds showing that the required number of paired samples only depends on the complexity of the specialist model classes. We further extend the theory to diffusion policies for sequential decision making to account for distribution shift in on-policy rollouts. Extensive experiments on robotic control and image restoration tasks are conducted to verify our theoretical results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The decoupling claim for paired-sample complexity looks unlikely to hold under standard uniform-convergence arguments for the generalist stage.

read the letter

The paper proposes fitting small specialist diffusion models on scarce paired data, then using them to generate pseudo-samples that train a larger generalist on abundant unlabeled conditions. The headline theoretical claim is that the number of paired samples needed depends only on specialist complexity, not generalist capacity. Experiments on robotic control and image restoration are said to back this up, and they extend the setup to diffusion policies with on-policy shift.

The two-stage procedure itself is a reasonable way to handle multi-objective diffusion when labeled pairs are expensive. The extension to sequential decision making is a concrete addition.

The soft spot is the one flagged in the stress test. Training the generalist on specialist-generated pseudo-samples means any excess-risk bound must control the gap between the induced measure and the true conditional. A uniform-convergence argument over the generalist class produces a term that grows with generalist complexity; keeping that term small enough to preserve the target Pareto gap would require the specialist error (hence paired-sample count) to shrink with generalist capacity. The abstract gives no sign that the proof replaces this step with a complexity-independent argument. Without seeing the actual derivation it is hard to tell whether the claimed decoupling survives.

This is for researchers working on sample-efficient multi-task generative models or diffusion policies. The idea is worth a serious referee to check whether the proof avoids the uniform term or whether the experiments only demonstrate the procedure without testing bound tightness.

Referee Report

1 major / 2 minor

Summary. The paper proposes a two-stage semi-supervised multi-objective learning framework for diffusion models: lightweight specialist models are first fit to limited paired (labeled) data for each objective, after which pseudo-samples generated by the specialists are used to train a higher-capacity generalist model. Generalization bounds are claimed to show that the number of required paired samples depends only on the complexity of the specialist classes (not the generalist). The theory is extended to diffusion policies for sequential decision making to handle on-policy distribution shift. Experiments on robotic control and image restoration tasks are used to verify the results.

Significance. If the claimed decoupling of paired-sample complexity from generalist capacity holds, the result would be significant for statistical learning theory in multi-objective settings with scarce labels, as it would allow scaling generalist capacity for better Pareto trade-offs without a corresponding increase in labeled data. The extension to diffusion policies with distribution shift is a constructive addition. The manuscript does not appear to include machine-checked proofs or fully reproducible code artifacts.

major comments (1)

[Abstract / generalization bound section] Abstract (central claim paragraph) and the generalization bound derivation: the claim that 'the required number of paired samples only depends on the complexity of the specialist model classes' is load-bearing. In the two-stage procedure the generalist is trained by empirical risk minimization on samples drawn from the fitted specialists rather than the true conditional. Any excess-risk bound for the generalist must therefore control the total variation (or similar) between the specialist-induced measure and the true measure. Standard uniform-convergence arguments (Rademacher complexity or covering numbers) over the generalist function class produce a deviation term whose control requires the specialist estimation error—and hence the paired-sample size—to shrink at a rate that depends on generalist capacity. The manuscript must exhibit the specific theorem or lemma that replaces this uni

minor comments (2)

Notation for the two-stage procedure (specialist vs. generalist risk functionals) should be introduced with explicit definitions before the bound statements to avoid ambiguity in the semi-supervised regime.
The experimental section should report the exact number of paired samples used relative to the specialist and generalist capacities, together with the measured Pareto gap, to allow direct comparison with the stated sample-complexity scaling.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for identifying the load-bearing aspect of our central claim. We address the concern on the generalization bound below and will clarify the proof structure in revision.

read point-by-point responses

Referee: [Abstract / generalization bound section] Abstract (central claim paragraph) and the generalization bound derivation: the claim that 'the required number of paired samples only depends on the complexity of the specialist model classes' is load-bearing. In the two-stage procedure the generalist is trained by empirical risk minimization on samples drawn from the fitted specialists rather than the true conditional. Any excess-risk bound for the generalist must therefore control the total variation (or similar) between the specialist-induced measure and the true measure. Standard uniform-convergence arguments (Rademacher complexity or covering numbers) over the generalist function class produce a deviation term whose control requires the specialist estimation error—and hence the paired-sample size—to shrink at a rate that depends on generalist capacity. The manuscript must exhibit the specif

Authors: We agree that a direct application of uniform convergence to the generalist on the true measure would couple the rates. Our analysis avoids this by a two-part decomposition: (i) the specialist approximation error is controlled solely by paired samples via standard Rademacher bounds on the specialist classes (Theorem 3.1), and (ii) the generalist is analyzed with respect to the specialist-induced measure, where excess risk is controlled by unlabeled data whose complexity depends on generalist capacity. The total-variation term between specialist and true measures is bounded separately using the score-matching objective of diffusion models (Lemma 4.2), which yields an additive error independent of generalist capacity. The triangle inequality then yields the claimed decoupling. We will add an explicit proof-strategy subsection and pointer to these results in the revision. revision: partial

Circularity Check

0 steps flagged

No circularity: standard two-stage semi-supervised bound with independent content

full rationale

The abstract and description present a two-stage procedure (fit specialists on paired data, distill via pseudo-samples to generalist) followed by generalization bounds whose stated dependence is only on specialist complexity. No equations, self-citations, or fitted-parameter renamings are supplied that would reduce the claimed bound to a tautology or to a self-referential fit. The skeptic concern addresses whether the bound is correct (uniform convergence term), which is a correctness issue rather than a circularity reduction. The derivation is therefore treated as self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; full paper required to audit.

pith-pipeline@v0.9.1-grok · 5776 in / 1114 out tokens · 27882 ms · 2026-06-30T12:06:48.414125+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 16 canonical work pages · 7 internal anchors

[1]

Chen, M., Jiang, H., Liao, W., and Zhao, T. Nonparametric regression on low-dimensional manifolds using deep relu networks: Function approximation and statistical recov- ery.Information and Inference: A Journal of the IMA, 11 (4):1203–1253, 2022a. Chen, M., Huang, K., Zhao, T., and Wang, M. Score approx- imation, estimation and distribution recovery of di...

work page arXiv
[2]

Unveil condi- tional diffusion models with classifier-free guidance: A sharp statistical theory.arXiv preprint arXiv:2403.11968,

Fu, H., Yang, Z., Wang, M., and Chen, M. Unveil condi- tional diffusion models with classifier-free guidance: A sharp statistical theory.arXiv preprint arXiv:2403.11968,

work page arXiv
[3]

Gong, R., Huang, J., Zhao, Y ., Geng, H., Gao, X., Wu, Q., Ai, W., Zhou, Z., Terzopoulos, D., Zhu, S.-C., et al

URL https://arxiv.org/abs/2504.18904. Gong, R., Huang, J., Zhao, Y ., Geng, H., Gao, X., Wu, Q., Ai, W., Zhou, Z., Terzopoulos, D., Zhu, S.-C., et al. Arnold: A benchmark for language-grounded task learn- ing with continuous states in realistic 3d scenes. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pp. 20483–20495,

work page arXiv
[4]

Classifier-Free Diffusion Guidance

Ho, J. and Salimans, T. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Planning with Diffusion for Flexible Behavior Synthesis

Janner, M., Du, Y ., Tenenbaum, J. B., and Levine, S. Plan- ning with diffusion for flexible behavior synthesis.arXiv preprint arXiv:2205.09991,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Progressive Growing of GANs for Improved Quality, Stability, and Variation

Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progres- sive growing of gans for improved quality, stability, and variation.arXiv preprint arXiv:1710.10196,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

OpenVLA: An Open-Source Vision-Language-Action Model

Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakr- ishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., San- keti, P., et al. Openvla: An open-source vision-language- action model.arXiv preprint arXiv:2406.09246,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Diffusion model for data-driven black-box optimization.arXiv preprint arXiv:2403.13219,

Li, Z., Yuan, H., Huang, K., Ni, C., Ye, Y ., Chen, M., and Wang, M. Diffusion model for data-driven black-box optimization.arXiv preprint arXiv:2403.13219,

work page arXiv
[9]

Transformers as decision makers: Provable in-context reinforcement learning via supervised pretraining.arXiv preprint arXiv:2310.08566,

Lin, L., Bai, Y ., and Mei, S. Transformers as decision makers: Provable in-context reinforcement learning via supervised pretraining.arXiv preprint arXiv:2310.08566,

work page arXiv
[10]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., and Chen, M. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

NVIDIA Isaac Sim, 2024a

NVIDIA. NVIDIA Isaac Sim, 2024a. URL https:// developer.nvidia.com/isaac/sim. Accessed: 2026-01-29. NVIDIA. Vmaterials. https://developer.nvidia. com/vmaterials, 2024b. Accessed: 2026-01-29. Oko, K., Akiyama, S., and Suzuki, T. Diffusion models are minimax optimal distribution estimators. InInter- national Conference on Machine Learning, pp. 26517– 26582. PMLR,

2026
[12]

Palette: Image-to-image diffusion models

Saharia, C., Chan, W., Chang, H., Lee, C., Ho, J., Salimans, T., Fleet, D., and Norouzi, M. Palette: Image-to-image diffusion models. InACM SIGGRAPH 2022 conference proceedings, pp. 1–10,

2022
[13]

Score-Based Generative Modeling through Stochastic Differential Equations

Song, Y ., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Er- mon, S., and Poole, B. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456,

work page internal anchor Pith review Pith/arXiv arXiv 2011
[14]

Solving inverse problems in medical imaging with score-based generative models.arXiv preprint arXiv:2111.08005,

Song, Y ., Shen, L., Xing, L., and Ermon, S. Solving inverse problems in medical imaging with score-based generative models.arXiv preprint arXiv:2111.08005,

work page arXiv
[15]

On the sample complexity of semi-supervised multi-objective learning

Wegel, T., So, G., Park, J., and Yang, F. On the sample complexity of semi-supervised multi-objective learning. arXiv preprint arXiv:2508.17152,

work page arXiv
[16]

Self-improving vision- language-action models with data generation via residual rl.arXiv preprint arXiv:2511.00091,

Xiao, W., Lin, H., Peng, A., Xue, H., He, T., Xie, Y ., Hu, F., Wu, J., Luo, Z., Fan, L., et al. Self-improving vision- language-action models with data generation via residual rl.arXiv preprint arXiv:2511.00091,

work page arXiv
[17]

Rldg: Robotic generalist policy distillation via reinforcement learning

Xu, C., Li, Q., Luo, J., and Levine, S. Rldg: Robotic generalist policy distillation via reinforcement learning. arXiv preprint arXiv:2412.09858,

work page arXiv
[18]

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Ze, Y ., Zhang, G., Zhang, K., Hu, C., Wang, M., and Xu, H. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations.arXiv preprint arXiv:2403.03954,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Data and pseudo-rollout protocol.Expert demonstrations are collected by motion-planning rollouts in Isaac Sim

Specialists use a lightweight U-Net (downsampling channels [256,512,1024],≈85M parameters); the generalist doubles the U-Net width ([512,1024,2048],≈293M parameters). Data and pseudo-rollout protocol.Expert demonstrations are collected by motion-planning rollouts in Isaac Sim. For each training variant, we start from a pool of 1000 expert trajectories. To...

2048
[20]

The U-Net takes the noisy statex t as input and predicts both a residualˆrand noiseˆϵ, from which the clean image is reconstructed as ˆx0 =x t −¯αtˆr−¯βtˆϵ

+ ¯βtϵ, ϵ∼ N(0, I), wherex cond is the masked-image condition. The U-Net takes the noisy statex t as input and predicts both a residualˆrand noiseˆϵ, from which the clean image is reconstructed as ˆx0 =x t −¯αtˆr−¯βtˆϵ. Reverse sampling is initialized atxT =x cond +ϵ, so the condition enters through the endpoint of the reverse trajectory rather than by ch...

2025
[21]

Then by Bousquet (2002, Lemma 6.1), it holds that with probability no less than1−δ, for anyj≤j 0 andφ∈Φ (j), 1 n nX i=1 φ(xi)−E P[φ] ≲R n(Φ(j)) + s (Bϵj +B

Let j0 =⌊log 2 n⌋. Then by Bousquet (2002, Lemma 6.1), it holds that with probability no less than1−δ, for anyj≤j 0 andφ∈Φ (j), 1 n nX i=1 φ(xi)−E P[φ] ≲R n(Φ(j)) + s (Bϵj +B

2002
[22]

log (logn)/δ n + blog (logn)/δ n =:F n(ϵj). (B.16) 15 Multi-Objective Learning for Diffusion Models: A Statistical Theory under Semi-Supervised Learning Noticing thatE P[φ]≤ϵ j ≤2E P[φ], it reduces to 1 n nX i=1 φ(xi)−E P[φ] ≲F n(EP[φ]).(B.17) Hence we have by noting thatF n is also a non-decreasing sub-root function, EP[φ]≤ 2 n nX i=1 φ(xi) +C ′(B∨b) r∗ ...

1982
[23]

Therefore we have|eℓ(x, y, h)| ≤M. Step 2.To bound the second order moment, E(x,y)∼P h 1 ∥x∥∞≤R (ℓ(x, y, h)−ℓ(x, y, s ∗))2 i =E (x,y)∼P h 1 ∥x∥∞≤R Et,xt|x∥h(xt, y, t)− ∇logϕ t(xt|x)∥2 − ∥s∗(xt, y, t)− ∇logϕ t(xt|x)∥2 2i ≤E (x,y)∼P 1 ∥x∥∞≤R Et,xt|x∥h(xt, y, t)−s ∗(xt, y, t)∥2 · Et,xt|x∥h(xt, y, t) +s ∗(xt, y, t)−2∇logϕ t(xt|x)∥2 ≤4ME (x,y)∼P 1 ∥x∥∞≤R Et,xt...

2014
[24]

And since exk i ∼ePbhk is truncated, we have ∥exk i ∥∞ ≤R for all 1≤i≤N

in Lemma 3.1, |¯ℓ(x, y, f)| ≤M R for any f∈ F . And since exk i ∼ePbhk is truncated, we have ∥exk i ∥∞ ≤R for all 1≤i≤N . According to Wainwright (2019, Thm. 4.10), it holds that with probability at least 1−δ/(2K), for anyf∈ F, Ey∼PY k ,x∼Pbhk (·|y) ¯ℓ(x, y, f)− 1 N NX i=1 ¯ℓ(exk i ,eyk i , f) ≤2R N(Ψk) +M R r 2 log(2K/δ) N .(B.55) Note that for anyf 1, f...

2019
[25]

Then F also satisfies reverse-triangle inequality and positive homogeneity

Define F(⃗ u) :=S(⃗ u+) where ⃗ u+ = max{⃗ u,0}. Then F also satisfies reverse-triangle inequality and positive homogeneity. Let C:={⃗ v∈R K :⃗ v·⃗ u≤F(⃗ u),∀⃗ u∈RK}.(B.65) According to Hahn–Banach separation (see e.g., Simons (2008, Coro. 2.4)), F(⃗ u) = sup ⃗ v∈C ⃗ v·⃗ u,∀⃗ u∈RK.(B.66) Further notice that0≤F(⃗ u)≤ ∥⃗ u∥∞, hence for any⃗ v∈C,⃗ v≥0and PK ...

2008
[26]

Define diam(Ψk(r),∥ · ∥ L2(ePk)) =D r

ψ2 ≤4∥ ¯ℓ(·,·, f 1,S 1)− ¯ℓ(·,·, f 2,S 2)∥L2(ePk),(B.76) whereePk := 1 N PN i=1 δ(exk i ,eyk i ). Define diam(Ψk(r),∥ · ∥ L2(ePk)) =D r. By Dudley’s bound (Van Handel, 2014; Wainwright, 2019), there exists an absolute constantC 0 such that for anyθ >0, RN(Ψk(r))≤C 0  θ+ Z Dr θ s logN(Ψ k(r),∥ · ∥ L2(ePk), ε) N dε   .(B.77) By the same arguments in Step

2014
[27]

(B.78) LetF ′ :={ efS :S ∈ S lin} ⊆ F

in Lemma 3.1, for any(S 1, f1),(S 2, f2)∈ M k(r), vuut 1 N NX i=1 (¯ℓ(exk i ,eyk i , f1,S ∞)− ¯ℓ(exk i ,eyk i , f2,S 2))2 ≤2M 1 2 R[∥f1 −f 2∥L∞(ΩR) +∥ efS1 − efS2 ∥L∞(ΩR)] + 4C′ 1M 1 2 R exp(−C ′ 2R2/2). (B.78) LetF ′ :={ efS :S ∈ S lin} ⊆ F. Hence for anyε≥8C ′ 1M 1 2 R exp(−C ′ 2R2/2), logN(Ψ k(r),∥ · ∥ L2(ePk), ε)≤logN(F,∥ · ∥ L∞(ΩR), ε/(8M 1 2 R)) + l...

2002

[1] [1]

Chen, M., Jiang, H., Liao, W., and Zhao, T. Nonparametric regression on low-dimensional manifolds using deep relu networks: Function approximation and statistical recov- ery.Information and Inference: A Journal of the IMA, 11 (4):1203–1253, 2022a. Chen, M., Huang, K., Zhao, T., and Wang, M. Score approx- imation, estimation and distribution recovery of di...

work page arXiv

[2] [2]

Unveil condi- tional diffusion models with classifier-free guidance: A sharp statistical theory.arXiv preprint arXiv:2403.11968,

Fu, H., Yang, Z., Wang, M., and Chen, M. Unveil condi- tional diffusion models with classifier-free guidance: A sharp statistical theory.arXiv preprint arXiv:2403.11968,

work page arXiv

[3] [3]

Gong, R., Huang, J., Zhao, Y ., Geng, H., Gao, X., Wu, Q., Ai, W., Zhou, Z., Terzopoulos, D., Zhu, S.-C., et al

URL https://arxiv.org/abs/2504.18904. Gong, R., Huang, J., Zhao, Y ., Geng, H., Gao, X., Wu, Q., Ai, W., Zhou, Z., Terzopoulos, D., Zhu, S.-C., et al. Arnold: A benchmark for language-grounded task learn- ing with continuous states in realistic 3d scenes. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pp. 20483–20495,

work page arXiv

[4] [4]

Classifier-Free Diffusion Guidance

Ho, J. and Salimans, T. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Planning with Diffusion for Flexible Behavior Synthesis

Janner, M., Du, Y ., Tenenbaum, J. B., and Levine, S. Plan- ning with diffusion for flexible behavior synthesis.arXiv preprint arXiv:2205.09991,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Progressive Growing of GANs for Improved Quality, Stability, and Variation

Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progres- sive growing of gans for improved quality, stability, and variation.arXiv preprint arXiv:1710.10196,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

OpenVLA: An Open-Source Vision-Language-Action Model

Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakr- ishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., San- keti, P., et al. Openvla: An open-source vision-language- action model.arXiv preprint arXiv:2406.09246,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Diffusion model for data-driven black-box optimization.arXiv preprint arXiv:2403.13219,

Li, Z., Yuan, H., Huang, K., Ni, C., Ye, Y ., Chen, M., and Wang, M. Diffusion model for data-driven black-box optimization.arXiv preprint arXiv:2403.13219,

work page arXiv

[9] [9]

Transformers as decision makers: Provable in-context reinforcement learning via supervised pretraining.arXiv preprint arXiv:2310.08566,

Lin, L., Bai, Y ., and Mei, S. Transformers as decision makers: Provable in-context reinforcement learning via supervised pretraining.arXiv preprint arXiv:2310.08566,

work page arXiv

[10] [10]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., and Chen, M. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

NVIDIA Isaac Sim, 2024a

NVIDIA. NVIDIA Isaac Sim, 2024a. URL https:// developer.nvidia.com/isaac/sim. Accessed: 2026-01-29. NVIDIA. Vmaterials. https://developer.nvidia. com/vmaterials, 2024b. Accessed: 2026-01-29. Oko, K., Akiyama, S., and Suzuki, T. Diffusion models are minimax optimal distribution estimators. InInter- national Conference on Machine Learning, pp. 26517– 26582. PMLR,

2026

[12] [12]

Palette: Image-to-image diffusion models

Saharia, C., Chan, W., Chang, H., Lee, C., Ho, J., Salimans, T., Fleet, D., and Norouzi, M. Palette: Image-to-image diffusion models. InACM SIGGRAPH 2022 conference proceedings, pp. 1–10,

2022

[13] [13]

Score-Based Generative Modeling through Stochastic Differential Equations

Song, Y ., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Er- mon, S., and Poole, B. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456,

work page internal anchor Pith review Pith/arXiv arXiv 2011

[14] [14]

Solving inverse problems in medical imaging with score-based generative models.arXiv preprint arXiv:2111.08005,

Song, Y ., Shen, L., Xing, L., and Ermon, S. Solving inverse problems in medical imaging with score-based generative models.arXiv preprint arXiv:2111.08005,

work page arXiv

[15] [15]

On the sample complexity of semi-supervised multi-objective learning

Wegel, T., So, G., Park, J., and Yang, F. On the sample complexity of semi-supervised multi-objective learning. arXiv preprint arXiv:2508.17152,

work page arXiv

[16] [16]

Self-improving vision- language-action models with data generation via residual rl.arXiv preprint arXiv:2511.00091,

Xiao, W., Lin, H., Peng, A., Xue, H., He, T., Xie, Y ., Hu, F., Wu, J., Luo, Z., Fan, L., et al. Self-improving vision- language-action models with data generation via residual rl.arXiv preprint arXiv:2511.00091,

work page arXiv

[17] [17]

Rldg: Robotic generalist policy distillation via reinforcement learning

Xu, C., Li, Q., Luo, J., and Levine, S. Rldg: Robotic generalist policy distillation via reinforcement learning. arXiv preprint arXiv:2412.09858,

work page arXiv

[18] [18]

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Ze, Y ., Zhang, G., Zhang, K., Hu, C., Wang, M., and Xu, H. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations.arXiv preprint arXiv:2403.03954,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Data and pseudo-rollout protocol.Expert demonstrations are collected by motion-planning rollouts in Isaac Sim

Specialists use a lightweight U-Net (downsampling channels [256,512,1024],≈85M parameters); the generalist doubles the U-Net width ([512,1024,2048],≈293M parameters). Data and pseudo-rollout protocol.Expert demonstrations are collected by motion-planning rollouts in Isaac Sim. For each training variant, we start from a pool of 1000 expert trajectories. To...

2048

[20] [20]

The U-Net takes the noisy statex t as input and predicts both a residualˆrand noiseˆϵ, from which the clean image is reconstructed as ˆx0 =x t −¯αtˆr−¯βtˆϵ

+ ¯βtϵ, ϵ∼ N(0, I), wherex cond is the masked-image condition. The U-Net takes the noisy statex t as input and predicts both a residualˆrand noiseˆϵ, from which the clean image is reconstructed as ˆx0 =x t −¯αtˆr−¯βtˆϵ. Reverse sampling is initialized atxT =x cond +ϵ, so the condition enters through the endpoint of the reverse trajectory rather than by ch...

2025

[21] [21]

Then by Bousquet (2002, Lemma 6.1), it holds that with probability no less than1−δ, for anyj≤j 0 andφ∈Φ (j), 1 n nX i=1 φ(xi)−E P[φ] ≲R n(Φ(j)) + s (Bϵj +B

Let j0 =⌊log 2 n⌋. Then by Bousquet (2002, Lemma 6.1), it holds that with probability no less than1−δ, for anyj≤j 0 andφ∈Φ (j), 1 n nX i=1 φ(xi)−E P[φ] ≲R n(Φ(j)) + s (Bϵj +B

2002

[22] [22]

log (logn)/δ n + blog (logn)/δ n =:F n(ϵj). (B.16) 15 Multi-Objective Learning for Diffusion Models: A Statistical Theory under Semi-Supervised Learning Noticing thatE P[φ]≤ϵ j ≤2E P[φ], it reduces to 1 n nX i=1 φ(xi)−E P[φ] ≲F n(EP[φ]).(B.17) Hence we have by noting thatF n is also a non-decreasing sub-root function, EP[φ]≤ 2 n nX i=1 φ(xi) +C ′(B∨b) r∗ ...

1982

[23] [23]

Therefore we have|eℓ(x, y, h)| ≤M. Step 2.To bound the second order moment, E(x,y)∼P h 1 ∥x∥∞≤R (ℓ(x, y, h)−ℓ(x, y, s ∗))2 i =E (x,y)∼P h 1 ∥x∥∞≤R Et,xt|x∥h(xt, y, t)− ∇logϕ t(xt|x)∥2 − ∥s∗(xt, y, t)− ∇logϕ t(xt|x)∥2 2i ≤E (x,y)∼P 1 ∥x∥∞≤R Et,xt|x∥h(xt, y, t)−s ∗(xt, y, t)∥2 · Et,xt|x∥h(xt, y, t) +s ∗(xt, y, t)−2∇logϕ t(xt|x)∥2 ≤4ME (x,y)∼P 1 ∥x∥∞≤R Et,xt...

2014

[24] [24]

And since exk i ∼ePbhk is truncated, we have ∥exk i ∥∞ ≤R for all 1≤i≤N

in Lemma 3.1, |¯ℓ(x, y, f)| ≤M R for any f∈ F . And since exk i ∼ePbhk is truncated, we have ∥exk i ∥∞ ≤R for all 1≤i≤N . According to Wainwright (2019, Thm. 4.10), it holds that with probability at least 1−δ/(2K), for anyf∈ F, Ey∼PY k ,x∼Pbhk (·|y) ¯ℓ(x, y, f)− 1 N NX i=1 ¯ℓ(exk i ,eyk i , f) ≤2R N(Ψk) +M R r 2 log(2K/δ) N .(B.55) Note that for anyf 1, f...

2019

[25] [25]

Then F also satisfies reverse-triangle inequality and positive homogeneity

Define F(⃗ u) :=S(⃗ u+) where ⃗ u+ = max{⃗ u,0}. Then F also satisfies reverse-triangle inequality and positive homogeneity. Let C:={⃗ v∈R K :⃗ v·⃗ u≤F(⃗ u),∀⃗ u∈RK}.(B.65) According to Hahn–Banach separation (see e.g., Simons (2008, Coro. 2.4)), F(⃗ u) = sup ⃗ v∈C ⃗ v·⃗ u,∀⃗ u∈RK.(B.66) Further notice that0≤F(⃗ u)≤ ∥⃗ u∥∞, hence for any⃗ v∈C,⃗ v≥0and PK ...

2008

[26] [26]

Define diam(Ψk(r),∥ · ∥ L2(ePk)) =D r

ψ2 ≤4∥ ¯ℓ(·,·, f 1,S 1)− ¯ℓ(·,·, f 2,S 2)∥L2(ePk),(B.76) whereePk := 1 N PN i=1 δ(exk i ,eyk i ). Define diam(Ψk(r),∥ · ∥ L2(ePk)) =D r. By Dudley’s bound (Van Handel, 2014; Wainwright, 2019), there exists an absolute constantC 0 such that for anyθ >0, RN(Ψk(r))≤C 0  θ+ Z Dr θ s logN(Ψ k(r),∥ · ∥ L2(ePk), ε) N dε   .(B.77) By the same arguments in Step

2014

[27] [27]

(B.78) LetF ′ :={ efS :S ∈ S lin} ⊆ F

in Lemma 3.1, for any(S 1, f1),(S 2, f2)∈ M k(r), vuut 1 N NX i=1 (¯ℓ(exk i ,eyk i , f1,S ∞)− ¯ℓ(exk i ,eyk i , f2,S 2))2 ≤2M 1 2 R[∥f1 −f 2∥L∞(ΩR) +∥ efS1 − efS2 ∥L∞(ΩR)] + 4C′ 1M 1 2 R exp(−C ′ 2R2/2). (B.78) LetF ′ :={ efS :S ∈ S lin} ⊆ F. Hence for anyε≥8C ′ 1M 1 2 R exp(−C ′ 2R2/2), logN(Ψ k(r),∥ · ∥ L2(ePk), ε)≤logN(F,∥ · ∥ L∞(ΩR), ε/(8M 1 2 R)) + l...

2002