pith. machine review for the scientific record. sign in

hub

Scaling text-to-image diffusion transformers with representation autoencoders

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

hub tools

fields

cs.CV 9 cs.LG 1

years

2026 10

verdicts

UNVERDICTED 10

representative citing papers

Asymmetric Flow Models

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

Asymmetric Flow Modeling restricts noise prediction to a low-rank subspace for high-dimensional flow generation, reaching 1.57 FID on ImageNet 256x256 and new state-of-the-art pixel text-to-image performance via finetuning from latent models.

Coevolving Representations in Joint Image-Feature Diffusion

cs.CV · 2026-04-19 · unverdicted · novelty 7.0

CoReDi coevolves semantic representations with the diffusion model via a jointly learned linear projection stabilized by stop-gradient, normalization, and regularization, yielding faster convergence and higher sample quality than fixed-representation baselines.

ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

ViTok-v2 is a 5B-parameter native-resolution image autoencoder using NaFlex and DINOv3 loss that matches or exceeds prior tokenizers at 256p and outperforms them at 512p and above while advancing the Pareto frontier in joint scaling with generators.

Taming Outlier Tokens in Diffusion Transformers

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.

Continuous Adversarial Flow Models

cs.LG · 2026-04-13 · unverdicted · novelty 6.0

Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-image benchmarks.

Video Generation with Predictive Latents

cs.CV · 2026-05-04 · unverdicted · novelty 5.0

PV-VAE improves video latent spaces for generation by unifying reconstruction with future-frame prediction, reporting 52% faster convergence and 34.42 FVD gain over Wan2.2 VAE on UCF101.

citing papers explorer

Showing 10 of 10 citing papers.

  • Asymmetric Flow Models cs.CV · 2026-05-13 · unverdicted · none · ref 64

    Asymmetric Flow Modeling restricts noise prediction to a low-rank subspace for high-dimensional flow generation, reaching 1.57 FID on ImageNet 256x256 and new state-of-the-art pixel text-to-image performance via finetuning from latent models.

  • Coevolving Representations in Joint Image-Feature Diffusion cs.CV · 2026-04-19 · unverdicted · none · ref 42

    CoReDi coevolves semantic representations with the diffusion model via a jointly learned linear projection stabilized by stop-gradient, normalization, and regularization, yielding faster convergence and higher sample quality than fixed-representation baselines.

  • What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion cs.CV · 2026-05-08 · unverdicted · none · ref 80

    Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.

  • ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters cs.CV · 2026-05-06 · unverdicted · none · ref 13

    ViTok-v2 is a 5B-parameter native-resolution image autoencoder using NaFlex and DINOv3 loss that matches or exceeds prior tokenizers at 256p and outperforms them at 512p and above while advancing the Pareto frontier in joint scaling with generators.

  • Taming Outlier Tokens in Diffusion Transformers cs.CV · 2026-05-06 · unverdicted · none · ref 31

    Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.

  • Latent Denoising Improves Visual Alignment in Large Multimodal Models cs.CV · 2026-04-23 · unverdicted · none · ref 81

    A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.

  • Continuous Adversarial Flow Models cs.LG · 2026-04-13 · unverdicted · none · ref 69

    Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-image benchmarks.

  • Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models cs.CV · 2026-05-07 · unverdicted · none · ref 60

    Semantic latent spaces from pretrained encoders outperform reconstruction-based spaces for robotic world models on planning and downstream policy performance.

  • Steering Visual Generation in Unified Multimodal Models with Understanding Supervision cs.CV · 2026-05-07 · unverdicted · none · ref 49

    Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.

  • Video Generation with Predictive Latents cs.CV · 2026-05-04 · unverdicted · none · ref 44

    PV-VAE improves video latent spaces for generation by unifying reconstruction with future-frame prediction, reporting 52% faster convergence and 34.42 FVD gain over Wan2.2 VAE on UCF101.