pith. sign in

Unified latents (ul): How to train your latents

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

citation-role summary

background 2 method 1

citation-polarity summary

years

2026 8

polarities

background 3

representative citing papers

PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion

cs.CV · 2026-05-22 · unverdicted · novelty 6.0

PiD is a pixel diffusion decoder that performs latent-to-pixel conversion and 4-8x upsampling in one generative step, enabling early stopping of latent diffusion and achieving sub-second 2048x2048 decoding with claimed better fidelity than cascaded baselines.

Improved Baselines with Representation Autoencoders

cs.CV · 2026-05-18 · conditional · novelty 6.0

RAE v2 reaches gFID 1.06 on ImageNet-256 in 80 epochs by combining multi-layer encoder sums, complementary REPA targets, and free guidance via output reparameterization.

Do multimodal models imagine electric sheep?

cs.CV · 2026-05-10 · conditional · novelty 6.0

Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.

Understanding Latent Diffusability via Fisher Geometry

cs.LG · 2026-04-03 · unverdicted · novelty 6.0

Latent diffusability is quantified by decomposing the MMSE rate along diffusion trajectories into Fisher Information and Fisher Information Rate, with three geometric penalties (dimensional compression, tangential distortion, curvature injection) identified as sources of failure.

Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models

cs.CV · 2026-05-20 · unverdicted · novelty 5.0

Lens is a 3.8B-parameter text-to-image model that reaches competitive or superior performance to >6B-parameter systems using 19.3% of the training compute of Z-Image through a densely captioned 800M dataset, multi-resolution batching, semantic VAE, strong language encoder, RL fine-tuning, and 4-step

SAME: A Semantically-Aligned Music Autoencoder

cs.SD · 2026-05-18 · unverdicted · novelty 5.0

SAME is a semantically regularized transformer autoencoder for music that delivers 4096x compression with open-weights release of large and small variants.

citing papers explorer

Showing 8 of 8 citing papers.

  • PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion cs.CV · 2026-05-22 · unverdicted · none · ref 12

    PiD is a pixel diffusion decoder that performs latent-to-pixel conversion and 4-8x upsampling in one generative step, enabling early stopping of latent diffusion and achieving sub-second 2048x2048 decoding with claimed better fidelity than cascaded baselines.

  • Improved Baselines with Representation Autoencoders cs.CV · 2026-05-18 · conditional · none · ref 22

    RAE v2 reaches gFID 1.06 on ImageNet-256 in 80 epochs by combining multi-layer encoder sums, complementary REPA targets, and free guidance via output reparameterization.

  • Provably Learning Diffusion Models under the Manifold Hypothesis: Collapse and Refine cs.LG · 2026-05-16 · unverdicted · none · ref 25

    SiLD is a score-matching framework that learns both manifold projection and intrinsic density from a single objective, with proven sample complexity depending only on intrinsic dimension.

  • Do multimodal models imagine electric sheep? cs.CV · 2026-05-10 · conditional · none · ref 47

    Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.

  • What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion cs.CV · 2026-05-08 · unverdicted · none · ref 31

    Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.

  • Understanding Latent Diffusability via Fisher Geometry cs.LG · 2026-04-03 · unverdicted · none · ref 10

    Latent diffusability is quantified by decomposing the MMSE rate along diffusion trajectories into Fisher Information and Fisher Information Rate, with three geometric penalties (dimensional compression, tangential distortion, curvature injection) identified as sources of failure.

  • Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models cs.CV · 2026-05-20 · unverdicted · none · ref 75

    Lens is a 3.8B-parameter text-to-image model that reaches competitive or superior performance to >6B-parameter systems using 19.3% of the training compute of Z-Image through a densely captioned 800M dataset, multi-resolution batching, semantic VAE, strong language encoder, RL fine-tuning, and 4-step

  • SAME: A Semantically-Aligned Music Autoencoder cs.SD · 2026-05-18 · unverdicted · none · ref 27

    SAME is a semantically regularized transformer autoencoder for music that delivers 4096x compression with open-weights release of large and small variants.