PixelGen: Improving Pixel Diffusion with Perceptual Supervision
Pith reviewed 2026-05-16 07:51 UTC · model grok-4.3
The pith
PixelGen adds noise-gated perceptual losses to x-prediction so that direct pixel diffusion matches or beats latent diffusion quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PixelGen augments x-prediction in pixel diffusion with two perceptual losses—an LPIPS term for local textures and a P-DINO term for global semantics—applied only at lower-noise timesteps via noise-gating. On ImageNet-256 without classifier-free guidance this yields an FID of 5.11 after 80 training epochs, surpassing latent diffusion baselines, while the same method scales to text-to-image synthesis reaching a GenEval score of 0.79 after six days on eight H800 GPUs, all within a single-stage pixel pipeline.
What carries the argument
Noise-gated perceptual supervision (LPIPS plus P-DINO) added to the x-prediction objective, active only below a noise threshold to preserve coverage.
If this is right
- Pixel diffusion can reach competitive FID scores without any VAE encoder-decoder stage.
- Perceptual terms improve high-frequency detail without requiring classifier-free guidance at inference.
- The same gated supervision extends directly to conditional text-to-image training with modest compute.
- Training schedules can remain short (under 100 epochs) while still outperforming latent baselines.
- The overall pipeline stays end-to-end differentiable in pixel space.
Where Pith is reading between the lines
- If the noise threshold can be learned rather than fixed, the method might adapt automatically across datasets or resolutions.
- The same LPIPS-plus-P-DINO combination could be tested on other pixel-space generators such as autoregressive or flow models.
- Removing the VAE stage entirely may reduce compounding reconstruction errors that latent methods accumulate.
- The approach suggests that explicit perceptual alignment at late denoising steps is more efficient than uniform pixel losses.
Load-bearing premise
That applying perceptual losses only at lower-noise timesteps preserves sample diversity and does not create new artifacts or mode collapse.
What would settle it
A side-by-side comparison of PixelGen trained with and without noise-gating, measuring both FID and a diversity metric such as precision-recall on the same dataset and schedule; if the no-gating version shows equal or better FID with lower diversity, the gating premise fails.
Figures
read the original abstract
Pixel diffusion generates images directly in pixel space, avoiding the VAE artifacts and representational bottlenecks of two-stage latent diffusion. Recent JiT further simplifies pixel diffusion with x-prediction, where the model predicts clean images rather than velocity. However, the standard pixel-wise diffusion loss treats all pixels equally, spending model capacity to perceptually insignificant signals and often leading to blurry samples. We propose PixelGen, an end-to-end pixel diffusion framework that augments x-prediction with perceptual supervision. Specifically, PixelGen introduces two complementary perceptual losses on top of x-prediction: an LPIPS loss for local textures and a P-DINO loss for global semantics. To preserve sample coverage, PixelGen further proposes a noise-gating strategy that applies these losses only at lower-noise timesteps. On ImageNet-256 without classifier-free guidance, PixelGen achieves an FID of 5.11 in 80 training epochs, surpassing the latent diffusion baselines. Moreover, PixelGen scales efficiently to text-to-image generation, reaching a GenEval score of 0.79 with only 6 days of training on 8xH800 GPUs. These results show that perceptual supervision substantially narrows the gap between pixel and latent diffusion while preserving a simple one-stage pipeline. Codes are available at https://github.com/Zehong-Ma/PixelGen.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. PixelGen augments standard x-prediction pixel diffusion with two perceptual losses (LPIPS for local textures and P-DINO for global semantics) that are applied only at low-noise timesteps via a noise-gating mechanism. The method reports an FID of 5.11 on ImageNet-256 without classifier-free guidance after 80 epochs and a GenEval score of 0.79 for text-to-image generation after 6 days on 8xH800 GPUs, claiming to narrow the gap to latent diffusion while retaining a simple one-stage pipeline. Public code is provided.
Significance. If the empirical gains hold under fair baselines, the result shows that targeted perceptual supervision can improve sample quality in direct pixel-space diffusion without VAEs or two-stage training, while the gating strategy and efficient scaling to text-to-image are practically relevant. The public implementation strengthens reproducibility.
major comments (2)
- [§3.3] §3.3 (noise-gating): the claim that gating perceptual losses to low-noise timesteps preserves coverage and avoids mode collapse is central to the diversity argument, yet the manuscript provides no quantitative diversity metrics (e.g., precision-recall or coverage) comparing gated vs. ungated variants; the FID improvement alone does not rule out reduced sample variety.
- [Table 2] Table 2 (ImageNet-256 results): the comparison to latent diffusion baselines should explicitly state whether those baselines were re-trained for exactly 80 epochs with identical optimizer and data augmentation; otherwise the 5.11 FID advantage cannot be attributed solely to the perceptual losses.
minor comments (2)
- [§4.1] The LPIPS and P-DINO loss weights are listed as free parameters; a brief sensitivity table or default values should be added to §4.1 for reproducibility.
- [Figure 3] Figure 3 (qualitative samples): the caption should note the exact noise threshold used for gating so readers can replicate the visual comparison.
Simulated Author's Rebuttal
We thank the referee for the encouraging review and the recommendation for minor revision. We appreciate the constructive feedback on the noise-gating mechanism and baseline comparisons. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [§3.3] §3.3 (noise-gating): the claim that gating perceptual losses to low-noise timesteps preserves coverage and avoids mode collapse is central to the diversity argument, yet the manuscript provides no quantitative diversity metrics (e.g., precision-recall or coverage) comparing gated vs. ungated variants; the FID improvement alone does not rule out reduced sample variety.
Authors: We agree that quantitative diversity metrics would provide stronger support for the claim that noise-gating preserves coverage. In the revised manuscript, we will add precision-recall and coverage metrics comparing the gated and ungated variants on ImageNet-256 to quantitatively demonstrate maintained sample diversity. revision: yes
-
Referee: [Table 2] Table 2 (ImageNet-256 results): the comparison to latent diffusion baselines should explicitly state whether those baselines were re-trained for exactly 80 epochs with identical optimizer and data augmentation; otherwise the 5.11 FID advantage cannot be attributed solely to the perceptual losses.
Authors: We will explicitly clarify in the revised manuscript that the latent diffusion baselines in Table 2 are taken from their original publications and were not re-trained under identical conditions (80 epochs, same optimizer, and data augmentations). This limits direct attribution of the FID gap solely to perceptual losses, and we will discuss the efficiency advantages of our approach under constrained training budgets. Re-training all baselines identically is not feasible within our current compute resources. revision: partial
Circularity Check
No significant circularity detected
full rationale
The paper presents an empirical framework augmenting x-prediction diffusion with gated perceptual losses (LPIPS for textures, P-DINO for semantics). All load-bearing claims are benchmark results (FID 5.11 on ImageNet-256, GenEval 0.79 for text-to-image) measured against external datasets and prior models. No equations, predictions, or uniqueness arguments reduce by construction to fitted inputs or self-citations; the noise-gating choice is an explicit, testable engineering decision whose coverage effects are directly verifiable via the released code.
Axiom & Free-Parameter Ledger
free parameters (2)
- LPIPS and P-DINO loss weights
- noise threshold for gating
axioms (1)
- domain assumption LPIPS and DINO features provide a reliable proxy for human perceptual quality
Forward citations
Cited by 7 Pith papers
-
Asymmetric Flow Models
Asymmetric Flow Modeling restricts noise prediction to a low-rank subspace for high-dimensional flow generation, reaching 1.57 FID on ImageNet 256x256 and new state-of-the-art pixel text-to-image performance via finet...
-
Structure-Adaptive Sparse Diffusion in Voxel Space for 3D Medical Image Enhancement
A sparse voxel-space diffusion method with structure-adaptive modulation achieves up to 10x training speedup and state-of-the-art results for 3D medical image denoising and super-resolution.
-
Spectral Progressive Diffusion for Efficient Image and Video Generation
Spectral Progressive Diffusion accelerates image and video generation in pretrained diffusion models by progressively growing resolution along the denoising trajectory using spectral noise expansion and a power spectr...
-
HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion
HyperDiT achieves FID 1.56 on ImageNet 256x256 in pixel space via hyper-connected cross-scale interactions, cross-attention, SA-RoPE, and VFM registers.
-
L2P: Unlocking Latent Potential for Pixel Generation
L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.
-
Spectral Progressive Diffusion for Efficient Image and Video Generation
Spectral Progressive Diffusion progressively grows resolution during denoising of pretrained diffusion models via spectral noise expansion and a power-spectrum-derived schedule, enabling training-free speedups and a f...
-
FrequencyBooster: Full-Frequency Modeling for High-Fidelity Pixel Diffusion
FrequencyBooster reports state-of-the-art FID scores of 1.60 at 256x256 and 1.69 at 512x512 for pixel diffusion by using a specialized decoder for full-frequency modeling.
Reference graph
Works this paper leans on
-
[1]
Chen, J., Cai, H., Chen, J., Xie, E., Yang, S., Tang, H., Li, M., Lu, Y ., and Han, S. Deep compression autoen- coder for efficient high-resolution diffusion models.arXiv preprint arXiv:2410.10733,
-
[2]
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
Chen, J., Xu, Z., Pan, X., Hu, Y ., Qin, C., Goldstein, T., Huang, L., Zhou, T., Xie, S., Savarese, S., et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025a. Chen, S., Ge, C., Zhang, S., Sun, P., and Luo, P. Pixelflow: Pixel-space generative models with flow.arXiv prepri...
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Esser, P., Kulal, S., Blattmann, A., Entezari, R., M¨uller, J., Saini, H., Levi, Y ., Lorenz, D., Sauer, A., Boesel, F., et al. Scaling rectified flow transformers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Fluid: Scaling autoregressive text-to-image generative models with continuous tokens
Fan, L., Li, T., Qin, S., Li, Y ., Sun, C., Rubinstein, M., Sun, D., He, K., and Tian, Y . Fluid: Scaling autoregressive text-to-image generative models with continuous tokens. arXiv preprint arXiv:2410.13863,
-
[5]
Gao, S., Zhou, P., Cheng, M.-M., and Yan, S. Masked diffusion transformer is a strong image synthesizer. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 23164–23173, 2023a. Gao, S., Zhou, P., Cheng, M.-M., and Yan, S. Masked diffusion transformer is a strong image synthesizer. In Proceedings of the IEEE/CVF International Co...
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model
Gong, L., Hou, X., Li, F., Li, L., Lian, X., Liu, F., Liu, L., Liu, W., Lu, W., Shi, Y ., et al. Seedream 2.0: A native chinese-english bilingual image generation foundation model.arXiv preprint arXiv:2503.07703,
work page internal anchor Pith review arXiv
-
[7]
Classifier-Free Diffusion Guidance
Ho, J. and Salimans, T. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Applying guidance in a limited interval improves sample and distribution quality in diffusion models
Kynk¨a¨anniemi, T., Aittala, M., Karras, T., Laine, S., Aila, T., and Lehtinen, J. Applying guidance in a limited interval improves sample and distribution quality in diffusion models.arXiv preprint arXiv:2404.07724,
-
[9]
Lei, J., Liu, K., Berner, J., Yu, H., Zheng, H., Wu, J., and Chu, X. There is no vae: End-to-end pixel-space gen- erative modeling via self-supervised pre-training.arXiv preprint arXiv:2510.12586,
-
[10]
Leng, X., Singh, J., Hou, Y ., Xing, Z., Xie, S., and Zheng, L. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483,
-
[11]
Back to Basics: Let Denoising Generative Models Denoise
URL https://arxiv.org/ abs/2511.13720. Li, T., Sun, Q., Fan, L., and He, K. Fractal generative models.arXiv preprint arXiv:2502.17437,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation
Liao, C., Liu, L., Wang, X., Luo, Z., Zhang, X., Zhao, W., Wu, J., Li, L., Tian, Z., and Huang, W. Mogao: An omni foundation model for interleaved multi-modal generation. arXiv preprint arXiv:2505.05472,
work page internal anchor Pith review arXiv
-
[13]
Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers
Ma, N., Goldstein, M., Albergo, M. S., Boffi, N. M., Vanden- Eijnden, E., and Xie, S. Sit: Exploring flow and diffusion- based generative models with scalable interpolant trans- formers.arXiv preprint arXiv:2401.08740,
-
[14]
DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation
Ma, Z., Wei, L., Wang, S., Zhang, S., and Tian, Q. Deco: Frequency-decoupled pixel diffusion for end-to-end im- age generation.arXiv preprint arXiv:2511.19365,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
DINOv2: Learning Robust Visual Features without Supervision
Oquab, M., Darcet, T., Moutakanni, T., V o, H., Szafraniec, M., Khalidov, V ., Fernandez, P., Haziza, D., Massa, F., El- Nouby, A., et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Stylegan-xl: Scaling stylegan to large diverse datasets
Sauer, A., Schwarz, K., and Geiger, A. Stylegan-xl: Scaling stylegan to large diverse datasets. InACM SIGGRAPH 2022 conference proceedings, pp. 1–10,
work page 2022
-
[18]
Denoising Diffusion Implicit Models
URL https://arxiv.org/abs/2010.02502. Song, T., Feng, W., Wang, S., Li, X., Ge, T., Zheng, B., and Wang, L. Dmm: Building a versatile image genera- tion model via distillation-based model merging.arXiv preprint arXiv:2504.12364,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[19]
Relay diffusion: Unifying diffusion process across resolutions for image syn- thesis
Teng, J., Zheng, W., Ding, M., Hong, W., Wangni, J., Yang, Z., and Tang, J. Relay diffusion: Unifying diffusion process across resolutions for image synthesis.arXiv preprint arXiv:2309.03350,
-
[20]
arXiv preprint arXiv:2405.14224 , year=
Teng, Y ., Wu, Y ., Shi, H., Ning, X., Dai, G., Wang, Y ., Li, Z., and Liu, X. Dim: Diffusion mamba for effi- cient high-resolution image synthesis.arXiv preprint arXiv:2405.14224,
-
[21]
LLaMA: Open and Efficient Foundation Language Models
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation lan- guage models.arXiv preprint arXiv:2302.13971, 2023a. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S...
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268, 2025
Wang, S., Gao, Z., Zhu, C., Huang, W., and Wang, L. Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268, 2025a. Wang, S., Tian, Z., Huang, W., and Wang, L. Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025b. Wang, Z., Bai, L., Yue, X., Ouyang, W., and Zhang, Y . Native-resolution image synthesis.arXiv preprint arXiv:2...
-
[23]
Yao, J. and Wang, X. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models.arXiv preprint arXiv:2501.01423,
-
[24]
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Yu, S., Kwak, S., Jang, H., Jeong, J., Huang, J., Shin, J., and Xie, S. Representation alignment for generation: Training diffusion transformers is easier than you think. arXiv preprint arXiv:2410.06940,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
PixelDiT: Pixel Diffusion Transformers for Image Generation
Yu, Y ., Xiong, W., Nie, W., Sheng, Y ., Liu, S., and Luo, J. Pixeldit: Pixel diffusion transformers for image genera- tion.arXiv preprint arXiv:2511.20645,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Diffu- sion models need visual priors for image generation.arXiv preprint arXiv:2410.08531, 2024
Yue, X., Wang, Z., Lu, Z., Sun, S., Wei, M., Ouyang, W., Bai, L., and Zhou, L. Diffusion models need visual priors for image generation.arXiv preprint arXiv:2410.08531,
-
[27]
Normalizing flows are capable generative models,
Zhai, S., Zhang, R., Nakkiran, P., Berthelot, D., Gu, J., Zheng, H., Chen, T., Bautista, M. A., Jaitly, N., and Susskind, J. Normalizing flows are capable generative models.arXiv preprint arXiv:2412.06329,
-
[28]
Diffusion Transformers with Representation Autoencoders
Zheng, B., Ma, N., Tong, S., and Xie, S. Diffusion trans- formers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025a. Zheng, G., Zhao, Q., Yang, T., Xiao, F., Lin, Z., Wu, J., Deng, J., Zhang, Y ., and Zhu, R. Farmer: Flow autoregressive transformer over pixels.arXiv preprint arXiv:2510.23588, 2025b. Zhou, M., Zheng, H., Wang, Z., Yin...
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
11 PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss A. More Implementary Details A.1. Baseline Comparisons In this subsection, we summarize the settings used for all baseline comparisons. In the baseline comparisons, all diffusion models are trained on ImageNet at 256×256 resolution for 200k iterations using a large DiT variant. Follo...
work page 2023
-
[30]
We also report results for the two-stage JiT-L/2 that requires a V AE. For a comprehensive comparison, we integrate DDT (Wang et al., 2025b) into the pixel diffusion to form PixDDT. A.2. Class-to-Image Generation This subsection describes the more implementation details for class-to-image generation. The batch size and learning rate follow the default set...
work page 2025
-
[31]
For evaluation, we use a Heun sampler with 50 inference steps following JiT (Li & He, 2025)
is set to (0.1, 0.9). For evaluation, we use a Heun sampler with 50 inference steps following JiT (Li & He, 2025). The timeshift is set to 2.0 to match the time sampler. A.3. Text-to-Image Generation We adopt Qwen3-1.7B (Yang et al.,
work page 2025
-
[32]
as the text encoder. To improve the alignment of frozen text features (Fan et al., 2024), we jointly train several transformer layers on the frozen text features similar to Fluid (Fan et al., 2024). The total batch size is 1536 for 256×256 resolution pretraining and 512 for 512×512 resolution pretraining. Following PixNerd (Wang et al., 2025a), we pretrai...
work page 2024
-
[33]
as future works. A.4. Experiment Configurations Table 7 summarizes the experiment configurations for PixelGen-L/16, PixelGen-XL/16, and PixelGen-XXL/16. In practice, we follow the training setups from previous works such as DiT (Peebles & Xie, 2023), SiT (Ma et al., 2024), and PixNerd (Wang et al., 2025a). B. Text-to-Image Prompts Below, we list the promp...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.