pith. sign in

arxiv: 2602.02493 · v2 · submitted 2026-02-02 · 💻 cs.CV · cs.AI

PixelGen: Improving Pixel Diffusion with Perceptual Supervision

Pith reviewed 2026-05-16 07:51 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords pixel diffusionperceptual lossx-predictionLPIPSDINOimage generationFIDtext-to-image
0
0 comments X

The pith

PixelGen adds noise-gated perceptual losses to x-prediction so that direct pixel diffusion matches or beats latent diffusion quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard pixel-wise losses in diffusion models waste capacity on low-level signals and produce blurry outputs. PixelGen adds LPIPS supervision for local texture and P-DINO supervision for global structure on top of x-prediction, but only at lower noise levels through a gating mechanism that keeps high-noise steps focused on basic structure. This one-stage pixel-space approach reaches an FID of 5.11 on ImageNet-256 after 80 epochs without classifier-free guidance, outperforming typical latent diffusion baselines, and extends to text-to-image generation with a GenEval score of 0.79 after brief training. The claim is that perceptual signals can be injected without sacrificing coverage or introducing new artifacts, closing the gap between simple pixel pipelines and two-stage latent ones.

Core claim

PixelGen augments x-prediction in pixel diffusion with two perceptual losses—an LPIPS term for local textures and a P-DINO term for global semantics—applied only at lower-noise timesteps via noise-gating. On ImageNet-256 without classifier-free guidance this yields an FID of 5.11 after 80 training epochs, surpassing latent diffusion baselines, while the same method scales to text-to-image synthesis reaching a GenEval score of 0.79 after six days on eight H800 GPUs, all within a single-stage pixel pipeline.

What carries the argument

Noise-gated perceptual supervision (LPIPS plus P-DINO) added to the x-prediction objective, active only below a noise threshold to preserve coverage.

If this is right

  • Pixel diffusion can reach competitive FID scores without any VAE encoder-decoder stage.
  • Perceptual terms improve high-frequency detail without requiring classifier-free guidance at inference.
  • The same gated supervision extends directly to conditional text-to-image training with modest compute.
  • Training schedules can remain short (under 100 epochs) while still outperforming latent baselines.
  • The overall pipeline stays end-to-end differentiable in pixel space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the noise threshold can be learned rather than fixed, the method might adapt automatically across datasets or resolutions.
  • The same LPIPS-plus-P-DINO combination could be tested on other pixel-space generators such as autoregressive or flow models.
  • Removing the VAE stage entirely may reduce compounding reconstruction errors that latent methods accumulate.
  • The approach suggests that explicit perceptual alignment at late denoising steps is more efficient than uniform pixel losses.

Load-bearing premise

That applying perceptual losses only at lower-noise timesteps preserves sample diversity and does not create new artifacts or mode collapse.

What would settle it

A side-by-side comparison of PixelGen trained with and without noise-gating, measuring both FID and a diversity metric such as precision-recall on the same dataset and schedule; if the no-gating version shows equal or better FID with lower diversity, the gating premise fails.

Figures

Figures reproduced from arXiv: 2602.02493 by Ruihan Xu, Shiliang Zhang, Zehong Ma.

Figure 1
Figure 1. Figure 1: This work shows that pixel diffusion with perceptual loss outperforms latent diffusion. (a) A traditional two-stage latent diffusion denoises in the latent space, which is influenced by the artifacts of the VAE. (b) PixelGen introduces perceptual loss to encourage the diffusion model to focus on the perceptual manifold, enabling the pixel diffusion to learn a meaningful manifold rather than the complex ful… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of different manifolds within the pixel space. The image manifold is a large manifold containing both perceptu￾ally significant information and imperceptible signals. The percep￾tual manifold contains perceptually important signals, providing a better target for pixel space diffusion. P-DINO and LPIPS are the two complementary perceptual supervision utilized in PixelGen. learn a high-dimension… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of PixelGen. The diffusion model directly predicts the image x instead of velocity v or noise ϵ to simplify the prediction target. A flow-matching diffusion loss is retained to keep the advantages of flow matching via velocity conversion. Two complementary perceptual losses are introduced to encourage the diffusion model to focus on the perceptual manifold. 3.3. P-DINO Loss While LPIPS provides st… view at source ↗
Figure 4
Figure 4. Figure 4: Effectiveness of perceptual supervision in PixelGen. LPIPS and P-DINO losses are progressively added to a baseline pixel diffusion model. The LPIPS loss improves local texture fidelity, while P-DINO further enhances global semantics. are better preserved. This indicates that LPIPS effectively emphasizes perceptually important local patterns. When the P-DINO loss is further introduced, the generated im￾ages… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results of text-to-image generation of PixelGen. All images are 512×512 resolution [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results of class-to-image generation of PixelGen with CFG. All images are 256×256 resolution [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7 [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: More qualitative results of class-to-image generation at a 256×256 resolution with CFG. The CFG scale is set to 3.0. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
read the original abstract

Pixel diffusion generates images directly in pixel space, avoiding the VAE artifacts and representational bottlenecks of two-stage latent diffusion. Recent JiT further simplifies pixel diffusion with x-prediction, where the model predicts clean images rather than velocity. However, the standard pixel-wise diffusion loss treats all pixels equally, spending model capacity to perceptually insignificant signals and often leading to blurry samples. We propose PixelGen, an end-to-end pixel diffusion framework that augments x-prediction with perceptual supervision. Specifically, PixelGen introduces two complementary perceptual losses on top of x-prediction: an LPIPS loss for local textures and a P-DINO loss for global semantics. To preserve sample coverage, PixelGen further proposes a noise-gating strategy that applies these losses only at lower-noise timesteps. On ImageNet-256 without classifier-free guidance, PixelGen achieves an FID of 5.11 in 80 training epochs, surpassing the latent diffusion baselines. Moreover, PixelGen scales efficiently to text-to-image generation, reaching a GenEval score of 0.79 with only 6 days of training on 8xH800 GPUs. These results show that perceptual supervision substantially narrows the gap between pixel and latent diffusion while preserving a simple one-stage pipeline. Codes are available at https://github.com/Zehong-Ma/PixelGen.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. PixelGen augments standard x-prediction pixel diffusion with two perceptual losses (LPIPS for local textures and P-DINO for global semantics) that are applied only at low-noise timesteps via a noise-gating mechanism. The method reports an FID of 5.11 on ImageNet-256 without classifier-free guidance after 80 epochs and a GenEval score of 0.79 for text-to-image generation after 6 days on 8xH800 GPUs, claiming to narrow the gap to latent diffusion while retaining a simple one-stage pipeline. Public code is provided.

Significance. If the empirical gains hold under fair baselines, the result shows that targeted perceptual supervision can improve sample quality in direct pixel-space diffusion without VAEs or two-stage training, while the gating strategy and efficient scaling to text-to-image are practically relevant. The public implementation strengthens reproducibility.

major comments (2)
  1. [§3.3] §3.3 (noise-gating): the claim that gating perceptual losses to low-noise timesteps preserves coverage and avoids mode collapse is central to the diversity argument, yet the manuscript provides no quantitative diversity metrics (e.g., precision-recall or coverage) comparing gated vs. ungated variants; the FID improvement alone does not rule out reduced sample variety.
  2. [Table 2] Table 2 (ImageNet-256 results): the comparison to latent diffusion baselines should explicitly state whether those baselines were re-trained for exactly 80 epochs with identical optimizer and data augmentation; otherwise the 5.11 FID advantage cannot be attributed solely to the perceptual losses.
minor comments (2)
  1. [§4.1] The LPIPS and P-DINO loss weights are listed as free parameters; a brief sensitivity table or default values should be added to §4.1 for reproducibility.
  2. [Figure 3] Figure 3 (qualitative samples): the caption should note the exact noise threshold used for gating so readers can replicate the visual comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the encouraging review and the recommendation for minor revision. We appreciate the constructive feedback on the noise-gating mechanism and baseline comparisons. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3.3] §3.3 (noise-gating): the claim that gating perceptual losses to low-noise timesteps preserves coverage and avoids mode collapse is central to the diversity argument, yet the manuscript provides no quantitative diversity metrics (e.g., precision-recall or coverage) comparing gated vs. ungated variants; the FID improvement alone does not rule out reduced sample variety.

    Authors: We agree that quantitative diversity metrics would provide stronger support for the claim that noise-gating preserves coverage. In the revised manuscript, we will add precision-recall and coverage metrics comparing the gated and ungated variants on ImageNet-256 to quantitatively demonstrate maintained sample diversity. revision: yes

  2. Referee: [Table 2] Table 2 (ImageNet-256 results): the comparison to latent diffusion baselines should explicitly state whether those baselines were re-trained for exactly 80 epochs with identical optimizer and data augmentation; otherwise the 5.11 FID advantage cannot be attributed solely to the perceptual losses.

    Authors: We will explicitly clarify in the revised manuscript that the latent diffusion baselines in Table 2 are taken from their original publications and were not re-trained under identical conditions (80 epochs, same optimizer, and data augmentations). This limits direct attribution of the FID gap solely to perceptual losses, and we will discuss the efficiency advantages of our approach under constrained training budgets. Re-training all baselines identically is not feasible within our current compute resources. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical framework augmenting x-prediction diffusion with gated perceptual losses (LPIPS for textures, P-DINO for semantics). All load-bearing claims are benchmark results (FID 5.11 on ImageNet-256, GenEval 0.79 for text-to-image) measured against external datasets and prior models. No equations, predictions, or uniqueness arguments reduce by construction to fitted inputs or self-citations; the noise-gating choice is an explicit, testable engineering decision whose coverage effects are directly verifiable via the released code.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach relies on standard diffusion training plus two perceptual losses whose weights and the noise threshold for gating are hyperparameters chosen to balance quality and coverage.

free parameters (2)
  • LPIPS and P-DINO loss weights
    Scaling factors for the perceptual terms that must be tuned to avoid overpowering the diffusion loss.
  • noise threshold for gating
    Timestep cutoff below which perceptual losses are applied; chosen to preserve diversity at high noise.
axioms (1)
  • domain assumption LPIPS and DINO features provide a reliable proxy for human perceptual quality
    Invoked when claiming the added losses improve visual quality without new artifacts.

pith-pipeline@v0.9.0 · 5533 in / 1261 out tokens · 29626 ms · 2026-05-16T07:51:11.210603+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Asymmetric Flow Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Asymmetric Flow Modeling restricts noise prediction to a low-rank subspace for high-dimensional flow generation, reaching 1.57 FID on ImageNet 256x256 and new state-of-the-art pixel text-to-image performance via finet...

  2. Structure-Adaptive Sparse Diffusion in Voxel Space for 3D Medical Image Enhancement

    cs.CV 2026-04 unverdicted novelty 7.0

    A sparse voxel-space diffusion method with structure-adaptive modulation achieves up to 10x training speedup and state-of-the-art results for 3D medical image denoising and super-resolution.

  3. Spectral Progressive Diffusion for Efficient Image and Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Spectral Progressive Diffusion accelerates image and video generation in pretrained diffusion models by progressively growing resolution along the denoising trajectory using spectral noise expansion and a power spectr...

  4. HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion

    cs.CV 2026-05 unverdicted novelty 6.0

    HyperDiT achieves FID 1.56 on ImageNet 256x256 in pixel space via hyper-connected cross-scale interactions, cross-attention, SA-RoPE, and VFM registers.

  5. L2P: Unlocking Latent Potential for Pixel Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.

  6. Spectral Progressive Diffusion for Efficient Image and Video Generation

    cs.CV 2026-05 unverdicted novelty 5.0

    Spectral Progressive Diffusion progressively grows resolution during denoising of pretrained diffusion models via spectral noise expansion and a power-spectrum-derived schedule, enabling training-free speedups and a f...

  7. FrequencyBooster: Full-Frequency Modeling for High-Fidelity Pixel Diffusion

    cs.CV 2026-05 unverdicted novelty 5.0

    FrequencyBooster reports state-of-the-art FID scores of 1.60 at 256x256 and 1.69 at 512x512 for pixel diffusion by using a specialized decoder for full-frequency modeling.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 6 Pith papers · 14 internal anchors

  1. [1]

    Deep compression autoencoder for efficient high-resolution diffusion models.arXiv preprint arXiv:2410.10733, 2024

    Chen, J., Cai, H., Chen, J., Xie, E., Yang, S., Tang, H., Li, M., Lu, Y ., and Han, S. Deep compression autoen- coder for efficient high-resolution diffusion models.arXiv preprint arXiv:2410.10733,

  2. [2]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Chen, J., Xu, Z., Pan, X., Hu, Y ., Qin, C., Goldstein, T., Huang, L., Zhou, T., Xie, S., Savarese, S., et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025a. Chen, S., Ge, C., Zhang, S., Sun, P., and Luo, P. Pixelflow: Pixel-space generative models with flow.arXiv prepri...

  3. [3]

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., M¨uller, J., Saini, H., Levi, Y ., Lorenz, D., Sauer, A., Boesel, F., et al. Scaling rectified flow transformers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206,

  4. [4]

    Fluid: Scaling autoregressive text-to-image generative models with continuous tokens

    Fan, L., Li, T., Qin, S., Li, Y ., Sun, C., Rubinstein, M., Sun, D., He, K., and Tian, Y . Fluid: Scaling autoregressive text-to-image generative models with continuous tokens. arXiv preprint arXiv:2410.13863,

  5. [5]

    Seedream 3.0 Technical Report

    Gao, S., Zhou, P., Cheng, M.-M., and Yan, S. Masked diffusion transformer is a strong image synthesizer. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 23164–23173, 2023a. Gao, S., Zhou, P., Cheng, M.-M., and Yan, S. Masked diffusion transformer is a strong image synthesizer. In Proceedings of the IEEE/CVF International Co...

  6. [6]

    Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model

    Gong, L., Hou, X., Li, F., Li, L., Lian, X., Liu, F., Liu, L., Liu, W., Lu, W., Shi, Y ., et al. Seedream 2.0: A native chinese-english bilingual image generation foundation model.arXiv preprint arXiv:2503.07703,

  7. [7]

    Classifier-Free Diffusion Guidance

    Ho, J. and Salimans, T. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,

  8. [8]

    Applying guidance in a limited interval improves sample and distribution quality in diffusion models

    Kynk¨a¨anniemi, T., Aittala, M., Karras, T., Laine, S., Aila, T., and Lehtinen, J. Applying guidance in a limited interval improves sample and distribution quality in diffusion models.arXiv preprint arXiv:2404.07724,

  9. [9]

    Advancing end- to-end pixel space generative modeling via self-supervised pre-training.arXiv preprint arXiv:2510.12586, 2025

    Lei, J., Liu, K., Berner, J., Yu, H., Zheng, H., Wu, J., and Chu, X. There is no vae: End-to-end pixel-space gen- erative modeling via self-supervised pre-training.arXiv preprint arXiv:2510.12586,

  10. [10]

    Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025

    Leng, X., Singh, J., Hou, Y ., Xing, Z., Xie, S., and Zheng, L. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483,

  11. [11]

    Back to Basics: Let Denoising Generative Models Denoise

    URL https://arxiv.org/ abs/2511.13720. Li, T., Sun, Q., Fan, L., and He, K. Fractal generative models.arXiv preprint arXiv:2502.17437,

  12. [12]

    Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

    Liao, C., Liu, L., Wang, X., Luo, Z., Zhang, X., Zhao, W., Wu, J., Li, L., Tian, Z., and Huang, W. Mogao: An omni foundation model for interleaved multi-modal generation. arXiv preprint arXiv:2505.05472,

  13. [13]

    Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

    Ma, N., Goldstein, M., Albergo, M. S., Boffi, N. M., Vanden- Eijnden, E., and Xie, S. Sit: Exploring flow and diffusion- based generative models with scalable interpolant trans- formers.arXiv preprint arXiv:2401.08740,

  14. [14]

    DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation

    Ma, Z., Wei, L., Wang, S., Zhang, S., and Tian, Q. Deco: Frequency-decoupled pixel diffusion for end-to-end im- age generation.arXiv preprint arXiv:2511.19365,

  15. [15]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., V o, H., Szafraniec, M., Khalidov, V ., Fernandez, P., Haziza, D., Massa, F., El- Nouby, A., et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

  16. [16]

    Stylegan-xl: Scaling stylegan to large diverse datasets

    Sauer, A., Schwarz, K., and Geiger, A. Stylegan-xl: Scaling stylegan to large diverse datasets. InACM SIGGRAPH 2022 conference proceedings, pp. 1–10,

  17. [18]

    Denoising Diffusion Implicit Models

    URL https://arxiv.org/abs/2010.02502. Song, T., Feng, W., Wang, S., Li, X., Ge, T., Zheng, B., and Wang, L. Dmm: Building a versatile image genera- tion model via distillation-based model merging.arXiv preprint arXiv:2504.12364,

  18. [19]

    Relay diffusion: Unifying diffusion process across resolutions for image syn- thesis

    Teng, J., Zheng, W., Ding, M., Hong, W., Wangni, J., Yang, Z., and Tang, J. Relay diffusion: Unifying diffusion process across resolutions for image synthesis.arXiv preprint arXiv:2309.03350,

  19. [20]

    arXiv preprint arXiv:2405.14224 , year=

    Teng, Y ., Wu, Y ., Shi, H., Ning, X., Dai, G., Wang, Y ., Li, Z., and Liu, X. Dim: Diffusion mamba for effi- cient high-resolution image synthesis.arXiv preprint arXiv:2405.14224,

  20. [21]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation lan- guage models.arXiv preprint arXiv:2302.13971, 2023a. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S...

  21. [22]

    Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268, 2025

    Wang, S., Gao, Z., Zhu, C., Huang, W., and Wang, L. Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268, 2025a. Wang, S., Tian, Z., Huang, W., and Wang, L. Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025b. Wang, Z., Bai, L., Yue, X., Ouyang, W., and Zhang, Y . Native-resolution image synthesis.arXiv preprint arXiv:2...

  22. [23]

    Reconstruction vs

    Yao, J. and Wang, X. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models.arXiv preprint arXiv:2501.01423,

  23. [24]

    Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

    Yu, S., Kwak, S., Jang, H., Jeong, J., Huang, J., Shin, J., and Xie, S. Representation alignment for generation: Training diffusion transformers is easier than you think. arXiv preprint arXiv:2410.06940,

  24. [25]

    PixelDiT: Pixel Diffusion Transformers for Image Generation

    Yu, Y ., Xiong, W., Nie, W., Sheng, Y ., Liu, S., and Luo, J. Pixeldit: Pixel diffusion transformers for image genera- tion.arXiv preprint arXiv:2511.20645,

  25. [26]

    Diffu- sion models need visual priors for image generation.arXiv preprint arXiv:2410.08531, 2024

    Yue, X., Wang, Z., Lu, Z., Sun, S., Wei, M., Ouyang, W., Bai, L., and Zhou, L. Diffusion models need visual priors for image generation.arXiv preprint arXiv:2410.08531,

  26. [27]

    Normalizing flows are capable generative models,

    Zhai, S., Zhang, R., Nakkiran, P., Berthelot, D., Gu, J., Zheng, H., Chen, T., Bautista, M. A., Jaitly, N., and Susskind, J. Normalizing flows are capable generative models.arXiv preprint arXiv:2412.06329,

  27. [28]

    Diffusion Transformers with Representation Autoencoders

    Zheng, B., Ma, N., Tong, S., and Xie, S. Diffusion trans- formers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025a. Zheng, G., Zhao, Q., Yang, T., Xiao, F., Lin, Z., Wu, J., Deng, J., Zhang, Y ., and Zhu, R. Farmer: Flow autoregressive transformer over pixels.arXiv preprint arXiv:2510.23588, 2025b. Zhou, M., Zheng, H., Wang, Z., Yin...

  28. [29]

    More Implementary Details A.1

    11 PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss A. More Implementary Details A.1. Baseline Comparisons In this subsection, we summarize the settings used for all baseline comparisons. In the baseline comparisons, all diffusion models are trained on ImageNet at 256×256 resolution for 200k iterations using a large DiT variant. Follo...

  29. [30]

    For a comprehensive comparison, we integrate DDT (Wang et al., 2025b) into the pixel diffusion to form PixDDT

    We also report results for the two-stage JiT-L/2 that requires a V AE. For a comprehensive comparison, we integrate DDT (Wang et al., 2025b) into the pixel diffusion to form PixDDT. A.2. Class-to-Image Generation This subsection describes the more implementation details for class-to-image generation. The batch size and learning rate follow the default set...

  30. [31]

    For evaluation, we use a Heun sampler with 50 inference steps following JiT (Li & He, 2025)

    is set to (0.1, 0.9). For evaluation, we use a Heun sampler with 50 inference steps following JiT (Li & He, 2025). The timeshift is set to 2.0 to match the time sampler. A.3. Text-to-Image Generation We adopt Qwen3-1.7B (Yang et al.,

  31. [32]

    To improve the alignment of frozen text features (Fan et al., 2024), we jointly train several transformer layers on the frozen text features similar to Fluid (Fan et al., 2024)

    as the text encoder. To improve the alignment of frozen text features (Fan et al., 2024), we jointly train several transformer layers on the frozen text features similar to Fluid (Fan et al., 2024). The total batch size is 1536 for 256×256 resolution pretraining and 512 for 512×512 resolution pretraining. Following PixNerd (Wang et al., 2025a), we pretrai...

  32. [33]

    as future works. A.4. Experiment Configurations Table 7 summarizes the experiment configurations for PixelGen-L/16, PixelGen-XL/16, and PixelGen-XXL/16. In practice, we follow the training setups from previous works such as DiT (Peebles & Xie, 2023), SiT (Ma et al., 2024), and PixNerd (Wang et al., 2025a). B. Text-to-Image Prompts Below, we list the promp...