{"total":22,"items":[{"citing_arxiv_id":"2606.27978","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Parallel Rollout Approximation for Pixel-Space Autoregressive Image Generation","primary_cat":"cs.CV","submitted_at":"2026-06-26T11:27:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PRA approximates sequential rollout training in parallel for pixel-space AR models via intermediate states and a pixel decoder, achieving FID 2.58 (135M params) and 1.94 (511M params) on ImageNet-1K 256x256, new SOTA among pixel-space AR models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27760","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PixelU: A U-Shaped Transformer for Efficient End-to-End Pixel Diffusion","primary_cat":"cs.CV","submitted_at":"2026-06-26T06:39:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PixelU is a minimalist U-shaped Diffusion Transformer for pixel-space diffusion that decouples frequencies with zero-cost skip connections and constant-channel downsampling, outperforming baselines like JiT-G at 1/3 the compute cost with FID 1.63 on ImageNet 256x256.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09048","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"BareWave: Waveform-Native Flow-Matching Text-to-Speech","primary_cat":"eess.AS","submitted_at":"2026-06-08T05:36:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BareWave develops a waveform-native flow-matching framework for direct text-to-waveform TTS using representation alignment, staged noise scheduling, and velocity-aware perceptual alignment to achieve strong zero-shot voice cloning results.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03455","ref_index":84,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling","primary_cat":"eess.AS","submitted_at":"2026-06-02T10:33:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31604","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Representation Forcing for Bottleneck-Free Unified Multimodal Models","primary_cat":"cs.CV","submitted_at":"2026-05-29T17:59:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Representation Forcing enables end-to-end pixel-space unified multimodal models by making visual representation prediction a native autoregressive generation target that guides subsequent pixel diffusion in the same backbone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21981","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RiT: Vanilla Diffusion Transformers Suffice in Representation Space","primary_cat":"cs.CV","submitted_at":"2026-05-21T04:21:43+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A vanilla Diffusion Transformer trained via x-prediction on frozen DINOv2 features reaches FID 1.14 on ImageNet 256x256 with fewer parameters and faster sampling than prior DiT variants.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20147","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset","primary_cat":"cs.CV","submitted_at":"2026-05-19T17:35:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PixVerve introduces a 95K ultra-high-resolution image-text dataset and training strategies that enable native 100-megapixel text-to-image generation together with a new evaluation benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18267","ref_index":18,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SRC-Flow: Compact Semantic Representations Enable Normalizing Flows for Image Generation","primary_cat":"cs.CV","submitted_at":"2026-05-18T12:03:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SRC-Flow compresses RAE features via a Semantic Representation Compressor into a low-dimensional space, enabling normalizing flows to reach gFID 1.65 on ImageNet 256x256 and 2.07 on 512x512 while retaining exact likelihoods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17759","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FrequencyBooster: Full-Frequency Modeling for High-Fidelity Pixel Diffusion","primary_cat":"cs.CV","submitted_at":"2026-05-18T02:25:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FrequencyBooster reports state-of-the-art FID scores of 1.60 at 256x256 and 1.69 at 512x512 for pixel diffusion by using a specialized decoder for full-frequency modeling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15741","ref_index":40,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion","primary_cat":"cs.CV","submitted_at":"2026-05-15T08:51:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"HyperDiT reports FID 1.56 on ImageNet 256x256 using hyper-connected cross-scale attention, SA-RoPE, and VFM registers in pixel space.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12013","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"L2P: Unlocking Latent Potential for Pixel Generation","primary_cat":"cs.CV","submitted_at":"2026-05-12T12:01:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06421","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FREPix: Frequency-Heterogeneous Flow Matching for Pixel-Space Image Generation","primary_cat":"cs.CV","submitted_at":"2026-05-07T15:27:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FREPix achieves competitive FID scores on ImageNet by decomposing image generation into separate low- and high-frequency paths within a flow matching framework.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21984","ref_index":81,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Soft Anisotropic Diagrams for Differentiable Image Representation","primary_cat":"cs.CV","submitted_at":"2026-04-23T18:07:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SAD is a new explicit differentiable image representation based on soft anisotropic additively weighted Voronoi partitions that achieves higher PSNR and 4-19x faster training than Image-GS and Instant-NGP at matched bitrate.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20041","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Normalizing Flows with Iterative Denoising","primary_cat":"cs.CV","submitted_at":"2026-04-21T22:52:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"iTARFlow augments normalizing flows with diffusion-style iterative denoising during sampling while preserving end-to-end likelihood training, reaching competitive results on ImageNet 64/128/256.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17492","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Coevolving Representations in Joint Image-Feature Diffusion","primary_cat":"cs.CV","submitted_at":"2026-04-19T15:29:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CoReDi coevolves semantic representations with the diffusion model via a jointly learned linear projection stabilized by stop-gradient, normalization, and regularization, yielding faster convergence and higher sample quality than fixed-representation baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"increasing resolutions to manage the high dimensionality of pixel space [7,41], at the expense of more complex training and inference procedures. More recent work has explored alternative architectures to sidestep these issues, including transformer-based normalizing flows [51], fractal generative models [26],DiT- based models that predict neural field parameters per patch [45], and methods that predict the clean image directly to anchor generation to the low-dimensional data manifold [25].DeCo[29] decouples the generation of high and low frequency components, leveraging a lightweight pixel decoder to reduce the complexity of direct pixel synthesis. Despite these advances, the integration of visual represen- Coevolving Representations 5"},{"citing_arxiv_id":"2604.16558","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Cross-Modal Generation: From Commodity WiFi to High-Fidelity mmWave and RFID Sensing","primary_cat":"cs.LG","submitted_at":"2026-04-17T08:39:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RF-CMG synthesizes high-quality mmWave and RFID signals from WiFi using a diffusion model with Modality-Guided Embedding for high-frequency details and Low-Frequency Modality Consistency to preserve physical structure.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12525","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CoD-Lite: Real-Time Diffusion-Based Generative Image Compression","primary_cat":"cs.CV","submitted_at":"2026-04-14T09:56:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoD-Lite delivers real-time generative image compression via a lightweight convolution-based diffusion codec with compression-oriented pre-training and distillation, achieving substantial bitrate savings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11521","ref_index":74,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Continuous Adversarial Flow Models","primary_cat":"cs.LG","submitted_at":"2026-04-13T14:23:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-image benchmarks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"SiT-XL/2+CAFM 675M 1.53 SiT-XL/2+REPA [79] 675M 1.42 DDT-XL [75] 675M 1.26 T able 4:Pixel-space continuous flow models on ImageNet 256px. Please con- sider that architectures and settings vary. Guided Method Param FID↓ No ADM [14] 554M 10.94 JiT-H/16 [37] 956M 7.17 JiT-H/16+CAFM 956M 3.57 SiD [22] 2B 2.77 Yes ADM-G [14] 554M 4.59 SiD [22] 2B 2.44 PixNerd-XL/16 [74] 700M 2.15 PixelFlow-XL/4 [9] 677M 1.98 JiT-H/16 [37] 956M 1.86 JiT-G/16 [37] 2B 1.82 JiT-H/16+CAFM 956M 1.80 SiD2 [23] 653M 1.38 fit the training data better. Overall, our method also achieves very competitive performance in the pixel space. 4.2 Text-to-Image Generation Post-training Setup.We experiment post-training with a text-to-image generation model, Z-"},{"citing_arxiv_id":"2602.02493","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PixelGen: Improving Pixel Diffusion with Perceptual Supervision","primary_cat":"cs.CV","submitted_at":"2026-02-02T18:59:42+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PixelGen augments pixel diffusion with gated perceptual supervision to reach FID 5.11 on ImageNet-256 and GenEval 0.79 in text-to-image, narrowing the gap to latent methods without VAEs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.20645","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PixelDiT: Pixel Diffusion Transformers for Image Generation","primary_cat":"cs.CV","submitted_at":"2025-11-25T18:59:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PixelDiT generates images in pixel space with a dual-level transformer and reaches 1.61 FID on ImageNet 256, outperforming prior pixel-space models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.19365","ref_index":61,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation","primary_cat":"cs.CV","submitted_at":"2025-11-24T17:59:06+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeCo decouples high- and low-frequency generation in pixel diffusion via a DiT plus lightweight decoder and a frequency-aware flow-matching loss, reaching FID 1.62 at 256x256 and 2.22 at 512x512 on ImageNet while closing the gap to latent diffusion methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.13720","ref_index":70,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Back to Basics: Let Denoising Generative Models Denoise","primary_cat":"cs.CV","submitted_at":"2025-11-17T18:59:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Directly predicting clean data with large-patch pixel Transformers enables strong generative performance in diffusion models where noise prediction fails at high dimensions.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"16×16 pixels), resulting in a high-dimensional token space that can be comparable to, or larger than, the Transformer's hidden dimension. SiD2 [26] and PixelFlow [6] adopt hi- erarchical designs that start from smaller patches; how- ever, these models are \"FLOP-heavy\" [26] and lose the in- herent generality and simplicity of standard Transformers. PixNerd [70] adopts a NeRF head [43] that integrates infor- mation from the Transformer output, noisy input, and spa- tial coordinates, with training further assisted by represen- tation alignment [74]. Even with these special-purpose designs, the architec- tures in these works typically start from the \"L\" (Large) or \"XL\" size. In fact, a latest work [73] suggests that a large"}],"limit":50,"offset":0}