{"total":11,"items":[{"citing_arxiv_id":"2605.16732","ref_index":58,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"DiRotQ: Rotation-Aware Quantization for 4-bit Diffusion Transformers","primary_cat":"cs.CV","submitted_at":"2026-05-16T00:52:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DiRotQ uses PCA-based rotation-aware activation quantization combined with GPTQ to achieve better FID and PSNR in 4-bit diffusion transformers than prior methods like SVDQuant.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11494","ref_index":33,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"STRIDE: Training-Free Diversity Guidance via PCA-Directed Feature Perturbation in Single-Step Diffusion Models","primary_cat":"cs.CV","submitted_at":"2026-05-12T04:10:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"STRIDE boosts diversity in one-step diffusion models by injecting PCA-aligned pink noise into transformer features while preserving text alignment and quality.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"1B-parameter MMDiT distilled to 4 denoising steps(38 joint transformer blocks, 4096 tokens at 1024×1024). These models represent the two dominant acceleration paradigms: single-step (no iterative refinement) and few-step distillation (compressed denoising schedule). Datasets.We evaluate on four standard text-to-image benchmarks:MS-COCO[ 25] for large- scale distributional diversity,DrawBench[ 33] for compositionally challenging prompts,Par- tiPrompts[ 41] for broad category coverage, andGenEval[ 14] for compositional accuracy. For MS-COCO, we use a subset of 2,000 captions from the 2014 validation split. See Supp. Sec B.1. Metrics.We report four primary metrics.InBatchSim (InBSim) ↓ measures the average pairwise CLIP similarity among images generated from the same prompt, directly quantifying mode collapse,"},{"citing_arxiv_id":"2605.10198","ref_index":4,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Empty SPACE: Cross-Attention Sparsity for Concept Erasure in Diffusion Models","primary_cat":"cs.LG","submitted_at":"2026-05-11T08:46:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SPACE induces sparsity in cross-attention parameters via closed-form iterative updates to erase target concepts more effectively than dense baselines in large diffusion models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684-10695, 2022. 9 Empty SPACE: Cross-Attention Sparsity for Concept Erasure in Diffusion Models [3] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022. [4] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479-36494, 2022. [5] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo"},{"citing_arxiv_id":"2605.09296","ref_index":35,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Micro-Defects Expose Macro-Fakes: Detecting AI-Generated Images via Local Distributional Shifts","primary_cat":"cs.CV","submitted_at":"2026-05-10T03:44:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MDMF detects AI-generated images by learning patch-level forensic signatures and quantifying their distributional discrepancies with MMD, yielding larger separation than global methods when micro-defects are present.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022. [34] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684-10695, 2022. [35] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479-36494, 2022. [36] Kihyuk Sohn, Honglak Lee, and Xinchen Yan."},{"citing_arxiv_id":"2605.04412","ref_index":24,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Structured 3D Latents Are Surprisingly Powerful: Unleashing Generalizable Style with 2D Diffusion","primary_cat":"cs.CV","submitted_at":"2026-05-06T02:08:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DiLAST optimizes 3D latents via guidance from a 2D diffusion model to enable generalizable style transfer for OOD styles in 3D asset generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04358","ref_index":49,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Intermediate Representations are Strong AI-Generated Image Detectors","primary_cat":"cs.CV","submitted_at":"2026-05-05T23:26:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Intermediate layer embedding sensitivity to perturbations distinguishes AI-generated images from real ones, yielding higher AUROC on GenImage and Forensics Small benchmarks than prior methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24416","ref_index":15,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Scaling Properties of Continuous Diffusion Spoken Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-27T12:45:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Continuous diffusion spoken language models follow scaling laws for loss and phoneme divergence and generate emotive multi-speaker speech at 16B scale, though long-form coherence stays difficult.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14379","ref_index":25,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Step-level Denoising-time Diffusion Alignment with Multiple Objectives","primary_cat":"cs.LG","submitted_at":"2026-04-15T19:52:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MSDDA derives a closed-form optimal reverse denoising distribution for multi-objective diffusion alignment that is exactly equivalent to step-level RL fine-tuning with no approximation error.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.19261","ref_index":3,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Enhancing Text-to-Image Diffusion Transformer via Split-Text Conditioning","primary_cat":"cs.CV","submitted_at":"2025-05-25T18:33:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DiT-ST converts complete-text captions into split-text primitives via LLMs and injects them hierarchically across denoising stages to reduce semantic confusion in DiT-based text-to-image generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.16527","ref_index":26,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Joint Relational Database Generation via Graph-Conditional Diffusion Models","primary_cat":"cs.LG","submitted_at":"2025-05-22T11:12:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GRDM jointly generates relational database tables via graph-conditional diffusion without table ordering, outperforming autoregressive baselines on multi-hop correlations and single-table fidelity across six real RDBs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.05470","ref_index":1,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Flow-GRPO: Training Flow Matching Models via Online RL","primary_cat":"cs.CV","submitted_at":"2025-05-08T17:58:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"∗Equal contribution, †Corresponding author 39th Conference on Neural Information Processing Systems (NeurIPS 2025). arXiv:2505.05470v5 [cs.CV] 27 Oct 2025 (b) Image Quality (c) Preference Score (a) GenEval Performance Figure 1:(a) GenEval performancerises steadily throughout Flow-GRPO's training and outperforms GPT-4o.(b) Image quality metricson DrawBench [ 1] remain essentially unchanged.(c) Human Preference Scoreson DrawBench improves after training. Results show thatFlow-GRPO enhances the desired capability while preserving image quality and exhibiting minimal reward-hacking. to collect training data, but flow models typically require many iterative steps to generate each sample, limiting efficiency."}],"limit":50,"offset":0}