{"total":14,"items":[{"citing_arxiv_id":"2605.27235","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MRT: Masked Region Transformer for Layered Image Generation and Editing at Scale","primary_cat":"cs.CV","submitted_at":"2026-05-26T16:16:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Presents MRT, a 20B-parameter masked region diffusion model unifying text-to-layers, image-to-layers, and layers-to-layers tasks with an overflow-aware canvas layer for complete editable outputs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20708","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rethinking Cross-Layer Information Routing in Diffusion Transformers","primary_cat":"cs.CV","submitted_at":"2026-05-20T05:07:15+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20147","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset","primary_cat":"cs.CV","submitted_at":"2026-05-19T17:35:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PixVerve introduces a 95K ultra-high-resolution image-text dataset and training strategies that enable native 100-megapixel text-to-image generation together with a new evaluation benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15684","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices","primary_cat":"cs.CV","submitted_at":"2026-05-15T07:13:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ElasticDiT introduces an elastic DiT architecture with adjustable spatial compression and block depth plus Shift Sparse Block Attention and a distilled VAE to enable a single model to cover multiple fidelity-latency points for high-resolution image generation on mobile devices.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18168","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation","primary_cat":"cs.CV","submitted_at":"2026-04-20T12:28:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[42-44]serieshavecontinuouslyimprovedimagegeneration capabilities. Recent large-scale models like FLUX [45, 46], Nano Banana [47], Qwen-Image [48], and HunyuanImage 3.0 [49] have demonstrated the ability to synthesize complex content and accurately edit images. To enhance semantic un- derstanding and instruction following abilities of generative models, models such as Playground v3 [50], SANA-1.5 [14], and BLIP3o-NEXT [51] focus on integrating LLMs [17,18] effectively into the generation framework. Meanwhile, given that high-quality image synthesis typically requires multi- ple denoising iterations, reducing the number of denoising steps to improve generation efficiency has become another important research direction. 2.2. Few-step Generation"},{"citing_arxiv_id":"2604.12322","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Self-Adversarial One Step Generation via Condition Shifting","primary_cat":"cs.CV","submitted_at":"2026-04-14T05:54:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"APEX derives self-adversarial gradients from condition-shifted velocity fields in flow models to achieve high-fidelity one-step generation, outperforming much larger models and multi-step teachers.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"(11) When the model is imperfect, xfake deviates from the true x, capturing the model's current generation error. We train the network under the shifted conditioncfake to fit the trajectory towardxfake. Construct fake trajectory:x fake t =α(t)z+γ(t)x fake. The fake flow loss is defined as: Lfake(θ) =E xt,z, t h ∥Fθ(xfake t , t,c fake)−(z−x fake)∥2 i .(12) Concretely, ∂xfake/∂θ=−t·∂F θ(xt, t,c)/∂θ , so Lfake simultaneously trains the cfake branch and injects a direct adversarial gradient intoFθ(·,·,c) . The stop gradient in APEX is applied separately in Lcons, where vfake := sg(Fθ(xt, t,c fake)) serves as a correction reference. When Lfake is minimized, vfake(·,·,c fake) approximates the velocity field of the fake distribution pfake."},{"citing_arxiv_id":"2601.03233","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LTX-2: Efficient Joint Audio-Visual Foundation Model","primary_cat":"cs.CV","submitted_at":"2026-01-06T18:24:41+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LTX-2 generates high-quality synchronized audiovisual content from text prompts via an asymmetric 14B-video / 5B-audio dual-stream transformer with cross-attention and modality-aware guidance.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"To support the phonetic precision required for synchronized speech, we move beyond simple global text embeddings. Our conditioning pipeline uses Gemma3-12B [27] as a backbone, refined through two specialized stages (Figure 4). For complex conditional generation tasks, relying exclusively on the final-layer embeddings from decoder-only LLMs has been shown to be sub-optimal [16, 29]. Moreover, Gemma3-12B decoder- only architecture employs causal (unidirectional) attention rather than full bidirectional context modeling. Therefore, we employ the following two methods to compensate for the limitations of causal attention and sub-optimal final-layer embeddings in complex conditional generation. 3.2.1 Multi-Layer Feature Extractor"},{"citing_arxiv_id":"2511.20645","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PixelDiT: Pixel Diffusion Transformers for Image Generation","primary_cat":"cs.CV","submitted_at":"2025-11-25T18:59:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PixelDiT generates images in pixel space with a dual-level transformer and reaches 1.61 FID on ImageNet 256, outperforming prior pixel-space models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.03147","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation","primary_cat":"cs.CV","submitted_at":"2025-06-03T17:59:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"generation models, while \"Unified\" indicates models capable of both understanding and generation. †: Results of GPT-4o-Image are tested by [47]. Model Cultural↑ Time↑ Space↑ Biology↑ Physics↑ Chemistry↑ Overall↑ Gen. Only SDXL [34] 0.43 0.48 0.47 0.44 0.45 0.27 0.43 SD3.5-large [10] 0.44 0.50 0.58 0.44 0.52 0.31 0.46 PixArt-Alpha [5] 0.45 0.50 0.48 0.49 0.56 0.34 0.47 playground-v2.5 [24] 0.49 0.58 0.55 0.43 0.48 0.33 0.49 FLUX.1-dev [16] 0.48 0.58 0.62 0.42 0.51 0.35 0.50 Unified Janus [44] 0.16 0.26 0.35 0.28 0.30 0.14 0.23 Show-o [46] 0.28 0.40 0.48 0.30 0.46 0.30 0.35 Janus-Pro-7B [7] 0.30 0.37 0.49 0.36 0.42 0.26 0.35 Emu3 [43] 0.34 0.45 0.48 0.41 0.45 0.27 0.39 MetaQuery-XL [32] 0.56 0.55 0.62 0.49 0.63 0.41 0.55 BAGEL [9] 0."},{"citing_arxiv_id":"2412.14169","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Autoregressive Video Generation without Vector Quantization","primary_cat":"cs.CV","submitted_at":"2024-12-18T18:59:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"NOVA reformulates video generation as non-quantized autoregressive frame-by-frame temporal prediction combined with set-by-set spatial prediction, outperforming prior AR video models and some diffusion models in efficiency and quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.00131","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Open-Sora Plan: Open-Source Large Video Generation Model","primary_cat":"cs.CV","submitted_at":"2024-11-28T14:07:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Open-Sora Plan presents an open-source large video generation model that combines a Wavelet-Flow VAE, Joint Image-Video Skiparse Denoiser, and multi-dimensional data curation to achieve high-quality video outputs with public code and weights.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.24164","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"$\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control","primary_cat":"cs.LG","submitted_at":"2024-10-31T17:22:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"models are typically concerned with image generation, but our action generation model builds on a number of previously proposed concepts. Like Zhou et al. [59], we train our model via a diffusion-style (flow matching) loss applied on individual sequence elements, in lieu of the standard cross-entropy loss for decoder-only transformers. Like Liu et al. [29], we use a separate set of weights for the tokens corresponding to diffusion. Incorporating these concepts into a VLA model, we introduce what to our knowledge is the first flow matching VLA that produces high-frequency action chunks for dexterous control. Our work also builds on a rich history of prior works on large-scale robot learning. Early work in this area often utilized"},{"citing_arxiv_id":"2410.10629","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers","primary_cat":"cs.CV","submitted_at":"2024-10-14T15:36:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sana-0.6B produces high-resolution images with strong text alignment at 20x smaller size and 100x higher throughput than Flux-12B by combining 32x image compression, linear DiT blocks, and a decoder-only LLM text encoder.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2409.18869","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Emu3: Next-Token Prediction is All You Need","primary_cat":"cs.CV","submitted_at":"2024-09-27T16:06:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"resolution diffusion transformer with fine-grained chinese understanding. arXiv preprint arXiv:2405.08748, 2024. [53] Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26689-26699, 2024. [54] Bingchen Liu, Ehsan Akhgari, Alexander Visheratin, Aleks Kamko, Linmiao Xu, Shivam Shrirao, Joao Souza, Suhail Doshi, and Daiqing Li. Playground v3: Improving text-to-image alignment with deep-fusion large language models. arXiv preprint arXiv:2409.10695, 2024. [55] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual"}],"limit":50,"offset":0}