{"total":15,"items":[{"citing_arxiv_id":"2605.21272","ref_index":66,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset","primary_cat":"cs.CV","submitted_at":"2026-05-20T15:04:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"3synthetic), using aesthetic pre-filtering, multi-classifier safety filtering, deduplication, and domain- based filtering for source governance. Each surviving image is re-captioned by multiple VLMs, ranging from short concept-level to long fine-grained descriptions, and the corpus is augmented with synthetic samples generated byApache 2.0T2I models. All samples are shipped with standard image embeddings (DINOv2 [64], CLIP [70], SSCD [66]), classifiers and detectors (YOLO [41], Mediapipe [61]), and pre-encoded with SANA V AE [102]. We also provide a comprehensive analysis of the dataset, including statistics, content and topic analyzes, and human quality assessment, and validate its usefulness by training a 4B-parameter T2I model exclusively on MONET, which achieves competitive evaluation scores."},{"citing_arxiv_id":"2605.18324","ref_index":40,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Improved Baselines with Representation Autoencoders","primary_cat":"cs.CV","submitted_at":"2026-05-18T12:42:34+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RAE v2 reaches gFID 1.06 on ImageNet-256 in 80 epochs by combining multi-layer encoder sums, complementary REPA targets, and free guidance via output reparameterization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13404","ref_index":24,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Seconds-Aligned PCA-DAC Latent Diffusion for Symbolic-to-Audio Drum Rendering","primary_cat":"cs.SD","submitted_at":"2026-05-13T11:59:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sec2Drum-DAC renders drum audio from symbolic inputs via diffusion on PCA-reduced DAC latents, improving spectral and transient metrics over regression baselines on 1733 held-out windows.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"DrumGAN: Synthesis of drum sounds with timbral feature conditioning using generative adversarial networks. InProceedings of ISMIR, 2020. [23] Patrick O'Reilly, Julia Barnett, Hugo Flores García, Annie Chu, Nathan Pruyne, Prem Seetharaman, and Bryan Pardo. The Rhythm In Anything: Audio-prompted drums generation with masked language modeling. InProceedings of ISMIR, 2025. [24] William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195-4205, 2023. [25] Konpat Preechakul, Nattanat Chatthee, Suttisak Wizadwongsa, and Supasorn Suwajanakorn. Diffusion autoencoders: Toward a meaningful and decodable representation. InProceedings of"},{"citing_arxiv_id":"2605.12271","ref_index":10,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm","primary_cat":"cs.CV","submitted_at":"2026-05-12T15:35:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Proposes V2V-Zero, a training-free framework replacing text conditioning with VLM final-layer hidden states from visual pages, achieving 0.85 on GenEval and 32.7/100 on new Simple-V2V Bench across models including video extension.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[8] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. [9] William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195-4205, 2023. [10] Yaofang Liu, Yumeng Ren, Xiaodong Cun, Aitor Artola, Yang Liu, Tieyong Zeng, Raymond H Chan, and Jean-michel Morel. Redefining temporal modeling in video diffusion: The vectorized timestep approach.arXiv preprint arXiv:2410.03160, 2024. [11] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al."},{"citing_arxiv_id":"2605.07286","ref_index":3,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Sparse Random-Feature Neural Networks with Krylov-Based SVD for Singularly Perturbed ODE","primary_cat":"math.NA","submitted_at":"2026-05-08T05:48:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sparse RFNNs with sSVD via Lanczos-Golub-Kahan bidiagonalization maintain accuracy while improving efficiency and robustness for 1D steady convection-diffusion equations with strong advection.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"convection-diffusion equations case having stronger advection, while achieving substantial gains in training efficiency and robustness compared to standard dense implementations. 1 Introduction Neural networks have demonstrated remarkable versatility and effectiveness across a broad range of applications, including images and audios [1, 2, 3], natural language processing [3, 4, 5], and complex scientific modelling [6, 7, 8]. However, neural networks have often failed to cater the scientific community where precision is of utmost importance. For an instance, it is a challenge for neural networks to learn some simple low dimensional mathematical or physical symbolic functions with a misfit in the order of machine precision [ 9]."},{"citing_arxiv_id":"2605.05781","ref_index":37,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Steering Visual Generation in Unified Multimodal Models with Understanding Supervision","primary_cat":"cs.CV","submitted_at":"2026-05-07T07:20:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25299","ref_index":13,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"The Thinking Pixel: Recursive Sparse Reasoning in Multimodal Diffusion Latents","primary_cat":"cs.CV","submitted_at":"2026-04-28T07:09:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A recursive sparse MoE framework integrated into diffusion models iteratively refines visual tokens via gated module selection to improve structured reasoning and image generation performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25289","ref_index":36,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Exploring Time Conditioning in Diffusion Generative Models from Disjoint Noisy Data Manifolds","primary_cat":"cs.LG","submitted_at":"2026-04-28T06:53:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Aligning the DDIM forward diffusion process with flow-matching manifold evolution enables high-quality generation without time conditioning, and class-conditional synthesis is possible with an unconditional denoiser by using separate time spaces per class.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24959","ref_index":30,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"CoreFlow: Low-Rank Matrix Generative Models","primary_cat":"cs.LG","submitted_at":"2026-04-27T19:56:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoreFlow is a low-rank matrix generative model that trains normalizing flows on shared subspaces to improve efficiency and quality for high-dimensional limited-sample data, including incomplete matrices.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.03233","ref_index":23,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"LTX-2: Efficient Joint Audio-Visual Foundation Model","primary_cat":"cs.CV","submitted_at":"2026-01-06T18:24:41+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LTX-2 generates high-quality synchronized audiovisual content from text prompts via an asymmetric 14B-video / 5B-audio dual-stream transformer with cross-attention and modality-aware guidance.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"alignment. Through these contributions,LTX-2establishes a new open-source foundation for T2A V generation, capable of producing coherent, expressive, and richly detailed content at unprecedented speed. 2 Related Work Diffusion Transformers (DiTs) have emerged as a unifying architecture for large-scale generative modeling. Introduced by Peebles and Xie [ 23], DiTs replace the traditional U-Net backbone with a transformer operating in latent space, enabling superior scalability and global receptive fields. Subsequent advances in Rectified Flow [15] have further optimized these models by framing denoising as a continuous flow, reducing sampling steps and improving efficiency. These developments form the architectural foundation for recent advances in multimodal generative models."},{"citing_arxiv_id":"2507.01925","ref_index":218,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"A Survey on Vision-Language-Action Models: An Action Tokenization Perspective","primary_cat":"cs.RO","submitted_at":"2025-07-02T17:34:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"RLBench, CALVIN;tabletopmanipulation(real-world) Environment(zero-shot) NR CoTDiffusion[216] SemanticAlignmentModule,Diffusion Model Trained on 10K trajectorieswith annotated ground truthkeyframe for each task(coarse-grained pretraining,fine-grained train) Image ViT Encoder,MLP Trained on 10K trajectories VIMA-Bench [217] Placement,Object,Task(zero-shot) N/A CoT-VLA [218] VILA-U Fine-tuned on robot actiondata, videos without actionlabels (VILA-U's vision towerfrozen) Image Full-AttentionModule (basedon VILA-U) Fine-tuned on taskdemonstrations collected ondownstream robot tasks LIBERO;Bridge-V2 [219],tabletop manipulation(real-world) Environment,Task,Instruction(zero-shot) WidowX,Franka Panda Multi-frame UniPi [220]"},{"citing_arxiv_id":"2504.20690","ref_index":20,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer","primary_cat":"cs.CV","submitted_at":"2025-04-29T12:14:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ICEdit achieves state-of-the-art instructional image editing in Diffusion Transformers via in-context generation, requiring only 0.1% of prior training data and 1% trainable parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.09992","ref_index":11,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Large Language Diffusion Models","primary_cat":"cs.CL","submitted_at":"2025-02-14T08:23:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"(2)) itself, that fundamentally underpin the essential properties of LLMs. In particular, we argue thatscalabilityis primarily a consequence of the interplay between Trans- formers [7], model size, data size, andFisher consistency 5 [8] induced by the generative principles in Eq. (1), rather than a unique result of the ARMs in Eq. (2). The success of diffusion trans- formers [9, 10] on visual data [11] supports this claim. Furthermore, theinstruction-followingand in-context learning[ 4] capabilities appear to be intrinsic properties of all conditional generative models on structurally consistent linguistic tasks, rather than exclusive advantages of ARMs. In addition, while ARMs can be interpreted as alossless data compressor[ 12, 13], any sufficiently"},{"citing_arxiv_id":"2406.03520","ref_index":77,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"VideoPhy: Evaluating Physical Commonsense for Video Generation","primary_cat":"cs.CV","submitted_at":"2024-06-05T17:53:55+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VideoPhy benchmark shows state-of-the-art text-to-video models follow physical commonsense and text prompts in only 39.6% of cases for the best model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"duction for All - github.com. https://github.com/hpcaitech/Open-Sora, 2024. [76] Piotr Padlewski, Max Bain, Matthew Henderson, Zhongkai Zhu, Nishant Relan, Hai Pham, Donovan Ong, Kaloyan Aleksiev, Aitor Ormazabal, Samuel Phua, et al. Vibe-eval: A hard evaluation suite for measuring progress of multimodal language models. arXiv preprint arXiv:2405.02287, 2024. [77] William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195-4205, 2023. [78] pika. Pika - pika.art. https://pika.art/. [79] Luis S Piloto, Ari Weinstein, Peter Battaglia, and Matthew Botvinick. Intuitive physics learning in a deep-learning model inspired by developmental psychology."},{"citing_arxiv_id":"2405.14430","ref_index":4,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference","primary_cat":"cs.CV","submitted_at":"2024-05-23T11:00:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PipeFusion applies patch partitioning and pipeline parallelism with one-step stale feature reuse to reduce communication overhead in DiT inference, reporting SOTA results on 8x L40 GPUs for Pixart, SD3, and Flux.1.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}