{"total":18,"items":[{"citing_arxiv_id":"2605.20090","ref_index":34,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"MetaEarth-MM: Unified Multimodal Remote Sensing Image Generation with Scene-centered Joint Modeling","primary_cat":"cs.CV","submitted_at":"2026-05-19T16:47:02+00:00","verdict":"CONDITIONAL","verdict_confidence":"UNKNOWN","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MetaEarth-MM unifies multi-modal remote sensing image generation and any-to-any translation across five modalities via scene-centered joint modeling on the new EarthMM dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18074","ref_index":40,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"4DLidarOpen: An Open 4D FMCW Lidar Dataset for Motion-Aware Autonomous Driving","primary_cat":"cs.RO","submitted_at":"2026-05-18T08:55:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"4DLidarOpen is a new open dataset providing synchronized 4D FMCW Lidar velocity measurements, multi-Lidar and camera data, and 3D bounding-box annotations with track IDs to support benchmarks on 3D detection, BEV segmentation, flow prediction, and motion forecasting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11111","ref_index":79,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ShardTensor: Domain Parallelism for Scientific Machine Learning","primary_cat":"cs.DC","submitted_at":"2026-05-11T18:20:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10256","ref_index":22,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"A Cold Diffusion Approach for Percussive Dereverberation","primary_cat":"cs.SD","submitted_at":"2026-05-11T09:23:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A cold diffusion model with direct and delta-normalized reverse processes, using UNet and transformer backbones, outperforms diffusion baselines for dereverberating acoustic and electronic drum stems on in-domain and out-of-domain tests.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"prediction modes: (i)Direct (next-state) prediction, which predicts the next less-reverberant intermediate spectrogram at each step, and (ii)∆-normalized residual (velocity-style) prediction, which predicts the step-size-normalized difference between consecutive intermediate spectrograms. We instantiate both with a UNet [21] and a diffusion Transformer [22] (DiT), and compare them against two strong diffusion baselines originally developed for speech enhancement/dereverberation: SGMSE+ [11], [12] and CDiffuSE [10]. Since most related diffusion-based pipelines are designed and evaluated in the speech domain, this is reflected not only in the training data and degradations but also in common evaluation protocols."},{"citing_arxiv_id":"2605.07861","ref_index":32,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"From Synthetic to Real: Toward Identity-Consistent Makeup Transfer with Synthetic and Real Data","primary_cat":"cs.CV","submitted_at":"2026-05-08T15:21:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The work creates identity-consistent synthetic makeup data via ConsistentBeauty and adapts models to real images using reinforcement learning in RealBeauty, achieving better identity preservation and real-world performance than prior methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Stable Diffusion [7] further advanced this paradigm by performing the noising and denoising processes in the latent space, significantly reduc- ing computational costs. DiT [31] introduced a Transformer- based architecture that replaces the conventional U-Net used in previous diffusion models. Stable Diffusion 3 [9] and Flux integrate flow mechanisms [32], [33] and transformer- based diffusion models, further enhancing the models' image generation capabilities. C. Reinforcement Learning for Diffusion Models Reinforcement learning has emerged as a key technique in the post-training phase of large language models (LLMs). When equipped with appropriate rewards, it enables LLMs to better align with specific preferences or further enhance"},{"citing_arxiv_id":"2604.26917","ref_index":73,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"AnimateAnyMesh++: A Flexible 4D Foundation Model for High-Fidelity Text-Driven Mesh Animation","primary_cat":"cs.CV","submitted_at":"2026-04-29T17:27:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"AnimateAnyMesh++ animates arbitrary 3D meshes from text using an expanded 300K-identity DyMesh-XL dataset, a power-law topology-aware DyMeshVAE-Flex, and a variable-length rectified-flow generator to produce semantically accurate, temporally coherent animations in seconds.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00874","ref_index":44,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Latent Space Probing for Adult Content Detection in Video Generative Models","primary_cat":"cs.CV","submitted_at":"2026-04-25T01:01:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19570","ref_index":15,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"RF-HiT: Rectified Flow Hierarchical Transformer for General Medical Image Segmentation","primary_cat":"cs.CV","submitted_at":"2026-04-21T15:24:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RF-HiT uses rectified flow and a multi-scale hierarchical transformer to reach 91.27% Dice on ACDC and 87.40% on BraTS 2021 with only 10.14 GFLOPs, 13.6M parameters, and three inference steps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19330","ref_index":40,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation","primary_cat":"eess.AS","submitted_at":"2026-04-21T10:58:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18344","ref_index":41,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"One Pass for All: A Discrete Diffusion Model for Knowledge Graph Triple Set Prediction","primary_cat":"cs.AI","submitted_at":"2026-04-20T14:41:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DiffTSP applies discrete diffusion to knowledge graph triple set prediction, recovering all missing triples simultaneously via edge-masking noise reversal and a structure-aware transformer, achieving SOTA on three datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18210","ref_index":21,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"TacticGen: Grounding Adaptable and Scalable Generation of Football Tactics","primary_cat":"cs.AI","submitted_at":"2026-04-20T12:57:11+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TacticGen generates realistic, adaptable football tactics via a multi-agent diffusion transformer trained on 3.3M events and 100M frames, supporting rule-, language-, or model-based guidance at inference time.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11793","ref_index":43,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Disentangled Point Diffusion for Precise Object Placement","primary_cat":"cs.RO","submitted_at":"2026-04-13T17:55:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TAX-DPD combines a feed-forward dense GMM for global placement priors with disentangled point cloud diffusion for local geometry and pose to achieve precise robotic object placement.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06568","ref_index":61,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"A Noise Constrained Diffusion (NC-Diffusion) Framework for High Fidelity Image Compression","primary_cat":"eess.IV","submitted_at":"2026-04-08T01:35:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"NC-Diffusion matches quantization noise to the diffusion forward process, adds an adaptive frequency filter and zero-shot enhancement, and reports superior fidelity on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04166","ref_index":31,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Primitive-based Truncated Diffusion for Efficient Trajectory Generation of Differential Drive Mobile Manipulators","primary_cat":"cs.RO","submitted_at":"2026-04-05T16:13:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A primitive-based truncated diffusion model with keypoint attention encoding generates more efficient and diverse trajectories for mobile manipulators than vanilla diffusion in cluttered 3D simulations.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Furthermore, our current diffusion process is conducted at discrete path points, still requiring trajectory optimization-based post-processing to obtain dynamically feasible trajectories, thereby increasing the complexity of the framework. In the future, we will try to incorporate advanced diffusion frameworks, such as EDM [30], Consistency model [31], and perform diffusion directly in the trajectory representation space to further accelerate inference and improve success rate [32]. We also plan to deploy the planner on physical robots to test and bridge the potential sim-to-real gap. To further exploit the advantages of the efficiency of our method, we will also consider adapting it to mobile manipulators"},{"citing_arxiv_id":"2603.28489","ref_index":15,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms","primary_cat":"eess.IV","submitted_at":"2026-03-30T14:23:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"This field has undergone a transformative journey, progressing from early generative adversarial networks (GANs) [1], [2] and pixel-level auto-regressive (AR) models [3], [4] to high-fidelity diffusion-based approaches [5]-[13], and more recently to large-scale architectures that function as \"World Simulators\" capable of modeling physical laws and long-horizon causal- ities [14], [15]. This progression marks a substantial leap in generative capabilities, enabling models not only to synthesize visual content but to understand and predict the underlying physics of the environment, thereby paving the way for AGI [16], [17]. To fully appreciate this leap, it is essential to understand video generation has the potential to achieve world modeling."},{"citing_arxiv_id":"2604.16362","ref_index":8,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SetFlow: Generating Structured Sets of Representations for Multiple Instance Learning","primary_cat":"cs.LG","submitted_at":"2026-03-20T13:29:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SetFlow is a flow-matching generative model for permutation-invariant MIL bags in representation space that produces synthetic data improving classification performance and enabling training on synthetic data alone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.09204","ref_index":15,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Flow-Opt: Scalable Centralized Multi-Robot Trajectory Optimization with Flow Matching and Differentiable Optimization","primary_cat":"cs.RO","submitted_at":"2025-10-10T09:43:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Flow-Opt combines a flow-matching DiT model with a custom differentiable safety filter and learned initialization to enable fast centralized trajectory optimization for tens of robots.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.14159","ref_index":30,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"MIMIC-D: Multi-modal Imitation for MultI-agent Coordination with Decentralized Diffusion Policies","primary_cat":"cs.RO","submitted_at":"2025-09-17T16:41:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MIMIC-D enables multi-modal multi-agent coordination via joint training of decentralized diffusion policies using only local information.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}