{"total":28,"items":[{"citing_arxiv_id":"2605.18010","ref_index":162,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Functionalization via Structure Completion and Motion Rectification","primary_cat":"cs.CV","submitted_at":"2026-05-18T08:05:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17850","ref_index":52,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Simple Approximation and Derivative Free Inference-Time Scaling for Diffusion Models via Sequential Monte Carlo on Path Measures","primary_cat":"stat.ML","submitted_at":"2026-05-18T04:45:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"URGE performs unbiased inference-time scaling for diffusion models by attaching multiplicative path weights from Girsanov estimation and resampling trajectories, with a proven equivalence to prior particle-wise SMC schemes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17527","ref_index":16,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Designing streetscapes from street-view imagery using diffusion models","primary_cat":"cs.CV","submitted_at":"2026-05-17T16:20:30+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A multimodal diffusion model generates controllable alternative streetscapes from street-view imagery using visual metrics and text, shown on Chicago and Orlando data with gains in semantic consistency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16863","ref_index":35,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Plan First, Diffuse Later: Extrinsic Graph Guidance for Long-Horizon Diffusion Planning","primary_cat":"cs.RO","submitted_at":"2026-05-16T07:53:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"XDiffuser combines extrinsic graph planning with diffusion models to guide denoising and improve performance on long-horizon robotic tasks including multi-agent coordination and TSP-style problems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16054","ref_index":13,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Ada-Diffuser: Latent-Aware Adaptive Diffusion for Decision-Making","primary_cat":"cs.LG","submitted_at":"2026-05-15T15:21:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Ada-Diffuser is a causal diffusion model that jointly learns observed interaction structure and underlying latent dynamics from minimal observations for adaptive planning and policy learning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"M=⟨S,A,Θ,T,R, γ, P Θ⟩, where S is the state space, A is the action space, and Θ is the space of task-specific latent parameters. For each θ∈Θ , the transition and reward functions are given by Tθ :S × A → P(S) and Rθ :S × A →R , respectively. The parameter θ is sampled from a prior distribution PΘ at the beginning of an episode and remains fixed during the episode. The discount factor is denoted by γ∈[0,1) . This framework defines a family of MDPs indexed by the latent parameter θ, with each θ inducing a different set of dynamics and reward functions. It can be seen as a special case of a contextual MDP where the context is latent and fixed per episode.Xie et al. (2021) 32 Published as a conference paper at ICLR 2026 further generalize this framework by allowing the task parameter θ to evolve dynamically across episodes, rather than being fixed. Bayes-Adaptive MDPs (BAMDPs) are closely related to both HiP-MDPs and contextual MDPs (CMDPs). In BAMDPs, the agent maintains a posterior distribution over MDPs based on its interaction history. Specifically, it maintains a belief bt(R, T) =p(R, T|τ :t), where τ:t = {s0,a 0, r0, . . . ,st} denotes the trajectory observed up to time t. This belief captures the agent's uncertainty about the underlying transition and reward functions. The transition and reward functions can then be defined in expectation over this posterior, effectively conditioning decision-making on the belief bt. When the environment is driven by hidden contextual variables or latent task parameters, such as in CMDPs or HiP-MDPs-this belief can be interpreted as a distribution over these latent variables. In this view, BAMDPs provide a non-parametric framework for reasoning over hidden structure, while approaches like ours explicitly model such latent variables and infer their posterior distributions using amortized inference. Both aim to enable adaptive planning and learning under uncertainty, but differ in how latent structure is represented and "},{"citing_arxiv_id":"2605.14703","ref_index":151,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Generating HDR Video from SDR Video","primary_cat":"cs.CV","submitted_at":"2026-05-14T11:21:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A multi-exposure video model predicts bracketed linear SDR sequences from single nonlinear SDR input, which a merging model combines into HDR video preserving shadow and highlight detail.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14654","ref_index":99,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Beyond Instance-Level Self-Supervision in 3D Multi-Modal Medical Imaging","primary_cat":"cs.CV","submitted_at":"2026-05-14T10:10:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A self-supervised approach uses consistent spatial relationships of anatomical structures across patients to improve 3D multi-modal medical image representations, yielding modest gains on segmentation and classification tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14270","ref_index":14,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers","primary_cat":"cs.CV","submitted_at":"2026-05-14T02:14:09+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13333","ref_index":24,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Stylized Text-to-Motion Generation via Hypernetwork-Driven Low-Rank Adaptation","primary_cat":"cs.CV","submitted_at":"2026-05-13T10:51:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A hypernetwork maps style motion embeddings to LoRA updates that stylize text-driven motion diffusion models with improved generalization to unseen styles via contrastive structuring of the style space.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12379","ref_index":46,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Discrete Flow Matching for Offline-to-Online Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-12T16:44:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DRIFT enables stable offline-to-online fine-tuning of CTMC policies in discrete RL via advantage-weighted discrete flow matching, path-space regularization, and candidate-set approximation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11387","ref_index":73,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies","primary_cat":"cs.LG","submitted_at":"2026-05-12T01:19:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Unsupervised behavioral mode discovery combined with mutual information rewards enables RL fine-tuning of multimodal generative policies that achieves higher success rates without losing action diversity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11361","ref_index":17,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"The tractability landscape of diffusion alignment: regularization, rewards, and computational primitives","primary_cat":"cs.LG","submitted_at":"2026-05-12T00:25:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The choice of closeness measure in diffusion reward alignment determines the computational primitives and tractable reward classes, with linear exponential tilts sufficing for KL with convex rewards and proximal oracles for Wasserstein with concave or low-dimensional Lipschitz rewards.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11311","ref_index":72,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Couple to Control: Joint Initial Noise Design in Diffusion Models","primary_cat":"cs.LG","submitted_at":"2026-05-11T22:56:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Coupled initial noises in diffusion models, with designed dependence but unchanged marginal Gaussians, improve generated image diversity on Stable Diffusion variants while preserving quality and alignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16384","ref_index":66,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Mutual Enhancement Between Global Tokens and Patch Tokens: From Theory to Practice","primary_cat":"cs.CV","submitted_at":"2026-05-11T10:51:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TaTok is a theoretically grounded adaptive tokenization method that uses global tokens and cumulative conditional entropy filtering to reduce redundancy while improving reconstruction quality over fixed-rate patch tokenization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10218","ref_index":33,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Relative Score Policy Optimization for Diffusion Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-11T08:58:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08574","ref_index":6,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Post-hoc Selective Classification for Reliable Synthetic Image Detection","primary_cat":"cs.CV","submitted_at":"2026-05-09T00:25:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ReSIDe generalizes logit-based confidence scores to intermediate layers of synthetic image detectors and uses preference optimization to aggregate them, cutting area under the risk-coverage curve by up to 69.55% under covariate shifts.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"that the length of the shorter edge is uniformly random in [64,96] ; (4)FGSM:images are perturbed using the fast gradient sign method (FGSM) [Goodfellow et al., 2015], with the distortion budget ϵ uniformly random in {1/255,2/255, ...,8/255} ; (5)JPEG:images are compressed with JPEG with q∼Uniform[30,70] , where q is the quality factor of JPEG compression; (6)Unseen generators (Unseen Gen.):synthetic images are generated by Wukong, Midjourney, Stable Diffusion V1.4, and Stable Diffusion V1.5, which are not presented in training. Note that to isolate the effect of different covariate shifts and study them independently, each covariate-shifted testing distribution differs from the training distribution in only one aspect, while other aspects stay the same as the"},{"citing_arxiv_id":"2605.07701","ref_index":5,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Guidance Is Not a Hyperparameter: Learning Dynamic Control in Diffusion Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-08T13:12:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Adaptive guidance trajectories learned via PPO outperform fixed-scale CFG on controllability-quality balance in three controlled NLP generation tasks with discrete diffusion models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ance scales at highly noisy stages, larger scales at intermediate steps, and smaller scales again near convergence. Such intuitions often lead to parameterized guidance curves with a small number of degrees of freedom. Let a guidance curve be parameterized asg θ(t)with parametersθ∈R d. Searching for a task-specific guidance strategy then amounts to solving the following optimization problem: θ⋆ = arg max θ Eτ∼g θ [R(τ)].(5) However, extending these approaches to task-specific guidance quickly becomes computationally infeasible. Searching over flexible curve families significantly enlarges the action space, and eval- uating each candidate curve requires multiple diffusion sampling runs. Given the high computa- tional cost and low sampling efficiency of discrete diffusion language models, curve-level search in"},{"citing_arxiv_id":"2605.05756","ref_index":34,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"MaMi-HOI: Harmonizing Global Kinematics and Local Geometry for Human-Object Interaction Generation","primary_cat":"cs.RO","submitted_at":"2026-05-07T06:52:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MaMi-HOI counters geometric forgetting in diffusion models via a Geometry-Aware Proximity Adapter for precise contacts and a Kinematic Harmony Adapter for natural whole-body postures in human-object interactions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03399","ref_index":11,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"PODiff: Latent Diffusion in Proper Orthogonal Decomposition Space for Scientific Super-Resolution","primary_cat":"cs.LG","submitted_at":"2026-05-05T06:21:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PODiff performs conditional diffusion in a fixed, variance-ordered POD latent space to enable efficient probabilistic super-resolution of high-dimensional scientific fields with lower memory and better-calibrated uncertainty than pixel-space or dropout baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02583","ref_index":69,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Stylistic Attribute Control in Latent Diffusion Models","primary_cat":"cs.CV","submitted_at":"2026-05-04T13:34:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A technique for parametric stylistic control in latent diffusion models learns disentangled directions from synthetic datasets and applies them via guidance composition while preserving semantics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02973","ref_index":31,"ref_count":2,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Structured Diffusion Bridges: Inductive Bias for Denoising Diffusion Bridges","primary_cat":"cs.LG","submitted_at":"2026-05-03T16:17:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A structured diffusion bridge method achieves near fully-paired modality translation quality using alignment constraints even in unpaired or semi-paired regimes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01568","ref_index":27,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Unifying Deep Stochastic Processes for Image Enhancement","primary_cat":"cs.CV","submitted_at":"2026-05-02T18:40:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Stochastic image enhancement methods are shown to be variants of a shared SDE differing in drift, diffusion, terminal distributions and boundary conditions, with controlled experiments revealing no single dominant family and a new modular library released.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01220","ref_index":35,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Visual Implicit Autoregressive Modeling","primary_cat":"cs.CV","submitted_at":"2026-05-02T03:23:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VIAR embeds implicit equilibrium layers in visual autoregressive models to achieve ImageNet FID 2.16 with 38.4% of VAR parameters and controllable inference compute.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00935","ref_index":27,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Watch Your Step: Information Injection in Diffusion Models via Shadow Timestep Embedding","primary_cat":"cs.LG","submitted_at":"2026-05-01T03:26:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Timestep embeddings in diffusion models function as a separable side channel that can carry dedicated information for adversarial injection or detection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18343","ref_index":38,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"DAG-STL: A Hierarchical Framework for Zero-Shot Trajectory Planning under Signal Temporal Logic Specifications","primary_cat":"cs.RO","submitted_at":"2026-04-20T14:41:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DAG-STL decomposes long-horizon STL planning into decomposition, timed waypoint allocation, and diffusion-based trajectory generation to enable zero-shot planning under unknown dynamics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18258","ref_index":24,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Long-Text-to-Image Generation via Compositional Prompt Decomposition","primary_cat":"cs.CV","submitted_at":"2026-04-20T13:31:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PRISM lets pre-trained text-to-image models handle long prompts by breaking them into compositional parts, predicting noise separately, and merging outputs via energy-based conjunction, matching fine-tuned models while generalizing better to prompts over 500 tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.06885","ref_index":36,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching","primary_cat":"eess.AS","submitted_at":"2024-10-09T13:46:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2310.00426","ref_index":152,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"PixArt-$\\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis","primary_cat":"cs.CV","submitted_at":"2023-09-30T16:18:00+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PixArt-α matches commercial text-to-image quality with a diffusion transformer trained in 675 A100 GPU days through decomposed training stages, cross-attention text injection, and vision-language model dense captions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}