{"total":16,"items":[{"citing_arxiv_id":"2606.25344","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Follow Your Track: Precise Skeleton Animation Controlled by 3D Trajectories","primary_cat":"cs.CV","submitted_at":"2026-06-24T03:18:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ACT is a trajectory-conditioned framework for topology-general skeletal animation that injects 3D point trajectories from monocular video into skeletons via a Routed Trajectory Injector for improved fidelity and temporal consistency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11670","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ARGUS: Stacked Multi-View Identity Mosaic Injection for Subject-Preserving Video Generation","primary_cat":"cs.CV","submitted_at":"2026-06-10T05:29:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ARGUS converts MLLM-selected identity evidence into a synchronized 3x3 mosaic injected as negative-time memory in a diffusion model, plus supporting training techniques, to achieve SOTA subject preservation on human video benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10671","ref_index":88,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FadeMem: Distance-Aware Memory Consolidation for Autoregressive Video Diffusion","primary_cat":"cs.CV","submitted_at":"2026-06-09T10:22:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FadeMem introduces distance-aware KV memory consolidation for autoregressive video diffusion that builds a temporal hierarchy with power-law merging to preserve short-term dynamics and long-range coherence under fixed cache budget.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02575","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Zero to Hero: Training-Free Custom Concept Spawning in World Models","primary_cat":"cs.CV","submitted_at":"2026-06-01T17:59:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SPAWN enables training-free insertion of custom visual concepts into autoregressive world models by swapping the pinned context-memory anchor over a short injection window.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30349","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AdaState: Self-Evolving Anchors for Streaming Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-28T17:59:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AdaState replaces the static first-frame KV anchor with an evolving hidden latent that the model denoises alongside content, treating time as relative to enable recurrence and richer dynamics in streaming video generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30090","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DirectorBench: Diagnosing Long-Form Video Generation with Personalized Multi-Agent Evaluation","primary_cat":"cs.CL","submitted_at":"2026-05-28T15:35:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DirectorBench is a profile-aware diagnostic benchmark that localizes bottlenecks in long-form video generation workflows using structured checkpoints and multi-agent evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19484","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing","primary_cat":"cs.CV","submitted_at":"2026-05-19T07:35:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CutVerse benchmark evaluates GUI agents on 186 complex media post-production tasks in seven apps and reports 36% success rate for existing models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18346","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Focused Forcing: Content-Aware Per-Frame KV Selection for Efficient Autoregressive Video Diffusion","primary_cat":"cs.CV","submitted_at":"2026-05-18T12:58:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Focused Forcing is a training-free per-frame KV selection method that combines attention scores with diversity metrics and head-importance estimation to accelerate autoregressive video diffusion up to 1.48x while improving quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17312","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VISTA: Triplet-Supervised Video Style Transfer with Diffusion Transformers","primary_cat":"cs.CV","submitted_at":"2026-05-17T08:03:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VISTA introduces a new synthetic triplet dataset and diffusion-transformer framework with style adapter that jointly models style, content, and motion to achieve state-of-the-art video style transfer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14487","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity","primary_cat":"cs.CV","submitted_at":"2026-05-14T07:27:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06356","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation","primary_cat":"cs.CV","submitted_at":"2026-05-07T14:34:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SwiftI2V achieves comparable 2K I2V quality to end-to-end models on VBench-I2V while cutting GPU time by 202x through low-resolution motion planning followed by strongly image-conditioned segment-wise high-resolution synthesis.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Generation (CSG), we use M=3 and N=1 for inference; see Appendix C.6 for the experiment that motivates this choice. Since 2K training data is relatively limited, we employ a curriculum strategy: we first train with 1080P videos from OpenViD-HD [14], then continue for 10K steps with 90K 2K videos from UltraVideo [28], mixed with our synthesized samples. Evaluation Metrics.We use VBench-I2V [ 8] as our primary evaluation suite. It measures I2V- specific fidelity (e.g.,i2v subjectandi2v background) as well as general video quality metrics. I2V generation is conditioned on both the input image and text, where the text prompts are taken from the official VBench-I2V prompt set. We also report runtime and GPU memory efficiency in Section 4.3 to validate different pipelines' practicality."},{"citing_arxiv_id":"2604.15911","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Efficient Video Diffusion Models: Advancements and Challenges","primary_cat":"cs.CV","submitted_at":"2026-04-17T10:11:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"and 𝐷𝜓 is the discriminator parameterized by 𝜓. This objective sharpens details and is commonly used either jointly with consistency or distribution distillation or as a later refinement stage. This subsection includes both combined and independent adversarial designs. 3.3.1 Combined Adversarial Distillation.While the conceptual roots of adversarial distillation lie in GANs [ 58], the inherent instability of GAN training remains a significant challenge when applied to video diffusion distillation. In the combined setting, adversarial design is most often used as an auxiliary objective to enhance consistency- or distribution-based distillation rather than as a standalone distillation route. Recent advancements over established"},{"citing_arxiv_id":"2604.05200","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Investigating Ethical Data Communication with Purrsuasion: An Educational Game about Negotiated Data Disclosure","primary_cat":"cs.HC","submitted_at":"2026-04-06T21:57:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Purrsuasion is a negotiation game that surfaces satisficing and intent-attribution difficulties when students practice ethical data disclosure under real constraints.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ways of satisfying both the receiver's information need and the disclo- sure constraint at once. Although this may be partially attributable to the 20-minute round structure in our deployment ofPurrsuasion, it also suggests opportunities for future work on authoring interfaces support- ing rapid ideation under disclosure constraints (cf. recent AI-assisted visualization tools [13, 30, 47]). Receivers struggle to articulate information need.We find that receivers mostly treat written requests for information as narrow queries for relevant evidence, but lacking initial insight into available data, they seldom formulate these queries with a beneficial level of precision. In- stead, receivers make vague references to the data signal named in their"},{"citing_arxiv_id":"2604.03118","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation","primary_cat":"cs.CV","submitted_at":"2026-04-03T15:43:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Salt improves low-step video generation quality by adding endpoint-consistent regularization to distribution matching distillation and using cache-conditioned feature alignment for autoregressive models.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"upon strong real-time baselines including Self Forcing, LongLive, and Causal Forcing, and apply the fullSaltframework (SC-DMDwith cache-conditioned training) without modifying model architectures or inference pipelines. Benchmarks.Fornon-autoregressiveI2V,wetrainPCM,DMD,andourmethod on the same proprietary internal I2V training set and evaluate on VBench-I2V from VBench++ [18], reporting the overall I2V Score and Quality Score to- gether with key dimensions such as Subject/Background Consistency, Motion Smoothness, Dynamic Degree, Imaging Quality, and Temporal Flickering. For non-autoregressive and autoregressive T2V, we report VBench [17] scores on 5-second generation. To assess long-horizon transfer, we additionally evaluate"},{"citing_arxiv_id":"2604.02535","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Spectral Framework for Multi-Scale Nonlinear Dimensionality Reduction","primary_cat":"cs.LG","submitted_at":"2026-04-02T21:39:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A spectral framework for nonlinear DR uses spectral bases plus cross-entropy optimization to create multi-scale embeddings that preserve both global manifold geometry and local neighborhoods while supporting graph-frequency analysis.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Verleysen. Quality assessment of dimensionality reduc- tion: Rank-based criteria.Neurocomputing, 72(7-9):1431-1443, 2009. doi: 10.1016/j.neucom.2008.12.017 7 [40] R. B. Lehoucq, D. C. Sorensen, and C. Yang.ARPACK Users' Guide: Solution of Large-Scale Eigenvalue Problems with Implicitly Restarted Arnoldi Methods. SIAM, 1998. doi: 10.1137/1.9780898719628 6 [41] N. Li, V . van Unen, T. Höllt, A. Thompson, J. van Bergen, N. Pezzotti et al. Mass cytometry reveals innate lymphoid cell differentiation pathways in the human fetal intestine.J. Exp. Med., 215(5):1383-1396, 2018. doi: 10.1084/jem.20171934 8 [42] D. Liao, C. Liu, B. W. Christensen, A. Tong, G. Huguet, G. Wolf et al. Assessing neural network representations during training using noise-"},{"citing_arxiv_id":"2602.07775","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion","primary_cat":"cs.CV","submitted_at":"2026-02-08T02:16:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Rank↓2.6250 2.0625 1.3125 Rolling Sink13 Evaluation Benchmark & Metrics.In this work, we adoptVBench-Long[40, 41,113] as the primary quantitative benchmark for evaluating Rolling Sink's performance and the performance gains of different steps during the systematic analysis.VBench-Longis a long-video evaluation benchmark released as part of VBench++[41], extending the originalVBench[40] on long-horizon video genera- tions while maintaining the same fine-grained evaluation philosophy (i.e., decom- posing the \"video quality\" into multiple diagnostic dimensions, each measured by one or multiple expert models that are massively pretrained). Prior SOTA Baselines.We compare Rolling Sink against two well-recognized,"}],"limit":50,"offset":0}