{"total":19,"items":[{"citing_arxiv_id":"2607.00902","ref_index":46,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MG-RWKV: Multi-Grained Context-Aware RWKV for Temporal Forgery Localization","primary_cat":"cs.CV","submitted_at":"2026-07-01T13:04:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MG-RWKV combines bidirectional RWKV, multi-granularity mixture of experts, and cross-granularity consistency to achieve state-of-the-art temporal forgery localization with linear complexity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00522","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Trajectory-Driven Spatio-Temporal Refinement Solution for CVPR 2026 8th UG2+ Challenge Track 3: DOST","primary_cat":"cs.CV","submitted_at":"2026-05-30T04:29:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Authors adapt the SegAnyMo baseline with DAVIS data plus simulated turbulence and a spatio-temporal cleanup module to rank 2nd on the DOST challenge track.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18010","ref_index":281,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Functionalization via Structure Completion and Motion Rectification","primary_cat":"cs.CV","submitted_at":"2026-05-18T08:05:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12006","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Robust Promptable Video Object Segmentation","primary_cat":"cs.CV","submitted_at":"2026-05-12T11:55:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The paper creates a real-world corruption benchmark for promptable video object segmentation and proposes MoGA, which uses object-specific memory to improve robustness and temporal consistency under adverse conditions.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"jitter, Gaussian noise, ISO noise, motion blur, resampling blur, fog, rain, and snow) to existing VOS datasets. Us- ing corruption synthesis algorithms [4, 13] with Fourier- based temporal modulation [19], we vary corruption inten- sity smoothly across frames to simulate degradation pat- terns in real videos. For training, we corrupt videos from MOSE [7], YouTube-VOS [47], and DA VIS [34], gener- SAM2 Mask Decoder Memory Encoder Image Encoder Q K, V Memory Attention MoGA Prompt Encoder prompts∅ Memory Bank Current Frame Linear MoGA sharedshared Avg. sharedshared Elementwise add.Matrix mul. Figure 3. Overview of MoGA integrated into SAM2 [37].Top: an example input video under adverse weather conditions. Frames with"},{"citing_arxiv_id":"2605.03276","ref_index":38,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing","primary_cat":"cs.CV","submitted_at":"2026-05-05T02:05:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VEBENCH is the first benchmark with 3.9K videos and 3,080 human-verified QA pairs that measures LMMs on video editing technique recognition and operation simulation, revealing a large gap to human performance.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Qwen3-VL-32B-Instruct [28] Qwen3 [32] 32B 34.36%34.36%28.99% 39.72% 39.72% 0.00 0.00 Qwen3-VL-30B-A3B-Instruct [28] Qwen3 [32] 30B 28.84% 31.65% 28.30% 29.37% 34.99% 0.00 0.00 Qwen3-VL-8B-Instruct [28] Qwen3 [32] 8B 32.39% 33.86% 28.64% 36.14% 39.08% 0.00 0.00 Qwen3-VL-4B-Instruct [28] Qwen3 [32] 4B 30.20% 30.14% 28.60% 31.80% 31.67% 0.00 0.00 Qwen2.5-VL-7B-Instruct [1] Qwen2.5 [38] 7B 28.74% 27.79% 27.47% 30.01% 28.10% 0.00 0.01 InternVL3-14B [42] Qwen2.5 [38] 14B 29.12% 28.55% 27.08% 31.16% 30.01% 0.00 0.01 InternVL3-8B [42] Qwen2.5 [38] 7B 27.99% 26.08% 25.21% 30.78% 26.95% 0.00 0.00 (a) On-screen Text (b) Frame Separator Figure 5. Examples of two different visual prompts in the stitched videos: (a) On-screen Text and (b) Frame Separator."},{"citing_arxiv_id":"2604.27322","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal","primary_cat":"cs.CV","submitted_at":"2026-04-30T02:08:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"YOSE accelerates DiT video object removal up to 2.5x by using BVI for adaptive token selection and DiffSim to simulate unmasked token effects, while preserving visual quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26488","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners","primary_cat":"cs.CV","submitted_at":"2026-04-29T09:51:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LILA learns temporally consistent semantic and geometric pixel features from uncurated videos via linear in-context learning on off-the-shelf depth and motion cues, yielding empirical gains on video object segmentation, surface normal estimation, and semantic segmentation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00891","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"X2SAM: Any Segmentation in Images and Videos","primary_cat":"cs.CV","submitted_at":"2026-04-27T16:24:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"X2SAM unifies any-segmentation across images and videos in one MLLM by adding a Mask Memory module for temporal consistency and joint training on mixed datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14630","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation","primary_cat":"cs.CV","submitted_at":"2026-04-16T05:14:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Cross-modal token modulation enables better fusion of appearance and motion cues in two-stream models, leading to state-of-the-art results in unsupervised video object segmentation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07901","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PanoSAM2: Lightweight Distortion- and Memory-aware Adaptions of SAM2 for 360 Video Object Segmentation","primary_cat":"cs.CV","submitted_at":"2026-04-09T07:17:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PanoSAM2 adapts SAM2 with a Pano-Aware Decoder, Distortion-Guided Mask Loss, and Long-Short Memory Module to improve 360 video object segmentation, reporting +5.6 and +6.7 gains over base SAM2 on two benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.08831","ref_index":92,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"3AM: 3egment Anything with Geometric Consistency in Videos","primary_cat":"cs.CV","submitted_at":"2026-01-13T18:59:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"3AM integrates MUSt3R 3D features into SAM2 via a Feature Merger and FOV-aware sampling to deliver geometry-consistent video object segmentation from RGB alone, with large gains on wide-baseline datasets.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"To demonstrate the capability of our model, we evaluate its performance on 2D object tracking under challenging camera motion. Specifically, we focus on scenarios where the camera undergoes significant translation and rotation, and where objects may disappear and later reappear. Traditional VOS benchmarks such as LVOS [26], VOST [75], DAVIS [61], and YTOS [92] primarily assume a relatively fixed camera and stable scene surroundings, making them insufficient for testing robustness under large viewpoint changes. To address this limitation, we use 3D-based datasets, ScanNet++ [109] and Replica [69], which naturally provide extensive camera trajectory variation due to their 3D reconstruction requirements. Their wide-baseline viewpoints make"},{"citing_arxiv_id":"2512.22046","ref_index":57,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Backdoor Attacks on Prompt-Driven Video Segmentation Foundation Models","primary_cat":"cs.CV","submitted_at":"2025-12-26T14:48:58+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BadVSFM is the first effective backdoor attack on prompt-driven video segmentation foundation models, using a two-stage encoder-decoder strategy to achieve high attack success rates with limited clean performance loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.13684","ref_index":75,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Recurrent Video Masked Autoencoders","primary_cat":"cs.CV","submitted_at":"2025-12-15T18:59:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RVM uses recurrent computation inside a masked autoencoder to learn video representations that match or exceed prior video and image models on classification, tracking, and dense spatial tasks with up to 30x better parameter efficiency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.16719","ref_index":145,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SAM 3: Segment Anything with Concepts","primary_cat":"cs.CV","submitted_at":"2025-11-20T18:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.18822","ref_index":60,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SAM 2++: Tracking Anything at Any Granularity","primary_cat":"cs.CV","submitted_at":"2025-10-21T17:20:15+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SAM 2++ unifies video tracking across mask, box, and point granularities via task-specific prompts, a unified decoder, task-adaptive memory, and a new multi-granularity dataset, reporting state-of-the-art results.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.05425","ref_index":54,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning","primary_cat":"cs.CV","submitted_at":"2025-06-05T05:51:35+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SIV-Bench is a new video benchmark with 2,792 clips and 5,455 QA pairs that evaluates MLLMs on social scene understanding, state reasoning, and dynamics prediction using social relation theory.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.12169","ref_index":44,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards a General-Purpose Zero-Shot Synthetic Low-Light Image and Video Pipeline","primary_cat":"cs.CV","submitted_at":"2025-04-16T15:19:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A self-supervised Degradation Estimation Network estimates parameters for physics-informed noise distributions to generate realistic synthetic low-light data, showing gains on noise replication, enhancement, and detection tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2408.00714","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SAM 2: Segment Anything in Images and Videos","primary_cat":"cs.CV","submitted_at":"2024-08-01T17:00:08+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SAM 2 delivers more accurate video segmentation with 3x fewer user interactions and 6x faster image segmentation than the original SAM by training a streaming-memory transformer on the largest video segmentation dataset collected to date.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1907.06119","ref_index":209,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Understanding Deep Learning Techniques for Image Segmentation","primary_cat":"cs.CV","submitted_at":"2019-07-13T19:23:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":1.0,"formal_verification":"none","one_line_summary":"A 2019 survey that categorizes and intuitively explains major deep learning techniques for image segmentation, progressing from classical methods to modern neural architectures.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Stanford Background Dataset [72] Microsoft COCO [122] MIT Scene parsing data(ADE20K) [222, 223] Semantic Boundaries Dataset [75] Microsoft Research Cambridge Object Recognition Image Database (MSRC) [188] Video Densely Annotated Video Segmentation(DAVIS) [168] Segmentation Video Segmentation Benchmark(VSB100) [64] Dataset YouTube-Video object Segmentation [209] Autonomous Cambridge-driving Labeled Video Database (CamVid) [23] Driving Cityscapes: Semantic Urban Scene Understanding [41] Mapillary Vistas Dataset [155] SYNTHIA: Synthetic collection of Imagery and Annotations [178] KITTI Vision Benchmark Suite [67] Berkeley Deep Drive [212] India Driving Dataset(IDD) [202] Aerial Inria Aerial Image Labeling Dataset [134]"}],"limit":50,"offset":0}