{"total":11,"items":[{"citing_arxiv_id":"2606.29504","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Empirical Evaluation of Multi-Modal Touch Detection in Over-the-Shoulder Video Surveillance","primary_cat":"cs.CV","submitted_at":"2026-06-28T17:05:56+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"An empirical evaluation of a multi-modal touch detector using MediaPipe, HSV skin filtering, motion differencing, and Canny edges finds low F1 scores on staged video and excessive false positives on real videos, concluding the approach does not enable reliable keystroke reconstruction outside contro","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.08674","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"BioVid: Autoregressive Video Generation with Biological Behavior Semantic Comprehension","primary_cat":"cs.CV","submitted_at":"2026-06-07T15:23:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"BioVid is a data-driven autoregressive model using 2D-encode/3D-decode tokenization and causal Transformer with EOS termination that reproduces real action duration distributions (W1 distance 1.24 frames) on NTU RGB+D drinking clips, outperforming fixed-length baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23428","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FAST-ME: Foundation-aware Adaptive Stopping for Motion Estimation for Efficient IoT Video Analysis","primary_cat":"cs.CV","submitted_at":"2026-05-22T09:41:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A hybrid motion estimation framework combines optimal stopping theory with foundation model semantic scores to reduce computation while maintaining accuracy and semantic coverage in video analysis.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17311","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SpecSem-Net: Integrating Spectral and Semantic Features for Robust AI-generated Video Detection","primary_cat":"cs.CV","submitted_at":"2026-05-17T08:02:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SpecSem-Net integrates Fourier-based spectral filtering with semantic-guided gated merging to detect AI-generated videos, reporting 87.25% accuracy on a new benchmark of five commercial generators and 95.59% on public datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18878","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Prognostic Value of Lung Ultrasound Biomarkers for Readmission Risk in Congestive Heart Failure: A Pilot Data-Driven Analysis","primary_cat":"eess.SP","submitted_at":"2026-05-16T02:49:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Pilot study uses pretrained video encoder features from lung ultrasound to predict 30-day CHF readmission, finding lower-lung views and temporal differences most informative with top MLP F1 of 0.80.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"View Decision-Tree Random-Forest SVM MLP MLP-Large TabPFN Left-1 0.60 [0.40-0.80] 0.55 [0.34-0.77] 0.41 [0.21-0.62] 0.53 [0.33-0.73] 0.53 [0.33-0.73]0.53 [0.31-0.77] Left-2 0.42 [0.21-0.63] 0.55 [0.34-0.74] 0.49 [0.29-0.68] 0.57 [0.37-0.76] 0.49 [0.29-0.68]0.53 [0.32-0.74] Left-30.63 [0.43-0.82] 0.58 [0.36-0.80] 0.60 [0.39-0.78] 0.68 [0.48-0.84] 0.68 [0.48-0.84]0.50 [0.28-0.72] Right-1 0.41 [0.21-0.60]0.64 [0.43-0.84] 0.65 [0.45-0.83]0.44 [0.24-0.65] 0.57 [0.37-0.76] 0.50 [0.29-0.71] Right-2 0.41 [0.22-0.62]0.64 [0.40-0.86]0.52 [0.32-0.73] 0.47 [0.25-0.68] 0.47 [0.25-0.69] 0.51 [0.29-0.75] Right-30.50 [0.29-0.69]0.55 [0.34-0.77] 0.55 [0.34-0.75]0.61 [0.40-0.82] 0.68 [0.48-0.86] 0.64 [0.43-0.84] All Views 0.53 [0.32-0.73]0."},{"citing_arxiv_id":"2605.02094","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SignMAE: Segmentation-Driven Self-Supervised Learning for Sign Language Recognition","primary_cat":"cs.CV","submitted_at":"2026-05-03T23:25:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SignMAE uses segmentation-driven masking in a mask-and-reconstruct self-supervised task to learn fine-grained sign representations, achieving state-of-the-art accuracy on WLASL, NMFs-CSL, and Slovo with fewer frames and modalities.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Multi-modal methods combine video and keypoints, or ad- ditional modalities such as optical flow and depth, to improve robustness, but often require longer input clips, heavier architectures, or more modalities during inference [24,7,11,29]. More broadly, masked autoencoding has become an effec- tive self-supervised paradigm for visual representation learning: VideoMAE [23] extends this idea to videos with random tube masking, and VideoMAEv2 [25] improves efficiency with a stronger training recipe and partial masked-token decoding. However, such generic masking strategies are not optimized for sign language, where hands occupy only a small portion of the frame. This moti- vates our segmentation-guided, hand-centric pretraining method, which focuses"},{"citing_arxiv_id":"2604.19683","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mask World Model: Predicting What Matters for Robust Robot Policy Learning","primary_cat":"cs.RO","submitted_at":"2026-04-21T17:05:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization and texture robustness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15096","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond Independent Frames: Latent Attention Masked Autoencoders for Multi-View Echocardiography","primary_cat":"cs.CV","submitted_at":"2026-04-16T14:55:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LAMAE adds latent-space attention to masked autoencoders so multi-view echocardiography videos can exchange information across frames and views, yielding representations that transfer from adult to pediatric hearts and enable ICD-10 code prediction on MIMIC-IV-ECHO.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10333","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Zero-shot World Models Are Developmentally Efficient Learners","primary_cat":"cs.AI","submitted_at":"2026-04-11T19:32:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"http://arxiv.org/abs/2006.07733. arXiv:2006.07733 [cs]. [21] Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training, October 2022. URL http://arxiv.org/abs/2203.12602. arXiv:2203.12602 [cs]. 15 Zero-shot World Models Are Developmentally Efficient Learners Awet al. [22] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging Properties in Self-Supervised Vision Transformers, May 2021. URLhttp://arxiv.org/abs/2104.14294. arXiv:2104.14294 [cs]. [23] Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Y ann Le- Cun, Mahmoud Assran, and Nicolas Ballas."},{"citing_arxiv_id":"2604.06783","ref_index":75,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Insights from Visual Cognition: Understanding Human Action Dynamics with Overall Glance and Refined Gaze Transformer","primary_cat":"cs.CV","submitted_at":"2026-04-08T07:52:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The OG-ReG Transformer achieves state-of-the-art results on Kinetics-400, Something-Something v2, and Diving-48 by combining global glance and local gaze processing paths.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2310.01852","ref_index":124,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment","primary_cat":"cs.CV","submitted_at":"2023-10-03T07:33:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}