{"total":10,"items":[{"citing_arxiv_id":"2605.22678","ref_index":20,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Swift Sampling: Selecting Temporal Surprises via Taylor Series","primary_cat":"cs.CV","submitted_at":"2026-05-21T16:20:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Swift Sampling is a training-free frame selection method that uses Taylor expansions on video latent trajectories to pick temporally surprising frames, outperforming uniform sampling on long-video QA tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19506","ref_index":25,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"EventPrune: Cascaded Event-Assisted Token Pruning for Efficient First-Person Dynamic Spatial Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-19T08:01:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EventPrune prunes 80% of visual tokens in Video-LLMs using event camera motion cues, yielding 1.89x speedup, 52% fewer GFLOPs, and slightly higher accuracy than full-token baselines on first-person dynamic spatial reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19218","ref_index":1,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference","primary_cat":"cs.CV","submitted_at":"2026-05-19T00:45:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RotateK uses online PCA-based rotation to align token-dependent key channel importance into a shared subspace, enabling accurate head-wise structured pruning and faster decoding in VLMs compared to prior token or channel methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18678","ref_index":72,"ref_count":4,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Lance: Unified Multimodal Modeling by Multi-Task Synergy","primary_cat":"cs.CV","submitted_at":"2026-05-18T17:18:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.","context_count":2,"top_context_role":"background","top_context_polarity":"background","context_text":"[71] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892-34916, 2023. [72] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296-26306, 2024. [73] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024. [74] Haozhe Liu, Shikun Liu, Zijian Zhou, Mengmeng Xu, Yanping Xie, Xiao Han, Juan C Pérez, Ding Liu, Kumara Kahatapitiya, Menglin Jia, et al. Mardini: Masked autoregressive diffusion for video generation at scale."},{"citing_arxiv_id":"2605.12056","ref_index":33,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models","primary_cat":"cs.AI","submitted_at":"2026-05-12T12:42:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Speechprune: Context-aware token pruning for speech information retrieval. In2025 IEEE International Conference on Multimedia and Expo (ICME), pages 1-6. IEEE, 2025. 11 [32] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296-26306, 2024. [33] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024. [34] Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507."},{"citing_arxiv_id":"2605.11559","ref_index":4,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLMs","primary_cat":"cs.CV","submitted_at":"2026-05-12T05:42:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Layer-wise Laplacian energy of visual attention reveals hallucination emergence in MLLMs and enables LaSCD, a closed-form logit remapping strategy that mitigates hallucinations while preserving general performance.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"with fragmented or incoherent visual grounding. 5 Experiment 5.1 Experiment Setup Baselines and models.We evaluate our method on four representative MLLMs for image-language understanding: InstructBLIP [38], LLaV A-1.5-7B [39], Qwen-VL-Chat [3], and GLM-4V-9B [40]. For video-language understanding, we additionally report results on Video-LLaV A-7B [41] and LLaV A-Next-Video-7B-DPO [4]. For comparison, we include several strong training-free hallu- cination mitigation baselines, including OPERA [14], DoLa [28], VCD [15], ICD [29], SID [42], DeCo [36], MemVR [19], ASCD [32], NoLan [34], AIR [35], DMAS [27] and Self-PEP [43]. Datasets and Evaluation.To rigorously assess the effectiveness of our proposed method, we conduct a comprehensive set of experiments across POPE benchmark [44], CHAIR [45], MME [46], MM-"},{"citing_arxiv_id":"2605.00891","ref_index":24,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"X2SAM: Any Segmentation in Images and Videos","primary_cat":"cs.CV","submitted_at":"2026-04-27T16:24:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"X2SAM unifies any-segmentation across images and videos in one MLLM by adding a Mask Memory module for temporal consistency and joint training on mixed datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11789","ref_index":99,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation","primary_cat":"cs.CV","submitted_at":"2026-04-13T17:55:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challenges like instance permanence and consistent interaction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.17419","ref_index":25,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"EAGLE: Expert-Augmented Attention Guidance for Tuning-Free Industrial Anomaly Detection in Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-02-19T14:50:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EAGLE achieves up to 94.4% anomaly detection accuracy on MVTec-AD and 88.1% on VisA by guiding frozen MLLMs with expert-derived thresholds and confidence-aware attention without parameter updates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.17726","ref_index":43,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM","primary_cat":"cs.CV","submitted_at":"2025-05-23T10:43:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Slot-MLLM introduces a slot-attention-based object-centric visual tokenizer with Q-Former encoder, diffusion decoder, and residual vector quantization for improved local visual comprehension and generation in multimodal LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}