{"total":14,"items":[{"citing_arxiv_id":"2606.00579","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Sandboxed Coding Agents are Competitive Omni-modal Task Solvers","primary_cat":"cs.CL","submitted_at":"2026-05-30T07:04:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Sandboxed coding agents with text+image access match or outperform native omnimodal models on audio-video benchmarks by converting tasks into code-driven retrieval and processing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00508","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"V-LynX: Token Interface Alignment for Video+X LLMs","primary_cat":"cs.CV","submitted_at":"2026-05-30T03:54:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"V-LynX integrates novel modalities into frozen Video LLMs by aligning to an internalized continuous token manifold using unpaired unimodal data and attention/statistical matching.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12056","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models","primary_cat":"cs.AI","submitted_at":"2026-05-12T12:42:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, et al. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024. [41] Fangxun Shu, Lei Zhang, Hao Jiang, and Cihang Xie. Audio-visual llm for video understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4246- 4255, 2025. [42] Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, and Chao Zhang. video-salmonn: Speech-enhanced audio-visual large language models.arXiv preprint arXiv:2406.15704, 2024. [43] Xudong Tan, Peng Ye, Chongjun Tu, Jianjian Cao, Yaoxin Yang, Lin Zhang, Dongzhan Zhou, and Tao Chen. Tokencarve: Information-preserving visual token compression in multimodal"},{"citing_arxiv_id":"2605.01024","ref_index":74,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EmoMM: Benchmarking and Steering MLLM for Multimodal Emotion Recognition under Conflict and Missingness","primary_cat":"cs.CV","submitted_at":"2026-05-01T18:35:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EmoMM benchmark reveals Video Contribution Collapse in MLLMs for emotion recognition under modality conflict and missingness, mitigated by CHASE head-level attention steering.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11244","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding","primary_cat":"cs.CV","submitted_at":"2026-04-13T09:50:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MTSS replaces monolithic video captions with factorized streams and relational grounding, yielding reported gains in understanding benchmarks and generation consistency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02605","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Do Audio-Visual Large Language Models Really See and Hear?","primary_cat":"cs.AI","submitted_at":"2026-04-03T00:48:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"AVLLMs encode audio semantics in middle layers but suppress them in final text outputs when audio conflicts with vision, due to training that largely inherits from vision-language base models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[49] Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image cap- tioning. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035-4045, 2018. 4 [50] Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all.arXiv preprint arXiv:2305.16355, 2023. 2 [51] Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, and Chao Zhang. video-salmonn: Speech-enhanced audio-visual large language models.arXiv preprint arXiv:2406.15704, 2024. 1 [52] Kim Sung-Bin, Oh Hyun-Bin, JungMok Lee, Arda Senocak, Joon Son Chung, and Tae-Hyun Oh. Avhbench: A cross- modal hallucination benchmark for audio-visual large lan-"},{"citing_arxiv_id":"2512.02231","ref_index":52,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2025-12-01T21:57:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AV-SpeakerBench is a new speaker-centered benchmark showing that top multimodal models still struggle with fine-grained audiovisual speech understanding, with Gemini 2.5 Pro leading but open models lagging on fusion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.14582","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2025-11-18T15:22:32+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OmniZip introduces an audio-guided dynamic token compression framework that achieves 3.42X inference speedup and 1.4X memory reduction for omnimodal LLMs without any training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.15148","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models","primary_cat":"cs.CV","submitted_at":"2025-10-16T21:10:22+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"XModBench is a tri-modal benchmark that systematically measures cross-modal consistency, modality disparities, and directional imbalances in omni-language models across five task families and all modality combinations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.20215","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Qwen2.5-Omni Technical Report","primary_cat":"cs.CL","submitted_at":"2025-03-26T04:17:55+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text performance on reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.12937","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization","primary_cat":"cs.AI","submitted_at":"2025-03-17T08:51:44+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.04326","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs","primary_cat":"cs.CV","submitted_at":"2025-02-06T18:59:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.13106","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding","primary_cat":"cs.CV","submitted_at":"2025-01-22T18:59:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Videogpt+: Integrating image and video encoders for enhanced video understanding.arxiv, 2024. [23] Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476, 2024. [24] Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, and Chao Zhang. video-salmonn: Speech-enhanced audio-visual large language models.arXiv preprint arXiv:2406.15704, 2024. [25] Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data."},{"citing_arxiv_id":"2501.01957","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction","primary_cat":"cs.CV","submitted_at":"2025-01-03T18:59:52+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"VITA-1.5 integrates vision and speech into a single LLM through multi-stage training, delivering competitive benchmark results on image, video, and speech tasks with near real-time response speed.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}