pith. machine review for the scientific record. sign in

arxiv: 2605.12056 · v1 · submitted 2026-05-12 · 💻 cs.AI

Recognition: 1 theorem link

· Lean Theorem

OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-13 04:46 UTC · model grok-4.3

classification 💻 cs.AI
keywords omnimodal llmstoken compressionaudio-video alignmentcross-modal reasoningdynamic programminginference efficiencycooperative compression
0
0 comments X

The pith

OmniRefine refines audio-video chunk boundaries for alignment then jointly compresses tokens to cut inference cost while holding performance near full levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that a two-stage training-free process can compress the many tokens from long videos and audio tracks in omnimodal models without breaking the cross-modal links those models need for reasoning. It first adjusts the starting and ending points of each compression chunk so that frames and audio segments line up more closely, using similarity scores and dynamic programming. It then compresses the video and audio tokens inside each newly aligned chunk together rather than separately. If the claim holds, models become practical to run on longer inputs because they retain most of their accuracy even after discarding more than half the original tokens.

Core claim

OmniRefine is a training-free two-stage framework for audio-visual token compression in Omni-LLMs. Correspondence-Preserving Chunk Refinement first converts native chunk boundaries into cross-modally aligned compression units by measuring frame-audio similarity and applying dynamic programming. Modality-Aware Cooperative Compression then jointly reduces redundancy in video and audio tokens inside each refined unit while keeping critical evidence intact.

What carries the argument

Correspondence-Preserving Chunk Refinement followed by Modality-Aware Cooperative Compression, which together produce aligned units that keep complementary audio-video information available during joint token reduction.

If this is right

  • The method delivers a better efficiency-performance trade-off than baselines that use fixed or native compression units.
  • Performance stays stable even when token retention drops to 44 percent.
  • On WorldSense the approach reaches 46.7 percent accuracy at 44 percent retention, nearly matching the full-token baseline.
  • Longer video streams and denser audio sequences become feasible at lower inference cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same boundary-refinement idea could extend to other paired modalities such as text and images where misalignment also hurts compression.
  • Lower token counts may enable real-time omnimodal applications on devices with limited memory or power.
  • Preserving alignment during compression reduces the chance that downstream reasoning steps lose synchronization between sound and visuals.

Load-bearing premise

Refining native chunk boundaries with frame-audio similarity and dynamic programming will create compression units that keep the complementary cross-modal information required for audio-video reasoning intact.

What would settle it

A controlled test at 44 percent token retention on WorldSense showing that accuracy with the refined aligned units falls more than a few points below the accuracy obtained with native unrefined chunks at the same retention rate.

Figures

Figures reproduced from arXiv: 2605.12056 by Feidiao Yang, Hai-Tao Zheng, Jie Wang, Yuchen Deng, Yuxing Han, Zidang Cai.

Figure 1
Figure 1. Figure 1: (a) Overview of OmniRefine. (b) OmniRefine outperforms baselines on WorldSense and [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Motivating analysis of native chunk boundaries in Qwen2.5-Omni. (a) Shallow-layer (layer [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of OmniRefine. Given encoded audio-visual tokens, OmniRefine first applies [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation on the audio budget. Performance across budget parameters [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of Dynamic Pruning. Video and audio retention ratios per chunk. benchmark. The results demonstrate that, on the 7B model, our method achieves a 1.15× speedup in overall inference and a remarkable 5.26× acceleration during the prefilling stage compared to the full-token baseline. Furthermore, our approach substantially mitigates memory overhead. By saving 15GB of GPU memory while retaining app… view at source ↗
read the original abstract

Omnimodal large language models (Omni-LLMs) show strong capability in audio-video understanding, but their practical deployment remains limited by high inference cost of long video streams and dense audio sequences. Despite recent progress, existing compression methods for Omni-LLMs typically rely on fixed or native compression units, which can disrupt cross-modal correspondence and the complementary information required for audio-video reasoning, making it difficult to improve inference efficiency while stably preserving performance. To address this, we propose OmniRefine, a training-free two-stage framework for efficient audio-visual token compression in Omni-LLMs. First, Correspondence-Preserving Chunk Refinement refines native chunk boundaries into cross-modally aligned compression units through frame-audio similarity and dynamic programming. Second, Modality-Aware Cooperative Compression jointly compresses video and audio tokens within each refined unit to reduce redundancy while preserving critical evidence. Extensive experiments show that OmniRefine achieves a better efficiency-performance trade-off than strong baselines and maintains stable performance under lower compression ratios. On WorldSense, it still reaches 46.7% accuracy at a 44% token retention ratio, nearly matching the full-token baseline. The code and interface will be released to facilitate further research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes OmniRefine, a training-free two-stage framework for token compression in omnimodal LLMs. Stage 1 (Correspondence-Preserving Chunk Refinement) uses frame-audio similarity and dynamic programming to adjust native chunk boundaries for better cross-modal alignment. Stage 2 (Modality-Aware Cooperative Compression) then jointly compresses video and audio tokens within each refined unit to reduce redundancy. The central empirical claim is that this yields a superior efficiency-performance trade-off versus baselines, with stable results under aggressive compression (e.g., 46.7% accuracy on WorldSense at 44% token retention, nearly matching the full-token baseline).

Significance. If the results hold after proper isolation of components, the work would be moderately significant for practical deployment of omnimodal models, as it targets a real deployment bottleneck (cross-modal misalignment during compression) with a training-free method. The emphasis on preserving complementary audio-video evidence and the promise of code release are positive for reproducibility.

major comments (2)
  1. §4 (Experiments) and §4.2 (Ablation Studies): No ablation isolates the contribution of Correspondence-Preserving Chunk Refinement. The WorldSense result (46.7% at 44% retention) is reported only for the full pipeline; there is no direct comparison against modality-aware cooperative compression applied to native (unrefined) chunks. This leaves the load-bearing assumption—that frame-audio similarity + DP produces units preserving complementary cross-modal information—untested, so the performance gain cannot be confidently attributed to the alignment step rather than the second stage alone.
  2. §4.1 (Experimental Setup): The manuscript supplies the headline WorldSense number but provides insufficient detail on exact baseline implementations, the precise definition and computation of frame-audio similarity, the dynamic programming objective function and constraints, number of runs, and any statistical tests for the reported superiority. This weakens the ability to evaluate the efficiency-performance trade-off claims.
minor comments (2)
  1. §3.1: The dynamic programming recurrence for boundary refinement would benefit from an explicit small-scale example or pseudocode to clarify how similarity scores translate into the final chunk boundaries.
  2. Figure 3 / Table 2: Axis labels and legend entries for compression ratios and accuracy could be made more consistent across panels to improve readability of the efficiency curves.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity and completeness.

read point-by-point responses
  1. Referee: §4 (Experiments) and §4.2 (Ablation Studies): No ablation isolates the contribution of Correspondence-Preserving Chunk Refinement. The WorldSense result (46.7% at 44% retention) is reported only for the full pipeline; there is no direct comparison against modality-aware cooperative compression applied to native (unrefined) chunks. This leaves the load-bearing assumption—that frame-audio similarity + DP produces units preserving complementary cross-modal information—untested, so the performance gain cannot be confidently attributed to the alignment step rather than the second stage alone.

    Authors: We agree that an explicit ablation isolating the Correspondence-Preserving Chunk Refinement stage is needed to strengthen attribution of gains. The current results show the full pipeline's performance, but we will add a new ablation in the revised §4.2 comparing Modality-Aware Cooperative Compression applied to native chunks versus the refined chunks on WorldSense (and other benchmarks). This will directly test whether the alignment-aware refinement preserves complementary cross-modal information beyond what the second stage achieves alone. We expect the refined units to yield higher accuracy at the same retention ratio, consistent with the paper's motivation that native chunks can disrupt audio-video correspondence. revision: yes

  2. Referee: §4.1 (Experimental Setup): The manuscript supplies the headline WorldSense number but provides insufficient detail on exact baseline implementations, the precise definition and computation of frame-audio similarity, the dynamic programming objective function and constraints, number of runs, and any statistical tests for the reported superiority. This weakens the ability to evaluate the efficiency-performance trade-off claims.

    Authors: We agree more implementation details are required for reproducibility. In the revised §4.1 we will add: exact baseline reimplementations (using the same Omni-LLM backbone and matching the original papers' token selection logic at equivalent ratios); frame-audio similarity defined as average cosine similarity between video frame embeddings and corresponding audio segment embeddings extracted from the model's encoders; the DP objective as maximizing the cumulative alignment score (sum of per-chunk similarities) subject to constraints of minimum one frame and one audio segment per refined chunk plus upper bounds on chunk duration to control compute; all results averaged over 3 random seeds with standard deviation reported; and paired t-tests confirming statistical significance (p < 0.05) versus baselines on WorldSense. These additions will allow readers to fully evaluate the trade-off claims. revision: yes

Circularity Check

0 steps flagged

No circularity in algorithmic derivation or claims

full rationale

The paper describes a training-free two-stage algorithmic procedure (Correspondence-Preserving Chunk Refinement via frame-audio similarity and dynamic programming, followed by Modality-Aware Cooperative Compression) evaluated directly on external benchmarks such as WorldSense. No equations, fitted parameters presented as predictions, self-definitional quantities, or load-bearing self-citations appear in the provided text that would reduce the reported performance retention (e.g., 46.7% at 44% retention) to an input by construction. The central claims rest on empirical results from independent datasets rather than any internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, fitted constants, or new postulated entities are mentioned; the approach relies on standard similarity metrics and dynamic programming whose details are not supplied in the abstract.

pith-pipeline@v0.9.0 · 5530 in / 1112 out tokens · 59652 ms · 2026-05-13T04:46:50.439143+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 14 internal anchors

  1. [1]

    Ming-omni: A unified multimodal model for perception and generation.arXiv preprint arXiv:2506.09344, 2025

    Inclusion AI, Biao Gong, Cheng Zou, Chuanyang Zheng, Chunluan Zhou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, et al. Ming-omni: A unified multimodal model for perception and generation.arXiv preprint arXiv:2506.09344, 2025

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  3. [3]

    Token Merging: Your ViT But Faster

    Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster.arXiv preprint arXiv:2210.09461, 2022

  4. [4]

    An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision, pages 19–35. Springer, 2024

  5. [5]

    Streamingtom: Streaming token compres- sion for efficient video understanding.arXiv preprint arXiv:2510.18269, 2025

    Xueyi Chen, Keda Tao, Kele Shao, and Huan Wang. Streamingtom: Streaming token compres- sion for efficient video understanding.arXiv preprint arXiv:2510.18269, 2025

  6. [6]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

  7. [7]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024

  8. [8]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023

  9. [9]

    Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

  10. [10]

    Beyond boundary frames: Audio-visual semantic guidance for context-aware video interpolation.arXiv preprint arXiv:2512.03590, 2025

    Yuchen Deng, Xiuyang Wu, Hai-Tao Zheng, Jie Wang, Feidiao Yang, and Yuxing Han. Beyond boundary frames: Audio-visual semantic guidance for context-aware video interpolation.arXiv preprint arXiv:2512.03590, 2025

  11. [11]

    Avatarsync: Rethinking talking-head animation through autoregressive perspective.arXiv e-prints, pages arXiv–2509, 2025

    Yuchen Deng, Xiuyang Wu, Hai-Tao Zheng, Suiyang Zhang, Yi He, and Yuxing Han. Avatarsync: Rethinking talking-head animation through autoregressive perspective.arXiv e-prints, pages arXiv–2509, 2025

  12. [12]

    Omnisift: Modality-asymmetric token compression for efficient omni-modal large language models.arXiv preprint arXiv:2602.04804, 2026

    Yue Ding, Yiyan Ji, Jungang Li, Xuyang Liu, Xinlong Chen, Junfei Wu, Bozhou Li, Bohan Zeng, Yang Shi, Yushuo Guan, et al. Omnisift: Modality-asymmetric token compression for efficient omni-modal large language models.arXiv preprint arXiv:2602.04804, 2026

  13. [13]

    Vita: Towards open-source interactive omni multimodal llm

    Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Yuhang Dai, Meng Zhao, Yi-Fan Zhang, Shaoqi Dong, Yangze Li, Xiong Wang, et al. Vita: Towards open-source interactive omni multimodal llm.arXiv preprint arXiv:2408.05211, 2024

  14. [14]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025

  15. [15]

    Framefusion: Combining similarity and importance for video token reduction on large visual language models.arXiv preprint arXiv:2501.01986, 2024

    Tianyu Fu, Tengxuan Liu, Qinghao Han, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, and Yu Wang. Framefusion: Combining similarity and importance for video token reduction on large vision language models.arXiv preprint arXiv:2501.01986, 2024. 10

  16. [16]

    Arc-hunyuan-video-7b: Structured video comprehension of real-world shorts.arXiv preprint arXiv:2507.20939, 2025

    Yuying Ge, Yixiao Ge, Chen Li, Teng Wang, Junfu Pu, Yizhuo Li, Lu Qiu, Jin Ma, Lisheng Duan, Xinyu Zuo, et al. Arc-hunyuan-video-7b: Structured video comprehension of real-world shorts.arXiv preprint arXiv:2507.20939, 2025

  17. [17]

    Echoing- pixels: Cross-modal adaptive token reduction for efficient audio-visual llms.arXiv preprint arXiv:2512.10324, 2025

    Chao Gong, Depeng Wang, Zhipeng Wei, Ya Guo, Huijia Zhu, and Jingjing Chen. Echoing- pixels: Cross-modal adaptive token reduction for efficient audio-visual llms.arXiv preprint arXiv:2512.10324, 2025

  18. [18]

    arXiv preprint arXiv:2502.04326 (2025)

    Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluat- ing real-world omnimodal understanding for multimodal llms.arXiv preprint arXiv:2502.04326, 2025

  19. [19]

    Prunevid: Visual token pruning for efficient video large language models

    Xiaohu Huang, Hao Zhou, and Kai Han. Prunevid: Visual token pruning for efficient video large language models. InFindings of the Association for Computational Linguistics: ACL 2025, pages 19959–19973, 2025

  20. [20]

    Multi-granular spatio-temporal token merging for training-free acceleration of video llms

    Jeongseok Hyun, Sukjun Hwang, Su Ho Han, Taeoh Kim, Inwoong Lee, Dongyoon Wee, Joon- Young Lee, Seon Joo Kim, and Minho Shim. Multi-granular spatio-temporal token merging for training-free acceleration of video llms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 23990–24000, 2025

  21. [21]

    Acckv: Towards efficient audio-video llms inference via adaptive-focusing and cross-calibration kv cache optimization

    Zhonghua Jiang, Kui Chen, Kunxi Li, Keting Yin, Yiyun Zhou, Zhaode Wang, Chengfei Lv, and Shengyu Zhang. Acckv: Towards efficient audio-video llms inference via adaptive-focusing and cross-calibration kv cache optimization. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 5494–5502, 2026

  22. [22]

    Fastav: Efficient token pruning for audio-visual large language model inference.arXiv preprint arXiv:2601.13143, 2026

    Chaeyoung Jung, Youngjoon Jang, Seungwoo Lee, and Joon Son Chung. Fastav: Efficient token pruning for audio-visual large language model inference.arXiv preprint arXiv:2601.13143, 2026

  23. [23]

    Token pruning in audio transformers: Optimizing performance and decoding patch importance.arXiv preprint arXiv:2504.01690, 2025

    Taehan Lee and Hyukjun Lee. Token pruning in audio transformers: Optimizing performance and decoding patch importance.arXiv preprint arXiv:2504.01690, 2025

  24. [24]

    Dash: Dynamic audio-driven semantic chunking for efficient omnimodal token compression.arXiv preprint arXiv:2603.15685, 2026

    Bingzhou Li and Tao Huang. Dash: Dynamic audio-driven semantic chunking for efficient omnimodal token compression.arXiv preprint arXiv:2603.15685, 2026

  25. [25]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  26. [26]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

  27. [27]

    Videochat: Chat-centric video understanding.Science China Information Sciences, 68(10):200102, 2025

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.Science China Information Sciences, 68(10):200102, 2025

  28. [28]

    Baichuan-omni technical report.arXiv preprint arXiv:2410.08565, 2024

    Yadong Li, Haoze Sun, Mingan Lin, Tianpeng Li, Guosheng Dong, Tao Zhang, Bowen Ding, Wei Song, Zhenglin Cheng, Yuqi Huo, et al. Baichuan-omni technical report.arXiv preprint arXiv:2410.08565, 2024

  29. [29]

    Accelerating transducers through adjacent token merging.arXiv preprint arXiv:2306.16009, 2023

    Yuang Li, Yu Wu, Jinyu Li, and Shujie Liu. Accelerating transducers through adjacent token merging.arXiv preprint arXiv:2306.16009, 2023

  30. [30]

    Video-llava: Learning united visual representation by alignment before projection

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 5971–5984, 2024

  31. [31]

    Speechprune: Context-aware token pruning for speech information retrieval

    Yueqian Lin, Yuzhe Fu, Jingyang Zhang, Yudong Liu, Jianyi Zhang, Jingwei Sun, Hai Helen Li, and Yiran Chen. Speechprune: Context-aware token pruning for speech information retrieval. In2025 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2025. 11

  32. [32]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

  33. [33]

    Llavanext: Improved reasoning, ocr, and world knowledge, 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024

  34. [34]

    MedGemma Technical Report

    Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

  35. [35]

    Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024

  36. [36]

    Llava-prumerge: Adaptive token reduction for efficient large multimodal models

    Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22857–22867, 2025

  37. [37]

    Holitom: Holistic token merging for fast video large language models.arXiv preprint arXiv:2505.21334, 2025

    Kele Shao, Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. Holitom: Holistic token merging for fast video large language models.arXiv preprint arXiv:2505.21334, 2025

  38. [38]

    arXiv preprint arXiv:2507.20198 , year=

    Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, and Huan Wang. When tokens talk too much: A survey of multimodal long- context token compression across images, videos, and audios.arXiv preprint arXiv:2507.20198, 2025

  39. [39]

    arXiv preprint arXiv:2503.11187 (2025)

    Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, and Guiguang Ding. Fastvid: Dynamic density pruning for fast video large language models. arXiv preprint arXiv:2503.11187, 2025

  40. [40]

    Longvu: Spatiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024

    Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, et al. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024

  41. [41]

    Audio-visual llm for video understanding

    Fangxun Shu, Lei Zhang, Hao Jiang, and Cihang Xie. Audio-visual llm for video understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4246– 4255, 2025

  42. [42]

    video-salmonn: Speech-enhanced audio-visual large language models.arXiv preprint arXiv:2406.15704, 2024

    Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, and Chao Zhang. video-salmonn: Speech-enhanced audio-visual large language models.arXiv preprint arXiv:2406.15704, 2024

  43. [43]

    Tokencarve: Information-preserving visual token compression in multimodal large language models.arXiv preprint arXiv:2503.10501, 2025

    Xudong Tan, Peng Ye, Chongjun Tu, Jianjian Cao, Yaoxin Yang, Lin Zhang, Dongzhan Zhou, and Tao Chen. Tokencarve: Information-preserving visual token compression in multimodal large language models.arXiv preprint arXiv:2503.10501, 2025

  44. [44]

    Dycoke: Dynamic compression of tokens for fast video large language models

    Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. Dycoke: Dynamic compression of tokens for fast video large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18992–19001, 2025

  45. [45]

    OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models

    Keda Tao, Kele Shao, Bohan Yu, Weiqiang Wang, Huan Wang, et al. Omnizip: Audio- guided dynamic token compression for fast omnimodal large language models.arXiv preprint arXiv:2511.14582, 2025

  46. [46]

    Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

  47. [47]

    Interactiveomni: A unified omni-modal model for audio-visual multi-turn dialogue.arXiv preprint arXiv:2510.13747, 2025

    Wenwen Tong, Hewei Guo, Dongchuan Ran, Jiangnan Chen, Jiefan Lu, Kaibin Wang, Keqiang Li, Xiaoxu Zhu, Jiakui Li, Kehan Li, et al. Interactiveomni: A unified omni-modal model for audio-visual multi-turn dialogue.arXiv preprint arXiv:2510.13747, 2025

  48. [48]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 12

  49. [49]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  50. [50]

    Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities.arXiv preprint arXiv:2410.11190, 2024

    Zhifei Xie and Changqiao Wu. Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities.arXiv preprint arXiv:2410.11190, 2024

  51. [51]

    PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

    Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al. Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction.arXiv preprint arXiv:2410.17247, 2024

  52. [52]

    Qwen2.5-Omni Technical Report

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report, 2025. URLhttps://arxiv.org/abs/2503.20215

  53. [53]

    Qwen3-Omni Technical Report

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

  54. [54]

    Topv: Compatible token pruning with inference time optimization for fast and low-memory multimodal vision language model

    Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Chendi Li, Jinghua Yan, Yu Bai, Ponnuswamy Sadayappan, Xia Hu, et al. Topv: Compatible token pruning with inference time optimization for fast and low-memory multimodal vision language model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19803–19813, 2025

  55. [55]

    Humanomniv2: From understanding to omni-modal reasoning with context, 2025

    Qize Yang, Shimin Yao, Weixuan Chen, Shenghao Fu, Detao Bai, Jiaxing Zhao, Boyuan Sun, Bowen Yin, Xihan Wei, and Jingren Zhou. Humanomniv2: From understanding to omni-modal reasoning with context.arXiv preprint arXiv:2506.21277, 2025

  56. [56]

    Visionzip: Longer is better but not necessary in vision language models

    Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19792–19802, 2025

  57. [57]

    Audio-centric video understanding benchmark without text shortcut

    Yudong Yang, Jimin Zhuang, Guangzhi Sun, Changli Tang, Yixuan Li, Peihan Li, Yifan Jiang, Wei Li, Zejun Ma, and Chao Zhang. Audio-centric video understanding benchmark without text shortcut. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6580–6598, 2025

  58. [58]

    Omnivinci: Enhancing architecture and data for omni-modal understanding llm.arXiv preprint arXiv:2510.15870, 2025

    Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, Wei Huang, Ligeng Zhu, Yuanhang Su, Sean Lin, An-Chieh Cheng, Zhen Wan, Jinchuan Tian, et al. Omnivinci: Enhancing architecture and data for omni-modal understanding llm.arXiv preprint arXiv:2510.15870, 2025

  59. [59]

    Fit and prune: Fast and training-free visual token pruning for multi-modal large language models

    Weihao Ye, Qiong Wu, Wenhao Lin, and Yiyi Zhou. Fit and prune: Fast and training-free visual token pruning for multi-modal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 22128–22136, 2025

  60. [60]

    vllm-omni: Fully disaggregated serving for any-to-any multimodal models.arXiv preprint arXiv:2602.02204, 2026

    Peiqi Yin, Jiangyun Zhu, Han Gao, Chenguang Zheng, Yongxiang Huang, Taichang Zhou, Ruirui Yang, Weizhi Liu, Weiqing Chen, Canlin Guo, et al. vllm-omni: Fully disaggregated serving for any-to-any multimodal models.arXiv preprint arXiv:2602.02204, 2026

  61. [61]

    Lmms-eval: Reality check on the evaluation of large multimodal models

    Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Reality check on the evaluation of large multimodal models. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 881–916, 2025

  62. [62]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava- video: Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 13 A Hyperparameter Settings To supplement the configuration details provided in the main text, Table 5 summarizes the compre- hensive hyperparameter settings utilized in the Omni...