pith. machine review for the scientific record. sign in

arxiv: 2604.02689 · v1 · submitted 2026-04-03 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Efficient3D: A Unified Framework for Adaptive and Debiased Token Reduction in 3D MLLMs

Yuhui Lin , Siyue Yu , Yuxing Yang , Guangliang Cheng , Jimin Xiao

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:42 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords 3D MLLMsvisual token pruningtoken reductionefficient inferencedebiased estimationadaptive rebalancing3D vision-language benchmarks
0
0 comments X

The pith

Efficient3D prunes visual tokens in 3D MLLMs using debiased estimates and scene-adaptive balancing to speed inference while raising accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Efficient3D to reduce the computational cost of 3D multimodal large language models by removing less essential visual tokens from their inputs. It builds a Debiased Visual Token Importance Estimator that draws on attention patterns from early layers to judge which tokens matter most. An Adaptive Token Rebalancing step then varies the pruning rate according to how complex each 3D scene appears. Experiments across five benchmarks show the pruned models often outperform the original full-token versions.

Core claim

Efficient3D shows that debiased token importance scoring combined with dynamic rebalancing of pruning strength according to scene complexity produces faster 3D MLLM inference while preserving semantic content and yielding higher scores on standard 3D vision-language tasks than unpruned baselines.

What carries the argument

Debiased Visual Token Importance Estimator (DVTIE) that aggregates attention while accounting for shallow initial layers to generate reliable token importance scores, together with Adaptive Token Rebalancing that adjusts pruning intensity per scene.

If this is right

  • Token reduction becomes context-aware for varying 3D scene difficulties.
  • Inference runs faster on resource-limited hardware without accuracy loss.
  • Semantic completeness is maintained across captioning and question-answering tasks.
  • The same framework improves results on ScanRefer, Multi3DRefer, ScanQA, and SQA3D.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pruning logic could be tested on 2D image-language models to check broader applicability.
  • Early token reduction may lower memory footprint in addition to compute time.
  • Combining the method with quantization could produce further efficiency gains on edge devices.

Load-bearing premise

That including attention signals from shallow initial layers produces more trustworthy rankings of which visual tokens are important.

What would settle it

An experiment in which replacing the shallow-layer component of the importance estimator with a standard deep-layer version yields equal or higher benchmark scores would show the debiased step is not required.

Figures

Figures reproduced from arXiv: 2604.02689 by Guangliang Cheng, Jimin Xiao, Siyue Yu, Yuhui Lin, Yuxing Yang.

Figure 1
Figure 1. Figure 1: (a) illustrates that the initial layers in the 3D MLLM [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Efficient3D framework. First, we perform unpruned training on a pretrained 3D MLLM and extract importance scores of visual tokens. Next, we use the visual importance scores as the supervision targets for training the proposed DVTIE. During inference, the 3D MLLM leverages the predicted importance scores from the DVTIE to perform visual token pruning. Furthermore, we propose an ATR strategy … view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of DVTIE network under different visual token pruning ratios. The results in average pruning ratios of 35%, 65%, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Recent advances in Multimodal Large Language Models (MLLMs) have expanded reasoning capabilities into 3D domains, enabling fine-grained spatial understanding. However, the substantial size of 3D MLLMs and the high dimensionality of input features introduce considerable inference overhead, which limits practical deployment on resource constrained platforms. To overcome this limitation, this paper presents Efficient3D, a unified framework for visual token pruning that accelerates 3D MLLMs while maintaining competitive accuracy. The proposed framework introduces a Debiased Visual Token Importance Estimator (DVTIE) module, which considers the influence of shallow initial layers during attention aggregation, thereby producing more reliable importance predictions for visual tokens. In addition, an Adaptive Token Rebalancing (ATR) strategy is developed to dynamically adjust pruning strength based on scene complexity, preserving semantic completeness and maintaining balanced attention across layers. Together, they enable context-aware token reduction that maintains essential semantics with lower computation. Comprehensive experiments conducted on five representative 3D vision and language benchmarks, including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D, demonstrate that Efficient3D achieves superior performance compared with unpruned baselines, with a +2.57% CIDEr improvement on the Scan2Cap dataset. Therefore, Efficient3D provides a scalable and effective solution for efficient inference in 3D MLLMs. The code is released at: https://github.com/sol924/Efficient3D

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Efficient3D, a unified framework for visual token pruning in 3D MLLMs. It introduces the Debiased Visual Token Importance Estimator (DVTIE), which aggregates attention scores from shallow initial layers to estimate token importance more reliably, and the Adaptive Token Rebalancing (ATR) strategy, which dynamically modulates pruning ratios according to scene complexity. Experiments across five 3D vision-language benchmarks (ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, SQA3D) report that the method outperforms unpruned baselines, including a +2.57% CIDEr gain on Scan2Cap, while reducing inference cost.

Significance. If the reported gains are reproducible and the causal contribution of DVTIE is isolated from ATR and training variance, the work would offer a practical route to deploying 3D MLLMs on edge devices. The open-sourced code strengthens the contribution by enabling direct verification of the efficiency claims.

major comments (3)
  1. [Experiments] Experiments section (and abstract): the +2.57% CIDEr improvement on Scan2Cap is stated without reporting the number of random seeds, standard deviation across runs, or statistical significance tests against the unpruned baseline. This omission prevents assessment of whether the gain exceeds training stochasticity.
  2. [§3.2] §3.2 (DVTIE): the central claim that aggregating shallow-layer attention yields more reliable importance scores lacks supporting ablations (e.g., DVTIE with vs. without layers 1–3) or correlation analysis against human-annotated token relevance. Shallow layers predominantly encode low-level spatial cues; without this evidence the reported superiority could be driven by ATR rather than the debiased estimator.
  3. [Table 2 / §4.2] Table 2 / §4.2: no comparison is provided against prior token-pruning methods for 3D MLLMs (or strong 2D MLLM baselines such as FastV or ToMe) under matched FLOPs budgets, making it impossible to judge whether DVTIE+ATR advances the state of the art or merely matches existing engineering practice.
minor comments (2)
  1. [§3.2] Notation for the attention aggregation in DVTIE is introduced without an explicit equation; adding a numbered equation would clarify the precise weighting of shallow vs. deep layers.
  2. [Abstract / §4] The abstract states “superior performance compared with unpruned baselines” but the main text should explicitly list the exact unpruned model variants (e.g., LLaVA-3D-7B) and their CIDEr scores for each dataset.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the paper. We address each major point below and will revise the manuscript to improve experimental reporting, add ablations, and include comparisons.

read point-by-point responses
  1. Referee: Experiments section (and abstract): the +2.57% CIDEr improvement on Scan2Cap is stated without reporting the number of random seeds, standard deviation across runs, or statistical significance tests against the unpruned baseline. This omission prevents assessment of whether the gain exceeds training stochasticity.

    Authors: We acknowledge the need for statistical rigor. In the revised manuscript we will rerun the Scan2Cap experiments over three random seeds, report mean and standard deviation, and include a paired t-test against the unpruned baseline to confirm the gain exceeds training variance. revision: yes

  2. Referee: §3.2 (DVTIE): the central claim that aggregating shallow-layer attention yields more reliable importance scores lacks supporting ablations (e.g., DVTIE with vs. without layers 1–3) or correlation analysis against human-annotated token relevance. Shallow layers predominantly encode low-level spatial cues; without this evidence the reported superiority could be driven by ATR rather than the debiased estimator.

    Authors: We will add an ablation table in §3.2 comparing DVTIE variants that aggregate attention from layers 1–3 versus deeper layers only, isolating the contribution of shallow-layer aggregation. We do not possess human-annotated token relevance labels, so a direct correlation study is not feasible; however, we will include qualitative attention visualizations and additional quantitative metrics (e.g., token retention rates on semantically critical objects) to support the reliability claim. We will also present an ablation that applies ATR alone versus DVTIE+ATR to clarify their separate effects. revision: partial

  3. Referee: Table 2 / §4.2: no comparison is provided against prior token-pruning methods for 3D MLLMs (or strong 2D MLLM baselines such as FastV or ToMe) under matched FLOPs budgets, making it impossible to judge whether DVTIE+ATR advances the state of the art or merely matches existing engineering practice.

    Authors: We agree that matched-FLOPs comparisons are essential. In the revision we will adapt FastV and ToMe to the 3D MLLM setting, enforce identical FLOPs budgets, and add these baselines to Table 2 together with a discussion in §4.2 that positions DVTIE+ATR relative to prior work. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; empirical engineering contribution

full rationale

The paper introduces DVTIE and ATR as modules for token pruning in 3D MLLMs, with performance validated on benchmarks like Scan2Cap (+2.57% CIDEr). No equations, derivations, or self-referential definitions are present that reduce any prediction or result to fitted inputs by construction. Claims rely on experimental comparisons to unpruned baselines rather than tautological redefinitions or self-citation chains. The framework is presented as a practical engineering solution without load-bearing self-citations, ansatzes, or uniqueness theorems imported from prior author work. This is the standard non-finding for empirical pruning papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the framework introduces two named modules whose internal mechanics are not detailed here.

pith-pipeline@v0.9.0 · 5584 in / 1009 out tokens · 42395 ms · 2026-05-13T19:42:08.841467+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · 2 internal anchors

  1. [1]

    Scanqa: 3d question answering for spatial scene understanding

    Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. InCVPR, 2022. 6

  2. [2]

    Token merging: Your ViT but faster

    Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your ViT but faster. InICLR, 2023. 1, 2, 5, 6

  3. [3]

    Madtp: Multimodal alignment-guided dynamic token pruning for accelerating vision-language transformer

    Jianjian Cao, Peng Ye, Shengze Li, Chong Yu, Yansong Tang, Jiwen Lu, and Tao Chen. Madtp: Multimodal alignment-guided dynamic token pruning for accelerating vision-language transformer. InCVPR, 2024. 2

  4. [4]

    Honeybee: Locality-enhanced projector for multimodal llm

    Junbum Cha, Wooyoung Kang, Jonghwan Mun, and Byungseok Roh. Honeybee: Locality-enhanced projector for multimodal llm. InCVPR, 2024. 2

  5. [5]

    Scanrefer: 3d object localization in rgb-d scans using natural language

    Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. InECCV, 2020. 1, 5, 6

  6. [6]

    Llavolta: Efficient multi- modal models via stage-wise visual context compression

    Jieneng Chen, Luoxin Ye, Ju He, Zhao-Yang Wang, Daniel Khashabi, and Alan Yuille. Llavolta: Efficient multi- modal models via stage-wise visual context compression. In NeurIPS, 2024. 2

  7. [7]

    An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InECCV, 2024. 1, 2, 3, 5, 6

  8. [8]

    Diffrate: Differentiable compression rate for efficient vision transformers

    Mengzhao Chen, Wenqi Shao, Peng Xu, Mingbao Lin, Kaipeng Zhang, Fei Chao, Rongrong Ji, Yu Qiao, and Ping Luo. Diffrate: Differentiable compression rate for efficient vision transformers. InICCV, 2023. 2

  9. [9]

    Ll3da: Visual interactive instruction tuning for omni-3d understand- ing reasoning and planning

    Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. Ll3da: Visual interactive instruction tuning for omni-3d understand- ing reasoning and planning. InCVPR, 2024. 2

  10. [10]

    Grounded 3d-llm with referent tokens.arXiv preprint arXiv:2405.10370, 2024

    Yilun Chen, Shuai Yang, Haifeng Huang, Tai Wang, Ruiyuan Lyu, Runsen Xu, Dahua Lin, and Jiangmiao Pang. Grounded 3d-llm with referent tokens.arXiv preprint arXiv:2405.10370, 2024. 2

  11. [11]

    Scan2cap: Context-aware dense captioning in rgb-d scans

    Zhenyu Chen, Ali Gholami, Matthias Nießner, and Angel X Chang. Scan2cap: Context-aware dense captioning in rgb-d scans. InCVPR, 2021. 1, 6

  12. [12]

    3d- affordanceLLM: Harnessing large language models for open-vocabulary affordance detection in 3d worlds

    Hengshuo Chu, Xiang Deng, Qi Lv, Xiaoyang Chen, Yinchuan Li, Jianye HAO, and Liqiang Nie. 3d- affordanceLLM: Harnessing large language models for open-vocabulary affordance detection in 3d worlds. InICLR,

  13. [13]

    Mobilevlm : A fast, strong and open vision language assistant for mobile devices

    Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, et al. Mobilevlm: A fast, strong and open vi- sion language assistant for mobile devices.arXiv preprint arXiv:2312.16886, 2023. 2

  14. [14]

    Mobilevlm v2: Faster and stronger baseline for vision language model.arXiv preprint arXiv:2402.03766, 2024

    Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, et al. Mobilevlm v2: Faster and stronger baseline for vision language model.arXiv preprint arXiv:2402.03766, 2024. 2

  15. [15]

    Bert: Pre-training of deep bidirectional trans- formers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InACL, 2019. 4

  16. [16]

    An image is worth 16×16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16×16 words: Transformers for image recognition at scale. InICLR, 2021. 2

  17. [17]

    Adaptive token sampling for efficient vision transformers

    Mohsen Fayyaz, Soroush Abbasi Koohpayegani, Farnoush Rezaei Jafari, Sunando Sengupta, Hamid Reza Vaezi Joze, Eric Sommerlade, Hamed Pirsiavash, and J¨urgen Gall. Adaptive token sampling for efficient vision transformers. InECCV, 2022. 2

  18. [18]

    Scene-llm: Extending language model for 3d visual understanding and reasoning

    Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wenhan Xiong. Scene-llm: Extending language model for 3d visual understanding and reasoning. InWCAC, 2025. 2

  19. [19]

    arXiv preprint arXiv:2309.00615 , year=

    Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xi- anzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xi- anzhi Li, Hongsheng Li, et al. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understand- ing, generation, and instruction following.arXiv preprint arXiv:2309.00615, 2023. 1, 2

  20. [20]

    arXiv preprint arXiv:2309.03905 , year=

    Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, et al. Imagebind-llm: Multi-modality instruction tun- ing.arXiv preprint arXiv:2309.03905, 2023. 2

  21. [21]

    Dynamic neural networks: A sur- vey.TPAMI, 44(11):7436–7456, 2021

    Yizeng Han, Gao Huang, Shiji Song, Le Yang, Honghui Wang, and Yulin Wang. Dynamic neural networks: A sur- vey.TPAMI, 44(11):7436–7456, 2021. 2

  22. [22]

    Re- thinking token reduction in mllms: Towards a unified paradigm for training-free acceleration.arXiv preprint arXiv:2411.17686, 2024

    Yuhang Han, Xuyang Liu, Pengxiang Ding, Donglin Wang, Honggang Chen, Qingsen Yan, and Siteng Huang. Re- thinking token reduction in mllms: Towards a unified paradigm for training-free acceleration.arXiv preprint arXiv:2411.17686, 2024. 2

  23. [23]

    Latency-aware unified dynamic networks for efficient image recognition.TPAMI,

    Yizeng Han, Zeyu Liu, Zhihang Yuan, Yifan Pu, Chaofei Wang, Shiji Song, and Gao Huang. Latency-aware unified dynamic networks for efficient image recognition.TPAMI,

  24. [24]

    Segpoint: Segment any point cloud via large language model

    Shuting He, Henghui Ding, Xudong Jiang, and Bihan Wen. Segpoint: Segment any point cloud via large language model. InECCV, 2024. 2

  25. [25]

    3d-llm: Inject- ing the 3d world into large language models.NeurIPS, 2023

    Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Inject- ing the 3d world into large language models.NeurIPS, 2023. 2

  26. [26]

    Chat-3d v2: Bridging 3d scene and large language models with object identifiers

    Haifeng Huang, Yilun Chen, Zehan Wang, Rongjie Huang, Luping Liu, Xize Cheng, Yang Zhao, Tai Wang, Runsen Xu, and Zhou Zhao. Chat-3d v2: Bridging 3d scene and large language models with object identifiers. InNeurIPS, 2024. 2

  27. [27]

    Chat-scene: Bridging 3d scene and large language models with object identifiers

    Haifeng Huang, Yilun Chen, Zehan Wang, Rongjie Huang, Runsen Xu, Tai Wang, Luping Liu, Xize Cheng, Yang Zhao, Jiangmiao Pang, et al. Chat-scene: Bridging 3d scene and large language models with object identifiers. InNeurIPS,

  28. [28]

    Zero-shot 3d question answering via voxel-based dynamic token compres- sion

    Hsiang-Wei Huang, Fu-Chen Chen, Wenhao Chai, Che- Chun Su, Lu Xia, Sanghun Jung, Cheng-Yen Yang, Jenq- Neng Hwang, Min Sun, and Cheng-Hao Kuo. Zero-shot 3d question answering via voxel-based dynamic token compres- sion. InCVPR, 2025. 2

  29. [29]

    An embodied generalist agent in 3d world

    Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world. InICML, 2024. 6

  30. [30]

    Ivtp: Instruction-guided visual token pruning for large vision-language models

    Kai Huang, Hao Zou, Ye Xi, BoChen Wang, Zhen Xie, and Liang Yu. Ivtp: Instruction-guided visual token pruning for large vision-language models. InECCV, 2024. 2

  31. [31]

    Resolving multi-condition confusion for finetuning-free personalized image generation

    Qihan Huang, Siming Fu, Jinlong Liu, Hao Jiang, Yipeng Yu, and Jie Song. Resolving multi-condition confusion for finetuning-free personalized image generation. InAAAI,

  32. [32]

    Dense object grounding in 3d scenes

    Wencan Huang, Daizong Liu, and Wei Hu. Dense object grounding in 3d scenes. InACM MM, pages 5017–5026,

  33. [33]

    Advancing 3d object grounding beyond a single 3d scene

    Wencan Huang, Daizong Liu, and Wei Hu. Advancing 3d object grounding beyond a single 3d scene. InACM MM, pages 7995–8004, 2024. 2

  34. [34]

    Fast3d: Acceler- ating 3d multi-modal large language models for efficient 3d scene understanding

    Wencan Huang, Daizong Liu, and Wei Hu. Fast3d: Acceler- ating 3d multi-modal large language models for efficient 3d scene understanding. InACM MM, 2025. 2, 3, 4, 5, 6

  35. [35]

    Dynamic diffusion transformer for accurate image generation

    Weinan Jia, Mengqi Huang, Nan Chen, Lei Zhang, and Zhendong Mao. Dynamic diffusion transformer for accurate image generation. InCVPR, 2025. 2

  36. [36]

    What kind of visual tokens do we need? training-free visual token pruning for multi-modal large language models from the perspective of graph

    Yutao Jiang, Qiong Wu, Wenhao Lin, Wei Yu, and Yiyi Zhou. What kind of visual tokens do we need? training-free visual token pruning for multi-modal large language models from the perspective of graph. InAAAI, 2025. 3

  37. [37]

    Feature purification matters: Suppressing outlier propagation for training-free open-vocabulary seman- tic segmentation

    Shuo Jin, Siyue Yu, Bingfeng Zhang, Mingjie Sun, Yi Dong, and Jimin Xiao. Feature purification matters: Suppressing outlier propagation for training-free open-vocabulary seman- tic segmentation. InICCV, 2025. 5

  38. [38]

    Turbo: Informativity-driven acceleration plug-in for vision-language large models

    Chen Ju, Haicheng Wang, Haozhe Cheng, Xu Chen, Zhonghua Zhai, Weilin Huang, Jinsong Lan, Shuai Xiao, and Bo Zheng. Turbo: Informativity-driven acceleration plug-in for vision-language large models. InECCV, 2024. 2

  39. [39]

    Lisa: Reasoning segmenta- tion via large language model

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmenta- tion via large language model. InCVPR, 2024. 1

  40. [40]

    Beyond token compression: A training-free reduction framework for efficient visual pro- cessing in mllms.arXiv preprint arXiv:2501.19036, 2025

    Hongliang Li, Jiaxin Zhang, Wenhui Liao, Dezhi Peng, Kai Ding, and Lianwen Jin. Beyond token compression: A training-free reduction framework for efficient visual pro- cessing in mllms.arXiv preprint arXiv:2501.19036, 2025. 3

  41. [41]

    Tokenpacker: Efficient visual projector for multimodal llm.IJCV, 133:6794–6812,

    Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jianke Zhu, and Lei Zhang. Tokenpacker: Efficient visual projector for multimodal llm.IJCV, 133:6794–6812,

  42. [42]

    Llama-vid: An image is worth 2 tokens in large language models

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InECCV,

  43. [43]

    Not all patches are what you need: Expediting vision transformers via token reorganizations

    Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Not all patches are what you need: Expediting vision transformers via token reorganizations. In ICLR, 2022. 2

  44. [44]

    Boosting multimodal large language models with visual to- kens withdrawal for rapid inference

    Zhihang Lin, Mingbao Lin, Luxi Lin, and Rongrong Ji. Boosting multimodal large language models with visual to- kens withdrawal for rapid inference. InAAAI, 2025. 3

  45. [45]

    Seeing is not believing: Adver- sarial natural object optimization for hard-label 3d scene at- tacks

    Daizong Liu and Wei Hu. Seeing is not believing: Adver- sarial natural object optimization for hard-label 3d scene at- tacks. InCVPR, 2025. 2

  46. [46]

    A survey on text-guided 3d visual grounding: elements, recent advances, and future directions.arXiv preprint arXiv:2406.05785, 2024

    Daizong Liu, Yang Liu, Wencan Huang, and Wei Hu. A survey on text-guided 3d visual grounding: elements, recent advances, and future directions.arXiv preprint arXiv:2406.05785, 2024. 2

  47. [47]

    arXiv preprint arXiv:2407.07403 , year=

    Daizong Liu, Mingyu Yang, Xiaoye Qu, Pan Zhou, Yu Cheng, and Wei Hu. A survey of attacks on large vision- language models: Resources, advances, and future trends. arXiv preprint arXiv:2407.07403, 2024. 2

  48. [48]

    Pandora’s box: Towards building universal attackers against real-world large vision-language models

    Daizong Liu, Mingyu Yang, Xiaoye Qu, Pan Zhou, Xiang Fang, Keke Tang, Yao Wan, and Lichao Sun. Pandora’s box: Towards building universal attackers against real-world large vision-language models. InNeurIPS, 2024

  49. [49]

    Visual instruction tuning.NeurIPS, 36, 2024

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 36, 2024. 1, 2

  50. [50]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov, Frank Hutter, et al. Fixing weight decay regularization in adam.arXiv preprint arXiv:1711.05101, 5: 5, 2017. 6

  51. [51]

    Sqa3d: Situated question answering in 3d scenes

    Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Situated question answering in 3d scenes. InICLR, 2023. 6

  52. [52]

    Adavit: Adaptive vision transformers for efficient image recognition

    Lingchen Meng, Hengduo Li, Bor-Chun Chen, Shiyi Lan, Zuxuan Wu, Yu-Gang Jiang, and Ser-Nam Lim. Adavit: Adaptive vision transformers for efficient image recognition. InCVPR, pages 12309–12318, 2022. 2

  53. [53]

    Deepstack: Deeply stacking visual tokens is surprisingly simple and ef- fective for lmms

    Lingchen Meng, Jianwei Yang, Rui Tian, Xiyang Dai, Zux- uan Wu, Jianfeng Gao, and Yu-Gang Jiang. Deepstack: Deeply stacking visual tokens is surprisingly simple and ef- fective for lmms. InNeurIPS, 2024. 2

  54. [54]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InACL, 2002. 6

  55. [55]

    Shapellm: Universal 3d object understanding for embodied interaction

    Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, Li Yi, and Kaisheng Ma. Shapellm: Universal 3d object understanding for embodied interaction. InECCV, 2024. 1, 2

  56. [56]

    Gpt4point: A unified framework for point-language understanding and generation

    Zhangyang Qi, Ye Fang, Zeyi Sun, Xiaoyang Wu, Tong Wu, Jiaqi Wang, Dahua Lin, and Hengshuang Zhao. Gpt4point: A unified framework for point-language understanding and generation. InCVPR, 2024. 2

  57. [57]

    Gpt4scene: Understand 3d scenes from videos with vision-language models.arXiv preprint arXiv:2501.01428, 2025

    Zhangyang Qi, Zhixiong Zhang, Ye Fang, Jiaqi Wang, and Hengshuang Zhao. Gpt4scene: Understand 3d scenes from videos with vision-language models.arXiv preprint arXiv:2501.01428, 2025. 2

  58. [58]

    Dynamicvit: Efficient vision transformers with dynamic token sparsification.NeurIPS, 34:13937–13949, 2021

    Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification.NeurIPS, 34:13937–13949, 2021. 2

  59. [59]

    Llava-prumerge: Adaptive token reduction for ef- ficient large multimodal models

    Yuzhang Shang, Mu Cai, Bingxin Xu, Yong-Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for ef- ficient large multimodal models. InICCV, 2025. 1, 3

  60. [60]

    Crossget: Cross-guided ensem- ble of tokens for accelerating vision-language transformers

    Dachuan Shi, Chaofan Tao, Anyi Rao, Zhendong Yang, Chun Yuan, and Jiaqi Wang. Crossget: Cross-guided ensem- ble of tokens for accelerating vision-language transformers. arXiv preprint arXiv:2305.17455, 2023. 2

  61. [61]

    Minigpt-3d: Efficiently aligning 3d point clouds with large language models using 2d priors

    Yuan Tang, Xu Han, Xianzhi Li, Qiao Yu, Yixue Hao, Long Hu, and Min Chen. Minigpt-3d: Efficiently aligning 3d point clouds with large language models using 2d priors. InACM MM, 2024. 1, 2

  62. [62]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 2

  63. [63]

    Attention is all you need.NeurIPS, 2017

    A Vaswani. Attention is all you need.NeurIPS, 2017. 2, 4

  64. [64]

    Cider: Consensus-based image description evalua- tion

    Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evalua- tion. InCVPR, 2015. 6

  65. [65]

    [cls] token tells everything needed for training-free efficient mllms.arXiv preprint arXiv:2412.05819, 2024

    Ao Wang, Fengyuan Sun, Hui Chen, Zijia Lin, Jungong Han, and Guiguang Ding. [cls] token tells everything needed for training-free efficient mllms.arXiv preprint arXiv:2412.05819, 2024. 3

  66. [66]

    Zero- tprune: Zero-shot token pruning through leveraging of the attention graph in pre-trained transformers

    Hongjie Wang, Bhishma Dedhia, and Niraj K Jha. Zero- tprune: Zero-shot token pruning through leveraging of the attention graph in pre-trained transformers. InCVPR, 2024. 2

  67. [67]

    Gpt4video: A unified multimodal large language model for lnstruction-followed understanding and safety-aware generation

    Zhanyu Wang, Longyue Wang, Zhen Zhao, Minghao Wu, Chenyang Lyu, Huayang Li, Deng Cai, Luping Zhou, Shum- ing Shi, and Zhaopeng Tu. Gpt4video: A unified multimodal large language model for lnstruction-followed understanding and safety-aware generation. InACM MM, 2024. 2

  68. [68]

    Data-efficiently learn large language model for universal 3d scene perception

    Zehan Wang, Haifeng Huang, Yang Zhao, Ziang Zhang, Tao Jin, and Zhou Zhao. Data-efficiently learn large language model for universal 3d scene perception. InACL, 2025. 2

  69. [69]

    Efficient multi- modal large language models via progressive consistency distillation.arXiv preprint arXiv:2510.00515, 2025

    Zichen Wen, Shaobo Wang, Yufa Zhou, Junyuan Zhang, Qintong Zhang, Yifeng Gao, Zhaorun Chen, Bin Wang, Wei- jia Li, Conghui He, and Linfeng Zhang. Efficient multi- modal large language models via progressive consistency distillation.arXiv preprint arXiv:2510.00515, 2025. 5

  70. [70]

    Pointllm: Empowering large lan- guage models to understand point clouds

    Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiang- miao Pang, and Dahua Lin. Pointllm: Empowering large lan- guage models to understand point clouds. InECCV, 2024. 1, 2

  71. [71]

    DeCo: Decoupling token compres- sion from semantic abstraction in multimodal large lan- guage models.arXiv:2405.20985, 2024

    Linli Yao, Lei Li, Shuhuai Ren, Lean Wang, Yuanxin Liu, Xu Sun, and Lu Hou. Deco: Decoupling token compres- sion from semantic abstraction in multimodal large language models.arXiv preprint arXiv:2405.20985, 2024. 1, 2

  72. [72]

    Fit and prune: Fast and training-free visual token pruning for multi- modal large language models

    Weihao Ye, Qiong Wu, Wenhao Lin, and Yiyi Zhou. Fit and prune: Fast and training-free visual token pruning for multi- modal large language models. InAAAI, 2025. 3

  73. [73]

    V oco-llama: Towards vision compression with large language models

    Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, and Yansong Tang. V oco-llama: Towards vision compression with large language models. InCVPR, 2025. 1, 2

  74. [74]

    Excap3d: Expressive 3d scene understanding via object cap- tioning with varying detail

    Chandan Yeshwanth, David Rozenberszki, and Angela Dai. Excap3d: Expressive 3d scene understanding via object cap- tioning with varying detail. InICCV, 2025. 2

  75. [75]

    Inst3d-lmm: Instance-aware 3d scene under- standing with multi-modal instruction tuning

    Hanxun Yu, Wentong Li, Song Wang, Junbo Chen, and Jianke Zhu. Inst3d-lmm: Instance-aware 3d scene under- standing with multi-modal instruction tuning. InCVPR,

  76. [76]

    3dgraphllm: Combin- ing semantic graphs and large language models for 3d scene understanding

    Tatiana Zemskova and Dmitry Yudin. 3dgraphllm: Combin- ing semantic graphs and large language models for 3d scene understanding. InICCV, 2025. 2

  77. [77]

    Multi3drefer: Grounding text description to multiple 3d ob- jects

    Yiming Zhang, ZeMing Gong, and Angel X Chang. Multi3drefer: Grounding text description to multiple 3d ob- jects. InICCV, 2023. 6

  78. [78]

    Sparsevlm: Visual token sparsification for efficient vision- language model inference

    Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, and Shanghang Zhang. Sparsevlm: Visual token sparsification for efficient vision- language model inference. InICML, 2025. 3

  79. [79]

    Metaxas, and Licheng Yu

    Shiyu Zhao, Zhenting Wang, Felix Juefei-Xu, Xide Xia, Miao Liu, Xiaofang Wang, Mingfu Liang, Ning Zhang, Dim- itris N. Metaxas, and Licheng Yu. Accelerating multimodal large language models by searching optimal vision token re- duction. InCVPR, 2025. 3

  80. [80]

    Dynamic tun- ing towards parameter and inference efficiency for vit adap- tation

    Wangbo Zhao, Jiasheng Tang, Yizeng Han, Yibing Song, Kai Wang, Gao Huang, Fan Wang, and Yang You. Dynamic tun- ing towards parameter and inference efficiency for vit adap- tation. InNeurIPS, 2024. 2

Showing first 80 references.