arxiv: 2604.02689 · v1 · submitted 2026-04-03 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Efficient3D: A Unified Framework for Adaptive and Debiased Token Reduction in 3D MLLMs

Yuhui Lin , Siyue Yu , Yuxing Yang , Guangliang Cheng , Jimin Xiao

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:42 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords 3D MLLMsvisual token pruningtoken reductionefficient inferencedebiased estimationadaptive rebalancing3D vision-language benchmarks

0 comments

The pith

Efficient3D prunes visual tokens in 3D MLLMs using debiased estimates and scene-adaptive balancing to speed inference while raising accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Efficient3D to reduce the computational cost of 3D multimodal large language models by removing less essential visual tokens from their inputs. It builds a Debiased Visual Token Importance Estimator that draws on attention patterns from early layers to judge which tokens matter most. An Adaptive Token Rebalancing step then varies the pruning rate according to how complex each 3D scene appears. Experiments across five benchmarks show the pruned models often outperform the original full-token versions.

Core claim

Efficient3D shows that debiased token importance scoring combined with dynamic rebalancing of pruning strength according to scene complexity produces faster 3D MLLM inference while preserving semantic content and yielding higher scores on standard 3D vision-language tasks than unpruned baselines.

What carries the argument

Debiased Visual Token Importance Estimator (DVTIE) that aggregates attention while accounting for shallow initial layers to generate reliable token importance scores, together with Adaptive Token Rebalancing that adjusts pruning intensity per scene.

If this is right

Token reduction becomes context-aware for varying 3D scene difficulties.
Inference runs faster on resource-limited hardware without accuracy loss.
Semantic completeness is maintained across captioning and question-answering tasks.
The same framework improves results on ScanRefer, Multi3DRefer, ScanQA, and SQA3D.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pruning logic could be tested on 2D image-language models to check broader applicability.
Early token reduction may lower memory footprint in addition to compute time.
Combining the method with quantization could produce further efficiency gains on edge devices.

Load-bearing premise

That including attention signals from shallow initial layers produces more trustworthy rankings of which visual tokens are important.

What would settle it

An experiment in which replacing the shallow-layer component of the importance estimator with a standard deep-layer version yields equal or higher benchmark scores would show the debiased step is not required.

Figures

Figures reproduced from arXiv: 2604.02689 by Guangliang Cheng, Jimin Xiao, Siyue Yu, Yuhui Lin, Yuxing Yang.

**Figure 2.** Figure 2: Overview of the Efficient3D framework. First, we perform unpruned training on a pretrained 3D MLLM and extract importance scores of visual tokens. Next, we use the visual importance scores as the supervision targets for training the proposed DVTIE. During inference, the 3D MLLM leverages the predicted importance scores from the DVTIE to perform visual token pruning. Furthermore, we propose an ATR strategy … view at source ↗

**Figure 3.** Figure 3: Visualization of DVTIE network under different visual token pruning ratios. The results in average pruning ratios of 35%, 65%, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Recent advances in Multimodal Large Language Models (MLLMs) have expanded reasoning capabilities into 3D domains, enabling fine-grained spatial understanding. However, the substantial size of 3D MLLMs and the high dimensionality of input features introduce considerable inference overhead, which limits practical deployment on resource constrained platforms. To overcome this limitation, this paper presents Efficient3D, a unified framework for visual token pruning that accelerates 3D MLLMs while maintaining competitive accuracy. The proposed framework introduces a Debiased Visual Token Importance Estimator (DVTIE) module, which considers the influence of shallow initial layers during attention aggregation, thereby producing more reliable importance predictions for visual tokens. In addition, an Adaptive Token Rebalancing (ATR) strategy is developed to dynamically adjust pruning strength based on scene complexity, preserving semantic completeness and maintaining balanced attention across layers. Together, they enable context-aware token reduction that maintains essential semantics with lower computation. Comprehensive experiments conducted on five representative 3D vision and language benchmarks, including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D, demonstrate that Efficient3D achieves superior performance compared with unpruned baselines, with a +2.57% CIDEr improvement on the Scan2Cap dataset. Therefore, Efficient3D provides a scalable and effective solution for efficient inference in 3D MLLMs. The code is released at: https://github.com/sol924/Efficient3D

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Efficient3D adds two concrete modules for pruning tokens in 3D MLLMs and reports a modest gain on one benchmark, but the case for why shallow-layer attention improves importance estimates is not yet convincing.

read the letter

The paper's main contribution is a token reduction framework for 3D multimodal models. It introduces DVTIE, which aggregates attention scores while including early layers to estimate token importance, and ATR, which scales the pruning ratio according to scene complexity. The headline result is a 2.57% CIDEr lift on Scan2Cap over the unpruned baseline, with tests run on five 3D vision-language datasets and code released publicly. That combination of released implementation and multi-benchmark numbers is the part worth paying attention to if you work on efficient inference for these models. The engineering focus on keeping attention balanced across layers and adapting to 3D scene variation is a reasonable extension of existing pruning ideas. The soft spot is the justification for DVTIE. Shallow layers mostly capture low-level spatial patterns, so folding them into the importance score could just add noise rather than debias anything useful for captioning or QA. The abstract gives no layer-wise ablations, no correlation with human-annotated importance, and no statistical checks on whether the reported gain survives different random seeds or baseline choices. Without those, it is hard to separate the effect of DVTIE from ATR or simple training variance. This is the sort of paper that would interest groups already running 3D MLLMs on edge hardware or experimenting with token pruning in vision-language settings. It has enough concrete experiments and open code to justify sending it to referees, though any review should press for the missing controls on the importance estimator. I would recommend peer review rather than desk rejection.

Referee Report

3 major / 2 minor

Summary. The paper proposes Efficient3D, a unified framework for visual token pruning in 3D MLLMs. It introduces the Debiased Visual Token Importance Estimator (DVTIE), which aggregates attention scores from shallow initial layers to estimate token importance more reliably, and the Adaptive Token Rebalancing (ATR) strategy, which dynamically modulates pruning ratios according to scene complexity. Experiments across five 3D vision-language benchmarks (ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, SQA3D) report that the method outperforms unpruned baselines, including a +2.57% CIDEr gain on Scan2Cap, while reducing inference cost.

Significance. If the reported gains are reproducible and the causal contribution of DVTIE is isolated from ATR and training variance, the work would offer a practical route to deploying 3D MLLMs on edge devices. The open-sourced code strengthens the contribution by enabling direct verification of the efficiency claims.

major comments (3)

[Experiments] Experiments section (and abstract): the +2.57% CIDEr improvement on Scan2Cap is stated without reporting the number of random seeds, standard deviation across runs, or statistical significance tests against the unpruned baseline. This omission prevents assessment of whether the gain exceeds training stochasticity.
[§3.2] §3.2 (DVTIE): the central claim that aggregating shallow-layer attention yields more reliable importance scores lacks supporting ablations (e.g., DVTIE with vs. without layers 1–3) or correlation analysis against human-annotated token relevance. Shallow layers predominantly encode low-level spatial cues; without this evidence the reported superiority could be driven by ATR rather than the debiased estimator.
[Table 2 / §4.2] Table 2 / §4.2: no comparison is provided against prior token-pruning methods for 3D MLLMs (or strong 2D MLLM baselines such as FastV or ToMe) under matched FLOPs budgets, making it impossible to judge whether DVTIE+ATR advances the state of the art or merely matches existing engineering practice.

minor comments (2)

[§3.2] Notation for the attention aggregation in DVTIE is introduced without an explicit equation; adding a numbered equation would clarify the precise weighting of shallow vs. deep layers.
[Abstract / §4] The abstract states “superior performance compared with unpruned baselines” but the main text should explicitly list the exact unpruned model variants (e.g., LLaVA-3D-7B) and their CIDEr scores for each dataset.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the paper. We address each major point below and will revise the manuscript to improve experimental reporting, add ablations, and include comparisons.

read point-by-point responses

Referee: Experiments section (and abstract): the +2.57% CIDEr improvement on Scan2Cap is stated without reporting the number of random seeds, standard deviation across runs, or statistical significance tests against the unpruned baseline. This omission prevents assessment of whether the gain exceeds training stochasticity.

Authors: We acknowledge the need for statistical rigor. In the revised manuscript we will rerun the Scan2Cap experiments over three random seeds, report mean and standard deviation, and include a paired t-test against the unpruned baseline to confirm the gain exceeds training variance. revision: yes
Referee: §3.2 (DVTIE): the central claim that aggregating shallow-layer attention yields more reliable importance scores lacks supporting ablations (e.g., DVTIE with vs. without layers 1–3) or correlation analysis against human-annotated token relevance. Shallow layers predominantly encode low-level spatial cues; without this evidence the reported superiority could be driven by ATR rather than the debiased estimator.

Authors: We will add an ablation table in §3.2 comparing DVTIE variants that aggregate attention from layers 1–3 versus deeper layers only, isolating the contribution of shallow-layer aggregation. We do not possess human-annotated token relevance labels, so a direct correlation study is not feasible; however, we will include qualitative attention visualizations and additional quantitative metrics (e.g., token retention rates on semantically critical objects) to support the reliability claim. We will also present an ablation that applies ATR alone versus DVTIE+ATR to clarify their separate effects. revision: partial
Referee: Table 2 / §4.2: no comparison is provided against prior token-pruning methods for 3D MLLMs (or strong 2D MLLM baselines such as FastV or ToMe) under matched FLOPs budgets, making it impossible to judge whether DVTIE+ATR advances the state of the art or merely matches existing engineering practice.

Authors: We agree that matched-FLOPs comparisons are essential. In the revision we will adapt FastV and ToMe to the 3D MLLM setting, enforce identical FLOPs budgets, and add these baselines to Table 2 together with a discussion in §4.2 that positions DVTIE+ATR relative to prior work. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; empirical engineering contribution

full rationale

The paper introduces DVTIE and ATR as modules for token pruning in 3D MLLMs, with performance validated on benchmarks like Scan2Cap (+2.57% CIDEr). No equations, derivations, or self-referential definitions are present that reduce any prediction or result to fitted inputs by construction. Claims rely on experimental comparisons to unpruned baselines rather than tautological redefinitions or self-citation chains. The framework is presented as a practical engineering solution without load-bearing self-citations, ansatzes, or uniqueness theorems imported from prior author work. This is the standard non-finding for empirical pruning papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the framework introduces two named modules whose internal mechanics are not detailed here.

pith-pipeline@v0.9.0 · 5584 in / 1009 out tokens · 42395 ms · 2026-05-13T19:42:08.841467+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Debiased Visual Token Importance Estimator (DVTIE) … aggregate attention matrices from layer K … ranking loss L_rank … Adaptive Token Rebalancing (ATR) … shadow factor s_k
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

… +2.57 % CIDEr improvement on Scan2Cap …

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · 2 internal anchors

[1]

Scanqa: 3d question answering for spatial scene understanding

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. InCVPR, 2022. 6

work page 2022
[2]

Token merging: Your ViT but faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your ViT but faster. InICLR, 2023. 1, 2, 5, 6

work page 2023
[3]

Madtp: Multimodal alignment-guided dynamic token pruning for accelerating vision-language transformer

Jianjian Cao, Peng Ye, Shengze Li, Chong Yu, Yansong Tang, Jiwen Lu, and Tao Chen. Madtp: Multimodal alignment-guided dynamic token pruning for accelerating vision-language transformer. InCVPR, 2024. 2

work page 2024
[4]

Honeybee: Locality-enhanced projector for multimodal llm

Junbum Cha, Wooyoung Kang, Jonghwan Mun, and Byungseok Roh. Honeybee: Locality-enhanced projector for multimodal llm. InCVPR, 2024. 2

work page 2024
[5]

Scanrefer: 3d object localization in rgb-d scans using natural language

Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. InECCV, 2020. 1, 5, 6

work page 2020
[6]

Llavolta: Efficient multi- modal models via stage-wise visual context compression

Jieneng Chen, Luoxin Ye, Ju He, Zhao-Yang Wang, Daniel Khashabi, and Alan Yuille. Llavolta: Efficient multi- modal models via stage-wise visual context compression. In NeurIPS, 2024. 2

work page 2024
[7]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InECCV, 2024. 1, 2, 3, 5, 6

work page 2024
[8]

Diffrate: Differentiable compression rate for efficient vision transformers

Mengzhao Chen, Wenqi Shao, Peng Xu, Mingbao Lin, Kaipeng Zhang, Fei Chao, Rongrong Ji, Yu Qiao, and Ping Luo. Diffrate: Differentiable compression rate for efficient vision transformers. InICCV, 2023. 2

work page 2023
[9]

Ll3da: Visual interactive instruction tuning for omni-3d understand- ing reasoning and planning

Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. Ll3da: Visual interactive instruction tuning for omni-3d understand- ing reasoning and planning. InCVPR, 2024. 2

work page 2024
[10]

Grounded 3d-llm with referent tokens.arXiv preprint arXiv:2405.10370, 2024

Yilun Chen, Shuai Yang, Haifeng Huang, Tai Wang, Ruiyuan Lyu, Runsen Xu, Dahua Lin, and Jiangmiao Pang. Grounded 3d-llm with referent tokens.arXiv preprint arXiv:2405.10370, 2024. 2

work page arXiv 2024
[11]

Scan2cap: Context-aware dense captioning in rgb-d scans

Zhenyu Chen, Ali Gholami, Matthias Nießner, and Angel X Chang. Scan2cap: Context-aware dense captioning in rgb-d scans. InCVPR, 2021. 1, 6

work page 2021
[12]

3d- affordanceLLM: Harnessing large language models for open-vocabulary affordance detection in 3d worlds

Hengshuo Chu, Xiang Deng, Qi Lv, Xiaoyang Chen, Yinchuan Li, Jianye HAO, and Liqiang Nie. 3d- affordanceLLM: Harnessing large language models for open-vocabulary affordance detection in 3d worlds. InICLR,

work page
[13]

Mobilevlm : A fast, strong and open vision language assistant for mobile devices

Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, et al. Mobilevlm: A fast, strong and open vi- sion language assistant for mobile devices.arXiv preprint arXiv:2312.16886, 2023. 2

work page arXiv 2023
[14]

Mobilevlm v2: Faster and stronger baseline for vision language model.arXiv preprint arXiv:2402.03766, 2024

Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, et al. Mobilevlm v2: Faster and stronger baseline for vision language model.arXiv preprint arXiv:2402.03766, 2024. 2

work page arXiv 2024
[15]

Bert: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InACL, 2019. 4

work page 2019
[16]

An image is worth 16×16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16×16 words: Transformers for image recognition at scale. InICLR, 2021. 2

work page 2021
[17]

Adaptive token sampling for efficient vision transformers

Mohsen Fayyaz, Soroush Abbasi Koohpayegani, Farnoush Rezaei Jafari, Sunando Sengupta, Hamid Reza Vaezi Joze, Eric Sommerlade, Hamed Pirsiavash, and J¨urgen Gall. Adaptive token sampling for efficient vision transformers. InECCV, 2022. 2

work page 2022
[18]

Scene-llm: Extending language model for 3d visual understanding and reasoning

Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wenhan Xiong. Scene-llm: Extending language model for 3d visual understanding and reasoning. InWCAC, 2025. 2

work page 2025
[19]

arXiv preprint arXiv:2309.00615 , year=

Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xi- anzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xi- anzhi Li, Hongsheng Li, et al. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understand- ing, generation, and instruction following.arXiv preprint arXiv:2309.00615, 2023. 1, 2

work page arXiv 2023
[20]

arXiv preprint arXiv:2309.03905 , year=

Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, et al. Imagebind-llm: Multi-modality instruction tun- ing.arXiv preprint arXiv:2309.03905, 2023. 2

work page arXiv 2023
[21]

Dynamic neural networks: A sur- vey.TPAMI, 44(11):7436–7456, 2021

Yizeng Han, Gao Huang, Shiji Song, Le Yang, Honghui Wang, and Yulin Wang. Dynamic neural networks: A sur- vey.TPAMI, 44(11):7436–7456, 2021. 2

work page 2021
[22]

Re- thinking token reduction in mllms: Towards a unified paradigm for training-free acceleration.arXiv preprint arXiv:2411.17686, 2024

Yuhang Han, Xuyang Liu, Pengxiang Ding, Donglin Wang, Honggang Chen, Qingsen Yan, and Siteng Huang. Re- thinking token reduction in mllms: Towards a unified paradigm for training-free acceleration.arXiv preprint arXiv:2411.17686, 2024. 2

work page arXiv 2024
[23]

Latency-aware unified dynamic networks for efficient image recognition.TPAMI,

Yizeng Han, Zeyu Liu, Zhihang Yuan, Yifan Pu, Chaofei Wang, Shiji Song, and Gao Huang. Latency-aware unified dynamic networks for efficient image recognition.TPAMI,

work page
[24]

Segpoint: Segment any point cloud via large language model

Shuting He, Henghui Ding, Xudong Jiang, and Bihan Wen. Segpoint: Segment any point cloud via large language model. InECCV, 2024. 2

work page 2024
[25]

3d-llm: Inject- ing the 3d world into large language models.NeurIPS, 2023

Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Inject- ing the 3d world into large language models.NeurIPS, 2023. 2

work page 2023
[26]

Chat-3d v2: Bridging 3d scene and large language models with object identifiers

Haifeng Huang, Yilun Chen, Zehan Wang, Rongjie Huang, Luping Liu, Xize Cheng, Yang Zhao, Tai Wang, Runsen Xu, and Zhou Zhao. Chat-3d v2: Bridging 3d scene and large language models with object identifiers. InNeurIPS, 2024. 2

work page 2024
[27]

Chat-scene: Bridging 3d scene and large language models with object identifiers

Haifeng Huang, Yilun Chen, Zehan Wang, Rongjie Huang, Runsen Xu, Tai Wang, Luping Liu, Xize Cheng, Yang Zhao, Jiangmiao Pang, et al. Chat-scene: Bridging 3d scene and large language models with object identifiers. InNeurIPS,

work page
[28]

Zero-shot 3d question answering via voxel-based dynamic token compres- sion

Hsiang-Wei Huang, Fu-Chen Chen, Wenhao Chai, Che- Chun Su, Lu Xia, Sanghun Jung, Cheng-Yen Yang, Jenq- Neng Hwang, Min Sun, and Cheng-Hao Kuo. Zero-shot 3d question answering via voxel-based dynamic token compres- sion. InCVPR, 2025. 2

work page 2025
[29]

An embodied generalist agent in 3d world

Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world. InICML, 2024. 6

work page 2024
[30]

Ivtp: Instruction-guided visual token pruning for large vision-language models

Kai Huang, Hao Zou, Ye Xi, BoChen Wang, Zhen Xie, and Liang Yu. Ivtp: Instruction-guided visual token pruning for large vision-language models. InECCV, 2024. 2

work page 2024
[31]

Resolving multi-condition confusion for finetuning-free personalized image generation

Qihan Huang, Siming Fu, Jinlong Liu, Hao Jiang, Yipeng Yu, and Jie Song. Resolving multi-condition confusion for finetuning-free personalized image generation. InAAAI,

work page
[32]

Dense object grounding in 3d scenes

Wencan Huang, Daizong Liu, and Wei Hu. Dense object grounding in 3d scenes. InACM MM, pages 5017–5026,

work page
[33]

Advancing 3d object grounding beyond a single 3d scene

Wencan Huang, Daizong Liu, and Wei Hu. Advancing 3d object grounding beyond a single 3d scene. InACM MM, pages 7995–8004, 2024. 2

work page 2024
[34]

Fast3d: Acceler- ating 3d multi-modal large language models for efficient 3d scene understanding

Wencan Huang, Daizong Liu, and Wei Hu. Fast3d: Acceler- ating 3d multi-modal large language models for efficient 3d scene understanding. InACM MM, 2025. 2, 3, 4, 5, 6

work page 2025
[35]

Dynamic diffusion transformer for accurate image generation

Weinan Jia, Mengqi Huang, Nan Chen, Lei Zhang, and Zhendong Mao. Dynamic diffusion transformer for accurate image generation. InCVPR, 2025. 2

work page 2025
[36]

What kind of visual tokens do we need? training-free visual token pruning for multi-modal large language models from the perspective of graph

Yutao Jiang, Qiong Wu, Wenhao Lin, Wei Yu, and Yiyi Zhou. What kind of visual tokens do we need? training-free visual token pruning for multi-modal large language models from the perspective of graph. InAAAI, 2025. 3

work page 2025
[37]

Feature purification matters: Suppressing outlier propagation for training-free open-vocabulary seman- tic segmentation

Shuo Jin, Siyue Yu, Bingfeng Zhang, Mingjie Sun, Yi Dong, and Jimin Xiao. Feature purification matters: Suppressing outlier propagation for training-free open-vocabulary seman- tic segmentation. InICCV, 2025. 5

work page 2025
[38]

Turbo: Informativity-driven acceleration plug-in for vision-language large models

Chen Ju, Haicheng Wang, Haozhe Cheng, Xu Chen, Zhonghua Zhai, Weilin Huang, Jinsong Lan, Shuai Xiao, and Bo Zheng. Turbo: Informativity-driven acceleration plug-in for vision-language large models. InECCV, 2024. 2

work page 2024
[39]

Lisa: Reasoning segmenta- tion via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmenta- tion via large language model. InCVPR, 2024. 1

work page 2024
[40]

Beyond token compression: A training-free reduction framework for efficient visual pro- cessing in mllms.arXiv preprint arXiv:2501.19036, 2025

Hongliang Li, Jiaxin Zhang, Wenhui Liao, Dezhi Peng, Kai Ding, and Lianwen Jin. Beyond token compression: A training-free reduction framework for efficient visual pro- cessing in mllms.arXiv preprint arXiv:2501.19036, 2025. 3

work page arXiv 2025
[41]

Tokenpacker: Efficient visual projector for multimodal llm.IJCV, 133:6794–6812,

Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jianke Zhu, and Lei Zhang. Tokenpacker: Efficient visual projector for multimodal llm.IJCV, 133:6794–6812,

work page
[42]

Llama-vid: An image is worth 2 tokens in large language models

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InECCV,

work page
[43]

Not all patches are what you need: Expediting vision transformers via token reorganizations

Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Not all patches are what you need: Expediting vision transformers via token reorganizations. In ICLR, 2022. 2

work page 2022
[44]

Boosting multimodal large language models with visual to- kens withdrawal for rapid inference

Zhihang Lin, Mingbao Lin, Luxi Lin, and Rongrong Ji. Boosting multimodal large language models with visual to- kens withdrawal for rapid inference. InAAAI, 2025. 3

work page 2025
[45]

Seeing is not believing: Adver- sarial natural object optimization for hard-label 3d scene at- tacks

Daizong Liu and Wei Hu. Seeing is not believing: Adver- sarial natural object optimization for hard-label 3d scene at- tacks. InCVPR, 2025. 2

work page 2025
[46]

A survey on text-guided 3d visual grounding: elements, recent advances, and future directions.arXiv preprint arXiv:2406.05785, 2024

Daizong Liu, Yang Liu, Wencan Huang, and Wei Hu. A survey on text-guided 3d visual grounding: elements, recent advances, and future directions.arXiv preprint arXiv:2406.05785, 2024. 2

work page arXiv 2024
[47]

arXiv preprint arXiv:2407.07403 , year=

Daizong Liu, Mingyu Yang, Xiaoye Qu, Pan Zhou, Yu Cheng, and Wei Hu. A survey of attacks on large vision- language models: Resources, advances, and future trends. arXiv preprint arXiv:2407.07403, 2024. 2

work page arXiv 2024
[48]

Pandora’s box: Towards building universal attackers against real-world large vision-language models

Daizong Liu, Mingyu Yang, Xiaoye Qu, Pan Zhou, Xiang Fang, Keke Tang, Yao Wan, and Lichao Sun. Pandora’s box: Towards building universal attackers against real-world large vision-language models. InNeurIPS, 2024

work page 2024
[49]

Visual instruction tuning.NeurIPS, 36, 2024

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 36, 2024. 1, 2

work page 2024
[50]

Decoupled Weight Decay Regularization

Ilya Loshchilov, Frank Hutter, et al. Fixing weight decay regularization in adam.arXiv preprint arXiv:1711.05101, 5: 5, 2017. 6

work page internal anchor Pith review Pith/arXiv arXiv 2017
[51]

Sqa3d: Situated question answering in 3d scenes

Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Situated question answering in 3d scenes. InICLR, 2023. 6

work page 2023
[52]

Adavit: Adaptive vision transformers for efficient image recognition

Lingchen Meng, Hengduo Li, Bor-Chun Chen, Shiyi Lan, Zuxuan Wu, Yu-Gang Jiang, and Ser-Nam Lim. Adavit: Adaptive vision transformers for efficient image recognition. InCVPR, pages 12309–12318, 2022. 2

work page 2022
[53]

Deepstack: Deeply stacking visual tokens is surprisingly simple and ef- fective for lmms

Lingchen Meng, Jianwei Yang, Rui Tian, Xiyang Dai, Zux- uan Wu, Jianfeng Gao, and Yu-Gang Jiang. Deepstack: Deeply stacking visual tokens is surprisingly simple and ef- fective for lmms. InNeurIPS, 2024. 2

work page 2024
[54]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InACL, 2002. 6

work page 2002
[55]

Shapellm: Universal 3d object understanding for embodied interaction

Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, Li Yi, and Kaisheng Ma. Shapellm: Universal 3d object understanding for embodied interaction. InECCV, 2024. 1, 2

work page 2024
[56]

Gpt4point: A unified framework for point-language understanding and generation

Zhangyang Qi, Ye Fang, Zeyi Sun, Xiaoyang Wu, Tong Wu, Jiaqi Wang, Dahua Lin, and Hengshuang Zhao. Gpt4point: A unified framework for point-language understanding and generation. InCVPR, 2024. 2

work page 2024
[57]

Gpt4scene: Understand 3d scenes from videos with vision-language models.arXiv preprint arXiv:2501.01428, 2025

Zhangyang Qi, Zhixiong Zhang, Ye Fang, Jiaqi Wang, and Hengshuang Zhao. Gpt4scene: Understand 3d scenes from videos with vision-language models.arXiv preprint arXiv:2501.01428, 2025. 2

work page arXiv 2025
[58]

Dynamicvit: Efficient vision transformers with dynamic token sparsification.NeurIPS, 34:13937–13949, 2021

Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification.NeurIPS, 34:13937–13949, 2021. 2

work page 2021
[59]

Llava-prumerge: Adaptive token reduction for ef- ficient large multimodal models

Yuzhang Shang, Mu Cai, Bingxin Xu, Yong-Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for ef- ficient large multimodal models. InICCV, 2025. 1, 3

work page 2025
[60]

Crossget: Cross-guided ensem- ble of tokens for accelerating vision-language transformers

Dachuan Shi, Chaofan Tao, Anyi Rao, Zhendong Yang, Chun Yuan, and Jiaqi Wang. Crossget: Cross-guided ensem- ble of tokens for accelerating vision-language transformers. arXiv preprint arXiv:2305.17455, 2023. 2

work page arXiv 2023
[61]

Minigpt-3d: Efficiently aligning 3d point clouds with large language models using 2d priors

Yuan Tang, Xu Han, Xianzhi Li, Qiao Yu, Yixue Hao, Long Hu, and Min Chen. Minigpt-3d: Efficiently aligning 3d point clouds with large language models using 2d priors. InACM MM, 2024. 1, 2

work page 2024
[62]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

Attention is all you need.NeurIPS, 2017

A Vaswani. Attention is all you need.NeurIPS, 2017. 2, 4

work page 2017
[64]

Cider: Consensus-based image description evalua- tion

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evalua- tion. InCVPR, 2015. 6

work page 2015
[65]

[cls] token tells everything needed for training-free efficient mllms.arXiv preprint arXiv:2412.05819, 2024

Ao Wang, Fengyuan Sun, Hui Chen, Zijia Lin, Jungong Han, and Guiguang Ding. [cls] token tells everything needed for training-free efficient mllms.arXiv preprint arXiv:2412.05819, 2024. 3

work page arXiv 2024
[66]

Zero- tprune: Zero-shot token pruning through leveraging of the attention graph in pre-trained transformers

Hongjie Wang, Bhishma Dedhia, and Niraj K Jha. Zero- tprune: Zero-shot token pruning through leveraging of the attention graph in pre-trained transformers. InCVPR, 2024. 2

work page 2024
[67]

Gpt4video: A unified multimodal large language model for lnstruction-followed understanding and safety-aware generation

Zhanyu Wang, Longyue Wang, Zhen Zhao, Minghao Wu, Chenyang Lyu, Huayang Li, Deng Cai, Luping Zhou, Shum- ing Shi, and Zhaopeng Tu. Gpt4video: A unified multimodal large language model for lnstruction-followed understanding and safety-aware generation. InACM MM, 2024. 2

work page 2024
[68]

Data-efficiently learn large language model for universal 3d scene perception

Zehan Wang, Haifeng Huang, Yang Zhao, Ziang Zhang, Tao Jin, and Zhou Zhao. Data-efficiently learn large language model for universal 3d scene perception. InACL, 2025. 2

work page 2025
[69]

Efficient multi- modal large language models via progressive consistency distillation.arXiv preprint arXiv:2510.00515, 2025

Zichen Wen, Shaobo Wang, Yufa Zhou, Junyuan Zhang, Qintong Zhang, Yifeng Gao, Zhaorun Chen, Bin Wang, Wei- jia Li, Conghui He, and Linfeng Zhang. Efficient multi- modal large language models via progressive consistency distillation.arXiv preprint arXiv:2510.00515, 2025. 5

work page arXiv 2025
[70]

Pointllm: Empowering large lan- guage models to understand point clouds

Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiang- miao Pang, and Dahua Lin. Pointllm: Empowering large lan- guage models to understand point clouds. InECCV, 2024. 1, 2

work page 2024
[71]

DeCo: Decoupling token compres- sion from semantic abstraction in multimodal large lan- guage models.arXiv:2405.20985, 2024

Linli Yao, Lei Li, Shuhuai Ren, Lean Wang, Yuanxin Liu, Xu Sun, and Lu Hou. Deco: Decoupling token compres- sion from semantic abstraction in multimodal large language models.arXiv preprint arXiv:2405.20985, 2024. 1, 2

work page arXiv 2024
[72]

Fit and prune: Fast and training-free visual token pruning for multi- modal large language models

Weihao Ye, Qiong Wu, Wenhao Lin, and Yiyi Zhou. Fit and prune: Fast and training-free visual token pruning for multi- modal large language models. InAAAI, 2025. 3

work page 2025
[73]

V oco-llama: Towards vision compression with large language models

Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, and Yansong Tang. V oco-llama: Towards vision compression with large language models. InCVPR, 2025. 1, 2

work page 2025
[74]

Excap3d: Expressive 3d scene understanding via object cap- tioning with varying detail

Chandan Yeshwanth, David Rozenberszki, and Angela Dai. Excap3d: Expressive 3d scene understanding via object cap- tioning with varying detail. InICCV, 2025. 2

work page 2025
[75]

Inst3d-lmm: Instance-aware 3d scene under- standing with multi-modal instruction tuning

Hanxun Yu, Wentong Li, Song Wang, Junbo Chen, and Jianke Zhu. Inst3d-lmm: Instance-aware 3d scene under- standing with multi-modal instruction tuning. InCVPR,

work page
[76]

3dgraphllm: Combin- ing semantic graphs and large language models for 3d scene understanding

Tatiana Zemskova and Dmitry Yudin. 3dgraphllm: Combin- ing semantic graphs and large language models for 3d scene understanding. InICCV, 2025. 2

work page 2025
[77]

Multi3drefer: Grounding text description to multiple 3d ob- jects

Yiming Zhang, ZeMing Gong, and Angel X Chang. Multi3drefer: Grounding text description to multiple 3d ob- jects. InICCV, 2023. 6

work page 2023
[78]

Sparsevlm: Visual token sparsification for efficient vision- language model inference

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, and Shanghang Zhang. Sparsevlm: Visual token sparsification for efficient vision- language model inference. InICML, 2025. 3

work page 2025
[79]

Metaxas, and Licheng Yu

Shiyu Zhao, Zhenting Wang, Felix Juefei-Xu, Xide Xia, Miao Liu, Xiaofang Wang, Mingfu Liang, Ning Zhang, Dim- itris N. Metaxas, and Licheng Yu. Accelerating multimodal large language models by searching optimal vision token re- duction. InCVPR, 2025. 3

work page 2025
[80]

Dynamic tun- ing towards parameter and inference efficiency for vit adap- tation

Wangbo Zhao, Jiasheng Tang, Yizeng Han, Yibing Song, Kai Wang, Gao Huang, Fan Wang, and Yang You. Dynamic tun- ing towards parameter and inference efficiency for vit adap- tation. InNeurIPS, 2024. 2

work page 2024

Showing first 80 references.