Recognition: 2 theorem links
· Lean TheoremEfficient3D: A Unified Framework for Adaptive and Debiased Token Reduction in 3D MLLMs
Pith reviewed 2026-05-13 19:42 UTC · model grok-4.3
The pith
Efficient3D prunes visual tokens in 3D MLLMs using debiased estimates and scene-adaptive balancing to speed inference while raising accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Efficient3D shows that debiased token importance scoring combined with dynamic rebalancing of pruning strength according to scene complexity produces faster 3D MLLM inference while preserving semantic content and yielding higher scores on standard 3D vision-language tasks than unpruned baselines.
What carries the argument
Debiased Visual Token Importance Estimator (DVTIE) that aggregates attention while accounting for shallow initial layers to generate reliable token importance scores, together with Adaptive Token Rebalancing that adjusts pruning intensity per scene.
If this is right
- Token reduction becomes context-aware for varying 3D scene difficulties.
- Inference runs faster on resource-limited hardware without accuracy loss.
- Semantic completeness is maintained across captioning and question-answering tasks.
- The same framework improves results on ScanRefer, Multi3DRefer, ScanQA, and SQA3D.
Where Pith is reading between the lines
- The same pruning logic could be tested on 2D image-language models to check broader applicability.
- Early token reduction may lower memory footprint in addition to compute time.
- Combining the method with quantization could produce further efficiency gains on edge devices.
Load-bearing premise
That including attention signals from shallow initial layers produces more trustworthy rankings of which visual tokens are important.
What would settle it
An experiment in which replacing the shallow-layer component of the importance estimator with a standard deep-layer version yields equal or higher benchmark scores would show the debiased step is not required.
Figures
read the original abstract
Recent advances in Multimodal Large Language Models (MLLMs) have expanded reasoning capabilities into 3D domains, enabling fine-grained spatial understanding. However, the substantial size of 3D MLLMs and the high dimensionality of input features introduce considerable inference overhead, which limits practical deployment on resource constrained platforms. To overcome this limitation, this paper presents Efficient3D, a unified framework for visual token pruning that accelerates 3D MLLMs while maintaining competitive accuracy. The proposed framework introduces a Debiased Visual Token Importance Estimator (DVTIE) module, which considers the influence of shallow initial layers during attention aggregation, thereby producing more reliable importance predictions for visual tokens. In addition, an Adaptive Token Rebalancing (ATR) strategy is developed to dynamically adjust pruning strength based on scene complexity, preserving semantic completeness and maintaining balanced attention across layers. Together, they enable context-aware token reduction that maintains essential semantics with lower computation. Comprehensive experiments conducted on five representative 3D vision and language benchmarks, including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D, demonstrate that Efficient3D achieves superior performance compared with unpruned baselines, with a +2.57% CIDEr improvement on the Scan2Cap dataset. Therefore, Efficient3D provides a scalable and effective solution for efficient inference in 3D MLLMs. The code is released at: https://github.com/sol924/Efficient3D
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Efficient3D, a unified framework for visual token pruning in 3D MLLMs. It introduces the Debiased Visual Token Importance Estimator (DVTIE), which aggregates attention scores from shallow initial layers to estimate token importance more reliably, and the Adaptive Token Rebalancing (ATR) strategy, which dynamically modulates pruning ratios according to scene complexity. Experiments across five 3D vision-language benchmarks (ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, SQA3D) report that the method outperforms unpruned baselines, including a +2.57% CIDEr gain on Scan2Cap, while reducing inference cost.
Significance. If the reported gains are reproducible and the causal contribution of DVTIE is isolated from ATR and training variance, the work would offer a practical route to deploying 3D MLLMs on edge devices. The open-sourced code strengthens the contribution by enabling direct verification of the efficiency claims.
major comments (3)
- [Experiments] Experiments section (and abstract): the +2.57% CIDEr improvement on Scan2Cap is stated without reporting the number of random seeds, standard deviation across runs, or statistical significance tests against the unpruned baseline. This omission prevents assessment of whether the gain exceeds training stochasticity.
- [§3.2] §3.2 (DVTIE): the central claim that aggregating shallow-layer attention yields more reliable importance scores lacks supporting ablations (e.g., DVTIE with vs. without layers 1–3) or correlation analysis against human-annotated token relevance. Shallow layers predominantly encode low-level spatial cues; without this evidence the reported superiority could be driven by ATR rather than the debiased estimator.
- [Table 2 / §4.2] Table 2 / §4.2: no comparison is provided against prior token-pruning methods for 3D MLLMs (or strong 2D MLLM baselines such as FastV or ToMe) under matched FLOPs budgets, making it impossible to judge whether DVTIE+ATR advances the state of the art or merely matches existing engineering practice.
minor comments (2)
- [§3.2] Notation for the attention aggregation in DVTIE is introduced without an explicit equation; adding a numbered equation would clarify the precise weighting of shallow vs. deep layers.
- [Abstract / §4] The abstract states “superior performance compared with unpruned baselines” but the main text should explicitly list the exact unpruned model variants (e.g., LLaVA-3D-7B) and their CIDEr scores for each dataset.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help strengthen the paper. We address each major point below and will revise the manuscript to improve experimental reporting, add ablations, and include comparisons.
read point-by-point responses
-
Referee: Experiments section (and abstract): the +2.57% CIDEr improvement on Scan2Cap is stated without reporting the number of random seeds, standard deviation across runs, or statistical significance tests against the unpruned baseline. This omission prevents assessment of whether the gain exceeds training stochasticity.
Authors: We acknowledge the need for statistical rigor. In the revised manuscript we will rerun the Scan2Cap experiments over three random seeds, report mean and standard deviation, and include a paired t-test against the unpruned baseline to confirm the gain exceeds training variance. revision: yes
-
Referee: §3.2 (DVTIE): the central claim that aggregating shallow-layer attention yields more reliable importance scores lacks supporting ablations (e.g., DVTIE with vs. without layers 1–3) or correlation analysis against human-annotated token relevance. Shallow layers predominantly encode low-level spatial cues; without this evidence the reported superiority could be driven by ATR rather than the debiased estimator.
Authors: We will add an ablation table in §3.2 comparing DVTIE variants that aggregate attention from layers 1–3 versus deeper layers only, isolating the contribution of shallow-layer aggregation. We do not possess human-annotated token relevance labels, so a direct correlation study is not feasible; however, we will include qualitative attention visualizations and additional quantitative metrics (e.g., token retention rates on semantically critical objects) to support the reliability claim. We will also present an ablation that applies ATR alone versus DVTIE+ATR to clarify their separate effects. revision: partial
-
Referee: Table 2 / §4.2: no comparison is provided against prior token-pruning methods for 3D MLLMs (or strong 2D MLLM baselines such as FastV or ToMe) under matched FLOPs budgets, making it impossible to judge whether DVTIE+ATR advances the state of the art or merely matches existing engineering practice.
Authors: We agree that matched-FLOPs comparisons are essential. In the revision we will adapt FastV and ToMe to the 3D MLLM setting, enforce identical FLOPs budgets, and add these baselines to Table 2 together with a discussion in §4.2 that positions DVTIE+ATR relative to prior work. revision: yes
Circularity Check
No circularity in derivation chain; empirical engineering contribution
full rationale
The paper introduces DVTIE and ATR as modules for token pruning in 3D MLLMs, with performance validated on benchmarks like Scan2Cap (+2.57% CIDEr). No equations, derivations, or self-referential definitions are present that reduce any prediction or result to fitted inputs by construction. Claims rely on experimental comparisons to unpruned baselines rather than tautological redefinitions or self-citation chains. The framework is presented as a practical engineering solution without load-bearing self-citations, ansatzes, or uniqueness theorems imported from prior author work. This is the standard non-finding for empirical pruning papers.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Debiased Visual Token Importance Estimator (DVTIE) … aggregate attention matrices from layer K … ranking loss L_rank … Adaptive Token Rebalancing (ATR) … shadow factor s_k
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
… +2.57 % CIDEr improvement on Scan2Cap …
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Scanqa: 3d question answering for spatial scene understanding
Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. InCVPR, 2022. 6
work page 2022
-
[2]
Token merging: Your ViT but faster
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your ViT but faster. InICLR, 2023. 1, 2, 5, 6
work page 2023
-
[3]
Jianjian Cao, Peng Ye, Shengze Li, Chong Yu, Yansong Tang, Jiwen Lu, and Tao Chen. Madtp: Multimodal alignment-guided dynamic token pruning for accelerating vision-language transformer. InCVPR, 2024. 2
work page 2024
-
[4]
Honeybee: Locality-enhanced projector for multimodal llm
Junbum Cha, Wooyoung Kang, Jonghwan Mun, and Byungseok Roh. Honeybee: Locality-enhanced projector for multimodal llm. InCVPR, 2024. 2
work page 2024
-
[5]
Scanrefer: 3d object localization in rgb-d scans using natural language
Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. InECCV, 2020. 1, 5, 6
work page 2020
-
[6]
Llavolta: Efficient multi- modal models via stage-wise visual context compression
Jieneng Chen, Luoxin Ye, Ju He, Zhao-Yang Wang, Daniel Khashabi, and Alan Yuille. Llavolta: Efficient multi- modal models via stage-wise visual context compression. In NeurIPS, 2024. 2
work page 2024
-
[7]
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InECCV, 2024. 1, 2, 3, 5, 6
work page 2024
-
[8]
Diffrate: Differentiable compression rate for efficient vision transformers
Mengzhao Chen, Wenqi Shao, Peng Xu, Mingbao Lin, Kaipeng Zhang, Fei Chao, Rongrong Ji, Yu Qiao, and Ping Luo. Diffrate: Differentiable compression rate for efficient vision transformers. InICCV, 2023. 2
work page 2023
-
[9]
Ll3da: Visual interactive instruction tuning for omni-3d understand- ing reasoning and planning
Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. Ll3da: Visual interactive instruction tuning for omni-3d understand- ing reasoning and planning. InCVPR, 2024. 2
work page 2024
-
[10]
Grounded 3d-llm with referent tokens.arXiv preprint arXiv:2405.10370, 2024
Yilun Chen, Shuai Yang, Haifeng Huang, Tai Wang, Ruiyuan Lyu, Runsen Xu, Dahua Lin, and Jiangmiao Pang. Grounded 3d-llm with referent tokens.arXiv preprint arXiv:2405.10370, 2024. 2
-
[11]
Scan2cap: Context-aware dense captioning in rgb-d scans
Zhenyu Chen, Ali Gholami, Matthias Nießner, and Angel X Chang. Scan2cap: Context-aware dense captioning in rgb-d scans. InCVPR, 2021. 1, 6
work page 2021
-
[12]
Hengshuo Chu, Xiang Deng, Qi Lv, Xiaoyang Chen, Yinchuan Li, Jianye HAO, and Liqiang Nie. 3d- affordanceLLM: Harnessing large language models for open-vocabulary affordance detection in 3d worlds. InICLR,
-
[13]
Mobilevlm : A fast, strong and open vision language assistant for mobile devices
Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, et al. Mobilevlm: A fast, strong and open vi- sion language assistant for mobile devices.arXiv preprint arXiv:2312.16886, 2023. 2
-
[14]
Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, et al. Mobilevlm v2: Faster and stronger baseline for vision language model.arXiv preprint arXiv:2402.03766, 2024. 2
-
[15]
Bert: Pre-training of deep bidirectional trans- formers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InACL, 2019. 4
work page 2019
-
[16]
An image is worth 16×16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16×16 words: Transformers for image recognition at scale. InICLR, 2021. 2
work page 2021
-
[17]
Adaptive token sampling for efficient vision transformers
Mohsen Fayyaz, Soroush Abbasi Koohpayegani, Farnoush Rezaei Jafari, Sunando Sengupta, Hamid Reza Vaezi Joze, Eric Sommerlade, Hamed Pirsiavash, and J¨urgen Gall. Adaptive token sampling for efficient vision transformers. InECCV, 2022. 2
work page 2022
-
[18]
Scene-llm: Extending language model for 3d visual understanding and reasoning
Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wenhan Xiong. Scene-llm: Extending language model for 3d visual understanding and reasoning. InWCAC, 2025. 2
work page 2025
-
[19]
arXiv preprint arXiv:2309.00615 , year=
Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xi- anzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xi- anzhi Li, Hongsheng Li, et al. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understand- ing, generation, and instruction following.arXiv preprint arXiv:2309.00615, 2023. 1, 2
-
[20]
arXiv preprint arXiv:2309.03905 , year=
Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, et al. Imagebind-llm: Multi-modality instruction tun- ing.arXiv preprint arXiv:2309.03905, 2023. 2
-
[21]
Dynamic neural networks: A sur- vey.TPAMI, 44(11):7436–7456, 2021
Yizeng Han, Gao Huang, Shiji Song, Le Yang, Honghui Wang, and Yulin Wang. Dynamic neural networks: A sur- vey.TPAMI, 44(11):7436–7456, 2021. 2
work page 2021
-
[22]
Yuhang Han, Xuyang Liu, Pengxiang Ding, Donglin Wang, Honggang Chen, Qingsen Yan, and Siteng Huang. Re- thinking token reduction in mllms: Towards a unified paradigm for training-free acceleration.arXiv preprint arXiv:2411.17686, 2024. 2
-
[23]
Latency-aware unified dynamic networks for efficient image recognition.TPAMI,
Yizeng Han, Zeyu Liu, Zhihang Yuan, Yifan Pu, Chaofei Wang, Shiji Song, and Gao Huang. Latency-aware unified dynamic networks for efficient image recognition.TPAMI,
-
[24]
Segpoint: Segment any point cloud via large language model
Shuting He, Henghui Ding, Xudong Jiang, and Bihan Wen. Segpoint: Segment any point cloud via large language model. InECCV, 2024. 2
work page 2024
-
[25]
3d-llm: Inject- ing the 3d world into large language models.NeurIPS, 2023
Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Inject- ing the 3d world into large language models.NeurIPS, 2023. 2
work page 2023
-
[26]
Chat-3d v2: Bridging 3d scene and large language models with object identifiers
Haifeng Huang, Yilun Chen, Zehan Wang, Rongjie Huang, Luping Liu, Xize Cheng, Yang Zhao, Tai Wang, Runsen Xu, and Zhou Zhao. Chat-3d v2: Bridging 3d scene and large language models with object identifiers. InNeurIPS, 2024. 2
work page 2024
-
[27]
Chat-scene: Bridging 3d scene and large language models with object identifiers
Haifeng Huang, Yilun Chen, Zehan Wang, Rongjie Huang, Runsen Xu, Tai Wang, Luping Liu, Xize Cheng, Yang Zhao, Jiangmiao Pang, et al. Chat-scene: Bridging 3d scene and large language models with object identifiers. InNeurIPS,
-
[28]
Zero-shot 3d question answering via voxel-based dynamic token compres- sion
Hsiang-Wei Huang, Fu-Chen Chen, Wenhao Chai, Che- Chun Su, Lu Xia, Sanghun Jung, Cheng-Yen Yang, Jenq- Neng Hwang, Min Sun, and Cheng-Hao Kuo. Zero-shot 3d question answering via voxel-based dynamic token compres- sion. InCVPR, 2025. 2
work page 2025
-
[29]
An embodied generalist agent in 3d world
Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world. InICML, 2024. 6
work page 2024
-
[30]
Ivtp: Instruction-guided visual token pruning for large vision-language models
Kai Huang, Hao Zou, Ye Xi, BoChen Wang, Zhen Xie, and Liang Yu. Ivtp: Instruction-guided visual token pruning for large vision-language models. InECCV, 2024. 2
work page 2024
-
[31]
Resolving multi-condition confusion for finetuning-free personalized image generation
Qihan Huang, Siming Fu, Jinlong Liu, Hao Jiang, Yipeng Yu, and Jie Song. Resolving multi-condition confusion for finetuning-free personalized image generation. InAAAI,
-
[32]
Dense object grounding in 3d scenes
Wencan Huang, Daizong Liu, and Wei Hu. Dense object grounding in 3d scenes. InACM MM, pages 5017–5026,
-
[33]
Advancing 3d object grounding beyond a single 3d scene
Wencan Huang, Daizong Liu, and Wei Hu. Advancing 3d object grounding beyond a single 3d scene. InACM MM, pages 7995–8004, 2024. 2
work page 2024
-
[34]
Fast3d: Acceler- ating 3d multi-modal large language models for efficient 3d scene understanding
Wencan Huang, Daizong Liu, and Wei Hu. Fast3d: Acceler- ating 3d multi-modal large language models for efficient 3d scene understanding. InACM MM, 2025. 2, 3, 4, 5, 6
work page 2025
-
[35]
Dynamic diffusion transformer for accurate image generation
Weinan Jia, Mengqi Huang, Nan Chen, Lei Zhang, and Zhendong Mao. Dynamic diffusion transformer for accurate image generation. InCVPR, 2025. 2
work page 2025
-
[36]
Yutao Jiang, Qiong Wu, Wenhao Lin, Wei Yu, and Yiyi Zhou. What kind of visual tokens do we need? training-free visual token pruning for multi-modal large language models from the perspective of graph. InAAAI, 2025. 3
work page 2025
-
[37]
Shuo Jin, Siyue Yu, Bingfeng Zhang, Mingjie Sun, Yi Dong, and Jimin Xiao. Feature purification matters: Suppressing outlier propagation for training-free open-vocabulary seman- tic segmentation. InICCV, 2025. 5
work page 2025
-
[38]
Turbo: Informativity-driven acceleration plug-in for vision-language large models
Chen Ju, Haicheng Wang, Haozhe Cheng, Xu Chen, Zhonghua Zhai, Weilin Huang, Jinsong Lan, Shuai Xiao, and Bo Zheng. Turbo: Informativity-driven acceleration plug-in for vision-language large models. InECCV, 2024. 2
work page 2024
-
[39]
Lisa: Reasoning segmenta- tion via large language model
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmenta- tion via large language model. InCVPR, 2024. 1
work page 2024
-
[40]
Hongliang Li, Jiaxin Zhang, Wenhui Liao, Dezhi Peng, Kai Ding, and Lianwen Jin. Beyond token compression: A training-free reduction framework for efficient visual pro- cessing in mllms.arXiv preprint arXiv:2501.19036, 2025. 3
-
[41]
Tokenpacker: Efficient visual projector for multimodal llm.IJCV, 133:6794–6812,
Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jianke Zhu, and Lei Zhang. Tokenpacker: Efficient visual projector for multimodal llm.IJCV, 133:6794–6812,
-
[42]
Llama-vid: An image is worth 2 tokens in large language models
Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InECCV,
-
[43]
Not all patches are what you need: Expediting vision transformers via token reorganizations
Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Not all patches are what you need: Expediting vision transformers via token reorganizations. In ICLR, 2022. 2
work page 2022
-
[44]
Boosting multimodal large language models with visual to- kens withdrawal for rapid inference
Zhihang Lin, Mingbao Lin, Luxi Lin, and Rongrong Ji. Boosting multimodal large language models with visual to- kens withdrawal for rapid inference. InAAAI, 2025. 3
work page 2025
-
[45]
Seeing is not believing: Adver- sarial natural object optimization for hard-label 3d scene at- tacks
Daizong Liu and Wei Hu. Seeing is not believing: Adver- sarial natural object optimization for hard-label 3d scene at- tacks. InCVPR, 2025. 2
work page 2025
-
[46]
Daizong Liu, Yang Liu, Wencan Huang, and Wei Hu. A survey on text-guided 3d visual grounding: elements, recent advances, and future directions.arXiv preprint arXiv:2406.05785, 2024. 2
-
[47]
arXiv preprint arXiv:2407.07403 , year=
Daizong Liu, Mingyu Yang, Xiaoye Qu, Pan Zhou, Yu Cheng, and Wei Hu. A survey of attacks on large vision- language models: Resources, advances, and future trends. arXiv preprint arXiv:2407.07403, 2024. 2
-
[48]
Pandora’s box: Towards building universal attackers against real-world large vision-language models
Daizong Liu, Mingyu Yang, Xiaoye Qu, Pan Zhou, Xiang Fang, Keke Tang, Yao Wan, and Lichao Sun. Pandora’s box: Towards building universal attackers against real-world large vision-language models. InNeurIPS, 2024
work page 2024
-
[49]
Visual instruction tuning.NeurIPS, 36, 2024
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 36, 2024. 1, 2
work page 2024
-
[50]
Decoupled Weight Decay Regularization
Ilya Loshchilov, Frank Hutter, et al. Fixing weight decay regularization in adam.arXiv preprint arXiv:1711.05101, 5: 5, 2017. 6
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[51]
Sqa3d: Situated question answering in 3d scenes
Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Situated question answering in 3d scenes. InICLR, 2023. 6
work page 2023
-
[52]
Adavit: Adaptive vision transformers for efficient image recognition
Lingchen Meng, Hengduo Li, Bor-Chun Chen, Shiyi Lan, Zuxuan Wu, Yu-Gang Jiang, and Ser-Nam Lim. Adavit: Adaptive vision transformers for efficient image recognition. InCVPR, pages 12309–12318, 2022. 2
work page 2022
-
[53]
Deepstack: Deeply stacking visual tokens is surprisingly simple and ef- fective for lmms
Lingchen Meng, Jianwei Yang, Rui Tian, Xiyang Dai, Zux- uan Wu, Jianfeng Gao, and Yu-Gang Jiang. Deepstack: Deeply stacking visual tokens is surprisingly simple and ef- fective for lmms. InNeurIPS, 2024. 2
work page 2024
-
[54]
Bleu: a method for automatic evaluation of machine translation
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InACL, 2002. 6
work page 2002
-
[55]
Shapellm: Universal 3d object understanding for embodied interaction
Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, Li Yi, and Kaisheng Ma. Shapellm: Universal 3d object understanding for embodied interaction. InECCV, 2024. 1, 2
work page 2024
-
[56]
Gpt4point: A unified framework for point-language understanding and generation
Zhangyang Qi, Ye Fang, Zeyi Sun, Xiaoyang Wu, Tong Wu, Jiaqi Wang, Dahua Lin, and Hengshuang Zhao. Gpt4point: A unified framework for point-language understanding and generation. InCVPR, 2024. 2
work page 2024
-
[57]
Zhangyang Qi, Zhixiong Zhang, Ye Fang, Jiaqi Wang, and Hengshuang Zhao. Gpt4scene: Understand 3d scenes from videos with vision-language models.arXiv preprint arXiv:2501.01428, 2025. 2
-
[58]
Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification.NeurIPS, 34:13937–13949, 2021. 2
work page 2021
-
[59]
Llava-prumerge: Adaptive token reduction for ef- ficient large multimodal models
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong-Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for ef- ficient large multimodal models. InICCV, 2025. 1, 3
work page 2025
-
[60]
Crossget: Cross-guided ensem- ble of tokens for accelerating vision-language transformers
Dachuan Shi, Chaofan Tao, Anyi Rao, Zhendong Yang, Chun Yuan, and Jiaqi Wang. Crossget: Cross-guided ensem- ble of tokens for accelerating vision-language transformers. arXiv preprint arXiv:2305.17455, 2023. 2
-
[61]
Minigpt-3d: Efficiently aligning 3d point clouds with large language models using 2d priors
Yuan Tang, Xu Han, Xianzhi Li, Qiao Yu, Yixue Hao, Long Hu, and Min Chen. Minigpt-3d: Efficiently aligning 3d point clouds with large language models using 2d priors. InACM MM, 2024. 1, 2
work page 2024
-
[62]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[63]
Attention is all you need.NeurIPS, 2017
A Vaswani. Attention is all you need.NeurIPS, 2017. 2, 4
work page 2017
-
[64]
Cider: Consensus-based image description evalua- tion
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evalua- tion. InCVPR, 2015. 6
work page 2015
-
[65]
Ao Wang, Fengyuan Sun, Hui Chen, Zijia Lin, Jungong Han, and Guiguang Ding. [cls] token tells everything needed for training-free efficient mllms.arXiv preprint arXiv:2412.05819, 2024. 3
-
[66]
Hongjie Wang, Bhishma Dedhia, and Niraj K Jha. Zero- tprune: Zero-shot token pruning through leveraging of the attention graph in pre-trained transformers. InCVPR, 2024. 2
work page 2024
-
[67]
Zhanyu Wang, Longyue Wang, Zhen Zhao, Minghao Wu, Chenyang Lyu, Huayang Li, Deng Cai, Luping Zhou, Shum- ing Shi, and Zhaopeng Tu. Gpt4video: A unified multimodal large language model for lnstruction-followed understanding and safety-aware generation. InACM MM, 2024. 2
work page 2024
-
[68]
Data-efficiently learn large language model for universal 3d scene perception
Zehan Wang, Haifeng Huang, Yang Zhao, Ziang Zhang, Tao Jin, and Zhou Zhao. Data-efficiently learn large language model for universal 3d scene perception. InACL, 2025. 2
work page 2025
-
[69]
Zichen Wen, Shaobo Wang, Yufa Zhou, Junyuan Zhang, Qintong Zhang, Yifeng Gao, Zhaorun Chen, Bin Wang, Wei- jia Li, Conghui He, and Linfeng Zhang. Efficient multi- modal large language models via progressive consistency distillation.arXiv preprint arXiv:2510.00515, 2025. 5
-
[70]
Pointllm: Empowering large lan- guage models to understand point clouds
Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiang- miao Pang, and Dahua Lin. Pointllm: Empowering large lan- guage models to understand point clouds. InECCV, 2024. 1, 2
work page 2024
-
[71]
Linli Yao, Lei Li, Shuhuai Ren, Lean Wang, Yuanxin Liu, Xu Sun, and Lu Hou. Deco: Decoupling token compres- sion from semantic abstraction in multimodal large language models.arXiv preprint arXiv:2405.20985, 2024. 1, 2
-
[72]
Fit and prune: Fast and training-free visual token pruning for multi- modal large language models
Weihao Ye, Qiong Wu, Wenhao Lin, and Yiyi Zhou. Fit and prune: Fast and training-free visual token pruning for multi- modal large language models. InAAAI, 2025. 3
work page 2025
-
[73]
V oco-llama: Towards vision compression with large language models
Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, and Yansong Tang. V oco-llama: Towards vision compression with large language models. InCVPR, 2025. 1, 2
work page 2025
-
[74]
Excap3d: Expressive 3d scene understanding via object cap- tioning with varying detail
Chandan Yeshwanth, David Rozenberszki, and Angela Dai. Excap3d: Expressive 3d scene understanding via object cap- tioning with varying detail. InICCV, 2025. 2
work page 2025
-
[75]
Inst3d-lmm: Instance-aware 3d scene under- standing with multi-modal instruction tuning
Hanxun Yu, Wentong Li, Song Wang, Junbo Chen, and Jianke Zhu. Inst3d-lmm: Instance-aware 3d scene under- standing with multi-modal instruction tuning. InCVPR,
-
[76]
3dgraphllm: Combin- ing semantic graphs and large language models for 3d scene understanding
Tatiana Zemskova and Dmitry Yudin. 3dgraphllm: Combin- ing semantic graphs and large language models for 3d scene understanding. InICCV, 2025. 2
work page 2025
-
[77]
Multi3drefer: Grounding text description to multiple 3d ob- jects
Yiming Zhang, ZeMing Gong, and Angel X Chang. Multi3drefer: Grounding text description to multiple 3d ob- jects. InICCV, 2023. 6
work page 2023
-
[78]
Sparsevlm: Visual token sparsification for efficient vision- language model inference
Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, and Shanghang Zhang. Sparsevlm: Visual token sparsification for efficient vision- language model inference. InICML, 2025. 3
work page 2025
-
[79]
Shiyu Zhao, Zhenting Wang, Felix Juefei-Xu, Xide Xia, Miao Liu, Xiaofang Wang, Mingfu Liang, Ning Zhang, Dim- itris N. Metaxas, and Licheng Yu. Accelerating multimodal large language models by searching optimal vision token re- duction. InCVPR, 2025. 3
work page 2025
-
[80]
Dynamic tun- ing towards parameter and inference efficiency for vit adap- tation
Wangbo Zhao, Jiasheng Tang, Yizeng Han, Yibing Song, Kai Wang, Gao Huang, Fan Wang, and Yang You. Dynamic tun- ing towards parameter and inference efficiency for vit adap- tation. InNeurIPS, 2024. 2
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.