pith. machine review for the scientific record. sign in

arxiv: 2605.12309 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: no theorem link

G²TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:28 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual token reductionunified multimodal modelsgeneration-guided selectionVAE latent consistencyimage understandingimage editingefficient inferencetoken merging
0
0 comments X

The pith

Generation-guided token reduction cuts visual tokens in unified multimodal models by nearly 2x while preserving both reasoning accuracy and editing quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Separate-encoder unified multimodal models incur high inference costs from dense visual token processing in the understanding branch. Prior reduction methods assume the goal is only discriminative reasoning, but these models must also retain information sufficient for image editing via latent-space reconstruction. G²TR uses the generation branch to supply a task-agnostic importance signal based on consistency with VAE latents, then applies balanced selection and merges redundant tokens. The training-free approach is applied only after encoding and yields 1.94x fewer tokens and lower prefill cost with no significant loss on understanding or editing benchmarks.

Core claim

Estimating understanding-side token importance from consistency with the generation branch's VAE latent representation allows balanced selection and merging that reduces visual token count while keeping both semantic understanding and latent-space reconstruction capabilities intact for editing tasks.

What carries the argument

Generation-guided importance scoring from VAE latent consistency, followed by balanced token selection and representative merging to limit information loss.

If this is right

  • Visual token counts drop substantially, directly lowering prefill computation time in inference pipelines.
  • Reasoning performance on standard image understanding benchmarks stays comparable to the full-token model.
  • Editing quality holds because tokens important for latent reconstruction are preferentially retained.
  • The method integrates into existing separate-encoder UMM workflows without any retraining or architecture changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same generation-branch signal might improve token efficiency in other multimodal systems that separate understanding from generation.
  • Extending the approach to video or multi-image inputs could test whether VAE consistency generalizes beyond single images.
  • Hybrid strategies that combine VAE consistency with attention scores may further reduce token counts without extra loss.

Load-bearing premise

That matching the VAE latent gives a reliable signal for tokens needed in both understanding and image reconstruction without causing meaningful loss for editing.

What would settle it

A controlled test showing clear degradation in editing quality or reconstruction fidelity on complex edits after applying the VAE-consistency reduction, while full-token baselines remain unaffected.

Figures

Figures reproduced from arXiv: 2605.12309 by Junxian Li, Kai Liu, Renjing Pei, Yulun Zhang, Zhikai Chen, Zhixin Wang, Zizhong Ding.

Figure 1
Figure 1. Figure 1: Intuitive view of G2TR. Left: Our method can preserve visual details crucial for image edit￾ing. Right: G2TR lies on the Pareto frontier, compared with previous SOTA methods. Notably, UMM relative averages are calculated by: (understanding relative averages + editing relative averages)/2. “U” means relative averages, which can be found in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The two observations under this scenario. (1) Left figure indicates that limited difference [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Method Overview. Here we focus on tasks which understanding-side visual tokens are [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of Editing results. The instructions of the four groups of images from top to down are: “Draw what it will look like after being frozen.”; “Draw what it will look like after one hour on a hot grill.”; “What does this dish look like after it has been baked?”; “Can I see the appearance of this ring on a finger?”. Generally, G2TR performs better in UMMs for image editing than compared baselines.… view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of retained tokens. The first line shows one case of image understanding, and [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: More visualization results. Instructions from top to down are: “What does the item in [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
read the original abstract

The development of separate-encoder Unified multimodal models (UMMs) comes with a rapidly growing inference cost due to dense visual token processing. In this paper, we focus on understanding-side visual token reduction for improving the efficiency of separate-encoder UMMs. While this topic has been widely studied for MLLMs, existing methods typically rely on attention scores, text-image similarity and so on, implicitly assuming that the final objective is discriminative reasoning. This assumption does not hold for UMMs, where understanding-side visual tokens must also preserve the model's capabilities for editing images. We propose G$^2$TR, a generation-guided visual token reduction framework for separate-encoder UMMs. Our key insight is that the generation branch provides a task-agnostic signal for identifying understanding-side visual tokens that are not only semantically relevant but also important for latent-space image reconstruction and generation. G$^2$TR estimates token importance from consistency with VAE latent, performs balanced token selection, and merges redundant tokens into retained representatives to reduce information loss. The method is training-free, plug-and-play, and applied only after the understanding encoding stage, making it compatible with existing UMM inference pipelines. Experiments on image understanding and editing benchmarks show that G$^2$TR substantially reduces visual tokens and prefill computation by 1.94x while maintaining both reasoning accuracy and editing quality, outperforming baselines on almost all benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes G²TR, a training-free, plug-and-play framework for visual token reduction in separate-encoder unified multimodal models (UMMs). Token importance is estimated from consistency between understanding-encoded tokens and VAE latents from the generation branch; balanced selection followed by merging of redundant tokens is then applied post-encoding. Experiments claim a 1.94× reduction in visual tokens and prefill compute while preserving accuracy on understanding benchmarks and quality on editing benchmarks, outperforming attention- and similarity-based baselines.

Significance. If the central empirical claim holds, the work offers a practical efficiency improvement for UMM inference by supplying a generation-derived, task-agnostic importance signal that respects both reasoning and latent-space editing requirements—an issue not addressed by prior MLLM token-reduction methods.

major comments (3)
  1. [§3] §3 (Method), token-importance definition: the claim that VAE-latent consistency supplies a task-agnostic signal for editing-critical tokens rests on an untested assumption. VAE reconstruction optimizes pixel-level fidelity; no correlation analysis, ablation, or counter-example is provided showing that the retained tokens align with regions or features required for text-guided editing rather than reconstruction.
  2. [§4] §4 (Experiments) and associated tables: the 1.94× reduction and “maintained editing quality” are reported without error bars, statistical significance tests, or per-benchmark editing-specific metrics (e.g., CLIP similarity, FID, or human preference scores). The post-hoc balanced selection and merging steps could mask localized information loss that only appears under certain editing prompts.
  3. [§4] §4, baseline comparisons: the statement that G²TR “outperforms baselines on almost all benchmarks” lacks a complete list of baselines, their token-reduction ratios, and the exact metric values; without these, the relative advantage over attention-score or text-image similarity methods cannot be verified.
minor comments (2)
  1. [Abstract] Abstract and §1: the phrase “outperforming baselines on almost all benchmarks” should be replaced by a precise count or table reference.
  2. [§3] Notation in §3: the exact formula or distance metric used to compute “consistency with VAE latent” is not stated explicitly; a numbered equation would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major point below and have prepared revisions to strengthen the presentation and evidence where the comments identify gaps.

read point-by-point responses
  1. Referee: [§3] §3 (Method), token-importance definition: the claim that VAE-latent consistency supplies a task-agnostic signal for editing-critical tokens rests on an untested assumption. VAE reconstruction optimizes pixel-level fidelity; no correlation analysis, ablation, or counter-example is provided showing that the retained tokens align with regions or features required for text-guided editing rather than reconstruction.

    Authors: We appreciate the referee’s observation that the original submission did not contain explicit correlation analysis or ablations linking retained tokens to editing-specific features. The design choice is motivated by the architecture of separate-encoder UMMs, in which the generation branch directly operates on VAE latents for text-guided editing; therefore, tokens whose understanding embeddings are consistent with those latents are expected to preserve the information required for both reconstruction fidelity and subsequent editing. To address the concern directly, the revised manuscript will include a new ablation subsection with (i) quantitative overlap statistics between high-importance tokens and regions altered by editing prompts and (ii) qualitative visualizations comparing token retention under understanding-only versus editing-aware prompts. revision: yes

  2. Referee: [§4] §4 (Experiments) and associated tables: the 1.94× reduction and “maintained editing quality” are reported without error bars, statistical significance tests, or per-benchmark editing-specific metrics (e.g., CLIP similarity, FID, or human preference scores). The post-hoc balanced selection and merging steps could mask localized information loss that only appears under certain editing prompts.

    Authors: We agree that the experimental reporting can be strengthened. In the revision we will (a) add error bars computed across three random seeds and report paired statistical significance tests (Wilcoxon signed-rank) for all main results, (b) expand the editing evaluation to include CLIP similarity and FID scores in addition to the benchmark-native metrics, and (c) add a short discussion with case studies illustrating that the balanced selection and merging steps do not produce noticeable localized degradation on representative editing prompts. These additions will be placed in an updated §4 and the corresponding tables. revision: yes

  3. Referee: [§4] §4, baseline comparisons: the statement that G²TR “outperforms baselines on almost all benchmarks” lacks a complete list of baselines, their token-reduction ratios, and the exact metric values; without these, the relative advantage over attention-score or text-image similarity methods cannot be verified.

    Authors: We acknowledge that the original text did not enumerate every baseline with its exact reduction ratio and metric values in a single location. Tables 1–3 already contain the full set of numbers for the attention-based and similarity-based methods we compared against, with reduction ratios matched to 1.94× for fairness. The revised manuscript will add an explicit supplementary table (and a clarifying paragraph in §4) that lists every baseline, its token-reduction ratio, and the complete per-benchmark metric values so that the claimed advantages can be verified directly. revision: yes

Circularity Check

0 steps flagged

No circularity: token importance from external VAE latent consistency

full rationale

The paper's core mechanism estimates token importance via consistency with an external VAE latent (independent of the UMM's own fitted quantities or outputs). This signal is used for balanced selection and merging in a training-free manner after understanding encoding. No self-definitional loops appear (e.g., no importance defined in terms of the reduction result itself), no fitted inputs renamed as predictions, and no load-bearing self-citations or ansatzes imported from prior author work. The derivation chain is self-contained against external benchmarks and VAE reconstruction objectives, with empirical validation on understanding/editing tasks rather than tautological equivalence to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the claim rests on the domain assumption that VAE latents supply a reliable, task-agnostic importance signal; no free parameters or new entities are explicitly introduced.

axioms (1)
  • domain assumption VAE latent space captures essential image information sufficient for both reconstruction and generation tasks
    Invoked as the basis for estimating token importance from consistency with VAE latent.

pith-pipeline@v0.9.0 · 5578 in / 1340 out tokens · 126399 ms · 2026-05-13T05:28:37.832351+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 9 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Improving image generation with better captions

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn.openai.com/papers/dall-e-3.pdf, 2023

  3. [3]

    An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InECCV, 2024

  4. [4]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

  5. [5]

    Diffusion models in vision: A survey.TPAMI, 2023

    Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. Diffusion models in vision: A survey.TPAMI, 2023

  6. [6]

    Flashattention: Fast and memory-efficient exact attention with io-awareness

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. InNeurIPS, 2022

  7. [7]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

  8. [8]

    Mme: A comprehensive evaluation benchmark for multimodal large language models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. InNeurIPS Datasets and Benchmarks Track, 2025

  9. [9]

    Gemini 3 pro image (nano banana pro)

    Google. Gemini 3 pro image (nano banana pro). https://aistudio.google.com/models/ gemini-3-pro-image, 2025

  10. [10]

    Understanding and harnessing sparsity in unified multimodal models.arXiv preprint arXiv:2512.02351, 2025

    Shwai He, Chaorui Deng, Ang Li, and Shen Yan. Understanding and harnessing sparsity in unified multimodal models.arXiv preprint arXiv:2512.02351, 2025

  11. [11]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

  12. [12]

    PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks

    Junxian Li, Kai Liu, Leyang Chen, Weida Wang, Zhixin Wang, Jiaqi Xu, Fan Li, Renjing Pei, Linghe Kong, and Yulun Zhang. Planviz: Evaluating planning-oriented image generation and editing for computer-use tasks.arXiv preprint arXiv:2602.06663, 2026

  13. [13]

    Dual diffusion for unified image generation and understanding

    Zijie Li, Henry Li, Yichun Shi, Amir Barati Farimani, Yuval Kluger, Linjie Yang, and Peng Wang. Dual diffusion for unified image generation and understanding. InCVPR, 2025

  14. [14]

    Rover: Benchmarking reciprocal cross-modal reasoning for omnimodal generation.arXiv preprint arXiv:2511.01163, 2025

    Yongyuan Liang, Wei Chow, Feng Li, Ziqiao Ma, Xiyao Wang, Jiageng Mao, Jiuhai Chen, Jiatao Gu, Yue Wang, and Furong Huang. Rover: Benchmarking reciprocal cross-modal reasoning for omnimodal generation.arXiv preprint arXiv:2511.01163, 2025

  15. [15]

    Visual instruction tuning.NeurIPS, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 2023

  16. [16]

    Step1X-Edit: A Practical Framework for General Image Editing

    Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025

  17. [17]

    Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

  18. [18]

    Unimod: Efficient unified multimodal transformers with mixture-of-depths.arXiv preprint arXiv:2502.06474, 2025

    Weijia Mao, Zhenheng Yang, and Mike Zheng Shou. Unimod: Efficient unified multimodal transformers with mixture-of-depths.arXiv preprint arXiv:2502.06474, 2025. 10

  19. [19]

    Introducing our latest image generation model in the api

    OpenAI. Introducing our latest image generation model in the api. https://openai.com/ index/image-generation-api/, 2025

  20. [20]

    Wiseedit: Benchmarking cognition-and creativity-informed image editing

    Kaihang Pan, Weile Chen, Haiyi Qiu, Qifan Yu, Wendong Bu, Zehan Wang, Yun Zhu, Juncheng Li, and Siliang Tang. Wiseedit: Benchmarking cognition-and creativity-informed image editing. arXiv preprint arXiv:2512.00387, 2025

  21. [21]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022

  22. [22]

    Holitom: Holistic token merging for fast video large language models

    Kele Shao, TAO Keda, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. Holitom: Holistic token merging for fast video large language models. InNeurIPS, 2025

  23. [23]

    A survey of token compression for efficient multimodal large language models.TMLR, 2025

    Kele Shao, TAO Keda, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, and Huan Wang. A survey of token compression for efficient multimodal large language models.TMLR, 2025

  24. [24]

    Less is more: A simple yet effective token reduction method for efficient multi-modal llms

    Dingjie Song, Wenjun Wang, Shunian Chen, Xidong Wang, Michael X Guan, and Benyou Wang. Less is more: A simple yet effective token reduction method for efficient multi-modal llms. InCOLING, 2025

  25. [25]

    Ivc- prune: Revealing the implicit visual coordinates in lvlms for vision token pruning.arXiv preprint arXiv:2602.03060, 2026

    Zhichao Sun, Yidong Ma, Gang Liu, Yibo Chen, Xu Tang, Yao Hu, and Yongchao Xu. Ivc- prune: Revealing the implicit visual coordinates in lvlms for vision token pruning.arXiv preprint arXiv:2602.03060, 2026

  26. [26]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

  27. [27]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  28. [28]

    Internvl-u: Democratizing unified multimodal models for understanding, reasoning, generation and editing.arXiv preprint arXiv:2603.09877, 2026

    Changyao Tian, Danni Yang, Guanzhou Chen, Erfei Cui, Zhaokai Wang, Yuchen Duan, Penghao Yin, Sitao Chen, Ganlin Yang, Mingxin Liu, et al. Internvl-u: Democratizing unified multimodal models for understanding, reasoning, generation and editing.arXiv preprint arXiv:2603.09877, 2026

  29. [29]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InCVPR, 2024

  30. [30]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  31. [31]

    Token pruning in multimodal large language models: Are we solving the right problem? InACL Findings, 2025

    Zichen Wen, Yifeng Gao, Weijia Li, Conghui He, and Linfeng Zhang. Token pruning in multimodal large language models: Are we solving the right problem? InACL Findings, 2025

  32. [32]

    Janus: Decoupling visual encoding for unified multimodal understanding and generation

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InCVPR, 2025

  33. [33]

    Kris-bench: Benchmarking next-level intelligent image editing models

    Yongliang Wu, Zonghui Li, Xinting Hu, Xinyu Ye, Xianfang Zeng, Gang YU, Wenbo Zhu, Bernt Schiele, Ming-Hsuan Yang, and Xu Yang. Kris-bench: Benchmarking next-level intelligent image editing models. InNeurIPS Datasets and Benchmarks Track, 2025

  34. [34]

    Announcing grok-1.5.https://x.ai/news/grok-1.5, 2024

    xAI. Announcing grok-1.5.https://x.ai/news/grok-1.5, 2024

  35. [35]

    Show-o: One single transformer to unify multimodal understanding and generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. InICLR, 2025

  36. [36]

    Show-o2: Improved native unified multimodal models

    Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models. InNeurIPS, 2025. 11

  37. [37]

    Conical visual concentration for efficient large vision-language models

    Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al. Conical visual concentration for efficient large vision-language models. InCVPR, 2025

  38. [38]

    Rethinking visual token reduction in lvlms under cross-modal misalignment

    Rui Xu, Yunke Wang, Yong Luo, and Bo Du. Rethinking visual token reduction in lvlms under cross-modal misalignment. InAAAI, 2026

  39. [39]

    Vscan: Rethinking visual token reduction for efficient large vision-language models.TMLR, 2026

    Ce Zhang, Kaixin Ma, Tianqing Fang, Wenhao Yu, Hongming Zhang, Zhisong Zhang, Haitao Mi, and Dong Yu. Vscan: Rethinking visual token reduction for efficient large vision-language models.TMLR, 2026

  40. [40]

    Envisioning beyond the pixels: Bench- marking reasoning-informed visual editing

    Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Xiaorong Zhu, Hao Li, Wenhao Chai, Zicheng Zhang, Renqiu Xia, Guangtao Zhai, Junchi Yan, et al. Envisioning beyond the pixels: Bench- marking reasoning-informed visual editing. InNeurIPS Datasets and Benchmarks Track, 2025

  41. [41]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 12 A Relationship among Generation, Editing and Understanding-side Visual Tokens Image generation (...