arxiv: 2605.12309 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: no theorem link

G²TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models

Junxian Li , Kai Liu , Zizhong Ding , Zhixin Wang , Zhikai Chen , Renjing Pei , Yulun Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:28 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual token reductionunified multimodal modelsgeneration-guided selectionVAE latent consistencyimage understandingimage editingefficient inferencetoken merging

0 comments

The pith

Generation-guided token reduction cuts visual tokens in unified multimodal models by nearly 2x while preserving both reasoning accuracy and editing quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Separate-encoder unified multimodal models incur high inference costs from dense visual token processing in the understanding branch. Prior reduction methods assume the goal is only discriminative reasoning, but these models must also retain information sufficient for image editing via latent-space reconstruction. G²TR uses the generation branch to supply a task-agnostic importance signal based on consistency with VAE latents, then applies balanced selection and merges redundant tokens. The training-free approach is applied only after encoding and yields 1.94x fewer tokens and lower prefill cost with no significant loss on understanding or editing benchmarks.

Core claim

Estimating understanding-side token importance from consistency with the generation branch's VAE latent representation allows balanced selection and merging that reduces visual token count while keeping both semantic understanding and latent-space reconstruction capabilities intact for editing tasks.

What carries the argument

Generation-guided importance scoring from VAE latent consistency, followed by balanced token selection and representative merging to limit information loss.

If this is right

Visual token counts drop substantially, directly lowering prefill computation time in inference pipelines.
Reasoning performance on standard image understanding benchmarks stays comparable to the full-token model.
Editing quality holds because tokens important for latent reconstruction are preferentially retained.
The method integrates into existing separate-encoder UMM workflows without any retraining or architecture changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same generation-branch signal might improve token efficiency in other multimodal systems that separate understanding from generation.
Extending the approach to video or multi-image inputs could test whether VAE consistency generalizes beyond single images.
Hybrid strategies that combine VAE consistency with attention scores may further reduce token counts without extra loss.

Load-bearing premise

That matching the VAE latent gives a reliable signal for tokens needed in both understanding and image reconstruction without causing meaningful loss for editing.

What would settle it

A controlled test showing clear degradation in editing quality or reconstruction fidelity on complex edits after applying the VAE-consistency reduction, while full-token baselines remain unaffected.

Figures

Figures reproduced from arXiv: 2605.12309 by Junxian Li, Kai Liu, Renjing Pei, Yulun Zhang, Zhikai Chen, Zhixin Wang, Zizhong Ding.

**Figure 1.** Figure 1: Intuitive view of G2TR. Left: Our method can preserve visual details crucial for image editing. Right: G2TR lies on the Pareto frontier, compared with previous SOTA methods. Notably, UMM relative averages are calculated by: (understanding relative averages + editing relative averages)/2. “U” means relative averages, which can be found in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: The two observations under this scenario. (1) Left figure indicates that limited difference [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Method Overview. Here we focus on tasks which understanding-side visual tokens are [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of Editing results. The instructions of the four groups of images from top to down are: “Draw what it will look like after being frozen.”; “Draw what it will look like after one hour on a hot grill.”; “What does this dish look like after it has been baked?”; “Can I see the appearance of this ring on a finger?”. Generally, G2TR performs better in UMMs for image editing than compared baselines.… view at source ↗

**Figure 5.** Figure 5: Visualization of retained tokens. The first line shows one case of image understanding, and [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: More visualization results. Instructions from top to down are: “What does the item in [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

read the original abstract

The development of separate-encoder Unified multimodal models (UMMs) comes with a rapidly growing inference cost due to dense visual token processing. In this paper, we focus on understanding-side visual token reduction for improving the efficiency of separate-encoder UMMs. While this topic has been widely studied for MLLMs, existing methods typically rely on attention scores, text-image similarity and so on, implicitly assuming that the final objective is discriminative reasoning. This assumption does not hold for UMMs, where understanding-side visual tokens must also preserve the model's capabilities for editing images. We propose G$^2$TR, a generation-guided visual token reduction framework for separate-encoder UMMs. Our key insight is that the generation branch provides a task-agnostic signal for identifying understanding-side visual tokens that are not only semantically relevant but also important for latent-space image reconstruction and generation. G$^2$TR estimates token importance from consistency with VAE latent, performs balanced token selection, and merges redundant tokens into retained representatives to reduce information loss. The method is training-free, plug-and-play, and applied only after the understanding encoding stage, making it compatible with existing UMM inference pipelines. Experiments on image understanding and editing benchmarks show that G$^2$TR substantially reduces visual tokens and prefill computation by 1.94x while maintaining both reasoning accuracy and editing quality, outperforming baselines on almost all benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

G²TR gives a clean training-free way to cut visual tokens in separate-encoder UMMs using VAE consistency from the generation branch, but the editing preservation claim rests on an assumption that needs direct verification.

read the letter

The core contribution is a plug-and-play method that estimates visual token importance after the understanding encoder by measuring consistency with the VAE latent from the generation side, then does balanced selection and merges redundancies. This targets the efficiency problem in unified models that run both understanding and editing paths on the same dense tokens. It is training-free and only touches the post-encoding stage, so it slots into existing pipelines without retraining. That angle is distinct from the attention- or similarity-based pruning common in understanding-only MLLMs, and the use of the generation branch as a signal is the main novelty here. The reported outcome is a 1.94x drop in tokens and prefill compute while holding reasoning accuracy and editing quality on the tested benchmarks, with gains over the baselines they compare against. If the numbers check out, it is a practical efficiency win for deployment of these models. The experiments appear to cover both understanding and editing tasks, which is the right scope for the claim. The method avoids fitting task-specific parameters or self-referential loops, which keeps the circularity low. The main soft spot is the assumption that VAE latent consistency reliably flags tokens important for both reasoning and latent-space editing. VAE objectives emphasize pixel-level reconstruction, while text-guided editing often hinges on higher-level semantic regions that may not overlap perfectly. The balanced selection and merging help with redundancy but do not directly test whether the retained tokens preserve edit-specific information. Without seeing the full editing benchmark details, exact metrics, error bars, or ablation on the consistency signal, it is difficult to judge how well the assumption holds in practice. The abstract claims maintenance of editing quality, but that part of the evidence is the one most exposed to the stress-test concern. This work is aimed at people building or deploying unified multimodal systems who care about inference cost. It is worth a serious referee because the idea is straightforward to implement and the efficiency target is concrete, even if the editing results will likely need tighter validation during review.

Referee Report

3 major / 2 minor

Summary. The paper proposes G²TR, a training-free, plug-and-play framework for visual token reduction in separate-encoder unified multimodal models (UMMs). Token importance is estimated from consistency between understanding-encoded tokens and VAE latents from the generation branch; balanced selection followed by merging of redundant tokens is then applied post-encoding. Experiments claim a 1.94× reduction in visual tokens and prefill compute while preserving accuracy on understanding benchmarks and quality on editing benchmarks, outperforming attention- and similarity-based baselines.

Significance. If the central empirical claim holds, the work offers a practical efficiency improvement for UMM inference by supplying a generation-derived, task-agnostic importance signal that respects both reasoning and latent-space editing requirements—an issue not addressed by prior MLLM token-reduction methods.

major comments (3)

[§3] §3 (Method), token-importance definition: the claim that VAE-latent consistency supplies a task-agnostic signal for editing-critical tokens rests on an untested assumption. VAE reconstruction optimizes pixel-level fidelity; no correlation analysis, ablation, or counter-example is provided showing that the retained tokens align with regions or features required for text-guided editing rather than reconstruction.
[§4] §4 (Experiments) and associated tables: the 1.94× reduction and “maintained editing quality” are reported without error bars, statistical significance tests, or per-benchmark editing-specific metrics (e.g., CLIP similarity, FID, or human preference scores). The post-hoc balanced selection and merging steps could mask localized information loss that only appears under certain editing prompts.
[§4] §4, baseline comparisons: the statement that G²TR “outperforms baselines on almost all benchmarks” lacks a complete list of baselines, their token-reduction ratios, and the exact metric values; without these, the relative advantage over attention-score or text-image similarity methods cannot be verified.

minor comments (2)

[Abstract] Abstract and §1: the phrase “outperforming baselines on almost all benchmarks” should be replaced by a precise count or table reference.
[§3] Notation in §3: the exact formula or distance metric used to compute “consistency with VAE latent” is not stated explicitly; a numbered equation would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major point below and have prepared revisions to strengthen the presentation and evidence where the comments identify gaps.

read point-by-point responses

Referee: [§3] §3 (Method), token-importance definition: the claim that VAE-latent consistency supplies a task-agnostic signal for editing-critical tokens rests on an untested assumption. VAE reconstruction optimizes pixel-level fidelity; no correlation analysis, ablation, or counter-example is provided showing that the retained tokens align with regions or features required for text-guided editing rather than reconstruction.

Authors: We appreciate the referee’s observation that the original submission did not contain explicit correlation analysis or ablations linking retained tokens to editing-specific features. The design choice is motivated by the architecture of separate-encoder UMMs, in which the generation branch directly operates on VAE latents for text-guided editing; therefore, tokens whose understanding embeddings are consistent with those latents are expected to preserve the information required for both reconstruction fidelity and subsequent editing. To address the concern directly, the revised manuscript will include a new ablation subsection with (i) quantitative overlap statistics between high-importance tokens and regions altered by editing prompts and (ii) qualitative visualizations comparing token retention under understanding-only versus editing-aware prompts. revision: yes
Referee: [§4] §4 (Experiments) and associated tables: the 1.94× reduction and “maintained editing quality” are reported without error bars, statistical significance tests, or per-benchmark editing-specific metrics (e.g., CLIP similarity, FID, or human preference scores). The post-hoc balanced selection and merging steps could mask localized information loss that only appears under certain editing prompts.

Authors: We agree that the experimental reporting can be strengthened. In the revision we will (a) add error bars computed across three random seeds and report paired statistical significance tests (Wilcoxon signed-rank) for all main results, (b) expand the editing evaluation to include CLIP similarity and FID scores in addition to the benchmark-native metrics, and (c) add a short discussion with case studies illustrating that the balanced selection and merging steps do not produce noticeable localized degradation on representative editing prompts. These additions will be placed in an updated §4 and the corresponding tables. revision: yes
Referee: [§4] §4, baseline comparisons: the statement that G²TR “outperforms baselines on almost all benchmarks” lacks a complete list of baselines, their token-reduction ratios, and the exact metric values; without these, the relative advantage over attention-score or text-image similarity methods cannot be verified.

Authors: We acknowledge that the original text did not enumerate every baseline with its exact reduction ratio and metric values in a single location. Tables 1–3 already contain the full set of numbers for the attention-based and similarity-based methods we compared against, with reduction ratios matched to 1.94× for fairness. The revised manuscript will add an explicit supplementary table (and a clarifying paragraph in §4) that lists every baseline, its token-reduction ratio, and the complete per-benchmark metric values so that the claimed advantages can be verified directly. revision: yes

Circularity Check

0 steps flagged

No circularity: token importance from external VAE latent consistency

full rationale

The paper's core mechanism estimates token importance via consistency with an external VAE latent (independent of the UMM's own fitted quantities or outputs). This signal is used for balanced selection and merging in a training-free manner after understanding encoding. No self-definitional loops appear (e.g., no importance defined in terms of the reduction result itself), no fitted inputs renamed as predictions, and no load-bearing self-citations or ansatzes imported from prior author work. The derivation chain is self-contained against external benchmarks and VAE reconstruction objectives, with empirical validation on understanding/editing tasks rather than tautological equivalence to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the claim rests on the domain assumption that VAE latents supply a reliable, task-agnostic importance signal; no free parameters or new entities are explicitly introduced.

axioms (1)

domain assumption VAE latent space captures essential image information sufficient for both reconstruction and generation tasks
Invoked as the basis for estimating token importance from consistency with VAE latent.

pith-pipeline@v0.9.0 · 5578 in / 1340 out tokens · 126399 ms · 2026-05-13T05:28:37.832351+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 9 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Improving image generation with better captions

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn.openai.com/papers/dall-e-3.pdf, 2023

work page 2023
[3]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InECCV, 2024

work page 2024
[4]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Diffusion models in vision: A survey.TPAMI, 2023

Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. Diffusion models in vision: A survey.TPAMI, 2023

work page 2023
[6]

Flashattention: Fast and memory-efficient exact attention with io-awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. InNeurIPS, 2022

work page 2022
[7]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Mme: A comprehensive evaluation benchmark for multimodal large language models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. InNeurIPS Datasets and Benchmarks Track, 2025

work page 2025
[9]

Gemini 3 pro image (nano banana pro)

Google. Gemini 3 pro image (nano banana pro). https://aistudio.google.com/models/ gemini-3-pro-image, 2025

work page 2025
[10]

Understanding and harnessing sparsity in unified multimodal models.arXiv preprint arXiv:2512.02351, 2025

Shwai He, Chaorui Deng, Ang Li, and Shen Yan. Understanding and harnessing sparsity in unified multimodal models.arXiv preprint arXiv:2512.02351, 2025

work page arXiv 2025
[11]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

work page 2024
[12]

PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks

Junxian Li, Kai Liu, Leyang Chen, Weida Wang, Zhixin Wang, Jiaqi Xu, Fan Li, Renjing Pei, Linghe Kong, and Yulun Zhang. Planviz: Evaluating planning-oriented image generation and editing for computer-use tasks.arXiv preprint arXiv:2602.06663, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

Dual diffusion for unified image generation and understanding

Zijie Li, Henry Li, Yichun Shi, Amir Barati Farimani, Yuval Kluger, Linjie Yang, and Peng Wang. Dual diffusion for unified image generation and understanding. InCVPR, 2025

work page 2025
[14]

Rover: Benchmarking reciprocal cross-modal reasoning for omnimodal generation.arXiv preprint arXiv:2511.01163, 2025

Yongyuan Liang, Wei Chow, Feng Li, Ziqiao Ma, Xiyao Wang, Jiageng Mao, Jiuhai Chen, Jiatao Gu, Yue Wang, and Furong Huang. Rover: Benchmarking reciprocal cross-modal reasoning for omnimodal generation.arXiv preprint arXiv:2511.01163, 2025

work page arXiv 2025
[15]

Visual instruction tuning.NeurIPS, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 2023

work page 2023
[16]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

work page 2024
[18]

Unimod: Efficient unified multimodal transformers with mixture-of-depths.arXiv preprint arXiv:2502.06474, 2025

Weijia Mao, Zhenheng Yang, and Mike Zheng Shou. Unimod: Efficient unified multimodal transformers with mixture-of-depths.arXiv preprint arXiv:2502.06474, 2025. 10

work page arXiv 2025
[19]

Introducing our latest image generation model in the api

OpenAI. Introducing our latest image generation model in the api. https://openai.com/ index/image-generation-api/, 2025

work page 2025
[20]

Wiseedit: Benchmarking cognition-and creativity-informed image editing

Kaihang Pan, Weile Chen, Haiyi Qiu, Qifan Yu, Wendong Bu, Zehan Wang, Yun Zhu, Juncheng Li, and Siliang Tang. Wiseedit: Benchmarking cognition-and creativity-informed image editing. arXiv preprint arXiv:2512.00387, 2025

work page arXiv 2025
[21]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022

work page 2022
[22]

Holitom: Holistic token merging for fast video large language models

Kele Shao, TAO Keda, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. Holitom: Holistic token merging for fast video large language models. InNeurIPS, 2025

work page 2025
[23]

A survey of token compression for efficient multimodal large language models.TMLR, 2025

Kele Shao, TAO Keda, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, and Huan Wang. A survey of token compression for efficient multimodal large language models.TMLR, 2025

work page 2025
[24]

Less is more: A simple yet effective token reduction method for efficient multi-modal llms

Dingjie Song, Wenjun Wang, Shunian Chen, Xidong Wang, Michael X Guan, and Benyou Wang. Less is more: A simple yet effective token reduction method for efficient multi-modal llms. InCOLING, 2025

work page 2025
[25]

Ivc- prune: Revealing the implicit visual coordinates in lvlms for vision token pruning.arXiv preprint arXiv:2602.03060, 2026

Zhichao Sun, Yidong Ma, Gang Liu, Yibo Chen, Xu Tang, Yao Hu, and Yongchao Xu. Ivc- prune: Revealing the implicit visual coordinates in lvlms for vision token pruning.arXiv preprint arXiv:2602.03060, 2026

work page arXiv 2026
[26]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Internvl-u: Democratizing unified multimodal models for understanding, reasoning, generation and editing.arXiv preprint arXiv:2603.09877, 2026

Changyao Tian, Danni Yang, Guanzhou Chen, Erfei Cui, Zhaokai Wang, Yuchen Duan, Penghao Yin, Sitao Chen, Ganlin Yang, Mingxin Liu, et al. Internvl-u: Democratizing unified multimodal models for understanding, reasoning, generation and editing.arXiv preprint arXiv:2603.09877, 2026

work page arXiv 2026
[29]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InCVPR, 2024

work page 2024
[30]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Token pruning in multimodal large language models: Are we solving the right problem? InACL Findings, 2025

Zichen Wen, Yifeng Gao, Weijia Li, Conghui He, and Linfeng Zhang. Token pruning in multimodal large language models: Are we solving the right problem? InACL Findings, 2025

work page 2025
[32]

Janus: Decoupling visual encoding for unified multimodal understanding and generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InCVPR, 2025

work page 2025
[33]

Kris-bench: Benchmarking next-level intelligent image editing models

Yongliang Wu, Zonghui Li, Xinting Hu, Xinyu Ye, Xianfang Zeng, Gang YU, Wenbo Zhu, Bernt Schiele, Ming-Hsuan Yang, and Xu Yang. Kris-bench: Benchmarking next-level intelligent image editing models. InNeurIPS Datasets and Benchmarks Track, 2025

work page 2025
[34]

Announcing grok-1.5.https://x.ai/news/grok-1.5, 2024

xAI. Announcing grok-1.5.https://x.ai/news/grok-1.5, 2024

work page 2024
[35]

Show-o: One single transformer to unify multimodal understanding and generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. InICLR, 2025

work page 2025
[36]

Show-o2: Improved native unified multimodal models

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models. InNeurIPS, 2025. 11

work page 2025
[37]

Conical visual concentration for efficient large vision-language models

Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al. Conical visual concentration for efficient large vision-language models. InCVPR, 2025

work page 2025
[38]

Rethinking visual token reduction in lvlms under cross-modal misalignment

Rui Xu, Yunke Wang, Yong Luo, and Bo Du. Rethinking visual token reduction in lvlms under cross-modal misalignment. InAAAI, 2026

work page 2026
[39]

Vscan: Rethinking visual token reduction for efficient large vision-language models.TMLR, 2026

Ce Zhang, Kaixin Ma, Tianqing Fang, Wenhao Yu, Hongming Zhang, Zhisong Zhang, Haitao Mi, and Dong Yu. Vscan: Rethinking visual token reduction for efficient large vision-language models.TMLR, 2026

work page 2026
[40]

Envisioning beyond the pixels: Bench- marking reasoning-informed visual editing

Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Xiaorong Zhu, Hao Li, Wenhao Chai, Zicheng Zhang, Renqiu Xia, Guangtao Zhai, Junchi Yan, et al. Envisioning beyond the pixels: Bench- marking reasoning-informed visual editing. InNeurIPS Datasets and Benchmarks Track, 2025

work page 2025
[41]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 12 A Relationship among Generation, Editing and Understanding-side Visual Tokens Image generation (...

work page internal anchor Pith review Pith/arXiv arXiv 2025