arxiv: 2505.07062 · v1 · submitted 2025-05-11 · 💻 cs.CV · cs.AI

Recognition: no theorem link

Seed1.5-VL Technical Report

Aoxue Zhang, Bairen Yi, Bencheng Liao, Can Huang, Can Zhang, Chaorui Deng, Chaoyi Deng, Chenggang Li, Cheng Lin, Cheng Yuan, Chengzhi Wei, Chenhui Gou, Chenwei Lou, Chundian Liu, Chunyuan Li, Deyao Zhu, Dong Guo, Donghong Zhong, Faming Wu, Feida Zhu, Feng Li, Feng Zhang, Fuxing Leng, Gang Wu, Guang Shi, Guodong Li, Guohong Xiao, Haibin Lin, Haihua Yang, Haobin Chen, Haoming Wang, Haoqi Fan, Heng Ji, Hongxiang Hao, Hui Shen, Huixia Li, Jiahao Li, Jialong Wu, Jianhua Zhu, Jianhui Duan, Jianpeng Jiao, Jian Wang, Jianyu Jiang, Jiashi Feng, Jiawei Wang, Jiaze Chen, Jihao Liu, Jingjia Huang, Jingji Chen, Jingqun Tang, Jingyu Sun, Jin Zeng, Joya Chen, Junda Feng, Junfeng Zhan, Junjie Fang, Jun Long, Junting Lu, Kai Hua, Kai Liu, Kai Shen, Kaiyuan Zhang, Kang Lei, Ke Shen, Ke Wang, Keyu Pan, Kunchang Li, Kun Zhang, Lanxin Li, Lei Li, Lei Shi, Liangqiang Chen, Liang Xiang, Li Han, Lin Chen, Lin Li, Lin Yan, Liping Yuan, Lishu Luo, Liying Chi, Longxiang Liu, Mengfei Du, Mingxuan Wang, Ningxin Pan, Peibin Chen, Pengfei Chen, Pengfei Liu, Pengfei Wu, Qinghao Ye, Qingqing Yuan, Qingyao Shuai, Qiuyan Tao, Renjie Zheng, Renrui Zhang, Rui Qian, Rui Wang, Rui Yang, Rui Zhao, Ru Zhang, Shaoqiang Xu, Shen Yan, Shihao Liang, Shipeng Yan, Shixiong Zhao, Shuai Peng, Shuaishuai Cao, Shuangye Li, Shuangzhi Wu, Shufan Liu, Shuhan Chang, Shu Zhong, Sihang Yuan, Sijin Wu, Songhua Cai, Tenglong Ao, Tianhao Yang, Tianheng Cheng, Tingting Zhang, Wanjun Zhong, Weihao Yu, Wei Jia, Weiwei Liu, Wei Weng, Wenhao Huang, Wenjia Zhu, Wenli Yang, Wenqian Wang, Wenzhi Wang, Xiang Long, XiangRui Yin, Xianhan Zeng, Xiaobo Qin, Xiaohan Ding, Xiaojun Xiao, Xiaolei Zhu, Xiao Li, Xiao Liu, Xiaoying Jia, Xiaoying Zhang, Xijin Zhang, Xinchen Zhang, Xin Liu, Xinyu Yang, Xiongcai Luo, Xiuli Chen, Xuantong Zhong, Xuanwei Zhang, Xuefeng Xiao, Xuehan Xiong, Xujing Li, Yanghua Peng, Yangrui Chen, Yanwei Li, Yan Wu, Yanxu Hu, Yawei Wen, Yifan Du, Yihao Zhang, Yi Lin, Yining Ye, Yiyuan Hu, Yiyuan Zhang, Yonghui Wu, Youbin Wu, Yudong Liu, Yue Ling, Yufeng Yuan, Yufeng Zhou, Yuhang Xu, Yuhong Yang, Yujia Qin, Yu Li, Yu Liu, Yunhao Fang, Yuntao Li, Yun Zhang, Yurui Ren, Yuwen Xiong, Yu Yue, Zanbo Wang, Zehua Hong, Zehua Wang, Zewei Sun, Zeyu Wang, Zhao Cai, Zhaoyue Zha, Zhecheng An, Zhehui Zhao, Zhengzhuo Xu, Zhipeng Chen, Zhiwu He, Zhiyong Wu, Zhuofan Zheng, Zihao Wang, Zilong Huang, Ziyu Zhu, Zuquan Song

Authors on Pith no claims yet

Pith reviewed 2026-05-11 05:21 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision-language modelmultimodal understandingmixture-of-expertsvisual reasoningagent controlGUI tasksbenchmark evaluation

0 comments

The pith

A vision-language model pairs a 532 million parameter encoder with a 20 billion active parameter mixture-of-experts language model to reach state-of-the-art results on 38 of 60 benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Seed1.5-VL to advance general-purpose multimodal understanding and reasoning through a specific compact architecture. It combines a 532 million parameter vision encoder with a mixture-of-experts language model that activates 20 billion parameters. This design is shown to deliver leading results on a wide range of public tests for visual understanding, video, and agent control while also handling reasoning tasks such as visual puzzles. A sympathetic reader would care because the report claims these outcomes arise from deliberate choices in model design, data construction, and staged training, suggesting efficient paths to capable multimodal systems.

Core claim

Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles.

What carries the argument

The pairing of a 532 million parameter vision encoder with a mixture-of-experts language model of 20 billion active parameters, refined through multi-stage data construction and training.

If this is right

The model supports strong performance in interactive agent tasks such as GUI control and gameplay.
It handles multimodal reasoning challenges including visual puzzles effectively.
The outcomes follow from targeted choices across model architecture, data construction, and training stages.
These capabilities position the model for wider use across diverse multimodal applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the performance holds on uncontaminated tests, similar compact encoder-plus-experts designs could reduce the compute needed for real-world agent deployment.
Exploring whether the same training recipe scales to other modalities might reveal transferable patterns in vision-language integration.
Testing the model on entirely fresh puzzle and control scenarios created after training would provide a clearer check on reasoning claims.

Load-bearing premise

The selected public benchmarks and internal suites accurately reflect genuine generalization rather than results inflated by overlap with training data.

What would settle it

Direct measurement showing that model accuracy drops sharply on newly constructed benchmarks with zero possible overlap to any training examples, or explicit documentation of test-set contamination in the training corpus.

read the original abstract

We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at https://www.volcengine.com/ (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Seed1.5-VL is a practical engineering report on a compact MoE VLM that posts strong benchmark and agent numbers, but the SOTA claims rest on unverified data hygiene.

read the letter

The main point is a new vision-language model with a 532M vision encoder and 20B-active MoE LLM that claims SOTA on 38 of 60 public benchmarks plus better agent performance than OpenAI CUA and Claude 3.7 on GUI and gameplay tasks. The report walks through the full build: vision encoder choices, MoE architecture, data construction stages, and the training sequence from pretraining to alignment. Those concrete details on what they actually did at each step are the useful part for anyone trying to reproduce or scale similar systems. The agent results are the most interesting angle because they move past static benchmarks into interactive control, which matters for real applications. The soft spot is the missing decontamination evidence. The paper describes large-scale multimodal data collection but supplies no overlap statistics, embedding-based checks, or ablation on cleaned subsets. Without that, the benchmark wins could partly reflect training-test leakage rather than generalization, which is a standard risk for models trained on web-scale corpora. This report is for practitioners who train or deploy VLMs and want a working recipe at this size rather than new theory. It deserves peer review because the architecture and process details are specific enough to evaluate and the empirical scope is broad, even if more leakage analysis would be required in revision.

Referee Report

2 major / 2 minor

Summary. The paper introduces Seed1.5-VL, a compact vision-language model consisting of a 532M-parameter vision encoder and a 20B-active-parameter MoE LLM. It claims state-of-the-art results on 38 out of 60 public VLM benchmarks, superior performance on agent-centric tasks (GUI control, gameplay) relative to OpenAI CUA and Claude 3.7, and strong multimodal reasoning on visual puzzles. The report focuses on model design, data construction, and multi-stage training procedures rather than exhaustive experimental ablations.

Significance. If the empirical claims are substantiated by verifiable generalization (rather than benchmark contamination), the work would demonstrate that relatively compact MoE-based VLMs can match or exceed larger systems on both standard understanding benchmarks and agentic tasks. This would be a meaningful data point for efficient multimodal architectures and could usefully inform subsequent research on data curation and training curricula for VLMs.

major comments (2)

[Evaluation / Experiments] Evaluation section (and abstract): The headline claim of SOTA performance on 38/60 public benchmarks and superiority on internal agent suites is presented without any quantitative leakage audit (n-gram overlap, embedding similarity, or training-cutoff statistics) between the training corpus and the test sets of the cited benchmarks. This omission directly undermines the ability to interpret the reported scores as evidence of generalization rather than memorization.
[Data Construction] Data construction section: The description of the multimodal training corpus and curation pipeline supplies no aggregate statistics on data sources, decontamination steps, or temporal cutoffs relative to benchmark release dates. Without these, the central empirical claims rest on an untested premise that the evaluation suites are uncontaminated.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a concise table summarizing the 60 public benchmarks and the exact metrics on which SOTA is claimed, rather than the aggregate 38/60 figure alone.
[Model Architecture] Notation for the MoE LLM (active vs. total parameters) is introduced but not consistently referenced in later sections when discussing scaling or inference cost.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review and for emphasizing the need for greater transparency around data contamination to support the validity of our benchmark results. We address the two major comments below and outline targeted revisions to the manuscript.

read point-by-point responses

Referee: [Evaluation / Experiments] Evaluation section (and abstract): The headline claim of SOTA performance on 38/60 public benchmarks and superiority on internal agent suites is presented without any quantitative leakage audit (n-gram overlap, embedding similarity, or training-cutoff statistics) between the training corpus and the test sets of the cited benchmarks. This omission directly undermines the ability to interpret the reported scores as evidence of generalization rather than memorization.

Authors: We agree that explicit quantitative leakage audits would strengthen the interpretation of the reported results. The original manuscript prioritized descriptions of model design, data construction principles, and multi-stage training over detailed decontamination metrics. Internally, we applied release-date-based temporal filtering and content deduplication for known public benchmarks to reduce contamination risk, but we did not compute or report specific n-gram overlap or embedding similarity statistics. In the revised version, we will add a subsection in the evaluation section summarizing our contamination mitigation approach and any high-level statistics that can be disclosed without compromising proprietary data sources. revision: yes
Referee: [Data Construction] Data construction section: The description of the multimodal training corpus and curation pipeline supplies no aggregate statistics on data sources, decontamination steps, or temporal cutoffs relative to benchmark release dates. Without these, the central empirical claims rest on an untested premise that the evaluation suites are uncontaminated.

Authors: The data construction section focuses on the high-level curation pipeline and training experiences rather than exhaustive statistics, which aligns with the report's stated goal of sharing practical insights. We recognize that aggregate statistics on sources, decontamination, and temporal cutoffs would improve transparency. We will expand this section with approximate proportions of data sources and a clearer description of our deduplication and temporal cutoff procedures. Full per-benchmark quantitative audits remain challenging to publish in detail due to data scale and proprietary constraints, but the added description will clarify our general safeguards. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical report with independent benchmark evaluations

full rationale

The paper is a technical report on model architecture (532M vision encoder + 20B MoE LLM), data construction, and training stages, culminating in reported empirical results on 60 public benchmarks and internal suites. No derivation chain, equations, or first-principles predictions exist that could reduce to inputs by construction. Public benchmark scores are external, independently measured quantities not defined from the model's fitted parameters or self-citations. Internal suites carry selection-bias risk but do not create circularity per the enumerated patterns. The central SOTA claim rests on observable performance metrics rather than self-referential definitions or load-bearing self-citations.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The report is an empirical model description with no mathematical derivations; all claims rest on benchmark numbers whose validity depends on unstated training and evaluation choices.

free parameters (2)

Vision encoder size (532M parameters)
Architectural scale chosen by the authors.
MoE LLM active parameters (20B)
Mixture-of-experts configuration and activation count selected during design.

axioms (1)

domain assumption Standard large-scale multimodal pretraining and fine-tuning procedures
The abstract assumes conventional vision-language training pipelines without stating deviations.

pith-pipeline@v0.9.0 · 6297 in / 1060 out tokens · 38115 ms · 2026-05-11T05:21:59.657929+00:00 · methodology

discussion (0)

Forward citations

Cited by 52 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
cs.CV 2026-05 unverdicted novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation
cs.CV 2026-04 unverdicted novelty 8.0

The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
cs.CV 2026-05 unverdicted novelty 7.0

INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation
cs.CV 2026-05 conditional novelty 7.0

AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domai...
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning
cs.CV 2026-05 unverdicted novelty 7.0

RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multi...
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
cs.AI 2026-05 unverdicted novelty 7.0

An exploration-aware RL framework lets LLM agents adaptively explore only under high uncertainty via variational rewards and action grouping, yielding consistent gains on text and GUI agent benchmarks.
Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
cs.AI 2026-05 unverdicted novelty 7.0

VIGIL decouples world-state completion (W) from benchmark success (B) requiring correct terminal reports, showing up to 19.7 pp gaps in B for models with similar W across 20 systems on 1000 episodes.
Benchmarking and Improving GUI Agents in High-Dynamic Environments
cs.CV 2026-04 unverdicted novelty 7.0

DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new Dy...
Benchmarking and Improving GUI Agents in High-Dynamic Environments
cs.CV 2026-04 conditional novelty 7.0

DynamicUI improves GUI agent performance in high-dynamic environments by using video-based dynamic perception, action-conditioned refinement, and reflection, outperforming prior agents on the new DynamicGUIBench while...
Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding
cs.CV 2026-04 unverdicted novelty 7.0

Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data
cs.RO 2026-05 unverdicted novelty 6.0

A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models
cs.RO 2026-05 conditional novelty 6.0

GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.
How Mobile World Model Guides GUI Agents?
cs.AI 2026-05 unverdicted novelty 6.0

Mobile world models in text, image, and code modalities reach state-of-the-art on their benchmarks and improve downstream GUI agent performance, with code best for in-distribution accuracy and text more robust for out...
SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs
cs.CV 2026-05 unverdicted novelty 6.0

SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.
MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production
cs.DC 2026-05 unverdicted novelty 6.0

MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.
Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
cs.AI 2026-05 unverdicted novelty 6.0

VIGIL separates world-state completion (W) from benchmark success (B) requiring correct terminal reports, showing up to 19.7 pp gaps between models with similar execution on 1000 episodes across 20 systems.
Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models
cs.CV 2026-05 unverdicted novelty 6.0

Introduces VURB benchmark and VUP-35K dataset to train discriminative and generative video reward models that achieve SOTA performance on VURB and VideoRewardBench.
DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning
cs.CV 2026-05 unverdicted novelty 6.0

DiffCap-Bench supplies a diverse IDC benchmark with ten categories and LLM judging grounded in human difference lists to evaluate MLLMs more robustly than prior lexical metrics.
AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding
cs.CV 2026-05 unverdicted novelty 6.0

AutoFocus converts token perplexity into an anisotropic Gaussian uncertainty field to drive region proposals and shape-aware zooming for improved GUI grounding in VLMs.
Leveraging Verifier-Based Reinforcement Learning in Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.
See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection
cs.CV 2026-04 unverdicted novelty 6.0

ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.
SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs
cs.CV 2026-04 unverdicted novelty 6.0

SMoES improves MoE-VLM performance and efficiency via soft modality-guided expert routing and inter-bin mutual information regularization, yielding 0.9-4.2% task gains and 56% communication reduction.
dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model
cs.RO 2026-04 unverdicted novelty 6.0

A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
cs.CV 2026-04 unverdicted novelty 6.0

OneVL is the first latent CoT method to exceed explicit CoT accuracy on four driving benchmarks while running at answer-only speed, by supervising latent tokens with a visual world model decoder.
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
cs.CV 2026-04 unverdicted novelty 6.0

OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior
cs.CV 2026-04 unverdicted novelty 6.0

DreamShot uses video diffusion priors and a role-attention consistency loss to produce coherent, personalized storyboards with better character and scene continuity than text-to-image methods.
UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding
cs.CV 2026-04 unverdicted novelty 6.0

UI-Zoomer uses uncertainty quantification to trigger and size adaptive zoom-ins only on uncertain GUI grounding predictions, yielding up to 13.4% gains on benchmarks with no training.
POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch
cs.CV 2026-04 unverdicted novelty 6.0

POINTS-Seeker-8B is an 8B multimodal model trained from scratch for agentic search that uses seeding and visual-space history folding to outperform prior models on six visual reasoning benchmarks.
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
cs.CV 2026-04 unverdicted novelty 6.0

POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
CLASP: Closed-loop Asynchronous Spatial Perception for Open-vocabulary Desktop Object Grasping
cs.RO 2026-04 unverdicted novelty 6.0

CLASP achieves 87% success in open-vocabulary desktop grasping via dual-pathway perception, asynchronous closed-loop evaluation, and automated multimodal data synthesis.
EpiAgent: An Agent-Centric System for Ancient Inscription Restoration
cs.CV 2026-04 unverdicted novelty 6.0

EpiAgent is a new agent-centric system that restores degraded ancient inscriptions with better quality and generalization than prior rigid AI methods by using an LLM planner to coordinate multimodal tools and iterativ...
LAMP: Lift Image-Editing as General 3D Priors for Open-world Manipulation
cs.CV 2026-04 unverdicted novelty 6.0

LAMP extracts continuous 3D inter-object transformations from image editing to serve as geometry-aware priors for zero-shot open-world robotic manipulation.
GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents
cs.CV 2026-04 unverdicted novelty 6.0

GameWorld is a new benchmark providing standardized interfaces, 34 games, 170 tasks, and verifiable outcome metrics to evaluate multimodal large language model agents in video game environments.
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization
cs.CV 2026-04 unverdicted novelty 6.0

MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
InstructTable: Improving Table Structure Recognition Through Instructions
cs.CV 2026-04 unverdicted novelty 6.0

InstructTable combines instruction-guided pre-training on structural patterns with visual fine-tuning and a template-free synthetic data generator (TME) to reach state-of-the-art table structure recognition on public ...
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
cs.CV 2025-07 unverdicted novelty 6.0

GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
cs.AI 2026-05 unverdicted novelty 5.0

An exploration-aware policy optimization method lets LLM agents explore selectively via a variational-inference reward and action grouping, yielding consistent gains on text and GUI agent benchmarks.
Text-Guided Multi-Scale Frequency Representation Adaptation
cs.CV 2026-05 unverdicted novelty 5.0

FreqAdapter adapts multimodal models by text-guided multi-scale fine-tuning in the frequency domain, claiming better performance and efficiency than signal-space PEFT methods.
Perceptual Flow Network for Visually Grounded Reasoning
cs.CV 2026-05 unverdicted novelty 5.0

PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
cs.AI 2026-04 unverdicted novelty 5.0

The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
cs.LG 2026-04 unverdicted novelty 5.0

A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning
cs.CV 2026-04 unverdicted novelty 5.0

SpatialImaginer integrates visual imagination with textual chain-of-thought to improve spatial reasoning robustness in multimodal large language models.
Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow
cs.CV 2026-04 unverdicted novelty 5.0

An inference-time technique that uses token activation dynamics to adaptively restrict text attention to important visual tokens, improving VLM accuracy on VQA, grounding, counting, OCR, and hallucination benchmarks.
OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence
cs.CL 2026-04 unverdicted novelty 5.0

OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.
Kimi K2.5: Visual Agentic Intelligence
cs.CL 2026-02 unverdicted novelty 5.0

Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
cs.CV 2025-09 unverdicted novelty 5.0

LLaVA-OneVision-1.5 provides open datasets, code, and models that match or exceed closed competitors on 27 benchmarks at low cost through curated data and efficient training.
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
cs.CV 2026-04 unverdicted novelty 4.0

OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.
Wan-Image: Pushing the Boundaries of Generative Visual Intelligence
cs.CV 2026-04 unverdicted novelty 3.0

Wan-Image is a unified multi-modal system that integrates LLMs and diffusion transformers to deliver professional-grade image generation features including complex typography, multi-subject consistency, and precise ed...
Seedance 2.0: Advancing Video Generation for World Complexity
cs.CV 2026-04 unverdicted novelty 3.0

Seedance 2.0 is an updated multi-modal model for generating 4-15 second audio-video content at 480p/720p with support for up to 3 video, 9 image, and 3 audio references.
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
cs.CV 2026-04 unverdicted novelty 3.0

This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...
Seedream 4.0: Toward Next-generation Multimodal Image Generation
cs.CV 2025-09 unverdicted novelty 3.0

Seedream 4.0 unifies text-to-image synthesis, image editing, and multi-image composition in an efficient diffusion transformer pretrained on billions of pairs and accelerated to 1.8 seconds for 2K output.

Reference graph

Works this paper leans on

208 extracted references · 208 canonical work pages · cited by 48 Pith papers · 46 internal anchors

[1]

Fuyu-8b: A multimodal architecture for ai agents.https://www.adept.ai/blog/fuyu-8b, 2023

work page 2023
[2]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone.arXiv preprint arXiv:2404.14219, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Countgd: Multi-modal open-world counting.Advances in Neural Information Processing Systems, 37:48810–48837, 2024

Niki Amini-Naieni, Tengda Han, and Andrew Zisserman. Countgd: Multi-modal open-world counting.Advances in Neural Information Processing Systems, 37:48810–48837, 2024

work page 2024
[4]

arXiv preprint arXiv:2407.02477 , year=

Elmira Amirloo, Jean-Philippe Fauconnier, Christoph Roesmann, Christian Kerl, Rinu Boney, Yusu Qian, Zirui Wang, Afshin Dehghan, Yinfei Yang, Zhe Gan, et al. Understanding alignment in multimodal llms: A comprehensive study.arXiv preprint arXiv:2407.02477, 2024

work page arXiv 2024
[5]

Claude 3.7 sonnet system card

Anthropic. Claude 3.7 sonnet system card. 2025

work page 2025
[6]

Claude’s extended thinking, 2025

anthropic. Claude’s extended thinking, 2025. URL https://www.anthropic.com/news/ visible-extended-thinking

work page 2025
[7]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models

Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. Advancesin neural information processing systems, 32, 2019

work page 2019
[9]

Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021

work page arXiv 2021
[10]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

work page internal anchor Pith review arXiv 2024
[11]

Bonatti, D

Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, et al. Windows agent arena: Evaluating multi-modal os agents at scale. arXiv preprint arXiv:2409.08264, 2024

work page arXiv 2024
[12]

Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models,

Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, et al. Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models. arXiv preprint arXiv:2410.10818, 2024

work page arXiv 2024
[13]

Flux: Fast software-based communication overlap on gpus through kernel fusion.arXiv preprint arXiv:2406.06858,

Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Chengji Yao, Ziheng Jiang, et al. Flux: Fast software-based communication overlap on gpus through kernel fusion.arXiv preprint arXiv:2406.06858, 2024

work page arXiv 2024
[14]

MMDetection: Open MMLab Detection Toolbox and Benchmark

Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. MMDetection: Open mmlab detection toolbox and b...

work page Pith review arXiv 1906
[15]

Are We on the Right Way for Evaluating Large Vision-Language Models?

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?arXiv preprint arXiv:2403.20330, 2024

work page internal anchor Pith review arXiv 2024
[16]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

work page 2024
[17]

Yolo-world: Real-time open- vocabulary object detection

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, and Ying Shan. Yolo-world: Real-time open- vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16901–16911, June 2024. 33

work page 2024
[18]

Puzzlevqa: Diagnosing multimodal reasoning challenges of language models with abstract visual patterns, 2024

Yew Ken Chia, Vernon Toh Yan Han, Deepanway Ghosal, Lidong Bing, and Soujanya Poria. Puzzlevqa: Diagnosing multimodal reasoning challenges of language models with abstract visual patterns, 2024. URL https://arxiv.org/abs/2403.13315

work page arXiv 2024
[19]

Lost in time: A new temporal benchmark for videollms.arXiv preprint arXiv:2410.07752,

Daniel Cores, Michael Dorkenwald, Manuel Mucientes, Cees GM Snoek, and Yuki M Asano. Tvbench: Redesigning video-language evaluation.arXiv preprint arXiv:2410.07752, 2024

work page arXiv 2024
[20]

Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution

Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdulmohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution. Advances in Neural Information Processing Systems, 36: 2252–2274, 2023

work page 2023
[21]

S., Salehi, M., Muennighoff, N., Lo, K., Soldaini, L., et al

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models.arXiv preprint arXiv:2409.17146, 2024

work page arXiv 2024
[22]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

work page 2009
[24]

Unveiling encoder-free vision-language models.arXiv preprint arXiv:2406.11832, 2024

Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, and Xinlong Wang. Unveiling encoder-free vision-language models. arXiv preprint arXiv:2406.11832, 2024

work page arXiv 2024
[25]

Self-play with execution feedback: Improving instruction-following capabilities of large language models

Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, and Jingren Zhou. Self-play with execution feedback: Improving instruction-following capabilities of large language models, 2024. URL https://arxiv.org/abs/2406.13542

work page arXiv 2024
[26]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[27]

Counting out time: Class agnostic video repetition counting in the wild

Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, and Andrew Zisserman. Counting out time: Class agnostic video repetition counting in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10387–10396, 2020

work page 2020
[28]

arXiv preprint arXiv:2309.17425 (2023) 3, 4, 9, 11, 20, 21, 22

Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, and Vaishaal Shankar. Data filtering networks.arXiv preprint arXiv:2309.17425, 2023

work page arXiv 2023
[29]

Eva: Exploring the limits of masked visual representation learning at scale

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19358–19369, 2023

work page 2023
[30]

Optimus: Acceler- ating large-scale multi-modal llm training by bubble exploitation

Weiqi Feng, Yangrui Chen, Shaoyu Wang, Yanghua Peng, Haibin Lin, and Minlan Yu. Optimus: Accelerating large-scale multi-modal llm training by bubble exploitation.arXiv preprint arXiv:2408.03505, 2024

work page arXiv 2024
[31]

Helix: A vision-language-action model for generalist humanoid control.https://www.figure.ai/ news/helix, 2025

Figure AI. Helix: A vision-language-action model for generalist humanoid control.https://www.figure.ai/ news/helix, 2025. Accessed: 2025-04-23

work page 2025
[32]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Blink: Multimodal large language models can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision, pages 148–166. Springer, 2024

work page 2024
[34]

Tall: Temporal activity localization via language query

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. InProceedings of the IEEE international conference on computer vision, pages 5267–5275, 2017

work page 2017
[35]

Experiment with gemini 2.0 flash native image generation.https://developers.googleblog.com/en/ experiment-with-gemini-20-flash-native-image-generation , 2025

Google. Experiment with gemini 2.0 flash native image generation.https://developers.googleblog.com/en/ experiment-with-gemini-20-flash-native-image-generation , 2025

work page 2025
[36]

arXiv preprint arXiv:2410.01615 (2024)

Aleksandr Gordeev, Vladimir Dokholyan, Irina Tolstykh, and Maksim Kuprashevich. Saliency-guided detr for moment retrieval and highlight detection.arXiv preprint arXiv:2410.01615, 2024. 34

work page arXiv 2024
[37]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pag...

work page 2024
[39]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Lvis: A dataset for large vocabulary instance segmentation

Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364, 2019

work page 2019
[41]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008, 2024

work page internal anchor Pith review arXiv 2024
[42]

arXiv preprint arXiv:2401.13919 , year=

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models.arXiv preprint arXiv:2401.13919, 2024

work page arXiv 2024
[43]

The many faces of robustness: A critical analysis of out-of-distribution generalization

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8340–8349, 2021

work page 2021
[44]

Natural adversarial examples

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15262–15271, 2021

work page 2021
[45]

Scaling Laws for Autoregressive Generative Modeling

Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive generative modeling.arXiv preprint arXiv:2010.14701, 2020

work page internal anchor Pith review arXiv 2010
[46]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[47]

The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019

work page internal anchor Pith review arXiv 1904
[48]

Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models

Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, and Jie Tang. Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models. arXiv preprint arXiv:2501.02955, 2025

work page internal anchor Pith review arXiv 2025
[49]

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826, 2025

work page internal anchor Pith review arXiv 2025
[50]

Gpipe: Efficient training of giant neural networks using pipeline parallelism

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advancesin neural information processing systems, 32, 2019

work page 2019
[51]

Online video understanding: A comprehensive benchmark and memory-augmented method.arXiv preprint arXiv:2501.00584, 2024

Zhenpeng Huang, Xinhao Li, Jiaqi Li, Jing Wang, Xiangyu Zeng, Cheng Liang, Tao Wu, Xi Chen, Liang Li, and Limin Wang. Online video understanding: A comprehensive benchmark and memory-augmented method.arXiv preprint arXiv:2501.00584, 2024

work page arXiv 2024
[52]

Classification done right for vision-language pre-training

Zilong Huang, Qinghao Ye, Bingyi Kang, Jiashi Feng, and Haoqi Fan. Classification done right for vision-language pre-training. Advancesin Neural Information Processing Systems, 37:96483–96504, 2024

work page 2024
[53]

J. D. Hunter. Matplotlib: A 2d graphics environment.Computing in Science & Engineering, 9(3):90–95, 2007. doi: 10.1109/MCSE.2007.55. 35

work page doi:10.1109/mcse.2007.55 2007
[54]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

In 21st USENIX Symposium on NetworkedSystems Design and Implementation (NSDI 24), pages 745–760, 2024

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, et al.{MegaScale}: Scaling large language model training to more than 10,000{GPUs}. In 21st USENIX Symposium on NetworkedSystems Design and Implementation (NSDI 24), pages 745–760, 2024

work page 2024
[58]

The Matplotlib Development Team

Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkinson, Ákos Kádár, Adam Trischler, and Yoshua Bengio. Figureqa: An annotated figure dataset for visual reasoning.arXiv preprint arXiv:1710.07300, 2017

work page arXiv 2017
[59]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[60]

Referitgame: Referring to objects in photographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014

work page 2014
[61]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InComputer Vision–ECCV2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14,2016, Proceedings, Part IV 14, pages 235–251. Springer, 2016

work page 2016
[62]

Ocr-free document understanding transformer

Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Ocr-free document understanding transformer. InEuropean Conference on Computer Vision (ECCV), 2022

work page 2022
[63]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Grace Lam, Pannag R Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. In Pulkit Agrawal, Oliver Kro...

work page 2025
[64]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprintarXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[65]

Reducing activation recomputation in large transformer models.Proceedings of Machine Learning and Systems, 5:341–353, 2023

Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models.Proceedings of Machine Learning and Systems, 5:341–353, 2023

work page 2023
[66]

The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale.IJCV, 2020

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale.IJCV, 2020

work page 2020
[67]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention,

work page
[68]

URL https://arxiv.org/abs/2309.06180

work page internal anchor Pith review arXiv
[70]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, 36 Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishi...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[71]

Echarts: a declarative framework for rapid construction of web-based visualization.Visual Informatics, 2(2):136–146, 2018

Deqing Li, Honghui Mei, Yi Shen, Shuang Su, Wenli Zhang, Junting Wang, Ming Zu, and Wei Chen. Echarts: a declarative framework for rapid construction of web-based visualization.Visual Informatics, 2(2):136–146, 2018

work page 2018
[72]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next- interleave: Tackling multi-image, video, and 3d in large multimodal models.arXiv preprint arXiv:2407.07895, 2024

work page internal anchor Pith review arXiv 2024
[73]

Screenspot-pro: Gui grounding for professional high- resolution computer use.arXiv, abs/2504.07981, 2025

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use.arXiv preprint arXiv:2504.07981, 2025

work page arXiv 2025
[74]

Mvbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024

work page 2024
[75]

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?,

Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuan- grui Ding, Rui Qian, et al. Ovo-bench: How far is your video-llms from real-world online video understanding? arXiv preprint arXiv:2501.05510, 2025

work page arXiv 2025
[76]

The devil is in the details: Tackling unimodal spurious correlations for generalizable multimodal reward models.arXiv preprint arXiv:2503.03122, 2025

Zichao Li, Xueru Wen, Jie Lou, Yuqiu Ji, Yaojie Lu, Xianpei Han, Debing Zhang, and Le Sun. The devil is in the details: Tackling unimodal spurious correlations for generalizable multimodal reward models.arXiv preprint arXiv:2503.03122, 2025

work page arXiv 2025
[77]

arXiv preprint arXiv:2411.03628 , year=

Junming Lin, Zheng Fang, Chi Chen, Zihao Wan, Fuwen Luo, Peng Li, Yang Liu, and Maosong Sun. Streaming- bench: Assessing the gap for mllms to achieve streaming video understanding.arXiv preprint arXiv:2411.03628, 2024

work page arXiv 2024
[78]

Ring Attention with Blockwise Transformers for Near-Infinite Context

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889, 2023

work page internal anchor Pith review arXiv 2023
[79]

Visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023

work page 2023
[80]

Visual- webbench: How far have multimodal llms evolved in web page understanding and grounding?arXiv preprint arXiv:2404.05955, 2024

Junpeng Liu, Yifan Song, Bill Yuchen Lin, Wai Lam, Graham Neubig, Yuanzhi Li, and Xiang Yue. Visual- webbench: How far have multimodal llms evolved in web page understanding and grounding?arXiv preprint arXiv:2404.05955, 2024

work page arXiv 2024
[81]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European Conference on Computer Vision, pages 38–55. Springer, 2024

work page 2024
[82]

Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233. Springer, 2024

work page 2024

Showing first 80 references.