ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.
hub Mixed citations
Seed1.5-VL Technical Report
Mixed citation behavior. Most common role is background (43%).
abstract
We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at https://www.volcengine.com/ (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal sy
co-cited works
representative citing papers
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.
MM-Snowball benchmark diagnoses hallucination snowballing in multi-turn MLLM dialogues; CAVR mitigates it via dual visual rectification at representation and logit levels.
Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.
Embodied3DBench creates a new evaluation benchmark for low-level embodied spatial intelligence in VLMs, evaluates 13 models showing gaps in interaction perception, and supplies a large synthetic training set that yields measurable gains.
POINav-Bench provides the first high-fidelity real-world benchmark for POI-goal VLN using 3DGS reconstructions of 126k m² with 163 POIs, supported by a Brain-Action framework and 70K real signage-entrance dataset.
Introduces a state-aligned latent actor-critic framework that lets diffusion models act as their own timestep-conditioned value functions for trajectory-level RL post-training and inference steering.
LangTail uses entity-level semantic priors from language models aligned via contrastive learning in a hierarchical clustering setup to resolve long-tail ambiguity, yielding +13.5, +12.9, and +8.9 mIoU gains on ScanNet-v2, S3DIS, and nuScenes.
INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domain datasets.
RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multimodal models.
VIGIL decouples world-state completion from terminal commitment in embodied agents, exposing up to 19.7 pp gaps in benchmark success despite comparable execution across 20 models.
DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new DynamicGUIBench spanning ten applications.
Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.
GeoThinker enables active, task-conditioned geometry integration in MLLMs via spatial-grounded fusion and importance gating, reaching 72.6 on VSI-Bench.
VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.
SecureWebArena is a new benchmark suite for holistic security evaluation of LVLM-based web agents using diverse simulated environments, attack taxonomies, and multi-layered failure analysis across reasoning, behavior, and outcomes.
VGR introduces a visual-grounded reasoning MLLM that detects and replays image regions during inference, achieving gains on visual benchmarks with 30% fewer image tokens than the LLaVA-NeXT-7B baseline.
LVBench is a new benchmark for extreme long video understanding that evaluates multimodal large language models on hour-scale videos using tasks designed to probe extended memory and comprehension.
DetAS-X uses an MLLM agent to adaptively compose detection workflows from restoration modules and expert detectors, enhanced by self-evolving experience harvesting, achieving substantial F1 score gains on challenging benchmarks.
Standard VLMs achieve expert-level 3D performance on depth estimation, pose estimation, and object understanding via three simple techniques without architecture changes or regression losses.
ProSR adds a Counterfactual Invariance Penalty and a Tail Drift Penalty to shape VLM reasoning trajectories for better visual dependence and stability on spatial tasks.
Introduces the first benchmark for metaphorical video understanding, identifies MLLM weaknesses in cross-domain mapping, and proposes an inference-time enhancement using a knowledge graph.
citing papers explorer
No citing papers match the current filters.