super hub Mixed citations

Seed1.5-VL Technical Report

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen · 2025 · cs.CV · arXiv 2505.07062

Mixed citation behavior. Most common role is background (43%).

100 Pith papers citing it

Background 43% of classified citations

open full Pith review browse 100 citing papers more from Dong Guo arXiv PDF

abstract

We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at https://www.volcengine.com/ (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 16 baseline 12 method 6 dataset 1

citation-polarity summary

background 15 baseline 12 use method 6 unclear 1 use dataset 1

claims ledger

abstract We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal sy

authors

Dong Guo Faming Wu Feida Zhu Fuxing Leng Guang Shi Haobin Chen

co-cited works

representative citing papers

ViMU: Benchmarking Video Metaphorical Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 8.0

ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation

cs.CV · 2026-04-13 · unverdicted · novelty 8.0

The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.

MSQA: A Natively Sourced Multilingual and Multicultural SimpleQA Benchmark

cs.CL · 2026-07-01 · unverdicted · novelty 7.0

MSQA benchmark shows LLMs exhibit cultural degradation and a locality effect where competence tracks pre-training exposure more than reasoning, and common inference-time fixes do not resolve it.

OmniCoT: A Benchmark for Global and Multi-Step Panoramic Reasoning

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

OmniCoT is a new panoramic reasoning benchmark with 6.7K eval, 1K real, and 14.3K training examples plus a two-stage SFT+GRPO training method to enforce global 360-degree consistency.

Beyond Absolute Scores: Relative Edit-induced Difference for Generalizable Image Aesthetic Assessment

cs.CV · 2026-06-04 · unverdicted · novelty 7.0 · 2 refs

RED-Aes learns aesthetic changes from edit-induced image pairs and a new RED-20k dataset via three-stage relative ranking training, claiming SOTA generalization over absolute MOS regression.

PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models

cs.CL · 2026-06-04 · unverdicted · novelty 7.0

PlanBench-V is a new benchmark and dataset for evaluating VLMs on spatial planning map interpretation via a four-stage framework of Perception, Reasoning, Association, and Implementation.

MetaPoint: Unlocking Precise Spatial Control in Agentic Visual Generation

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

MetaPoint represents 2D coordinates as special tokens in visual generative models to enable precise spatial control using existing positional encodings without architectural modifications.

MM-Snowball: Evaluating and Mitigating Hallucination Snowballing in Multimodal Multi-Turn Dialogue

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

MM-Snowball benchmark diagnoses hallucination snowballing in multi-turn MLLM dialogues; CAVR mitigates it via dual visual rectification at representation and logit levels.

Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.

Embodied3DBench: Benchmarking Low-Level Embodied Spatial Intelligence of Vision Language Models

cs.CV · 2026-05-27 · unverdicted · novelty 7.0

Embodied3DBench creates a new evaluation benchmark for low-level embodied spatial intelligence in VLMs, evaluates 13 models showing gaps in interaction perception, and supplies a large synthetic training set that yields measurable gains.

POINav: Benchmarking and Enhancing Final-Meters Arrival in Real-World Vision-Language Navigation

cs.RO · 2026-05-27 · unverdicted · novelty 7.0

POINav-Bench provides the first high-fidelity real-world benchmark for POI-goal VLN using 3DGS reconstructions of 126k m² with 163 POIs, supported by a Brain-Action framework and 70K real signage-entrance dataset.

Explicit Critic Guidance for Aligning Diffusion Models

cs.LG · 2026-05-26 · unverdicted · novelty 7.0

Introduces a state-aligned latent actor-critic framework that lets diffusion models act as their own timestep-conditioned value functions for trajectory-level RL post-training and inference steering.

Resolving Long-Tail Ambiguity in Unsupervised 3D Point Cloud Segmentation with Language Priors

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

LangTail uses entity-level semantic priors from language models aligned via contrastive learning in a hierarchical clustering setup to resolve long-tail ambiguity, yielding +13.5, +12.9, and +8.9 mIoU gains on ScanNet-v2, S3DIS, and nuScenes.

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.

AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation

cs.CV · 2026-05-11 · conditional · novelty 7.0

AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domain datasets.

Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning

cs.CV · 2026-05-10 · unverdicted · novelty 7.0

RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multimodal models.

Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents

cs.AI · 2026-05-09 · unverdicted · novelty 7.0 · 3 refs

VIGIL decouples world-state completion from terminal commitment in embodied agents, exposing up to 19.7 pp gaps in benchmark success despite comparable execution across 20 models.

Benchmarking and Improving GUI Agents in High-Dynamic Environments

cs.CV · 2026-04-28 · unverdicted · novelty 7.0 · 2 refs

DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new DynamicGUIBench spanning ten applications.

Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.

Thinking with Geometry: Active Geometry Integration for Spatial Reasoning

cs.CV · 2026-02-05 · unverdicted · novelty 7.0

GeoThinker enables active, task-conditioned geometry integration in MLLMs via spatial-grounded fusion and importance gating, reaching 72.6 on VSI-Bench.

VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning

cs.CV · 2026-01-22 · unverdicted · novelty 7.0

VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.

SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents

cs.CR · 2025-10-11 · unverdicted · novelty 7.0

SecureWebArena is a new benchmark suite for holistic security evaluation of LVLM-based web agents using diverse simulated environments, attack taxonomies, and multi-layered failure analysis across reasoning, behavior, and outcomes.

VGR: Visual Grounded Reasoning

cs.CV · 2025-06-13 · unverdicted · novelty 7.0

VGR introduces a visual-grounded reasoning MLLM that detects and replays image regions during inference, achieving gains on visual benchmarks with 30% fewer image tokens than the LLaVA-NeXT-7B baseline.

citing papers explorer

Showing 6 of 6 citing papers after filters.

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation cs.CV · 2026-05-12 · unverdicted · none · ref 9 · internal anchor
INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models cs.CV · 2026-05-08 · unverdicted · none · ref 10 · internal anchor
Introduces VURB benchmark and VUP-35K dataset to train discriminative and generative video reward models that achieve SOTA performance on VURB and VideoRewardBench.
DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior cs.CV · 2026-04-19 · unverdicted · none · ref 13 · internal anchor
DreamShot uses video diffusion priors and a role-attention consistency loss to produce coherent, personalized storyboards with better character and scene continuity than text-to-image methods.
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs cs.CV · 2026-04-13 · unverdicted · none · ref 26 · internal anchor
POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning cs.CV · 2026-04-19 · unverdicted · none · ref 41 · internal anchor
SpatialImaginer integrates visual imagination with textual chain-of-thought to improve spatial reasoning robustness in multimodal large language models.
Seedream 4.0: Toward Next-generation Multimodal Image Generation cs.CV · 2025-09-24 · unverdicted · none · ref 6 · internal anchor
Seedream 4.0 unifies text-to-image synthesis, image editing, and multi-image composition in an efficient diffusion transformer pretrained on billions of pairs and accelerated to 1.8 seconds for 2K output.

Seed1.5-VL Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer