hub Mixed citations

Seed1.5-VL Technical Report

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen · 2025 · cs.CV · arXiv 2505.07062

Mixed citation behavior. Most common role is background (43%).

77 Pith papers citing it

Background 43% of classified citations

open full Pith review browse 77 citing papers arXiv PDF

abstract

We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at https://www.volcengine.com/ (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 16 baseline 12 method 6 dataset 1

citation-polarity summary

background 15 baseline 12 use method 6 unclear 1 use dataset 1

claims ledger

abstract We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal sy

co-cited works

representative citing papers

ViMU: Benchmarking Video Metaphorical Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 8.0

ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation

cs.CV · 2026-04-13 · unverdicted · novelty 8.0

The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.

MM-Snowball: Evaluating and Mitigating Hallucination Snowballing in Multimodal Multi-Turn Dialogue

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

MM-Snowball benchmark diagnoses hallucination snowballing in multi-turn MLLM dialogues; CAVR mitigates it via dual visual rectification at representation and logit levels.

Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.

Resolving Long-Tail Ambiguity in Unsupervised 3D Point Cloud Segmentation with Language Priors

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

LangTail uses entity-level semantic priors from language models aligned via contrastive learning in a hierarchical clustering setup to resolve long-tail ambiguity, yielding +13.5, +12.9, and +8.9 mIoU gains on ScanNet-v2, S3DIS, and nuScenes.

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.

AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation

cs.CV · 2026-05-11 · conditional · novelty 7.0

AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domain datasets.

Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning

cs.CV · 2026-05-10 · unverdicted · novelty 7.0

RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multimodal models.

Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents

cs.AI · 2026-05-09 · unverdicted · novelty 7.0 · 3 refs

VIGIL decouples world-state completion from terminal commitment in embodied agents, exposing up to 19.7 pp gaps in benchmark success despite comparable execution across 20 models.

Benchmarking and Improving GUI Agents in High-Dynamic Environments

cs.CV · 2026-04-28 · unverdicted · novelty 7.0 · 2 refs

DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new DynamicGUIBench spanning ten applications.

Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.

Thinking with Geometry: Active Geometry Integration for Spatial Reasoning

cs.CV · 2026-02-05 · unverdicted · novelty 7.0

GeoThinker enables active, task-conditioned geometry integration in MLLMs via spatial-grounded fusion and importance gating, reaching 72.6 on VSI-Bench.

VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning

cs.CV · 2026-01-22 · unverdicted · novelty 7.0

VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.

SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents

cs.CR · 2025-10-11 · unverdicted · novelty 7.0

SecureWebArena is a new benchmark suite for holistic security evaluation of LVLM-based web agents using diverse simulated environments, attack taxonomies, and multi-layered failure analysis across reasoning, behavior, and outcomes.

VGR: Visual Grounded Reasoning

cs.CV · 2025-06-13 · unverdicted · novelty 7.0

VGR introduces a visual-grounded reasoning MLLM that detects and replays image regions during inference, achieving gains on visual benchmarks with 30% fewer image tokens than the LLaVA-NeXT-7B baseline.

LVBench: An Extreme Long Video Understanding Benchmark

cs.CV · 2024-06-12 · accept · novelty 7.0

LVBench is a new benchmark for extreme long video understanding that evaluates multimodal large language models on hour-scale videos using tasks designed to probe extended memory and comprehension.

Detect in Any Scene: An Agentic Framework for Object Detection with Experience-Aware Reasoning

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

DetAS-X uses an MLLM agent to adaptively compose detection workflows from restoration modules and expert detectors, enhanced by self-evolving experience harvesting, achieving substantial F1 score gains on challenging benchmarks.

GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

GeoWeaver performs token-adaptive geometric grounding on visual tokens from a multi-level bank prior to language modeling to support better spatio-temporal reasoning.

FinDocMRE: A Benchmark for Document-Level Financial Multimodal Reasoning Evaluation

cs.CE · 2026-05-18 · unverdicted · novelty 6.0

FinDocMRE is a new multi-image document-level benchmark spanning 12 financial domains and 5 task types, showing that 11 tested LMMs all score below 65 overall with particular weaknesses in numerical estimation and cross-page grounding.

Artificial Intolerance: Stigmatizing Language in Clinical Documentation Skews Large Language Model Decision-Making

cs.CL · 2026-05-17 · unverdicted · novelty 6.0

Frontier LLMs exhibit bias from stigmatizing language in clinical vignettes across four conditions, skewing decisions toward less aggressive management, with limited mitigation from Chain-of-Thought or self-debiasing prompts.

Unlocking Dense Metric Depth Estimation in VLMs

cs.CV · 2026-05-15 · unverdicted · novelty 6.0 · 2 refs

DepthVLM converts a standard VLM into a dense metric depth predictor by attaching a lightweight head and training under unified vision-text supervision, outperforming prior VLMs and some pure vision models on a new indoor-outdoor benchmark.

SEED: Targeted Data Selection by Weighted Independent Set

cs.LG · 2026-05-15 · unverdicted · novelty 6.0

SEED models data selection as Weighted Independent Set on a similarity graph, using node value calibration and local scale normalization to produce compact high-quality training subsets that outperform prior methods on instruction tuning and segmentation tasks.

LRCP: Low-Rank Compressibility Guided Visual Token Pruning for Efficient LVLMs

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

LRCP prunes visual tokens in LVLMs by scoring projection residuals onto a PCA-estimated low-rank subspace, achieving 88.9% image token reduction with 94.7% performance retention and 87.5% video reduction with 97.8% accuracy retention.

citing papers explorer

Showing 50 of 77 citing papers.

ViMU: Benchmarking Video Metaphorical Understanding cs.CV · 2026-05-14 · unverdicted · none · ref 10 · internal anchor
ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos cs.CV · 2026-05-08 · unverdicted · none · ref 39 · internal anchor
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation cs.CV · 2026-04-13 · unverdicted · none · ref 11 · internal anchor
The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.
MM-Snowball: Evaluating and Mitigating Hallucination Snowballing in Multimodal Multi-Turn Dialogue cs.CV · 2026-05-30 · unverdicted · none · ref 22 · internal anchor
MM-Snowball benchmark diagnoses hallucination snowballing in multi-turn MLLM dialogues; CAVR mitigates it via dual visual rectification at representation and logit levels.
Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs cs.CL · 2026-05-29 · unverdicted · none · ref 62 · internal anchor
Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.
Resolving Long-Tail Ambiguity in Unsupervised 3D Point Cloud Segmentation with Language Priors cs.CV · 2026-05-20 · unverdicted · none · ref 13 · internal anchor
LangTail uses entity-level semantic priors from language models aligned via contrastive learning in a hierarchical clustering setup to resolve long-tail ambiguity, yielding +13.5, +12.9, and +8.9 mIoU gains on ScanNet-v2, S3DIS, and nuScenes.
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation cs.CV · 2026-05-12 · unverdicted · none · ref 9 · internal anchor
INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation cs.CV · 2026-05-11 · conditional · none · ref 53 · internal anchor
AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domain datasets.
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning cs.CV · 2026-05-10 · unverdicted · none · ref 3 · internal anchor
RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multimodal models.
Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents cs.AI · 2026-05-09 · unverdicted · none · ref 33 · 3 links · internal anchor
VIGIL decouples world-state completion from terminal commitment in embodied agents, exposing up to 19.7 pp gaps in benchmark success despite comparable execution across 20 models.
Benchmarking and Improving GUI Agents in High-Dynamic Environments cs.CV · 2026-04-28 · unverdicted · none · ref 9 · 2 links · internal anchor
DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new DynamicGUIBench spanning ten applications.
Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding cs.CV · 2026-04-09 · unverdicted · none · ref 16 · internal anchor
Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.
Thinking with Geometry: Active Geometry Integration for Spatial Reasoning cs.CV · 2026-02-05 · unverdicted · none · ref 4 · internal anchor
GeoThinker enables active, task-conditioned geometry integration in MLLMs via spatial-grounded fusion and importance gating, reaching 72.6 on VSI-Bench.
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning cs.CV · 2026-01-22 · unverdicted · none · ref 28 · internal anchor
VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.
SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents cs.CR · 2025-10-11 · unverdicted · none · ref 11 · internal anchor
SecureWebArena is a new benchmark suite for holistic security evaluation of LVLM-based web agents using diverse simulated environments, attack taxonomies, and multi-layered failure analysis across reasoning, behavior, and outcomes.
VGR: Visual Grounded Reasoning cs.CV · 2025-06-13 · unverdicted · none · ref 13 · internal anchor
VGR introduces a visual-grounded reasoning MLLM that detects and replays image regions during inference, achieving gains on visual benchmarks with 30% fewer image tokens than the LLaVA-NeXT-7B baseline.
LVBench: An Extreme Long Video Understanding Benchmark cs.CV · 2024-06-12 · accept · none · ref 10 · internal anchor
LVBench is a new benchmark for extreme long video understanding that evaluates multimodal large language models on hour-scale videos using tasks designed to probe extended memory and comprehension.
Detect in Any Scene: An Agentic Framework for Object Detection with Experience-Aware Reasoning cs.CV · 2026-05-29 · unverdicted · none · ref 5 · internal anchor
DetAS-X uses an MLLM agent to adaptively compose detection workflows from restoration modules and expert detectors, enhanced by self-evolving experience harvesting, achieving substantial F1 score gains on challenging benchmarks.
GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning cs.CV · 2026-05-21 · unverdicted · none · ref 41 · internal anchor
GeoWeaver performs token-adaptive geometric grounding on visual tokens from a multi-level bank prior to language modeling to support better spatio-temporal reasoning.
FinDocMRE: A Benchmark for Document-Level Financial Multimodal Reasoning Evaluation cs.CE · 2026-05-18 · unverdicted · none · ref 11 · internal anchor
FinDocMRE is a new multi-image document-level benchmark spanning 12 financial domains and 5 task types, showing that 11 tested LMMs all score below 65 overall with particular weaknesses in numerical estimation and cross-page grounding.
Artificial Intolerance: Stigmatizing Language in Clinical Documentation Skews Large Language Model Decision-Making cs.CL · 2026-05-17 · unverdicted · none · ref 62 · internal anchor
Frontier LLMs exhibit bias from stigmatizing language in clinical vignettes across four conditions, skewing decisions toward less aggressive management, with limited mitigation from Chain-of-Thought or self-debiasing prompts.
Unlocking Dense Metric Depth Estimation in VLMs cs.CV · 2026-05-15 · unverdicted · none · ref 21 · 2 links · internal anchor
DepthVLM converts a standard VLM into a dense metric depth predictor by attaching a lightweight head and training under unified vision-text supervision, outperforming prior VLMs and some pure vision models on a new indoor-outdoor benchmark.
SEED: Targeted Data Selection by Weighted Independent Set cs.LG · 2026-05-15 · unverdicted · none · ref 25 · internal anchor
SEED models data selection as Weighted Independent Set on a similarity graph, using node value calibration and local scale normalization to produce compact high-quality training subsets that outperform prior methods on instruction tuning and segmentation tasks.
LRCP: Low-Rank Compressibility Guided Visual Token Pruning for Efficient LVLMs cs.CV · 2026-05-15 · unverdicted · none · ref 18 · internal anchor
LRCP prunes visual tokens in LVLMs by scoring projection residuals onto a PCA-estimated low-rank subspace, achieving 88.9% image token reduction with 94.7% performance retention and 87.5% video reduction with 97.8% accuracy retention.
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data cs.RO · 2026-05-13 · unverdicted · none · ref 25 · internal anchor
A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models cs.RO · 2026-05-13 · conditional · none · ref 29 · internal anchor
GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.
SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs cs.CV · 2026-05-10 · unverdicted · none · ref 4 · internal anchor
SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.
MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production cs.DC · 2026-05-09 · unverdicted · none · ref 18 · internal anchor
MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.
Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models cs.CV · 2026-05-08 · unverdicted · none · ref 10 · internal anchor
Introduces VURB benchmark and VUP-35K dataset to train discriminative and generative video reward models that achieve SOTA performance on VURB and VideoRewardBench.
DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning cs.CV · 2026-05-06 · unverdicted · none · ref 10 · internal anchor
DiffCap-Bench supplies a diverse IDC benchmark with ten categories and LLM judging grounded in human difference lists to evaluate MLLMs more robustly than prior lexical metrics.
AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding cs.CV · 2026-05-04 · unverdicted · none · ref 9 · internal anchor
AutoFocus converts token perplexity into an anisotropic Gaussian uncertainty field to drive region proposals and shape-aware zooming for improved GUI grounding in VLMs.
Leveraging Verifier-Based Reinforcement Learning in Image Editing cs.CV · 2026-04-30 · unverdicted · none · ref 22 · 2 links · internal anchor
Edit-R1 builds a CoT-based reasoning reward model (RRM) via SFT and GCPO, then applies it with GRPO to improve image editing models such as FLUX.1-kontext.
See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection cs.CV · 2026-04-27 · unverdicted · none · ref 19 · internal anchor
ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.
SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs cs.CV · 2026-04-27 · unverdicted · none · ref 18 · internal anchor
SMoES improves MoE-VLM performance and efficiency via soft modality-guided expert routing and inter-bin mutual information regularization, yielding 0.9-4.2% task gains and 56% communication reduction.
dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model cs.RO · 2026-04-24 · unverdicted · none · ref 10 · internal anchor
A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation cs.CV · 2026-04-20 · unverdicted · none · ref 37 · 2 links · internal anchor
OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior cs.CV · 2026-04-19 · unverdicted · none · ref 13 · internal anchor
DreamShot uses video diffusion priors and a role-attention consistency loss to produce coherent, personalized storyboards with better character and scene continuity than text-to-image methods.
UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding cs.CV · 2026-04-15 · unverdicted · none · ref 8 · internal anchor
UI-Zoomer uses uncertainty quantification to trigger and size adaptive zoom-ins only on uncertain GUI grounding predictions, yielding up to 13.4% gains on benchmarks with no training.
POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch cs.CV · 2026-04-15 · unverdicted · none · ref 14 · internal anchor
POINTS-Seeker-8B is an 8B multimodal model trained from scratch for agentic search that uses seeding and visual-space history folding to outperform prior models on six visual reasoning benchmarks.
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs cs.CV · 2026-04-13 · unverdicted · none · ref 26 · internal anchor
POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
CLASP: Closed-loop Asynchronous Spatial Perception for Open-vocabulary Desktop Object Grasping cs.RO · 2026-04-13 · unverdicted · none · ref 31 · internal anchor
CLASP achieves 87% success in open-vocabulary desktop grasping via dual-pathway perception, asynchronous closed-loop evaluation, and automated multimodal data synthesis.
LAMP: Lift Image-Editing as General 3D Priors for Open-world Manipulation cs.CV · 2026-04-09 · unverdicted · none · ref 28 · internal anchor
LAMP extracts continuous 3D inter-object transformations from image editing to serve as geometry-aware priors for zero-shot open-world robotic manipulation.
GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents cs.CV · 2026-04-08 · unverdicted · none · ref 31 · internal anchor
GameWorld is a new benchmark providing standardized interfaces, 34 games, 170 tasks, and verifiable outcome metrics to evaluate multimodal large language model agents in video game environments.
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization cs.CV · 2026-04-08 · unverdicted · none · ref 2 · internal anchor
MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
InstructTable: Improving Table Structure Recognition Through Instructions cs.CV · 2026-04-03 · unverdicted · none · ref 13 · internal anchor
InstructTable combines instruction-guided pre-training on structural patterns with visual fine-tuning and a template-free synthetic data generator (TME) to reach state-of-the-art table structure recognition on public benchmarks and a new complex-table test set.
AD-Copilot: A Vision-Language Assistant for Industrial Anomaly Detection via Visual In-context Comparison cs.CV · 2026-03-14 · conditional · none · ref 5 · internal anchor
AD-Copilot trains an MLLM on a new curated industrial dataset Chat-AD with a Comparison Encoder that uses cross-attention on image pairs, reaching 82.3% accuracy on MMAD and 3.35x gains on MMAD-BBox while generalizing and exceeding human experts on some tasks.
Universal Pose Pretraining for Generalizable Vision-Language-Action Policies cs.CV · 2026-02-23 · unverdicted · none · ref 15 · internal anchor
Pose-VLA uses a decoupled two-stage pre-training with discrete pose tokens to extract universal 3D spatial priors from 3D datasets and robotic trajectories, achieving 79.5% success on RoboTwin 2.0 and 96.0% on LIBERO.
Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning cs.CV · 2025-12-17 · unverdicted · none · ref 16 · internal anchor
Skyra is an MLLM that detects AI-generated videos by identifying and reasoning over grounded visual artifacts, supported by a new annotated dataset and benchmark.
Boosting Reasoning in Large Multimodal Models via Activation Replay cs.CV · 2025-11-25 · unverdicted · none · ref 14 · internal anchor
Activation Replay boosts multimodal reasoning in post-trained LMMs by replaying low-entropy activations from base models to RLVR counterparts at test time via visual token manipulation.
MiMo-Embodied: X-Embodied Foundation Model Technical Report cs.RO · 2025-11-20 · unverdicted · none · ref 20 · internal anchor
MiMo-Embodied is a single foundation model that achieves state-of-the-art results on 17 embodied AI benchmarks and 12 autonomous driving benchmarks through multi-stage learning, curated data, and CoT/RL fine-tuning that produces positive cross-domain transfer.

Seed1.5-VL Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer