hub Canonical reference

CogVLM2: Visual Language Models for Image and Video Understanding

Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang · 2024 · cs.CV · arXiv 2408.16500

Canonical reference. 83% of citing Pith papers cite this work as background.

35 Pith papers citing it

Background 83% of classified citations

open full Pith review browse 35 citing papers arXiv PDF

abstract

Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to $1344 \times 1344$ pixels. As a video understanding model, CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction. Notably, CogVLM2 family has achieved state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench. All models are open-sourced in https://github.com/THUDM/CogVLM2 and https://github.com/THUDM/GLM-4, contributing to the advancement of the field.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6

citation-polarity summary

background 5 unclear 1

representative citing papers

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

Culture-Aware Humorous Captioning: Multimodal Humor Generation across Cultural Contexts

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

Introduces culture-aware humorous captioning task and staged alignment framework that improves contextual fit and balances image relevance with humor in multimodal LLMs.

Towards Unconstrained Human-Object Interaction

cs.CV · 2026-04-15 · unverdicted · novelty 7.0

Introduces the U-HOI task and shows MLLMs plus a language-to-graph pipeline can handle human-object interactions without any predefined vocabulary at training or inference time.

VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

VSAS-Bench offers temporally dense annotations and synchronous/asynchronous protocols to evaluate streaming VLMs on timeliness, consistency, accuracy, and latency trade-offs, showing that adapted conventional VLMs can outperform specialized streaming models.

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

cs.CV · 2025-05-27 · conditional · novelty 7.0

Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.

EgoWalk: A Multimodal Dataset for Robot Navigation in the Wild

cs.RO · 2025-05-27 · conditional · novelty 7.0

EgoWalk supplies 50 hours of real-world multimodal human navigation data in varied indoor/outdoor settings together with open pipelines that auto-generate language goal annotations and traversability masks.

AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization

cs.CL · 2025-03-31 · unverdicted · novelty 7.0

AdaMMS merges heterogeneous MLLMs via architecture mapping, linear weight interpolation, and unsupervised hyper-parameter search, outperforming prior methods on vision-language benchmarks as the first such approach without labeled data.

HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks

cs.CV · 2024-12-23 · unverdicted · novelty 7.0

HumanVBench provides a 16-task benchmark for human-centric video understanding in MLLMs, created through automated annotation and distractor synthesis pipelines, and shows top models lag human performance on emotion perception and cross-modal alignment.

S$^4$ST: A Strong, Self-transferable, faSt, and Simple Scale Transformation for Transferable Targeted Attack

cs.CR · 2024-10-13 · unverdicted · novelty 7.0

S⁴ST shows that dimensionally consistent scaling with low-redundancy complementary transforms achieves state-of-the-art data-free transferable targeted attacks by exploiting visual data's multi-scale nature.

LVBench: An Extreme Long Video Understanding Benchmark

cs.CV · 2024-06-12 · accept · novelty 7.0

LVBench is a new benchmark for extreme long video understanding that evaluates multimodal large language models on hour-scale videos using tasks designed to probe extended memory and comprehension.

PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking

cs.CV · 2026-06-30 · unverdicted · novelty 6.0

PixelEyes decouples reasoning and perception via mask-guided search and semantic BFS, introduces PixelEyes-6K dataset and Pinpoint-Bench benchmark, and open-sources code and models.

MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models

cs.CV · 2026-06-05 · unverdicted · novelty 6.0

MotionEnhancer distills motion priors from video diffusion models into VLMs via parameter-free attention alignment modules to improve motion-level video understanding.

From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

cs.CL · 2026-05-19 · conditional · novelty 6.0

Staged post-training that first solidifies visual perception before visual and textual reasoning improves VLM accuracy and shortens reasoning traces on visual math and perception benchmarks.

Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction

cs.CV · 2026-05-17 · unverdicted · novelty 6.0 · 2 refs

Omni-DuplexEval provides a new benchmark and automatic evaluation method for real-time duplex omni-modal interaction, showing state-of-the-art models reach only 39.6% overall and 20% on proactive reminders.

CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with negligible added inference cost.

VersaVogue: Visual Expert Orchestration and Preference Alignment for Unified Fashion Synthesis

cs.CV · 2026-04-08 · unverdicted · novelty 6.0

VersaVogue unifies garment generation and virtual dressing via trait-routing attention with mixture-of-experts and an automated multi-perspective preference optimization pipeline that uses DPO without human labels.

MIRAGE: Benchmarking and Aligning Multi-Instance Image Editing

cs.CV · 2026-04-06 · unverdicted · novelty 6.0

MIRAGE introduces a benchmark for multi-instance image editing and a training-free framework that uses vision-language parsing and parallel regional denoising to achieve precise edits without altering backgrounds.

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

cs.CV · 2025-07-01 · unverdicted · novelty 6.0

GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.

When 'YES' Meets 'BUT': Can Large Models Comprehend Contradictory Humor Through Comparative Reasoning?

cs.CV · 2025-03-29 · unverdicted · novelty 6.0

Presents YesBut (V2) benchmark and shows state-of-the-art VLMs significantly underperform humans on tasks requiring comparative reasoning for contradictory humor in comics.

Improving Video Generation with Human Feedback

cs.CV · 2025-01-23 · unverdicted · novelty 6.0

A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

cs.CV · 2025-01-06 · unverdicted · novelty 6.0

MotionBench is a new benchmark showing poor fine-grained motion understanding in VLMs and proposes TE Fusion to improve performance with higher frame rates.

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

cs.CV · 2024-12-30 · unverdicted · novelty 6.0

VisionReward learns multi-dimensional human preferences for image and video generation via hierarchical assessment and linear weighting, outperforming VideoScore by 17.2% in prediction accuracy and yielding 31.6% higher win rates in text-to-video models.

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

cs.RO · 2024-12-09 · unverdicted · novelty 6.0

Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-world tests.

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

cs.CV · 2024-08-12 · unverdicted · novelty 6.0

CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.

citing papers explorer

Showing 8 of 8 citing papers after filters.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models cs.CV · 2024-09-25 · accept · none · ref 41 · internal anchor
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks cs.CV · 2024-12-23 · unverdicted · none · ref 21 · internal anchor
HumanVBench provides a 16-task benchmark for human-centric video understanding in MLLMs, created through automated annotation and distractor synthesis pipelines, and shows top models lag human performance on emotion perception and cross-modal alignment.
S$^4$ST: A Strong, Self-transferable, faSt, and Simple Scale Transformation for Transferable Targeted Attack cs.CR · 2024-10-13 · unverdicted · none · ref 76 · internal anchor
S⁴ST shows that dimensionally consistent scaling with low-redundancy complementary transforms achieves state-of-the-art data-free transferable targeted attacks by exploiting visual data's multi-scale nature.
LVBench: An Extreme Long Video Understanding Benchmark cs.CV · 2024-06-12 · accept · none · ref 12 · internal anchor
LVBench is a new benchmark for extreme long video understanding that evaluates multimodal large language models on hour-scale videos using tasks designed to probe extended memory and comprehension.
VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation cs.CV · 2024-12-30 · unverdicted · none · ref 37 · internal anchor
VisionReward learns multi-dimensional human preferences for image and video generation via hierarchical assessment and linear weighting, outperforming VideoScore by 17.2% in prediction accuracy and yielding 31.6% higher win rates in text-to-video models.
Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks cs.RO · 2024-12-09 · unverdicted · none · ref 30 · internal anchor
Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-world tests.
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer cs.CV · 2024-08-12 · unverdicted · none · ref 81 · internal anchor
CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.
PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling cs.CV · 2024-10-08 · unverdicted · none · ref 61 · internal anchor
PDF-WuKong adds a sparse sampler to an MLLM for efficient long-PDF multimodal QA and reports an 8.6% F1 gain over proprietary models on a new 1.1M-pair academic-paper dataset.

CogVLM2: Visual Language Models for Image and Video Understanding

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer