hub Canonical reference

CogVLM2: Visual Language Models for Image and Video Understanding

Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang · 2024 · cs.CV · arXiv 2408.16500

Canonical reference. 83% of citing Pith papers cite this work as background.

35 Pith papers citing it

Background 83% of classified citations

open full Pith review browse 35 citing papers arXiv PDF

abstract

Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to $1344 \times 1344$ pixels. As a video understanding model, CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction. Notably, CogVLM2 family has achieved state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench. All models are open-sourced in https://github.com/THUDM/CogVLM2 and https://github.com/THUDM/GLM-4, contributing to the advancement of the field.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6

citation-polarity summary

background 5 unclear 1

representative citing papers

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

Culture-Aware Humorous Captioning: Multimodal Humor Generation across Cultural Contexts

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

Introduces culture-aware humorous captioning task and staged alignment framework that improves contextual fit and balances image relevance with humor in multimodal LLMs.

Towards Unconstrained Human-Object Interaction

cs.CV · 2026-04-15 · unverdicted · novelty 7.0

Introduces the U-HOI task and shows MLLMs plus a language-to-graph pipeline can handle human-object interactions without any predefined vocabulary at training or inference time.

VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

VSAS-Bench offers temporally dense annotations and synchronous/asynchronous protocols to evaluate streaming VLMs on timeliness, consistency, accuracy, and latency trade-offs, showing that adapted conventional VLMs can outperform specialized streaming models.

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

cs.CV · 2025-05-27 · conditional · novelty 7.0

Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.

EgoWalk: A Multimodal Dataset for Robot Navigation in the Wild

cs.RO · 2025-05-27 · conditional · novelty 7.0

EgoWalk supplies 50 hours of real-world multimodal human navigation data in varied indoor/outdoor settings together with open pipelines that auto-generate language goal annotations and traversability masks.

AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization

cs.CL · 2025-03-31 · unverdicted · novelty 7.0

AdaMMS merges heterogeneous MLLMs via architecture mapping, linear weight interpolation, and unsupervised hyper-parameter search, outperforming prior methods on vision-language benchmarks as the first such approach without labeled data.

HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks

cs.CV · 2024-12-23 · unverdicted · novelty 7.0

HumanVBench provides a 16-task benchmark for human-centric video understanding in MLLMs, created through automated annotation and distractor synthesis pipelines, and shows top models lag human performance on emotion perception and cross-modal alignment.

S$^4$ST: A Strong, Self-transferable, faSt, and Simple Scale Transformation for Transferable Targeted Attack

cs.CR · 2024-10-13 · unverdicted · novelty 7.0

S⁴ST shows that dimensionally consistent scaling with low-redundancy complementary transforms achieves state-of-the-art data-free transferable targeted attacks by exploiting visual data's multi-scale nature.

LVBench: An Extreme Long Video Understanding Benchmark

cs.CV · 2024-06-12 · accept · novelty 7.0

LVBench is a new benchmark for extreme long video understanding that evaluates multimodal large language models on hour-scale videos using tasks designed to probe extended memory and comprehension.

PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking

cs.CV · 2026-06-30 · unverdicted · novelty 6.0

PixelEyes decouples reasoning and perception via mask-guided search and semantic BFS, introduces PixelEyes-6K dataset and Pinpoint-Bench benchmark, and open-sources code and models.

MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models

cs.CV · 2026-06-05 · unverdicted · novelty 6.0

MotionEnhancer distills motion priors from video diffusion models into VLMs via parameter-free attention alignment modules to improve motion-level video understanding.

From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

cs.CL · 2026-05-19 · conditional · novelty 6.0

Staged post-training that first solidifies visual perception before visual and textual reasoning improves VLM accuracy and shortens reasoning traces on visual math and perception benchmarks.

Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction

cs.CV · 2026-05-17 · unverdicted · novelty 6.0 · 2 refs

Omni-DuplexEval provides a new benchmark and automatic evaluation method for real-time duplex omni-modal interaction, showing state-of-the-art models reach only 39.6% overall and 20% on proactive reminders.

CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with negligible added inference cost.

VersaVogue: Visual Expert Orchestration and Preference Alignment for Unified Fashion Synthesis

cs.CV · 2026-04-08 · unverdicted · novelty 6.0

VersaVogue unifies garment generation and virtual dressing via trait-routing attention with mixture-of-experts and an automated multi-perspective preference optimization pipeline that uses DPO without human labels.

MIRAGE: Benchmarking and Aligning Multi-Instance Image Editing

cs.CV · 2026-04-06 · unverdicted · novelty 6.0

MIRAGE introduces a benchmark for multi-instance image editing and a training-free framework that uses vision-language parsing and parallel regional denoising to achieve precise edits without altering backgrounds.

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

cs.CV · 2025-07-01 · unverdicted · novelty 6.0

GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.

When 'YES' Meets 'BUT': Can Large Models Comprehend Contradictory Humor Through Comparative Reasoning?

cs.CV · 2025-03-29 · unverdicted · novelty 6.0

Presents YesBut (V2) benchmark and shows state-of-the-art VLMs significantly underperform humans on tasks requiring comparative reasoning for contradictory humor in comics.

Improving Video Generation with Human Feedback

cs.CV · 2025-01-23 · unverdicted · novelty 6.0

A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

cs.CV · 2025-01-06 · unverdicted · novelty 6.0

MotionBench is a new benchmark showing poor fine-grained motion understanding in VLMs and proposes TE Fusion to improve performance with higher frame rates.

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

cs.CV · 2024-12-30 · unverdicted · novelty 6.0

VisionReward learns multi-dimensional human preferences for image and video generation via hierarchical assessment and linear weighting, outperforming VideoScore by 17.2% in prediction accuracy and yielding 31.6% higher win rates in text-to-video models.

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

cs.RO · 2024-12-09 · unverdicted · novelty 6.0

Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-world tests.

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

cs.CV · 2024-08-12 · unverdicted · novelty 6.0

CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.

citing papers explorer

Showing 35 of 35 citing papers.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models cs.CV · 2024-09-25 · accept · none · ref 41 · internal anchor
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
Culture-Aware Humorous Captioning: Multimodal Humor Generation across Cultural Contexts cs.CL · 2026-04-20 · unverdicted · none · ref 77 · internal anchor
Introduces culture-aware humorous captioning task and staged alignment framework that improves contextual fit and balances image relevance with humor in multimodal LLMs.
Towards Unconstrained Human-Object Interaction cs.CV · 2026-04-15 · unverdicted · none · ref 13 · internal anchor
Introduces the U-HOI task and shows MLLMs plus a language-to-graph pipeline can handle human-object interactions without any predefined vocabulary at training or inference time.
VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models cs.CV · 2026-04-08 · unverdicted · none · ref 9 · internal anchor
VSAS-Bench offers temporally dense annotations and synchronous/asynchronous protocols to evaluate streaming VLMs on timeliness, consistency, accuracy, and latency trade-offs, showing that adapted conventional VLMs can outperform specialized streaming models.
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning? cs.CV · 2025-05-27 · conditional · none · ref 25 · internal anchor
Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.
EgoWalk: A Multimodal Dataset for Robot Navigation in the Wild cs.RO · 2025-05-27 · conditional · none · ref 21 · internal anchor
EgoWalk supplies 50 hours of real-world multimodal human navigation data in varied indoor/outdoor settings together with open pipelines that auto-generate language goal annotations and traversability masks.
AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization cs.CL · 2025-03-31 · unverdicted · none · ref 8 · internal anchor
AdaMMS merges heterogeneous MLLMs via architecture mapping, linear weight interpolation, and unsupervised hyper-parameter search, outperforming prior methods on vision-language benchmarks as the first such approach without labeled data.
HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks cs.CV · 2024-12-23 · unverdicted · none · ref 21 · internal anchor
HumanVBench provides a 16-task benchmark for human-centric video understanding in MLLMs, created through automated annotation and distractor synthesis pipelines, and shows top models lag human performance on emotion perception and cross-modal alignment.
S$^4$ST: A Strong, Self-transferable, faSt, and Simple Scale Transformation for Transferable Targeted Attack cs.CR · 2024-10-13 · unverdicted · none · ref 76 · internal anchor
S⁴ST shows that dimensionally consistent scaling with low-redundancy complementary transforms achieves state-of-the-art data-free transferable targeted attacks by exploiting visual data's multi-scale nature.
LVBench: An Extreme Long Video Understanding Benchmark cs.CV · 2024-06-12 · accept · none · ref 12 · internal anchor
LVBench is a new benchmark for extreme long video understanding that evaluates multimodal large language models on hour-scale videos using tasks designed to probe extended memory and comprehension.
PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking cs.CV · 2026-06-30 · unverdicted · none · ref 13 · internal anchor
PixelEyes decouples reasoning and perception via mask-guided search and semantic BFS, introduces PixelEyes-6K dataset and Pinpoint-Bench benchmark, and open-sources code and models.
MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models cs.CV · 2026-06-05 · unverdicted · none · ref 10 · internal anchor
MotionEnhancer distills motion priors from video diffusion models into VLMs via parameter-free attention alignment modules to improve motion-level video understanding.
From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models cs.CL · 2026-05-19 · conditional · none · ref 6 · internal anchor
Staged post-training that first solidifies visual perception before visual and textual reasoning improves VLM accuracy and shortens reasoning traces on visual math and perception benchmarks.
Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction cs.CV · 2026-05-17 · unverdicted · none · ref 25 · 2 links · internal anchor
Omni-DuplexEval provides a new benchmark and automatic evaluation method for real-time duplex omni-modal interaction, showing state-of-the-art models reach only 39.6% overall and 20% on proactive reminders.
CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering cs.CV · 2026-05-06 · unverdicted · none · ref 53 · internal anchor
CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with negligible added inference cost.
VersaVogue: Visual Expert Orchestration and Preference Alignment for Unified Fashion Synthesis cs.CV · 2026-04-08 · unverdicted · none · ref 16 · internal anchor
VersaVogue unifies garment generation and virtual dressing via trait-routing attention with mixture-of-experts and an automated multi-perspective preference optimization pipeline that uses DPO without human labels.
MIRAGE: Benchmarking and Aligning Multi-Instance Image Editing cs.CV · 2026-04-06 · unverdicted · none · ref 13 · internal anchor
MIRAGE introduces a benchmark for multi-instance image editing and a training-free framework that uses vision-language parsing and parallel regional denoising to achieve precise edits without altering backgrounds.
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning cs.CV · 2025-07-01 · unverdicted · none · ref 20 · internal anchor
GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
When 'YES' Meets 'BUT': Can Large Models Comprehend Contradictory Humor Through Comparative Reasoning? cs.CV · 2025-03-29 · unverdicted · none · ref 69 · internal anchor
Presents YesBut (V2) benchmark and shows state-of-the-art VLMs significantly underperform humans on tasks requiring comparative reasoning for contradictory humor in comics.
Improving Video Generation with Human Feedback cs.CV · 2025-01-23 · unverdicted · none · ref 23 · internal anchor
A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models cs.CV · 2025-01-06 · unverdicted · none · ref 15 · internal anchor
MotionBench is a new benchmark showing poor fine-grained motion understanding in VLMs and proposes TE Fusion to improve performance with higher frame rates.
VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation cs.CV · 2024-12-30 · unverdicted · none · ref 37 · internal anchor
VisionReward learns multi-dimensional human preferences for image and video generation via hierarchical assessment and linear weighting, outperforming VideoScore by 17.2% in prediction accuracy and yielding 31.6% higher win rates in text-to-video models.
Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks cs.RO · 2024-12-09 · unverdicted · none · ref 30 · internal anchor
Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-world tests.
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer cs.CV · 2024-08-12 · unverdicted · none · ref 81 · internal anchor
CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.
TrackRef3D: Multi-View Consistent Track-then-Label for Open-World Referring Segmentation in 3D Gaussian Splatting cs.CV · 2026-05-26 · unverdicted · none · ref 3 · internal anchor
TrackRef3D proposes a fully automatic multi-view consistent track-then-label method for open-world referring segmentation in 3D Gaussian Splatting using TSCM, visibility-aware descriptions, and hybrid contrastive training.
MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset cs.CV · 2026-05-20 · unverdicted · none · ref 36 · internal anchor
MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.
Illusion-Aware Visual Preprocessing and Anti-Illusion Prompting for Classic Illusion Understanding in Vision-Language Models cs.CV · 2026-05-09 · conditional · none · ref 9 · internal anchor
A combination of illusion-specific image transformations, anti-illusion prompts, and majority voting lets VLMs reach 90.48% accuracy on a 630-image illusion benchmark without any model training.
KD-CVG: A Knowledge-Driven Approach for Creative Video Generation cs.CV · 2026-04-23 · unverdicted · none · ref 13 · internal anchor
KD-CVG uses an Advertising Creative Knowledge Base plus Semantic-Aware Retrieval and Multimodal Knowledge Reference modules to improve semantic alignment and motion realism in text-to-video generation for advertising.
ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures cs.AI · 2026-04-23 · unverdicted · none · ref 11 · 2 links · internal anchor
ReCAPA adds predictive correction and multi-level semantic alignment to VLA models, plus two new metrics for tracking error spread and recovery, yielding competitive benchmark results over LLM baselines.
EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence cs.CV · 2025-09-18 · unverdicted · none · ref 2 · internal anchor
EchoVLM applies a Mixture-of-Experts vision-language model to ultrasound imaging across seven body regions, reporting gains of 10.15 BLEU-1 and 4.77 ROUGE-1 over Qwen2-VL on report generation.
PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling cs.CV · 2024-10-08 · unverdicted · none · ref 61 · internal anchor
PDF-WuKong adds a sparse sampler to an MLLM for efficient long-PDF multimodal QA and reports an 8.6% F1 gain over proprietary models on a new 1.1M-pair academic-paper dataset.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding cs.CV · 2025-01-22 · unverdicted · none · ref 153 · internal anchor
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction cs.CV · 2025-01-03 · conditional · none · ref 57 · internal anchor
VITA-1.5 integrates vision and speech into a single LLM through multi-stage training, delivering competitive benchmark results on image, video, and speech tasks with near real-time response speed.
VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation cs.CV · 2026-01-30 · unreviewed · ref 4 · internal anchor
High-Entropy Tokens as Multimodal Failure Points in Vision-Language Models cs.CV · 2025-12-26 · unreviewed · ref 11 · internal anchor

CogVLM2: Visual Language Models for Image and Video Understanding

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer