{"paper":{"title":"CogVLM2: Visual Language Models for Image and Video Understanding","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"The CogVLM2 family reaches state-of-the-art results on image and video benchmarks by refining visual expert architectures and training recipes.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Bin Xu, Da Yin, Debing Liu, Guanyu Feng, Jie Tang, Ji Qi, Juanzi Li, Junhui Ji, Lei Zhao, Ming Ding, Peng Zhang, Qingsong Lv, Shiyu Huang, Weihan Wang, Wenmeng Yu, Wenyi Hong, Xiaohan Zhang, Xiaotao Gu, Xixuan Song, Yan Wang, Yean Cheng, Yuxiao Dong, Zhao Xue, Zhuoyi Yang, Zihan Wang","submitted_at":"2024-08-29T12:59:12Z","abstract_excerpt":"Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to $1344 \\times 1344$ pixels. As a video understan"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"CogVLM2 family has achieved state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the reported benchmark improvements stem primarily from the described architecture changes and training recipes rather than undisclosed increases in model size, data volume, or compute.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"CogVLM2 family achieves state-of-the-art results on image and video understanding benchmarks through improved visual expert architecture, higher resolution inputs, and automated temporal grounding for videos.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"The CogVLM2 family reaches state-of-the-art results on image and video benchmarks by refining visual expert architectures and training recipes.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"3ecd0993b7bf0b1783f0a1fa505f6502c498677a9c5ed094444249ccd61a893a"},"source":{"id":"2408.16500","kind":"arxiv","version":1},"verdict":{"id":"9b90ac07-92ba-474a-85a7-b8331cff8803","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T20:07:23.935615Z","strongest_claim":"CogVLM2 family has achieved state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench.","one_line_summary":"CogVLM2 family achieves state-of-the-art results on image and video understanding benchmarks through improved visual expert architecture, higher resolution inputs, and automated temporal grounding for videos.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the reported benchmark improvements stem primarily from the described architecture changes and training recipes rather than undisclosed increases in model size, data volume, or compute.","pith_extraction_headline":"The CogVLM2 family reaches state-of-the-art results on image and video benchmarks by refining visual expert architectures and training recipes."},"references":{"count":94,"sample":[{"doi":"","year":2019,"title":"M. Acharya, K. Kafle, and C. Kanan. Tallyqa: Answering complex counting questions. In Proc. of Association for the Advancement of Artificial Intelligence, 2019","work_id":"1d36bd2e-b486-49fb-912a-ad929c7f9d24","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","ref_index":2,"cited_arxiv_id":"2303.08774","is_internal_anchor":true},{"doi":"","year":2010,"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","ref_index":3,"cited_arxiv_id":"2010.11929","is_internal_anchor":true},{"doi":"","year":2015,"title":"S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. Vqa: Visual question answering. In Proc. of International Conference on Computer Vision, pages 2425–2433, 2015","work_id":"4b105d61-5d3e-438b-83f4-246542fc3464","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","ref_index":6,"cited_arxiv_id":"2308.12966","is_internal_anchor":true}],"resolved_work":94,"snapshot_sha256":"796a5f33c1c97d617ef32885969f6e583bf386a5f9a1ad2b1cc06655c5d1c13d","internal_anchors":20},"formal_canon":{"evidence_count":2,"snapshot_sha256":"dc13a115ff0563c7151273119d2ba3fe874c3a609ff6f85e259e1983daeec241"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}