{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:CLXB2SP2UUUTKFNJPVGOMMGOJK","short_pith_number":"pith:CLXB2SP2","schema_version":"1.0","canonical_sha256":"12ee1d49faa5293515a97d4ce630ce4ab4f212633893fb842a1cdacd5ad6e731","source":{"kind":"arxiv","id":"2408.16500","version":1},"attestation_state":"computed","paper":{"title":"CogVLM2: Visual Language Models for Image and Video Understanding","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"The CogVLM2 family reaches state-of-the-art results on image and video benchmarks by refining visual expert architectures and training recipes.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Bin Xu, Da Yin, Debing Liu, Guanyu Feng, Jie Tang, Ji Qi, Juanzi Li, Junhui Ji, Lei Zhao, Ming Ding, Peng Zhang, Qingsong Lv, Shiyu Huang, Weihan Wang, Wenmeng Yu, Wenyi Hong, Xiaohan Zhang, Xiaotao Gu, Xixuan Song, Yan Wang, Yean Cheng, Yuxiao Dong, Zhao Xue, Zhuoyi Yang, Zihan Wang","submitted_at":"2024-08-29T12:59:12Z","abstract_excerpt":"Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to $1344 \\times 1344$ pixels. As a video understan"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2408.16500","kind":"arxiv","version":1},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2024-08-29T12:59:12Z","cross_cats_sorted":[],"title_canon_sha256":"9b21e119188e35c13e6672981c4e7ab473790a80b12ce65aa9eb12dddf1a2839","abstract_canon_sha256":"4a0b0561fb635c897f4d823de89ce5cf6644e8c9e66a8d0e0a72614a246a4b9f"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:46.729497Z","signature_b64":"Qullf6aLRVObn+u37v61mzQB+Ie8m7KnRLRo+mKzsje9WM4sbBATgNKMQbOsmxy0KKImC/ksZZVC94n/kDncBA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"12ee1d49faa5293515a97d4ce630ce4ab4f212633893fb842a1cdacd5ad6e731","last_reissued_at":"2026-05-17T23:38:46.728968Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:46.728968Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"CogVLM2: Visual Language Models for Image and Video Understanding","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"The CogVLM2 family reaches state-of-the-art results on image and video benchmarks by refining visual expert architectures and training recipes.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Bin Xu, Da Yin, Debing Liu, Guanyu Feng, Jie Tang, Ji Qi, Juanzi Li, Junhui Ji, Lei Zhao, Ming Ding, Peng Zhang, Qingsong Lv, Shiyu Huang, Weihan Wang, Wenmeng Yu, Wenyi Hong, Xiaohan Zhang, Xiaotao Gu, Xixuan Song, Yan Wang, Yean Cheng, Yuxiao Dong, Zhao Xue, Zhuoyi Yang, Zihan Wang","submitted_at":"2024-08-29T12:59:12Z","abstract_excerpt":"Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to $1344 \\times 1344$ pixels. As a video understan"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"CogVLM2 family has achieved state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the reported benchmark improvements stem primarily from the described architecture changes and training recipes rather than undisclosed increases in model size, data volume, or compute.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"CogVLM2 family achieves state-of-the-art results on image and video understanding benchmarks through improved visual expert architecture, higher resolution inputs, and automated temporal grounding for videos.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"The CogVLM2 family reaches state-of-the-art results on image and video benchmarks by refining visual expert architectures and training recipes.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"3ecd0993b7bf0b1783f0a1fa505f6502c498677a9c5ed094444249ccd61a893a"},"source":{"id":"2408.16500","kind":"arxiv","version":1},"verdict":{"id":"9b90ac07-92ba-474a-85a7-b8331cff8803","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T20:07:23.935615Z","strongest_claim":"CogVLM2 family has achieved state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench.","one_line_summary":"CogVLM2 family achieves state-of-the-art results on image and video understanding benchmarks through improved visual expert architecture, higher resolution inputs, and automated temporal grounding for videos.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the reported benchmark improvements stem primarily from the described architecture changes and training recipes rather than undisclosed increases in model size, data volume, or compute.","pith_extraction_headline":"The CogVLM2 family reaches state-of-the-art results on image and video benchmarks by refining visual expert architectures and training recipes."},"references":{"count":94,"sample":[{"doi":"","year":2019,"title":"M. Acharya, K. Kafle, and C. Kanan. Tallyqa: Answering complex counting questions. In Proc. of Association for the Advancement of Artificial Intelligence, 2019","work_id":"1d36bd2e-b486-49fb-912a-ad929c7f9d24","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","ref_index":2,"cited_arxiv_id":"2303.08774","is_internal_anchor":true},{"doi":"","year":2010,"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","ref_index":3,"cited_arxiv_id":"2010.11929","is_internal_anchor":true},{"doi":"","year":2015,"title":"S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. Vqa: Visual question answering. In Proc. of International Conference on Computer Vision, pages 2425–2433, 2015","work_id":"4b105d61-5d3e-438b-83f4-246542fc3464","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","ref_index":6,"cited_arxiv_id":"2308.12966","is_internal_anchor":true}],"resolved_work":94,"snapshot_sha256":"796a5f33c1c97d617ef32885969f6e583bf386a5f9a1ad2b1cc06655c5d1c13d","internal_anchors":20},"formal_canon":{"evidence_count":2,"snapshot_sha256":"dc13a115ff0563c7151273119d2ba3fe874c3a609ff6f85e259e1983daeec241"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2408.16500","created_at":"2026-05-17T23:38:46.729053+00:00"},{"alias_kind":"arxiv_version","alias_value":"2408.16500v1","created_at":"2026-05-17T23:38:46.729053+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2408.16500","created_at":"2026-05-17T23:38:46.729053+00:00"},{"alias_kind":"pith_short_12","alias_value":"CLXB2SP2UUUT","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"CLXB2SP2UUUTKFNJ","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"CLXB2SP2","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":33,"internal_anchor_count":33,"sample":[{"citing_arxiv_id":"2410.05970","citing_title":"PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling","ref_index":61,"is_internal_anchor":true},{"citing_arxiv_id":"2410.13891","citing_title":"S$^4$ST: A Strong, Self-transferable, faSt, and Simple Scale Transformation for Transferable Targeted Attack","ref_index":76,"is_internal_anchor":true},{"citing_arxiv_id":"2412.17574","citing_title":"HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2501.02955","citing_title":"MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2503.23733","citing_title":"AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2503.23137","citing_title":"When 'YES' Meets 'BUT': Can Large Models Comprehend Contradictory Humor Through Comparative Reasoning?","ref_index":69,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21272","citing_title":"MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17360","citing_title":"Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20177","citing_title":"From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2505.21282","citing_title":"EgoWalk: A Multimodal Dataset for Robot Navigation in the Wild","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2406.08035","citing_title":"LVBench: An Extreme Long Video Understanding Benchmark","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2509.14977","citing_title":"EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2501.01957","citing_title":"VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction","ref_index":57,"is_internal_anchor":true},{"citing_arxiv_id":"2505.21374","citing_title":"Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2412.06224","citing_title":"Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2512.21815","citing_title":"High-Entropy Tokens as Multimodal Failure Points in Vision-Language Models","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2412.21059","citing_title":"VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2601.23286","citing_title":"VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2409.17146","citing_title":"Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2501.13918","citing_title":"Improving Video Generation with Human Feedback","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2604.21232","citing_title":"ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08841","citing_title":"Illusion-Aware Visual Preprocessing and Anti-Illusion Prompting for Classic Illusion Understanding in Vision-Language Models","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2604.07634","citing_title":"VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2507.01006","citing_title":"GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2501.13106","citing_title":"VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding","ref_index":153,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/CLXB2SP2UUUTKFNJPVGOMMGOJK","json":"https://pith.science/pith/CLXB2SP2UUUTKFNJPVGOMMGOJK.json","graph_json":"https://pith.science/api/pith-number/CLXB2SP2UUUTKFNJPVGOMMGOJK/graph.json","events_json":"https://pith.science/api/pith-number/CLXB2SP2UUUTKFNJPVGOMMGOJK/events.json","paper":"https://pith.science/paper/CLXB2SP2"},"agent_actions":{"view_html":"https://pith.science/pith/CLXB2SP2UUUTKFNJPVGOMMGOJK","download_json":"https://pith.science/pith/CLXB2SP2UUUTKFNJPVGOMMGOJK.json","view_paper":"https://pith.science/paper/CLXB2SP2","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2408.16500&json=true","fetch_graph":"https://pith.science/api/pith-number/CLXB2SP2UUUTKFNJPVGOMMGOJK/graph.json","fetch_events":"https://pith.science/api/pith-number/CLXB2SP2UUUTKFNJPVGOMMGOJK/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/CLXB2SP2UUUTKFNJPVGOMMGOJK/action/timestamp_anchor","attest_storage":"https://pith.science/pith/CLXB2SP2UUUTKFNJPVGOMMGOJK/action/storage_attestation","attest_author":"https://pith.science/pith/CLXB2SP2UUUTKFNJPVGOMMGOJK/action/author_attestation","sign_citation":"https://pith.science/pith/CLXB2SP2UUUTKFNJPVGOMMGOJK/action/citation_signature","submit_replication":"https://pith.science/pith/CLXB2SP2UUUTKFNJPVGOMMGOJK/action/replication_record"}},"created_at":"2026-05-17T23:38:46.729053+00:00","updated_at":"2026-05-17T23:38:46.729053+00:00"}