Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang Lin · 2024

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

representative citing papers

DharmaOCR: Specialized Small Language Models for Structured OCR that outperform Open-Source and Commercial Baselines

cs.CV · 2026-04-15 · unverdicted · novelty 7.0

DharmaOCR models reach 0.925 and 0.911 extraction scores with 0.40% and 0.20% degeneration rates on a new benchmark covering printed, handwritten, and legal documents, outperforming open-source and commercial baselines via SFT plus DPO.

Focusing Where Vision Matters: Selective Training for Large Vision Language Models via Visual Information Gain

cs.CV · 2026-02-19 · unverdicted · novelty 7.0

Introduces VIG metric to measure visual contribution via perplexity reduction and applies it for selective training of LVLMs on high-VIG samples and tokens to improve grounding with reduced supervision.

SynSur: An end-to-end generative pipeline for synthetic industrial surface defect generation and detection

cs.CV · 2026-04-29 · unverdicted · novelty 4.0

A generative pipeline creates realistic synthetic pitting defects and other surface flaws that, when added to real training data, yield modest gains in industrial defect detectors without replacing the need for authentic samples.

citing papers explorer

Showing 3 of 3 citing papers.

DharmaOCR: Specialized Small Language Models for Structured OCR that outperform Open-Source and Commercial Baselines cs.CV · 2026-04-15 · unverdicted · none · ref 13
DharmaOCR models reach 0.925 and 0.911 extraction scores with 0.40% and 0.20% degeneration rates on a new benchmark covering printed, handwritten, and legal documents, outperforming open-source and commercial baselines via SFT plus DPO.
Focusing Where Vision Matters: Selective Training for Large Vision Language Models via Visual Information Gain cs.CV · 2026-02-19 · unverdicted · none · ref 16
Introduces VIG metric to measure visual contribution via perplexity reduction and applies it for selective training of LVLMs on high-VIG samples and tokens to improve grounding with reduced supervision.
SynSur: An end-to-end generative pipeline for synthetic industrial surface defect generation and detection cs.CV · 2026-04-29 · unverdicted · none · ref 48
A generative pipeline creates realistic synthetic pitting defects and other surface flaws that, when added to real training data, yield modest gains in industrial defect detectors without replacing the need for authentic samples.

Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution

fields

years

verdicts

representative citing papers

citing papers explorer