Recognition: 1 theorem link
· Lean TheoremQwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Pith reviewed 2026-05-10 12:44 UTC · model grok-4.3
The pith
Vision-language models gain localization and text-reading skills by aligning image, caption, and box data during training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Starting from a language model base, the models acquire visual capacity through a visual receptor, input-output interface, three-stage training pipeline, and multilingual multimodal cleaned corpus. Aligning image-caption-box tuples adds grounding and text-reading abilities. The resulting models, including the base and chat versions, set new records for generalist models of similar scale on visual-centric benchmarks such as image captioning, question answering, and visual grounding in both zero-shot and few-shot settings, and they also outperform prior vision-language chatbots on real-world dialog tasks.
What carries the argument
Alignment of image-caption-box tuples within a three-stage training pipeline that adds visual capacity to a language model base.
If this is right
- The models achieve leading results among similar-scale generalists on image captioning, question answering, and visual grounding benchmarks.
- They maintain strong performance in zero-shot and few-shot evaluation settings without task-specific fine-tuning.
- The chat-tuned version surpasses existing vision-language chatbots on real-world dialog benchmarks.
- The models can perform localization and text reading in addition to basic image description and question answering.
Where Pith is reading between the lines
- The same training approach could be applied to other language model bases to create additional versatile multimodal systems.
- Emphasis on multilingual cleaned data may lead to stronger results on visual tasks involving non-English text or captions.
- These capabilities could support downstream uses such as interactive image analysis tools that reference specific objects by location.
Load-bearing premise
That the specific combination of visual receptor, input-output interface, three-stage training pipeline, multilingual multimodal cleaned corpus, and image-caption-box alignment produces genuine generalization rather than benchmark-specific gains or data artifacts.
What would settle it
A new test set of images containing text and objects in novel combinations where the models fail to outperform prior generalist models of similar size on localization or text-reading accuracy.
read the original abstract
In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyond the conventional image description and question-answering, we implement the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples. The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales on a broad range of visual-centric benchmarks (e.g., image captioning, question answering, visual grounding) and different settings (e.g., zero-shot, few-shot). Moreover, on real-world dialog benchmarks, our instruction-tuned Qwen-VL-Chat also demonstrates superiority compared to existing vision-language chatbots. Code, demo and models are available at https://github.com/QwenLM/Qwen-VL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Qwen-VL series of large vision-language models built on the Qwen-LM foundation. It adds visual capacity via a visual receptor, input-output interface, 3-stage training pipeline, and a cleaned multilingual multimodal corpus, with additional image-caption-box alignment to enable grounding and text-reading. The resulting Qwen-VL and Qwen-VL-Chat models are claimed to set new records for generalist models of similar scale on visual-centric benchmarks (captioning, VQA, grounding) in zero- and few-shot settings, and to outperform existing vision-language chatbots on real-world dialog benchmarks. Code, models, and a demo are released.
Significance. If the empirical claims hold under rigorous evaluation, the work would be significant for demonstrating a scalable recipe that extends a strong language model into a versatile generalist capable of localization and text reading in addition to standard VQA and captioning. The open release of code, models, and demo is a clear strength that enables reproducibility and follow-on research.
major comments (3)
- [Experiments] Experiments section (and associated tables): the manuscript asserts new state-of-the-art results on multiple benchmarks but provides no component-wise ablations (e.g., 2-stage vs. 3-stage training, with vs. without box alignment, or controlled data-matched baselines against Qwen-LM scale alone). This leaves the central claim that the reported gains arise specifically from the visual receptor, 3-stage pipeline, and alignment rather than from data volume/quality or model scale untested.
- [Results] Results tables: no error bars, multiple runs, or statistical significance tests are reported for the benchmark numbers, and the evaluation protocols (exact prompts, few-shot examples, preprocessing) are not fully specified, making it impossible to verify the claimed records or compare fairly to prior work.
- [Training Pipeline] Section 3 (training pipeline): the description of the 3-stage training and the image-caption-box alignment objective is high-level; without quantitative isolation of each stage's contribution or details on how the alignment loss interacts with the language-modeling objective, the causal role of these design choices in the final performance remains unclear.
minor comments (2)
- [Abstract] The abstract and introduction use the phrase 'set new records' without immediately citing the specific tables or prior best scores being surpassed; adding explicit cross-references would improve readability.
- [Model Architecture] Notation for the visual receptor and input-output interface could be made more precise (e.g., explicit tensor shapes or layer dimensions) to aid replication.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major point below and have revised the manuscript accordingly to improve clarity, rigor, and reproducibility.
read point-by-point responses
-
Referee: [Experiments] Experiments section (and associated tables): the manuscript asserts new state-of-the-art results on multiple benchmarks but provides no component-wise ablations (e.g., 2-stage vs. 3-stage training, with vs. without box alignment, or controlled data-matched baselines against Qwen-LM scale alone). This leaves the central claim that the reported gains arise specifically from the visual receptor, 3-stage pipeline, and alignment rather than from data volume/quality or model scale untested.
Authors: We agree that explicit component-wise ablations strengthen the attribution of gains to our design choices. In the revised manuscript we add a dedicated ablation subsection that reports: (i) 3-stage vs. 2-stage training on the same data mixture, (ii) performance with and without the image-caption-box alignment objective, and (iii) a controlled comparison against a Qwen-LM-scale baseline trained on identical multimodal data but without the visual receptor and alignment stages. These results indicate that the largest incremental gains arise from the 3-stage pipeline and box alignment rather than data volume alone. revision: yes
-
Referee: [Results] Results tables: no error bars, multiple runs, or statistical significance tests are reported for the benchmark numbers, and the evaluation protocols (exact prompts, few-shot examples, preprocessing) are not fully specified, making it impossible to verify the claimed records or compare fairly to prior work.
Authors: We have expanded the experimental setup and appendix to provide complete evaluation protocols, including the exact prompts, few-shot example selections, and preprocessing pipelines used for every benchmark. Regarding error bars and multiple runs, each full training run of these models requires substantial compute; we therefore report single-run results and have added an explicit limitations paragraph noting this constraint and the consequent absence of statistical significance tests. revision: partial
-
Referee: [Training Pipeline] Section 3 (training pipeline): the description of the 3-stage training and the image-caption-box alignment objective is high-level; without quantitative isolation of each stage's contribution or details on how the alignment loss interacts with the language-modeling objective, the causal role of these design choices in the final performance remains unclear.
Authors: We have revised Section 3 to include a more granular description of each training stage, the precise form of the alignment loss, and the weighting schedule used to combine it with the standard language-modeling objective. The new ablation studies mentioned above provide quantitative isolation of each stage's contribution, directly addressing the request for evidence of causal impact. revision: yes
Circularity Check
No circularity: empirical model description and benchmark reporting
full rationale
The paper describes a sequence of engineering choices (visual receptor, I/O interface, 3-stage training, multilingual corpus, image-caption-box alignment) starting from Qwen-LM, then reports aggregate results on standard visual-centric benchmarks under zero-shot and few-shot settings. No mathematical derivations, first-principles predictions, or equations appear in the provided text. Performance claims are direct empirical comparisons against other models of similar scale; they do not reduce to fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations of uniqueness theorems. Training decisions are presented as design choices, not as outputs derived from the target results. This is a standard LVLM release paper whose central claims rest on external benchmark numbers rather than internal circular reductions.
Axiom & Free-Parameter Ledger
free parameters (2)
- model scale and architecture dimensions
- 3-stage training hyperparameters
axioms (2)
- domain assumption The visual receptor and input-output interface successfully integrate image features into the language model without catastrophic interference.
- domain assumption Alignment of image-caption-box tuples produces reliable grounding and text-reading abilities.
Forward citations
Cited by 60 Pith papers
-
S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images
S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.
-
SpikeMLLM: Spike-based Multimodal Large Language Models via Modality-Specific Temporal Scales and Temporal Compression
SpikeMLLM is the first spike-based MLLM framework that maintains near-lossless performance under aggressive timestep compression and delivers 9x throughput and 25x power efficiency gains via a custom RTL accelerator.
-
OxyEcomBench: Benchmarking Multimodal Foundation Models across E-Commerce Ecosystems
OxyEcomBench is a unified multimodal benchmark covering 6 capability areas and 29 tasks with authentic e-commerce data to measure how well foundation models handle real platform, merchant, and customer challenges.
-
Context Matters: Auditing Gender Bias in T2I Generation through Risk-Tiered Use-Case Profiles
A new framework called THUMB cards organizes gender bias metrics for T2I models by risk-tiered use cases, measurement categories, and harm typologies aligned with the EU AI Act.
-
ImageAttributionBench: How Far Are We from Generalizable Attribution?
ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.
-
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
ReVision reduces visual token usage by 46% on average in agent trajectories via a learned patch selector and improves success rates by 3% on three benchmarks, showing that history saturation stems from inefficient rep...
-
CATS: Curvature Aware Temporal Selection for efficient long video understanding
CATS uses temporal curvature of query-frame relevance to select informative frames, achieving 93-95% of heavy multi-stage accuracy at 3-4% of the preprocessing cost on long-video benchmarks.
-
UniShield: Unified Face Attack Detection via KG-Informed Multimodal Reasoning
UniShield introduces a knowledge-graph-informed multimodal framework that improves unified detection of physical and digital face attacks through instruction tuning and consistency-optimized reasoning.
-
GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
-
OmicsLM: A Multimodal Large Language Model for Multi-Sample Omics Reasoning
OmicsLM integrates continuous omics embeddings into LLMs for multi-sample biological reasoning, matching specialized models on profile tasks while outperforming them and general LLMs on language-guided QA over real ex...
-
OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice
OralMLLM-Bench is a new benchmark with 27 tasks in four cognitive categories that evaluates six MLLMs on dental radiographs and shows clear performance gaps versus clinicians.
-
OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice
OralMLLM-Bench reveals performance gaps between multimodal large language models and clinicians on cognitive tasks for dental radiographic analysis across periapical, panoramic, and cephalometric images.
-
SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images
SpecVQA is a new benchmark dataset and evaluation suite for testing multimodal large language models on scientific spectral image understanding and visual question answering, supported by a curve-preserving sampling m...
-
TripVVT: A Large-Scale Triplet Dataset and a Coarse-Mask Baseline for In-the-Wild Video Virtual Try-On
A new large-scale triplet dataset and diffusion transformer model using coarse human masks deliver improved video virtual try-on quality and generalization in challenging real-world conditions.
-
TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation
TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.
-
Membership Inference Attacks Against Video Large Language Models
A temperature-perturbed black-box attack infers video training membership in VideoLLMs with 0.68 AUC by exploiting sharper generation behavior on member samples.
-
QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding
Introduces QCalEval benchmark showing best zero-shot VLM score of 72.3 on quantum calibration plots, with fine-tuning and in-context learning effects varying by model type.
-
Improving Vision-language Models with Perception-centric Process Reward Models
Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.
-
ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction
ShredBench shows state-of-the-art MLLMs perform well on intact documents but suffer sharp drops in restoration accuracy as fragmentation increases to 8-16 pieces, indicating insufficient cross-modal semantic reasoning...
-
CNSL-bench: Benchmarking the Sign Language Understanding Capabilities of MLLMs on Chinese National Sign Language
CNSL-bench shows current MLLMs perform substantially worse than humans on Chinese sign language tasks with systematic gaps across modalities and articulatory forms.
-
Probing Visual Planning in Image Editing Models
Image editing models fail zero-shot visual planning on abstract mazes and queen puzzles but generalize after finetuning, yet still cannot match human zero-shot efficiency.
-
ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence
ONOTE is a multi-format benchmark that applies a deterministic pipeline to expose a disconnect between perceptual accuracy and music-theoretic comprehension in leading omnimodal AI models.
-
EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training
EmbodiedMidtrain mid-trains VLMs on curated VLA-aligned data subsets to improve downstream performance on robot manipulation benchmarks.
-
Temporal UI State Inconsistency in Desktop GUI Agents: Formalizing and Defending Against TOCTOU Attacks on Computer-Use Agents
Desktop GUI agents face TOCTOU attacks from UI state changes during the ~6.5s observation-to-action gap, with a three-layer pre-execution verification defense achieving 100% interception on two attack types but failin...
-
MNAFT: modality neuron-aware fine-tuning of multimodal large language models for image translation
MNAFT identifies language-agnostic and language-specific neurons via activation analysis and selectively fine-tunes only relevant ones in MLLMs to close the modality gap and outperform full fine-tuning and other metho...
-
S-GRPO: Unified Post-Training for Large Vision-Language Models
S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.
-
MCSC-Bench: Multimodal Context-to-Script Creation for Realistic Video Production
MCSC-Bench is the first large-scale dataset for the Multimodal Context-to-Script Creation task, requiring models to select relevant shots from redundant materials, plan missing shots, and generate coherent scripts wit...
-
Zero-Shot Retail Theft Detection via Orchestrated Vision Models: A Model-Agnostic, Cost-Effective Alternative to Trained Single-Model Systems
Paza is a zero-shot, model-agnostic pipeline that uses behavioral pre-filters on cheap object and pose models to trigger expensive VLMs only when needed, delivering 89.5% precision and 92.8% specificity on a synthesiz...
-
MirrorBench: Evaluating Self-centric Intelligence in MLLMs by Introducing a Mirror
MirrorBench reveals that leading MLLMs perform far below humans on tasks requiring self-referential perception and representation, even at the simplest level.
-
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
-
Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding
DualComp uses a lightweight router to split visual token compression into a semantic stream with size-adaptive clustering and a geometric stream with path-tracing recovery, enabling low-cost high-fidelity UHR remote s...
-
MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models
MMR-AD is a new benchmark dataset showing that current generalist MLLMs lag industrial needs for anomaly detection, with Anomaly-R1 delivering better results through reasoning and RL.
-
UIPress: Bringing Optical Token Compression to UI-to-Code Generation
UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline...
-
Learning Vision-Language-Action World Models for Autonomous Driving
VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.
-
GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing
GeoMMBench reveals deficiencies in current multimodal LLMs for geoscience tasks while GeoMMAgent demonstrates that tool-integrated agents achieve significantly higher performance.
-
Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment
Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.
-
IoT-Brain: Grounding LLMs for Semantic-Spatial Sensor Scheduling
IoT-Brain uses a neuro-symbolic Spatial Trajectory Graph to ground LLMs for verifiable semantic-spatial sensor scheduling, achieving 37.6% higher task success with lower resource use on a campus-scale benchmark.
-
An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks
An agentic architecture with multimodal screening, a five-agent jury, meta-synthesis, and source attribution protocol detects biases in Romanian history textbooks more accurately than zero-shot baselines, achieving 83...
-
ID-Selection: Importance-Diversity Based Visual Token Selection for Efficient LVLM Inference
ID-Selection combines importance scoring with iterative diversity suppression to prune 97.2% of visual tokens in LVLMs while retaining 91.8% performance and cutting FLOPs by over 97% without retraining.
-
The Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models
Fine-tuning VLMs for driving erodes pre-trained world knowledge, but shifting adaptation to prompt space via the Drive Expert Adapter preserves generalization while improving task performance.
-
Toward an Artificial General Teacher: Procedural Geometry Data Generation and Visual Grounding with Vision-Language Models
A procedural engine generates 200k+ synthetic geometry diagrams to fine-tune VLMs for referring image segmentation on abstract diagrams, yielding 49% IoU and 85% Buffered IoU with Florence-2 versus under 1% zero-shot.
-
QAPruner: Quantization-Aware Vision Token Pruning for Multimodal Large Language Models
QAPruner introduces a hybrid sensitivity metric that combines group-wise quantization error simulation and outlier intensity with semantic scores to prune visual tokens, yielding 2.24% higher accuracy than naive basel...
-
Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models
Q-Mask uses query-conditioned causal masks to separate text location from recognition in OCR VLMs, backed by a new benchmark and 26M-pair training dataset.
-
SHARP: Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion in Remote Sensing Synthesis
SHARP applies a spectrum-aware dynamic RoPE scaling schedule that promotes resolution more strongly in early denoising stages and relaxes it later, outperforming static baselines on quality metrics for remote sensing images.
-
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...
-
Evaluating Object Hallucination in Large Vision-Language Models
Large vision-language models exhibit severe object hallucination that varies with training instructions, and the proposed POPE polling method evaluates it more stably and flexibly than prior approaches.
-
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
-
When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLMs
Layer-wise Laplacian energy of visual attention reveals hallucination emergence in MLLMs and enables LaSCD, a closed-form logit remapping strategy that mitigates hallucinations while preserving general performance.
-
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
Data curation alone raises VLM accuracy by 11+ points on average, improves reliability and OOD generalization, and achieves near-frontier results at far lower training and inference cost.
-
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.
-
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
ReVision reduces visual tokens in computer-use agent histories by 46% on average and raises success rates by 3% by learning to drop redundant patches across screenshots, allowing longer histories to keep improving per...
-
Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination
LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.
-
Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing
DR-Smoothing introduces a disrupt-then-rectify prompt processing scheme into smoothing defenses, delivering tight theoretical bounds on success probability against both token- and prompt-level jailbreaks.
-
SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs
SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.
-
STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation
STARFlow2 presents an autoregressive flow-based architecture for unified multimodal text-image generation by interleaving a VLM stream with a TarFlow stream via residual skips and a unified latent space.
-
LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning
LiteGUI trains 2B/3B-scale GUI agents via SFT-free guided on-policy distillation and multi-solution dual-level GRPO to reach SOTA lightweight performance and compete with larger models.
-
ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning
ReasonEdit uses a new CoT dataset and reinforcement learning to produce interpretable, human-aligned evaluations of text-guided image edits.
-
Towards Annotation-Free Validation of MLLMs: A Vision-Language Logical Consistency Metric
VL-LCM measures vision-language logical consistency without annotations and shows that recent MLLMs have high accuracy but low logical consistency on benchmarks like MMMU and NaturalBench.
-
MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems
MoE-Hub enables seamless MoE communication overlap via hardware-accelerated destination-agnostic data transmission, delivering 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedups over prior systems.
-
ChartZero: Synthetic Priors Enable Zero Shot Chart Data Extraction
ChartZero achieves zero-shot line chart data extraction by training only on synthetic mathematical functions, using a Global Orthogonal Instance loss to prevent curve fragmentation and a VLM-guided strategy for legend...
Reference graph
Works this paper leans on
-
[1]
Removing pairs with too large aspect ratio of the image
-
[2]
Removing pairs with too small image
-
[3]
Removing pairs with a harsh CLIP score (dataset-specific)
-
[4]
Removing pairs with text containing non-English or non-Chinese characters
-
[5]
Removing pairs with text containing emoji characters
-
[6]
Removing pairs with text length too short or too long
-
[7]
Cleaning the text’s HTML-tagged part
-
[8]
If there is more than one text matching the same image, we select the longest one
Cleaning the text with certain unregular patterns For academic caption datasets, we remove pairs whose text contains the special tags in CC12M (Changpinyo et al., 2021) and SBU (Ordonez et al., 2011). If there is more than one text matching the same image, we select the longest one. A.2 VQA FortheVQAv2(Goyaletal.,2017)dataset,weselecttheanswerannotationba...
work page 2021
-
[9]
to get the rendering results of each page in a PDF file as well as all the text annotations with their bounding boxes
-
[10]
16 Figure 5: Visualization of the Grounding and OCR data used for training Qwen-VL 17
Extracting all texts and their bounding boxes for each page. 16 Figure 5: Visualization of the Grounding and OCR data used for training Qwen-VL 17
-
[14]
Removing images containing Unicode characters in the “Latin Extended-A” and “Latin Extended-B” blocks
-
[15]
Removing images containing Unicode characters in the “Private Use Area (PUA)” block. For all HTML web pages we collected, we pre-process them in a similar approach to all the PDF data we collected, but we use Puppeteer (Google, 2023) instead of PyMuPDF to render these HTML pages and get the ground truth annotation. We follow the steps below to pre-process...
work page 2023
-
[16]
Extracting all texts for each webpage
-
[17]
Rendering each page and save them as an image file
-
[18]
Removing too small image
-
[19]
Removing images with too many or too few characters
-
[20]
Removing images containing Unicode characters in the “Private Use Area (PUA)” block. B Data Format Details of Training B.1 Data Format of Multi-Task Pre-training We visualize the Multi-Task Pre-training data format in Box B.1. The Box contains all 7 tasks with the black-colored text as the prefix sequence without loss and blue-colored text as the ground t...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.