hub Mixed citations

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing · 2023 · cs.CV · arXiv 2312.14238

Mixed citation behavior. Most common role is background (60%).

47 Pith papers citing it

Background 60% of classified citations

open full Pith review browse 47 citing papers arXiv PDF

abstract

The exponential growth of large language models (LLMs) has opened up numerous possibilities for multimodal AGI systems. However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs. In this work, we design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters and progressively aligns it with the LLM, using web-scale image-text data from various sources. This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks including visual perception tasks such as image-level or pixel-level recognition, vision-language tasks such as zero-shot image/video classification, zero-shot image/video-text retrieval, and link with LLMs to create multi-modal dialogue systems. It has powerful visual capabilities and can be a good alternative to the ViT-22B. We hope that our research could contribute to the development of multi-modal large models. Code and models are available at https://github.com/OpenGVLab/InternVL.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9 baseline 5 method 1

citation-polarity summary

background 9 baseline 5 use method 1

representative citing papers

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

cs.CV · 2026-04-23 · unverdicted · novelty 8.0

S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

cs.CL · 2024-10-06 · unverdicted · novelty 8.0

ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

cs.CV · 2024-08-23 · conditional · novelty 8.0

MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

cs.CL · 2023-11-27 · unverdicted · novelty 8.0

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.

CosFly-Track: A Large-Scale Multi-Modal Dataset for UAV Visual Tracking via Multi-Constraint Trajectory Optimization

cs.RO · 2026-05-18 · conditional · novelty 7.0 · 2 refs

CosFlyTrack provides 12,000 expert UAV trajectories with aligned RGB, depth, segmentation, pose, target state, and bilingual instructions to train visual tracking agents, yielding 53-69 point gains in success rate after fine-tuning.

Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.

Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization

cs.CV · 2026-04-10 · unverdicted · novelty 7.0

Mosaic combines text perturbation, multi-view image optimization, and surrogate model ensembles to reduce reliance on any single open-source model and achieve higher attack success rates on commercial closed-source VLMs.

CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

cs.CL · 2026-02-02 · unverdicted · novelty 7.0

Multimodal LLMs process code as images to achieve up to 8x token compression, with visual cues like syntax highlighting aiding tasks and clone detection remaining resilient or even improving under compression.

FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

cs.CV · 2025-04-14 · unverdicted · novelty 7.0

FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

cs.AI · 2024-07-01 · accept · novelty 7.0

WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

cs.CV · 2024-06-13 · conditional · novelty 7.0

MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.

Anisotropic Modality Align

cs.MM · 2026-05-08 · unverdicted · novelty 6.0

Modality representations share dominant semantic geometry but have an anisotropic residual gap; AnisoAlign corrects source representations boundedly using target geometry for unpaired alignment.

Agentic AI for Remote Sensing: Technical Challenges and Research Directions

cs.CV · 2026-04-27 · unverdicted · novelty 6.0 · 2 refs

Agentic AI faces structural challenges in remote sensing due to geospatial data properties and workflow constraints, requiring EO-native agents built around structured state, tool-aware reasoning, and validity-aware evaluation.

If you're waiting for a sign... that might not be it! Mitigating Trust Boundary Confusion from Visual Injections on Vision-Language Agentic Systems

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

LVLM-based agents exhibit trust boundary confusion with visual injections and a multi-agent defense separating perception from decision-making reduces misleading responses while preserving correct ones.

CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference

cs.DC · 2026-04-07 · unverdicted · novelty 6.0

CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baselines with 0-8% F1 drop.

AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis

cs.CV · 2026-04-07 · unverdicted · novelty 6.0

AICA-Bench evaluates 23 VLMs on affective image analysis, identifies weak intensity calibration and shallow descriptions as limitations, and proposes training-free Grounded Affective Tree Prompting to improve performance.

AD-Copilot: A Vision-Language Assistant for Industrial Anomaly Detection via Visual In-context Comparison

cs.CV · 2026-03-14 · conditional · novelty 6.0

AD-Copilot trains an MLLM on a new curated industrial dataset Chat-AD with a Comparison Encoder that uses cross-attention on image pairs, reaching 82.3% accuracy on MMAD and 3.35x gains on MMAD-BBox while generalizing and exceeding human experts on some tasks.

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

cs.CV · 2025-01-06 · unverdicted · novelty 6.0

MotionBench is a new benchmark showing poor fine-grained motion understanding in VLMs and proposes TE Fusion to improve performance with higher frame rates.

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

cs.CL · 2024-11-15 · conditional · novelty 6.0

Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

cs.CV · 2024-10-22 · unverdicted · novelty 6.0

LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal detail loss.

Long Context Transfer from Language to Vision

cs.CV · 2024-06-24 · unverdicted · novelty 6.0

Extending language model context length enables LMMs to process over 200K visual tokens from long videos without video training, achieving SOTA on Video-MME via dense frame sampling.

Are We on the Right Way for Evaluating Large Vision-Language Models?

cs.CV · 2024-03-29 · conditional · novelty 6.0

Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6 capabilities and 18 axes with new metrics for leakage and true multi-modal gain.

ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models

cs.CL · 2024-02-18 · unverdicted · novelty 6.0

ALLaVA creates 1.3M GPT4V-synthesized samples enabling 4B VLMs to achieve competitive results on 17 benchmarks and match 7B/13B models on some tasks.

citing papers explorer

Showing 47 of 47 citing papers.

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images cs.CV · 2026-04-23 · unverdicted · none · ref 3 · internal anchor
S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection cs.CL · 2024-10-06 · unverdicted · none · ref 10 · internal anchor
ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? cs.CV · 2024-08-23 · conditional · none · ref 12 · internal anchor
MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI cs.CL · 2023-11-27 · unverdicted · none · ref 11 · internal anchor
MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
CosFly-Track: A Large-Scale Multi-Modal Dataset for UAV Visual Tracking via Multi-Constraint Trajectory Optimization cs.RO · 2026-05-18 · conditional · none · ref 20 · 2 links · internal anchor
CosFlyTrack provides 12,000 expert UAV trajectories with aligned RGB, depth, segmentation, pose, target state, and bilingual instructions to train visual tracking agents, yielding 53-69 point gains in success rate after fine-tuning.
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding cs.CV · 2026-04-14 · unverdicted · none · ref 11 · internal anchor
Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization cs.CV · 2026-04-10 · unverdicted · none · ref 2 · internal anchor
Mosaic combines text perturbation, multi-view image optimization, and surrogate model ensembles to reduce reliance on any single open-source model and achieve higher attack success rates on commercial closed-source VLMs.
CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding cs.CL · 2026-02-02 · unverdicted · none · ref 26 · internal anchor
Multimodal LLMs process code as images to achieve up to 8x token compression, with visual cues like syntax highlighting aiding tasks and clone detection remaining resilient or even improving under compression.
FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding cs.CV · 2025-04-14 · unverdicted · none · ref 11 · internal anchor
FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning? cs.AI · 2024-07-01 · accept · none · ref 46 · internal anchor
WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding cs.CV · 2024-06-13 · conditional · none · ref 9 · internal anchor
MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding cs.CV · 2026-05-18 · unverdicted · none · ref 10 · internal anchor
SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.
Anisotropic Modality Align cs.MM · 2026-05-08 · unverdicted · none · ref 3 · internal anchor
Modality representations share dominant semantic geometry but have an anisotropic residual gap; AnisoAlign corrects source representations boundedly using target geometry for unpaired alignment.
Agentic AI for Remote Sensing: Technical Challenges and Research Directions cs.CV · 2026-04-27 · unverdicted · none · ref 18 · 2 links · internal anchor
Agentic AI faces structural challenges in remote sensing due to geospatial data properties and workflow constraints, requiring EO-native agents built around structured state, tool-aware reasoning, and validity-aware evaluation.
If you're waiting for a sign... that might not be it! Mitigating Trust Boundary Confusion from Visual Injections on Vision-Language Agentic Systems cs.CV · 2026-04-21 · unverdicted · none · ref 26 · internal anchor
LVLM-based agents exhibit trust boundary confusion with visual injections and a multi-agent defense separating perception from decision-making reduces misleading responses while preserving correct ones.
CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference cs.DC · 2026-04-07 · unverdicted · none · ref 13 · internal anchor
CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baselines with 0-8% F1 drop.
AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis cs.CV · 2026-04-07 · unverdicted · none · ref 10 · internal anchor
AICA-Bench evaluates 23 VLMs on affective image analysis, identifies weak intensity calibration and shallow descriptions as limitations, and proposes training-free Grounded Affective Tree Prompting to improve performance.
AD-Copilot: A Vision-Language Assistant for Industrial Anomaly Detection via Visual In-context Comparison cs.CV · 2026-03-14 · conditional · none · ref 77 · internal anchor
AD-Copilot trains an MLLM on a new curated industrial dataset Chat-AD with a Comparison Encoder that uses cross-attention on image pairs, reaching 82.3% accuracy on MMAD and 3.35x gains on MMAD-BBox while generalizing and exceeding human experts on some tasks.
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models cs.CV · 2025-01-06 · unverdicted · none · ref 5 · internal anchor
MotionBench is a new benchmark showing poor fine-grained motion understanding in VLMs and proposes TE Fusion to improve performance with higher frame rates.
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization cs.CL · 2024-11-15 · conditional · none · ref 19 · internal anchor
Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding cs.CV · 2024-10-22 · unverdicted · none · ref 7 · internal anchor
LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal detail loss.
Long Context Transfer from Language to Vision cs.CV · 2024-06-24 · unverdicted · none · ref 14 · internal anchor
Extending language model context length enables LMMs to process over 200K visual tokens from long videos without video training, achieving SOTA on Video-MME via dense frame sampling.
Are We on the Right Way for Evaluating Large Vision-Language Models? cs.CV · 2024-03-29 · conditional · none · ref 6 · internal anchor
Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6 capabilities and 18 axes with new metrics for leakage and true multi-modal gain.
ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models cs.CL · 2024-02-18 · unverdicted · none · ref 72 · internal anchor
ALLaVA creates 1.3M GPT4V-synthesized samples enabling 4B VLMs to achieve competitive results on 17 benchmarks and match 7B/13B models on some tasks.
MetaRA: Metamorphic Robustness Assessment for Multimodal Large Language Model-based Visual Question Answering Systems cs.CV · 2026-05-19 · unverdicted · none · ref 35 · internal anchor
MetaRA applies metamorphic testing to VQA tasks and shows that MLLM models exhibit sensitivity to linguistic perturbations and superficial visual cues not detected by conventional accuracy benchmarks.
FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing cs.CV · 2026-05-17 · unverdicted · none · ref 7 · internal anchor
FastOCR dynamically selects a small subset of visual tokens per decoding step using focal-guided pruning and cross-step reuse, retaining 98% accuracy on Qwen2.5-VL while attending to only 5% of tokens and cutting attention latency by 3x.
DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding cs.AI · 2026-05-15 · unverdicted · none · ref 7 · internal anchor
DRS-GUI introduces a dynamic region search method with Focus/Shift/Scatter actions and MCTS-based planning that improves GUI grounding accuracy by 14% on ScreenSpot-Pro for both general and GUI-specific MLLMs without any training.
From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media cs.CV · 2026-04-23 · unverdicted · none · ref 58 · internal anchor
VLMs recover reliable population-level trends in climate change visual discourse on social media even when per-image accuracy is only moderate.
Are Face Embeddings Compatible Across Deep Neural Network Models? cs.CV · 2026-04-08 · unverdicted · none · ref 4 · internal anchor
Simple affine transformations align face embeddings across different DNN models, substantially improving cross-model identification and verification performance.
Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation cs.CV · 2026-04-06 · unverdicted · none · ref 7 · internal anchor
Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.
Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration cs.CV · 2025-05-27 · unverdicted · none · ref 4 · internal anchor
CAAC mitigates hallucinations in LVLMs via Visual-Token Calibration and Adaptive Attention Re-Scaling guided by model confidence, showing gains on CHAIR, AMBER, and POPE especially in long-form generation.
Qwen2.5-VL Technical Report cs.CV · 2025-02-19 · unverdicted · none · ref 3 · internal anchor
Qwen2.5-VL reports a vision-language model family using native dynamic-resolution ViT and absolute time encoding that matches GPT-4o on document and diagram tasks while supporting hour-long videos with second-level localization.
LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding cs.CV · 2025-01-09 · unverdicted · none · ref 15 · internal anchor
LLaVA-Octopus introduces instruction-driven adaptive fusion of multiple visual projectors in a multimodal LLM to improve video understanding performance.
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding cs.CV · 2024-12-13 · accept · none · ref 15 · internal anchor
DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B activated parameters.
CogVLM2: Visual Language Models for Image and Video Understanding cs.CV · 2024-08-29 · conditional · none · ref 14 · internal anchor
CogVLM2 family achieves state-of-the-art results on image and video understanding benchmarks through improved visual expert architecture, higher resolution inputs, and automated temporal grounding for videos.
LLaVA-OneVision: Easy Visual Task Transfer cs.CV · 2024-08-06 · unverdicted · none · ref 22 · internal anchor
LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output cs.CV · 2024-07-03 · conditional · none · ref 26 · internal anchor
InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.
When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise cs.CV · 2026-05-06 · unverdicted · none · ref 1 · 2 links · internal anchor
Mild rotations and noise significantly increase relation hallucinations in VLMs across models and datasets, with prompt and preprocessing fixes providing only partial relief.
Structured Labeling Enables Faster Vision-Language Models for End-to-End Autonomous Driving cs.CV · 2025-06-05 · unverdicted · none · ref 16 · internal anchor
Introduces structured NuScenes-S dataset and 0.9B FastDrive VLM claiming 20% higher decision accuracy and over 10x inference speedup versus larger unstructured VLMs.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding cs.CV · 2025-01-22 · unverdicted · none · ref 31 · internal anchor
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction cs.CV · 2025-01-03 · conditional · none · ref 60 · internal anchor
VITA-1.5 integrates vision and speech into a single LLM through multi-stage training, delivering competitive benchmark results on image, video, and speech tasks with near real-time response speed.
PaliGemma: A versatile 3B VLM for transfer cs.CV · 2024-07-10 · unverdicted · none · ref 30 · internal anchor
PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites cs.CV · 2024-04-25 · unverdicted · none · ref 18 · internal anchor
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
A Survey on Hallucination in Large Vision-Language Models cs.CV · 2024-02-01 · unverdicted · none · ref 5 · internal anchor
This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.
SAGE Celer 2.6 Technical Card cs.CL · 2026-03-24 · unverdicted · none · ref 3 · internal anchor
SAGE Celer 2.6 is a new line of language models with inverse reasoning training, integrated vision, and strong performance on math, coding, and South Asian language benchmarks.
CREST: Curvature-Regulated Event-Centric Sampling for Efficient Long-Video Understanding cs.CV · 2026-05-09 · unreviewed · ref 3 · internal anchor
High-Entropy Tokens as Multimodal Failure Points in Vision-Language Models cs.CV · 2025-12-26 · unreviewed · ref 3 · internal anchor

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer