Title resolution pending

Minesh Mathew, Dimosthenis Karatzas · 2021 · arXiv 2104.12756

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 2 baseline 1 dataset 1

citation-polarity summary

background 3 baseline 1

representative citing papers

MODE: Modality-Decomposed Expert-Level Mixed-Precision Quantization for MoE Multimodal LLMs

cs.LG · 2026-06-15 · unverdicted · novelty 7.0

MODE decomposes expert selection frequency by modality, filters redundant vision tokens, adds per-modality sensitivity, and uses ILP to assign bit-widths, limiting average loss to 2.9% at W3A16 on MoE-MLLMs.

Enginuity: A Dataset and Benchmark for Vision-Language Understanding of Engineering Diagrams

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

Enginuity is the first open benchmark dataset for VLMs on engineering diagrams, with evaluations showing models identify parts but produce low-fidelity descriptions and struggle with factual reasoning.

Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.

FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

cs.CV · 2025-04-14 · unverdicted · novelty 7.0

FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

cs.CV · 2024-12-31 · accept · novelty 7.0

OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.

PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

cs.CV · 2026-06-17 · unverdicted · novelty 6.0

PerceptionDLM enables parallel region captioning in multimodal diffusion language models via prompting and attention masking, introduces ParaDLC-Bench, and claims first parallel region perception with DLMs.

Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

Entropy-gradient grounding uses model uncertainty to retrieve evidence regions in VLMs, improving performance on detail-critical and compositional tasks across multiple architectures.

Chart-RL: Policy Optimization Reinforcement Learning for Enhanced Visual Reasoning in Chart Question Answering with Vision Language Models

cs.AI · 2026-04-03 · unverdicted · novelty 6.0

Chart-RL uses RL policy optimization and LoRA to boost VLM chart reasoning, enabling a 4B model to reach 0.634 accuracy versus 0.580 for an 8B model with lower latency.

Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models

cs.CV · 2025-05-22 · unverdicted · novelty 6.0

Circle-RoPE achieves cross-modal positional disentanglement in VLMs by mapping 2D image tokens to a cone-like annulus orthogonal to the text axis, with PTD=0 eliminating RoPE geometric bias while preserving intra-image structure via alternating geometry encoding.

Kimi K2.5: Visual Agentic Intelligence

cs.CL · 2026-02-02 · unverdicted · novelty 5.0

Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.

Combating Textual Noise and Redundancy: Entropy-Aware Dense Visual Token Pruning

cs.CV · 2026-07-02 · unverdicted · novelty 4.0

EADP filters textual noise via statistical entropy then casts token selection as submodular maximization with spatial prior to preserve fine-grained cues in VLMs under strict budgets.

citing papers explorer

Showing 8 of 8 citing papers after filters.

Enginuity: A Dataset and Benchmark for Vision-Language Understanding of Engineering Diagrams cs.CV · 2026-06-02 · unverdicted · none · ref 16
Enginuity is the first open benchmark dataset for VLMs on engineering diagrams, with evaluations showing models identify parts but produce low-fidelity descriptions and struggle with factual reasoning.
Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment cs.CV · 2026-04-09 · unverdicted · none · ref 51
Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.
FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding cs.CV · 2025-04-14 · unverdicted · none · ref 52
FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning cs.CV · 2024-12-31 · accept · none · ref 119
OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models cs.CV · 2026-06-17 · unverdicted · none · ref 30
PerceptionDLM enables parallel region captioning in multimodal diffusion language models via prompting and attention masking, introduces ParaDLC-Bench, and claims first parallel region perception with DLMs.
Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models cs.CV · 2026-04-09 · unverdicted · none · ref 23
Entropy-gradient grounding uses model uncertainty to retrieve evidence regions in VLMs, improving performance on detail-critical and compositional tasks across multiple architectures.
Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models cs.CV · 2025-05-22 · unverdicted · none · ref 17
Circle-RoPE achieves cross-modal positional disentanglement in VLMs by mapping 2D image tokens to a cone-like annulus orthogonal to the text axis, with PTD=0 eliminating RoPE geometric bias while preserving intra-image structure via alternating geometry encoding.
Combating Textual Noise and Redundancy: Entropy-Aware Dense Visual Token Pruning cs.CV · 2026-07-02 · unverdicted · none · ref 46
EADP filters textual noise via statistical entropy then casts token selection as submodular maximization with spatial prior to preserve fine-grained cues in VLMs under strict budgets.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer