MODE decomposes expert selection frequency by modality, filters redundant vision tokens, adds per-modality sensitivity, and uses ILP to assign bit-widths, limiting average loss to 2.9% at W3A16 on MoE-MLLMs.
Title resolution pending
11 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
Enginuity is the first open benchmark dataset for VLMs on engineering diagrams, with evaluations showing models identify parts but produce low-fidelity descriptions and struggle with factual reasoning.
Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.
FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.
OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
PerceptionDLM enables parallel region captioning in multimodal diffusion language models via prompting and attention masking, introduces ParaDLC-Bench, and claims first parallel region perception with DLMs.
Entropy-gradient grounding uses model uncertainty to retrieve evidence regions in VLMs, improving performance on detail-critical and compositional tasks across multiple architectures.
Chart-RL uses RL policy optimization and LoRA to boost VLM chart reasoning, enabling a 4B model to reach 0.634 accuracy versus 0.580 for an 8B model with lower latency.
Circle-RoPE achieves cross-modal positional disentanglement in VLMs by mapping 2D image tokens to a cone-like annulus orthogonal to the text axis, with PTD=0 eliminating RoPE geometric bias while preserving intra-image structure via alternating geometry encoding.
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
EADP filters textual noise via statistical entropy then casts token selection as submodular maximization with spatial prior to preserve fine-grained cues in VLMs under strict budgets.
citing papers explorer
-
Enginuity: A Dataset and Benchmark for Vision-Language Understanding of Engineering Diagrams
Enginuity is the first open benchmark dataset for VLMs on engineering diagrams, with evaluations showing models identify parts but produce low-fidelity descriptions and struggle with factual reasoning.
-
Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment
Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.
-
FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.
-
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
-
PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models
PerceptionDLM enables parallel region captioning in multimodal diffusion language models via prompting and attention masking, introduces ParaDLC-Bench, and claims first parallel region perception with DLMs.
-
Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models
Entropy-gradient grounding uses model uncertainty to retrieve evidence regions in VLMs, improving performance on detail-critical and compositional tasks across multiple architectures.
-
Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models
Circle-RoPE achieves cross-modal positional disentanglement in VLMs by mapping 2D image tokens to a cone-like annulus orthogonal to the text axis, with PTD=0 eliminating RoPE geometric bias while preserving intra-image structure via alternating geometry encoding.
-
Combating Textual Noise and Redundancy: Entropy-Aware Dense Visual Token Pruning
EADP filters textual noise via statistical entropy then casts token selection as submodular maximization with spatial prior to preserve fine-grained cues in VLMs under strict budgets.