Introduces an aligned multi-language dataset and a language-conditioned low-rank adapter for generating executable plotting code in Python, R, and LaTeX from chart images.
hub
Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models
10 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
Foveated Reasoner integrates foveation as stateful actions inside the autoregressive decoding loop of vision-language models, trained via cold-start supervision then reinforcement learning to achieve higher accuracy at low token budgets.
SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.
MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by reinforcing visual attention.
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
citing papers explorer
-
Aligned Multi-View Scripts for Universal Chart-to-Code Generation
Introduces an aligned multi-language dataset and a language-conditioned low-rank adapter for generating executable plotting code in Python, R, and LaTeX from chart images.
-
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation
AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
-
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.
-
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
-
Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models
Foveated Reasoner integrates foveation as stateful actions inside the autoregressive decoding loop of vision-language models, trained via cold-start supervision then reinforcement learning to achieve higher accuracy at low token budgets.
-
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.
-
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
-
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models
Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by reinforcing visual attention.
-
Hallucination of Multimodal Large Language Models: A Survey
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.