super hub Baseline reference

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Ao Zhang, Chongyi Wang, Hongji Zhu, Junbo Cui, Tianyu Yu, Yuan Yao · 2024 · cs.CV · arXiv 2408.01800

Baseline reference. 62% of citing Pith papers use this work as a benchmark or comparison.

137 Pith papers citing it

Baseline 62% of classified citations

open full Pith review browse 137 citing papers more from Ao Zhang arXiv PDF

abstract

The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally reshaped the landscape of AI research and industry, shedding light on a promising path toward the next AI milestone. However, significant challenges remain preventing MLLMs from being practical in real-world applications. The most notable challenge comes from the huge cost of running an MLLM with a massive number of parameters and extensive computation. As a result, most MLLMs need to be deployed on high-performing cloud servers, which greatly limits their application scopes such as mobile, offline, energy-sensitive, and privacy-protective scenarios. In this work, we present MiniCPM-V, a series of efficient MLLMs deployable on end-side devices. By integrating the latest MLLM techniques in architecture, pretraining and alignment, the latest MiniCPM-Llama3-V 2.5 has several notable features: (1) Strong performance, outperforming GPT-4V-1106, Gemini Pro and Claude 3 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks, (2) strong OCR capability and 1.8M pixel high-resolution image perception at any aspect ratio, (3) trustworthy behavior with low hallucination rates, (4) multilingual support for 30+ languages, and (5) efficient deployment on mobile phones. More importantly, MiniCPM-V can be viewed as a representative example of a promising trend: The model sizes for achieving usable (e.g., GPT-4V) level performance are rapidly decreasing, along with the fast growth of end-side computation capacity. This jointly shows that GPT-4V level MLLMs deployed on end devices are becoming increasingly possible, unlocking a wider spectrum of real-world AI applications in the near future.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

baseline 16 background 10 dataset 2 method 1

citation-polarity summary

baseline 16 background 10 use dataset 2 use method 1

claims ledger

abstract The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally reshaped the landscape of AI research and industry, shedding light on a promising path toward the next AI milestone. However, significant challenges remain preventing MLLMs from being practical in real-world applications. The most notable challenge comes from the huge cost of running an MLLM with a massive number of parameters and extensive computation. As a result, most MLLMs need to be deployed on high-performing cloud servers, which greatly limits their application scopes such as mobile, offline, energy-sensitive

authors

Ao Zhang Chongyi Wang Hongji Zhu Junbo Cui Tianyu Yu Yuan Yao

co-cited works

representative citing papers

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

cs.CV · 2026-05-07 · unverdicted · novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.

SpikeMLLM: Spike-based Multimodal Large Language Models via Modality-Specific Temporal Scales and Temporal Compression

cs.NE · 2026-04-13 · unverdicted · novelty 8.0

SpikeMLLM is the first spike-based MLLM framework that maintains near-lossless performance under aggressive timestep compression and delivers 9x throughput and 25x power efficiency gains via a custom RTL accelerator.

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

cs.CV · 2026-02-15 · unverdicted · novelty 8.0

EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

cs.CL · 2024-10-06 · unverdicted · novelty 8.0

ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

cs.CL · 2024-09-04 · accept · novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

MemLearner: Learning to Query Context memory for Video World Models

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

MemLearner introduces a learning-based adaptive context query method using query tokens in video world models to improve long-term scene consistency over rule-based retrieval.

Toward Semantic-Agnostic and Shape-Aware Vision-Language Segmentation Models

cs.CV · 2026-05-27 · unverdicted · novelty 7.0

Introduces SANSA paradigm for semantic-agnostic vision-language segmentation via dictionary or example-based prompts, with finetuning delivering up to 20% mIoU gains on the new task while retaining standard performance.

VoiceGiraffe: A Benchmark for Extreme Long-Context Audio-Language Understanding

cs.SD · 2026-05-27 · unverdicted · novelty 7.0

VoiceGiraffe is a new benchmark showing that long-range memory persistence remains a key bottleneck for large audio language models on hour-scale audio.

Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

cs.AI · 2026-05-21 · unverdicted · novelty 7.0

Introduces the Grounded Personality Reasoning task and MM-OCEAN dataset to show that MLLMs frequently produce correct Big Five personality ratings without grounding them in observable video evidence.

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

LatentOmni proposes a latent-space cross-modal reasoning framework that uses feature-level supervision and Omni-Sync Position Embedding to align and synchronize audio-visual latents, supported by a new 35K interleaved reasoning dataset and showing gains over text CoT baselines.

An Efficient Streaming Video Understanding Framework with Agentic Control

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

R3-Streaming uses cascaded control with age-aware memory forgetting and TB-GRPO reinforcement learning to reach SOTA scores of 57.92 on OVO-Bench and 76.36 on StreamingBench with 95-96% fewer visual tokens.

CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings

cs.AI · 2026-05-17 · unverdicted · novelty 7.0

CBT-Audio dataset shows that adding audio input improves distress intensity estimation over transcripts alone for 8 of 10 audio language models, with clearest gains when verbal content and vocal delivery diverge.

PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

cs.AI · 2026-05-15 · unverdicted · novelty 7.0

PAGER achieves 4.1x higher task success in point-precise geometric GUI control by combining topology-aware planning with precision-aligned reinforcement learning on the new PAGE Bench dataset of 4,906 problems.

EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.

TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models

cs.CV · 2026-05-11 · conditional · novelty 7.0 · 2 refs

TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.

Liberating LLM Capabilities in Full-Duplex Speech Models

cs.CL · 2026-05-04 · unverdicted · novelty 7.0

LWS is a text-first paradigm for full-duplex speech LLMs that treats visible writing as a primary output channel alongside audio input and spoken response, implemented via token schema and synthetic per-second annotations.

QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding

quant-ph · 2026-04-28 · unverdicted · novelty 7.0

Introduces QCalEval benchmark showing best zero-shot VLM score of 72.3 on quantum calibration plots, with fine-tuning and in-context learning effects varying by model type.

CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

cs.CV · 2026-04-24 · unverdicted · novelty 7.0

CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench and VLM2-Bench with transfer gains to other multimodal tasks.

Towards Temporal Compositional Reasoning in Long-Form Sports Videos

cs.CV · 2026-04-24 · unverdicted · novelty 7.0

SportsTime benchmark and CoTR method improve multimodal AI's temporal compositional reasoning and evidence grounding in long-form sports videos.

Grounding Video Reasoning in Physical Signals

cs.CV · 2026-04-23 · unverdicted · novelty 7.0

A new benchmark converts video clips into shared grounded event records and tests models across physics, semantic, and control prompts under original, shuffled, ablated, and masked conditions, finding selective robustness and weak spatial performance.

WildFireVQA: A Large-Scale Radiometric Thermal VQA Benchmark for Aerial Wildfire Monitoring

cs.CV · 2026-04-22 · unverdicted · novelty 7.0

WildFireVQA is a new large-scale visual question answering benchmark that pairs RGB imagery with radiometric thermal measurements for aerial wildfire monitoring across six task categories.

LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

Ghost-100 benchmark shows prompt tone drives hallucination rates and intensities in VLMs, with non-monotonic peaks at intermediate pressure and task-specific differences that aggregate metrics hide.

citing papers explorer

Showing 50 of 137 citing papers.

FMplex: Model Virtualization for Serving Extensible Foundation Models cs.DC · 2026-06-08 · unverdicted · none · ref 83 · internal anchor
FMplex is a serving system that virtualizes FM backbones for sharing across tasks, claiming up to 80% lower latency and 6x more tasks hosted versus prior approaches.
Don't Pause: Streaming Video-Language Synchrony for Online Video Understanding cs.CV · 2026-06-05 · unverdicted · none · ref 98 · internal anchor
LyraV uses FDTC and SToP for per-frame incremental decoding to reach 98.29% video synchrony at 3.89 FPS while preserving general understanding.
MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models cs.CV · 2026-06-05 · unverdicted · none · ref 27 · internal anchor
MotionEnhancer distills motion priors from video diffusion models into VLMs via parameter-free attention alignment modules to improve motion-level video understanding.
Training-Free Multi-Concept LoRA Composition with Prompt-Aware Weighting cs.CV · 2026-06-02 · unverdicted · none · ref 61 · internal anchor
Prompt-aware weighting strategies W-Switch and W-Composite improve multi-concept LoRA composition in diffusion models without training.
Consistent Yet Wrong: Evidence Insensitivity in Spatial Vision-Language Models cs.CV · 2026-06-01 · conditional · none · ref 27 · 2 links · internal anchor
Leading VLMs show high cross-view consistency paired with low metric accuracy on distance queries, indicating evidence-insensitive reasoning rather than geometric grounding.
Readable Yet Unpredictable: Rotated-Outcome Prediction in Vision-Language Models cs.CV · 2026-06-01 · unverdicted · none · ref 11 · internal anchor
VLMs recognize rotated images when shown directly but fail to predict rotated outcomes from originals on the new RotOutBench benchmark.
BilliardPhys-Bench: Benchmarking Physical Reasoning and Visual Dynamics of Multimodal LLMs cs.AI · 2026-05-29 · unverdicted · none · ref 2 · internal anchor
BilliardPhys-Bench is a new procedural benchmark that evaluates multimodal LLMs on ball-to-ball collisions, wall bounces, and final positions in simulated billiards, revealing performance drops with time and complexity plus a 'stasis bias' toward predicting no interaction.
MTAVG-Bench 2.0: Diagnosing Failure Modes of Cinematic Expressiveness in Multi-Talker Audio-Video Generation cs.AI · 2026-05-27 · unverdicted · none · ref 36 · internal anchor
MTAVG-Bench 2.0 is a new benchmark that evaluates omni LLMs on diagnosing high-level cinematic failures in multi-talker audio-video generation using a taxonomy of acting, narrative, atmosphere, and audio-visual language.
EgoProx: Evaluating MLLMs on Egocentric 3D Proximity Reasoning Across a Cognitive Hierarchy cs.CV · 2026-05-23 · unverdicted · none · ref 74 · internal anchor
EgoProx benchmark shows MLLMs have some spatial knowledge but struggle to leverage it for egocentric 3D proximity reasoning VQA.
FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding cs.CV · 2026-05-19 · unverdicted · none · ref 25 · 3 links · internal anchor
FineBench is a large-scale human-centric VQA benchmark exposing weaknesses in open VLMs for fine-grained activity understanding, with FineAgent providing a practical enhancement method.
A More Word-like Image Tokenization for MLLMs cs.CV · 2026-05-18 · unverdicted · none · ref 46 · internal anchor
DiVT clusters patch embeddings into coherent semantic units and adapts token count to image complexity, matching or exceeding baselines with fewer visual tokens on multimodal benchmarks.
Why We Look Where We Look: Emergent Human-like Fixations of a Foveated Visual Language Model Maximizing Scene Understanding cs.CV · 2026-05-18 · unverdicted · none · ref 29 · internal anchor
A foveated VLM trained for scene comprehension produces human-like fixations, outperforming models trained for search, classification, or with altered peripheral vision.
Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction cs.CV · 2026-05-17 · unverdicted · none · ref 7 · 2 links · internal anchor
Omni-DuplexEval provides a new benchmark and automatic evaluation method for real-time duplex omni-modal interaction, showing state-of-the-art models reach only 39.6% overall and 20% on proactive reminders.
OProver: A Unified Framework for Agentic Formal Theorem Proving cs.CL · 2026-05-17 · unverdicted · none · ref 30 · internal anchor
OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.
Deep Pre-Alignment for VLMs cs.CV · 2026-05-14 · unverdicted · none · ref 131 · internal anchor
Deep Pre-Alignment uses a small VLM perceiver instead of ViT to pre-align visual features with LLM text space, yielding 1.9-3.0 point gains on multimodal benchmarks and 32.9% less language forgetting.
OmniDrop: Layer-wise Token Pruning for Omni-modal LLMs via Query-Guidance cs.AI · 2026-05-14 · unverdicted · none · ref 30 · internal anchor
OmniDrop is a training-free layer-wise token pruning framework for omni-modal LLMs that uses query guidance and temporal diversity to reduce prefill latency by up to 40% and memory by 14.7% while improving benchmark scores by up to 3.58 points.
EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs cs.CL · 2026-05-11 · unverdicted · none · ref 57 · internal anchor
EchoDistill applies noisy-to-clean self-distillation with GRPO to boost Audio LLM robustness, reporting 4.18% average GSR gains under strong noise.
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs? cs.CV · 2026-05-09 · unverdicted · none · ref 47 · internal anchor
LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and VQA benchmarks.
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding cs.CV · 2026-05-07 · unverdicted · none · ref 39 · 2 links · internal anchor
VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.
From Priors to Perception: Grounding Video-LLMs in Physical Reality cs.CV · 2026-05-06 · unverdicted · none · ref 40 · internal anchor
Video-LLMs fail physical reasoning due to semantic prior dominance rather than perception deficits; a new programmatic adversarial curriculum and visual-anchored reasoning chain enable substantial gains via standard LoRA fine-tuning.
KARMA-MV: A Benchmark for Causal Question Answering on Music Videos cs.CV · 2026-05-05 · unverdicted · none · ref 35 · internal anchor
KARMA-MV is a new benchmark showing that causal knowledge graphs improve VLMs on causal audio-visual reasoning in music videos.
MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction cs.CL · 2026-04-30 · unverdicted · none · ref 1 · internal anchor
MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.
See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection cs.CV · 2026-04-27 · unverdicted · none · ref 56 · internal anchor
ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.
One Identity, Many Roles: Multimodal Entity Coreference for Enhanced Video Situation Recognition cs.CV · 2026-04-25 · unverdicted · none · ref 76 · internal anchor
CineMEC performs multimodal entity coreference by clustering visual entities and aligning them with text role mentions to boost captioning and grounding performance on an extended VidSitu dataset.
Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models cs.CV · 2026-04-22 · unverdicted · none · ref 59 · internal anchor
Foveated Reasoner integrates foveation as stateful actions inside the autoregressive decoding loop of vision-language models, trained via cold-start supervision then reinforcement learning to achieve higher accuracy at low token budgets.
Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models cs.AI · 2026-04-18 · unverdicted · none · ref 44 · internal anchor
Omni-modal LLMs exhibit visual preference that emerges in mid-to-late layers, enabling hallucination detection without task-specific training.
Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models cs.CV · 2026-04-16 · unverdicted · none · ref 50 · internal anchor
A 0.5B student VLM distills from a 3B teacher using visual-switch distillation and DBiLD loss to gain 3.6 points on average across 10 multimodal benchmarks without architecture changes.
MedRCube: A Multidimensional Framework for Fine-Grained and In-Depth Evaluation of MLLMs in Medical Imaging cs.CL · 2026-04-15 · unverdicted · none · ref 81 · internal anchor
MedRCube is a new fine-grained evaluation framework that benchmarks 33 MLLMs on medical imaging, ranks Lingshu-32B highest, and finds a significant positive link between shortcut behaviors and diagnostic performance.
UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing cs.CV · 2026-04-15 · unverdicted · none · ref 21 · internal anchor
UHR-BAT is a budget-aware framework that uses text-guided multi-scale importance estimation plus region-wise preserve and merge strategies to compress visual tokens in ultra-high-resolution remote sensing vision-language models.
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs cs.CV · 2026-04-13 · unverdicted · none · ref 102 · internal anchor
POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models cs.CV · 2026-04-09 · unverdicted · none · ref 40 · internal anchor
HAWK is a training-free method that prunes over 80% of visual tokens in MLLMs while retaining 96% accuracy by using head importance weights and text-guided attention to select task-relevant tokens.
AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis cs.CV · 2026-04-07 · unverdicted · none · ref 47 · internal anchor
AICA-Bench evaluates 23 VLMs on affective image analysis, identifies weak intensity calibration and shallow descriptions as limitations, and proposes training-free Grounded Affective Tree Prompting to improve performance.
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward cs.CV · 2026-04-06 · unverdicted · none · ref 83 · internal anchor
Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.
Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning cs.CV · 2026-04-06 · unverdicted · none · ref 47 · internal anchor
RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.
Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning cs.CV · 2026-04-06 · unverdicted · none · ref 55 · internal anchor
G2F-RAG converts retrieved knowledge subgraphs into a single visual reasoning frame appended to videos, enabling training-free and interpretable improvements for LMM-based video reasoning on knowledge-intensive tasks.
ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs cs.CV · 2026-04-04 · unverdicted · none · ref 68 · internal anchor
ITIScore evaluates MLLM image captions via image-to-text-to-image reconstruction consistency and aligns with human judgments on a new 40K-caption benchmark.
PersonaVLM: Long-Term Personalized Multimodal LLMs cs.CL · 2026-03-20 · unverdicted · none · ref 46 · internal anchor
PersonaVLM adds memory extraction, multi-turn retrieval-based reasoning, and personality inference to multimodal LLMs, yielding 22.4% gains on a new long-term personalization benchmark and outperforming GPT-4o.
C2F-Thinker: Coarse-to-Fine Reasoning with Hint-Guided Reinforcement Learning for Multimodal Sentiment Analysis cs.CL · 2026-03-10 · unverdicted · none · ref 28 · internal anchor
C2F-Thinker combines structured coarse-to-fine chain-of-thought reasoning with hint-guided GRPO reinforcement learning to achieve competitive fine-grained sentiment regression and superior cross-domain generalization in multimodal analysis.
Lang2Act: Fine-Grained Visual Reasoning through Self-Emergent Linguistic Toolchains cs.AI · 2026-01-29 · unverdicted · none · ref 1 · internal anchor
Lang2Act boosts VLM visual perception over 4% by letting models self-generate linguistic toolchains through two-stage RL training.
Benchmarking and Enhancing VLM for Compressed Image Understanding cs.CV · 2025-12-24 · unverdicted · none · ref 16 · internal anchor
Introduces a benchmark for VLMs on compressed images and a universal adaptor to improve performance across codecs and bitrates.
PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models cs.CV · 2025-12-01 · conditional · none · ref 56 · internal anchor
A new dataset and fine-tuned VLM detector/explainer called PhyDetEx shows that current T2V models still struggle to generate videos that obey physical laws, with open-source models performing worse.
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing cs.CV · 2025-09-26 · unverdicted · none · ref 54 · internal anchor
MinerU2.5 uses a two-stage decoupled vision-language architecture to achieve state-of-the-art document parsing accuracy with lower computational overhead than existing general and domain-specific models.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency cs.CV · 2025-08-25 · unverdicted · none · ref 165 · internal anchor
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.
Exploring the Secondary Risks of Large Language Models cs.LG · 2025-06-14 · unverdicted · none · ref 50 · internal anchor
Introduces secondary risks as a new class of LLM failures from benign prompts, defines two primitives, proposes SecLens search framework, and releases SecRiskBench showing risks are widespread across 16 models.
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics cs.LG · 2025-06-02 · unverdicted · none · ref 47 · internal anchor
SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.
Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models cs.CV · 2025-05-22 · unverdicted · none · ref 26 · internal anchor
Circle-RoPE achieves cross-modal positional disentanglement in VLMs by mapping 2D image tokens to a cone-like annulus orthogonal to the text axis, with PTD=0 eliminating RoPE geometric bias while preserving intra-image structure via alternating geometry encoding.
SkyReels-V2: Infinite-length Film Generative Model cs.CV · 2025-04-17 · unverdicted · none · ref 68 · internal anchor
SkyReels-V2 produces infinite-length film videos via MLLM-based captioning, progressive pretraining, motion RL, and diffusion forcing with non-decreasing noise schedules.
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models cs.CV · 2025-04-14 · conditional · none · ref 136 · internal anchor
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
SmolVLM: Redefining small and efficient multimodal models cs.AI · 2025-04-07 · unverdicted · none · ref 36 · internal anchor
SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models cs.CV · 2025-01-06 · unverdicted · none · ref 46 · internal anchor
MotionBench is a new benchmark showing poor fine-grained motion understanding in VLMs and proposes TE Fusion to improve performance with higher frame rates.

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer