MSLA is the first physically deployable attack that uses adversarial lighting to break semantic alignment in VLMs such as CLIP, LLaVA, and BLIP, causing classification failures and hallucinations in real scenes.
hub Canonical reference
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
Canonical reference. 73% of citing Pith papers cite this work as background.
abstract
We introduce OpenFlamingo, a family of autoregressive vision-language models ranging from 3B to 9B parameters. OpenFlamingo is an ongoing effort to produce an open-source replication of DeepMind's Flamingo models. On seven vision-language datasets, OpenFlamingo models average between 80 - 89% of corresponding Flamingo performance. This technical report describes our models, training data, hyperparameters, and evaluation suite. We share our models and code at https://github.com/mlfoundations/open_flamingo.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
MathVista benchmark shows GPT-4V achieves 49.9% accuracy on visual mathematical reasoning tasks, outperforming other models but trailing humans by 10.4%.
SelfBootTok decomposes image tokens into global and local groups via self-bootstrapped learning, enabling generators to use only global tokens for ~40% less computation and a new SOTA gFID of 1.56 with 64 tokens.
PedestrianQA is a new benchmark that turns pedestrian behavior prediction into VLM question-answering with rationales, reporting improved intention classification, trajectory accuracy, and explanation quality after fine-tuning on multiple existing video datasets.
BlockVLA accelerates autoregressive VLA models by 3.3x using block diffusion finetuning, with faster training convergence and better early performance on long-horizon robotic tasks.
AffectGPT-RL applies reinforcement learning to optimize non-differentiable emotion wheel metrics in open-vocabulary multimodal emotion recognition, yielding performance gains and state-of-the-art results on basic emotion recognition benchmarks.
Introduces QCalEval benchmark showing best zero-shot VLM score of 72.3 on quantum calibration plots, with fine-tuning and in-context learning effects varying by model type.
Introduces the first large-scale 3D PET/CT dataset with fine-grained RoI annotations for Vietnamese and a graph-enhanced HiRRA framework that achieves SOTA report generation by modeling RoI dependencies.
UCGP is a universal physical adversarial patch that compromises cross-modal semantic alignment in IR-VLMs through curved-grid parameterization and representation-space disruption.
Omni-NegCLIP improves CLIP's negation understanding by up to 52.65% on presence-based and 12.50% on absence-based tasks through front-layer fine-tuning with specialized contrastive losses.
A wrinkle-field perturbation method creates photorealistic non-rigid image changes that degrade state-of-the-art VLMs on image captioning and VQA more effectively than prior baselines.
QuantVLA is the first post-training quantization framework for VLA models that quantizes the diffusion transformer action head and reports higher task success rates than full-precision baselines with roughly 70% memory savings on the quantized components.
PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.
LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.
MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.
ReVisIT achieves near-SOTA performance on open multimodal tasks by retrieving and reasoning over labeled images as visual exemplars in a train-free scaffold, closing the open-vs-closed gap for models like Qwen3-VL-30B.
WorldBench is a visually diverse multimodal reasoning benchmark where the strongest of 15 tested MLLMs reaches only 64% accuracy.
Robust vision encoders from multimodal adversarial pretraining transfer to MLLMs and deliver large gains in adversarial captioning and VQA performance, while test-time stochastic transformations provide an effective black-box defense.
BYORn defends autoregressive vision-language models against backdoor attacks in supervised fine-tuning by dynamically replacing semantically implausible poisoned responses with model-generated alternatives, improving robustness while preserving clean performance.
MATE is a multi-modal MoE trajectory policy using a cosine router and stochastic noise to improve expert balance, reporting 4.75% higher average success rate than prior methods on LIBERO under data scarcity.
GAUC selects coresets in pre-trained VLM embedding space by jointly optimizing distributional fidelity via MMD, prompt robustness via effective mutual information difference, and output stability via predictive variance penalty.
citing papers explorer
-
RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation
RoboMIND is a large-scale multi-embodiment teleoperation dataset for robot manipulation containing 107k trajectories across four robots, with failure annotations and a digital twin simulator.
-
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new robots and objects.
-
VLA Foundry: A Unified Framework for Training Vision-Language-Action Models
VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.
-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.