Title resolution pending

Qwen2 · 2025

18 Pith papers cite this work. Polarity classification is still indexing.

18 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

Sparse autoencoders inserted into VLMs and trained only for reconstruction can reliably detect adversarial attacks on images, including unseen domains and attack types.

CCTVBench: Contrastive Consistency Traffic VideoQA Benchmark for Multimodal LLMs

cs.CV · 2026-04-22 · unverdicted · novelty 8.0

CCTVBench exposes a large gap between standard QA accuracy and contrastive consistency in traffic video reasoning for multimodal LLMs and introduces C-TCD to narrow that gap.

UCSF-PDGM-VQA: Visual Question Answering dataset for brain tumor MRI interpretation

cs.CV · 2026-05-16 · unverdicted · novelty 7.0 · 2 refs

Introduces the UCSF-PDGM-VQA dataset of 2387 QA pairs from 473 glioma MRI studies and demonstrates that state-of-the-art VLMs exhibit modality collapse on multi-sequence 3D medical images.

BRITE: A Benchmark for Reliable and Interpretable T2V Evaluation on Implausible Scenarios

cs.MM · 2026-04-24 · unverdicted · novelty 7.0

BRITE benchmark reveals that leading T2V models handle static object composition well but degrade sharply on object-action binding and audio-visual synchronization for implausible prompts.

PGT: Procedurally Generated Tasks for improving visual grounding in MLLMs

cs.CV · 2026-05-22 · unverdicted · novelty 6.0

PGT generates synthetic tasks via geometric overlays on images to supply dense visual supervision, improving spatial and relational understanding in MLLMs by up to 20% on targeted benchmarks.

DMN: A Compositional Framework for Jailbreaking Multimodal LLMs with Multi-Image Inputs

cs.CR · 2026-05-18 · unverdicted · novelty 6.0

DMN achieves over 90% attack success rate on GPT-4o, Gemini-2.5-pro and Claude Sonnet 4 by distributing instructions, supplying multimodal evidence, and adding number chain tasks across multiple images.

OProver: A Unified Framework for Agentic Formal Theorem Proving

cs.CL · 2026-05-17 · unverdicted · novelty 6.0

OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.

Prefix-Adaptive Block Diffusion for Efficient Document Recognition

cs.CV · 2026-05-16 · unverdicted · novelty 6.0

PA-BDM adapts block diffusion by switching to causal intra-block denoising and dynamically committing reliable prefixes to KV cache, yielding higher accuracy and 71.6% higher throughput than a comparable baseline on document benchmarks.

Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search

cs.CV · 2026-05-09 · unverdicted · novelty 6.0

Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency in 360° environments.

How Many Visual Tokens Do Multimodal Language Models Need? Scaling Visual Token Pruning with F^3A

cs.CV · 2026-05-09 · unverdicted · novelty 6.0

F^3A is a training-free visual token pruning router that treats pruning as task-conditioned evidence search and allocates a fixed vision token budget using question cues and frozen sparse heads without extra LLM passes.

Probabilistic Programs of Thought

cs.CL · 2026-04-19 · unverdicted · novelty 6.0

Probabilistic programs of thought let LLMs produce many program variants from one generation by building a compact probabilistic representation of the token distribution.

PRISM: Perception Reasoning Interleaved for Sequential Decision Making

cs.AI · 2026-05-06 · unverdicted · novelty 5.0

PRISM interleaves VLM perception and LLM reasoning via a dynamic goal-oriented question-answer pipeline to produce sharper scene descriptions, outperforming prior image-based models on ALFWorld and Room-to-Room.

Reasoning-Guided Grounding: Elevating Video Anomaly Detection through Multimodal Large Language Models

cs.CV · 2026-04-07 · unverdicted · novelty 5.0

VANGUARD is a staged-training VLM framework that reports 94% ROC-AUC and 84% F1 on UCF-Crime while adding chain-of-thought reasoning and spatial grounding to video anomaly detection.

NVIDIA Nemotron 3: Efficient and Open Intelligence

cs.CL · 2025-12-24 · unverdicted · novelty 5.0

NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.

Occlusion-Aware Physics-Semantic Keyframe Selection for Robust Video Editing

cs.CV · 2026-05-22

Bad Seeing or Bad Thinking? Rewarding Perception for Multimodal Reasoning

cs.AI · 2026-05-13

Unified Pix Token And Word Token Generative Language Model

cs.CV · 2026-05-13

Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models

cs.CV · 2026-05-12 · 3 refs

citing papers explorer

Showing 18 of 18 citing papers.

Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs cs.CV · 2026-05-08 · unverdicted · none · ref 51
Sparse autoencoders inserted into VLMs and trained only for reconstruction can reliably detect adversarial attacks on images, including unseen domains and attack types.
CCTVBench: Contrastive Consistency Traffic VideoQA Benchmark for Multimodal LLMs cs.CV · 2026-04-22 · unverdicted · none · ref 33
CCTVBench exposes a large gap between standard QA accuracy and contrastive consistency in traffic video reasoning for multimodal LLMs and introduces C-TCD to narrow that gap.
UCSF-PDGM-VQA: Visual Question Answering dataset for brain tumor MRI interpretation cs.CV · 2026-05-16 · unverdicted · none · ref 54 · 2 links
Introduces the UCSF-PDGM-VQA dataset of 2387 QA pairs from 473 glioma MRI studies and demonstrates that state-of-the-art VLMs exhibit modality collapse on multi-sequence 3D medical images.
BRITE: A Benchmark for Reliable and Interpretable T2V Evaluation on Implausible Scenarios cs.MM · 2026-04-24 · unverdicted · none · ref 18
BRITE benchmark reveals that leading T2V models handle static object composition well but degrade sharply on object-action binding and audio-visual synchronization for implausible prompts.
PGT: Procedurally Generated Tasks for improving visual grounding in MLLMs cs.CV · 2026-05-22 · unverdicted · none · ref 36
PGT generates synthetic tasks via geometric overlays on images to supply dense visual supervision, improving spatial and relational understanding in MLLMs by up to 20% on targeted benchmarks.
DMN: A Compositional Framework for Jailbreaking Multimodal LLMs with Multi-Image Inputs cs.CR · 2026-05-18 · unverdicted · none · ref 16
DMN achieves over 90% attack success rate on GPT-4o, Gemini-2.5-pro and Claude Sonnet 4 by distributing instructions, supplying multimodal evidence, and adding number chain tasks across multiple images.
OProver: A Unified Framework for Agentic Formal Theorem Proving cs.CL · 2026-05-17 · unverdicted · none · ref 43
OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.
Prefix-Adaptive Block Diffusion for Efficient Document Recognition cs.CV · 2026-05-16 · unverdicted · none · ref 16
PA-BDM adapts block diffusion by switching to causal intra-block denoising and dynamically committing reliable prefixes to KV cache, yielding higher accuracy and 71.6% higher throughput than a comparable baseline on document benchmarks.
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search cs.CV · 2026-05-09 · unverdicted · none · ref 23
Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency in 360° environments.
How Many Visual Tokens Do Multimodal Language Models Need? Scaling Visual Token Pruning with F^3A cs.CV · 2026-05-09 · unverdicted · none · ref 3
F^3A is a training-free visual token pruning router that treats pruning as task-conditioned evidence search and allocates a fixed vision token budget using question cues and frozen sparse heads without extra LLM passes.
Probabilistic Programs of Thought cs.CL · 2026-04-19 · unverdicted · none · ref 36
Probabilistic programs of thought let LLMs produce many program variants from one generation by building a compact probabilistic representation of the token distribution.
PRISM: Perception Reasoning Interleaved for Sequential Decision Making cs.AI · 2026-05-06 · unverdicted · none · ref 23
PRISM interleaves VLM perception and LLM reasoning via a dynamic goal-oriented question-answer pipeline to produce sharper scene descriptions, outperforming prior image-based models on ALFWorld and Room-to-Room.
Reasoning-Guided Grounding: Elevating Video Anomaly Detection through Multimodal Large Language Models cs.CV · 2026-04-07 · unverdicted · none · ref 48
VANGUARD is a staged-training VLM framework that reports 94% ROC-AUC and 84% F1 on UCF-Crime while adding chain-of-thought reasoning and spatial grounding to video anomaly detection.
NVIDIA Nemotron 3: Efficient and Open Intelligence cs.CL · 2025-12-24 · unverdicted · none · ref 166
NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
Occlusion-Aware Physics-Semantic Keyframe Selection for Robust Video Editing cs.CV · 2026-05-22 · unreviewed · ref 105
Bad Seeing or Bad Thinking? Rewarding Perception for Multimodal Reasoning cs.AI · 2026-05-13 · unreviewed · ref 79
Unified Pix Token And Word Token Generative Language Model cs.CV · 2026-05-13 · unreviewed · ref 7
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models cs.CV · 2026-05-12 · unreviewed · ref 21 · 3 links

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer