M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.
hub
International conference on machine learning , pages=
18 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
DirectTryOn achieves state-of-the-art one-step virtual try-on performance by applying pure conditional transport, garment preservation loss, and self-consistency loss to straighten trajectories in pretrained generative models.
A text-supervised global layout embedding augments local patch representations in late-interaction VDR, yielding +2.4 nDCG@5 and +2.3 MAP@5 gains over ColPali/ColQwen baselines on ViDoRe-v2.
Latent Consistency Models enable high-fidelity text-to-image generation in 2-4 steps by directly predicting solutions to the probability flow ODE in latent space, distilled from pre-trained LDMs.
Natural-language descriptions generated and verified through generative models and digital twins capture the selectivity of most neurons in macaque V1 and V4.
SAGE trains agents in physics-grounded semantic abstractions via RL with asymmetric clipping, achieving 53.21% LLM-Match Success on A-EQA (+9.7% over baseline) and encouraging physical robot transfer.
ReSIDe generalizes logit-based confidence scores to intermediate layers of synthetic image detectors and uses preference optimization to aggregate them, cutting area under the risk-coverage curve by up to 69.55% under covariate shifts.
Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
A plug-and-play RL method adds batch-level distributional supervision via CCC rewards to reduce regression-to-the-mean in MLLMs on imbalanced regression benchmarks.
Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
R-DMesh uses a VAE with a learned rectification jump offset and Triflow Attention inside a rectified-flow diffusion transformer to produce video-aligned 4D meshes despite initial pose misalignment.
APEX is an assumption-free image quality metric using Sliced Wasserstein Distance on CLIP and DINOv2 embeddings that claims superior robustness to degradations and cross-dataset stability.
Neurosymbolic framework grounds skeleton motion in learnable pose and dynamics concepts then reasons over them with differentiable logic to recognize actions interpretably on NTU and NW-UCLA benchmarks.
CERSA derives low-rank fine-tuning subspaces from SVD principal components that retain 90-95% spectral energy, delivering higher performance than LoRA and other PEFT baselines at substantially lower memory cost across vision, generation, and language tasks.
A structured diffusion bridge method achieves near fully-paired modality translation quality using alignment constraints even in unpaired or semi-paired regimes.
Human visual interestingness is linearly decodable from final-layer embeddings in Qwen3-VL-8B and becomes progressively more structured across vision and language layers without explicit supervision.
citing papers explorer
-
Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation
M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.
-
DirectTryOn: One-Step Virtual Try-On via Straightened Conditional Transport
DirectTryOn achieves state-of-the-art one-step virtual try-on performance by applying pure conditional transport, garment preservation loss, and self-consistency loss to straighten trajectories in pretrained generative models.
-
Beyond Bag-of-Patches: Learning Global Layout via Textual Supervision for Late-Interaction Visual Document Retrieval
A text-supervised global layout embedding augments local patch representations in late-interaction VDR, yielding +2.4 nDCG@5 and +2.3 MAP@5 gains over ColPali/ColQwen baselines on ViDoRe-v2.
-
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
Latent Consistency Models enable high-fidelity text-to-image generation in 2-4 steps by directly predicting solutions to the probability flow ODE in latent space, distilled from pre-trained LDMs.
-
Letting the neural code speak: Automated characterization of monkey visual neurons through human language
Natural-language descriptions generated and verified through generative models and digital twins capture the selectivity of most neurons in macaque V1 and V4.
-
Plan in Sandbox, Navigate in Open Worlds: Learning Physics-Grounded Abstracted Experience for Embodied Navigation
SAGE trains agents in physics-grounded semantic abstractions via RL with asymmetric clipping, achieving 53.21% LLM-Match Success on A-EQA (+9.7% over baseline) and encouraging physical robot transfer.
-
Post-hoc Selective Classification for Reliable Synthetic Image Detection
ReSIDe generalizes logit-based confidence scores to intermediate layers of synthetic image detectors and uses preference optimization to aggregate them, cutting area under the risk-coverage curve by up to 69.55% under covariate shifts.
-
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation
Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
-
Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression
A plug-and-play RL method adds batch-level distributional supervision via CCC rewards to reduce regression-to-the-mean in MLLMs on imbalanced regression benchmarks.
-
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
-
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
-
Revisiting Feature Prediction for Learning Visual Representations from Video
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
-
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
R-DMesh uses a VAE with a learned rectification jump offset and Triflow Attention inside a rectified-flow diffusion transformer to produce video-aligned 4D meshes despite initial pose misalignment.
-
APEX: Assumption-free Projection-based Embedding eXamination Metric for Image Quality Assessment
APEX is an assumption-free image quality metric using Sliced Wasserstein Distance on CLIP and DINOv2 embeddings that claims superior robustness to degradations and cross-dataset stability.
-
Neurosymbolic Framework for Concept-Driven Logical Reasoning in Skeleton-Based Human Action Recognition
Neurosymbolic framework grounds skeleton motion in learnable pose and dynamics concepts then reasons over them with differentiable logic to recognize actions interpretably on NTU and NW-UCLA benchmarks.
-
CERSA: Cumulative Energy-Retaining Subspace Adaptation for Memory-Efficient Fine-Tuning
CERSA derives low-rank fine-tuning subspaces from SVD principal components that retain 90-95% spectral energy, delivering higher performance than LoRA and other PEFT baselines at substantially lower memory cost across vision, generation, and language tasks.
-
Structured Diffusion Bridges: Inductive Bias for Denoising Diffusion Bridges
A structured diffusion bridge method achieves near fully-paired modality translation quality using alignment constraints even in unpaired or semi-paired regimes.
-
Neuroscience-Inspired Analyses of Visual Interestingness in Multimodal Transformers
Human visual interestingness is linearly decodable from final-layer embeddings in Qwen3-VL-8B and becomes progressively more structured across vision and language layers without explicit supervision.