CAS mitigates object hallucinations in MLLMs by extracting two context preference vectors from designed conflict samples and applying signed residual injection at mid-early MLP layers without retraining or added latency.
Canonical reference
Orthogonal subspace learning for language model continual learning
Canonical reference. 86% of citing Pith papers cite this work as background.
citation-role summary
citation-polarity summary
roles
background 7representative citing papers
GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
ConeSep tackles noisy triplet correspondences in composed image retrieval by introducing geometric fidelity quantization to locate noise, negative boundary learning for semantic opposites, and targeted unlearning via optimal transport, outperforming prior methods on FashionIQ and CIRR.
MLLMs ignore dial state geometry and cluster by appearance, causing inconsistency under variations; TriSCA's state-distance alignment, metadata supervision, and objective alignment improve robustness on clock and gauge benchmarks.
Air-Know decouples MLLM-based external arbitration from proxy learning via knowledge internalization and dual-stream training to overcome noisy triplet correspondence in composed image retrieval.
INTENT mitigates cross-modal correspondence noise and modality-inherent noise in composed image retrieval via FFT-based visual invariant composition and bi-objective discriminative learning.
HABIT improves robustness in composed image retrieval under noisy triplets by quantifying sample cleanliness via mutual information transition rates and applying dual-consistency progressive learning to retain good patterns and correct bad ones.
ReTrack calibrates directional bias in composed video features using semantic disentanglement and bidirectional evidence alignment to improve retrieval performance on CVR and CIR tasks.
Equitable attention via Dominant Object Penalty and Outlier Boost Coefficient reduces object hallucinations in multimodal LLMs without retraining.
KORE combines knowledge-oriented data augmentations with null-space projections on activation covariance matrices to inject new knowledge into LMMs while preserving prior knowledge.
citing papers explorer
-
Rethinking Visual Neglect: Steering via Context-Preference for MLLM Hallucination Mitigation
CAS mitigates object hallucinations in MLLMs by extracting two context preference vectors from designed conflict samples and applying signed residual injection at mid-early MLP layers without retraining or added latency.
-
GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
-
ConeSep: Cone-based Robust Noise-Unlearning Compositional Network for Composed Image Retrieval
ConeSep tackles noisy triplet correspondences in composed image retrieval by introducing geometric fidelity quantization to locate noise, negative boundary learning for semantic opposites, and targeted unlearning via optimal transport, outperforming prior methods on FashionIQ and CIRR.
-
State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading
MLLMs ignore dial state geometry and cluster by appearance, causing inconsistency under variations; TriSCA's state-distance alignment, metadata supervision, and objective alignment improve robustness on clock and gauge benchmarks.
-
Air-Know: Arbiter-Calibrated Knowledge-Internalizing Robust Network for Composed Image Retrieval
Air-Know decouples MLLM-based external arbitration from proxy learning via knowledge internalization and dual-stream training to overcome noisy triplet correspondence in composed image retrieval.
-
INTENT: Invariance and Discrimination-aware Noise Mitigation for Robust Composed Image Retrieval
INTENT mitigates cross-modal correspondence noise and modality-inherent noise in composed image retrieval via FFT-based visual invariant composition and bi-objective discriminative learning.
-
HABIT: Chrono-Synergia Robust Progressive Learning Framework for Composed Image Retrieval
HABIT improves robustness in composed image retrieval under noisy triplets by quantifying sample cleanliness via mutual information transition rates and applying dual-consistency progressive learning to retain good patterns and correct bad ones.
-
ReTrack: Evidence-Driven Dual-Stream Directional Anchor Calibration Network for Composed Video Retrieval
ReTrack calibrates directional bias in composed video features using semantic disentanglement and bidirectional evidence alignment to improve retrieval performance on CVR and CIR tasks.
-
See Fair, Speak Truth: Equitable Attention Improves Grounding and Reduces Hallucination in Vision-Language Alignment
Equitable attention via Dominant Object Penalty and Outlier Boost Coefficient reduces object hallucinations in multimodal LLMs without retraining.
-
KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Controls
KORE combines knowledge-oriented data augmentations with null-space projections on activation covariance matrices to inject new knowledge into LMMs while preserving prior knowledge.