CogniVerse: Revolutionizing Multi-Modal Retrieval-Augmented Generation with Cognitive Reflection and Geometric Reasoning

Changshuo Wang; Wanlong Fang; Xiang Fang

arxiv: 2605.29602 · v1 · pith:ZP4CAWQGnew · submitted 2026-05-28 · 💻 cs.CV

CogniVerse: Revolutionizing Multi-Modal Retrieval-Augmented Generation with Cognitive Reflection and Geometric Reasoning

Xiang Fang , Wanlong Fang , Changshuo Wang This is my paper

Pith reviewed 2026-06-29 08:11 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-modal retrieval-augmented generationcognitive reflectionRiemannian manifoldspectral graph theoryoptimal transportknowledge graphsmulti-modal LLMsinformation geometry

0 comments

The pith

CogniVerse enhances multi-modal RAG by using cognitive reflection to filter retrieval, Riemannian manifolds for alignment, and optimal transport for coherent generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CogniVerse as a framework to fix problems in multi-modal retrieval-augmented generation like noisy retrieval and semantic misalignment. It does this with three parts: a module that thinks about whether to retrieve and filters content, a retrieval system that maps things onto curved spaces for better matching and uses graph theory to clean up connections, and a generation part that uses transport math to keep both small details and big picture consistent. If this works, systems answering questions with images and text could be more accurate and faster. A sympathetic reader would care because better multi-modal AI could improve tools for education, search, and decision making.

Core claim

CogniVerse addresses limitations in existing MMRAG frameworks by integrating a Cognitive Reflection Module that dynamically assesses retrieval necessity and filters relevant multi-modal content, a Multi-modal Retrieval Module that aligns embeddings in a Riemannian manifold using information geometry and refines knowledge graphs via spectral graph theory, and a Hierarchical Generation Module that employs an optimal transport-based loss to balance token-level accuracy and global semantic coherence, leading to significant outperformance in accuracy and coherence with reduced latency.

What carries the argument

The three synergistic components: Cognitive Reflection Module for dynamic filtering, Multi-modal Retrieval Module on Riemannian manifold with spectral graph refinement for precise alignment, and Hierarchical Generation Module with optimal transport loss for balancing local and global coherence.

If this is right

Reduces noise and irrelevant retrieval in multi-modal queries
Improves cross-modal semantic alignment through geometric methods
Enables adaptive reasoning by assessing retrieval needs
Achieves more coherent generation across local and global contexts
Lowers retrieval latency while boosting accuracy

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such a framework might extend to single-modal or other AI tasks requiring reflection and geometry.
Future work could test if the Riemannian approach generalizes to other embedding spaces.
The optimal transport loss might apply to other generation models beyond MMRAG.
Integration with existing LLMs could be explored for practical deployment.

Load-bearing premise

The three modules can be combined and implemented in practice to deliver the claimed improvements in accuracy, coherence, and latency.

What would settle it

An experiment where CogniVerse is implemented and tested on standard MMRAG benchmarks showing no significant gains over baselines or increased latency.

Figures

Figures reproduced from arXiv: 2605.29602 by Changshuo Wang, Wanlong Fang, Xiang Fang.

**Figure 1.** Figure 1: Overview of our proposed CogniVerse. The framework begins with a Cognitive Reflection Module to assess retrieval necessity, [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Performance Robustness to 20% Query Noise (on Mul [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Spectral graph refinement in CogniVerse. Left: Original [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Multi-modal Retrieval-Augmented Generation (MMRAG) has emerged as a powerful paradigm for enhancing Multimodal Large Language Models in knowledge-intensive question answering by integrating external visual, textual, and structural knowledge. However, existing MMRAG frameworks suffer from critical limitations, including noisy and irrelevant retrieval, cross-modal semantic misalignment, lack of adaptive reasoning, and incoherent generation across local and global contexts. We introduce \textbf{CogniVerse}, a novel MMRAG framework that addresses these challenges through a cognitive-inspired, mathematically rigorous approach. Drawing from human-like reasoning, CogniVerse integrates three synergistic components: (1) a Cognitive Reflection Module that dynamically assesses retrieval necessity and filters relevant multi-modal content, reducing noise and computational overhead; (2) a Multi-modal Retrieval Module that aligns embeddings in a Riemannian manifold using information geometry and refines knowledge graphs via spectral graph theory, ensuring precise and coherent retrieval; and (3) a Hierarchical Generation Module that employs an optimal transport-based loss to balance token-level accuracy and global semantic coherence. Extensive experiments demonstrate that CogniVerse significantly outperforms state-of-the-art systems in both accuracy and coherence, while reducing retrieval latency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CogniVerse proposes combining cognitive reflection, Riemannian manifold retrieval, and optimal transport for multi-modal RAG but supplies no experiments, equations, or results to support any performance gains.

read the letter

The main takeaway is that this paper describes a proposed MMRAG framework called CogniVerse built from three modules: a cognitive reflection step to assess retrieval needs and filter noise, a retrieval module that aligns embeddings on a Riemannian manifold and refines graphs with spectral methods, and a generation module using an optimal transport loss for local-global balance. It claims these yield higher accuracy, better coherence, and lower latency than prior systems.

The combination of those existing techniques applied to this specific setting is the only element that counts as new. The paper does a clear job naming real, recurring problems in the area such as irrelevant retrievals, cross-modal misalignment, and incoherent outputs.

The soft spots are substantial and central. The abstract states that extensive experiments show significant outperformance, yet no datasets, baselines, metrics, ablations, or even error bars appear. No equations are given for the manifold alignment, the spectral refinement, or the OT loss, despite the repeated mention of mathematical rigor. Without any of that material, the superiority claims reduce to assertions, and there is no way to check whether the modules can be implemented without recreating the listed limitations.

This kind of high-level system sketch might interest practitioners who build applied multi-modal pipelines and want architectural ideas to adapt. It offers nothing concrete for verification or extension.

I would not bring the paper to a reading group and would not cite it. It does not deserve peer review because there is no verifiable content for referees to assess.

Referee Report

3 major / 0 minor

Summary. The manuscript introduces CogniVerse, a multi-modal RAG framework for knowledge-intensive QA that combines three components: (1) a Cognitive Reflection Module to assess retrieval necessity and filter content, (2) a Multi-modal Retrieval Module that aligns embeddings on a Riemannian manifold via information geometry and refines knowledge graphs with spectral graph theory, and (3) a Hierarchical Generation Module using an optimal transport loss to balance token-level accuracy and global coherence. The abstract asserts that extensive experiments show significant gains over SOTA systems in accuracy, coherence, and reduced retrieval latency.

Significance. If the performance claims were supported by rigorous experiments, the combination of cognitive reflection with geometric retrieval and optimal transport generation could offer a meaningful advance in addressing noise, misalignment, and incoherence in MMRAG. However, the complete absence of any empirical validation, equations, or implementation details means the work currently contributes no verifiable advance.

major comments (3)

[Abstract] Abstract: The central claim that 'extensive experiments demonstrate that CogniVerse significantly outperforms state-of-the-art systems in both accuracy and coherence, while reducing retrieval latency' is unsupported by any datasets, baselines, metrics, results tables, ablation studies, or error bars. This absence is load-bearing for the primary contribution.
[Abstract] Abstract: The Multi-modal Retrieval Module is described as performing alignment 'in a Riemannian manifold using information geometry' and refinement 'via spectral graph theory,' yet no manifold metric, embedding alignment objective, spectral operator, or pseudocode is supplied. Without these, it is impossible to verify whether the approach resolves cross-modal semantic misalignment or merely restates standard techniques.
[Abstract] Abstract: The Hierarchical Generation Module is said to employ 'an optimal transport-based loss to balance token-level accuracy and global semantic coherence,' but the loss function, transport plan formulation, and how it interacts with the LLM decoder are not defined. This prevents assessment of whether the claimed coherence gains are achievable.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the detailed and substantive review. The comments accurately identify that the submitted manuscript consists of a high-level conceptual proposal without empirical results, equations, or implementation details. We will revise the abstract and text to remove unsupported claims and clarify the scope of the work.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'extensive experiments demonstrate that CogniVerse significantly outperforms state-of-the-art systems in both accuracy and coherence, while reducing retrieval latency' is unsupported by any datasets, baselines, metrics, results tables, ablation studies, or error bars. This absence is load-bearing for the primary contribution.

Authors: We agree that the abstract's claim of extensive experiments is unsupported, as the manuscript contains no experimental section, datasets, baselines, or results. This constitutes an overstatement. We will revise the abstract to describe CogniVerse as a proposed framework without asserting empirical superiority. revision: yes
Referee: [Abstract] Abstract: The Multi-modal Retrieval Module is described as performing alignment 'in a Riemannian manifold using information geometry' and refinement 'via spectral graph theory,' yet no manifold metric, embedding alignment objective, spectral operator, or pseudocode is supplied. Without these, it is impossible to verify whether the approach resolves cross-modal semantic misalignment or merely restates standard techniques.

Authors: The manuscript provides only a descriptive overview of the module and does not include any manifold metric, alignment objective, spectral operator, or pseudocode. We acknowledge that this prevents verification or assessment of novelty. We will revise the text to indicate these elements are conceptual and not formally specified. revision: partial
Referee: [Abstract] Abstract: The Hierarchical Generation Module is said to employ 'an optimal transport-based loss to balance token-level accuracy and global semantic coherence,' but the loss function, transport plan formulation, and how it interacts with the LLM decoder are not defined. This prevents assessment of whether the claimed coherence gains are achievable.

Authors: The manuscript does not define the optimal transport loss, transport plan, or its interaction with the decoder. We agree this prevents evaluation of the claimed benefits. We will revise the description to note that the loss is proposed at a conceptual level without mathematical formulation. revision: partial

standing simulated objections not resolved

The complete absence of empirical validation, equations, implementation details, or results, which cannot be supplied without conducting new experiments and derivations absent from the original manuscript.

Circularity Check

0 steps flagged

No circularity: no derivation chain or self-referential reductions present

full rationale

The provided abstract and placeholder full text contain only high-level module descriptions and an assertion of experimental superiority, with no equations, parameter fittings, self-citations, uniqueness theorems, or ansatzes that could reduce a claimed result to its inputs by construction. No load-bearing steps exist to inspect for any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are stated in the abstract; the framework implicitly assumes that the named geometric and transport techniques will address the listed MMRAG limitations when combined.

pith-pipeline@v0.9.1-grok · 5735 in / 1189 out tokens · 25612 ms · 2026-06-29T08:11:56.916910+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

133 extracted references · 17 canonical work pages · 6 internal anchors

[1]

Imperceptible beam-sensitive ad- versarial attacks for lidar-based object detection in au- tonomous driving

Fuyao Cai, Daizong Liu, Xiang Fang, Jixiang Yu, Keke Tang, and Pan Zhou. Imperceptible beam-sensitive ad- versarial attacks for lidar-based object detection in au- tonomous driving. In2025 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2025. 1

2025
[2]

Towards building model/prompt-transferable attackers against large vision-language models.Advances in Neu- ral Information Processing Systems, 38:174022–174058,

Xiaowen Cai, Daizong Liu, Xiaoye Qu, Xiang Fang, Jian- feng Dong, Keke Tang, Pan Zhou, Lichao Sun, and Wei Hu. Towards building model/prompt-transferable attackers against large vision-language models.Advances in Neu- ral Information Processing Systems, 38:174022–174058,
[3]

Webqa: Multihop and multimodal qa

Yingshan Chang, Mridu Narang, Hisami Suzuki, Guihong Cao, Jianfeng Gao, and Yonatan Bisk. Webqa: Multihop and multimodal qa. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 16495–16504, 2022. 6

2022
[4]

Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming-Wei Chang. Can pre-trained vision and language models answer visual information-seeking questions? InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14948–14968, 2023. 1, 2, 3, 6

2023
[5]

Raw nav-merge seismic data to subsurface properties with mlp based multi-modal information unscrambler.Advances in Neural Information Processing Systems, 34:8740–8752,

Aditya Desai, Zhaozhuo Xu, Menal Gupta, Anu Chan- dran, Antoine Vial-Aussavy, and Anshumali Shrivastava. Raw nav-merge seismic data to subsurface properties with mlp based multi-modal information unscrambler.Advances in Neural Information Processing Systems, 34:8740–8752,
[6]

Improving adversarially robust few-shot image classi- fication with generalizable representations

Junhao Dong, Yuan Wang, Jian-Huang Lai, and Xiaohua Xie. Improving adversarially robust few-shot image classi- fication with generalizable representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 9025–9034, 2022. 1

2022
[7]

The enemy of my enemy is my friend: Exploring inverse adversaries for improving adversarial training

Junhao Dong, Seyed-Mohsen Moosavi-Dezfooli, Jian- huang Lai, and Xiaohua Xie. The enemy of my enemy is my friend: Exploring inverse adversaries for improving adversarial training. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 24678–24687, 2023

2023
[8]

Restricted black-box adversarial attack against deep- fake face swapping.IEEE Transactions on Information Forensics and Security, 18:2596–2608, 2023

Junhao Dong, Yuan Wang, Jianhuang Lai, and Xiaohua Xie. Restricted black-box adversarial attack against deep- fake face swapping.IEEE Transactions on Information Forensics and Security, 18:2596–2608, 2023

2023
[9]

Survey on adversarial attack and defense for medical image analysis: Methods and challenges.ACM Computing Surveys, 57(3):1–38, 2024

Junhao Dong, Junxi Chen, Xiaohua Xie, Jianhuang Lai, and Hao Chen. Survey on adversarial attack and defense for medical image analysis: Methods and challenges.ACM Computing Surveys, 57(3):1–38, 2024

2024
[10]

Adversarially robust distillation by reducing the student-teacher variance gap

Junhao Dong, Piotr Koniusz, Junxi Chen, and Yew-Soon Ong. Adversarially robust distillation by reducing the student-teacher variance gap. InEuropean Conference on Computer Vision, pages 92–111. Springer, 2024

2024
[11]

Robust distillation via untargeted and tar- geted intermediate adversarial samples

Junhao Dong, Piotr Koniusz, Junxi Chen, Z Jane Wang, and Yew-Soon Ong. Robust distillation via untargeted and tar- geted intermediate adversarial samples. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28432–28442, 2024

2024
[12]

Adversarially robust few-shot learning via parameter co-distillation of similarity and class concept learners

Junhao Dong, Piotr Koniusz, Junxi Chen, Xiaohua Xie, and Yew-Soon Ong. Adversarially robust few-shot learning via parameter co-distillation of similarity and class concept learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28535– 28544, 2024

2024
[13]

Robustifying zero-shot vision language models by sub- spaces alignment

Junhao Dong, Piotr Koniusz, Liaoyuan Feng, Yifei Zhang, Hao Zhu, Weiming Liu, Xinghua Qu, and Yew-Soon Ong. Robustifying zero-shot vision language models by sub- spaces alignment. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 21037– 21047, 2025

2025
[14]

Stabilizing modality gap & lowering gradient norms improve zero-shot adversarial robustness of vlms

Junhao Dong, Piotr Koniusz, Xinghua Qu, and Yew-Soon Ong. Stabilizing modality gap & lowering gradient norms improve zero-shot adversarial robustness of vlms. InPro- ceedings of the 31st ACM SIGKDD Conference on Knowl- edge Discovery and Data Mining V . 1, pages 236–247, 2025

2025
[15]

Improving zero- shot adversarial robustness in vision-language models by closed-form alignment of adversarial path simplices

Junhao Dong, Piotr Koniusz, Yifei Zhang, Hao Zhu, Weim- ing Liu, Xinghua Qu, and Yew-Soon Ong. Improving zero- shot adversarial robustness in vision-language models by closed-form alignment of adversarial path simplices. In Forty-second International Conference on Machine Learn- ing, 2025

2025
[16]

Confound from all sides, distill with resilience: Multi- objective adversarial paths to zero-shot robustness

Junhao Dong, Jiao Liu, Xinghua Qu, and Yew-Soon Ong. Confound from all sides, distill with resilience: Multi- objective adversarial paths to zero-shot robustness. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 624–634, 2025

2025
[17]

Robust superalign- ment: Weak-to-strong robustness generalization for vision- language models

Junhao Dong, Cong Zhang, Xinghua Qu, Zejun Ma, Pi- otr Koniusz, and Yew-Soon Ong. Robust superalign- ment: Weak-to-strong robustness generalization for vision- language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[18]

Allies teach better than enemies: Inverse adversaries for robust knowledge distilla- tion.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

Junhao Dong, Raoof Zare Moayedi, Yew-Soon Ong, and Seyed-Mohsen Moosavi-Dezfooli. Allies teach better than enemies: Inverse adversaries for robust knowledge distilla- tion.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

2026
[19]

Robust superalign- ment: Weak-to-strong robustness generalization for vision- language models

Junhao Dong, Cong Zhang, Xinghua Qu, Zejun Ma, Pi- otr Koniusz, and Yew-Soon Ong. Robust superalign- ment: Weak-to-strong robustness generalization for vision- language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. 1

2026
[20]

An image is worth 16x16 words: Trans- formers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InInternational Conference on Learning Representations, 2021. 2

2021
[21]

An empirical study of training end-to- end vision-and-language transformers

Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuo- hang Wang, Lijuan Wang, Chenguang Zhu, Zicheng Liu, and Michael Zeng. An empirical study of training end-to- end vision-and-language transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18166–18176, 2022. 2

2022
[22]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, T Krzyzanowski, F Basisty, et al. A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024. 2, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

To align or not to align: Strategic multimodal representation align- ment for optimal performance

Wanlong Fang, Tianle Zhang, and Alvin Chan. To align or not to align: Strategic multimodal representation align- ment for optimal performance. InProceedings of the AAAI Conference on Artificial Intelligence, pages 21056–21064,
[24]

Towards understanding modality interaction in multimodal language models via partial information decomposition

Wanlong Fang, Tianle Zhang, Wen Tao, and Alvin Chan. Towards understanding modality interaction in multimodal language models via partial information decomposition. In International Conference on Machine Learning, 2026

2026
[25]

Advancing out-of-distribution detection across diverse scenarios

Xiang Fang. Advancing out-of-distribution detection across diverse scenarios. InProceedings of the AAAI Conference on Artificial Intelligence, pages 41042–41043, 2026

2026
[26]

Disentangling adversarial prompts: A semantic-graph defense for robust llm security

Xiang Fang and Wanlong Fang. Disentangling adversarial prompts: A semantic-graph defense for robust llm security. InProceedings of the AAAI Conference on Artificial Intel- ligence, 2026

2026
[27]

Slap: The semantic least action principle for variational video-language modeling

Xiang Fang and Wanlong Fang. Slap: The semantic least action principle for variational video-language modeling. In International Conference on Machine Learning, 2026

2026
[28]

Double Self-weighted Multi-view Clustering via Adaptive View Fusion

Xiang Fang and Yuchong Hu. Double self-weighted multi- view clustering via adaptive view fusion.arXiv preprint arXiv:2011.10396, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011
[29]

V3h: View variation and view heredity for incomplete multiview clustering.IEEE Transactions on Artificial Intel- ligence, 1(3):233–247, 2020

Xiang Fang, Yuchong Hu, Pan Zhou, and Dapeng Oliver Wu. V3h: View variation and view heredity for incomplete multiview clustering.IEEE Transactions on Artificial Intel- ligence, 1(3):233–247, 2020. 1

2020
[30]

An- imc: A soft approach for autoweighted noisy and incom- plete multiview clustering.IEEE Transactions on Artificial Intelligence, 3(2):192–206, 2021

Xiang Fang, Yuchong Hu, Pan Zhou, and Dapeng Wu. An- imc: A soft approach for autoweighted noisy and incom- plete multiview clustering.IEEE Transactions on Artificial Intelligence, 3(2):192–206, 2021

2021
[31]

Xiang Fang, Yuchong Hu, Pan Zhou, and Dapeng Oliver Wu. Unbalanced incomplete multi-view clustering via the scheme of view evolution: Weak views are meat; strong views do eat.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(4):913–927, 2021. 1

2021
[32]

Multi-modal cross-domain alignment network for video moment retrieval.IEEE Transactions on Multimedia, 25: 7517–7532, 2022

Xiang Fang, Daizong Liu, Pan Zhou, and Yuchong Hu. Multi-modal cross-domain alignment network for video moment retrieval.IEEE Transactions on Multimedia, 25: 7517–7532, 2022

2022
[33]

Annotations are not all you need: A cross-modal knowledge transfer network for unsupervised temporal sentence grounding

Xiang Fang, Daizong Liu, Wanlong Fang, Pan Zhou, Yu Cheng, Keke Tang, and Kai Zou. Annotations are not all you need: A cross-modal knowledge transfer network for unsupervised temporal sentence grounding. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 8721–8733, 2023. 1

2023
[34]

You can ground earlier than see: An effective and efficient pipeline for temporal sentence grounding in compressed videos

Xiang Fang, Daizong Liu, Pan Zhou, and Guoshun Nan. You can ground earlier than see: An effective and efficient pipeline for temporal sentence grounding in compressed videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2448– 2460, 2023

2023
[35]

Hierarchical local-global transformer for tem- poral sentence grounding.IEEE Transactions on Multime- dia, 2023

Xiang Fang, Daizong Liu, Pan Zhou, Zichuan Xu, and Ruixuan Li. Hierarchical local-global transformer for tem- poral sentence grounding.IEEE Transactions on Multime- dia, 2023

2023
[36]

Not all inputs are valid: Towards open- set video moment retrieval using language

Xiang Fang, Wanlong Fang, Daizong Liu, Xiaoye Qu, Jian- feng Dong, Pan Zhou, Renfu Li, Zichuan Xu, Lixing Chen, Panpan Zheng, et al. Not all inputs are valid: Towards open- set video moment retrieval using language. InProceedings of the 32nd ACM International Conference on Multimedia, pages 28–37, 2024. 1

2024
[37]

Fewer steps, better performance: Efficient cross-modal clip trimming for video moment retrieval using language

Xiang Fang, Daizong Liu, Wanlong Fang, Pan Zhou, Zichuan Xu, Wenzheng Xu, Junyang Chen, and Renfu Li. Fewer steps, better performance: Efficient cross-modal clip trimming for video moment retrieval using language. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 1735–1743, 2024

2024
[38]

Rethinking weakly-supervised video tempo- ral grounding from a game perspective

Xiang Fang, Zeyu Xiong, Wanlong Fang, Xiaoye Qu, Chen Chen, Jianfeng Dong, Keke Tang, Pan Zhou, Yu Cheng, and Daizong Liu. Rethinking weakly-supervised video tempo- ral grounding from a game perspective. InEuropean Con- ference on Computer Vision. Springer, 2024. 1

2024
[39]

Adap- tive multi-prompt contrastive network for few-shot out-of- distribution detection

Xiang Fang, Arvind Easwaran, and Blaise Genest. Adap- tive multi-prompt contrastive network for few-shot out-of- distribution detection. InInternational Conference on Ma- chine Learning, 2025

2025
[40]

Adaptive hierarchical graph cut for multi-granularity out-of-distribution detec- tion.IEEE Transactions on Artificial Intelligence, 2025

Xiang Fang, Arvind Easwaran, Blaise Genest, and Pon- nuthurai Nagaratnam Suganthan. Adaptive hierarchical graph cut for multi-granularity out-of-distribution detec- tion.IEEE Transactions on Artificial Intelligence, 2025. 1

2025
[41]

Your data is not per- fect: Towards cross-domain out-of-distribution detection in class-imbalanced data.Expert Systems with Applications, 2025

Xiang Fang, Arvind Easwaran, Blaise Genest, and Pon- nuthurai Nagaratnam Suganthan. Your data is not per- fect: Towards cross-domain out-of-distribution detection in class-imbalanced data.Expert Systems with Applications, 2025

2025
[42]

Turing patterns for multimedia: Reaction-diffusion multi- modal fusion for language-guided video moment retrieval

Xiang Fang, Wanlong Fang, Wei Ji, and Tat-Seng Chua. Turing patterns for multimedia: Reaction-diffusion multi- modal fusion for language-guided video moment retrieval. InACM International Conference on Multimedia, 2025. 1

2025
[43]

Hi- erarchical semantic-augmented navigation: Optimal trans- port and graph-driven reasoning for vision-language navi- gation

Xiang Fang, Wanlong Fang, and Changshuo Wang. Hi- erarchical semantic-augmented navigation: Optimal trans- port and graph-driven reasoning for vision-language navi- gation. InAdvances in Neural Information Processing Sys- tems, 2025

2025
[44]

Multi-pair temporal sentence grounding via multi-thread knowledge transfer network

Xiang Fang, Wanlong Fang, Changshuo Wang, Daizong Liu, Keke Tang, Jianfeng Dong, Pan Zhou, and Beibei Li. Multi-pair temporal sentence grounding via multi-thread knowledge transfer network. InProceedings of the AAAI Conference on Artificial Intelligence, 2025. 1

2025
[45]

Multi-pair temporal sentence grounding via multi-thread knowledge transfer network

Xiang Fang, Wanlong Fang, Changshuo Wang, Daizong Liu, Keke Tang, Jianfeng Dong, Pan Zhou, and Beibei Li. Multi-pair temporal sentence grounding via multi-thread knowledge transfer network. InProceedings of the AAAI Conference on Artificial Intelligence, pages 2915–2923,
[46]

Immuno-vlm: Im- munizing large vision-language models via generative se- mantic antibodies for open-world trustworthiness

Xiang Fang, Wanlong Fang, and Wei Ji. Immuno-vlm: Im- munizing large vision-language models via generative se- mantic antibodies for open-world trustworthiness. InInter- national Conference on Machine Learning, 2026

2026
[47]

Unveil- ing the fragility of vision-language models: Multi-modal adversarial synergy via texture-constrained perturbations and cross-modal optimization

Xiang Fang, Wanlong Fang, and Changshuo Wang. Unveil- ing the fragility of vision-language models: Multi-modal adversarial synergy via texture-constrained perturbations and cross-modal optimization. InProceedings of the AAAI Conference on Artificial Intelligence, 2026. 1

2026
[48]

Rethinking video-language model from the language input perspective

Xiang Fang, Wanlong Fang, Changshuo Wang, Xiaoye Qu, and Daizong Liu. Rethinking video-language model from the language input perspective. InProceedings of the AAAI Conference on Artificial Intelligence, 2026

2026
[49]

Towards unified vision-language models with incomplete multi-modal in- puts

Xiang Fang, Wanlong Fang, Changshuo Wang, Keke Tang, Daizong Liu, Siyi Wang, and Wei Ji. Towards unified vision-language models with incomplete multi-modal in- puts. InProceedings of the AAAI Conference on Artificial Intelligence, 2026. 1

2026
[50]

Retrieval augmented language model pre-training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. InInternational conference on machine learn- ing, pages 3929–3938. PMLR, 2020. 2

2020
[51]

Retrieval-Augmented Generation with Graphs (GraphRAG)

Haoyu Han, Yu Wang, Harry Shomer, Kai Guo, Jiayuan Ding, Yongjia Lei, Mahantesh Halappanavar, Ryan A Rossi, Subhabrata Mukherjee, Xianfeng Tang, et al. Retrieval-augmented generation with graphs (graphrag). arXiv preprint arXiv:2501.00309, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Fine-grained cross-modal alignment network for text-video retrieval

Ning Han, Jingjing Chen, Guangyi Xiao, Hao Zhang, Yawen Zeng, and Hao Chen. Fine-grained cross-modal alignment network for text-video retrieval. InProceedings of the 29th ACM International Conference on Multimedia, pages 3826–3834, 2021. 1

2021
[53]

A closer look at backdoor attacks on clip

Shuo He, Zhifang Zhang, Feng Liu, Roy Ka-Wei Lee, Bo An, and Lei Feng. A closer look at backdoor attacks on clip. InICML, 2025. 1

2025
[54]

Codev: Code with images for faithful visual reasoning via tool-aware policy optimization.arXiv preprint arXiv:2511.19661, 2025

Xinhai Hou, Shaoyuan Xu, Manan Biyani, Moyan Li, Jia Liu, Todd C Hollon, and Bryan Wang. Codev: Code with images for faithful visual reasoning via tool-aware policy optimization.arXiv preprint arXiv:2511.19661, 2025. 1

work page arXiv 2025
[55]

On the comparison be- tween multi-modal and single-modal contrastive learning

Wei Huang, Andi Han, Yongqiang Chen, Yuan Cao, Zhiqiang Xu, and Taiji Suzuki. On the comparison be- tween multi-modal and single-modal contrastive learning. Advances in Neural Information Processing Systems, 37: 81549–81605, 2024. 1

2024
[56]

What makes multi-modal learning better than single (provably).Advances in Neural Information Processing Systems, 34:10944–10956, 2021

Yu Huang, Chenzhuang Du, Zihui Xue, Xuanyao Chen, Hang Zhao, and Longbo Huang. What makes multi-modal learning better than single (provably).Advances in Neural Information Processing Systems, 34:10944–10956, 2021. 1

2021
[57]

Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InIn- ternational Conference on Machine Learning, pages 4904– 4916, 2021. 2

2021
[58]

Adv-watermark: A novel watermark perturbation for adversarial examples

Xiaojun Jia, Xingxing Wei, Xiaochun Cao, and Xiaoguang Han. Adv-watermark: A novel watermark perturbation for adversarial examples. InProceedings of the 28th ACM in- ternational conference on multimedia, pages 1579–1587,
[59]

Improved techniques for optimization-based jailbreaking on large lan- guage models.arXiv preprint arXiv:2405.21018, 2024

Xiaojun Jia, Tianyu Pang, Chao Du, Yihao Huang, Jindong Gu, Yang Liu, Xiaochun Cao, and Min Lin. Improved techniques for optimization-based jailbreaking on large lan- guage models.arXiv preprint arXiv:2405.21018, 2024

work page arXiv 2024
[60]

Semantic-aligned adversarial evolution triangle for high- transferability vision-language attack.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Xiaojun Jia, Sensen Gao, Qing Guo, Simeng Qin, Ke Ma, Yihao Huang, Yang Liu, Ivor Tsang, and Xiaochun Cao. Semantic-aligned adversarial evolution triangle for high- transferability vision-language attack.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025
[61]

Gehman, S.; Gururangan, S.; Sap, M.; Choi, Y .; and Smith, N

Xiaojun Jia, Sensen Gao, Simeng Qin, Ke Ma, Xinfeng Li, Yihao Huang, Wei Dong, Yang Liu, and Xiaochun Cao. Evolution-based region adversarial prompt learning for ro- bustness enhancement in vision-language models.arXiv preprint arXiv:2503.12874, 2025

work page arXiv 2025
[62]

Adversarial attacks against closed-source mllms via feature optimal alignment.arXiv preprint arXiv:2505.21494, 2025

Xiaojun Jia, Sensen Gao, Simeng Qin, Tianyu Pang, Chao Du, Yihao Huang, Xinfeng Li, Yiming Li, Bo Li, and Yang Liu. Adversarial attacks against closed-source mllms via feature optimal alignment.arXiv preprint arXiv:2505.21494, 2025. 1

work page arXiv 2025
[63]

Knowledge-augmented reasoning dis- tillation for small language models in knowledge-intensive tasks.Advances in Neural Information Processing Systems, 36:48573–48602, 2023

Minki Kang, Seanie Lee, Jinheon Baek, Kenji Kawaguchi, and Sung Ju Hwang. Knowledge-augmented reasoning dis- tillation for small language models in knowledge-intensive tasks.Advances in Neural Information Processing Systems, 36:48573–48602, 2023. 1

2023
[64]

Dynamic graph-enhanced event refinement for temporal sentence grounding of micro-moments.IEEE Transactions on Multimedia, 2026

Mingjin Kuai, You Qin, Xiang Fang, Wei Ji, and Roger Zimmermann. Dynamic graph-enhanced event refinement for temporal sentence grounding of micro-moments.IEEE Transactions on Multimedia, 2026. 1

2026
[65]

Natural questions: a benchmark for question answering re- search.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Ep- stein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering re- search.Transactions of the Association for Computational Linguistics, 7:453–466, 2019. 1

2019
[66]

Ex- ploring disentangled appearance-motion contexts for tem- poral activity localization

Huashuo Lei, Xiaowen Cai, Daizong Liu, Xiang Fang, Xi- aoye Qu, Jianfeng Dong, Jixiang Yu, and Keyan Jin. Ex- ploring disentangled appearance-motion contexts for tem- poral activity localization. In2025 International Joint Con- ference on Neural Networks (IJCNN), pages 1–8. IEEE,
[67]

Retrieval-augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, Se- bastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. InAdvances in Neural Information Processing Systems, pages 9459– 9474, 2020. 1, 2

2020
[68]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 1, 2, 3, 6

2023
[69]

Self- supervised geometric features discovery via interpretable attention for vehicle re-identification and beyond

Ming Li, Xinming Huang, and Ziming Zhang. Self- supervised geometric features discovery via interpretable attention for vehicle re-identification and beyond. InICCV,
[70]

Exploiting multi-view part-wise correlation via an efficient transformer for vehicle re-identification.TOM, 2021

Ming Li, Jun Liu, Ce Zheng, Xinming Huang, and Zim- ing Zhang. Exploiting multi-view part-wise correlation via an efficient transformer for vehicle re-identification.TOM, 2021

2021
[71]

Dr-fer: Discrimina- tive and robust representation learning for facial expression recognition.IEEE Transactions on Multimedia, 26:6297– 6309, 2023

Ming Li, Huazhu Fu, Shengfeng He, Hehe Fan, Jun Liu, Jussi Keppo, and Mike Zheng Shou. Dr-fer: Discrimina- tive and robust representation learning for facial expression recognition.IEEE Transactions on Multimedia, 26:6297– 6309, 2023

2023
[72]

Stprivacy: Spatio-temporal privacy- preserving action recognition

Ming Li, Xiangyu Xu, Hehe Fan, Pan Zhou, Jun Liu, Jia-Wei Liu, Jiahe Li, Jussi Keppo, Mike Zheng Shou, and Shuicheng Yan. Stprivacy: Spatio-temporal privacy- preserving action recognition. InICCV, 2023

2023
[73]

Instant3d: instant text- to-3d generation.IJCV, 2024

Ming Li, Pan Zhou, Jia-Wei Liu, Jussi Keppo, Min Lin, Shuicheng Yan, and Xiangyu Xu. Instant3d: instant text- to-3d generation.IJCV, 2024. 1

2024
[74]

Merging clinical knowledge into large language models for medical research and applications: A survey.arXiv e-prints, pages arXiv– 2502, 2025

Qiyuan Li, Haijiang Liu, Caicai Guo, Deyu Chen, Meng Wang, Feng Gao, and Jinguang Gu. Merging clinical knowledge into large language models for medical research and applications: A survey.arXiv e-prints, pages arXiv– 2502, 2025. 1

2025
[75]

Growing with the generator: Self-paced grpo for video generation.arXiv preprint arXiv:2511.19356, 2025

Rui Li, Yuanzhi Liang, Ziqi Ni, Haibing Huang, Chi Zhang, and Xuelong Li. Growing with the generator: Self-paced grpo for video generation.arXiv preprint arXiv:2511.19356, 2025. 1

work page arXiv 2025
[76]

Mmcoqa: Conver- sational question answering over text, tables, and images

Yongqi Li, Wenjie Li, and Liqiang Nie. Mmcoqa: Conver- sational question answering over text, tables, and images. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4220–4231, 2022. 2, 3, 6

2022
[77]

Integrating reinforcement learning with vi- sual generative models: foundations and advances.Vici- nagearth, 3(1):2, 2026

Yuanzhi Liang, Yijie Fang, Rui Li, Ziqi Ni, Ruijie Su, and Chi Zhang. Integrating reinforcement learning with vi- sual generative models: foundations and advances.Vici- nagearth, 3(1):2, 2026. 1

2026
[78]

Commongen: A constrained text generation challenge for generative commonsense reasoning

Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi, and Xiang Ren. Commongen: A constrained text generation challenge for generative commonsense reasoning. InFindings of the As- sociation for Computational Linguistics: EMNLP 2020, pages 1823–1840, 2020. 1

2020
[79]

Explor- ing optical-flow-guided motion and detection-based appear- ance for temporal sentence grounding.IEEE Transactions on Multimedia, 25:8539–8553, 2023

Daizong Liu, Xiang Fang, Wei Hu, and Pan Zhou. Explor- ing optical-flow-guided motion and detection-based appear- ance for temporal sentence grounding.IEEE Transactions on Multimedia, 25:8539–8553, 2023. 1

2023
[80]

Hypotheses tree building for one- shot temporal sentence localization

Daizong Liu, Xiang Fang, Pan Zhou, Xing Di, Weining Lu, and Yu Cheng. Hypotheses tree building for one- shot temporal sentence localization. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1640– 1648, 2023

2023

Showing first 80 references.

[1] [1]

Imperceptible beam-sensitive ad- versarial attacks for lidar-based object detection in au- tonomous driving

Fuyao Cai, Daizong Liu, Xiang Fang, Jixiang Yu, Keke Tang, and Pan Zhou. Imperceptible beam-sensitive ad- versarial attacks for lidar-based object detection in au- tonomous driving. In2025 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2025. 1

2025

[2] [2]

Towards building model/prompt-transferable attackers against large vision-language models.Advances in Neu- ral Information Processing Systems, 38:174022–174058,

Xiaowen Cai, Daizong Liu, Xiaoye Qu, Xiang Fang, Jian- feng Dong, Keke Tang, Pan Zhou, Lichao Sun, and Wei Hu. Towards building model/prompt-transferable attackers against large vision-language models.Advances in Neu- ral Information Processing Systems, 38:174022–174058,

[3] [3]

Webqa: Multihop and multimodal qa

Yingshan Chang, Mridu Narang, Hisami Suzuki, Guihong Cao, Jianfeng Gao, and Yonatan Bisk. Webqa: Multihop and multimodal qa. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 16495–16504, 2022. 6

2022

[4] [4]

Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming-Wei Chang. Can pre-trained vision and language models answer visual information-seeking questions? InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14948–14968, 2023. 1, 2, 3, 6

2023

[5] [5]

Raw nav-merge seismic data to subsurface properties with mlp based multi-modal information unscrambler.Advances in Neural Information Processing Systems, 34:8740–8752,

Aditya Desai, Zhaozhuo Xu, Menal Gupta, Anu Chan- dran, Antoine Vial-Aussavy, and Anshumali Shrivastava. Raw nav-merge seismic data to subsurface properties with mlp based multi-modal information unscrambler.Advances in Neural Information Processing Systems, 34:8740–8752,

[6] [6]

Improving adversarially robust few-shot image classi- fication with generalizable representations

Junhao Dong, Yuan Wang, Jian-Huang Lai, and Xiaohua Xie. Improving adversarially robust few-shot image classi- fication with generalizable representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 9025–9034, 2022. 1

2022

[7] [7]

The enemy of my enemy is my friend: Exploring inverse adversaries for improving adversarial training

Junhao Dong, Seyed-Mohsen Moosavi-Dezfooli, Jian- huang Lai, and Xiaohua Xie. The enemy of my enemy is my friend: Exploring inverse adversaries for improving adversarial training. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 24678–24687, 2023

2023

[8] [8]

Restricted black-box adversarial attack against deep- fake face swapping.IEEE Transactions on Information Forensics and Security, 18:2596–2608, 2023

Junhao Dong, Yuan Wang, Jianhuang Lai, and Xiaohua Xie. Restricted black-box adversarial attack against deep- fake face swapping.IEEE Transactions on Information Forensics and Security, 18:2596–2608, 2023

2023

[9] [9]

Survey on adversarial attack and defense for medical image analysis: Methods and challenges.ACM Computing Surveys, 57(3):1–38, 2024

Junhao Dong, Junxi Chen, Xiaohua Xie, Jianhuang Lai, and Hao Chen. Survey on adversarial attack and defense for medical image analysis: Methods and challenges.ACM Computing Surveys, 57(3):1–38, 2024

2024

[10] [10]

Adversarially robust distillation by reducing the student-teacher variance gap

Junhao Dong, Piotr Koniusz, Junxi Chen, and Yew-Soon Ong. Adversarially robust distillation by reducing the student-teacher variance gap. InEuropean Conference on Computer Vision, pages 92–111. Springer, 2024

2024

[11] [11]

Robust distillation via untargeted and tar- geted intermediate adversarial samples

Junhao Dong, Piotr Koniusz, Junxi Chen, Z Jane Wang, and Yew-Soon Ong. Robust distillation via untargeted and tar- geted intermediate adversarial samples. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28432–28442, 2024

2024

[12] [12]

Adversarially robust few-shot learning via parameter co-distillation of similarity and class concept learners

Junhao Dong, Piotr Koniusz, Junxi Chen, Xiaohua Xie, and Yew-Soon Ong. Adversarially robust few-shot learning via parameter co-distillation of similarity and class concept learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28535– 28544, 2024

2024

[13] [13]

Robustifying zero-shot vision language models by sub- spaces alignment

Junhao Dong, Piotr Koniusz, Liaoyuan Feng, Yifei Zhang, Hao Zhu, Weiming Liu, Xinghua Qu, and Yew-Soon Ong. Robustifying zero-shot vision language models by sub- spaces alignment. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 21037– 21047, 2025

2025

[14] [14]

Stabilizing modality gap & lowering gradient norms improve zero-shot adversarial robustness of vlms

Junhao Dong, Piotr Koniusz, Xinghua Qu, and Yew-Soon Ong. Stabilizing modality gap & lowering gradient norms improve zero-shot adversarial robustness of vlms. InPro- ceedings of the 31st ACM SIGKDD Conference on Knowl- edge Discovery and Data Mining V . 1, pages 236–247, 2025

2025

[15] [15]

Improving zero- shot adversarial robustness in vision-language models by closed-form alignment of adversarial path simplices

Junhao Dong, Piotr Koniusz, Yifei Zhang, Hao Zhu, Weim- ing Liu, Xinghua Qu, and Yew-Soon Ong. Improving zero- shot adversarial robustness in vision-language models by closed-form alignment of adversarial path simplices. In Forty-second International Conference on Machine Learn- ing, 2025

2025

[16] [16]

Confound from all sides, distill with resilience: Multi- objective adversarial paths to zero-shot robustness

Junhao Dong, Jiao Liu, Xinghua Qu, and Yew-Soon Ong. Confound from all sides, distill with resilience: Multi- objective adversarial paths to zero-shot robustness. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 624–634, 2025

2025

[17] [17]

Robust superalign- ment: Weak-to-strong robustness generalization for vision- language models

Junhao Dong, Cong Zhang, Xinghua Qu, Zejun Ma, Pi- otr Koniusz, and Yew-Soon Ong. Robust superalign- ment: Weak-to-strong robustness generalization for vision- language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[18] [18]

Allies teach better than enemies: Inverse adversaries for robust knowledge distilla- tion.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

Junhao Dong, Raoof Zare Moayedi, Yew-Soon Ong, and Seyed-Mohsen Moosavi-Dezfooli. Allies teach better than enemies: Inverse adversaries for robust knowledge distilla- tion.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

2026

[19] [19]

Robust superalign- ment: Weak-to-strong robustness generalization for vision- language models

Junhao Dong, Cong Zhang, Xinghua Qu, Zejun Ma, Pi- otr Koniusz, and Yew-Soon Ong. Robust superalign- ment: Weak-to-strong robustness generalization for vision- language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. 1

2026

[20] [20]

An image is worth 16x16 words: Trans- formers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InInternational Conference on Learning Representations, 2021. 2

2021

[21] [21]

An empirical study of training end-to- end vision-and-language transformers

Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuo- hang Wang, Lijuan Wang, Chenguang Zhu, Zicheng Liu, and Michael Zeng. An empirical study of training end-to- end vision-and-language transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18166–18176, 2022. 2

2022

[22] [22]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, T Krzyzanowski, F Basisty, et al. A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024. 2, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

To align or not to align: Strategic multimodal representation align- ment for optimal performance

Wanlong Fang, Tianle Zhang, and Alvin Chan. To align or not to align: Strategic multimodal representation align- ment for optimal performance. InProceedings of the AAAI Conference on Artificial Intelligence, pages 21056–21064,

[24] [24]

Towards understanding modality interaction in multimodal language models via partial information decomposition

Wanlong Fang, Tianle Zhang, Wen Tao, and Alvin Chan. Towards understanding modality interaction in multimodal language models via partial information decomposition. In International Conference on Machine Learning, 2026

2026

[25] [25]

Advancing out-of-distribution detection across diverse scenarios

Xiang Fang. Advancing out-of-distribution detection across diverse scenarios. InProceedings of the AAAI Conference on Artificial Intelligence, pages 41042–41043, 2026

2026

[26] [26]

Disentangling adversarial prompts: A semantic-graph defense for robust llm security

Xiang Fang and Wanlong Fang. Disentangling adversarial prompts: A semantic-graph defense for robust llm security. InProceedings of the AAAI Conference on Artificial Intel- ligence, 2026

2026

[27] [27]

Slap: The semantic least action principle for variational video-language modeling

Xiang Fang and Wanlong Fang. Slap: The semantic least action principle for variational video-language modeling. In International Conference on Machine Learning, 2026

2026

[28] [28]

Double Self-weighted Multi-view Clustering via Adaptive View Fusion

Xiang Fang and Yuchong Hu. Double self-weighted multi- view clustering via adaptive view fusion.arXiv preprint arXiv:2011.10396, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011

[29] [29]

V3h: View variation and view heredity for incomplete multiview clustering.IEEE Transactions on Artificial Intel- ligence, 1(3):233–247, 2020

Xiang Fang, Yuchong Hu, Pan Zhou, and Dapeng Oliver Wu. V3h: View variation and view heredity for incomplete multiview clustering.IEEE Transactions on Artificial Intel- ligence, 1(3):233–247, 2020. 1

2020

[30] [30]

An- imc: A soft approach for autoweighted noisy and incom- plete multiview clustering.IEEE Transactions on Artificial Intelligence, 3(2):192–206, 2021

Xiang Fang, Yuchong Hu, Pan Zhou, and Dapeng Wu. An- imc: A soft approach for autoweighted noisy and incom- plete multiview clustering.IEEE Transactions on Artificial Intelligence, 3(2):192–206, 2021

2021

[31] [31]

Xiang Fang, Yuchong Hu, Pan Zhou, and Dapeng Oliver Wu. Unbalanced incomplete multi-view clustering via the scheme of view evolution: Weak views are meat; strong views do eat.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(4):913–927, 2021. 1

2021

[32] [32]

Multi-modal cross-domain alignment network for video moment retrieval.IEEE Transactions on Multimedia, 25: 7517–7532, 2022

Xiang Fang, Daizong Liu, Pan Zhou, and Yuchong Hu. Multi-modal cross-domain alignment network for video moment retrieval.IEEE Transactions on Multimedia, 25: 7517–7532, 2022

2022

[33] [33]

Annotations are not all you need: A cross-modal knowledge transfer network for unsupervised temporal sentence grounding

Xiang Fang, Daizong Liu, Wanlong Fang, Pan Zhou, Yu Cheng, Keke Tang, and Kai Zou. Annotations are not all you need: A cross-modal knowledge transfer network for unsupervised temporal sentence grounding. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 8721–8733, 2023. 1

2023

[34] [34]

You can ground earlier than see: An effective and efficient pipeline for temporal sentence grounding in compressed videos

Xiang Fang, Daizong Liu, Pan Zhou, and Guoshun Nan. You can ground earlier than see: An effective and efficient pipeline for temporal sentence grounding in compressed videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2448– 2460, 2023

2023

[35] [35]

Hierarchical local-global transformer for tem- poral sentence grounding.IEEE Transactions on Multime- dia, 2023

Xiang Fang, Daizong Liu, Pan Zhou, Zichuan Xu, and Ruixuan Li. Hierarchical local-global transformer for tem- poral sentence grounding.IEEE Transactions on Multime- dia, 2023

2023

[36] [36]

Not all inputs are valid: Towards open- set video moment retrieval using language

Xiang Fang, Wanlong Fang, Daizong Liu, Xiaoye Qu, Jian- feng Dong, Pan Zhou, Renfu Li, Zichuan Xu, Lixing Chen, Panpan Zheng, et al. Not all inputs are valid: Towards open- set video moment retrieval using language. InProceedings of the 32nd ACM International Conference on Multimedia, pages 28–37, 2024. 1

2024

[37] [37]

Fewer steps, better performance: Efficient cross-modal clip trimming for video moment retrieval using language

Xiang Fang, Daizong Liu, Wanlong Fang, Pan Zhou, Zichuan Xu, Wenzheng Xu, Junyang Chen, and Renfu Li. Fewer steps, better performance: Efficient cross-modal clip trimming for video moment retrieval using language. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 1735–1743, 2024

2024

[38] [38]

Rethinking weakly-supervised video tempo- ral grounding from a game perspective

Xiang Fang, Zeyu Xiong, Wanlong Fang, Xiaoye Qu, Chen Chen, Jianfeng Dong, Keke Tang, Pan Zhou, Yu Cheng, and Daizong Liu. Rethinking weakly-supervised video tempo- ral grounding from a game perspective. InEuropean Con- ference on Computer Vision. Springer, 2024. 1

2024

[39] [39]

Adap- tive multi-prompt contrastive network for few-shot out-of- distribution detection

Xiang Fang, Arvind Easwaran, and Blaise Genest. Adap- tive multi-prompt contrastive network for few-shot out-of- distribution detection. InInternational Conference on Ma- chine Learning, 2025

2025

[40] [40]

Adaptive hierarchical graph cut for multi-granularity out-of-distribution detec- tion.IEEE Transactions on Artificial Intelligence, 2025

Xiang Fang, Arvind Easwaran, Blaise Genest, and Pon- nuthurai Nagaratnam Suganthan. Adaptive hierarchical graph cut for multi-granularity out-of-distribution detec- tion.IEEE Transactions on Artificial Intelligence, 2025. 1

2025

[41] [41]

Your data is not per- fect: Towards cross-domain out-of-distribution detection in class-imbalanced data.Expert Systems with Applications, 2025

Xiang Fang, Arvind Easwaran, Blaise Genest, and Pon- nuthurai Nagaratnam Suganthan. Your data is not per- fect: Towards cross-domain out-of-distribution detection in class-imbalanced data.Expert Systems with Applications, 2025

2025

[42] [42]

Turing patterns for multimedia: Reaction-diffusion multi- modal fusion for language-guided video moment retrieval

Xiang Fang, Wanlong Fang, Wei Ji, and Tat-Seng Chua. Turing patterns for multimedia: Reaction-diffusion multi- modal fusion for language-guided video moment retrieval. InACM International Conference on Multimedia, 2025. 1

2025

[43] [43]

Hi- erarchical semantic-augmented navigation: Optimal trans- port and graph-driven reasoning for vision-language navi- gation

Xiang Fang, Wanlong Fang, and Changshuo Wang. Hi- erarchical semantic-augmented navigation: Optimal trans- port and graph-driven reasoning for vision-language navi- gation. InAdvances in Neural Information Processing Sys- tems, 2025

2025

[44] [44]

Multi-pair temporal sentence grounding via multi-thread knowledge transfer network

Xiang Fang, Wanlong Fang, Changshuo Wang, Daizong Liu, Keke Tang, Jianfeng Dong, Pan Zhou, and Beibei Li. Multi-pair temporal sentence grounding via multi-thread knowledge transfer network. InProceedings of the AAAI Conference on Artificial Intelligence, 2025. 1

2025

[45] [45]

Multi-pair temporal sentence grounding via multi-thread knowledge transfer network

Xiang Fang, Wanlong Fang, Changshuo Wang, Daizong Liu, Keke Tang, Jianfeng Dong, Pan Zhou, and Beibei Li. Multi-pair temporal sentence grounding via multi-thread knowledge transfer network. InProceedings of the AAAI Conference on Artificial Intelligence, pages 2915–2923,

[46] [46]

Immuno-vlm: Im- munizing large vision-language models via generative se- mantic antibodies for open-world trustworthiness

Xiang Fang, Wanlong Fang, and Wei Ji. Immuno-vlm: Im- munizing large vision-language models via generative se- mantic antibodies for open-world trustworthiness. InInter- national Conference on Machine Learning, 2026

2026

[47] [47]

Unveil- ing the fragility of vision-language models: Multi-modal adversarial synergy via texture-constrained perturbations and cross-modal optimization

Xiang Fang, Wanlong Fang, and Changshuo Wang. Unveil- ing the fragility of vision-language models: Multi-modal adversarial synergy via texture-constrained perturbations and cross-modal optimization. InProceedings of the AAAI Conference on Artificial Intelligence, 2026. 1

2026

[48] [48]

Rethinking video-language model from the language input perspective

Xiang Fang, Wanlong Fang, Changshuo Wang, Xiaoye Qu, and Daizong Liu. Rethinking video-language model from the language input perspective. InProceedings of the AAAI Conference on Artificial Intelligence, 2026

2026

[49] [49]

Towards unified vision-language models with incomplete multi-modal in- puts

Xiang Fang, Wanlong Fang, Changshuo Wang, Keke Tang, Daizong Liu, Siyi Wang, and Wei Ji. Towards unified vision-language models with incomplete multi-modal in- puts. InProceedings of the AAAI Conference on Artificial Intelligence, 2026. 1

2026

[50] [50]

Retrieval augmented language model pre-training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. InInternational conference on machine learn- ing, pages 3929–3938. PMLR, 2020. 2

2020

[51] [51]

Retrieval-Augmented Generation with Graphs (GraphRAG)

Haoyu Han, Yu Wang, Harry Shomer, Kai Guo, Jiayuan Ding, Yongjia Lei, Mahantesh Halappanavar, Ryan A Rossi, Subhabrata Mukherjee, Xianfeng Tang, et al. Retrieval-augmented generation with graphs (graphrag). arXiv preprint arXiv:2501.00309, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [52]

Fine-grained cross-modal alignment network for text-video retrieval

Ning Han, Jingjing Chen, Guangyi Xiao, Hao Zhang, Yawen Zeng, and Hao Chen. Fine-grained cross-modal alignment network for text-video retrieval. InProceedings of the 29th ACM International Conference on Multimedia, pages 3826–3834, 2021. 1

2021

[53] [53]

A closer look at backdoor attacks on clip

Shuo He, Zhifang Zhang, Feng Liu, Roy Ka-Wei Lee, Bo An, and Lei Feng. A closer look at backdoor attacks on clip. InICML, 2025. 1

2025

[54] [54]

Codev: Code with images for faithful visual reasoning via tool-aware policy optimization.arXiv preprint arXiv:2511.19661, 2025

Xinhai Hou, Shaoyuan Xu, Manan Biyani, Moyan Li, Jia Liu, Todd C Hollon, and Bryan Wang. Codev: Code with images for faithful visual reasoning via tool-aware policy optimization.arXiv preprint arXiv:2511.19661, 2025. 1

work page arXiv 2025

[55] [55]

On the comparison be- tween multi-modal and single-modal contrastive learning

Wei Huang, Andi Han, Yongqiang Chen, Yuan Cao, Zhiqiang Xu, and Taiji Suzuki. On the comparison be- tween multi-modal and single-modal contrastive learning. Advances in Neural Information Processing Systems, 37: 81549–81605, 2024. 1

2024

[56] [56]

What makes multi-modal learning better than single (provably).Advances in Neural Information Processing Systems, 34:10944–10956, 2021

Yu Huang, Chenzhuang Du, Zihui Xue, Xuanyao Chen, Hang Zhao, and Longbo Huang. What makes multi-modal learning better than single (provably).Advances in Neural Information Processing Systems, 34:10944–10956, 2021. 1

2021

[57] [57]

Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InIn- ternational Conference on Machine Learning, pages 4904– 4916, 2021. 2

2021

[58] [58]

Adv-watermark: A novel watermark perturbation for adversarial examples

Xiaojun Jia, Xingxing Wei, Xiaochun Cao, and Xiaoguang Han. Adv-watermark: A novel watermark perturbation for adversarial examples. InProceedings of the 28th ACM in- ternational conference on multimedia, pages 1579–1587,

[59] [59]

Improved techniques for optimization-based jailbreaking on large lan- guage models.arXiv preprint arXiv:2405.21018, 2024

Xiaojun Jia, Tianyu Pang, Chao Du, Yihao Huang, Jindong Gu, Yang Liu, Xiaochun Cao, and Min Lin. Improved techniques for optimization-based jailbreaking on large lan- guage models.arXiv preprint arXiv:2405.21018, 2024

work page arXiv 2024

[60] [60]

Semantic-aligned adversarial evolution triangle for high- transferability vision-language attack.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Xiaojun Jia, Sensen Gao, Qing Guo, Simeng Qin, Ke Ma, Yihao Huang, Yang Liu, Ivor Tsang, and Xiaochun Cao. Semantic-aligned adversarial evolution triangle for high- transferability vision-language attack.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025

[61] [61]

Gehman, S.; Gururangan, S.; Sap, M.; Choi, Y .; and Smith, N

Xiaojun Jia, Sensen Gao, Simeng Qin, Ke Ma, Xinfeng Li, Yihao Huang, Wei Dong, Yang Liu, and Xiaochun Cao. Evolution-based region adversarial prompt learning for ro- bustness enhancement in vision-language models.arXiv preprint arXiv:2503.12874, 2025

work page arXiv 2025

[62] [62]

Adversarial attacks against closed-source mllms via feature optimal alignment.arXiv preprint arXiv:2505.21494, 2025

Xiaojun Jia, Sensen Gao, Simeng Qin, Tianyu Pang, Chao Du, Yihao Huang, Xinfeng Li, Yiming Li, Bo Li, and Yang Liu. Adversarial attacks against closed-source mllms via feature optimal alignment.arXiv preprint arXiv:2505.21494, 2025. 1

work page arXiv 2025

[63] [63]

Knowledge-augmented reasoning dis- tillation for small language models in knowledge-intensive tasks.Advances in Neural Information Processing Systems, 36:48573–48602, 2023

Minki Kang, Seanie Lee, Jinheon Baek, Kenji Kawaguchi, and Sung Ju Hwang. Knowledge-augmented reasoning dis- tillation for small language models in knowledge-intensive tasks.Advances in Neural Information Processing Systems, 36:48573–48602, 2023. 1

2023

[64] [64]

Dynamic graph-enhanced event refinement for temporal sentence grounding of micro-moments.IEEE Transactions on Multimedia, 2026

Mingjin Kuai, You Qin, Xiang Fang, Wei Ji, and Roger Zimmermann. Dynamic graph-enhanced event refinement for temporal sentence grounding of micro-moments.IEEE Transactions on Multimedia, 2026. 1

2026

[65] [65]

Natural questions: a benchmark for question answering re- search.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Ep- stein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering re- search.Transactions of the Association for Computational Linguistics, 7:453–466, 2019. 1

2019

[66] [66]

Ex- ploring disentangled appearance-motion contexts for tem- poral activity localization

Huashuo Lei, Xiaowen Cai, Daizong Liu, Xiang Fang, Xi- aoye Qu, Jianfeng Dong, Jixiang Yu, and Keyan Jin. Ex- ploring disentangled appearance-motion contexts for tem- poral activity localization. In2025 International Joint Con- ference on Neural Networks (IJCNN), pages 1–8. IEEE,

[67] [67]

Retrieval-augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, Se- bastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. InAdvances in Neural Information Processing Systems, pages 9459– 9474, 2020. 1, 2

2020

[68] [68]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 1, 2, 3, 6

2023

[69] [69]

Self- supervised geometric features discovery via interpretable attention for vehicle re-identification and beyond

Ming Li, Xinming Huang, and Ziming Zhang. Self- supervised geometric features discovery via interpretable attention for vehicle re-identification and beyond. InICCV,

[70] [70]

Exploiting multi-view part-wise correlation via an efficient transformer for vehicle re-identification.TOM, 2021

Ming Li, Jun Liu, Ce Zheng, Xinming Huang, and Zim- ing Zhang. Exploiting multi-view part-wise correlation via an efficient transformer for vehicle re-identification.TOM, 2021

2021

[71] [71]

Dr-fer: Discrimina- tive and robust representation learning for facial expression recognition.IEEE Transactions on Multimedia, 26:6297– 6309, 2023

Ming Li, Huazhu Fu, Shengfeng He, Hehe Fan, Jun Liu, Jussi Keppo, and Mike Zheng Shou. Dr-fer: Discrimina- tive and robust representation learning for facial expression recognition.IEEE Transactions on Multimedia, 26:6297– 6309, 2023

2023

[72] [72]

Stprivacy: Spatio-temporal privacy- preserving action recognition

Ming Li, Xiangyu Xu, Hehe Fan, Pan Zhou, Jun Liu, Jia-Wei Liu, Jiahe Li, Jussi Keppo, Mike Zheng Shou, and Shuicheng Yan. Stprivacy: Spatio-temporal privacy- preserving action recognition. InICCV, 2023

2023

[73] [73]

Instant3d: instant text- to-3d generation.IJCV, 2024

Ming Li, Pan Zhou, Jia-Wei Liu, Jussi Keppo, Min Lin, Shuicheng Yan, and Xiangyu Xu. Instant3d: instant text- to-3d generation.IJCV, 2024. 1

2024

[74] [74]

Merging clinical knowledge into large language models for medical research and applications: A survey.arXiv e-prints, pages arXiv– 2502, 2025

Qiyuan Li, Haijiang Liu, Caicai Guo, Deyu Chen, Meng Wang, Feng Gao, and Jinguang Gu. Merging clinical knowledge into large language models for medical research and applications: A survey.arXiv e-prints, pages arXiv– 2502, 2025. 1

2025

[75] [75]

Growing with the generator: Self-paced grpo for video generation.arXiv preprint arXiv:2511.19356, 2025

Rui Li, Yuanzhi Liang, Ziqi Ni, Haibing Huang, Chi Zhang, and Xuelong Li. Growing with the generator: Self-paced grpo for video generation.arXiv preprint arXiv:2511.19356, 2025. 1

work page arXiv 2025

[76] [76]

Mmcoqa: Conver- sational question answering over text, tables, and images

Yongqi Li, Wenjie Li, and Liqiang Nie. Mmcoqa: Conver- sational question answering over text, tables, and images. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4220–4231, 2022. 2, 3, 6

2022

[77] [77]

Integrating reinforcement learning with vi- sual generative models: foundations and advances.Vici- nagearth, 3(1):2, 2026

Yuanzhi Liang, Yijie Fang, Rui Li, Ziqi Ni, Ruijie Su, and Chi Zhang. Integrating reinforcement learning with vi- sual generative models: foundations and advances.Vici- nagearth, 3(1):2, 2026. 1

2026

[78] [78]

Commongen: A constrained text generation challenge for generative commonsense reasoning

Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi, and Xiang Ren. Commongen: A constrained text generation challenge for generative commonsense reasoning. InFindings of the As- sociation for Computational Linguistics: EMNLP 2020, pages 1823–1840, 2020. 1

2020

[79] [79]

Explor- ing optical-flow-guided motion and detection-based appear- ance for temporal sentence grounding.IEEE Transactions on Multimedia, 25:8539–8553, 2023

Daizong Liu, Xiang Fang, Wei Hu, and Pan Zhou. Explor- ing optical-flow-guided motion and detection-based appear- ance for temporal sentence grounding.IEEE Transactions on Multimedia, 25:8539–8553, 2023. 1

2023

[80] [80]

Hypotheses tree building for one- shot temporal sentence localization

Daizong Liu, Xiang Fang, Pan Zhou, Xing Di, Weining Lu, and Yu Cheng. Hypotheses tree building for one- shot temporal sentence localization. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1640– 1648, 2023

2023