CogniVerse: Revolutionizing Multi-Modal Retrieval-Augmented Generation with Cognitive Reflection and Geometric Reasoning
Pith reviewed 2026-06-29 08:11 UTC · model grok-4.3
The pith
CogniVerse enhances multi-modal RAG by using cognitive reflection to filter retrieval, Riemannian manifolds for alignment, and optimal transport for coherent generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CogniVerse addresses limitations in existing MMRAG frameworks by integrating a Cognitive Reflection Module that dynamically assesses retrieval necessity and filters relevant multi-modal content, a Multi-modal Retrieval Module that aligns embeddings in a Riemannian manifold using information geometry and refines knowledge graphs via spectral graph theory, and a Hierarchical Generation Module that employs an optimal transport-based loss to balance token-level accuracy and global semantic coherence, leading to significant outperformance in accuracy and coherence with reduced latency.
What carries the argument
The three synergistic components: Cognitive Reflection Module for dynamic filtering, Multi-modal Retrieval Module on Riemannian manifold with spectral graph refinement for precise alignment, and Hierarchical Generation Module with optimal transport loss for balancing local and global coherence.
If this is right
- Reduces noise and irrelevant retrieval in multi-modal queries
- Improves cross-modal semantic alignment through geometric methods
- Enables adaptive reasoning by assessing retrieval needs
- Achieves more coherent generation across local and global contexts
- Lowers retrieval latency while boosting accuracy
Where Pith is reading between the lines
- Such a framework might extend to single-modal or other AI tasks requiring reflection and geometry.
- Future work could test if the Riemannian approach generalizes to other embedding spaces.
- The optimal transport loss might apply to other generation models beyond MMRAG.
- Integration with existing LLMs could be explored for practical deployment.
Load-bearing premise
The three modules can be combined and implemented in practice to deliver the claimed improvements in accuracy, coherence, and latency.
What would settle it
An experiment where CogniVerse is implemented and tested on standard MMRAG benchmarks showing no significant gains over baselines or increased latency.
Figures
read the original abstract
Multi-modal Retrieval-Augmented Generation (MMRAG) has emerged as a powerful paradigm for enhancing Multimodal Large Language Models in knowledge-intensive question answering by integrating external visual, textual, and structural knowledge. However, existing MMRAG frameworks suffer from critical limitations, including noisy and irrelevant retrieval, cross-modal semantic misalignment, lack of adaptive reasoning, and incoherent generation across local and global contexts. We introduce \textbf{CogniVerse}, a novel MMRAG framework that addresses these challenges through a cognitive-inspired, mathematically rigorous approach. Drawing from human-like reasoning, CogniVerse integrates three synergistic components: (1) a Cognitive Reflection Module that dynamically assesses retrieval necessity and filters relevant multi-modal content, reducing noise and computational overhead; (2) a Multi-modal Retrieval Module that aligns embeddings in a Riemannian manifold using information geometry and refines knowledge graphs via spectral graph theory, ensuring precise and coherent retrieval; and (3) a Hierarchical Generation Module that employs an optimal transport-based loss to balance token-level accuracy and global semantic coherence. Extensive experiments demonstrate that CogniVerse significantly outperforms state-of-the-art systems in both accuracy and coherence, while reducing retrieval latency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CogniVerse, a multi-modal RAG framework for knowledge-intensive QA that combines three components: (1) a Cognitive Reflection Module to assess retrieval necessity and filter content, (2) a Multi-modal Retrieval Module that aligns embeddings on a Riemannian manifold via information geometry and refines knowledge graphs with spectral graph theory, and (3) a Hierarchical Generation Module using an optimal transport loss to balance token-level accuracy and global coherence. The abstract asserts that extensive experiments show significant gains over SOTA systems in accuracy, coherence, and reduced retrieval latency.
Significance. If the performance claims were supported by rigorous experiments, the combination of cognitive reflection with geometric retrieval and optimal transport generation could offer a meaningful advance in addressing noise, misalignment, and incoherence in MMRAG. However, the complete absence of any empirical validation, equations, or implementation details means the work currently contributes no verifiable advance.
major comments (3)
- [Abstract] Abstract: The central claim that 'extensive experiments demonstrate that CogniVerse significantly outperforms state-of-the-art systems in both accuracy and coherence, while reducing retrieval latency' is unsupported by any datasets, baselines, metrics, results tables, ablation studies, or error bars. This absence is load-bearing for the primary contribution.
- [Abstract] Abstract: The Multi-modal Retrieval Module is described as performing alignment 'in a Riemannian manifold using information geometry' and refinement 'via spectral graph theory,' yet no manifold metric, embedding alignment objective, spectral operator, or pseudocode is supplied. Without these, it is impossible to verify whether the approach resolves cross-modal semantic misalignment or merely restates standard techniques.
- [Abstract] Abstract: The Hierarchical Generation Module is said to employ 'an optimal transport-based loss to balance token-level accuracy and global semantic coherence,' but the loss function, transport plan formulation, and how it interacts with the LLM decoder are not defined. This prevents assessment of whether the claimed coherence gains are achievable.
Simulated Author's Rebuttal
We thank the referee for the detailed and substantive review. The comments accurately identify that the submitted manuscript consists of a high-level conceptual proposal without empirical results, equations, or implementation details. We will revise the abstract and text to remove unsupported claims and clarify the scope of the work.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that 'extensive experiments demonstrate that CogniVerse significantly outperforms state-of-the-art systems in both accuracy and coherence, while reducing retrieval latency' is unsupported by any datasets, baselines, metrics, results tables, ablation studies, or error bars. This absence is load-bearing for the primary contribution.
Authors: We agree that the abstract's claim of extensive experiments is unsupported, as the manuscript contains no experimental section, datasets, baselines, or results. This constitutes an overstatement. We will revise the abstract to describe CogniVerse as a proposed framework without asserting empirical superiority. revision: yes
-
Referee: [Abstract] Abstract: The Multi-modal Retrieval Module is described as performing alignment 'in a Riemannian manifold using information geometry' and refinement 'via spectral graph theory,' yet no manifold metric, embedding alignment objective, spectral operator, or pseudocode is supplied. Without these, it is impossible to verify whether the approach resolves cross-modal semantic misalignment or merely restates standard techniques.
Authors: The manuscript provides only a descriptive overview of the module and does not include any manifold metric, alignment objective, spectral operator, or pseudocode. We acknowledge that this prevents verification or assessment of novelty. We will revise the text to indicate these elements are conceptual and not formally specified. revision: partial
-
Referee: [Abstract] Abstract: The Hierarchical Generation Module is said to employ 'an optimal transport-based loss to balance token-level accuracy and global semantic coherence,' but the loss function, transport plan formulation, and how it interacts with the LLM decoder are not defined. This prevents assessment of whether the claimed coherence gains are achievable.
Authors: The manuscript does not define the optimal transport loss, transport plan, or its interaction with the decoder. We agree this prevents evaluation of the claimed benefits. We will revise the description to note that the loss is proposed at a conceptual level without mathematical formulation. revision: partial
- The complete absence of empirical validation, equations, implementation details, or results, which cannot be supplied without conducting new experiments and derivations absent from the original manuscript.
Circularity Check
No circularity: no derivation chain or self-referential reductions present
full rationale
The provided abstract and placeholder full text contain only high-level module descriptions and an assertion of experimental superiority, with no equations, parameter fittings, self-citations, uniqueness theorems, or ansatzes that could reduce a claimed result to its inputs by construction. No load-bearing steps exist to inspect for any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Imperceptible beam-sensitive ad- versarial attacks for lidar-based object detection in au- tonomous driving
Fuyao Cai, Daizong Liu, Xiang Fang, Jixiang Yu, Keke Tang, and Pan Zhou. Imperceptible beam-sensitive ad- versarial attacks for lidar-based object detection in au- tonomous driving. In2025 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2025. 1
2025
-
[2]
Towards building model/prompt-transferable attackers against large vision-language models.Advances in Neu- ral Information Processing Systems, 38:174022–174058,
Xiaowen Cai, Daizong Liu, Xiaoye Qu, Xiang Fang, Jian- feng Dong, Keke Tang, Pan Zhou, Lichao Sun, and Wei Hu. Towards building model/prompt-transferable attackers against large vision-language models.Advances in Neu- ral Information Processing Systems, 38:174022–174058,
-
[3]
Webqa: Multihop and multimodal qa
Yingshan Chang, Mridu Narang, Hisami Suzuki, Guihong Cao, Jianfeng Gao, and Yonatan Bisk. Webqa: Multihop and multimodal qa. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 16495–16504, 2022. 6
2022
-
[4]
Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming-Wei Chang. Can pre-trained vision and language models answer visual information-seeking questions? InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14948–14968, 2023. 1, 2, 3, 6
2023
-
[5]
Raw nav-merge seismic data to subsurface properties with mlp based multi-modal information unscrambler.Advances in Neural Information Processing Systems, 34:8740–8752,
Aditya Desai, Zhaozhuo Xu, Menal Gupta, Anu Chan- dran, Antoine Vial-Aussavy, and Anshumali Shrivastava. Raw nav-merge seismic data to subsurface properties with mlp based multi-modal information unscrambler.Advances in Neural Information Processing Systems, 34:8740–8752,
-
[6]
Improving adversarially robust few-shot image classi- fication with generalizable representations
Junhao Dong, Yuan Wang, Jian-Huang Lai, and Xiaohua Xie. Improving adversarially robust few-shot image classi- fication with generalizable representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 9025–9034, 2022. 1
2022
-
[7]
The enemy of my enemy is my friend: Exploring inverse adversaries for improving adversarial training
Junhao Dong, Seyed-Mohsen Moosavi-Dezfooli, Jian- huang Lai, and Xiaohua Xie. The enemy of my enemy is my friend: Exploring inverse adversaries for improving adversarial training. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 24678–24687, 2023
2023
-
[8]
Restricted black-box adversarial attack against deep- fake face swapping.IEEE Transactions on Information Forensics and Security, 18:2596–2608, 2023
Junhao Dong, Yuan Wang, Jianhuang Lai, and Xiaohua Xie. Restricted black-box adversarial attack against deep- fake face swapping.IEEE Transactions on Information Forensics and Security, 18:2596–2608, 2023
2023
-
[9]
Survey on adversarial attack and defense for medical image analysis: Methods and challenges.ACM Computing Surveys, 57(3):1–38, 2024
Junhao Dong, Junxi Chen, Xiaohua Xie, Jianhuang Lai, and Hao Chen. Survey on adversarial attack and defense for medical image analysis: Methods and challenges.ACM Computing Surveys, 57(3):1–38, 2024
2024
-
[10]
Adversarially robust distillation by reducing the student-teacher variance gap
Junhao Dong, Piotr Koniusz, Junxi Chen, and Yew-Soon Ong. Adversarially robust distillation by reducing the student-teacher variance gap. InEuropean Conference on Computer Vision, pages 92–111. Springer, 2024
2024
-
[11]
Robust distillation via untargeted and tar- geted intermediate adversarial samples
Junhao Dong, Piotr Koniusz, Junxi Chen, Z Jane Wang, and Yew-Soon Ong. Robust distillation via untargeted and tar- geted intermediate adversarial samples. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28432–28442, 2024
2024
-
[12]
Adversarially robust few-shot learning via parameter co-distillation of similarity and class concept learners
Junhao Dong, Piotr Koniusz, Junxi Chen, Xiaohua Xie, and Yew-Soon Ong. Adversarially robust few-shot learning via parameter co-distillation of similarity and class concept learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28535– 28544, 2024
2024
-
[13]
Robustifying zero-shot vision language models by sub- spaces alignment
Junhao Dong, Piotr Koniusz, Liaoyuan Feng, Yifei Zhang, Hao Zhu, Weiming Liu, Xinghua Qu, and Yew-Soon Ong. Robustifying zero-shot vision language models by sub- spaces alignment. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 21037– 21047, 2025
2025
-
[14]
Stabilizing modality gap & lowering gradient norms improve zero-shot adversarial robustness of vlms
Junhao Dong, Piotr Koniusz, Xinghua Qu, and Yew-Soon Ong. Stabilizing modality gap & lowering gradient norms improve zero-shot adversarial robustness of vlms. InPro- ceedings of the 31st ACM SIGKDD Conference on Knowl- edge Discovery and Data Mining V . 1, pages 236–247, 2025
2025
-
[15]
Improving zero- shot adversarial robustness in vision-language models by closed-form alignment of adversarial path simplices
Junhao Dong, Piotr Koniusz, Yifei Zhang, Hao Zhu, Weim- ing Liu, Xinghua Qu, and Yew-Soon Ong. Improving zero- shot adversarial robustness in vision-language models by closed-form alignment of adversarial path simplices. In Forty-second International Conference on Machine Learn- ing, 2025
2025
-
[16]
Confound from all sides, distill with resilience: Multi- objective adversarial paths to zero-shot robustness
Junhao Dong, Jiao Liu, Xinghua Qu, and Yew-Soon Ong. Confound from all sides, distill with resilience: Multi- objective adversarial paths to zero-shot robustness. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 624–634, 2025
2025
-
[17]
Robust superalign- ment: Weak-to-strong robustness generalization for vision- language models
Junhao Dong, Cong Zhang, Xinghua Qu, Zejun Ma, Pi- otr Koniusz, and Yew-Soon Ong. Robust superalign- ment: Weak-to-strong robustness generalization for vision- language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[18]
Allies teach better than enemies: Inverse adversaries for robust knowledge distilla- tion.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026
Junhao Dong, Raoof Zare Moayedi, Yew-Soon Ong, and Seyed-Mohsen Moosavi-Dezfooli. Allies teach better than enemies: Inverse adversaries for robust knowledge distilla- tion.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026
2026
-
[19]
Robust superalign- ment: Weak-to-strong robustness generalization for vision- language models
Junhao Dong, Cong Zhang, Xinghua Qu, Zejun Ma, Pi- otr Koniusz, and Yew-Soon Ong. Robust superalign- ment: Weak-to-strong robustness generalization for vision- language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. 1
2026
-
[20]
An image is worth 16x16 words: Trans- formers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InInternational Conference on Learning Representations, 2021. 2
2021
-
[21]
An empirical study of training end-to- end vision-and-language transformers
Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuo- hang Wang, Lijuan Wang, Chenguang Zhu, Zicheng Liu, and Michael Zeng. An empirical study of training end-to- end vision-and-language transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18166–18176, 2022. 2
2022
-
[22]
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
Darren Edge, T Krzyzanowski, F Basisty, et al. A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024. 2, 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
To align or not to align: Strategic multimodal representation align- ment for optimal performance
Wanlong Fang, Tianle Zhang, and Alvin Chan. To align or not to align: Strategic multimodal representation align- ment for optimal performance. InProceedings of the AAAI Conference on Artificial Intelligence, pages 21056–21064,
-
[24]
Towards understanding modality interaction in multimodal language models via partial information decomposition
Wanlong Fang, Tianle Zhang, Wen Tao, and Alvin Chan. Towards understanding modality interaction in multimodal language models via partial information decomposition. In International Conference on Machine Learning, 2026
2026
-
[25]
Advancing out-of-distribution detection across diverse scenarios
Xiang Fang. Advancing out-of-distribution detection across diverse scenarios. InProceedings of the AAAI Conference on Artificial Intelligence, pages 41042–41043, 2026
2026
-
[26]
Disentangling adversarial prompts: A semantic-graph defense for robust llm security
Xiang Fang and Wanlong Fang. Disentangling adversarial prompts: A semantic-graph defense for robust llm security. InProceedings of the AAAI Conference on Artificial Intel- ligence, 2026
2026
-
[27]
Slap: The semantic least action principle for variational video-language modeling
Xiang Fang and Wanlong Fang. Slap: The semantic least action principle for variational video-language modeling. In International Conference on Machine Learning, 2026
2026
-
[28]
Double Self-weighted Multi-view Clustering via Adaptive View Fusion
Xiang Fang and Yuchong Hu. Double self-weighted multi- view clustering via adaptive view fusion.arXiv preprint arXiv:2011.10396, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[29]
V3h: View variation and view heredity for incomplete multiview clustering.IEEE Transactions on Artificial Intel- ligence, 1(3):233–247, 2020
Xiang Fang, Yuchong Hu, Pan Zhou, and Dapeng Oliver Wu. V3h: View variation and view heredity for incomplete multiview clustering.IEEE Transactions on Artificial Intel- ligence, 1(3):233–247, 2020. 1
2020
-
[30]
An- imc: A soft approach for autoweighted noisy and incom- plete multiview clustering.IEEE Transactions on Artificial Intelligence, 3(2):192–206, 2021
Xiang Fang, Yuchong Hu, Pan Zhou, and Dapeng Wu. An- imc: A soft approach for autoweighted noisy and incom- plete multiview clustering.IEEE Transactions on Artificial Intelligence, 3(2):192–206, 2021
2021
-
[31]
Xiang Fang, Yuchong Hu, Pan Zhou, and Dapeng Oliver Wu. Unbalanced incomplete multi-view clustering via the scheme of view evolution: Weak views are meat; strong views do eat.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(4):913–927, 2021. 1
2021
-
[32]
Multi-modal cross-domain alignment network for video moment retrieval.IEEE Transactions on Multimedia, 25: 7517–7532, 2022
Xiang Fang, Daizong Liu, Pan Zhou, and Yuchong Hu. Multi-modal cross-domain alignment network for video moment retrieval.IEEE Transactions on Multimedia, 25: 7517–7532, 2022
2022
-
[33]
Annotations are not all you need: A cross-modal knowledge transfer network for unsupervised temporal sentence grounding
Xiang Fang, Daizong Liu, Wanlong Fang, Pan Zhou, Yu Cheng, Keke Tang, and Kai Zou. Annotations are not all you need: A cross-modal knowledge transfer network for unsupervised temporal sentence grounding. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 8721–8733, 2023. 1
2023
-
[34]
You can ground earlier than see: An effective and efficient pipeline for temporal sentence grounding in compressed videos
Xiang Fang, Daizong Liu, Pan Zhou, and Guoshun Nan. You can ground earlier than see: An effective and efficient pipeline for temporal sentence grounding in compressed videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2448– 2460, 2023
2023
-
[35]
Hierarchical local-global transformer for tem- poral sentence grounding.IEEE Transactions on Multime- dia, 2023
Xiang Fang, Daizong Liu, Pan Zhou, Zichuan Xu, and Ruixuan Li. Hierarchical local-global transformer for tem- poral sentence grounding.IEEE Transactions on Multime- dia, 2023
2023
-
[36]
Not all inputs are valid: Towards open- set video moment retrieval using language
Xiang Fang, Wanlong Fang, Daizong Liu, Xiaoye Qu, Jian- feng Dong, Pan Zhou, Renfu Li, Zichuan Xu, Lixing Chen, Panpan Zheng, et al. Not all inputs are valid: Towards open- set video moment retrieval using language. InProceedings of the 32nd ACM International Conference on Multimedia, pages 28–37, 2024. 1
2024
-
[37]
Fewer steps, better performance: Efficient cross-modal clip trimming for video moment retrieval using language
Xiang Fang, Daizong Liu, Wanlong Fang, Pan Zhou, Zichuan Xu, Wenzheng Xu, Junyang Chen, and Renfu Li. Fewer steps, better performance: Efficient cross-modal clip trimming for video moment retrieval using language. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 1735–1743, 2024
2024
-
[38]
Rethinking weakly-supervised video tempo- ral grounding from a game perspective
Xiang Fang, Zeyu Xiong, Wanlong Fang, Xiaoye Qu, Chen Chen, Jianfeng Dong, Keke Tang, Pan Zhou, Yu Cheng, and Daizong Liu. Rethinking weakly-supervised video tempo- ral grounding from a game perspective. InEuropean Con- ference on Computer Vision. Springer, 2024. 1
2024
-
[39]
Adap- tive multi-prompt contrastive network for few-shot out-of- distribution detection
Xiang Fang, Arvind Easwaran, and Blaise Genest. Adap- tive multi-prompt contrastive network for few-shot out-of- distribution detection. InInternational Conference on Ma- chine Learning, 2025
2025
-
[40]
Adaptive hierarchical graph cut for multi-granularity out-of-distribution detec- tion.IEEE Transactions on Artificial Intelligence, 2025
Xiang Fang, Arvind Easwaran, Blaise Genest, and Pon- nuthurai Nagaratnam Suganthan. Adaptive hierarchical graph cut for multi-granularity out-of-distribution detec- tion.IEEE Transactions on Artificial Intelligence, 2025. 1
2025
-
[41]
Your data is not per- fect: Towards cross-domain out-of-distribution detection in class-imbalanced data.Expert Systems with Applications, 2025
Xiang Fang, Arvind Easwaran, Blaise Genest, and Pon- nuthurai Nagaratnam Suganthan. Your data is not per- fect: Towards cross-domain out-of-distribution detection in class-imbalanced data.Expert Systems with Applications, 2025
2025
-
[42]
Turing patterns for multimedia: Reaction-diffusion multi- modal fusion for language-guided video moment retrieval
Xiang Fang, Wanlong Fang, Wei Ji, and Tat-Seng Chua. Turing patterns for multimedia: Reaction-diffusion multi- modal fusion for language-guided video moment retrieval. InACM International Conference on Multimedia, 2025. 1
2025
-
[43]
Hi- erarchical semantic-augmented navigation: Optimal trans- port and graph-driven reasoning for vision-language navi- gation
Xiang Fang, Wanlong Fang, and Changshuo Wang. Hi- erarchical semantic-augmented navigation: Optimal trans- port and graph-driven reasoning for vision-language navi- gation. InAdvances in Neural Information Processing Sys- tems, 2025
2025
-
[44]
Multi-pair temporal sentence grounding via multi-thread knowledge transfer network
Xiang Fang, Wanlong Fang, Changshuo Wang, Daizong Liu, Keke Tang, Jianfeng Dong, Pan Zhou, and Beibei Li. Multi-pair temporal sentence grounding via multi-thread knowledge transfer network. InProceedings of the AAAI Conference on Artificial Intelligence, 2025. 1
2025
-
[45]
Multi-pair temporal sentence grounding via multi-thread knowledge transfer network
Xiang Fang, Wanlong Fang, Changshuo Wang, Daizong Liu, Keke Tang, Jianfeng Dong, Pan Zhou, and Beibei Li. Multi-pair temporal sentence grounding via multi-thread knowledge transfer network. InProceedings of the AAAI Conference on Artificial Intelligence, pages 2915–2923,
-
[46]
Immuno-vlm: Im- munizing large vision-language models via generative se- mantic antibodies for open-world trustworthiness
Xiang Fang, Wanlong Fang, and Wei Ji. Immuno-vlm: Im- munizing large vision-language models via generative se- mantic antibodies for open-world trustworthiness. InInter- national Conference on Machine Learning, 2026
2026
-
[47]
Unveil- ing the fragility of vision-language models: Multi-modal adversarial synergy via texture-constrained perturbations and cross-modal optimization
Xiang Fang, Wanlong Fang, and Changshuo Wang. Unveil- ing the fragility of vision-language models: Multi-modal adversarial synergy via texture-constrained perturbations and cross-modal optimization. InProceedings of the AAAI Conference on Artificial Intelligence, 2026. 1
2026
-
[48]
Rethinking video-language model from the language input perspective
Xiang Fang, Wanlong Fang, Changshuo Wang, Xiaoye Qu, and Daizong Liu. Rethinking video-language model from the language input perspective. InProceedings of the AAAI Conference on Artificial Intelligence, 2026
2026
-
[49]
Towards unified vision-language models with incomplete multi-modal in- puts
Xiang Fang, Wanlong Fang, Changshuo Wang, Keke Tang, Daizong Liu, Siyi Wang, and Wei Ji. Towards unified vision-language models with incomplete multi-modal in- puts. InProceedings of the AAAI Conference on Artificial Intelligence, 2026. 1
2026
-
[50]
Retrieval augmented language model pre-training
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. InInternational conference on machine learn- ing, pages 3929–3938. PMLR, 2020. 2
2020
-
[51]
Retrieval-Augmented Generation with Graphs (GraphRAG)
Haoyu Han, Yu Wang, Harry Shomer, Kai Guo, Jiayuan Ding, Yongjia Lei, Mahantesh Halappanavar, Ryan A Rossi, Subhabrata Mukherjee, Xianfeng Tang, et al. Retrieval-augmented generation with graphs (graphrag). arXiv preprint arXiv:2501.00309, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Fine-grained cross-modal alignment network for text-video retrieval
Ning Han, Jingjing Chen, Guangyi Xiao, Hao Zhang, Yawen Zeng, and Hao Chen. Fine-grained cross-modal alignment network for text-video retrieval. InProceedings of the 29th ACM International Conference on Multimedia, pages 3826–3834, 2021. 1
2021
-
[53]
A closer look at backdoor attacks on clip
Shuo He, Zhifang Zhang, Feng Liu, Roy Ka-Wei Lee, Bo An, and Lei Feng. A closer look at backdoor attacks on clip. InICML, 2025. 1
2025
-
[54]
Xinhai Hou, Shaoyuan Xu, Manan Biyani, Moyan Li, Jia Liu, Todd C Hollon, and Bryan Wang. Codev: Code with images for faithful visual reasoning via tool-aware policy optimization.arXiv preprint arXiv:2511.19661, 2025. 1
-
[55]
On the comparison be- tween multi-modal and single-modal contrastive learning
Wei Huang, Andi Han, Yongqiang Chen, Yuan Cao, Zhiqiang Xu, and Taiji Suzuki. On the comparison be- tween multi-modal and single-modal contrastive learning. Advances in Neural Information Processing Systems, 37: 81549–81605, 2024. 1
2024
-
[56]
What makes multi-modal learning better than single (provably).Advances in Neural Information Processing Systems, 34:10944–10956, 2021
Yu Huang, Chenzhuang Du, Zihui Xue, Xuanyao Chen, Hang Zhao, and Longbo Huang. What makes multi-modal learning better than single (provably).Advances in Neural Information Processing Systems, 34:10944–10956, 2021. 1
2021
-
[57]
Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InIn- ternational Conference on Machine Learning, pages 4904– 4916, 2021. 2
2021
-
[58]
Adv-watermark: A novel watermark perturbation for adversarial examples
Xiaojun Jia, Xingxing Wei, Xiaochun Cao, and Xiaoguang Han. Adv-watermark: A novel watermark perturbation for adversarial examples. InProceedings of the 28th ACM in- ternational conference on multimedia, pages 1579–1587,
-
[59]
Xiaojun Jia, Tianyu Pang, Chao Du, Yihao Huang, Jindong Gu, Yang Liu, Xiaochun Cao, and Min Lin. Improved techniques for optimization-based jailbreaking on large lan- guage models.arXiv preprint arXiv:2405.21018, 2024
-
[60]
Semantic-aligned adversarial evolution triangle for high- transferability vision-language attack.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
Xiaojun Jia, Sensen Gao, Qing Guo, Simeng Qin, Ke Ma, Yihao Huang, Yang Liu, Ivor Tsang, and Xiaochun Cao. Semantic-aligned adversarial evolution triangle for high- transferability vision-language attack.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
2025
-
[61]
Gehman, S.; Gururangan, S.; Sap, M.; Choi, Y .; and Smith, N
Xiaojun Jia, Sensen Gao, Simeng Qin, Ke Ma, Xinfeng Li, Yihao Huang, Wei Dong, Yang Liu, and Xiaochun Cao. Evolution-based region adversarial prompt learning for ro- bustness enhancement in vision-language models.arXiv preprint arXiv:2503.12874, 2025
-
[62]
Xiaojun Jia, Sensen Gao, Simeng Qin, Tianyu Pang, Chao Du, Yihao Huang, Xinfeng Li, Yiming Li, Bo Li, and Yang Liu. Adversarial attacks against closed-source mllms via feature optimal alignment.arXiv preprint arXiv:2505.21494, 2025. 1
-
[63]
Knowledge-augmented reasoning dis- tillation for small language models in knowledge-intensive tasks.Advances in Neural Information Processing Systems, 36:48573–48602, 2023
Minki Kang, Seanie Lee, Jinheon Baek, Kenji Kawaguchi, and Sung Ju Hwang. Knowledge-augmented reasoning dis- tillation for small language models in knowledge-intensive tasks.Advances in Neural Information Processing Systems, 36:48573–48602, 2023. 1
2023
-
[64]
Dynamic graph-enhanced event refinement for temporal sentence grounding of micro-moments.IEEE Transactions on Multimedia, 2026
Mingjin Kuai, You Qin, Xiang Fang, Wei Ji, and Roger Zimmermann. Dynamic graph-enhanced event refinement for temporal sentence grounding of micro-moments.IEEE Transactions on Multimedia, 2026. 1
2026
-
[65]
Natural questions: a benchmark for question answering re- search.Transactions of the Association for Computational Linguistics, 7:453–466, 2019
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Ep- stein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering re- search.Transactions of the Association for Computational Linguistics, 7:453–466, 2019. 1
2019
-
[66]
Ex- ploring disentangled appearance-motion contexts for tem- poral activity localization
Huashuo Lei, Xiaowen Cai, Daizong Liu, Xiang Fang, Xi- aoye Qu, Jianfeng Dong, Jixiang Yu, and Keyan Jin. Ex- ploring disentangled appearance-motion contexts for tem- poral activity localization. In2025 International Joint Con- ference on Neural Networks (IJCNN), pages 1–8. IEEE,
-
[67]
Retrieval-augmented generation for knowledge-intensive nlp tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, Se- bastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. InAdvances in Neural Information Processing Systems, pages 9459– 9474, 2020. 1, 2
2020
-
[68]
Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 1, 2, 3, 6
2023
-
[69]
Self- supervised geometric features discovery via interpretable attention for vehicle re-identification and beyond
Ming Li, Xinming Huang, and Ziming Zhang. Self- supervised geometric features discovery via interpretable attention for vehicle re-identification and beyond. InICCV,
-
[70]
Exploiting multi-view part-wise correlation via an efficient transformer for vehicle re-identification.TOM, 2021
Ming Li, Jun Liu, Ce Zheng, Xinming Huang, and Zim- ing Zhang. Exploiting multi-view part-wise correlation via an efficient transformer for vehicle re-identification.TOM, 2021
2021
-
[71]
Dr-fer: Discrimina- tive and robust representation learning for facial expression recognition.IEEE Transactions on Multimedia, 26:6297– 6309, 2023
Ming Li, Huazhu Fu, Shengfeng He, Hehe Fan, Jun Liu, Jussi Keppo, and Mike Zheng Shou. Dr-fer: Discrimina- tive and robust representation learning for facial expression recognition.IEEE Transactions on Multimedia, 26:6297– 6309, 2023
2023
-
[72]
Stprivacy: Spatio-temporal privacy- preserving action recognition
Ming Li, Xiangyu Xu, Hehe Fan, Pan Zhou, Jun Liu, Jia-Wei Liu, Jiahe Li, Jussi Keppo, Mike Zheng Shou, and Shuicheng Yan. Stprivacy: Spatio-temporal privacy- preserving action recognition. InICCV, 2023
2023
-
[73]
Instant3d: instant text- to-3d generation.IJCV, 2024
Ming Li, Pan Zhou, Jia-Wei Liu, Jussi Keppo, Min Lin, Shuicheng Yan, and Xiangyu Xu. Instant3d: instant text- to-3d generation.IJCV, 2024. 1
2024
-
[74]
Merging clinical knowledge into large language models for medical research and applications: A survey.arXiv e-prints, pages arXiv– 2502, 2025
Qiyuan Li, Haijiang Liu, Caicai Guo, Deyu Chen, Meng Wang, Feng Gao, and Jinguang Gu. Merging clinical knowledge into large language models for medical research and applications: A survey.arXiv e-prints, pages arXiv– 2502, 2025. 1
2025
-
[75]
Rui Li, Yuanzhi Liang, Ziqi Ni, Haibing Huang, Chi Zhang, and Xuelong Li. Growing with the generator: Self-paced grpo for video generation.arXiv preprint arXiv:2511.19356, 2025. 1
-
[76]
Mmcoqa: Conver- sational question answering over text, tables, and images
Yongqi Li, Wenjie Li, and Liqiang Nie. Mmcoqa: Conver- sational question answering over text, tables, and images. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4220–4231, 2022. 2, 3, 6
2022
-
[77]
Integrating reinforcement learning with vi- sual generative models: foundations and advances.Vici- nagearth, 3(1):2, 2026
Yuanzhi Liang, Yijie Fang, Rui Li, Ziqi Ni, Ruijie Su, and Chi Zhang. Integrating reinforcement learning with vi- sual generative models: foundations and advances.Vici- nagearth, 3(1):2, 2026. 1
2026
-
[78]
Commongen: A constrained text generation challenge for generative commonsense reasoning
Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi, and Xiang Ren. Commongen: A constrained text generation challenge for generative commonsense reasoning. InFindings of the As- sociation for Computational Linguistics: EMNLP 2020, pages 1823–1840, 2020. 1
2020
-
[79]
Explor- ing optical-flow-guided motion and detection-based appear- ance for temporal sentence grounding.IEEE Transactions on Multimedia, 25:8539–8553, 2023
Daizong Liu, Xiang Fang, Wei Hu, and Pan Zhou. Explor- ing optical-flow-guided motion and detection-based appear- ance for temporal sentence grounding.IEEE Transactions on Multimedia, 25:8539–8553, 2023. 1
2023
-
[80]
Hypotheses tree building for one- shot temporal sentence localization
Daizong Liu, Xiang Fang, Pan Zhou, Xing Di, Weining Lu, and Yu Cheng. Hypotheses tree building for one- shot temporal sentence localization. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1640– 1648, 2023
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.