pith. sign in

arxiv: 2606.01615 · v1 · pith:QA4EKHDWnew · submitted 2026-06-01 · 💻 cs.CV · cs.MM

Turing Patterns for Multimedia: Reaction-Diffusion Multi-Modal Fusion for Language-Guided Video Moment Retrieval

Pith reviewed 2026-06-28 15:43 UTC · model grok-4.3

classification 💻 cs.CV cs.MM
keywords reaction-diffusionmultimodal fusionvideo moment retrievalTuring patternsGray-Scott modellanguage-guided retrievalvideo-language alignment
0
0 comments X

The pith

Video-language alignment modeled as a reaction-diffusion process creates emergent patterns that adaptively fuse modalities for moment retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that static cross-attention mechanisms fail to capture evolving video-text relationships, and proposes instead to treat modality fusion as a reaction-diffusion process drawn from Turing pattern formation. Video features are allowed to diffuse across time while text interactions act as nonlinear reactions that amplify signal and suppress noise. The Gray-Scott model supplies the concrete dynamics, with stability and convergence proven via Turing instability criteria. The resulting fusion module plugs into standard pretrained encoders and DETR heads. If the approach holds, it supplies a biologically motivated alternative that can improve alignment and generalization on language-guided video tasks.

Core claim

By reimagining video-language alignment as a reaction-diffusion process in which video features diffuse across time and text-video interactions serve as nonlinear reactions, the RDMF framework produces emergent patterns that improve modality alignment and moment retrieval, with mathematical guarantees of stability drawn from Turing instability analysis of the Gray-Scott model.

What carries the argument

Reaction-Diffusion Multimodal Fusion (RDMF) module based on the Gray-Scott model, where diffusion spreads video context temporally and nonlinear reactions between modalities generate adaptive alignment patterns.

If this is right

  • The fusion module integrates directly with pretrained video and text encoders plus DETR-style heads for both moment retrieval and saliency prediction.
  • Turing instability criteria provide explicit conditions for stable pattern formation and convergence of the fusion process.
  • The approach supplies a new biologically grounded paradigm that can replace or augment conventional multimodal fusion layers.
  • Preliminary results indicate potential gains in identifying salient video moments over existing methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reaction-diffusion dynamics could be tested on other video-language tasks such as highlight detection or dense video captioning.
  • Emergent patterns might be inspected post-training to reveal which temporal regions receive amplified alignment.
  • Because the module is built from standard differential-equation discretizations, it may admit efficient GPU implementations without custom attention kernels.

Load-bearing premise

That modeling video and text interactions as a reaction-diffusion process will generate emergent patterns that adaptively improve alignment beyond what static cross-attention achieves.

What would settle it

A controlled comparison on a standard benchmark such as Charades-STA or ActivityNet Captions in which the RDMF fusion module yields equal or lower retrieval accuracy than a matched static cross-attention baseline.

Figures

Figures reproduced from arXiv: 2606.01615 by Tat-Seng Chua, Wanlong Fang, Wei Ji, Xiang Fang.

Figure 1
Figure 1. Figure 1: Comparison between traditional model and our [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of the Reaction-Diffusion Multimodal Fusion (RDMF) framework. Video and text inputs are processed [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effect of the number of reaction-diffusion steps ( [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Temporal diffusion effects on QVHighlights, where [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of saliency predictions on [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Video-language models are pivotal for tasks such as moment retrieval and highlight detection, yet they often struggle to capture the dynamic, non-linear interactions between temporal video sequences and textual semantics. Existing approaches, relying on static cross-attention or prompt-tuning mechanisms, fail to adaptively model the evolving relationships between modalities, leading to suboptimal alignment and limited generalization. Inspired by systems biology, we propose \textbf{Reaction-Diffusion Multimodal Fusion (RDMF)}, a novel framework that reimagines video-language alignment as a reaction-diffusion (RD) process, drawing on the principles of pattern formation introduced by Alan Turing. In RDMF, video features diffuse across time to capture temporal context, while text-video interactions are modeled as non-linear reactions that amplify relevant features and suppress noise, forming emergent patterns akin to biological systems. Leveraging the Gray-Scott RD model, we design a computationally efficient fusion module that integrates video and text representations, supported by rigorous mathematical analysis of stability and convergence using Turing instability criteria. Our framework is theoretically grounded, employing advanced mathematical tools to ensure stable pattern formation, and is practically viable, incorporating standard components like pretrained encoders and DETR-style heads for moment retrieval and saliency prediction. RDMF represents a pioneering interdisciplinary approach, bridging systems biology and multimedia research to address the limitations of conventional multimodal fusion. Preliminary experiments demonstrate its potential to outperform existing methods in identifying salient video moments, offering a new paradigm for video-language tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes Reaction-Diffusion Multimodal Fusion (RDMF) for language-guided video moment retrieval. It reinterprets video-text alignment as a Gray-Scott reaction-diffusion process in which video features diffuse temporally and text-video interactions act as nonlinear reactions, generating emergent Turing patterns for improved modality alignment. The work claims a rigorous mathematical analysis of stability and convergence via Turing instability criteria, uses pretrained encoders and DETR-style heads, and reports that preliminary experiments indicate outperformance over existing methods.

Significance. If the claimed stability analysis holds and the discretized RD system is shown to operate inside the Turing regime while producing alignment-improving patterns, the approach would constitute a genuinely novel interdisciplinary contribution that could address limitations of static cross-attention in dynamic multimodal settings.

major comments (2)
  1. [Abstract] Abstract: the central claim that the RDMF module is 'supported by rigorous mathematical analysis of stability and convergence using Turing instability criteria' is load-bearing, yet the manuscript supplies no Jacobian, dispersion relation, trace/determinant conditions, or verification that the chosen Gray-Scott coefficients and discrete temporal diffusion operator satisfy the inequalities required for positive real-part eigenvalues; without these the assertion that emergent patterns improve alignment cannot be evaluated.
  2. [Abstract] Abstract: the claim that 'preliminary experiments demonstrate its potential to outperform existing methods' is load-bearing for practical viability, yet no datasets, baselines, metrics, or quantitative results are reported, leaving the empirical support for the modeling choice unverifiable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where the manuscript's claims require additional substantiation. We address each major comment below and will revise the manuscript to incorporate the requested details.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the RDMF module is 'supported by rigorous mathematical analysis of stability and convergence using Turing instability criteria' is load-bearing, yet the manuscript supplies no Jacobian, dispersion relation, trace/determinant conditions, or verification that the chosen Gray-Scott coefficients and discrete temporal diffusion operator satisfy the inequalities required for positive real-part eigenvalues; without these the assertion that emergent patterns improve alignment cannot be evaluated.

    Authors: We acknowledge that the current version of the manuscript states the existence of a stability analysis but does not supply the explicit derivations. In the revised manuscript we will add a dedicated subsection (and appendix) that presents the Jacobian matrix of the discretized Gray-Scott system, derives the dispersion relation under the chosen temporal diffusion operator, states the trace/determinant conditions, and verifies that the selected coefficients place the system inside the Turing-unstable regime with positive real-part eigenvalues. This will allow direct evaluation of the claim that the emergent patterns improve modality alignment. revision: yes

  2. Referee: [Abstract] Abstract: the claim that 'preliminary experiments demonstrate its potential to outperform existing methods' is load-bearing for practical viability, yet no datasets, baselines, metrics, or quantitative results are reported, leaving the empirical support for the modeling choice unverifiable.

    Authors: The present manuscript contains only the qualitative statement about preliminary experiments. We will expand the paper with a new Experiments section that reports quantitative results on standard language-guided moment retrieval benchmarks (Charades-STA, ActivityNet Captions), using established baselines (Moment-DETR, 2D-TAN, etc.) and metrics (R@1, mIoU). The abstract will be updated to reference these concrete results rather than the current placeholder phrasing. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The abstract and description present the RDMF module as a design choice inspired by the Gray-Scott model, with a claimed mathematical analysis of stability via Turing criteria. No equations, self-citations, fitted parameters renamed as predictions, or uniqueness theorems are quoted that reduce any claimed result to its own inputs by construction. The framework is described as reimagining alignment as an RD process, but this is an ansatz and modeling decision rather than a derivation that loops back on itself. The paper is self-contained against external benchmarks in the provided text, with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the modeling choice that video features diffuse temporally and text-video interactions act as non-linear reactions, plus the applicability of Turing instability criteria; these are domain assumptions not independently evidenced in the abstract. No free parameters or invented entities are explicitly quantified.

free parameters (1)
  • Gray-Scott reaction and diffusion coefficients
    Specific parameter values required to produce useful emergent patterns are not stated and would need to be selected or fitted.
axioms (2)
  • domain assumption Video features can be usefully modeled as diffusing across time while text-video interactions behave as non-linear reactions that amplify relevant features and suppress noise.
    This is the core reimagining stated in the abstract.
  • domain assumption Turing instability criteria guarantee stable pattern formation and convergence for the fusion module.
    Invoked to support the claim of rigorous mathematical analysis.

pith-pipeline@v0.9.1-grok · 5802 in / 1480 out tokens · 44805 ms · 2026-06-28T15:43:01.206141+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

83 extracted references · 2 linked inside Pith

  1. [1]

    Fuyao Cai, Daizong Liu, Xiang Fang, Jixiang Yu, Keke Tang, and Pan Zhou

  2. [2]

    In2025 IEEE International Conference on Multimedia and Expo (ICME)

    Imperceptible Beam-Sensitive Adversarial Attacks for LiDAR-based Object Detection in Autonomous Driving. In2025 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1–6. Turing Patterns for Multimedia: Reaction-Diffusion Multi-Modal Fusion for Language-Guided Video Moment Retrieval MM ’25, October 27–31, 2025, Dublin, Ireland

  3. [3]

    Xiaowen Cai, Daizong Liu, Xiaoye Qu, Xiang Fang, Jianfeng Dong, Keke Tang, Pan Zhou, Lichao Sun, and Wei Hu. 2026. Towards building model/prompt- transferable attackers against large vision-language models.Advances in Neural Information Processing Systems38 (2026), 174022–174058

  4. [4]

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexan- der Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. InEuropean conference on computer vision. Springer, 213–229

  5. [5]

    Jingyuan Chen, Xinpeng Chen, Lin Ma, Zequn Jie, and Tat-Seng Chua. 2018. Tem- porally grounding natural sentence in video. InProceedings of the 2018 conference on empirical methods in natural language processing. 162–171

  6. [6]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 4171–4186

  7. [7]

    Wanlong Fang, Tianle Zhang, and Alvin Chan. 2026. To align or not to align: Strategic multimodal representation alignment for optimal performance. InPro- ceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 21056–21064

  8. [8]

    Wanlong Fang, Tianle Zhang, Wen Tao, and Alvin Chan. 2026. Towards Un- derstanding Modality Interaction in Multimodal Language Models via Partial Information Decomposition. InInternational Conference on Machine Learning

  9. [9]

    Xiang Fang. 2026. Advancing Out-of-Distribution Detection Across Diverse Scenarios. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 41042–41043

  10. [10]

    Xiang Fang, Arvind Easwaran, and Blaise Genest. 2025. Adaptive Multi-prompt Contrastive Network for Few-shot Out-of-distribution Detection. InInternational Conference on Machine Learning

  11. [11]

    Xiang Fang, Arvind Easwaran, Blaise Genest, and Ponnuthurai Nagaratnam Suganthan. 2025. Adaptive Hierarchical Graph Cut for Multi-granularity Out-of- distribution Detection.IEEE Transactions on Artificial Intelligence(2025)

  12. [12]

    Xiang Fang, Arvind Easwaran, Blaise Genest, and Ponnuthurai Nagaratnam Sug- anthan. 2025. Your data is not perfect: Towards cross-domain out-of-distribution detection in class-imbalanced data.Expert Systems with Applications(2025)

  13. [13]

    Xiang Fang and Wanlong Fang. 2026. Disentangling Adversarial Prompts: A Semantic-Graph Defense for Robust LLM Security. InProceedings of the AAAI Conference on Artificial Intelligence

  14. [14]

    Xiang Fang and Wanlong Fang. 2026. SLAP: The Semantic Least Action Principle for Variational Video-Language Modeling. InInternational Conference on Machine Learning

  15. [15]

    Xiang Fang, Wanlong Fang, and Wei Ji. 2026. Immuno-VLM: Immunizing Large Vision-Language Models via Generative Semantic Antibodies for Open-World Trustworthiness. InInternational Conference on Machine Learning

  16. [16]

    Xiang Fang, Wanlong Fang, Daizong Liu, Xiaoye Qu, Jianfeng Dong, Pan Zhou, Renfu Li, Zichuan Xu, Lixing Chen, Panpan Zheng, et al. 2024. Not all inputs are valid: Towards open-set video moment retrieval using language. InProceedings of the 32nd ACM International Conference on Multimedia. 28–37

  17. [17]

    Xiang Fang, Wanlong Fang, and Changshuo Wang. 2025. Hierarchical Semantic- Augmented Navigation: Optimal Transport and Graph-Driven Reasoning for Vision-Language Navigation. InAdvances in Neural Information Processing Sys- tems

  18. [18]

    Xiang Fang, Wanlong Fang, and Changshuo Wang. 2026. CogniVerse: Revolution- izing Multi-modal Retrieval-Augmented Generation with Cognitive Reflection and Geometric Reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

  19. [19]

    Xiang Fang, Wanlong Fang, and Changshuo Wang. 2026. Unveiling the Fragility of Vision-Language Models: Multi-Modal Adversarial Synergy via Texture- Constrained Perturbations and Cross-Modal Optimization. InProceedings of the AAAI Conference on Artificial Intelligence

  20. [20]

    Xiang Fang, Wanlong Fang, Changshuo Wang, Daizong Liu, Keke Tang, Jianfeng Dong, Pan Zhou, and Beibei Li. 2025. Multi-pair temporal sentence grounding via multi-thread knowledge transfer network. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 2915–2923

  21. [21]

    Xiang Fang, Wanlong Fang, Changshuo Wang, Daizong Liu, Keke Tang, Jianfeng Dong, Pan Zhou, and Beibei Li. 2025. Multi-Pair Temporal Sentence Ground- ing via Multi-Thread Knowledge Transfer Network. InProceedings of the AAAI Conference on Artificial Intelligence

  22. [22]

    Xiang Fang, Wanlong Fang, Changshuo Wang, Xiaoye Qu, and Daizong Liu

  23. [23]

    InProceedings of the AAAI Conference on Artificial Intelligence

    Rethinking Video-language Model From the Language Input Perspective. InProceedings of the AAAI Conference on Artificial Intelligence

  24. [24]

    Xiang Fang, Wanlong Fang, Changshuo Wang, Keke Tang, Daizong Liu, Siyi Wang, and Wei Ji. 2026. Towards Unified Vision-Language Models With Incom- plete Multi-Modal Inputs. InProceedings of the AAAI Conference on Artificial Intelligence

  25. [25]

    Xiang Fang and Yuchong Hu. 2020. Double self-weighted multi-view clustering via adaptive view fusion.arXiv preprint arXiv:2011.10396(2020)

  26. [26]

    Xiang Fang, Yuchong Hu, Pan Zhou, and Dapeng Wu. 2021. Animc: A soft approach for autoweighted noisy and incomplete multiview clustering.IEEE Transactions on Artificial Intelligence3, 2 (2021), 192–206

  27. [27]

    Xiang Fang, Yuchong Hu, Pan Zhou, and Dapeng Oliver Wu. 2020. V3H: View variation and view heredity for incomplete multiview clustering.IEEE Transac- tions on Artificial Intelligence1, 3 (2020), 233–247

  28. [28]

    Xiang Fang, Yuchong Hu, Pan Zhou, and Dapeng Oliver Wu. 2021. Unbalanced in- complete multi-view clustering via the scheme of view evolution: Weak views are meat; strong views do eat.IEEE Transactions on Emerging Topics in Computational Intelligence6, 4 (2021), 913–927

  29. [29]

    Xiang Fang, Daizong Liu, Wanlong Fang, Pan Zhou, Yu Cheng, Keke Tang, and Kai Zou. 2023. Annotations Are Not All You Need: A Cross-modal Knowledge Transfer Network for Unsupervised Temporal Sentence Grounding. InFindings of the Association for Computational Linguistics: EMNLP 2023. 8721–8733

  30. [30]

    Xiang Fang, Daizong Liu, Wanlong Fang, Pan Zhou, Zichuan Xu, Wenzheng Xu, Junyang Chen, and Renfu Li. 2024. Fewer Steps, Better Performance: Efficient Cross-Modal Clip Trimming for Video Moment Retrieval Using Language. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 1735–1743

  31. [31]

    Xiang Fang, Daizong Liu, Pan Zhou, and Yuchong Hu. 2022. Multi-modal cross- domain alignment network for video moment retrieval.IEEE Transactions on Multimedia25 (2022), 7517–7532

  32. [32]

    Xiang Fang, Daizong Liu, Pan Zhou, and Guoshun Nan. 2023. You can ground earlier than see: An effective and efficient pipeline for temporal sentence ground- ing in compressed videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2448–2460

  33. [33]

    Xiang Fang, Daizong Liu, Pan Zhou, Zichuan Xu, and Ruixuan Li. 2023. Hierarchi- cal local-global transformer for temporal sentence grounding.IEEE Transactions on Multimedia(2023)

  34. [34]

    Xiang Fang, Zeyu Xiong, Wanlong Fang, Xiaoye Qu, Chen Chen, Jianfeng Dong, Keke Tang, Pan Zhou, Yu Cheng, and Daizong Liu. 2024. Rethinking Weakly- supervised Video Temporal Grounding From a Game Perspective. InEuropean Conference on Computer Vision. Springer

  35. [35]

    Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slow- fast networks for video recognition. InProceedings of the IEEE/CVF international conference on computer vision. 6202–6211

  36. [36]

    Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. Tall: Temporal activity localization via language query. InProceedings of the IEEE international conference on computer vision. 5267–5275

  37. [37]

    Peter Gray and Stephen K Scott. 1983. Autocatalytic reactions in the isothermal, continuous stirred tank reactor: isolas and other forms of multistability.Chemical Engineering Science38, 1 (1983), 29–43

  38. [38]

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision- language representation learning with noisy text supervision. InInternational conference on machine learning. PMLR, 4904–4916

  39. [39]

    Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles

  40. [40]

    InProceedings of the IEEE international conference on computer vision

    Dense-captioning events in videos. InProceedings of the IEEE international conference on computer vision. 706–715

  41. [41]

    Mingjin Kuai, You Qin, Xiang Fang, Wei Ji, and Roger Zimmermann. 2026. Dy- namic Graph-enhanced Event Refinement for Temporal Sentence Grounding of Micro-moments.IEEE Transactions on Multimedia(2026)

  42. [42]

    Huashuo Lei, Xiaowen Cai, Daizong Liu, Xiang Fang, Xiaoye Qu, Jianfeng Dong, Jixiang Yu, and Keyan Jin. 2025. Exploring Disentangled Appearance-Motion Contexts for Temporal Activity Localization. In2025 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8

  43. [43]

    Jie Lei, Tamara L Berg, and Mohit Bansal. 2021. Detecting moments and highlights in videos via natural language queries.Advances in Neural Information Processing Systems34 (2021), 11846–11858

  44. [44]

    Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, and Mike Zheng Shou. 2023. Univtg: To- wards unified video-language temporal grounding. InProceedings of the IEEE/CVF international conference on computer vision. 2794–2804

  45. [45]

    Daizong Liu, Xiaowen Cai, Junhao Dong, Zhongliang Guo, Xiaoye Qu, Runwei Guan, Xiang Fang, and Dengpan Ye. 2026. Attacking Gray-Box Large Vision- Language Models with Adaptive SVD-Structured Adversarial Alignment. In International Conference on Machine Learning

  46. [46]

    Daizong Liu, Xiang Fang, Wei Hu, and Pan Zhou. 2023. Exploring optical-flow- guided motion and detection-based appearance for temporal sentence grounding. IEEE Transactions on Multimedia25 (2023), 8539–8553

  47. [47]

    Daizong Liu, Xiang Fang, Xiaoye Qu, Jianfeng Dong, He Yan, Yang Yang, Pan Zhou, and Yu Cheng. 2024. Unsupervised domain adaptative temporal sentence localization with mutual information maximization. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 3567–3575

  48. [48]

    Daizong Liu, Xiang Fang, Pan Zhou, Xing Di, Weining Lu, and Yu Cheng. 2023. Hypotheses tree building for one-shot temporal sentence localization. InProceed- ings of the AAAI Conference on Artificial Intelligence, Vol. 37. 1640–1648

  49. [49]

    Daizong Liu, Xiaoye Qu, Xiang Fang, Jianfeng Dong, Pan Zhou, Guoshun Nan, Keke Tang, Wanlong Fang, and Yu Cheng. 2024. Towards robust temporal activity localization learning with noisy labels. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 16630–16642. MM ’25, Oc...

  50. [50]

    Daizong Liu, Mingyu Yang, Xiaoye Qu, Pan Zhou, Xiang Fang, Keke Tang, Yao Wan, and Lichao Sun. 2024. Pandora’s box: Towards building universal attackers against real-world large vision-language models.Advances in Neural Information Processing Systems37 (2024), 52127–52158

  51. [51]

    Daizong Liu, Jiahao Zhu, Xiang Fang, Zeyu Xiong, Huan Wang, Renfu Li, and Pan Zhou. 2023. Conditional video diffusion network for fine-grained temporal sentence grounding.IEEE Transactions on Multimedia26 (2023), 5461–5476

  52. [52]

    Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video swin transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3202–3211

  53. [53]

    WonJun Moon, Sangeek Hyun, SuBeen Lee, and Jae-Pil Heo. 2023. Correlation- guided query-dependency calibration for video temporal grounding.arXiv preprint arXiv:2311.08835(2023)

  54. [54]

    WonJun Moon, Sangeek Hyun, SangUk Park, Dongchan Park, and Jae-Pil Heo

  55. [55]

    InProceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Query-dependent video representation for moment retrieval and highlight detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 23023–23033

  56. [56]

    2003.Mathematical biology: II: spatial models and biomedical applications

    James Dickson Murray and James Dickson Murray. 2003.Mathematical biology: II: spatial models and biomedical applications. Vol. 18. Springer

  57. [57]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

  58. [58]

    Keke Tang, Chao Hou, Weilong Peng, Xiang Fang, Zhize Wu, Yongwei Nie, Wenping Wang, and Zhihong Tian. 2025. Simplification is all you need against out-of-distribution overconfidence. InProceedings of the Computer Vision and Pattern Recognition Conference. 5030–5040

  59. [59]

    Keke Tang, Wenyu Zhao, Weilong Peng, Xiang Fang, Xiaodong Cui, Peican Zhu, and Zhihong Tian. 2024. Reparameterization head for efficient multi-input networks. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6190–6194

  60. [60]

    Alan Mathison Turing. 1990. The chemical basis of morphogenesis.Bulletin of mathematical biology52, 1 (1990), 153–197

  61. [61]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

  62. [62]

    Changshuo Wang, Xiang Fang, and Prayag Tiwari. 2025. DyPolySeg: Taylor Series-Inspired Dynamic Polynomial Fitting Network for Few-shot Point Cloud Semantic Segmentation. InForty-second International Conference on Machine Learning

  63. [63]

    Changshuo Wang, Shuting He, Xiang Fang, Jiawei Han, Zhonghang Liu, Xin Ning, Weijun Li, and Prayag Tiwari. 2025. Point clouds meets physics: Dynamic acoustic field fitting network for point cloud understanding. InProceedings of the Computer Vision and Pattern Recognition Conference. 22182–22192

  64. [64]

    Changshuo Wang, Shuting He, Xiang Fang, Zhijian Hu, Jia-Hong Huang, Yixian Shen, and Prayag Tiwari. 2026. Reasoning beyond points: A visual introspec- tive approach for few-shot 3d segmentation.Advances in Neural Information Processing Systems38 (2026), 117394–117414

  65. [65]

    Changshuo Wang, Shuting He, Xiang Fang, Weijun Li, Yixian Shen, Mingkun Xu, Zhongtian Sun, and Prayag Tiwari. 2026. TopAdapter: Topology-Aware Prompt Tuning for Efficient Point Cloud Understanding.International Conference on Machine Learning(2026)

  66. [66]

    Changshuo Wang, Shuting He, Xiang Fang, Fangzhe Nan, and Prayag Tiwari

  67. [67]

    InProceedings of the 33rd ACM International Conference on Multimedia

    Seeing the Overlooked: Bio-Visual Inspired Weak Saliency Feedback Trans- former for Person Re-identification. InProceedings of the 33rd ACM International Conference on Multimedia. 3192–3201

  68. [68]

    Changshuo Wang, Shuting He, Xiang Fang, Meiqing Wu, Siew-Kei Lam, and Prayag Tiwari. 2025. Taylor series-inspired local structure fitting network for few- shot point cloud semantic segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 7527–7535

  69. [69]

    Changshuo Wang, Zhijian Hu, Xiang Fang, Zai Yang Yu, Yibin Wu, Mingkun Xu, Yusong Wang, Xingyu Gao, and Prayag Tiwari. 2026. Biologically-Inspired Evolutionary Domain Symbiosis for Few-shot and Zero-shot Point Cloud Seman- tic Segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 9666–9674

  70. [70]

    Changshuo Wang, Weijun Li, Shuting He, Xiang Fang, Xingyu Gao, Zhonghang Liu, Prayag Tiwari, and Dimitrios Kanoulas. 2026. From Coarse to Fine: Deep Prototype Refinement Network for Few-Shot Point Cloud Semantic Segmentation. International Conference on Machine Learning(2026)

  71. [71]

    Junyi Wang, Jinjiang Li, Guodong Fan, Yakun Ju, Xiang Fang, and Alex C Kot

  72. [72]

    Prototype-driven structure synergy network for remote sensing images segmentation.IEEE Transactions on Geoscience and Remote Sensing(2025)

  73. [73]

    Siyi Wang, Suman Dutta, Wei Jie Bryan Lee, Jerrie Feng, Xiang Fang, and Anupam Chattopadhyay. 2025. Reducing T-Depth and T-Count in Quantum Multiplication Using Compressor Primitives. InProceedings of the Great Lakes Symposium on VLSI 2025. 35–40

  74. [74]

    Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, et al . 2022. Internvideo: General video foundation models via generative and discriminative learning.arXiv preprint arXiv:2212.03191(2022)

  75. [75]

    Zeyu Xiong, Daizong Liu, Xiang Fang, Xiaoye Qu, Jianfeng Dong, Jiahao Zhu, Keke Tang, and Pan Zhou. 2024. Rethinking video sentence grounding from a tracking perspective with memory network and masked attention.IEEE Transac- tions on Multimedia26 (2024), 11204–11218

  76. [76]

    Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. 2021. Videoclip: Contrastive pre-training for zero-shot video-text understanding. InProceedings of the 2021 conference on empirical methods in natural language processing. 6787– 6800

  77. [77]

    Hai Yan, Haijian Ma, Xiaowen Cai, Daizong Liu, Zenghui Yuan, Xiaoye Qu, Jianfeng Dong, Runwei Guan, Xiang Fang, Hongyang He, et al . 2026. Fit the distribution: Cross-image/prompt adversarial attacks on multimodal large lan- guage models.Advances in Neural Information Processing Systems38 (2026), 75204–75247

  78. [78]

    Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. 2022. Tubedetr: Spatio-temporal video grounding with transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16442– 16453

  79. [79]

    Guide Yang, Chao Hou, Weilong Peng, Xiang Fang, Yongwei Nie, Peican Zhu, and Keke Tang. 2025. EOOD: Entropy-based Out-of-distribution Detection. In 2025 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8

  80. [80]

    Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou. 2020. Span-based localiz- ing network for natural language video localization. InProceedings of the 58th annual meeting of the association for computational linguistics. 6543–6554

Showing first 80 references.