Recognition: no theorem link
Meow-Omni 1: A Multimodal Large Language Model for Feline Ethology
Pith reviewed 2026-05-12 03:57 UTC · model grok-4.3
The pith
A quad-modal language model fuses video, audio, physiological time-series and text to infer feline intentions beyond surface behavior patterns.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Meow-Omni 1 is the first open-source quad-modal large language model built for computational ethology; it integrates specialized scientific encoders for video, audio and physiological time-series into a single backbone, then uses physiologically grounded cross-modal alignment to perform intent inference, reaching 71.16 percent accuracy on the expert-verified MeowBench benchmark while exceeding leading vision-language and omni-modal baselines.
What carries the argument
Physiologically grounded cross-modal alignment that fuses quad-modal streams inside a unified backbone to distinguish semantically aliased external signals according to internal physiological context.
If this is right
- Semantic aliasing in animal signals can be resolved by explicit physiological context instead of relying on behavioral patterns alone.
- Open release of the model, training code and Meow-10K dataset enables direct extension to other species and diagnostic tasks.
- Foundation models for ethology can now incorporate high-frequency biological time-series without custom post-processing pipelines.
- Veterinary and conservation applications gain a concrete pathway from raw multimodal recordings to actionable intent labels.
Where Pith is reading between the lines
- The same alignment technique could be tested on livestock or zoo animals where collar or implant data already exist.
- If the approach generalizes, it reduces the need for prolonged visual observation in field studies of elusive species.
- Future versions might predict physiological state from behavior alone once the alignment is learned, enabling non-contact monitoring.
Load-bearing premise
The new MeowBench benchmark, even after expert verification, measures genuine physiologically grounded intent rather than superficial correlations between observable signals.
What would settle it
Present the model with matched video-audio clips where physiological readings clearly contradict the most probable behavioral interpretation; if accuracy drops to near chance while human experts still succeed using the physiology, the claim of latent-state reasoning fails.
Figures
read the original abstract
Deciphering animal intent is a fundamental challenge in computational ethology, largely because of semantic aliasing, the phenomenon where identical external signals (e.g., a cat's purr) correspond to radically different internal states depending on physiological context. Existing Multimodal Large Language Models (MLLMs) are blind to high-frequency biological time-series data, restricting them to superficial behavioural pattern matching rather than genuine latent-state reasoning. To bridge this gap, we introduce Meow-Omni 1, the first open-source, quad-modal MLLM purpose-built for computational ethology. It natively fuses video, audio, and physiological time-series streams with textual reasoning. Through targeted architectural adaptation, we integrate specialized scientific encoders into a unified backbone and formalize intent inference via physiologically grounded cross-modal alignment. Evaluated on MeowBench, a novel, expert-verified quad-modal benchmark, Meow-Omni 1 achieves state-of-the-art intent-recognition accuracy (71.16%), substantially outperforming leading vision-language and omni-modal baselines. We release the complete open-source pipeline including model weights, training framework, and the Meow-10K dataset, to establish a scalable paradigm for inter-species intent understanding and to advance foundation models toward real-world veterinary diagnostics and wildlife conservation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Meow-Omni 1, the first open-source quad-modal MLLM for computational ethology that natively fuses video, audio, physiological time-series, and text to address semantic aliasing in feline intent recognition. It presents the self-introduced MeowBench benchmark and Meow-10K dataset, claiming SOTA intent-recognition accuracy of 71.16% that substantially outperforms vision-language and omni-modal baselines, and releases the model weights, training framework, and dataset.
Significance. If the central claims hold, this would constitute a meaningful advance in multimodal AI for ethology by demonstrating the value of physiological grounding for latent-state reasoning over behavioral patterns, with downstream potential in veterinary diagnostics and conservation. The open-source release of weights, code, and data is a clear strength that would support reproducibility and community follow-up work.
major comments (3)
- Abstract: The SOTA claim of 71.16% intent-recognition accuracy on MeowBench is presented without any description of the model architecture, integration of scientific encoders, training procedure, or formalization of 'physiologically grounded cross-modal alignment.' These details are load-bearing for determining whether outperformance reflects the claimed architectural advance rather than dataset-specific fitting.
- Benchmark section: MeowBench is described as 'expert-verified' yet supplies no verification protocol, inter-expert agreement statistics, label construction details (e.g., how purrs are disambiguated via heart-rate or cortisol proxies), or negative controls for superficial video/audio cues. This is critical because the benchmark and Meow-10K dataset are author-created, directly affecting the validity of the cross-baseline comparison.
- Experiments section: No error analysis, ablation studies on modality contributions, or explicit comparison methodology (e.g., baseline implementations, evaluation splits) are provided to support the reported accuracy gains. This omission prevents assessment of whether the 71.16% figure reliably demonstrates genuine latent-state reasoning.
minor comments (1)
- Abstract: The abstract would benefit from a brief statement on model scale (parameters) and training data volume to contextualize the quad-modal fusion claims.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which will help us improve the manuscript. We address each of the major comments below and outline the revisions we plan to make.
read point-by-point responses
-
Referee: Abstract: The SOTA claim of 71.16% intent-recognition accuracy on MeowBench is presented without any description of the model architecture, integration of scientific encoders, training procedure, or formalization of 'physiologically grounded cross-modal alignment.' These details are load-bearing for determining whether outperformance reflects the claimed architectural advance rather than dataset-specific fitting.
Authors: The abstract is intentionally concise to highlight the key contributions. Detailed descriptions of the model architecture, including the integration of scientific encoders for video, audio, physiological time-series, and text, the training procedure, and the formalization of physiologically grounded cross-modal alignment are provided in Sections 3 and 4 of the manuscript. To better support the SOTA claim in the abstract, we will revise it to include a brief mention of these elements, ensuring readers can assess the architectural novelty without needing to read the full text immediately. revision: yes
-
Referee: Benchmark section: MeowBench is described as 'expert-verified' yet supplies no verification protocol, inter-expert agreement statistics, label construction details (e.g., how purrs are disambiguated via heart-rate or cortisol proxies), or negative controls for superficial video/audio cues. This is critical because the benchmark and Meow-10K dataset are author-created, directly affecting the validity of the cross-baseline comparison.
Authors: We agree that more transparency is needed regarding the construction and verification of MeowBench and Meow-10K. In the revised version, we will expand the benchmark section with a detailed verification protocol, report inter-expert agreement statistics, explain label construction using physiological proxies such as heart-rate and cortisol levels for disambiguating intents like purring, and include negative controls to rule out reliance on superficial cues. This will substantiate the expert-verified claim and support the validity of our comparisons. revision: yes
-
Referee: Experiments section: No error analysis, ablation studies on modality contributions, or explicit comparison methodology (e.g., baseline implementations, evaluation splits) are provided to support the reported accuracy gains. This omission prevents assessment of whether the 71.16% figure reliably demonstrates genuine latent-state reasoning.
Authors: We acknowledge this gap in the current presentation. While the experiments section reports the accuracy and comparisons to baselines, we will add comprehensive error analysis, ablation studies isolating the contribution of each modality (video, audio, physiological time-series, and text), and detailed methodology for baseline implementations and evaluation splits. These additions will provide stronger evidence that the performance gains stem from the quad-modal fusion and physiologically grounded alignment rather than other factors. revision: yes
Circularity Check
No significant circularity in the derivation chain.
full rationale
The paper introduces a new quad-modal model, benchmark (MeowBench), and dataset (Meow-10K) and reports performance (71.16% intent-recognition accuracy) that outperforms external vision-language and omni-modal baselines. No equations, fitted parameters renamed as predictions, self-definitional reductions, or load-bearing self-citations appear in the abstract or described structure. The central claim rests on comparative evaluation against independent baselines rather than reducing to the authors' own inputs by construction. The release of the full pipeline and dataset provides external verifiability, rendering the derivation self-contained.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Meow-Omni 1
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Paul E. Rose and Lisa M. Riley. Conducting behavioural research in the zoo: A guide to ten important methods, concepts and theories.Journal of Zoological and Botanical Gardens, 2 (3):421–444, 2021. ISSN 2673-5636. doi: 10.3390/jzbg2030031. URL https://www.mdpi. com/2673-5636/2/3/31
-
[2]
Drew Rendall and Michael J. Owren. Communication without meaning or information: aban- doning language-based and informational constructs in animal communication theory. In Ulrich E. Stegmann, editor,Animal Communication Theory: Information and Influence, pages 151–188. Cambridge University Press, Cambridge, 2013
work page 2013
-
[3]
Feline vocal commu- nication.Journal of Veterinary Science, 21, 01 2020
Chloé Tavernier, Sohail Ahmed, Katherine Houpt, and Seong Chan Yeon. Feline vocal commu- nication.Journal of Veterinary Science, 21, 01 2020. doi: 10.4142/jvs.2020.21.e18
-
[4]
Isabella Merola and Daniel S Mills. Systematic review of the behavioural assessment of pain in cats.Journal of Feline Medicine and Surgery, 18(2):60–76, Feb 2016. ISSN 1532-2750. doi: 10.1177/1098612X15578725. URLhttps://doi.org/10.1177/1098612X15578725
-
[5]
Andrea M. Green and Dora E. Angelaki. Multisensory integration: resolving sensory ambiguities to build novel representations.Current Opinion in Neurobiology, 20(3):353–360, Jun 2010. ISSN 1873-6882. doi: 10.1016/j.conb.2010.04.009
-
[6]
Guanyu Chen, Yoshinari Takegawa, Kohei Matsumura, Hiroki Watanabe, and Keiji Hirata. Cat and dog behavior recognition method using deep learning approach based on inertial measurement unit sensor data.Sensors and Materials, 37(3(3)):1073–1098, Mar 2025. ISSN 0914-4935. doi: 10.18494/SAM5359. Published March 28, 2025
-
[7]
Automatic classification of cat vocalizations emitted in different contexts.Animals, 9(8):543, 2019
Stavros Ntalampiras, Luca Andrea Ludovico, Giorgio Presti, Emanuela Prato Previde, Monica Battini, Simona Cannas, Clara Palestrini, and Silvana Mattiello. Automatic classification of cat vocalizations emitted in different contexts.Animals, 9(8):543, 2019. URL https: //doi.org/10.3390/ani9080543
-
[8]
Roghanizad, Md Mobashir Hasan Shandhi, and Jessilyn Dunn
Will Ke Wang, Ina Chen, Leeor Hershkovich, Jiamu Yang, Ayush Shetty, Geetika Singh, Yihang Jiang, Aditya Kotla, Jason Zisheng Shang, Rushil Yerrabelli, Ali R. Roghanizad, Md Mobashir Hasan Shandhi, and Jessilyn Dunn. A systematic review of time series classi- fication techniques used in biomedical applications.Sensors, 22(20):8016, Oct 2022. ISSN 1424-822...
-
[9]
MiniCPM-o 4.5: A next-generation omni-modal large language model
OpenBMB Team. MiniCPM-o 4.5: A next-generation omni-modal large language model. https://huggingface.co/openbmb/MiniCPM-o-4_5, 2025. Accessed: 2026-04-19
work page 2025
-
[10]
Intern-s1-pro: Scientific multimodal foundation model at trillion scale, 2026
Yicheng Zou, Dongsheng Zhu, Lin Zhu, Tong Zhu, Yunhua Zhou, Peiheng Zhou, Xinyu Zhou, Dongzhan Zhou, Zhiwang Zhou, Yuhao Zhou, Bowen Zhou, Zhanping Zhong, Zhijie Zhong, Haiteng Zhao, Penghao Zhao, Xiaomeng Zhao, Zhiyuan Zhao, Yechen Zhang, Jin Zhang, Wenwei Zhang, Hongjie Zhang, Zhuo Zhang, Wenlong Zhang, Bo Zhang, Chao Zhang, Chen Zhang, Yuhang Zang, Fei...
-
[11]
Intern-s1: A scientific multimodal foun- dation model,
Lei Bai, Zhongrui Cai, Maosong Cao, Weihan Cao, Chiyu Chen, Haojiong Chen, Kai Chen, Pengcheng Chen, Ying Chen, Yongkang Chen, Yu Cheng, Yu Cheng, Pei Chu, Tao Chu, Erfei Cui, Ganqu Cui, Long Cui, Ziyun Cui, Nianchen Deng, Ning Ding, Nanqin Dong, Peijie Dong, Shihan Dou, Sinan Du, Haodong Duan, Caihua Fan, Ben Gao, Changjiang Gao, Jianfei Gao, Songyang Ga...
-
[12]
Aves: Animal vocalization encoder based on self-supervision, 2022
Masato Hagiwara. Aves: Animal vocalization encoder based on self-supervision, 2022. URL https://arxiv.org/abs/2210.14493
-
[13]
Hubert: Self-supervised speech representation learning by masked prediction of hidden units, 2021
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units, 2021. URLhttps://arxiv.org/abs/2106.07447
-
[14]
Beans: The benchmark of animal sounds, 2022
Masato Hagiwara, Benjamin Hoffman, Jen-Yu Liu, Maddie Cusimano, Felix Effenberger, and Katie Zacarian. Beans: The benchmark of animal sounds, 2022. URLhttps://arxiv.org/ abs/2210.12300
-
[15]
Perch 2.0 transfers ’whale’ to underwater tasks, 2025
Andrea Burns, Lauren Harrell, Bart van Merriënboer, Vincent Dumoulin, Jenny Hamer, and Tom Denton. Perch 2.0 transfers ’whale’ to underwater tasks, 2025. URL https://arxiv. org/abs/2512.03219
-
[16]
Jules Cauzinille, Benoît Favre, Ricard Marxer, Dena Clink, Abdul Ahmad, and Arnaud Rey. Investigating self-supervised speech models’ ability to classify animal vocalizations: The case of gibbon’s vocal signatures. pages 132–136, 09 2024. doi: 10.21437/Interspeech.2024-1096
-
[17]
Can masked autoencoders also listen to birds?, 2025
Lukas Rauch, René Heinrich, Ilyass Moummad, Alexis Joly, Bernhard Sick, and Christoph Scholz. Can masked autoencoders also listen to birds?, 2025. URL https://arxiv.org/ abs/2504.12880
-
[18]
de Polavieja, Yair Lakretz, and German Sumbre
Chiara Semenzin, Faadil Mustun, Roberto Dessi, Alexis Emanuelli, Pierre Orhan, Gonzalo G. de Polavieja, Yair Lakretz, and German Sumbre. Dolph2vec: Self-supervised representations of dolphin vocalizations, 2026. URLhttps://openreview.net/forum?id=QGAFX5kcR5. 12
work page 2026
-
[19]
Video foundation models for animal behavior analysis, 07 2024
Jennifer Sun, Hao Zhou, Long Zhao, Liangzhe Yuan, Bryan Seybold, David Hendon, Florian Schroff, David Ross, Hartwig Adam, Bo Hu, and Ting Liu. Video foundation models for animal behavior analysis, 07 2024
work page 2024
-
[20]
Liao, Chien-Chang Chen, Hen-Hsen Huang, and Hong-Yuan Mark Liao
Hung-Shuo Chang, Yue-Cheng Yang, Yu-Hsi Chen, Wei-Hsin Chen, Chien-Yao Wang, James C. Liao, Chien-Chang Chen, Hen-Hsen Huang, and Hong-Yuan Mark Liao. A universal action space for general behavior analysis, 2026. URLhttps://arxiv.org/abs/2602.09518
-
[21]
Uniap: Towards universal animal perception in vision via few-shot learning, 2023
Meiqi Sun, Zhonghan Zhao, Wenhao Chai, Hanjun Luo, Shidong Cao, Yanting Zhang, Jenq- Neng Hwang, and Gaoang Wang. Uniap: Towards universal animal perception in vision via few-shot learning, 2023. URLhttps://arxiv.org/abs/2308.09953
-
[22]
Shaokai Ye, Anastasiia Filippova, Jessy Lauer, Steffen Schneider, Maxime Vidal, Tian Qiu, Alexander Mathis, and Mackenzie Weygandt Mathis. Superanimal pretrained pose estimation models for behavioral analysis.Nature Communications, 15(1):5165, Jun 2024. ISSN 2041-
work page 2024
-
[23]
doi: 10.1038/s41467-024-48792-2
-
[24]
Markus Marks, Jin Qiuhan, Oliver Sturman, Lukas von Ziegler, Sepp Kollmorgen, Wolfger von der Behrens, Valerio Mante, Johannes Bohacek, and Mehmet Fatih Yanik. Deep-learning based identification, tracking, pose estimation, and behavior classification of interacting primates and mice in complex environments.Nature Machine Intelligence, 4(4):331–340, Apr 20...
-
[25]
Marcelo Feighelstein, Lea Henze, Sebastian Meller, Ilan Shimshoni, Ben Hermoni, Michael Berko, Friederike Twele, Alexandra Schütter, Nora Dorn, Sabine Kästner, Lauren Finka, Stelio P. L. Luna, Daniel S. Mills, Holger A. V olk, and Anna Zamansky. Explainable automated pain recognition in cats.Scientific Reports, 13(1):8973, Jun 2023. ISSN 2045-2322. doi: 1...
-
[26]
Pattern Recog- nition153, 110500 (2024).https://doi.org/https://doi.org/10.1016/j
Cheng Fang, Tiemin Zhang, Haikun Zheng, Junduan Huang, and Kaixuan Cuan. Pose estimation and behavior classification of broiler chickens based on deep neural networks.Computers and Electronics in Agriculture, 180:105863, 2021. ISSN 0168-1699. doi: https://doi.org/10.1016/j. compag.2020.105863. URL https://www.sciencedirect.com/science/article/pii/ S016816...
work page doi:10.1016/j 2021
-
[27]
A phar- macology toolkit for animal pose estimation, tracking and analysis
Dema Saleh, Moemen Ahmed, Mai Zaafan, Yasmine Farouk, and Ayman Atia. A phar- macology toolkit for animal pose estimation, tracking and analysis. In2023 International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC), pages 1–7, 2023. doi: 10.1109/MIUCC58832.2023.10278344
-
[28]
Edoardo Fazzari, Fabio Carrara, Fabrizio Falchi, Cesare Stefanini, and Donato Romano. Using ai to decode the behavioral responses of an insect to chemical stimuli: towards machine-animal computational technologies.International Journal of Machine Learning and Cybernetics, 15 (5):1985–1994, May 2024. ISSN 1868-808X. doi: 10.1007/s13042-023-02009-y
-
[29]
Reza Arablouei, Liang Wang, Lachlan Currie, Jodan Yates, Flavio A.P. Alvarenga, and Greg J. Bishop-Hurley. Animal behavior classification via deep learning on embedded sys- tems.Computers and Electronics in Agriculture, 207:107707, 2023. ISSN 0168-1699. doi: https://doi.org/10.1016/j.compag.2023.107707. URL https://www.sciencedirect.com/ science/article/p...
-
[30]
A lorawan-based smart sensor tag for cow behavior monitoring
Thai-Ha Dang, Ngoc-Hai Dang, Viet-Thang Tran, and Wan-Young Chung. A lorawan-based smart sensor tag for cow behavior monitoring. In2022 IEEE Sensors, pages 1–4, 2022. doi: 10.1109/SENSORS52175.2022.9967209
-
[31]
Zhixin Pan, Huihui Chen, Weizhao Zhong, Aiguo Wang, and Chundi Zheng. A cnn-based animal behavior recognition algorithm for wearable devices.IEEE Sensors Journal, 23(5): 5156–5164, 2023. doi: 10.1109/JSEN.2023.3239015
-
[32]
Md Ariful Islam Mozumder, Tagne Poupi Theodore Armand, Rashadul Islam Sumon, Shah Muhammad Imtiyaj Uddin, and Hee-Cheol Kim. Automated pipeline for robust cat activity detection based on deep learning and wearable sensor data.Sensors, 24(23), 2024. ISSN 1424-
work page 2024
-
[33]
URLhttps://www.mdpi.com/1424-8220/24/23/7436
doi: 10.3390/s24237436. URLhttps://www.mdpi.com/1424-8220/24/23/7436. 13
-
[34]
Anthropic. Introducing claude opus 4.7. https://www.anthropic.com/news/ claude-opus-4-7 , apr 2026. Large language model. API identifier: claude-opus-4-7. Knowledge cutoff: January 2026
work page 2026
-
[35]
Google DeepMind. Gemini 3.1 pro model card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/ , feb 2026. Large language model. Multimodal reasoning model with 1M token context window. API identifier: gemini-3.1-pro. Knowledge cutoff: 2026
work page 2026
-
[36]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5
work page 2026
-
[37]
Qwen3-omni technical report, 2025
Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo ...
work page 2025
-
[38]
Animal-bench: Benchmarking multimodal video models for animal-centric video understanding
Yinuo Jing, Ruxu Zhang, Kongming Liang, Yongxiang Li, Zhongjiang He, Zhanyu Ma, and Jun Guo. Animal-bench: Benchmarking multimodal video models for animal-centric video understanding. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=DexM7d1H6e
work page 2024
-
[39]
Jun Chen, Ming Hu, Darren J. Coker, Michael L. Berumen, Blair Costelloe, Sara Beery, Anna Rohrbach, and Mohamed Elhoseiny. Mammalnet: A large-scale video benchmark for mammal recognition and behavior understanding, 2023. URL https://arxiv.org/abs/2306.00576
-
[40]
Talmo D. Pereira, Joshua W. Shaevitz, and Mala Murthy. Quantifying behavior to understand the brain.Nature Neuroscience, 23(12):1537–1549, Dec 2020. ISSN 1546-1726. doi: 10.1038/ s41593-020-00734-z
work page 2020
-
[41]
Friston.Active Inference: The Free Energy Principle in Mind, Brain, and Behavior
Thomas Parr, Giovanni Pezzulo, and Karl J. Friston.Active Inference: The Free Energy Principle in Mind, Brain, and Behavior. The MIT Press, Cambridge, MA, 2022. ISBN 9780262045353. 58 b&w illustrations
work page 2022
-
[42]
The seven tools of causal inference, with reflections on machine learning.Commun
Judea Pearl. The seven tools of causal inference, with reflections on machine learning.Commun. ACM, 62(3):54–60, February 2019. ISSN 0001-0782. doi: 10.1145/3241036. URL https: //doi.org/10.1145/3241036
-
[43]
Michelle Smit, Seer J Ikurior, Rene A Corner-Thomas, Christopher J Andrews, Ina Draganova, and David G Thomas. The use of triaxial accelerometers and machine learning algorithms for behavioural identification in domestic cats (felis catus): A validation study.Sensors, 23(16): 7165, 2023
work page 2023
-
[44]
Carolyn E Dunford, Nikki J Marks, Rory P Wilson, and D Michael Scantlebury. Identifying animal behaviours from accelerometers: Improving predictive accuracy of machine learning by refining the variables selected, data frequency, and sample duration.Ecology and Evolution, 14 (5):e11380, 2024
work page 2024
-
[45]
Frozen in time: A joint video and image encoder for end-to-end retrieval
Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. InProceedings of the IEEE/CVF international conference on computer vision, pages 1728–1738, 2021
work page 2021
-
[46]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
Temporally-aware feature pooling for action spotting in soccer broadcasts
Silvio Giancola and Bernard Ghanem. Temporally-aware feature pooling for action spotting in soccer broadcasts. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4490–4499, 2021. 14
work page 2021
-
[48]
Action-Guided Attention for Video Action Anticipation
Tsung-Ming Tai, Sofia Casarin, Andrea Pilzer, Werner Nutt, and Oswald Lanz. Action-guided attention for video action anticipation.arXiv preprint arXiv:2603.01743, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[49]
Audio set: An ontology and human-labeled dataset for audio events
Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017
work page 2017
-
[50]
Ast: Audio spectrogram trans- former,
Yuan Gong, Yu-An Chung, and James Glass. Ast: Audio spectrogram transformer.arXiv preprint arXiv:2104.01778, 2021
-
[51]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report,
-
[52]
URLhttps://arxiv.org/abs/2502.13923
work page internal anchor Pith review Pith/arXiv arXiv
-
[53]
Freesound datasets: A platform for the creation of open audio datasets
Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra. Freesound datasets: A platform for the creation of open audio datasets. InISMIR, pages 486–493, 2017
work page 2017
- [54]
-
[55]
Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025. 15 Appendix A Data Processing Pipeline A.1 Bio Dataset Preprocessing Cat behavioural bio-signal data are generally scarce, particularly datasets that simultaneously provi...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.