arxiv: 2605.09152 · v1 · submitted 2026-05-09 · 💻 cs.CL · q-bio.NC

Recognition: no theorem link

Meow-Omni 1: A Multimodal Large Language Model for Feline Ethology

Chengjie Hong, Dongzhan Zhou, Feng Zhou, Giulio Zhu, Jucheng Hu, Liang Zhou, Sifei Li, Suorong Yang, Tairan Wang, Yiheng Zeng, Yulin Chen, Zhangquan Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:57 UTC · model grok-4.3

classification 💻 cs.CL q-bio.NC

keywords multimodal large language modelfeline ethologyintent recognitioncomputational ethologyquad-modal fusionphysiological time-seriesanimal behavior aisemantic aliasing

0 comments

The pith

A quad-modal language model fuses video, audio, physiological time-series and text to infer feline intentions beyond surface behavior patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a model that takes four kinds of input at once to figure out what a cat is trying to communicate. Standard vision-language systems treat signals like a purr as fixed meanings, but the same signal can reflect hunger, contentment or pain depending on the cat's current body state. By adding continuous physiological streams and training the model to align them with visible cues, the work aims to move from pattern matching to actual state inference. The resulting system is released with weights and data so others can extend it to veterinary tools or wildlife monitoring.

Core claim

Meow-Omni 1 is the first open-source quad-modal large language model built for computational ethology; it integrates specialized scientific encoders for video, audio and physiological time-series into a single backbone, then uses physiologically grounded cross-modal alignment to perform intent inference, reaching 71.16 percent accuracy on the expert-verified MeowBench benchmark while exceeding leading vision-language and omni-modal baselines.

What carries the argument

Physiologically grounded cross-modal alignment that fuses quad-modal streams inside a unified backbone to distinguish semantically aliased external signals according to internal physiological context.

If this is right

Semantic aliasing in animal signals can be resolved by explicit physiological context instead of relying on behavioral patterns alone.
Open release of the model, training code and Meow-10K dataset enables direct extension to other species and diagnostic tasks.
Foundation models for ethology can now incorporate high-frequency biological time-series without custom post-processing pipelines.
Veterinary and conservation applications gain a concrete pathway from raw multimodal recordings to actionable intent labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same alignment technique could be tested on livestock or zoo animals where collar or implant data already exist.
If the approach generalizes, it reduces the need for prolonged visual observation in field studies of elusive species.
Future versions might predict physiological state from behavior alone once the alignment is learned, enabling non-contact monitoring.

Load-bearing premise

The new MeowBench benchmark, even after expert verification, measures genuine physiologically grounded intent rather than superficial correlations between observable signals.

What would settle it

Present the model with matched video-audio clips where physiological readings clearly contradict the most probable behavioral interpretation; if accuracy drops to near chance while human experts still succeed using the physiology, the claim of latent-state reasoning fails.

Figures

Figures reproduced from arXiv: 2605.09152 by Chengjie Hong, Dongzhan Zhou, Feng Zhou, Giulio Zhu, Jucheng Hu, Liang Zhou, Sifei Li, Suorong Yang, Tairan Wang, Yiheng Zeng, Yulin Chen, Zhangquan Chen.

**Figure 1.** Figure 1: Meow-Omni 1 Model Architecture. 4 Methods The development of Meow-Omni 1 follows a multi-stage approach encompassing specialized model surgery, a novel temporal labeling strategy, and a rigorous alignment-specialization training pipeline. 4.1 Architecture: Model Surgery and Tokenizer Expansion We build Meow-Omni 1 upon the MiniCPM-o backbone, performing a series of architectural “transplantations” to acco… view at source ↗

**Figure 2.** Figure 2: Baseline for biosignal modality, adapted from [6]. [PITH_FULL_IMAGE:figures/full_fig_p018_2.png] view at source ↗

read the original abstract

Deciphering animal intent is a fundamental challenge in computational ethology, largely because of semantic aliasing, the phenomenon where identical external signals (e.g., a cat's purr) correspond to radically different internal states depending on physiological context. Existing Multimodal Large Language Models (MLLMs) are blind to high-frequency biological time-series data, restricting them to superficial behavioural pattern matching rather than genuine latent-state reasoning. To bridge this gap, we introduce Meow-Omni 1, the first open-source, quad-modal MLLM purpose-built for computational ethology. It natively fuses video, audio, and physiological time-series streams with textual reasoning. Through targeted architectural adaptation, we integrate specialized scientific encoders into a unified backbone and formalize intent inference via physiologically grounded cross-modal alignment. Evaluated on MeowBench, a novel, expert-verified quad-modal benchmark, Meow-Omni 1 achieves state-of-the-art intent-recognition accuracy (71.16%), substantially outperforming leading vision-language and omni-modal baselines. We release the complete open-source pipeline including model weights, training framework, and the Meow-10K dataset, to establish a scalable paradigm for inter-species intent understanding and to advance foundation models toward real-world veterinary diagnostics and wildlife conservation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper brings a quad-modal MLLM with physiological time-series for cat intent inference plus open Meow-10K data and MeowBench, but the abstract supplies no architecture, training, or benchmark-construction details to support the 71% SOTA claim over pattern matching.

read the letter

This paper's core move is to add physiological time-series to video, audio, and text in an MLLM so the model can handle semantic aliasing in cat signals like purrs. It releases the model weights, training code, Meow-10K dataset, and the expert-verified MeowBench benchmark, which is genuinely new for computational ethology. That open release and the domain focus are the parts worth noting first. The work identifies a clear limitation in existing MLLMs—they ignore high-frequency biological signals—and tries to fix it with specialized encoders and cross-modal alignment. Releasing everything makes the resources usable even if the current write-up stays thin. The main soft spot is that the abstract gives almost nothing on how the model is built, how training works, or how MeowBench labels were created to separate genuine latent-state reasoning from better video-audio pattern matching. The expert-verification claim sits there without a protocol, inter-rater numbers, or negative controls, so the 71.16% outperformance versus baselines could still come from fitting observable cues rather than the claimed physiological grounding. The self-introduced benchmark adds the usual circularity risk, and the stress-test concern about missing verification details holds up on the available text. Readers working on multimodal models for animal behavior or veterinary AI will find the dataset and open pipeline useful as a starting point. The paper deserves a serious referee because the application is fresh, the resources are public, and the gap it targets is real, even though the methods section will need heavy expansion and the benchmark validation will need scrutiny before the central claim can be trusted.

Referee Report

3 major / 1 minor

Summary. The paper introduces Meow-Omni 1, the first open-source quad-modal MLLM for computational ethology that natively fuses video, audio, physiological time-series, and text to address semantic aliasing in feline intent recognition. It presents the self-introduced MeowBench benchmark and Meow-10K dataset, claiming SOTA intent-recognition accuracy of 71.16% that substantially outperforms vision-language and omni-modal baselines, and releases the model weights, training framework, and dataset.

Significance. If the central claims hold, this would constitute a meaningful advance in multimodal AI for ethology by demonstrating the value of physiological grounding for latent-state reasoning over behavioral patterns, with downstream potential in veterinary diagnostics and conservation. The open-source release of weights, code, and data is a clear strength that would support reproducibility and community follow-up work.

major comments (3)

Abstract: The SOTA claim of 71.16% intent-recognition accuracy on MeowBench is presented without any description of the model architecture, integration of scientific encoders, training procedure, or formalization of 'physiologically grounded cross-modal alignment.' These details are load-bearing for determining whether outperformance reflects the claimed architectural advance rather than dataset-specific fitting.
Benchmark section: MeowBench is described as 'expert-verified' yet supplies no verification protocol, inter-expert agreement statistics, label construction details (e.g., how purrs are disambiguated via heart-rate or cortisol proxies), or negative controls for superficial video/audio cues. This is critical because the benchmark and Meow-10K dataset are author-created, directly affecting the validity of the cross-baseline comparison.
Experiments section: No error analysis, ablation studies on modality contributions, or explicit comparison methodology (e.g., baseline implementations, evaluation splits) are provided to support the reported accuracy gains. This omission prevents assessment of whether the 71.16% figure reliably demonstrates genuine latent-state reasoning.

minor comments (1)

Abstract: The abstract would benefit from a brief statement on model scale (parameters) and training data volume to contextualize the quad-modal fusion claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which will help us improve the manuscript. We address each of the major comments below and outline the revisions we plan to make.

read point-by-point responses

Referee: Abstract: The SOTA claim of 71.16% intent-recognition accuracy on MeowBench is presented without any description of the model architecture, integration of scientific encoders, training procedure, or formalization of 'physiologically grounded cross-modal alignment.' These details are load-bearing for determining whether outperformance reflects the claimed architectural advance rather than dataset-specific fitting.

Authors: The abstract is intentionally concise to highlight the key contributions. Detailed descriptions of the model architecture, including the integration of scientific encoders for video, audio, physiological time-series, and text, the training procedure, and the formalization of physiologically grounded cross-modal alignment are provided in Sections 3 and 4 of the manuscript. To better support the SOTA claim in the abstract, we will revise it to include a brief mention of these elements, ensuring readers can assess the architectural novelty without needing to read the full text immediately. revision: yes
Referee: Benchmark section: MeowBench is described as 'expert-verified' yet supplies no verification protocol, inter-expert agreement statistics, label construction details (e.g., how purrs are disambiguated via heart-rate or cortisol proxies), or negative controls for superficial video/audio cues. This is critical because the benchmark and Meow-10K dataset are author-created, directly affecting the validity of the cross-baseline comparison.

Authors: We agree that more transparency is needed regarding the construction and verification of MeowBench and Meow-10K. In the revised version, we will expand the benchmark section with a detailed verification protocol, report inter-expert agreement statistics, explain label construction using physiological proxies such as heart-rate and cortisol levels for disambiguating intents like purring, and include negative controls to rule out reliance on superficial cues. This will substantiate the expert-verified claim and support the validity of our comparisons. revision: yes
Referee: Experiments section: No error analysis, ablation studies on modality contributions, or explicit comparison methodology (e.g., baseline implementations, evaluation splits) are provided to support the reported accuracy gains. This omission prevents assessment of whether the 71.16% figure reliably demonstrates genuine latent-state reasoning.

Authors: We acknowledge this gap in the current presentation. While the experiments section reports the accuracy and comparisons to baselines, we will add comprehensive error analysis, ablation studies isolating the contribution of each modality (video, audio, physiological time-series, and text), and detailed methodology for baseline implementations and evaluation splits. These additions will provide stronger evidence that the performance gains stem from the quad-modal fusion and physiologically grounded alignment rather than other factors. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain.

full rationale

The paper introduces a new quad-modal model, benchmark (MeowBench), and dataset (Meow-10K) and reports performance (71.16% intent-recognition accuracy) that outperforms external vision-language and omni-modal baselines. No equations, fitted parameters renamed as predictions, self-definitional reductions, or load-bearing self-citations appear in the abstract or described structure. The central claim rests on comparative evaluation against independent baselines rather than reducing to the authors' own inputs by construction. The release of the full pipeline and dataset provides external verifiability, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Only the abstract is available, so no specific free parameters, axioms, or invented entities beyond the model itself can be identified from the text. The model is presented as a purpose-built system without detailed breakdown of its components or assumptions.

invented entities (1)

Meow-Omni 1 no independent evidence
purpose: Quad-modal MLLM for physiologically grounded feline intent inference
The model is the central new system introduced to address semantic aliasing in animal signals.

pith-pipeline@v0.9.0 · 5573 in / 1316 out tokens · 44013 ms · 2026-05-12T03:57:34.487305+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 4 internal anchors

[1]

Rose and Lisa M

Paul E. Rose and Lisa M. Riley. Conducting behavioural research in the zoo: A guide to ten important methods, concepts and theories.Journal of Zoological and Botanical Gardens, 2 (3):421–444, 2021. ISSN 2673-5636. doi: 10.3390/jzbg2030031. URL https://www.mdpi. com/2673-5636/2/3/31

work page doi:10.3390/jzbg2030031 2021
[2]

Drew Rendall and Michael J. Owren. Communication without meaning or information: aban- doning language-based and informational constructs in animal communication theory. In Ulrich E. Stegmann, editor,Animal Communication Theory: Information and Influence, pages 151–188. Cambridge University Press, Cambridge, 2013

work page 2013
[3]

Feline vocal commu- nication.Journal of Veterinary Science, 21, 01 2020

Chloé Tavernier, Sohail Ahmed, Katherine Houpt, and Seong Chan Yeon. Feline vocal commu- nication.Journal of Veterinary Science, 21, 01 2020. doi: 10.4142/jvs.2020.21.e18

work page doi:10.4142/jvs.2020.21.e18 2020
[4]

Systematic review of the behavioural assessment of pain in cats.Journal of Feline Medicine and Surgery, 18(2):60–76, Feb 2016

Isabella Merola and Daniel S Mills. Systematic review of the behavioural assessment of pain in cats.Journal of Feline Medicine and Surgery, 18(2):60–76, Feb 2016. ISSN 1532-2750. doi: 10.1177/1098612X15578725. URLhttps://doi.org/10.1177/1098612X15578725

work page doi:10.1177/1098612x15578725 2016
[5]

Green and Dora E

Andrea M. Green and Dora E. Angelaki. Multisensory integration: resolving sensory ambiguities to build novel representations.Current Opinion in Neurobiology, 20(3):353–360, Jun 2010. ISSN 1873-6882. doi: 10.1016/j.conb.2010.04.009

work page doi:10.1016/j.conb.2010.04.009 2010
[6]

Cat and dog behavior recognition method using deep learning approach based on inertial measurement unit sensor data.Sensors and Materials, 37(3(3)):1073–1098, Mar 2025

Guanyu Chen, Yoshinari Takegawa, Kohei Matsumura, Hiroki Watanabe, and Keiji Hirata. Cat and dog behavior recognition method using deep learning approach based on inertial measurement unit sensor data.Sensors and Materials, 37(3(3)):1073–1098, Mar 2025. ISSN 0914-4935. doi: 10.18494/SAM5359. Published March 28, 2025

work page doi:10.18494/sam5359 2025
[7]

Automatic classification of cat vocalizations emitted in different contexts.Animals, 9(8):543, 2019

Stavros Ntalampiras, Luca Andrea Ludovico, Giorgio Presti, Emanuela Prato Previde, Monica Battini, Simona Cannas, Clara Palestrini, and Silvana Mattiello. Automatic classification of cat vocalizations emitted in different contexts.Animals, 9(8):543, 2019. URL https: //doi.org/10.3390/ani9080543

work page doi:10.3390/ani9080543 2019
[8]

Roghanizad, Md Mobashir Hasan Shandhi, and Jessilyn Dunn

Will Ke Wang, Ina Chen, Leeor Hershkovich, Jiamu Yang, Ayush Shetty, Geetika Singh, Yihang Jiang, Aditya Kotla, Jason Zisheng Shang, Rushil Yerrabelli, Ali R. Roghanizad, Md Mobashir Hasan Shandhi, and Jessilyn Dunn. A systematic review of time series classi- fication techniques used in biomedical applications.Sensors, 22(20):8016, Oct 2022. ISSN 1424-822...

work page doi:10.3390/s22208016 2022
[9]

MiniCPM-o 4.5: A next-generation omni-modal large language model

OpenBMB Team. MiniCPM-o 4.5: A next-generation omni-modal large language model. https://huggingface.co/openbmb/MiniCPM-o-4_5, 2025. Accessed: 2026-04-19

work page 2025
[10]

Intern-s1-pro: Scientific multimodal foundation model at trillion scale, 2026

Yicheng Zou, Dongsheng Zhu, Lin Zhu, Tong Zhu, Yunhua Zhou, Peiheng Zhou, Xinyu Zhou, Dongzhan Zhou, Zhiwang Zhou, Yuhao Zhou, Bowen Zhou, Zhanping Zhong, Zhijie Zhong, Haiteng Zhao, Penghao Zhao, Xiaomeng Zhao, Zhiyuan Zhao, Yechen Zhang, Jin Zhang, Wenwei Zhang, Hongjie Zhang, Zhuo Zhang, Wenlong Zhang, Bo Zhang, Chao Zhang, Chen Zhang, Yuhang Zang, Fei...

work page arXiv 2026
[11]

Intern-s1: A scientific multimodal foun- dation model,

Lei Bai, Zhongrui Cai, Maosong Cao, Weihan Cao, Chiyu Chen, Haojiong Chen, Kai Chen, Pengcheng Chen, Ying Chen, Yongkang Chen, Yu Cheng, Yu Cheng, Pei Chu, Tao Chu, Erfei Cui, Ganqu Cui, Long Cui, Ziyun Cui, Nianchen Deng, Ning Ding, Nanqin Dong, Peijie Dong, Shihan Dou, Sinan Du, Haodong Duan, Caihua Fan, Ben Gao, Changjiang Gao, Jianfei Gao, Songyang Ga...

work page arXiv 2025
[12]

Aves: Animal vocalization encoder based on self-supervision, 2022

Masato Hagiwara. Aves: Animal vocalization encoder based on self-supervision, 2022. URL https://arxiv.org/abs/2210.14493

work page arXiv 2022
[13]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units, 2021

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units, 2021. URLhttps://arxiv.org/abs/2106.07447

work page arXiv 2021
[14]

Beans: The benchmark of animal sounds, 2022

Masato Hagiwara, Benjamin Hoffman, Jen-Yu Liu, Maddie Cusimano, Felix Effenberger, and Katie Zacarian. Beans: The benchmark of animal sounds, 2022. URLhttps://arxiv.org/ abs/2210.12300

work page arXiv 2022
[15]

Perch 2.0 transfers ’whale’ to underwater tasks, 2025

Andrea Burns, Lauren Harrell, Bart van Merriënboer, Vincent Dumoulin, Jenny Hamer, and Tom Denton. Perch 2.0 transfers ’whale’ to underwater tasks, 2025. URL https://arxiv. org/abs/2512.03219

work page arXiv 2025
[16]

Investigating self-supervised speech models’ ability to classify animal vocalizations: The case of gibbon’s vocal signatures

Jules Cauzinille, Benoît Favre, Ricard Marxer, Dena Clink, Abdul Ahmad, and Arnaud Rey. Investigating self-supervised speech models’ ability to classify animal vocalizations: The case of gibbon’s vocal signatures. pages 132–136, 09 2024. doi: 10.21437/Interspeech.2024-1096

work page doi:10.21437/interspeech.2024-1096 2024
[17]

Can masked autoencoders also listen to birds?, 2025

Lukas Rauch, René Heinrich, Ilyass Moummad, Alexis Joly, Bernhard Sick, and Christoph Scholz. Can masked autoencoders also listen to birds?, 2025. URL https://arxiv.org/ abs/2504.12880

work page arXiv 2025
[18]

de Polavieja, Yair Lakretz, and German Sumbre

Chiara Semenzin, Faadil Mustun, Roberto Dessi, Alexis Emanuelli, Pierre Orhan, Gonzalo G. de Polavieja, Yair Lakretz, and German Sumbre. Dolph2vec: Self-supervised representations of dolphin vocalizations, 2026. URLhttps://openreview.net/forum?id=QGAFX5kcR5. 12

work page 2026
[19]

Video foundation models for animal behavior analysis, 07 2024

Jennifer Sun, Hao Zhou, Long Zhao, Liangzhe Yuan, Bryan Seybold, David Hendon, Florian Schroff, David Ross, Hartwig Adam, Bo Hu, and Ting Liu. Video foundation models for animal behavior analysis, 07 2024

work page 2024
[20]

Liao, Chien-Chang Chen, Hen-Hsen Huang, and Hong-Yuan Mark Liao

Hung-Shuo Chang, Yue-Cheng Yang, Yu-Hsi Chen, Wei-Hsin Chen, Chien-Yao Wang, James C. Liao, Chien-Chang Chen, Hen-Hsen Huang, and Hong-Yuan Mark Liao. A universal action space for general behavior analysis, 2026. URLhttps://arxiv.org/abs/2602.09518

work page arXiv 2026
[21]

Uniap: Towards universal animal perception in vision via few-shot learning, 2023

Meiqi Sun, Zhonghan Zhao, Wenhao Chai, Hanjun Luo, Shidong Cao, Yanting Zhang, Jenq- Neng Hwang, and Gaoang Wang. Uniap: Towards universal animal perception in vision via few-shot learning, 2023. URLhttps://arxiv.org/abs/2308.09953

work page arXiv 2023
[22]

Superanimal pretrained pose estimation models for behavioral analysis.Nature Communications, 15(1):5165, Jun 2024

Shaokai Ye, Anastasiia Filippova, Jessy Lauer, Steffen Schneider, Maxime Vidal, Tian Qiu, Alexander Mathis, and Mackenzie Weygandt Mathis. Superanimal pretrained pose estimation models for behavioral analysis.Nature Communications, 15(1):5165, Jun 2024. ISSN 2041-

work page 2024
[23]

doi: 10.1038/s41467-024-48792-2

work page doi:10.1038/s41467-024-48792-2
[24]

Markus Marks, Jin Qiuhan, Oliver Sturman, Lukas von Ziegler, Sepp Kollmorgen, Wolfger von der Behrens, Valerio Mante, Johannes Bohacek, and Mehmet Fatih Yanik. Deep-learning based identification, tracking, pose estimation, and behavior classification of interacting primates and mice in complex environments.Nature Machine Intelligence, 4(4):331–340, Apr 20...

work page doi:10.1038/s42256-022-00477-5 2022
[25]

Marcelo Feighelstein, Lea Henze, Sebastian Meller, Ilan Shimshoni, Ben Hermoni, Michael Berko, Friederike Twele, Alexandra Schütter, Nora Dorn, Sabine Kästner, Lauren Finka, Stelio P. L. Luna, Daniel S. Mills, Holger A. V olk, and Anna Zamansky. Explainable automated pain recognition in cats.Scientific Reports, 13(1):8973, Jun 2023. ISSN 2045-2322. doi: 1...

work page doi:10.1038/s41598-023-35846-6 2023
[26]

Pattern Recog- nition153, 110500 (2024).https://doi.org/https://doi.org/10.1016/j

Cheng Fang, Tiemin Zhang, Haikun Zheng, Junduan Huang, and Kaixuan Cuan. Pose estimation and behavior classification of broiler chickens based on deep neural networks.Computers and Electronics in Agriculture, 180:105863, 2021. ISSN 0168-1699. doi: https://doi.org/10.1016/j. compag.2020.105863. URL https://www.sciencedirect.com/science/article/pii/ S016816...

work page doi:10.1016/j 2021
[27]

A phar- macology toolkit for animal pose estimation, tracking and analysis

Dema Saleh, Moemen Ahmed, Mai Zaafan, Yasmine Farouk, and Ayman Atia. A phar- macology toolkit for animal pose estimation, tracking and analysis. In2023 International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC), pages 1–7, 2023. doi: 10.1109/MIUCC58832.2023.10278344

work page doi:10.1109/miucc58832.2023.10278344 2023
[28]

Edoardo Fazzari, Fabio Carrara, Fabrizio Falchi, Cesare Stefanini, and Donato Romano. Using ai to decode the behavioral responses of an insect to chemical stimuli: towards machine-animal computational technologies.International Journal of Machine Learning and Cybernetics, 15 (5):1985–1994, May 2024. ISSN 1868-808X. doi: 10.1007/s13042-023-02009-y

work page doi:10.1007/s13042-023-02009-y 1985
[29]

Alvarenga, and Greg J

Reza Arablouei, Liang Wang, Lachlan Currie, Jodan Yates, Flavio A.P. Alvarenga, and Greg J. Bishop-Hurley. Animal behavior classification via deep learning on embedded sys- tems.Computers and Electronics in Agriculture, 207:107707, 2023. ISSN 0168-1699. doi: https://doi.org/10.1016/j.compag.2023.107707. URL https://www.sciencedirect.com/ science/article/p...

work page doi:10.1016/j.compag.2023.107707 2023
[30]

A lorawan-based smart sensor tag for cow behavior monitoring

Thai-Ha Dang, Ngoc-Hai Dang, Viet-Thang Tran, and Wan-Young Chung. A lorawan-based smart sensor tag for cow behavior monitoring. In2022 IEEE Sensors, pages 1–4, 2022. doi: 10.1109/SENSORS52175.2022.9967209

work page doi:10.1109/sensors52175.2022.9967209 2022
[31]

A cnn-based animal behavior recognition algorithm for wearable devices.IEEE Sensors Journal, 23(5): 5156–5164, 2023

Zhixin Pan, Huihui Chen, Weizhao Zhong, Aiguo Wang, and Chundi Zheng. A cnn-based animal behavior recognition algorithm for wearable devices.IEEE Sensors Journal, 23(5): 5156–5164, 2023. doi: 10.1109/JSEN.2023.3239015

work page doi:10.1109/jsen.2023.3239015 2023
[32]

Automated pipeline for robust cat activity detection based on deep learning and wearable sensor data.Sensors, 24(23), 2024

Md Ariful Islam Mozumder, Tagne Poupi Theodore Armand, Rashadul Islam Sumon, Shah Muhammad Imtiyaj Uddin, and Hee-Cheol Kim. Automated pipeline for robust cat activity detection based on deep learning and wearable sensor data.Sensors, 24(23), 2024. ISSN 1424-

work page 2024
[33]

URLhttps://www.mdpi.com/1424-8220/24/23/7436

doi: 10.3390/s24237436. URLhttps://www.mdpi.com/1424-8220/24/23/7436. 13

work page doi:10.3390/s24237436
[34]

Introducing claude opus 4.7

Anthropic. Introducing claude opus 4.7. https://www.anthropic.com/news/ claude-opus-4-7 , apr 2026. Large language model. API identifier: claude-opus-4-7. Knowledge cutoff: January 2026

work page 2026
[35]

Gemini 3.1 pro model card

Google DeepMind. Gemini 3.1 pro model card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/ , feb 2026. Large language model. Multimodal reasoning model with 1M token context window. API identifier: gemini-3.1-pro. Knowledge cutoff: 2026

work page 2026
[36]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

work page 2026
[37]

Qwen3-omni technical report, 2025

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo ...

work page 2025
[38]

Animal-bench: Benchmarking multimodal video models for animal-centric video understanding

Yinuo Jing, Ruxu Zhang, Kongming Liang, Yongxiang Li, Zhongjiang He, Zhanyu Ma, and Jun Guo. Animal-bench: Benchmarking multimodal video models for animal-centric video understanding. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=DexM7d1H6e

work page 2024
[39]

Coker, Michael L

Jun Chen, Ming Hu, Darren J. Coker, Michael L. Berumen, Blair Costelloe, Sara Beery, Anna Rohrbach, and Mohamed Elhoseiny. Mammalnet: A large-scale video benchmark for mammal recognition and behavior understanding, 2023. URL https://arxiv.org/abs/2306.00576

work page arXiv 2023
[40]

Pereira, Joshua W

Talmo D. Pereira, Joshua W. Shaevitz, and Mala Murthy. Quantifying behavior to understand the brain.Nature Neuroscience, 23(12):1537–1549, Dec 2020. ISSN 1546-1726. doi: 10.1038/ s41593-020-00734-z

work page 2020
[41]

Friston.Active Inference: The Free Energy Principle in Mind, Brain, and Behavior

Thomas Parr, Giovanni Pezzulo, and Karl J. Friston.Active Inference: The Free Energy Principle in Mind, Brain, and Behavior. The MIT Press, Cambridge, MA, 2022. ISBN 9780262045353. 58 b&w illustrations

work page 2022
[42]

The seven tools of causal inference, with reflections on machine learning.Commun

Judea Pearl. The seven tools of causal inference, with reflections on machine learning.Commun. ACM, 62(3):54–60, February 2019. ISSN 0001-0782. doi: 10.1145/3241036. URL https: //doi.org/10.1145/3241036

work page doi:10.1145/3241036 2019
[43]

The use of triaxial accelerometers and machine learning algorithms for behavioural identification in domestic cats (felis catus): A validation study.Sensors, 23(16): 7165, 2023

Michelle Smit, Seer J Ikurior, Rene A Corner-Thomas, Christopher J Andrews, Ina Draganova, and David G Thomas. The use of triaxial accelerometers and machine learning algorithms for behavioural identification in domestic cats (felis catus): A validation study.Sensors, 23(16): 7165, 2023

work page 2023
[44]

Carolyn E Dunford, Nikki J Marks, Rory P Wilson, and D Michael Scantlebury. Identifying animal behaviours from accelerometers: Improving predictive accuracy of machine learning by refining the variables selected, data frequency, and sample duration.Ecology and Evolution, 14 (5):e11380, 2024

work page 2024
[45]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. InProceedings of the IEEE/CVF international conference on computer vision, pages 1728–1738, 2021

work page 2021
[46]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Temporally-aware feature pooling for action spotting in soccer broadcasts

Silvio Giancola and Bernard Ghanem. Temporally-aware feature pooling for action spotting in soccer broadcasts. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4490–4499, 2021. 14

work page 2021
[48]

Action-Guided Attention for Video Action Anticipation

Tsung-Ming Tai, Sofia Casarin, Andrea Pilzer, Werner Nutt, and Oswald Lanz. Action-guided attention for video action anticipation.arXiv preprint arXiv:2603.01743, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[49]

Audio set: An ontology and human-labeled dataset for audio events

Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017

work page 2017
[50]

Ast: Audio spectrogram trans- former,

Yuan Gong, Yu-An Chung, and James Glass. Ast: Audio spectrogram transformer.arXiv preprint arXiv:2104.01778, 2021

work page arXiv 2021
[51]

Qwen2.5-vl technical report,

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report,

work page
[52]

URLhttps://arxiv.org/abs/2502.13923

work page internal anchor Pith review Pith/arXiv arXiv
[53]

Freesound datasets: A platform for the creation of open audio datasets

Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra. Freesound datasets: A platform for the creation of open audio datasets. InISMIR, pages 486–493, 2017

work page 2017
[54]

cat_class

Tao Jiang. cat_class. Hugging Face Dataset, 2026. URL https://huggingface.co/ datasets/taozi555/cat_class. Accessed: 2026-04-20

work page 2026
[55]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025. 15 Appendix A Data Processing Pipeline A.1 Bio Dataset Preprocessing Cat behavioural bio-signal data are generally scarce, particularly datasets that simultaneously provi...

work page internal anchor Pith review Pith/arXiv arXiv 2025