Recognition: unknown
MTT-Bench: Predicting Social Dominance in Mice via Multimodal Large Language Models
Pith reviewed 2026-05-08 09:09 UTC · model grok-4.3
The pith
Multimodal large language models can predict social dominance hierarchies among mice from raw video of their pairwise interactions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Fine-tuned multimodal large language models perform zero-shot inference on unseen mouse behavioral videos to predict social dominance, producing rankings that agree closely with those obtained from conventional tube tests; this demonstrates that foundation models can be applied to ethology without the design of domain-specific models.
What carries the argument
MTT-Bench, the benchmark of annotated pairwise mouse interaction videos used to fine-tune MLLMs for zero-shot dominance prediction from raw footage.
If this is right
- Dominance can be predicted without supplying explicit labels at test time.
- Foundation models become usable for social-behavior analysis in animals without custom engineering per domain.
- Automated ranking of hierarchies becomes feasible on larger video collections than manual scoring allows.
- The same pipeline can be applied to other pairwise interaction tasks once additional annotated video sets exist.
Where Pith is reading between the lines
- Similar video-based fine-tuning might be tried on other species or on different social behaviors such as mating or aggression.
- If the models truly capture dominance rather than low-level motion statistics, they could be tested on videos recorded under varied lighting or camera angles.
- The approach could reduce the total annotation burden if a single fine-tuned model transfers across multiple behavioral assays.
Load-bearing premise
Fine-tuned multimodal LLMs extract genuine dominance signals from limited mouse videos rather than latching onto spurious visual cues that fail to generalize.
What would settle it
A fresh collection of tube-test videos on which the model's predicted dominance order shows low agreement with both repeated tube-test outcomes and independent human observers.
Figures
read the original abstract
Understanding social dominance in animal behavior is critical for neuroscience and behavioral studies. In this work, we explore the capability of Multimodal Large Language Models(MLLMs) to analyze raw behavioral video of mice and predict their dominance hierarchy. We introduce MTT-Bench, a novel benchmark comprising annotated videos of pairwise mouse interactions for Mouse Tube Test analysis. Building on existing MLLM architectures, we fine-tune these models to perform zero-shot inference on unseen behavioral sequences, predicting social dominance without explicit labels during testing. Our framework demonstrates promising results, showing high agreement with tube test rankings. This work opens a new direction for applying foundation models to ethology and social behavior analysis, without the need to design domain-specific models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MTT-Bench, a benchmark of annotated videos capturing pairwise mouse interactions in the Mouse Tube Test. It fine-tunes multimodal large language models (MLLMs) on this data to enable zero-shot prediction of social dominance hierarchies directly from raw behavioral videos on unseen sequences. The central claim is that the resulting framework achieves high agreement with traditional tube test rankings, offering a general-purpose foundation-model approach to ethology that avoids the need for hand-crafted domain-specific models.
Significance. If the quantitative claims hold under proper validation, the work could meaningfully advance the application of general-purpose MLLMs to animal behavior analysis in neuroscience, potentially streamlining dominance hierarchy inference without specialized feature engineering. The introduction of a public benchmark is a constructive step, but the absence of any reported metrics, controls, or baselines currently prevents determining whether the approach captures genuine behavioral signals or spurious video cues.
major comments (2)
- [Abstract] Abstract: The assertion that the framework shows 'promising results' and 'high agreement with tube test rankings' is presented without any quantitative support (accuracy, rank correlation, kappa, sample size, train/test split, or statistical tests). This directly undermines evaluation of the central generalization claim that fine-tuned MLLMs extract dominance signals rather than non-behavioral confounders such as body size, fur color, or lighting.
- [Framework] Framework description: No details are supplied on the fine-tuning objective, loss function, output decoding for dominance (e.g., pairwise ranking vs. direct classification), specific MLLM backbones, or ablation studies that isolate behavioral features from visual artifacts. These omissions are load-bearing because they leave open whether performance reflects learned social behavior or dataset-specific shortcuts.
minor comments (1)
- [Abstract] The abstract would be strengthened by including at least one concrete performance number and the size of MTT-Bench to allow readers to gauge the scale of the reported agreement.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We agree that both the abstract and framework sections require additional quantitative and technical details to strengthen the central claims. We will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion that the framework shows 'promising results' and 'high agreement with tube test rankings' is presented without any quantitative support (accuracy, rank correlation, kappa, sample size, train/test split, or statistical tests). This directly undermines evaluation of the central generalization claim that fine-tuned MLLMs extract dominance signals rather than non-behavioral confounders such as body size, fur color, or lighting.
Authors: We agree that the abstract should include quantitative support to allow immediate evaluation of the claims. The full manuscript reports these metrics in the Results section (accuracy, Spearman rank correlation, Cohen's kappa, sample sizes for train/test splits, and statistical tests). We will revise the abstract to explicitly state key values such as the agreement rate with tube-test rankings, the number of videos evaluated, and the train/test protocol. This change will directly address concerns about potential confounders by highlighting the controlled evaluation setup. revision: yes
-
Referee: [Framework] Framework description: No details are supplied on the fine-tuning objective, loss function, output decoding for dominance (e.g., pairwise ranking vs. direct classification), specific MLLM backbones, or ablation studies that isolate behavioral features from visual artifacts. These omissions are load-bearing because they leave open whether performance reflects learned social behavior or dataset-specific shortcuts.
Authors: We acknowledge the need for these technical specifics. The revised manuscript will expand the Methods section to detail the fine-tuning objective (pairwise ranking via contrastive alignment), the loss function, output decoding (LLM-based pairwise comparison), the exact MLLM backbones used, and ablation studies that control for non-behavioral cues such as body size and lighting. These additions will demonstrate that performance derives from behavioral signals rather than dataset shortcuts. revision: yes
Circularity Check
No significant circularity; empirical pipeline with no derivations
full rationale
The paper introduces a new benchmark (MTT-Bench) of annotated mouse videos and fine-tunes standard MLLM architectures to predict dominance rankings, reporting empirical agreement with tube-test outcomes. No equations, theoretical derivations, or first-principles results exist that could reduce to their own inputs by construction. The central claim is an empirical performance statement on held-out sequences, not a self-referential fit or renamed pattern. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. The work is self-contained as a standard ML benchmark + fine-tuning study.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
The mouse ascending: perspectives for human- disease models
Rosenthal N, Brown S. The mouse ascending: perspectives for human- disease models. Nat Cell Biol. 2007;9(9):993-999. doi:10.1038/ncb437
-
[2]
The Applicability of Mouse Models to the Study of Human Disease
Rydell-T ¨orm¨anen K, Johnson JR. The Applicability of Mouse Models to the Study of Human Disease. Methods Mol Biol. 2019;1940:3-22. https://doi.org/10.1007/978-1-4939-9086-3 1
-
[3]
Datta SR, Anderson DJ, Branson K, Perona P, Leifer A. (2019). Computational neuroethology: a call to action. Neuron, 104(1), 11-24. https://doi.org/10.1016/j.neuron.2019.09.038
-
[4]
Machine Learning in Modeling of Mouse Behavior
Gharagozloo M, Amrani A, Wittingstall K, Hamilton-Wright A, Gris D. Machine Learning in Modeling of Mouse Behavior. Front Neurosci. 2021;15:700253. Published 2021 Sep 14. doi:10.3389/fnins.2021.700253
-
[5]
Behavioral tests for the assessment of social hierarchy in mice
Zheng H, Chen D, Zhong Z, et al. Behavioral tests for the assessment of social hierarchy in mice. Front Behav Neurosci. 2025;19:1549666. Published 2025 Mar 5. doi:10.3389/fnbeh.2025.1549666
-
[6]
Fetcho, R.N., Hall, B.S., Estrin, D.J. et al. Regulation of social interaction in mice by a frontostriatal circuit modulated by es- tablished hierarchical relationships. Nat Commun 14, 2487 (2023). https://doi.org/10.1038/s41467-023-37460-6
-
[7]
Pereira, T. D., Shaevitz, J. W., & Murthy, M. (2020). Quantifying be- havior to understand the brain. Nature neuroscience, 23(12), 1537–1549. https://doi.org/10.1038/s41593-020-00734-z
-
[8]
Pereira, T.D., Tabris, N., Matsliah, A. et al. SLEAP: A deep learning system for multi-animal pose tracking. Nat Methods 19, 486–495 (2022). https://doi.org/10.1038/s41592-022-01426-1
-
[9]
Ye, S., Filippova, A., Lauer, J. et al. SuperAnimal pretrained pose estimation models for behavioral analysis. Nat Commun 15, 5165 (2024). https://doi.org/10.1038/s41467-024-48792-2
-
[10]
A. Hsu and E. Yttri, “B-SOiD, an open-source unsupervised algorithm for identification and fast prediction of behaviors,” Nature Communica- tions, vol. 12, no. 1, Aug. 2021, Art. no. 5188, doi: 10.1038/s41467- 021-25420-x
-
[11]
Weinreb, C., Pearl, J., Lin, S., Osman, M. A. M., Zhang, L., Annapra- gada, S., Conlin, E., Hoffman, R., Makowska, S., Gillis, W. F., Jay, M., Ye, S., Mathis, A., Mathis, M. W., Pereira, T., Linderman, S. W., & Datta, S. R. (2023). Keypoint-MoSeq: parsing behavior by linking point tracking to pose dynamics. bioRxiv : the preprint server for biology, 2023....
-
[12]
Fox, S. N., Kashyap, S. N., Murchison, C. F., Arrant, A. E., & Roberson, E. D. (2025). Assessing Social Dominance in Mouse Models Using the Tube Test. Journal of visualized experiments : JoVE, (220), 10.3791/67919. https://doi.org/10.3791/67919
-
[13]
Feixiang Zhou, Xinyu Yang, Fang Chen, Long Chen, Zhenheng Jiang, Hu Zhu, Reiko Heckel, Haikuan Wang, Minrui Fei, and Huiyu Zhou, “Cross-Skeleton Interaction Graph Aggregation Net- work for Representation Learning of Mouse Social Behaviour,” arXiv preprint arXiv:2208.03819, Jan. 2025. https://doi.org/10.48550/arXiv. 2208.03819
work page internal anchor Pith review doi:10.48550/arxiv 2025
-
[14]
Hilbert’s sixth problem: derivation of fluid equations via Boltzmann’s kinetic theory,
T. Xu, T. Zhou, Y . Wang, et al., “MouseGPT: A Large-scale Vision-Language Model for Mouse Behavior Analysis,” *arXiv preprint arXiv:2503.10212*, Mar. 2025. https://doi.org/10.48550/arXiv.2503. 10212
-
[15]
Varholick, J.A., Pontiggia, A., Murphy, E. et al. Social dominance hierarchy type and rank contribute to phenotypic variation within cages of laboratory mice. Sci Rep 9, 13650 (2019). https://doi.org/10.1038/ s41598-019-49612-0
2019
-
[16]
Q. Wen, L. Sun, F. Yang, X. Song, J. Gao, X. Wang, and H. Xu, ”Time Series Data Augmentation for Deep Learning: A Survey,” in *Proc. 30th Int. Joint Conf. Artif. Intell. (IJCAI)*, Montreal, Canada, 2021, pp. 4653-4660, doi: 10.24963/ijcai.2021/631
- [17]
-
[18]
Learning transferable visual models from natural language supervi- sion,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” in Proc. Int. Conf. Mach. Learn. (ICML), vol. 139, pp. 8748–8763, 2021
2021
-
[20]
VideoCoCa: Video captioning with contrastive learning,
Z. Yang, P.-Y . Huang, Y . Wang, X. Li, B. Zoph, Q. Le, and Y . Wu, “VideoCoCa: Video captioning with contrastive learning,” arXiv preprint arXiv:2205.11074, 2022
-
[21]
Some methods for classification and analysis of multi- variate observations,
J. MacQueen, “Some methods for classification and analysis of multi- variate observations,” in Proc. 5th Berkeley Symp. Math. Statist. Probab., vol. 1, pp. 281–297, 1967
1967
-
[22]
Unsupervised learning of visual features by contrasting cluster assign- ments,
M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assign- ments,” in Proc. NeurIPS, 2020
2020
-
[24]
Time-Contrastive Networks: Self-Supervised Learning from Video,
P. Sermanet, C. Lynch, Y . Chebotar, J. Hsu, E. Jang, S. Schaal, and S. Levine, “Time-Contrastive Networks: Self-Supervised Learning from Video,” arXiv preprint arXiv:1704.06888, 2018. https://arxiv.org/abs/ 1704.06888
-
[25]
J. Li, D. Li, S. Savarese, and S. Fei-Fei, “BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models,” arXiv preprint arXiv:2301.12597, 2023
work page internal anchor Pith review arXiv 2023
-
[26]
Flamingo: a Visual Language Model for Few-Shot Learning
J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, et al., “Flamingo: A Visual Language Model for Few-Shot Learning,” arXiv preprint arXiv:2204.14198, 2022
work page internal anchor Pith review arXiv 2022
-
[27]
InternVL: Scaling up Vision–Language Pretraining with Multimodal Interns,
C. Xu, X. Lin, W. Zhang, et al., “InternVL: Scaling up Vision–Language Pretraining with Multimodal Interns,” arXiv preprint arXiv:2305.11967, 2023
-
[28]
Hello gpt-4o
OpenAI. Hello gpt-4o. https://openai.com/index/hello-gpt4o, 2024
2024
-
[29]
InternVL3-1B
Shanghai AI Laboratory. InternVL3-1B. https://opencompass.org.cn/ internvl3-1b, 2024
2024
-
[30]
InternVL3-2B
Shanghai AI Laboratory. InternVL3-2B. https://opencompass.org.cn/ internvl3-2b, 2024
2024
-
[31]
InternVL3-8B
Shanghai AI Laboratory. InternVL3-8B. https://opencompass.org.cn/ internvl3-8b, 2024
2024
-
[32]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review arXiv 2024
-
[33]
Gpt-4 technical report, 2023
OpenAI. Gpt-4 technical report, 2023. Technical report
2023
-
[34]
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?,
Y . Li, J. Niu, Z. Miao, C. Ge, Y . Zhou, Q. He, X. Dong, H. Duan, S. Ding, R. Qian, P. Zhang, Y . Zang, Y . Cao, C. He, and J. Wang, “OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?,” arXiv preprint arXiv:2501.05510, 2025. https: //arxiv.org/abs/2501.05510
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.