pith. machine review for the scientific record. sign in

arxiv: 2604.22492 · v1 · submitted 2026-04-24 · 📡 eess.IV · cs.CV

Recognition: unknown

MTT-Bench: Predicting Social Dominance in Mice via Multimodal Large Language Models

Haoyu Chen, Yunquan Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:09 UTC · model grok-4.3

classification 📡 eess.IV cs.CV
keywords social dominancemicemultimodal LLMstube testethologyvideo analysisbehavioral predictiondominance hierarchy
0
0 comments X

The pith

Multimodal large language models can predict social dominance hierarchies among mice from raw video of their pairwise interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The work tests whether existing multimodal LLMs, after fine-tuning on a new set of labeled mouse videos, can rank which mouse is dominant in unseen tube-test clips. The authors assemble MTT-Bench, a collection of annotated pairwise interaction videos, and adapt MLLM architectures so they output dominance predictions without receiving test-time labels. Results show substantial agreement with the rankings produced by the standard tube-test procedure. If the approach holds, ethologists could analyze dominance without building custom computer-vision pipelines for each species or behavior.

Core claim

Fine-tuned multimodal large language models perform zero-shot inference on unseen mouse behavioral videos to predict social dominance, producing rankings that agree closely with those obtained from conventional tube tests; this demonstrates that foundation models can be applied to ethology without the design of domain-specific models.

What carries the argument

MTT-Bench, the benchmark of annotated pairwise mouse interaction videos used to fine-tune MLLMs for zero-shot dominance prediction from raw footage.

If this is right

  • Dominance can be predicted without supplying explicit labels at test time.
  • Foundation models become usable for social-behavior analysis in animals without custom engineering per domain.
  • Automated ranking of hierarchies becomes feasible on larger video collections than manual scoring allows.
  • The same pipeline can be applied to other pairwise interaction tasks once additional annotated video sets exist.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar video-based fine-tuning might be tried on other species or on different social behaviors such as mating or aggression.
  • If the models truly capture dominance rather than low-level motion statistics, they could be tested on videos recorded under varied lighting or camera angles.
  • The approach could reduce the total annotation burden if a single fine-tuned model transfers across multiple behavioral assays.

Load-bearing premise

Fine-tuned multimodal LLMs extract genuine dominance signals from limited mouse videos rather than latching onto spurious visual cues that fail to generalize.

What would settle it

A fresh collection of tube-test videos on which the model's predicted dominance order shows low agreement with both repeated tube-test outcomes and independent human observers.

Figures

Figures reproduced from arXiv: 2604.22492 by Haoyu Chen, Yunquan Chen.

Figure 1
Figure 1. Figure 1: Overview of the process of generating the Evaluated Dataset. view at source ↗
Figure 2
Figure 2. Figure 2: The operation process of the two prediction methods. view at source ↗
read the original abstract

Understanding social dominance in animal behavior is critical for neuroscience and behavioral studies. In this work, we explore the capability of Multimodal Large Language Models(MLLMs) to analyze raw behavioral video of mice and predict their dominance hierarchy. We introduce MTT-Bench, a novel benchmark comprising annotated videos of pairwise mouse interactions for Mouse Tube Test analysis. Building on existing MLLM architectures, we fine-tune these models to perform zero-shot inference on unseen behavioral sequences, predicting social dominance without explicit labels during testing. Our framework demonstrates promising results, showing high agreement with tube test rankings. This work opens a new direction for applying foundation models to ethology and social behavior analysis, without the need to design domain-specific models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces MTT-Bench, a benchmark of annotated videos capturing pairwise mouse interactions in the Mouse Tube Test. It fine-tunes multimodal large language models (MLLMs) on this data to enable zero-shot prediction of social dominance hierarchies directly from raw behavioral videos on unseen sequences. The central claim is that the resulting framework achieves high agreement with traditional tube test rankings, offering a general-purpose foundation-model approach to ethology that avoids the need for hand-crafted domain-specific models.

Significance. If the quantitative claims hold under proper validation, the work could meaningfully advance the application of general-purpose MLLMs to animal behavior analysis in neuroscience, potentially streamlining dominance hierarchy inference without specialized feature engineering. The introduction of a public benchmark is a constructive step, but the absence of any reported metrics, controls, or baselines currently prevents determining whether the approach captures genuine behavioral signals or spurious video cues.

major comments (2)
  1. [Abstract] Abstract: The assertion that the framework shows 'promising results' and 'high agreement with tube test rankings' is presented without any quantitative support (accuracy, rank correlation, kappa, sample size, train/test split, or statistical tests). This directly undermines evaluation of the central generalization claim that fine-tuned MLLMs extract dominance signals rather than non-behavioral confounders such as body size, fur color, or lighting.
  2. [Framework] Framework description: No details are supplied on the fine-tuning objective, loss function, output decoding for dominance (e.g., pairwise ranking vs. direct classification), specific MLLM backbones, or ablation studies that isolate behavioral features from visual artifacts. These omissions are load-bearing because they leave open whether performance reflects learned social behavior or dataset-specific shortcuts.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by including at least one concrete performance number and the size of MTT-Bench to allow readers to gauge the scale of the reported agreement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that both the abstract and framework sections require additional quantitative and technical details to strengthen the central claims. We will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that the framework shows 'promising results' and 'high agreement with tube test rankings' is presented without any quantitative support (accuracy, rank correlation, kappa, sample size, train/test split, or statistical tests). This directly undermines evaluation of the central generalization claim that fine-tuned MLLMs extract dominance signals rather than non-behavioral confounders such as body size, fur color, or lighting.

    Authors: We agree that the abstract should include quantitative support to allow immediate evaluation of the claims. The full manuscript reports these metrics in the Results section (accuracy, Spearman rank correlation, Cohen's kappa, sample sizes for train/test splits, and statistical tests). We will revise the abstract to explicitly state key values such as the agreement rate with tube-test rankings, the number of videos evaluated, and the train/test protocol. This change will directly address concerns about potential confounders by highlighting the controlled evaluation setup. revision: yes

  2. Referee: [Framework] Framework description: No details are supplied on the fine-tuning objective, loss function, output decoding for dominance (e.g., pairwise ranking vs. direct classification), specific MLLM backbones, or ablation studies that isolate behavioral features from visual artifacts. These omissions are load-bearing because they leave open whether performance reflects learned social behavior or dataset-specific shortcuts.

    Authors: We acknowledge the need for these technical specifics. The revised manuscript will expand the Methods section to detail the fine-tuning objective (pairwise ranking via contrastive alignment), the loss function, output decoding (LLM-based pairwise comparison), the exact MLLM backbones used, and ablation studies that control for non-behavioral cues such as body size and lighting. These additions will demonstrate that performance derives from behavioral signals rather than dataset shortcuts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline with no derivations

full rationale

The paper introduces a new benchmark (MTT-Bench) of annotated mouse videos and fine-tunes standard MLLM architectures to predict dominance rankings, reporting empirical agreement with tube-test outcomes. No equations, theoretical derivations, or first-principles results exist that could reduce to their own inputs by construction. The central claim is an empirical performance statement on held-out sequences, not a self-referential fit or renamed pattern. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. The work is self-contained as a standard ML benchmark + fine-tuning study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on model hyperparameters, training assumptions, or data characteristics, so the ledger cannot be populated beyond the generic premise that fine-tuning MLLMs on video data can yield behavioral predictions.

pith-pipeline@v0.9.0 · 5414 in / 1128 out tokens · 65419 ms · 2026-05-08T09:09:35.778088+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 23 canonical work pages · 4 internal anchors

  1. [1]

    The mouse ascending: perspectives for human- disease models

    Rosenthal N, Brown S. The mouse ascending: perspectives for human- disease models. Nat Cell Biol. 2007;9(9):993-999. doi:10.1038/ncb437

  2. [2]

    The Applicability of Mouse Models to the Study of Human Disease

    Rydell-T ¨orm¨anen K, Johnson JR. The Applicability of Mouse Models to the Study of Human Disease. Methods Mol Biol. 2019;1940:3-22. https://doi.org/10.1007/978-1-4939-9086-3 1

  3. [3]

    Datta SR, Anderson DJ, Branson K, Perona P, Leifer A. (2019). Computational neuroethology: a call to action. Neuron, 104(1), 11-24. https://doi.org/10.1016/j.neuron.2019.09.038

  4. [4]

    Machine Learning in Modeling of Mouse Behavior

    Gharagozloo M, Amrani A, Wittingstall K, Hamilton-Wright A, Gris D. Machine Learning in Modeling of Mouse Behavior. Front Neurosci. 2021;15:700253. Published 2021 Sep 14. doi:10.3389/fnins.2021.700253

  5. [5]

    Behavioral tests for the assessment of social hierarchy in mice

    Zheng H, Chen D, Zhong Z, et al. Behavioral tests for the assessment of social hierarchy in mice. Front Behav Neurosci. 2025;19:1549666. Published 2025 Mar 5. doi:10.3389/fnbeh.2025.1549666

  6. [6]

    Fetcho, R.N., Hall, B.S., Estrin, D.J. et al. Regulation of social interaction in mice by a frontostriatal circuit modulated by es- tablished hierarchical relationships. Nat Commun 14, 2487 (2023). https://doi.org/10.1038/s41467-023-37460-6

  7. [7]

    D., Shaevitz, J

    Pereira, T. D., Shaevitz, J. W., & Murthy, M. (2020). Quantifying be- havior to understand the brain. Nature neuroscience, 23(12), 1537–1549. https://doi.org/10.1038/s41593-020-00734-z

  8. [8]

    Pereira, T.D., Tabris, N., Matsliah, A. et al. SLEAP: A deep learning system for multi-animal pose tracking. Nat Methods 19, 486–495 (2022). https://doi.org/10.1038/s41592-022-01426-1

  9. [9]

    Ye, S., Filippova, A., Lauer, J. et al. SuperAnimal pretrained pose estimation models for behavioral analysis. Nat Commun 15, 5165 (2024). https://doi.org/10.1038/s41467-024-48792-2

  10. [10]

    Gravitationally induced decoherence vs space-time diffusion: testing the quantum nature of gravity.Nature Commun., 14(1):7910, 2023

    A. Hsu and E. Yttri, “B-SOiD, an open-source unsupervised algorithm for identification and fast prediction of behaviors,” Nature Communica- tions, vol. 12, no. 1, Aug. 2021, Art. no. 5188, doi: 10.1038/s41467- 021-25420-x

  11. [11]

    Weinreb, C., Pearl, J., Lin, S., Osman, M. A. M., Zhang, L., Annapra- gada, S., Conlin, E., Hoffman, R., Makowska, S., Gillis, W. F., Jay, M., Ye, S., Mathis, A., Mathis, M. W., Pereira, T., Linderman, S. W., & Datta, S. R. (2023). Keypoint-MoSeq: parsing behavior by linking point tracking to pose dynamics. bioRxiv : the preprint server for biology, 2023....

  12. [12]

    N., Kashyap, S

    Fox, S. N., Kashyap, S. N., Murchison, C. F., Arrant, A. E., & Roberson, E. D. (2025). Assessing Social Dominance in Mouse Models Using the Tube Test. Journal of visualized experiments : JoVE, (220), 10.3791/67919. https://doi.org/10.3791/67919

  13. [13]

    chinchilla optimal

    Feixiang Zhou, Xinyu Yang, Fang Chen, Long Chen, Zhenheng Jiang, Hu Zhu, Reiko Heckel, Haikuan Wang, Minrui Fei, and Huiyu Zhou, “Cross-Skeleton Interaction Graph Aggregation Net- work for Representation Learning of Mouse Social Behaviour,” arXiv preprint arXiv:2208.03819, Jan. 2025. https://doi.org/10.48550/arXiv. 2208.03819

  14. [14]

    Hilbert’s sixth problem: derivation of fluid equations via Boltzmann’s kinetic theory,

    T. Xu, T. Zhou, Y . Wang, et al., “MouseGPT: A Large-scale Vision-Language Model for Mouse Behavior Analysis,” *arXiv preprint arXiv:2503.10212*, Mar. 2025. https://doi.org/10.48550/arXiv.2503. 10212

  15. [15]

    Varholick, J.A., Pontiggia, A., Murphy, E. et al. Social dominance hierarchy type and rank contribute to phenotypic variation within cages of laboratory mice. Sci Rep 9, 13650 (2019). https://doi.org/10.1038/ s41598-019-49612-0

  16. [16]

    Q. Wen, L. Sun, F. Yang, X. Song, J. Gao, X. Wang, and H. Xu, ”Time Series Data Augmentation for Deep Learning: A Survey,” in *Proc. 30th Int. Joint Conf. Artif. Intell. (IJCAI)*, Montreal, Canada, 2021, pp. 4653-4660, doi: 10.24963/ijcai.2021/631

  17. [17]

    C. Cao, F. Zhou, Y . Dai, J. Wang, and K. Zhang, ”A Survey of Mix-based Data Augmentation: Taxonomy, Methods, Applications, and Explainability,” arXiv preprint arXiv:2212.10888, Jun. 2024. https:// arxiv.org/abs/2212.10888

  18. [18]

    Learning transferable visual models from natural language supervi- sion,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” in Proc. Int. Conf. Mach. Learn. (ICML), vol. 139, pp. 8748–8763, 2021

  19. [20]

    VideoCoCa: Video captioning with contrastive learning,

    Z. Yang, P.-Y . Huang, Y . Wang, X. Li, B. Zoph, Q. Le, and Y . Wu, “VideoCoCa: Video captioning with contrastive learning,” arXiv preprint arXiv:2205.11074, 2022

  20. [21]

    Some methods for classification and analysis of multi- variate observations,

    J. MacQueen, “Some methods for classification and analysis of multi- variate observations,” in Proc. 5th Berkeley Symp. Math. Statist. Probab., vol. 1, pp. 281–297, 1967

  21. [22]

    Unsupervised learning of visual features by contrasting cluster assign- ments,

    M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assign- ments,” in Proc. NeurIPS, 2020

  22. [24]

    Time-Contrastive Networks: Self-Supervised Learning from Video,

    P. Sermanet, C. Lynch, Y . Chebotar, J. Hsu, E. Jang, S. Schaal, and S. Levine, “Time-Contrastive Networks: Self-Supervised Learning from Video,” arXiv preprint arXiv:1704.06888, 2018. https://arxiv.org/abs/ 1704.06888

  23. [25]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    J. Li, D. Li, S. Savarese, and S. Fei-Fei, “BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models,” arXiv preprint arXiv:2301.12597, 2023

  24. [26]

    Flamingo: a Visual Language Model for Few-Shot Learning

    J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, et al., “Flamingo: A Visual Language Model for Few-Shot Learning,” arXiv preprint arXiv:2204.14198, 2022

  25. [27]

    InternVL: Scaling up Vision–Language Pretraining with Multimodal Interns,

    C. Xu, X. Lin, W. Zhang, et al., “InternVL: Scaling up Vision–Language Pretraining with Multimodal Interns,” arXiv preprint arXiv:2305.11967, 2023

  26. [28]

    Hello gpt-4o

    OpenAI. Hello gpt-4o. https://openai.com/index/hello-gpt4o, 2024

  27. [29]

    InternVL3-1B

    Shanghai AI Laboratory. InternVL3-1B. https://opencompass.org.cn/ internvl3-1b, 2024

  28. [30]

    InternVL3-2B

    Shanghai AI Laboratory. InternVL3-2B. https://opencompass.org.cn/ internvl3-2b, 2024

  29. [31]

    InternVL3-8B

    Shanghai AI Laboratory. InternVL3-8B. https://opencompass.org.cn/ internvl3-8b, 2024

  30. [32]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

  31. [33]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023. Technical report

  32. [34]

    OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?,

    Y . Li, J. Niu, Z. Miao, C. Ge, Y . Zhou, Q. He, X. Dong, H. Duan, S. Ding, R. Qian, P. Zhang, Y . Zang, Y . Cao, C. He, and J. Wang, “OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?,” arXiv preprint arXiv:2501.05510, 2025. https: //arxiv.org/abs/2501.05510