arxiv: 2604.22492 · v1 · submitted 2026-04-24 · 📡 eess.IV · cs.CV

Recognition: unknown

MTT-Bench: Predicting Social Dominance in Mice via Multimodal Large Language Models

Haoyu Chen, Yunquan Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:09 UTC · model grok-4.3

classification 📡 eess.IV cs.CV

keywords social dominancemicemultimodal LLMstube testethologyvideo analysisbehavioral predictiondominance hierarchy

0 comments

The pith

Multimodal large language models can predict social dominance hierarchies among mice from raw video of their pairwise interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The work tests whether existing multimodal LLMs, after fine-tuning on a new set of labeled mouse videos, can rank which mouse is dominant in unseen tube-test clips. The authors assemble MTT-Bench, a collection of annotated pairwise interaction videos, and adapt MLLM architectures so they output dominance predictions without receiving test-time labels. Results show substantial agreement with the rankings produced by the standard tube-test procedure. If the approach holds, ethologists could analyze dominance without building custom computer-vision pipelines for each species or behavior.

Core claim

Fine-tuned multimodal large language models perform zero-shot inference on unseen mouse behavioral videos to predict social dominance, producing rankings that agree closely with those obtained from conventional tube tests; this demonstrates that foundation models can be applied to ethology without the design of domain-specific models.

What carries the argument

MTT-Bench, the benchmark of annotated pairwise mouse interaction videos used to fine-tune MLLMs for zero-shot dominance prediction from raw footage.

If this is right

Dominance can be predicted without supplying explicit labels at test time.
Foundation models become usable for social-behavior analysis in animals without custom engineering per domain.
Automated ranking of hierarchies becomes feasible on larger video collections than manual scoring allows.
The same pipeline can be applied to other pairwise interaction tasks once additional annotated video sets exist.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar video-based fine-tuning might be tried on other species or on different social behaviors such as mating or aggression.
If the models truly capture dominance rather than low-level motion statistics, they could be tested on videos recorded under varied lighting or camera angles.
The approach could reduce the total annotation burden if a single fine-tuned model transfers across multiple behavioral assays.

Load-bearing premise

Fine-tuned multimodal LLMs extract genuine dominance signals from limited mouse videos rather than latching onto spurious visual cues that fail to generalize.

What would settle it

A fresh collection of tube-test videos on which the model's predicted dominance order shows low agreement with both repeated tube-test outcomes and independent human observers.

Figures

Figures reproduced from arXiv: 2604.22492 by Haoyu Chen, Yunquan Chen.

**Figure 1.** Figure 1: Overview of the process of generating the Evaluated Dataset. view at source ↗

**Figure 2.** Figure 2: The operation process of the two prediction methods. view at source ↗

read the original abstract

Understanding social dominance in animal behavior is critical for neuroscience and behavioral studies. In this work, we explore the capability of Multimodal Large Language Models(MLLMs) to analyze raw behavioral video of mice and predict their dominance hierarchy. We introduce MTT-Bench, a novel benchmark comprising annotated videos of pairwise mouse interactions for Mouse Tube Test analysis. Building on existing MLLM architectures, we fine-tune these models to perform zero-shot inference on unseen behavioral sequences, predicting social dominance without explicit labels during testing. Our framework demonstrates promising results, showing high agreement with tube test rankings. This work opens a new direction for applying foundation models to ethology and social behavior analysis, without the need to design domain-specific models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a new benchmark dataset for mouse tube-test videos but supplies no numbers or controls to back its claim of high agreement with dominance rankings.

read the letter

The one thing to know is that this work releases MTT-Bench, a set of annotated pairwise mouse interaction videos from the standard tube test, and then fine-tunes multimodal LLMs to predict which animal is dominant. The second thing is that the evaluation section gives no quantitative support for the main claim. No accuracy, no rank correlation, no kappa, no sample size, and no comparison to obvious non-model baselines appear in the abstract or the stress-test summary. That gap makes the rest of the paper hard to assess on its own terms. The benchmark itself is the clearest addition. Earlier ethology papers have used tracking or simple classifiers on rodent videos, but a public, task-specific collection aimed at dominance hierarchy prediction is new enough that other groups could use it to test their own pipelines. The choice to start from existing MLLM architectures rather than building a custom vision model keeps the method simple and reproducible in principle. The framing also correctly notes that manual scoring is slow and that automation could help neuroscience labs that run many tube tests. The soft spots sit in the results. The abstract asserts “high agreement” and “promising results” without any metric, error bar, or ablation on visual confounders such as body size, fur color, or lighting. The stress-test note is accurate on this point: without those controls it is impossible to know whether the model extracts dominance signals or just spurious video features. If the full manuscript contains proper train-test splits, statistical tests, and baseline comparisons, that would change the picture; nothing in the provided material shows it. The paper is aimed at behavioral neuroscientists who want faster dominance scoring and at applied ML researchers looking for video benchmarks in biology. A reader who needs a starting dataset for mouse interaction analysis could extract value from the release alone. It deserves peer review because new task-specific datasets in this area are worth checking even when the initial model results are thin. I would send it out with a clear request for the missing quantitative details and confound controls.

Referee Report

2 major / 1 minor

Summary. The paper introduces MTT-Bench, a benchmark of annotated videos capturing pairwise mouse interactions in the Mouse Tube Test. It fine-tunes multimodal large language models (MLLMs) on this data to enable zero-shot prediction of social dominance hierarchies directly from raw behavioral videos on unseen sequences. The central claim is that the resulting framework achieves high agreement with traditional tube test rankings, offering a general-purpose foundation-model approach to ethology that avoids the need for hand-crafted domain-specific models.

Significance. If the quantitative claims hold under proper validation, the work could meaningfully advance the application of general-purpose MLLMs to animal behavior analysis in neuroscience, potentially streamlining dominance hierarchy inference without specialized feature engineering. The introduction of a public benchmark is a constructive step, but the absence of any reported metrics, controls, or baselines currently prevents determining whether the approach captures genuine behavioral signals or spurious video cues.

major comments (2)

[Abstract] Abstract: The assertion that the framework shows 'promising results' and 'high agreement with tube test rankings' is presented without any quantitative support (accuracy, rank correlation, kappa, sample size, train/test split, or statistical tests). This directly undermines evaluation of the central generalization claim that fine-tuned MLLMs extract dominance signals rather than non-behavioral confounders such as body size, fur color, or lighting.
[Framework] Framework description: No details are supplied on the fine-tuning objective, loss function, output decoding for dominance (e.g., pairwise ranking vs. direct classification), specific MLLM backbones, or ablation studies that isolate behavioral features from visual artifacts. These omissions are load-bearing because they leave open whether performance reflects learned social behavior or dataset-specific shortcuts.

minor comments (1)

[Abstract] The abstract would be strengthened by including at least one concrete performance number and the size of MTT-Bench to allow readers to gauge the scale of the reported agreement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that both the abstract and framework sections require additional quantitative and technical details to strengthen the central claims. We will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that the framework shows 'promising results' and 'high agreement with tube test rankings' is presented without any quantitative support (accuracy, rank correlation, kappa, sample size, train/test split, or statistical tests). This directly undermines evaluation of the central generalization claim that fine-tuned MLLMs extract dominance signals rather than non-behavioral confounders such as body size, fur color, or lighting.

Authors: We agree that the abstract should include quantitative support to allow immediate evaluation of the claims. The full manuscript reports these metrics in the Results section (accuracy, Spearman rank correlation, Cohen's kappa, sample sizes for train/test splits, and statistical tests). We will revise the abstract to explicitly state key values such as the agreement rate with tube-test rankings, the number of videos evaluated, and the train/test protocol. This change will directly address concerns about potential confounders by highlighting the controlled evaluation setup. revision: yes
Referee: [Framework] Framework description: No details are supplied on the fine-tuning objective, loss function, output decoding for dominance (e.g., pairwise ranking vs. direct classification), specific MLLM backbones, or ablation studies that isolate behavioral features from visual artifacts. These omissions are load-bearing because they leave open whether performance reflects learned social behavior or dataset-specific shortcuts.

Authors: We acknowledge the need for these technical specifics. The revised manuscript will expand the Methods section to detail the fine-tuning objective (pairwise ranking via contrastive alignment), the loss function, output decoding (LLM-based pairwise comparison), the exact MLLM backbones used, and ablation studies that control for non-behavioral cues such as body size and lighting. These additions will demonstrate that performance derives from behavioral signals rather than dataset shortcuts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline with no derivations

full rationale

The paper introduces a new benchmark (MTT-Bench) of annotated mouse videos and fine-tunes standard MLLM architectures to predict dominance rankings, reporting empirical agreement with tube-test outcomes. No equations, theoretical derivations, or first-principles results exist that could reduce to their own inputs by construction. The central claim is an empirical performance statement on held-out sequences, not a self-referential fit or renamed pattern. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. The work is self-contained as a standard ML benchmark + fine-tuning study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on model hyperparameters, training assumptions, or data characteristics, so the ledger cannot be populated beyond the generic premise that fine-tuning MLLMs on video data can yield behavioral predictions.

pith-pipeline@v0.9.0 · 5414 in / 1128 out tokens · 65419 ms · 2026-05-08T09:09:35.778088+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 23 canonical work pages · 4 internal anchors

[1]

The mouse ascending: perspectives for human- disease models

Rosenthal N, Brown S. The mouse ascending: perspectives for human- disease models. Nat Cell Biol. 2007;9(9):993-999. doi:10.1038/ncb437

work page doi:10.1038/ncb437 2007
[2]

The Applicability of Mouse Models to the Study of Human Disease

Rydell-T ¨orm¨anen K, Johnson JR. The Applicability of Mouse Models to the Study of Human Disease. Methods Mol Biol. 2019;1940:3-22. https://doi.org/10.1007/978-1-4939-9086-3 1

work page doi:10.1007/978-1-4939-9086-3 2019
[3]

Datta SR, Anderson DJ, Branson K, Perona P, Leifer A. (2019). Computational neuroethology: a call to action. Neuron, 104(1), 11-24. https://doi.org/10.1016/j.neuron.2019.09.038

work page doi:10.1016/j.neuron.2019.09.038 2019
[4]

Machine Learning in Modeling of Mouse Behavior

Gharagozloo M, Amrani A, Wittingstall K, Hamilton-Wright A, Gris D. Machine Learning in Modeling of Mouse Behavior. Front Neurosci. 2021;15:700253. Published 2021 Sep 14. doi:10.3389/fnins.2021.700253

work page doi:10.3389/fnins.2021.700253 2021
[5]

Behavioral tests for the assessment of social hierarchy in mice

Zheng H, Chen D, Zhong Z, et al. Behavioral tests for the assessment of social hierarchy in mice. Front Behav Neurosci. 2025;19:1549666. Published 2025 Mar 5. doi:10.3389/fnbeh.2025.1549666

work page doi:10.3389/fnbeh.2025.1549666 2025
[6]

Fetcho, R.N., Hall, B.S., Estrin, D.J. et al. Regulation of social interaction in mice by a frontostriatal circuit modulated by es- tablished hierarchical relationships. Nat Commun 14, 2487 (2023). https://doi.org/10.1038/s41467-023-37460-6

work page doi:10.1038/s41467-023-37460-6 2023
[7]

D., Shaevitz, J

Pereira, T. D., Shaevitz, J. W., & Murthy, M. (2020). Quantifying be- havior to understand the brain. Nature neuroscience, 23(12), 1537–1549. https://doi.org/10.1038/s41593-020-00734-z

work page doi:10.1038/s41593-020-00734-z 2020
[8]

Pereira, T.D., Tabris, N., Matsliah, A. et al. SLEAP: A deep learning system for multi-animal pose tracking. Nat Methods 19, 486–495 (2022). https://doi.org/10.1038/s41592-022-01426-1

work page doi:10.1038/s41592-022-01426-1 2022
[9]

Ye, S., Filippova, A., Lauer, J. et al. SuperAnimal pretrained pose estimation models for behavioral analysis. Nat Commun 15, 5165 (2024). https://doi.org/10.1038/s41467-024-48792-2

work page doi:10.1038/s41467-024-48792-2 2024
[10]

Gravitationally induced decoherence vs space-time diffusion: testing the quantum nature of gravity.Nature Commun., 14(1):7910, 2023

A. Hsu and E. Yttri, “B-SOiD, an open-source unsupervised algorithm for identification and fast prediction of behaviors,” Nature Communica- tions, vol. 12, no. 1, Aug. 2021, Art. no. 5188, doi: 10.1038/s41467- 021-25420-x

work page doi:10.1038/s41467- 2021
[11]

Weinreb, C., Pearl, J., Lin, S., Osman, M. A. M., Zhang, L., Annapra- gada, S., Conlin, E., Hoffman, R., Makowska, S., Gillis, W. F., Jay, M., Ye, S., Mathis, A., Mathis, M. W., Pereira, T., Linderman, S. W., & Datta, S. R. (2023). Keypoint-MoSeq: parsing behavior by linking point tracking to pose dynamics. bioRxiv : the preprint server for biology, 2023....

work page doi:10.1101/2023.03.16.532307 2023
[12]

N., Kashyap, S

Fox, S. N., Kashyap, S. N., Murchison, C. F., Arrant, A. E., & Roberson, E. D. (2025). Assessing Social Dominance in Mouse Models Using the Tube Test. Journal of visualized experiments : JoVE, (220), 10.3791/67919. https://doi.org/10.3791/67919

work page doi:10.3791/67919 2025
[13]

chinchilla optimal

Feixiang Zhou, Xinyu Yang, Fang Chen, Long Chen, Zhenheng Jiang, Hu Zhu, Reiko Heckel, Haikuan Wang, Minrui Fei, and Huiyu Zhou, “Cross-Skeleton Interaction Graph Aggregation Net- work for Representation Learning of Mouse Social Behaviour,” arXiv preprint arXiv:2208.03819, Jan. 2025. https://doi.org/10.48550/arXiv. 2208.03819

work page internal anchor Pith review doi:10.48550/arxiv 2025
[14]

Hilbert’s sixth problem: derivation of fluid equations via Boltzmann’s kinetic theory,

T. Xu, T. Zhou, Y . Wang, et al., “MouseGPT: A Large-scale Vision-Language Model for Mouse Behavior Analysis,” *arXiv preprint arXiv:2503.10212*, Mar. 2025. https://doi.org/10.48550/arXiv.2503. 10212

work page doi:10.48550/arxiv.2503 2025
[15]

Varholick, J.A., Pontiggia, A., Murphy, E. et al. Social dominance hierarchy type and rank contribute to phenotypic variation within cages of laboratory mice. Sci Rep 9, 13650 (2019). https://doi.org/10.1038/ s41598-019-49612-0

2019
[16]

Q. Wen, L. Sun, F. Yang, X. Song, J. Gao, X. Wang, and H. Xu, ”Time Series Data Augmentation for Deep Learning: A Survey,” in *Proc. 30th Int. Joint Conf. Artif. Intell. (IJCAI)*, Montreal, Canada, 2021, pp. 4653-4660, doi: 10.24963/ijcai.2021/631

work page doi:10.24963/ijcai.2021/631 2021
[17]

C. Cao, F. Zhou, Y . Dai, J. Wang, and K. Zhang, ”A Survey of Mix-based Data Augmentation: Taxonomy, Methods, Applications, and Explainability,” arXiv preprint arXiv:2212.10888, Jun. 2024. https:// arxiv.org/abs/2212.10888

work page arXiv 2024
[18]

Learning transferable visual models from natural language supervi- sion,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” in Proc. Int. Conf. Mach. Learn. (ICML), vol. 139, pp. 8748–8763, 2021

2021
[20]

VideoCoCa: Video captioning with contrastive learning,

Z. Yang, P.-Y . Huang, Y . Wang, X. Li, B. Zoph, Q. Le, and Y . Wu, “VideoCoCa: Video captioning with contrastive learning,” arXiv preprint arXiv:2205.11074, 2022

work page arXiv 2022
[21]

Some methods for classification and analysis of multi- variate observations,

J. MacQueen, “Some methods for classification and analysis of multi- variate observations,” in Proc. 5th Berkeley Symp. Math. Statist. Probab., vol. 1, pp. 281–297, 1967

1967
[22]

Unsupervised learning of visual features by contrasting cluster assign- ments,

M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assign- ments,” in Proc. NeurIPS, 2020

2020
[24]

Time-Contrastive Networks: Self-Supervised Learning from Video,

P. Sermanet, C. Lynch, Y . Chebotar, J. Hsu, E. Jang, S. Schaal, and S. Levine, “Time-Contrastive Networks: Self-Supervised Learning from Video,” arXiv preprint arXiv:1704.06888, 2018. https://arxiv.org/abs/ 1704.06888

work page arXiv 2018
[25]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

J. Li, D. Li, S. Savarese, and S. Fei-Fei, “BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models,” arXiv preprint arXiv:2301.12597, 2023

work page internal anchor Pith review arXiv 2023
[26]

Flamingo: a Visual Language Model for Few-Shot Learning

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, et al., “Flamingo: A Visual Language Model for Few-Shot Learning,” arXiv preprint arXiv:2204.14198, 2022

work page internal anchor Pith review arXiv 2022
[27]

InternVL: Scaling up Vision–Language Pretraining with Multimodal Interns,

C. Xu, X. Lin, W. Zhang, et al., “InternVL: Scaling up Vision–Language Pretraining with Multimodal Interns,” arXiv preprint arXiv:2305.11967, 2023

work page arXiv 2023
[28]

Hello gpt-4o

OpenAI. Hello gpt-4o. https://openai.com/index/hello-gpt4o, 2024

2024
[29]

InternVL3-1B

Shanghai AI Laboratory. InternVL3-1B. https://opencompass.org.cn/ internvl3-1b, 2024

2024
[30]

InternVL3-2B

Shanghai AI Laboratory. InternVL3-2B. https://opencompass.org.cn/ internvl3-2b, 2024

2024
[31]

InternVL3-8B

Shanghai AI Laboratory. InternVL3-8B. https://opencompass.org.cn/ internvl3-8b, 2024

2024
[32]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review arXiv 2024
[33]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023. Technical report

2023
[34]

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?,

Y . Li, J. Niu, Z. Miao, C. Ge, Y . Zhou, Q. He, X. Dong, H. Duan, S. Ding, R. Qian, P. Zhang, Y . Zang, Y . Cao, C. He, and J. Wang, “OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?,” arXiv preprint arXiv:2501.05510, 2025. https: //arxiv.org/abs/2501.05510

work page arXiv 2025