Robust Zero-Shot Generalization for Open-Vocabulary Action Recognition via Task Arithmetic
Pith reviewed 2026-06-26 21:18 UTC · model grok-4.3
The pith
Task arithmetic merges fine-tuned models to boost zero-shot open-vocabulary action recognition
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Leveraging model merging and task arithmetic, task vectors from models fine-tuned on diverse public OVAR datasets are extracted and combined. The resulting merged model achieves superior zero-shot generalization to the pre-trained base model in out-of-distribution settings.
What carries the argument
Task vectors, defined as the difference between the weights of a fine-tuned model and the base model, which are then added together to merge capabilities from multiple tasks.
Load-bearing premise
Task vectors from models fine-tuned on different datasets can be linearly combined to produce a model that generalizes robustly to new actions and domains.
What would settle it
If a merged model does not outperform the base model on accuracy for action recognition on a held-out out-of-distribution benchmark, the central claim would be falsified.
Figures
read the original abstract
Open Vocabulary Action Recognition (OVAR) enables the recognition of novel actions by leveraging vision-language representations, overcoming the limitations of traditional closed-set approaches. However, achieving robust performance in real-world scenarios typically requires domain-specific fine-tuning, which is often costly and raises privacy and regulatory concerns. In this work, we propose an alternative paradigm that bypasses target-domain training and recombines knowledge from existing datasets and models. Leveraging model merging and task arithmetic, we extract and combine task vectors from models fine-tuned on diverse public OVAR datasets. We show that, in out-of-distribution settings, the resulting merged model achieves superior zero-shot generalization to the pre-trained base model. Code is available at https://github.com/omaymaMoussadek/robust-ovar
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes using task arithmetic to extract and recombine task vectors from models fine-tuned on multiple public open-vocabulary action recognition (OVAR) datasets, producing a merged model that, in out-of-distribution settings, achieves better zero-shot generalization than the pre-trained base model without any target-domain adaptation or training.
Significance. If the central empirical claim holds under properly controlled OOD conditions, the work would demonstrate a practical, training-free route to improving robustness in OVAR by leveraging existing public models and datasets, with direct relevance to privacy-sensitive applications. Public code release is a clear strength for reproducibility.
major comments (1)
- [Experiments] Experiments section (and any associated tables/figures reporting OOD results): the manuscript must explicitly verify and report that the action classes and visual domains in the held-out OOD test sets are disjoint from those appearing in all source fine-tuning datasets used to extract the task vectors. Without such checks, gains versus the base model could be explained by partial leakage rather than the arithmetic recombination itself, directly undermining the zero-shot generalization claim.
minor comments (1)
- [Abstract] Abstract: quantitative metrics, dataset names, baseline comparisons, and error bars are absent; these details should be summarized even at the abstract level for a methods paper.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback emphasizing the need for explicit verification of the zero-shot OOD protocol. We address the concern directly below.
read point-by-point responses
-
Referee: [Experiments] Experiments section (and any associated tables/figures reporting OOD results): the manuscript must explicitly verify and report that the action classes and visual domains in the held-out OOD test sets are disjoint from those appearing in all source fine-tuning datasets used to extract the task vectors. Without such checks, gains versus the base model could be explained by partial leakage rather than the arithmetic recombination itself, directly undermining the zero-shot generalization claim.
Authors: We agree that explicit verification is required to rigorously support the zero-shot claim. In the revised manuscript we will add a new subsection (and accompanying table) in the Experiments section that enumerates all action classes and visual domains appearing in the source fine-tuning datasets and confirms their complete disjointness from the held-out OOD test sets. Our internal analysis already establishes this disjointness; the added material will make the check transparent and reproducible. revision: yes
Circularity Check
No circularity: empirical merging on external datasets
full rationale
The paper describes an empirical procedure that extracts task vectors from models fine-tuned on public OVAR datasets and merges them via task arithmetic to produce a model evaluated on out-of-distribution test sets. No equations, predictions, or first-principles claims are presented that reduce by construction to quantities defined or fitted within the paper itself. The central result is a comparative empirical performance claim against a pre-trained base model, supported by external data and benchmarks rather than self-referential definitions or self-citation chains that carry the load-bearing argument.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inICML, 2021
2021
-
[2]
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time,
M. Wortsman, G. Ilharco, S. Y . Gadre, R. Roelofs, R. Gontijo-Lopes, A. S. Morcos, H. Namkoong, A. Farhadi, Y . Carmon, S. Kornblith et al., “Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time,” inICML, 2022
2022
-
[3]
Merging models with fisher-weighted averaging,
M. S. Matena and C. A. Raffel, “Merging models with fisher-weighted averaging,”Advances in Neural Information Processing Systems, vol. 35, pp. 17 703–17 716, 2022
2022
-
[4]
Editing models with task arithmetic,
G. Ilharco, M. T. Ribeiro, M. Wortsman, S. Gururangan, L. Schmidt, H. Hajishirzi, and A. Farhadi, “Editing models with task arithmetic,” in ICLR, 2022
2022
-
[5]
Task arithmetic in the tangent space: Improved editing of pre-trained models,
G. Ortiz-Jimenez, A. Favero, and P. Frossard, “Task arithmetic in the tangent space: Improved editing of pre-trained models,”NeurIPS, 2023
2023
-
[6]
Ties- merging: Resolving interference when merging models,
P. Yadav, D. Tam, L. Choshen, C. A. Raffel, and M. Bansal, “Ties- merging: Resolving interference when merging models,”Advances in neural information processing systems, vol. 36, pp. 7093–7115, 2023
2023
-
[7]
Localizing task information for improved model merging and compres- sion,
K. Wang, N. Dimitriadis, G. Ortiz-Jimenez, F. Fleuret, and P. Frossard, “Localizing task information for improved model merging and compres- sion,”arXiv preprint arXiv:2405.07813, 2024
-
[8]
Task singular vectors: Reducing task interference in model merging,
A. A. Gargiulo, D. Crisostomi, M. S. Bucarelli, S. Scardapane, F. Sil- vestri, and E. Rodola, “Task singular vectors: Reducing task interference in model merging,” inCVPR, 2025
2025
-
[9]
J. Carreira, E. Noland, C. Hillier, and A. Zisserman, “A short note on the kinetics-700 human action dataset,”arXiv preprint arXiv:1907.06987, 2019
-
[10]
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,”arXiv preprint arXiv:1212.0402, 2012
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[11]
Hmdb: A large video database for human motion recognition,
H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “Hmdb: A large video database for human motion recognition,” inICCV, 2011
2011
-
[12]
Not only look, but also listen: Learning multimodal violence detection under weak supervision,
P. Wu, J. Liu, Y . Shi, Y . Sun, F. Shao, Z. Wu, and Z. Yang, “Not only look, but also listen: Learning multimodal violence detection under weak supervision,” inECCV. Springer, 2020
2020
-
[13]
Language models are super mario: Absorbing abilities from homologous models as a free lunch,
L. Yu, B. Yu, H. Yu, F. Huang, and Y . Li, “Language models are super mario: Absorbing abilities from homologous models as a free lunch,” in Forty-first International Conference on Machine Learning, 2024
2024
-
[14]
U-net transplant: the role of pre-training for model merging in 3d medical segmentation,
L. Lumetti, G. Capitani, E. Ficarra, S. Calderara, C. Grana, A. Porrello, and F. Bolelli, “U-net transplant: the role of pre-training for model merging in 3d medical segmentation,” inMICCAI, 2025
2025
-
[15]
No task left behind: Isotropic model merging with common and task-specific subspaces,
D. Marczak, S. Magistri, S. Cygert, B. Twardowski, A. D. Bagdanov, and J. Van De Weijer, “No task left behind: Isotropic model merging with common and task-specific subspaces,” inICML, 2025
2025
-
[16]
Open-vclip: Transforming clip to an open-vocabulary video model via interpolated weight optimization,
Z. Weng, X. Yang, A. Li, Z. Wu, and Y .-G. Jiang, “Open-vclip: Transforming clip to an open-vocabulary video model via interpolated weight optimization,” inICML, 2023
2023
-
[17]
Mastering task arithmetic:τjp as a key indicator for weight disentanglement,
K. Yoshida, Y . Naraki, T. Horie, R. Yamaki, R. Shimizu, Y . Saito, J. McAuley, and H. Naganuma, “Mastering task arithmetic:τjp as a key indicator for weight disentanglement,” inICLR, 2025
2025
-
[18]
Dataless weight disentanglement in task arith- metic via kronecker-factored approximate curvature,
A. Porrello, P. Buzzega, F. Dangel, T. Sommariva, R. Salami, L. Boni- celli, and S. Calderara, “Dataless weight disentanglement in task arith- metic via kronecker-factored approximate curvature,” inICLR, 2026
2026
-
[19]
A second-order perspective on model compositionality and incremental learning,
A. Porrello, L. Bonicelli, P. Buzzega, M. Millunzi, S. Calderara, and R. Cucchiara, “A second-order perspective on model compositionality and incremental learning,” inICLR, vol. 2025, 2025
2025
-
[20]
Distilling linearized behavior into non-linear fine-tuning for effective task arith- metic,
T. Sommariva, F. Morandi, S. Calderara, and A. Porrello, “Distilling linearized behavior into non-linear fine-tuning for effective task arith- metic,” inICML, 2026
2026
-
[21]
The Kinetics Human Action Video Dataset
W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijaya- narasimhan, F. Viola, T. Green, T. Back, P. Natsevet al., “The kinetics human action video dataset,”arXiv preprint arXiv:1705.06950, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[22]
Decoupled weight decay regularization,
I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inICLR, 2019
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.