Robust Zero-Shot Generalization for Open-Vocabulary Action Recognition via Task Arithmetic

Alessandro Banzatti; Angelo Porrello; Federico Venturini; Francesca Morandi; Francesco Cannarile; Mauro Suardi; Omayma Moussadek; Simone Calderara

arxiv: 2606.20734 · v1 · pith:O5QFBFANnew · submitted 2026-06-17 · 💻 cs.CV · cs.AI

Robust Zero-Shot Generalization for Open-Vocabulary Action Recognition via Task Arithmetic

Francesca Morandi , Omayma Moussadek , Federico Venturini , Mauro Suardi , Alessandro Banzatti , Francesco Cannarile , Angelo Porrello , Simone Calderara This is my paper

Pith reviewed 2026-06-26 21:18 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords open-vocabulary action recognitiontask arithmeticmodel mergingzero-shot generalizationout-of-distributionvision-language modelsaction recognition

0 comments

The pith

Task arithmetic merges fine-tuned models to boost zero-shot open-vocabulary action recognition

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper demonstrates that combining task vectors extracted from models fine-tuned on various public open-vocabulary action recognition datasets produces a merged model with improved performance. In out-of-distribution settings, this merged model generalizes better to novel actions and domains than the original pre-trained model. The method eliminates the need for fine-tuning on the target domain, which is often expensive and raises privacy issues. Readers would care because it provides a practical way to achieve robust generalization using only existing models and datasets.

Core claim

Leveraging model merging and task arithmetic, task vectors from models fine-tuned on diverse public OVAR datasets are extracted and combined. The resulting merged model achieves superior zero-shot generalization to the pre-trained base model in out-of-distribution settings.

What carries the argument

Task vectors, defined as the difference between the weights of a fine-tuned model and the base model, which are then added together to merge capabilities from multiple tasks.

Load-bearing premise

Task vectors from models fine-tuned on different datasets can be linearly combined to produce a model that generalizes robustly to new actions and domains.

What would settle it

If a merged model does not outperform the base model on accuracy for action recognition on a held-out out-of-distribution benchmark, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2606.20734 by Alessandro Banzatti, Angelo Porrello, Federico Venturini, Francesca Morandi, Francesco Cannarile, Mauro Suardi, Omayma Moussadek, Simone Calderara.

**Figure 2.** Figure 2: Experimental protocol for zero-shot evaluation. Source models are fine-tuned on individual datasets [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: For each dataset, we report its out-of-distribution (OOD) shift relative [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Target accuracy versus number of fused source models, showing how performance scales as more task vectors are aggregated. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

Open Vocabulary Action Recognition (OVAR) enables the recognition of novel actions by leveraging vision-language representations, overcoming the limitations of traditional closed-set approaches. However, achieving robust performance in real-world scenarios typically requires domain-specific fine-tuning, which is often costly and raises privacy and regulatory concerns. In this work, we propose an alternative paradigm that bypasses target-domain training and recombines knowledge from existing datasets and models. Leveraging model merging and task arithmetic, we extract and combine task vectors from models fine-tuned on diverse public OVAR datasets. We show that, in out-of-distribution settings, the resulting merged model achieves superior zero-shot generalization to the pre-trained base model. Code is available at https://github.com/omaymaMoussadek/robust-ovar

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Task arithmetic on OVAR fine-tunes gives a training-free route to better zero-shot OOD results, but the evaluation needs explicit checks that test actions and domains are absent from the source datasets.

read the letter

The paper takes task vectors from models fine-tuned on several public OVAR datasets, adds them together, and reports that the merged model beats the base model on out-of-distribution zero-shot tests. That is the central move.

It does one thing cleanly: it shows a practical way to recombine existing fine-tunes without touching target data, which matters when fine-tuning is expensive or restricted. The code release lets others reproduce the merge directly, and the method stays within standard task-arithmetic practice rather than inventing new machinery.

The soft spot is the OOD claim. The stress-test concern holds: if any test actions or visual domains appear in the source fine-tuning sets, the gains could reflect partial leakage instead of arithmetic-enabled generalization. The abstract gives no numbers, no dataset lists, and no mention of class-level or domain disjointness checks. The full paper must supply those controls plus the actual metrics and baselines; without them the result stays provisional.

The work is for people already working on model merging or zero-shot video tasks who want a simple recipe they can try on their own data. It shows clear, non-circular thinking on the method side and does not rely on fitted values or internal contradictions.

I would bring it to a reading group to walk through the merge procedure and the dataset details. I would not cite it yet. It deserves peer review because the idea is testable, the code is public, and a referee can ask for the missing overlap checks and numbers in one round.

Referee Report

1 major / 1 minor

Summary. The paper proposes using task arithmetic to extract and recombine task vectors from models fine-tuned on multiple public open-vocabulary action recognition (OVAR) datasets, producing a merged model that, in out-of-distribution settings, achieves better zero-shot generalization than the pre-trained base model without any target-domain adaptation or training.

Significance. If the central empirical claim holds under properly controlled OOD conditions, the work would demonstrate a practical, training-free route to improving robustness in OVAR by leveraging existing public models and datasets, with direct relevance to privacy-sensitive applications. Public code release is a clear strength for reproducibility.

major comments (1)

[Experiments] Experiments section (and any associated tables/figures reporting OOD results): the manuscript must explicitly verify and report that the action classes and visual domains in the held-out OOD test sets are disjoint from those appearing in all source fine-tuning datasets used to extract the task vectors. Without such checks, gains versus the base model could be explained by partial leakage rather than the arithmetic recombination itself, directly undermining the zero-shot generalization claim.

minor comments (1)

[Abstract] Abstract: quantitative metrics, dataset names, baseline comparisons, and error bars are absent; these details should be summarized even at the abstract level for a methods paper.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing the need for explicit verification of the zero-shot OOD protocol. We address the concern directly below.

read point-by-point responses

Referee: [Experiments] Experiments section (and any associated tables/figures reporting OOD results): the manuscript must explicitly verify and report that the action classes and visual domains in the held-out OOD test sets are disjoint from those appearing in all source fine-tuning datasets used to extract the task vectors. Without such checks, gains versus the base model could be explained by partial leakage rather than the arithmetic recombination itself, directly undermining the zero-shot generalization claim.

Authors: We agree that explicit verification is required to rigorously support the zero-shot claim. In the revised manuscript we will add a new subsection (and accompanying table) in the Experiments section that enumerates all action classes and visual domains appearing in the source fine-tuning datasets and confirms their complete disjointness from the held-out OOD test sets. Our internal analysis already establishes this disjointness; the added material will make the check transparent and reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical merging on external datasets

full rationale

The paper describes an empirical procedure that extracts task vectors from models fine-tuned on public OVAR datasets and merges them via task arithmetic to produce a model evaluated on out-of-distribution test sets. No equations, predictions, or first-principles claims are presented that reduce by construction to quantities defined or fitted within the paper itself. The central result is a comparative empirical performance claim against a pre-trained base model, supported by external data and benchmarks rather than self-referential definitions or self-citation chains that carry the load-bearing argument.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities. The approach implicitly assumes standard task arithmetic operations and the utility of public OVAR datasets but does not introduce new ones.

pith-pipeline@v0.9.1-grok · 5683 in / 1153 out tokens · 28470 ms · 2026-06-26T21:18:21.968411+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 4 canonical work pages · 2 internal anchors

[1]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inICML, 2021

2021
[2]

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time,

M. Wortsman, G. Ilharco, S. Y . Gadre, R. Roelofs, R. Gontijo-Lopes, A. S. Morcos, H. Namkoong, A. Farhadi, Y . Carmon, S. Kornblith et al., “Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time,” inICML, 2022

2022
[3]

Merging models with fisher-weighted averaging,

M. S. Matena and C. A. Raffel, “Merging models with fisher-weighted averaging,”Advances in Neural Information Processing Systems, vol. 35, pp. 17 703–17 716, 2022

2022
[4]

Editing models with task arithmetic,

G. Ilharco, M. T. Ribeiro, M. Wortsman, S. Gururangan, L. Schmidt, H. Hajishirzi, and A. Farhadi, “Editing models with task arithmetic,” in ICLR, 2022

2022
[5]

Task arithmetic in the tangent space: Improved editing of pre-trained models,

G. Ortiz-Jimenez, A. Favero, and P. Frossard, “Task arithmetic in the tangent space: Improved editing of pre-trained models,”NeurIPS, 2023

2023
[6]

Ties- merging: Resolving interference when merging models,

P. Yadav, D. Tam, L. Choshen, C. A. Raffel, and M. Bansal, “Ties- merging: Resolving interference when merging models,”Advances in neural information processing systems, vol. 36, pp. 7093–7115, 2023

2023
[7]

Localizing task information for improved model merging and compres- sion,

K. Wang, N. Dimitriadis, G. Ortiz-Jimenez, F. Fleuret, and P. Frossard, “Localizing task information for improved model merging and compres- sion,”arXiv preprint arXiv:2405.07813, 2024

work page arXiv 2024
[8]

Task singular vectors: Reducing task interference in model merging,

A. A. Gargiulo, D. Crisostomi, M. S. Bucarelli, S. Scardapane, F. Sil- vestri, and E. Rodola, “Task singular vectors: Reducing task interference in model merging,” inCVPR, 2025

2025
[9]

Carreira, E

J. Carreira, E. Noland, C. Hillier, and A. Zisserman, “A short note on the kinetics-700 human action dataset,”arXiv preprint arXiv:1907.06987, 2019

work page arXiv 1907
[10]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,”arXiv preprint arXiv:1212.0402, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[11]

Hmdb: A large video database for human motion recognition,

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “Hmdb: A large video database for human motion recognition,” inICCV, 2011

2011
[12]

Not only look, but also listen: Learning multimodal violence detection under weak supervision,

P. Wu, J. Liu, Y . Shi, Y . Sun, F. Shao, Z. Wu, and Z. Yang, “Not only look, but also listen: Learning multimodal violence detection under weak supervision,” inECCV. Springer, 2020

2020
[13]

Language models are super mario: Absorbing abilities from homologous models as a free lunch,

L. Yu, B. Yu, H. Yu, F. Huang, and Y . Li, “Language models are super mario: Absorbing abilities from homologous models as a free lunch,” in Forty-first International Conference on Machine Learning, 2024

2024
[14]

U-net transplant: the role of pre-training for model merging in 3d medical segmentation,

L. Lumetti, G. Capitani, E. Ficarra, S. Calderara, C. Grana, A. Porrello, and F. Bolelli, “U-net transplant: the role of pre-training for model merging in 3d medical segmentation,” inMICCAI, 2025

2025
[15]

No task left behind: Isotropic model merging with common and task-specific subspaces,

D. Marczak, S. Magistri, S. Cygert, B. Twardowski, A. D. Bagdanov, and J. Van De Weijer, “No task left behind: Isotropic model merging with common and task-specific subspaces,” inICML, 2025

2025
[16]

Open-vclip: Transforming clip to an open-vocabulary video model via interpolated weight optimization,

Z. Weng, X. Yang, A. Li, Z. Wu, and Y .-G. Jiang, “Open-vclip: Transforming clip to an open-vocabulary video model via interpolated weight optimization,” inICML, 2023

2023
[17]

Mastering task arithmetic:τjp as a key indicator for weight disentanglement,

K. Yoshida, Y . Naraki, T. Horie, R. Yamaki, R. Shimizu, Y . Saito, J. McAuley, and H. Naganuma, “Mastering task arithmetic:τjp as a key indicator for weight disentanglement,” inICLR, 2025

2025
[18]

Dataless weight disentanglement in task arith- metic via kronecker-factored approximate curvature,

A. Porrello, P. Buzzega, F. Dangel, T. Sommariva, R. Salami, L. Boni- celli, and S. Calderara, “Dataless weight disentanglement in task arith- metic via kronecker-factored approximate curvature,” inICLR, 2026

2026
[19]

A second-order perspective on model compositionality and incremental learning,

A. Porrello, L. Bonicelli, P. Buzzega, M. Millunzi, S. Calderara, and R. Cucchiara, “A second-order perspective on model compositionality and incremental learning,” inICLR, vol. 2025, 2025

2025
[20]

Distilling linearized behavior into non-linear fine-tuning for effective task arith- metic,

T. Sommariva, F. Morandi, S. Calderara, and A. Porrello, “Distilling linearized behavior into non-linear fine-tuning for effective task arith- metic,” inICML, 2026

2026
[21]

The Kinetics Human Action Video Dataset

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijaya- narasimhan, F. Viola, T. Green, T. Back, P. Natsevet al., “The kinetics human action video dataset,”arXiv preprint arXiv:1705.06950, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

Decoupled weight decay regularization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inICLR, 2019

2019

[1] [1]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inICML, 2021

2021

[2] [2]

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time,

M. Wortsman, G. Ilharco, S. Y . Gadre, R. Roelofs, R. Gontijo-Lopes, A. S. Morcos, H. Namkoong, A. Farhadi, Y . Carmon, S. Kornblith et al., “Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time,” inICML, 2022

2022

[3] [3]

Merging models with fisher-weighted averaging,

M. S. Matena and C. A. Raffel, “Merging models with fisher-weighted averaging,”Advances in Neural Information Processing Systems, vol. 35, pp. 17 703–17 716, 2022

2022

[4] [4]

Editing models with task arithmetic,

G. Ilharco, M. T. Ribeiro, M. Wortsman, S. Gururangan, L. Schmidt, H. Hajishirzi, and A. Farhadi, “Editing models with task arithmetic,” in ICLR, 2022

2022

[5] [5]

Task arithmetic in the tangent space: Improved editing of pre-trained models,

G. Ortiz-Jimenez, A. Favero, and P. Frossard, “Task arithmetic in the tangent space: Improved editing of pre-trained models,”NeurIPS, 2023

2023

[6] [6]

Ties- merging: Resolving interference when merging models,

P. Yadav, D. Tam, L. Choshen, C. A. Raffel, and M. Bansal, “Ties- merging: Resolving interference when merging models,”Advances in neural information processing systems, vol. 36, pp. 7093–7115, 2023

2023

[7] [7]

Localizing task information for improved model merging and compres- sion,

K. Wang, N. Dimitriadis, G. Ortiz-Jimenez, F. Fleuret, and P. Frossard, “Localizing task information for improved model merging and compres- sion,”arXiv preprint arXiv:2405.07813, 2024

work page arXiv 2024

[8] [8]

Task singular vectors: Reducing task interference in model merging,

A. A. Gargiulo, D. Crisostomi, M. S. Bucarelli, S. Scardapane, F. Sil- vestri, and E. Rodola, “Task singular vectors: Reducing task interference in model merging,” inCVPR, 2025

2025

[9] [9]

Carreira, E

J. Carreira, E. Noland, C. Hillier, and A. Zisserman, “A short note on the kinetics-700 human action dataset,”arXiv preprint arXiv:1907.06987, 2019

work page arXiv 1907

[10] [10]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,”arXiv preprint arXiv:1212.0402, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[11] [11]

Hmdb: A large video database for human motion recognition,

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “Hmdb: A large video database for human motion recognition,” inICCV, 2011

2011

[12] [12]

Not only look, but also listen: Learning multimodal violence detection under weak supervision,

P. Wu, J. Liu, Y . Shi, Y . Sun, F. Shao, Z. Wu, and Z. Yang, “Not only look, but also listen: Learning multimodal violence detection under weak supervision,” inECCV. Springer, 2020

2020

[13] [13]

Language models are super mario: Absorbing abilities from homologous models as a free lunch,

L. Yu, B. Yu, H. Yu, F. Huang, and Y . Li, “Language models are super mario: Absorbing abilities from homologous models as a free lunch,” in Forty-first International Conference on Machine Learning, 2024

2024

[14] [14]

U-net transplant: the role of pre-training for model merging in 3d medical segmentation,

L. Lumetti, G. Capitani, E. Ficarra, S. Calderara, C. Grana, A. Porrello, and F. Bolelli, “U-net transplant: the role of pre-training for model merging in 3d medical segmentation,” inMICCAI, 2025

2025

[15] [15]

No task left behind: Isotropic model merging with common and task-specific subspaces,

D. Marczak, S. Magistri, S. Cygert, B. Twardowski, A. D. Bagdanov, and J. Van De Weijer, “No task left behind: Isotropic model merging with common and task-specific subspaces,” inICML, 2025

2025

[16] [16]

Open-vclip: Transforming clip to an open-vocabulary video model via interpolated weight optimization,

Z. Weng, X. Yang, A. Li, Z. Wu, and Y .-G. Jiang, “Open-vclip: Transforming clip to an open-vocabulary video model via interpolated weight optimization,” inICML, 2023

2023

[17] [17]

Mastering task arithmetic:τjp as a key indicator for weight disentanglement,

K. Yoshida, Y . Naraki, T. Horie, R. Yamaki, R. Shimizu, Y . Saito, J. McAuley, and H. Naganuma, “Mastering task arithmetic:τjp as a key indicator for weight disentanglement,” inICLR, 2025

2025

[18] [18]

Dataless weight disentanglement in task arith- metic via kronecker-factored approximate curvature,

A. Porrello, P. Buzzega, F. Dangel, T. Sommariva, R. Salami, L. Boni- celli, and S. Calderara, “Dataless weight disentanglement in task arith- metic via kronecker-factored approximate curvature,” inICLR, 2026

2026

[19] [19]

A second-order perspective on model compositionality and incremental learning,

A. Porrello, L. Bonicelli, P. Buzzega, M. Millunzi, S. Calderara, and R. Cucchiara, “A second-order perspective on model compositionality and incremental learning,” inICLR, vol. 2025, 2025

2025

[20] [20]

Distilling linearized behavior into non-linear fine-tuning for effective task arith- metic,

T. Sommariva, F. Morandi, S. Calderara, and A. Porrello, “Distilling linearized behavior into non-linear fine-tuning for effective task arith- metic,” inICML, 2026

2026

[21] [21]

The Kinetics Human Action Video Dataset

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijaya- narasimhan, F. Viola, T. Green, T. Back, P. Natsevet al., “The kinetics human action video dataset,”arXiv preprint arXiv:1705.06950, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[22] [22]

Decoupled weight decay regularization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inICLR, 2019

2019