pith. sign in

arxiv: 2606.20734 · v1 · pith:O5QFBFANnew · submitted 2026-06-17 · 💻 cs.CV · cs.AI

Robust Zero-Shot Generalization for Open-Vocabulary Action Recognition via Task Arithmetic

Pith reviewed 2026-06-26 21:18 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords open-vocabulary action recognitiontask arithmeticmodel mergingzero-shot generalizationout-of-distributionvision-language modelsaction recognition
0
0 comments X

The pith

Task arithmetic merges fine-tuned models to boost zero-shot open-vocabulary action recognition

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper demonstrates that combining task vectors extracted from models fine-tuned on various public open-vocabulary action recognition datasets produces a merged model with improved performance. In out-of-distribution settings, this merged model generalizes better to novel actions and domains than the original pre-trained model. The method eliminates the need for fine-tuning on the target domain, which is often expensive and raises privacy issues. Readers would care because it provides a practical way to achieve robust generalization using only existing models and datasets.

Core claim

Leveraging model merging and task arithmetic, task vectors from models fine-tuned on diverse public OVAR datasets are extracted and combined. The resulting merged model achieves superior zero-shot generalization to the pre-trained base model in out-of-distribution settings.

What carries the argument

Task vectors, defined as the difference between the weights of a fine-tuned model and the base model, which are then added together to merge capabilities from multiple tasks.

Load-bearing premise

Task vectors from models fine-tuned on different datasets can be linearly combined to produce a model that generalizes robustly to new actions and domains.

What would settle it

If a merged model does not outperform the base model on accuracy for action recognition on a held-out out-of-distribution benchmark, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2606.20734 by Alessandro Banzatti, Angelo Porrello, Federico Venturini, Francesca Morandi, Francesco Cannarile, Mauro Suardi, Omayma Moussadek, Simone Calderara.

Figure 1
Figure 1. Figure 1: Given two models fine-tuned from the same weights, task addition [4] [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Experimental protocol for zero-shot evaluation. Source models are fine-tuned on individual datasets [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: For each dataset, we report its out-of-distribution (OOD) shift relative [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Target accuracy versus number of fused source models, showing how performance scales as more task vectors are aggregated. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
read the original abstract

Open Vocabulary Action Recognition (OVAR) enables the recognition of novel actions by leveraging vision-language representations, overcoming the limitations of traditional closed-set approaches. However, achieving robust performance in real-world scenarios typically requires domain-specific fine-tuning, which is often costly and raises privacy and regulatory concerns. In this work, we propose an alternative paradigm that bypasses target-domain training and recombines knowledge from existing datasets and models. Leveraging model merging and task arithmetic, we extract and combine task vectors from models fine-tuned on diverse public OVAR datasets. We show that, in out-of-distribution settings, the resulting merged model achieves superior zero-shot generalization to the pre-trained base model. Code is available at https://github.com/omaymaMoussadek/robust-ovar

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes using task arithmetic to extract and recombine task vectors from models fine-tuned on multiple public open-vocabulary action recognition (OVAR) datasets, producing a merged model that, in out-of-distribution settings, achieves better zero-shot generalization than the pre-trained base model without any target-domain adaptation or training.

Significance. If the central empirical claim holds under properly controlled OOD conditions, the work would demonstrate a practical, training-free route to improving robustness in OVAR by leveraging existing public models and datasets, with direct relevance to privacy-sensitive applications. Public code release is a clear strength for reproducibility.

major comments (1)
  1. [Experiments] Experiments section (and any associated tables/figures reporting OOD results): the manuscript must explicitly verify and report that the action classes and visual domains in the held-out OOD test sets are disjoint from those appearing in all source fine-tuning datasets used to extract the task vectors. Without such checks, gains versus the base model could be explained by partial leakage rather than the arithmetic recombination itself, directly undermining the zero-shot generalization claim.
minor comments (1)
  1. [Abstract] Abstract: quantitative metrics, dataset names, baseline comparisons, and error bars are absent; these details should be summarized even at the abstract level for a methods paper.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing the need for explicit verification of the zero-shot OOD protocol. We address the concern directly below.

read point-by-point responses
  1. Referee: [Experiments] Experiments section (and any associated tables/figures reporting OOD results): the manuscript must explicitly verify and report that the action classes and visual domains in the held-out OOD test sets are disjoint from those appearing in all source fine-tuning datasets used to extract the task vectors. Without such checks, gains versus the base model could be explained by partial leakage rather than the arithmetic recombination itself, directly undermining the zero-shot generalization claim.

    Authors: We agree that explicit verification is required to rigorously support the zero-shot claim. In the revised manuscript we will add a new subsection (and accompanying table) in the Experiments section that enumerates all action classes and visual domains appearing in the source fine-tuning datasets and confirms their complete disjointness from the held-out OOD test sets. Our internal analysis already establishes this disjointness; the added material will make the check transparent and reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical merging on external datasets

full rationale

The paper describes an empirical procedure that extracts task vectors from models fine-tuned on public OVAR datasets and merges them via task arithmetic to produce a model evaluated on out-of-distribution test sets. No equations, predictions, or first-principles claims are presented that reduce by construction to quantities defined or fitted within the paper itself. The central result is a comparative empirical performance claim against a pre-trained base model, supported by external data and benchmarks rather than self-referential definitions or self-citation chains that carry the load-bearing argument.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities. The approach implicitly assumes standard task arithmetic operations and the utility of public OVAR datasets but does not introduce new ones.

pith-pipeline@v0.9.1-grok · 5683 in / 1153 out tokens · 28470 ms · 2026-06-26T21:18:21.968411+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 4 canonical work pages · 2 internal anchors

  1. [1]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inICML, 2021

  2. [2]

    Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time,

    M. Wortsman, G. Ilharco, S. Y . Gadre, R. Roelofs, R. Gontijo-Lopes, A. S. Morcos, H. Namkoong, A. Farhadi, Y . Carmon, S. Kornblith et al., “Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time,” inICML, 2022

  3. [3]

    Merging models with fisher-weighted averaging,

    M. S. Matena and C. A. Raffel, “Merging models with fisher-weighted averaging,”Advances in Neural Information Processing Systems, vol. 35, pp. 17 703–17 716, 2022

  4. [4]

    Editing models with task arithmetic,

    G. Ilharco, M. T. Ribeiro, M. Wortsman, S. Gururangan, L. Schmidt, H. Hajishirzi, and A. Farhadi, “Editing models with task arithmetic,” in ICLR, 2022

  5. [5]

    Task arithmetic in the tangent space: Improved editing of pre-trained models,

    G. Ortiz-Jimenez, A. Favero, and P. Frossard, “Task arithmetic in the tangent space: Improved editing of pre-trained models,”NeurIPS, 2023

  6. [6]

    Ties- merging: Resolving interference when merging models,

    P. Yadav, D. Tam, L. Choshen, C. A. Raffel, and M. Bansal, “Ties- merging: Resolving interference when merging models,”Advances in neural information processing systems, vol. 36, pp. 7093–7115, 2023

  7. [7]

    Localizing task information for improved model merging and compres- sion,

    K. Wang, N. Dimitriadis, G. Ortiz-Jimenez, F. Fleuret, and P. Frossard, “Localizing task information for improved model merging and compres- sion,”arXiv preprint arXiv:2405.07813, 2024

  8. [8]

    Task singular vectors: Reducing task interference in model merging,

    A. A. Gargiulo, D. Crisostomi, M. S. Bucarelli, S. Scardapane, F. Sil- vestri, and E. Rodola, “Task singular vectors: Reducing task interference in model merging,” inCVPR, 2025

  9. [9]

    Carreira, E

    J. Carreira, E. Noland, C. Hillier, and A. Zisserman, “A short note on the kinetics-700 human action dataset,”arXiv preprint arXiv:1907.06987, 2019

  10. [10]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,”arXiv preprint arXiv:1212.0402, 2012

  11. [11]

    Hmdb: A large video database for human motion recognition,

    H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “Hmdb: A large video database for human motion recognition,” inICCV, 2011

  12. [12]

    Not only look, but also listen: Learning multimodal violence detection under weak supervision,

    P. Wu, J. Liu, Y . Shi, Y . Sun, F. Shao, Z. Wu, and Z. Yang, “Not only look, but also listen: Learning multimodal violence detection under weak supervision,” inECCV. Springer, 2020

  13. [13]

    Language models are super mario: Absorbing abilities from homologous models as a free lunch,

    L. Yu, B. Yu, H. Yu, F. Huang, and Y . Li, “Language models are super mario: Absorbing abilities from homologous models as a free lunch,” in Forty-first International Conference on Machine Learning, 2024

  14. [14]

    U-net transplant: the role of pre-training for model merging in 3d medical segmentation,

    L. Lumetti, G. Capitani, E. Ficarra, S. Calderara, C. Grana, A. Porrello, and F. Bolelli, “U-net transplant: the role of pre-training for model merging in 3d medical segmentation,” inMICCAI, 2025

  15. [15]

    No task left behind: Isotropic model merging with common and task-specific subspaces,

    D. Marczak, S. Magistri, S. Cygert, B. Twardowski, A. D. Bagdanov, and J. Van De Weijer, “No task left behind: Isotropic model merging with common and task-specific subspaces,” inICML, 2025

  16. [16]

    Open-vclip: Transforming clip to an open-vocabulary video model via interpolated weight optimization,

    Z. Weng, X. Yang, A. Li, Z. Wu, and Y .-G. Jiang, “Open-vclip: Transforming clip to an open-vocabulary video model via interpolated weight optimization,” inICML, 2023

  17. [17]

    Mastering task arithmetic:τjp as a key indicator for weight disentanglement,

    K. Yoshida, Y . Naraki, T. Horie, R. Yamaki, R. Shimizu, Y . Saito, J. McAuley, and H. Naganuma, “Mastering task arithmetic:τjp as a key indicator for weight disentanglement,” inICLR, 2025

  18. [18]

    Dataless weight disentanglement in task arith- metic via kronecker-factored approximate curvature,

    A. Porrello, P. Buzzega, F. Dangel, T. Sommariva, R. Salami, L. Boni- celli, and S. Calderara, “Dataless weight disentanglement in task arith- metic via kronecker-factored approximate curvature,” inICLR, 2026

  19. [19]

    A second-order perspective on model compositionality and incremental learning,

    A. Porrello, L. Bonicelli, P. Buzzega, M. Millunzi, S. Calderara, and R. Cucchiara, “A second-order perspective on model compositionality and incremental learning,” inICLR, vol. 2025, 2025

  20. [20]

    Distilling linearized behavior into non-linear fine-tuning for effective task arith- metic,

    T. Sommariva, F. Morandi, S. Calderara, and A. Porrello, “Distilling linearized behavior into non-linear fine-tuning for effective task arith- metic,” inICML, 2026

  21. [21]

    The Kinetics Human Action Video Dataset

    W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijaya- narasimhan, F. Viola, T. Green, T. Back, P. Natsevet al., “The kinetics human action video dataset,”arXiv preprint arXiv:1705.06950, 2017

  22. [22]

    Decoupled weight decay regularization,

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inICLR, 2019