pith. machine review for the scientific record. sign in

arxiv: 2604.23665 · v1 · submitted 2026-04-26 · 💻 cs.CV

Recognition: unknown

HAC: Parameter-Efficient Hyperbolic Adaptation of CLIP for Zero-Shot VQA

Cigdem Beyan, Francesco Dibitonto, Vittorio Murino

Pith reviewed 2026-05-08 06:32 UTC · model grok-4.3

classification 💻 cs.CV
keywords hyperbolic adaptationCLIPzero-shot VQAparameter-efficient fine-tuningvisual question answeringreasoning taskshyperbolic geometry
0
0 comments X

The pith

Lightweight adaptation moves pretrained CLIP into hyperbolic space and raises zero-shot VQA accuracy, especially on reasoning questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a small set of added parameters can shift an existing CLIP model from Euclidean to hyperbolic geometry without retraining the whole network. The shift is learned on image-text pairs that share no overlap with any VQA test set, yet the resulting model answers visual questions more accurately than both the original CLIP and earlier fully hyperbolic CLIP variants. Gains reach 1.9 points on average for reasoning-heavy benchmarks. If the improvement stems from hyperbolic geometry rather than the adaptation routine alone, existing vision-language models could be upgraded cheaply to handle hierarchical visual-textual relations more cleanly.

Core claim

HAC performs parameter-efficient fine-tuning that projects pretrained CLIP embeddings into hyperbolic space, producing representations that outperform Euclidean CLIP and prior hyperbolic CLIP models on a range of zero-shot VQA benchmarks spanning general, reasoning, and OCR categories, with the largest lift on reasoning tasks.

What carries the argument

The HAC adaptation module, which adds a lightweight projection into hyperbolic space while freezing most of the original CLIP weights.

If this is right

  • Existing CLIP checkpoints can be reused rather than discarded when moving to hyperbolic embeddings.
  • Zero-shot VQA evaluation remains strict because no VQA data enters training.
  • Reasoning tasks benefit most, suggesting hyperbolic space organizes the hierarchical relations needed for multi-step visual inference.
  • The same adaptation recipe may apply to other CLIP-based tasks without task-specific data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be tested on retrieval or captioning benchmarks that also rely on hierarchical structure.
  • If hyperbolic geometry proves consistently superior, future foundation models might be trained directly in hyperbolic space from the start.
  • The parameter budget used for adaptation offers a practical knob for trading compute against accuracy in other multimodal settings.

Load-bearing premise

The performance gains come specifically from the hyperbolic geometry rather than from the fine-tuning procedure or from incidental alignment between the adaptation data and the VQA tests.

What would settle it

An ablation that trains an otherwise identical Euclidean adapter with the same number of parameters and the same non-VQA data, then measures whether its VQA accuracy matches or exceeds the hyperbolic version.

Figures

Figures reproduced from arXiv: 2604.23665 by Cigdem Beyan, Francesco Dibitonto, Vittorio Murino.

Figure 1
Figure 1. Figure 1: Overview of our HAC enabling geometric adaptation of pretrained CLIP models through parameter-efficient fine-tuning. HAC updates a limited number of Adaptation parameters that are added by wrapping each selected block ℓ with an Adaptation module Uℓ,T , where T is a lightweight transformation of the block. HAC also fully trains the final LayerNorm, Linear Projection layers, and Projection α scalars. denotes… view at source ↗
Figure 2
Figure 2. Figure 2: HAC Adapted Transformer Blocks: For a selected block ℓ, the HAC Adaptation module Uℓ,T can implement any lightweight transformation, such as a sequential (left), a residual (middle) or a low-rank transformation (right) to selected block’s submodules. denotes the only trainable parameters. impose a fixed gating effect on the encoder’s outputs. These projection heads and final LayerNorms are the only compone… view at source ↗
Figure 3
Figure 3. Figure 3: (a) HoroPCA 2D projection of embeddings from our best HAC-B model. view at source ↗
Figure 4
Figure 4. Figure 4: HoroPCA projections comparing the geometric structure learned with and without Compositional Entailment Loss LhCE: (a) HAC-S with LhCE: clear hierarchical separation, with object boxes near the origin and scene-level embeddings at larger radii. (b) HAC-S without LhCE: loss of hierarchical organization; embeddings lack radial ordering. (c) HAC-B with LhCE: clear object-scene hierarchy. (d) HAC-S without LhC… view at source ↗
read the original abstract

Recent advances in representation learning have shown that hyperbolic geometry can offer a more expressive alternative to the Euclidean embeddings used in CLIP models, capturing hierarchical structures and leading to better-organized representations. However, current hyperbolic CLIP variants are trained entirely from scratch, which is computationally expensive and resource-intensive. In this work, we propose HAC (Hyperbolic Adaptation of CLIP), a parameter-efficient framework that enables pretrained CLIP models to transition into hyperbolic space via lightweight fine-tuning. We apply HAC to Visual Question Answering (VQA), where models must interpret visual elements and align them with textual queries. Notably, HAC's training is performed on a dataset with no overlap with any VQA benchmark, resulting in a strict zero-shot evaluation paradigm that underscores HAC's task-agnostic adaptability. We evaluate HAC across a diverse suite of VQA benchmarks spanning General, Reasoning, and OCR categories. Both HAC-S (small) and HAC-B (medium) consistently surpass Euclidean baselines and prior hyperbolic approaches, with HAC-B delivering up to a +1.9 point average improvement over CLIP-B on reasoning-intensive tasks. Our code is available at https://github.com/fdibiton/HAC

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes HAC, a parameter-efficient framework for adapting pretrained CLIP models into hyperbolic space via lightweight fine-tuning on a non-VQA dataset. It evaluates the approach in a strict zero-shot setting on diverse VQA benchmarks (General, Reasoning, OCR), claiming consistent gains over Euclidean CLIP baselines and prior hyperbolic methods, including up to +1.9 average improvement for HAC-B on reasoning-intensive tasks.

Significance. If the gains can be attributed to hyperbolic geometry rather than adaptation alone, the work would offer a practical route to more expressive embeddings in vision-language models without full retraining from scratch. The task-agnostic training protocol and public code release strengthen the contribution by enabling verification and extension.

major comments (1)
  1. [Experimental Evaluation] Experimental section (results tables and ablation studies): the headline performance deltas (e.g., +1.9 on reasoning tasks) are reported only against standard CLIP-B and earlier hyperbolic CLIP variants. No Euclidean control is presented that applies the identical parameter-efficient adapter, training data, optimizer, and hyperparameters entirely in Euclidean space. Without this ablation, the central claim that improvements arise from hyperbolic geometry cannot be isolated from the effects of the adaptation procedure itself.
minor comments (2)
  1. [Abstract] Abstract and §3: the description of the adaptation architecture, loss functions, and hyperparameter selection is high-level; even with the code link, explicit equations or pseudocode for the hyperbolic projection and adapter modules would improve clarity.
  2. [Results] Table captions and §4: report standard deviations or statistical significance tests across runs to support the claim of consistent outperformance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment point by point below.

read point-by-point responses
  1. Referee: Experimental section (results tables and ablation studies): the headline performance deltas (e.g., +1.9 on reasoning tasks) are reported only against standard CLIP-B and earlier hyperbolic CLIP variants. No Euclidean control is presented that applies the identical parameter-efficient adapter, training data, optimizer, and hyperparameters entirely in Euclidean space. Without this ablation, the central claim that improvements arise from hyperbolic geometry cannot be isolated from the effects of the adaptation procedure itself.

    Authors: We appreciate the referee's observation that a direct Euclidean control using the identical adapter would better isolate the contribution of hyperbolic geometry. The current manuscript compares HAC against unadapted Euclidean CLIP-B baselines and prior hyperbolic CLIP variants trained from scratch. While these results support the overall effectiveness of the approach, we agree that the suggested ablation strengthens the central claim. We will add this control experiment to the revised manuscript, applying the same parameter-efficient adapter, training data, optimizer, and hyperparameters but operating entirely in Euclidean space. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with independent evaluation

full rationale

The paper introduces HAC as a parameter-efficient adaptation procedure that fine-tunes pretrained CLIP into hyperbolic space on a non-VQA corpus, then reports zero-shot VQA performance gains. No equations, derivations, or first-principles predictions appear in the provided text. The central claims rest on direct experimental comparisons against Euclidean baselines and prior hyperbolic models; these comparisons are not forced by construction, self-definition, or load-bearing self-citation. The evaluation protocol (non-overlapping training data, strict zero-shot) is externally verifiable and does not reduce to a renaming or fitted input presented as a prediction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that hyperbolic geometry better organizes hierarchical visual-textual relations than Euclidean space, plus standard assumptions of CLIP pretraining and transfer learning. No free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Hyperbolic geometry captures hierarchical structures more effectively than Euclidean embeddings for vision-language tasks
    Stated directly in the opening sentence of the abstract as motivation for the work.

pith-pipeline@v0.9.0 · 5513 in / 1157 out tokens · 27618 ms · 2026-05-08T06:32:51.837467+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    In: ICCV (2015) Parameter-Efficient Hyperbolic Adaptation of CLIP for Zero-Shot VQA 17

    Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: VQA: visual question answering. In: ICCV (2015) Parameter-Efficient Hyperbolic Adaptation of CLIP for Zero-Shot VQA 17

  2. [2]

    Layer Normalization

    Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv:1607.06450 (2016)

  3. [3]

    In: ACL (2023)

    Cao, R., Jiang, J.: Modularized zero-shot VQA with pre-trained models. In: ACL (2023)

  4. [4]

    In: ICML (2021)

    Chami, I., Gu, A., Nguyen, D., Ré, C.: Horopca: Hyperbolic dimensionality reduc- tion via horospherical projections. In: ICML (2021)

  5. [5]

    Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y., Chen, Z., Duan, H., Wang, J., Qiao, Y., Lin, D., Zhao, F.: Are we on the right way for evaluating large vision-language models? In: Neurips (2024)

  6. [6]

    In: NeurIPS (2021)

    Desai, K., Kaul, G., Aysola, Z., Johnson, J.: Redcaps: Web-curated image-text data created by the people, for the people. In: NeurIPS (2021)

  7. [7]

    In: ICML (2023)

    Desai, K., Nickel, M., Rajpurohit, T., Johnson, J., Vedantam, S.R.: Hyperbolic image-text representations. In: ICML (2023)

  8. [8]

    In: ICML (2018)

    Ganea, O., Bécigneul, G., Hofmann, T.: Hyperbolic entailment cones for learning hierarchical embeddings. In: ICML (2018)

  9. [9]

    In: ICLR (2022)

    He, J., Zhou, C., Ma, X., Berg-Kirkpatrick, T., Neubig, G.: Towards a unified view of parameter-efficient transfer learning. In: ICLR (2022)

  10. [10]

    In: CVPR (2016)

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

  11. [11]

    In: ICML (2019)

    Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., de Laroussilhe, Q., Ges- mundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for NLP. In: ICML (2019)

  12. [12]

    In: ICLR (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. In: ICLR (2022)

  13. [13]

    In: EMNLP (2022)

    Ivgi, M., Carmon, Y., Berant, J.: Scaling laws under the microscope: Predicting transformer performance from small scale experiments. In: EMNLP (2022)

  14. [14]

    In: ICLR (2024)

    Jain, N., Chiang, P., Wen, Y., Kirchenbauer, J., Chu, H., Somepalli, G.: Neftune: Noisy embeddings improve instruction finetuning. In: ICLR (2024)

  15. [15]

    A rank stabilization scaling factor for fine-tuning with lora

    Kalajdzievski, D.: A rank stabilization scaling factor for fine-tuning with lora. arXiv:2312.03732 (2023)

  16. [16]

    In: ECCV (2016)

    Kembhavi, A., Salvato, M., Kolve, E., Seo, M.J., Hajishirzi, H., Farhadi, A.: A diagram is worth a dozen images. In: ECCV (2016)

  17. [17]

    Graduate Texts in Mathematics (2019)

    Lee, J.M.: Introduction to riemannian manifolds. Graduate Texts in Mathematics (2019)

  18. [18]

    CVPR (2024)

    Li, B., Ge, Y., Ge, Y., Wang, G., Wang, R., Zhang, R., Shan, Y.: Seed-bench: Benchmarking multimodal large language models. CVPR (2024)

  19. [19]

    In: ICLR (2017)

    Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts. In: ICLR (2017)

  20. [20]

    In: ICLR (2019)

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)

  21. [21]

    In: NeurIPS (2022)

    Lu, P., Mishra, S., Xia, T., Qiu, L.e.a.: Learn to explain: Multimodal reasoning via thought chains for science question answering. In: NeurIPS (2022)

  22. [22]

    Maniparambil, M., Akshulakov, R., Djilali, Y.A.D., Narayan, S., Seddik, M.E.A., Mangalam, K., O’Connor, N.E.: Do vision and language encoders represent the world similarly? CVPR (2024)

  23. [23]

    In: Neurips (2017)

    Nickel, M., Kiela, D.: Poincaré embeddings for learning hierarchical representa- tions. In: Neurips (2017)

  24. [24]

    In: ICLR (2025)

    Pal, A., van Spengler, M., di Melendugno, G.M.D., Flaborea, A., Galasso, F., Mettes, P.: Compositional entailment learning for hyperbolic vision-language mod- els. In: ICLR (2025)

  25. [25]

    In: ICLR (2024) 18 F

    Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., Ye, Q., Wei, F.: Ground- ing multimodal large language models to the world. In: ICLR (2024) 18 F. Dibitonto et al

  26. [26]

    In: EMNLP (2023)

    Poth, C., Sterz, H., Paul, I., Purkayastha, S., Engländer, L., Imhof, T., Vulić, I., Ruder, S., Gurevych, I., Pfeiffer, J.: Adapters: A unified library for parameter- efficient and modular transfer learning. In: EMNLP (2023)

  27. [27]

    In: ICML (2021)

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: ICML (2021)

  28. [28]

    Graduate Texts in Mathe- matics (2006)

    Ratcliffe, J.G.: Foundations of hyperbolic manifolds. Graduate Texts in Mathe- matics (2006)

  29. [29]

    In: ECCV (2022)

    Schwenk, D., Khandelwal, A., Clark, C., Marino, K., Mottaghi, R.: A-okvqa: A benchmark for visual question answering using world knowledge. In: ECCV (2022)

  30. [30]

    Shen, S., Li, L.H., Tan, H., Bansal, M., Rohrbach, A., Chang, K., Yao, Z., Keutzer, K.: How much can CLIP benefit vision-and-language tasks? In: ICLR (2022)

  31. [31]

    In: MICCAI (2025)

    Shiri, M., Beyan, C., Murino, V.: MadCLIP: Few-shot medical anomaly detection with CLIP. In: MICCAI (2025)

  32. [32]

    In: ICIAP (2025)

    Shiri,M.,Beyan,C.,Murino,V.:MADPOT:MedicalanomalydetectionwithCLIP adaptation and partial optimal transport. In: ICIAP (2025)

  33. [33]

    In: ACL (2022)

    Song, H., Dong, L., Zhang, W., Liu, T., Wei, F.: CLIP models are few-shot learners: Empirical studies on VQA and visual entailment. In: ACL (2022)

  34. [34]

    xAI: Realworldqa dataset.https://x.ai/news/grok-1.5v(2024)

  35. [35]

    In: ACL (2022)

    Zaken, E.B., Goldberg, Y., Ravfogel, S.: Bitfit: Simple parameter-efficient fine- tuning for transformer-based masked language-models. In: ACL (2022)

  36. [36]

    In: ICLR (2024)

    Zhao, B., Tu, H., Wei, C., Mei, J., Xie, C.: Tuning layernorm in attention: Towards efficient multi-modal LLM finetuning. In: ICLR (2024)