arxiv: 2605.06477 · v1 · submitted 2026-05-07 · 💻 cs.CV

Recognition: unknown

GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs

Pranav Mantini , Shishir K. Shah

Authors on Pith no claims yet

Pith reviewed 2026-05-08 13:12 UTC · model grok-4.3

classification 💻 cs.CV

keywords GeoStackVision-Language Modelsknowledge compositioncatastrophic forgettingadapter modulesweight foldingincremental learningmulti-domain adaptation

0 comments

The pith

GeoStack stacks any number of domain experts into a VLM by geometric constraints on adapters, preserving base knowledge and folding weights for constant-time inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to solve the accumulation of expertise in Vision-Language Models without the usual loss of prior knowledge when new domains or tasks are added. It presents GeoStack as a modular system that combines separately trained experts through specific geometric and structural rules applied to adapter modules. These rules keep the original model's core capabilities unchanged during composition. A key mathematical result shows that the combined weights can be folded so that inference time remains fixed no matter how many experts are included. This matters for building models that can keep learning over long periods in changing environments like new visual tasks or added object classes.

Core claim

GeoStack is a framework that composes independently trained domain experts into one unified VLM by imposing geometric and structural constraints on the adapter manifold. This preserves the base model's foundational knowledge without loss or interference. The framework further demonstrates a weight-folding property that reduces inference complexity to O(1) regardless of the number of experts. Experiments on multi-domain adaptation and class-incremental learning confirm that the method supports efficient long-term composition while mitigating catastrophic forgetting.

What carries the argument

The weight-folding property realized through geometric constraints on the adapter manifold, which enables addition of experts without changing inference cost or base performance.

If this is right

Multiple domain experts can be integrated without retraining the base model or degrading its original performance.
Inference remains efficient as the total knowledge grows, supporting ongoing addition of new capabilities.
Class-incremental and multi-domain adaptation tasks can proceed over extended periods with reduced forgetting.
The same geometric stacking can be applied to new experts without requiring changes to the inference pipeline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same constraint-based folding might apply to other adapter-based models outside vision-language settings.
Practical systems could dynamically load and combine expert modules on demand for specialized queries.
Future tests could check whether the O(1) property holds when experts are trained on highly dissimilar data distributions.

Load-bearing premise

Imposing geometric and structural constraints on the adapter manifold ensures the foundational knowledge of the base model is preserved without loss or interference.

What would settle it

An experiment that adds 20 or more experts and measures either rising inference latency per sample or declining accuracy on the original base tasks would falsify the constant-time and preservation claims.

Figures

Figures reproduced from arXiv: 2605.06477 by Pranav Mantini, Shishir K. Shah.

**Figure 1.** Figure 1: GeoStack Overview. Domainspecific adapters (BiCLIP) are not applicable across domains and have reduced generalizability. GeoStack allows domain experts (W′ d , W′ e ) to be trained independently and then stacked to enable applicability across multiple tasks. 2 view at source ↗

**Figure 2.** Figure 2: Spider plot showing accuracy invariance across four stacking permutations. Dataset Mean (%) σ Range ImageNet 69.29 0.04 0.09 Caltech101 92.97 0.27 0.74 Food101 89.49 0.02 0.07 EuroSAT 84.49 0.42 1.25 view at source ↗

**Figure 3.** Figure 3: Comparison of Task-0 accuracy and ImageNet knowledge retention across 10 incremental tasks view at source ↗

**Figure 5.** Figure 5: Stability Stress Test Analysis. EuroSAT accuracy as a function of simulated Orthogonality Error (OE). In Sec. 2.5, we defined the stackability metric as the normalized Orthogonality Error (OE): S(W) 2 = 1 d 2 ∥W⊤W − I∥ 2 F 14 view at source ↗

read the original abstract

We address the challenge of knowledge composition in Vision-Language Models (VLMs), where accumulating expertise across multiple domains or tasks typically leads to catastrophic forgetting. We introduce GeoStack (Geometric Stacking), a modular framework that allows independently trained domain experts to be composed into a unified model. By imposing geometric and structural constraints on the adapter manifold, GeoStack ensures the foundational knowledge of the base model is preserved. Furthermore, we mathematically demonstrate a weight-folding property that achieves constant-time inference complexity ($O(1)$), regardless of the number of integrated experts. Experimental results across multi-domain adaptation and class-incremental learning show that GeoStack provides an efficient mechanism for long-term knowledge composition while significantly mitigating catastrophic forgetting. Code is available at https://github.com/QuantitativeImagingLaboratory/GeoStack.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GeoStack uses geometric constraints on adapters to enable stacking in VLMs and claims a weight-folding property for O(1) inference, but the exact algebraic closure looks unproven.

read the letter

The main takeaway is that GeoStack imposes geometric and structural constraints on the adapter manifold so that independently trained domain experts can be composed without overwriting the base VLM's knowledge, plus a claimed mathematical weight-folding step that keeps inference cost constant no matter how many experts are added. The code release helps with checking the implementation directly. Experiments on multi-domain adaptation and class-incremental learning are reported to show lower forgetting rates than standard approaches. That combination of a named framework, the folding claim, and public code is the concrete addition here. The geometric angle on the adapter space is a reasonable way to try to enforce preservation of base knowledge, and the O(1) target addresses a real scaling pain point in continual learning setups. The soft spot sits in the weight-folding property itself. The argument requires that the constrained adapters form a quasi-abelian structure whose operation folds cleanly without residual terms or approximation error. If the geometric constraints only regularize toward closure rather than enforce it exactly during independent training, then folding more than a few experts will accumulate linear error in the effective weights. That would quietly violate both the constant-time guarantee and the no-interference claim. The abstract states the property is demonstrated, yet the available text gives no equations or proof sketch, so it is impossible to tell whether the construction is exact or relies on empirical softness. This paper is for people working on modular adaptation and continual learning in vision-language models. Readers already experimenting with adapters or geometric constraints on parameter spaces will find the framing useful to test or extend, even if they end up tightening the algebraic conditions. It has enough of a concrete proposal and experimental direction to deserve a serious referee who can check the folding math and run scaling ablations on the error growth. I would send it to review.

Referee Report

2 major / 2 minor

Summary. The paper introduces GeoStack, a modular framework for composing independently trained domain-specific adapters in Vision-Language Models. Geometric and structural constraints are imposed on the adapter manifold to preserve base-model knowledge and avoid catastrophic forgetting. A mathematical demonstration of a weight-folding property is presented that composes any number of experts into a single effective weight set with constant-time (O(1)) inference complexity. Experiments on multi-domain adaptation and class-incremental learning tasks are reported to support the claims, and code is released.

Significance. If the weight-folding property holds exactly under the stated quasi-abelian structure, the work would provide a valuable mechanism for efficient, long-term knowledge composition in VLMs without linear growth in inference cost. The open-source code release is a clear strength that supports reproducibility and further investigation.

major comments (2)

[Mathematical demonstration section (weight-folding theorem)] Mathematical demonstration of the weight-folding property: the claim of exact O(1) inference independent of the number of experts requires that independently trained adapters satisfy algebraic closure under the quasi-abelian operation after the geometric constraints are applied. The manuscript must specify whether the constraints enforce exact group closure (with no residual terms) or only soft regularization; any deviation would produce error that accumulates with the number of experts, violating both the constant-time guarantee and the no-interference claim.
[Experiments section] Experimental validation of scaling: the reported results on multi-domain and incremental learning show forgetting mitigation, but no ablation or scaling plot is provided that measures both task accuracy and wall-clock inference time as the number of stacked experts increases from 1 to 10+. Such data are necessary to confirm that inference remains strictly O(1) rather than exhibiting hidden linear or super-linear costs.

minor comments (2)

[Abstract] The abstract introduces the term 'quasi-abelian' without a one-sentence definition or forward reference; adding a brief parenthetical or moving the definition to the introduction would improve readability for readers outside the immediate subfield.
[Method section] Notation for the adapter manifold and the folding operator should be introduced consistently in the first method subsection and used uniformly thereafter; current usage mixes descriptive prose with symbols without an explicit table of notation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive evaluation of the work's significance and for the detailed, constructive comments. We address each major comment point by point below, indicating planned revisions to the manuscript.

read point-by-point responses

Referee: Mathematical demonstration of the weight-folding property: the claim of exact O(1) inference independent of the number of experts requires that independently trained adapters satisfy algebraic closure under the quasi-abelian operation after the geometric constraints are applied. The manuscript must specify whether the constraints enforce exact group closure (with no residual terms) or only soft regularization; any deviation would produce error that accumulates with the number of experts, violating both the constant-time guarantee and the no-interference claim.

Authors: The weight-folding theorem (Section 3) is proven under the quasi-abelian structure where the imposed geometric constraints on the adapter manifold are designed to enforce exact algebraic closure. The derivation shows that the composition yields an equivalent single weight set with no residual terms, preserving both the O(1) inference complexity and the no-interference property. We will revise the mathematical demonstration section to explicitly state that the constraints achieve exact group closure (rather than soft regularization) and include a clarifying remark on the absence of accumulating errors. revision: yes
Referee: Experimental validation of scaling: the reported results on multi-domain and incremental learning show forgetting mitigation, but no ablation or scaling plot is provided that measures both task accuracy and wall-clock inference time as the number of stacked experts increases from 1 to 10+. Such data are necessary to confirm that inference remains strictly O(1) rather than exhibiting hidden linear or super-linear costs.

Authors: We agree that direct empirical scaling evidence strengthens the theoretical claim. While the current experiments demonstrate results on multi-domain and class-incremental tasks involving multiple experts, a dedicated scaling ablation for wall-clock inference time was not included. In the revised manuscript we will add ablation experiments and plots reporting both task accuracy and measured inference time for 1 to 12 stacked experts, confirming the constant-time behavior in practice. revision: yes

Circularity Check

0 steps flagged

No circularity: weight-folding presented as consequence of imposed manifold constraints

full rationale

The abstract states that geometric and structural constraints are imposed on the adapter manifold to preserve base knowledge, after which a weight-folding property is mathematically demonstrated to yield O(1) inference. No equations are supplied in the abstract, and the full text (per instructions) contains no quoted reduction showing the folding property is defined in terms of itself or obtained by fitting parameters to the target result. The quasi-abelian structure is introduced as part of the framework construction rather than smuggled via self-citation or renamed empirical pattern. The central claim therefore remains independent of its inputs; the derivation chain does not collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities detailed beyond the framework name and geometric constraints.

axioms (1)

domain assumption Geometric and structural constraints on the adapter manifold preserve foundational base model knowledge
Directly stated as the mechanism ensuring no forgetting.

invented entities (1)

GeoStack framework no independent evidence
purpose: Modular composition of domain experts with geometric constraints
Newly introduced framework name and structure.

pith-pipeline@v0.9.0 · 5431 in / 1072 out tokens · 33160 ms · 2026-05-08T13:12:48.492649+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 4 internal anchors

[1]

URL https://arxiv.org/ abs/2505.23117. M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613,

work page arXiv
[2]

doi: 10.1109/CVPR.2009.5206848. L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE,

work page doi:10.1109/cvpr.2009.5206848 2009
[3]

URLhttps://arxiv.org/abs/2110.04544. S. Gupta, S. Kansal, S. Jegelka, P. Isola, and V . Garg. Canonicalizing multimodal contrastive representation learning,

work page arXiv
[4]

URLhttps://arxiv.org/abs/2602.17584. P. Helber, B. Bischke, A. Dengel, and D. Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226,

work page arXiv
[5]

Distilling the Knowledge in a Neural Network

G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review arXiv
[6]

URLhttps://arxiv.org/abs/2212.04089. J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526,

work page internal anchor Pith review arXiv
[7]

URLhttps://arxiv.org/abs/2203.02053. P. Liu, X. Qiu, and X. Huang. Recurrent neural network for text classification with multi-task learning.CoRR, abs/1605.05101,

work page arXiv
[8]

URLhttp://arxiv.org/abs/1605.05101. P. Mantini and S. K. Shah. Biclip: Domain canonicalization via structured geometric transformation,

work page arXiv
[9]

URL https://arxiv.org/abs/2603.08942. M.-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

URLhttps://arxiv.org/abs/2103.00020. S. Rebuffi, A. Kolesnikov, and C. H. Lampert. icarl: Incremental classifier and representation learning.CoRR, abs/1611.07725,

work page internal anchor Pith review arXiv
[11]

URLhttp://arxiv.org/abs/1611.07725. A. C. Stickland and I. Murray. BERT and pals: Projected attention layers for efficient adaptation in multi-task learning.CoRR, abs/1902.02671,

work page arXiv 1902
[12]

URLhttp://arxiv.org/abs/1902.02671. Y .-L. Sung, J. Cho, and M. Bansal. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks,

work page arXiv 1902
[13]

URLhttps://arxiv.org/abs/2112.06825. R. Zhang, R. Fang, W. Zhang, P. Gao, K. Li, J. Dai, Y . Qiao, and H. Li. Tip-adapter: Training-free clip-adapter for better vision-language modeling,

work page arXiv
[14]

URLhttps://arxiv.org/abs/2111.03930. K. Zhou, J. Yang, C. C. Loy, and Z. Liu. Learning to prompt for vision-language models.International Journal of Computer Vision, 130(9):2337–2348, July 2022a. ISSN 1573-1405. doi: 10.1007/s11263-022-01653-1. URLhttp://dx.doi.org/10.1007/s11263-022-01653-1. K. Zhou, J. Yang, C. C. Loy, and Z. Liu. Conditional prompt lea...

work page doi:10.1007/s11263-022-01653-1
[15]

Across all five sequences, GeoStack consistently maintains higher ImageNet accuracy compared to the BiCLIP baseline

reveal a clear correlation between Orthogonality Error (OE) and foundational knowledge retention. Across all five sequences, GeoStack consistently maintains higher ImageNet accuracy compared to the BiCLIP baseline. This is most evident in Sequence D (DTD), where the baseline’s OE explodes to0.050, resulting in a catastrophic 7.1% drop in ImageNet performa...

work page arXiv 2013