Recognition: unknown
GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs
Pith reviewed 2026-05-08 13:12 UTC · model grok-4.3
The pith
GeoStack stacks any number of domain experts into a VLM by geometric constraints on adapters, preserving base knowledge and folding weights for constant-time inference.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GeoStack is a framework that composes independently trained domain experts into one unified VLM by imposing geometric and structural constraints on the adapter manifold. This preserves the base model's foundational knowledge without loss or interference. The framework further demonstrates a weight-folding property that reduces inference complexity to O(1) regardless of the number of experts. Experiments on multi-domain adaptation and class-incremental learning confirm that the method supports efficient long-term composition while mitigating catastrophic forgetting.
What carries the argument
The weight-folding property realized through geometric constraints on the adapter manifold, which enables addition of experts without changing inference cost or base performance.
If this is right
- Multiple domain experts can be integrated without retraining the base model or degrading its original performance.
- Inference remains efficient as the total knowledge grows, supporting ongoing addition of new capabilities.
- Class-incremental and multi-domain adaptation tasks can proceed over extended periods with reduced forgetting.
- The same geometric stacking can be applied to new experts without requiring changes to the inference pipeline.
Where Pith is reading between the lines
- The same constraint-based folding might apply to other adapter-based models outside vision-language settings.
- Practical systems could dynamically load and combine expert modules on demand for specialized queries.
- Future tests could check whether the O(1) property holds when experts are trained on highly dissimilar data distributions.
Load-bearing premise
Imposing geometric and structural constraints on the adapter manifold ensures the foundational knowledge of the base model is preserved without loss or interference.
What would settle it
An experiment that adds 20 or more experts and measures either rising inference latency per sample or declining accuracy on the original base tasks would falsify the constant-time and preservation claims.
Figures
read the original abstract
We address the challenge of knowledge composition in Vision-Language Models (VLMs), where accumulating expertise across multiple domains or tasks typically leads to catastrophic forgetting. We introduce GeoStack (Geometric Stacking), a modular framework that allows independently trained domain experts to be composed into a unified model. By imposing geometric and structural constraints on the adapter manifold, GeoStack ensures the foundational knowledge of the base model is preserved. Furthermore, we mathematically demonstrate a weight-folding property that achieves constant-time inference complexity ($O(1)$), regardless of the number of integrated experts. Experimental results across multi-domain adaptation and class-incremental learning show that GeoStack provides an efficient mechanism for long-term knowledge composition while significantly mitigating catastrophic forgetting. Code is available at https://github.com/QuantitativeImagingLaboratory/GeoStack.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GeoStack, a modular framework for composing independently trained domain-specific adapters in Vision-Language Models. Geometric and structural constraints are imposed on the adapter manifold to preserve base-model knowledge and avoid catastrophic forgetting. A mathematical demonstration of a weight-folding property is presented that composes any number of experts into a single effective weight set with constant-time (O(1)) inference complexity. Experiments on multi-domain adaptation and class-incremental learning tasks are reported to support the claims, and code is released.
Significance. If the weight-folding property holds exactly under the stated quasi-abelian structure, the work would provide a valuable mechanism for efficient, long-term knowledge composition in VLMs without linear growth in inference cost. The open-source code release is a clear strength that supports reproducibility and further investigation.
major comments (2)
- [Mathematical demonstration section (weight-folding theorem)] Mathematical demonstration of the weight-folding property: the claim of exact O(1) inference independent of the number of experts requires that independently trained adapters satisfy algebraic closure under the quasi-abelian operation after the geometric constraints are applied. The manuscript must specify whether the constraints enforce exact group closure (with no residual terms) or only soft regularization; any deviation would produce error that accumulates with the number of experts, violating both the constant-time guarantee and the no-interference claim.
- [Experiments section] Experimental validation of scaling: the reported results on multi-domain and incremental learning show forgetting mitigation, but no ablation or scaling plot is provided that measures both task accuracy and wall-clock inference time as the number of stacked experts increases from 1 to 10+. Such data are necessary to confirm that inference remains strictly O(1) rather than exhibiting hidden linear or super-linear costs.
minor comments (2)
- [Abstract] The abstract introduces the term 'quasi-abelian' without a one-sentence definition or forward reference; adding a brief parenthetical or moving the definition to the introduction would improve readability for readers outside the immediate subfield.
- [Method section] Notation for the adapter manifold and the folding operator should be introduced consistently in the first method subsection and used uniformly thereafter; current usage mixes descriptive prose with symbols without an explicit table of notation.
Simulated Author's Rebuttal
We thank the referee for their positive evaluation of the work's significance and for the detailed, constructive comments. We address each major comment point by point below, indicating planned revisions to the manuscript.
read point-by-point responses
-
Referee: Mathematical demonstration of the weight-folding property: the claim of exact O(1) inference independent of the number of experts requires that independently trained adapters satisfy algebraic closure under the quasi-abelian operation after the geometric constraints are applied. The manuscript must specify whether the constraints enforce exact group closure (with no residual terms) or only soft regularization; any deviation would produce error that accumulates with the number of experts, violating both the constant-time guarantee and the no-interference claim.
Authors: The weight-folding theorem (Section 3) is proven under the quasi-abelian structure where the imposed geometric constraints on the adapter manifold are designed to enforce exact algebraic closure. The derivation shows that the composition yields an equivalent single weight set with no residual terms, preserving both the O(1) inference complexity and the no-interference property. We will revise the mathematical demonstration section to explicitly state that the constraints achieve exact group closure (rather than soft regularization) and include a clarifying remark on the absence of accumulating errors. revision: yes
-
Referee: Experimental validation of scaling: the reported results on multi-domain and incremental learning show forgetting mitigation, but no ablation or scaling plot is provided that measures both task accuracy and wall-clock inference time as the number of stacked experts increases from 1 to 10+. Such data are necessary to confirm that inference remains strictly O(1) rather than exhibiting hidden linear or super-linear costs.
Authors: We agree that direct empirical scaling evidence strengthens the theoretical claim. While the current experiments demonstrate results on multi-domain and class-incremental tasks involving multiple experts, a dedicated scaling ablation for wall-clock inference time was not included. In the revised manuscript we will add ablation experiments and plots reporting both task accuracy and measured inference time for 1 to 12 stacked experts, confirming the constant-time behavior in practice. revision: yes
Circularity Check
No circularity: weight-folding presented as consequence of imposed manifold constraints
full rationale
The abstract states that geometric and structural constraints are imposed on the adapter manifold to preserve base knowledge, after which a weight-folding property is mathematically demonstrated to yield O(1) inference. No equations are supplied in the abstract, and the full text (per instructions) contains no quoted reduction showing the folding property is defined in terms of itself or obtained by fitting parameters to the target result. The quasi-abelian structure is introduced as part of the framework construction rather than smuggled via self-citation or renamed empirical pattern. The central claim therefore remains independent of its inputs; the derivation chain does not collapse by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Geometric and structural constraints on the adapter manifold preserve foundational base model knowledge
invented entities (1)
-
GeoStack framework
no independent evidence
Reference graph
Works this paper leans on
- [1]
-
[2]
doi: 10.1109/CVPR.2009.5206848. L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE,
- [3]
- [4]
-
[5]
Distilling the Knowledge in a Neural Network
G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,
work page internal anchor Pith review arXiv
-
[6]
URLhttps://arxiv.org/abs/2212.04089. J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526,
work page internal anchor Pith review arXiv
- [7]
- [8]
-
[9]
URL https://arxiv.org/abs/2603.08942. M.-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
URLhttps://arxiv.org/abs/2103.00020. S. Rebuffi, A. Kolesnikov, and C. H. Lampert. icarl: Incremental classifier and representation learning.CoRR, abs/1611.07725,
work page internal anchor Pith review arXiv
- [11]
- [12]
- [13]
-
[14]
URLhttps://arxiv.org/abs/2111.03930. K. Zhou, J. Yang, C. C. Loy, and Z. Liu. Learning to prompt for vision-language models.International Journal of Computer Vision, 130(9):2337–2348, July 2022a. ISSN 1573-1405. doi: 10.1007/s11263-022-01653-1. URLhttp://dx.doi.org/10.1007/s11263-022-01653-1. K. Zhou, J. Yang, C. C. Loy, and Z. Liu. Conditional prompt lea...
-
[15]
reveal a clear correlation between Orthogonality Error (OE) and foundational knowledge retention. Across all five sequences, GeoStack consistently maintains higher ImageNet accuracy compared to the BiCLIP baseline. This is most evident in Sequence D (DTD), where the baseline’s OE explodes to0.050, resulting in a catastrophic 7.1% drop in ImageNet performa...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.