Optimized Architectures for Kolmogorov-Arnold Networks
Pith reviewed 2026-05-16 22:23 UTC · model grok-4.3
The pith
Combining sparsification with depth selection in overprovisioned KANs yields smaller models with competitive accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Overprovisioned KAN architectures combined with sparsification, deep supervision, and depth selection, optimized differentiably under a minimum description length objective, allow learning compact interpretable networks that achieve competitive or superior accuracy across benchmarks without the complexity of other enhancements.
What carries the argument
Differentiable joint optimization of activations, structure, and depth under a minimum description length objective applied to overprovisioned KANs
If this is right
- Substantially smaller models are discovered while accuracy remains competitive or better.
- Interpretability is preserved through the principled optimization process.
- The approach outperforms sparsification alone on multiple task types.
- End-to-end optimization of model depth becomes practical for KANs.
Where Pith is reading between the lines
- Similar optimization strategies could be tested on other network types to balance size and performance.
- This may enable wider adoption of KANs in domains requiring model inspection such as physics-informed modeling.
- The method suggests a general template for making interpretable models more practical by overprovisioning then pruning.
Load-bearing premise
Differentiable mechanisms under the minimum description length objective can jointly optimize activations, structure, and depth end-to-end while preserving interpretability.
What would settle it
Demonstrating on the paper's benchmarks that the full method does not produce smaller models with accuracy at least as good as sparsification alone would falsify the central result.
Figures
read the original abstract
Efforts to improve Kolmogorov--Arnold networks (KANs) with architectural enhancements have been stymied by the complexity those enhancements bring, undermining the interpretability that makes KANs attractive in the first place. Here we study overprovisioned architectures combined with sparsification, deep supervision, and depth selection, to learn compact, interpretable KANs without sacrificing accuracy. Crucially, we focus on differentiable mechanisms under a principled minimum description length objective, jointly optimizing activations, structure, and depth end-to-end. Experiments across function approximation benchmarks, dynamical systems forecasting, and real-world prediction tasks demonstrate that sparsification alone is insufficient, but the combination with depth selection achieves competitive or superior accuracy while discovering substantially smaller models. The result is a principled path toward models that are both more expressive and more interpretable, addressing a key tension in scientific machine learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an approach to optimize Kolmogorov-Arnold Networks (KANs) using overprovisioned architectures sparsified via differentiable mechanisms under a minimum description length (MDL) objective, combined with deep supervision and depth selection. This is claimed to yield compact, interpretable models with competitive accuracy on function approximation benchmarks, dynamical systems forecasting, and real-world prediction tasks, where sparsification alone is insufficient but the full combination succeeds.
Significance. If the central claims hold, this work offers a significant contribution to scientific machine learning by providing a principled, end-to-end differentiable method to balance expressiveness and interpretability in KANs. The emphasis on MDL for joint optimization of structure and depth is a strength, potentially leading to more reliable models in applications requiring interpretability.
major comments (2)
- Abstract: The abstract reports competitive results on multiple benchmarks but provides no visible error bars, ablation details, or data exclusion criteria; this makes the central claim of 'substantially smaller models' with competitive accuracy difficult to verify from the given information.
- Abstract/Experiments narrative: The MDL objective is presented as principled for jointly optimizing activations, structure, and depth, yet the description leaves open whether structure and depth selection use the same data as the accuracy evaluation; without explicit train/validation separation this risks circularity in the 'substantially smaller models' claim.
minor comments (2)
- Abstract: The role of 'deep supervision' in the overall pipeline is mentioned but not elaborated; a brief description of its differentiable implementation would improve clarity.
- Abstract: The assumption that the differentiable MDL mechanisms preserve the original interpretability motivation of KANs is stated but would benefit from a short supporting discussion or example in the main text.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address each major comment below and will revise the manuscript to strengthen the abstract and clarify experimental protocols. All changes will be incorporated in the next version.
read point-by-point responses
-
Referee: Abstract: The abstract reports competitive results on multiple benchmarks but provides no visible error bars, ablation details, or data exclusion criteria; this makes the central claim of 'substantially smaller models' with competitive accuracy difficult to verify from the given information.
Authors: We agree that the abstract would benefit from greater specificity. In the revised version we will add a brief statement on error bars (computed over multiple random seeds), summarize the key ablation outcomes that isolate the contribution of depth selection, and note the data exclusion criteria used for the real-world tasks. These additions will be kept concise while making the 'substantially smaller models' claim directly verifiable from the abstract. revision: yes
-
Referee: Abstract/Experiments narrative: The MDL objective is presented as principled for jointly optimizing activations, structure, and depth, yet the description leaves open whether structure and depth selection use the same data as the accuracy evaluation; without explicit train/validation separation this risks circularity in the 'substantially smaller models' claim.
Authors: We appreciate the referee highlighting this potential ambiguity. In the full experimental protocol, structure and depth selection are performed on a held-out validation split that is disjoint from both the training data and the final test sets used for accuracy reporting. We will explicitly state this separation in the revised abstract and in the experimental setup section to remove any possibility of circularity. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper presents an approach using differentiable mechanisms under a minimum description length objective to jointly optimize activations, structure, and depth in overprovisioned KANs, with experimental results across benchmarks showing that combining sparsification and depth selection yields smaller models with competitive accuracy. No load-bearing derivation step in the abstract or described claims reduces by construction to a self-definition, fitted input renamed as prediction, or self-citation chain. The experimental narrative relies on ablation-style comparisons that remain independent of the optimization inputs, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Sparsification and depth selection under MDL preserve KAN interpretability advantages
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
LMDL = Lmodel + Lmodel|data ... Lmodel = (log n / n) ||θ||0 ... with ||θ||0 approximated via E[z] gate expectations under differentiable L0 relaxation
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Overprovisioning and sparsification are synergistic... combination with depth selection achieves competitive or superior accuracy while discovering substantially smaller models
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
KANs need curvature: penalties for compositional smoothness
A curvature penalty for KANs, derived to respect compositional effects and equipped with a proven upper bound on full-model curvature, produces smoother activations while preserving accuracy.
Reference graph
Works this paper leans on
-
[1]
Highly accurate protein structure prediction with alphafold
John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvu- nakool, Russ Bates, Augustin ˇZ´ıdek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021. 1
work page 2021
-
[2]
Physics-informed machine learning.Nature Reviews Physics, 3(6):422–440, 2021
George Em Karniadakis, Ioannis G Kevrekidis, Lu Lu, Paris Perdikaris, Sifan Wang, and Liu Yang. Physics-informed machine learning.Nature Reviews Physics, 3(6):422–440, 2021. 1
work page 2021
-
[3]
Scientific discovery in the age of artificial intelligence.Nature, 620(7972):47–60, 2023
Hanchen Wang, Tianfan Fu, Yuanqi Du, Wenhao Gao, Kexin Huang, Ziming Liu, Payal Chandak, Shengchao Liu, Peter Van Katwyk, Andreea Deac, et al. Scientific discovery in the age of artificial intelligence.Nature, 620(7972):47–60, 2023. 1
work page 2023
-
[4]
Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature machine intelligence, 1(5):206–215, 2019. 1
work page 2019
-
[5]
A survey of methods for explaining black box models.ACM computing surveys (CSUR), 51(5):1–42, 2018
Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. A survey of methods for explaining black box models.ACM computing surveys (CSUR), 51(5):1–42, 2018. 1
work page 2018
-
[6]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 1, 4 9
work page 2016
-
[7]
Densely connected convolutional networks
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017. 1, 4
work page 2017
-
[8]
Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity.Monographs on statistics and applied probability, 143(143):8, 2015. 1
work page 2015
-
[9]
Optimal brain damage.Advances in neural information processing systems, 2,
Yann LeCun, John Denker, and Sara Solla. Optimal brain damage.Advances in neural information processing systems, 2,
-
[10]
Steven L. Brunton, Joshua L. Proctor, and J. Nathan Kutz. Discovering governing equations from data by sparse identifi- cation of nonlinear dynamical systems.Proceedings of the National Academy of Sciences, 113(15):3932–3937, 2016. 1, 6
work page 2016
-
[11]
Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networks through𝑙 0 regularization. In International Conference on Learning Representations, 2018. 1, 3, 5
work page 2018
-
[12]
DARTS: Differentiable architecture search
Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. InInternational Conference on Learning Representations, 2019. 1
work page 2019
-
[13]
Neural Architecture Search with Reinforcement Learning
Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning.arXiv preprint arXiv:1611.01578,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Neural architecture search: A survey.Journal of Machine Learning Research, 20(55):1–21, 2019
Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey.Journal of Machine Learning Research, 20(55):1–21, 2019. 1
work page 2019
-
[15]
Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljacic, Thomas Y. Hou, and Max Tegmark. KAN: Kolmogorov–Arnold networks. InThe Thirteenth International Conference on Learning Representations,
-
[16]
arXiv preprint arXiv:2408.10205 , year=
Ziming Liu, Pingchuan Ma, Yixuan Wang, Wojciech Matusik, and Max Tegmark. KAN 2.0: Kolmogorov–Arnold Networks meet science.arXiv preprint arXiv:2408.10205, 2024. 1, 2, 3
-
[17]
American Mathematical Society, 1961
Andre ˘ı Nikolaevich Kolmogorov.On the representation of continuous functions of several variables by superpositions of continuous functions of a smaller number of variables. American Mathematical Society, 1961. 2
work page 1961
-
[18]
Vladimir I Arnold. On functions of three variables.Collected Works: Representations of Functions, Celestial Mechanics and KAM Theory, 1957–1965, pages 5–8, 2009. 2
work page 1957
-
[19]
Andrei Nikolaevich Kolmogorov. On the representations of continuous functions of many variables by superposition of continuous functions of one variable and addition. InDokl. Akad. Nauk USSR, volume 114, pages 953–956, 1957. 2
work page 1957
-
[20]
Kolmogorov-Arnold networks are radial basis function ne tworks
Ziyao Li. Kolmogorov–Arnold networks are Radial Basis Function networks.arXiv preprint arXiv:2405.06721, 2024. 2
-
[21]
FourierKAN.https://github.com/GistNoesis/FourierKAN, 2024
GistNoesis. FourierKAN.https://github.com/GistNoesis/FourierKAN, 2024. Accessed: 2025-07-07. 2
work page 2024
-
[22]
Eric Reinhardt, Dinesh Ramakrishnan, and Sergei Gleyzer. SineKAN: Kolmogorov–Arnold networks using sinusoidal activation functions.Frontiers in Artificial Intelligence, 7, 2025. ISSN 2624-8212. doi: 10.3389/frai.2024.1462952. 2
-
[23]
Zavareh Bozorgasl and Hao Chen. Wav-KAN: Wavelet Kolmogorov–Arnold networks.arXiv preprint arXiv:2405.12832,
-
[24]
SS Sidharth, AR Keerthana, R Gokul, and KP Anas. Chebyshev polynomial-based Kolmogorov–Arnold networks: An efficient architecture for nonlinear function approximation.arXiv preprint arXiv:2405.07200, 2024. 2
-
[25]
Inferences from Multinomal Data: Learning about a bag of marbles
Robert Tibshirani. Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, December 2018. ISSN 0035-9246. doi: 10.1111/j.2517-6161.1996.tb02080.x. URL https://doi.org/10.1111/j.2517-6161.1996.tb02080.x. 3
-
[26]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014. 3, 9 10
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[27]
The State of Sparsity in Deep Neural Networks
Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks.arXiv preprint arXiv:1902.09574,
work page internal anchor Pith review Pith/arXiv arXiv 1902
-
[28]
Extrapolation and learning equations
Georg Martius and Christoph H Lampert. Extrapolation and learning equations.arXiv preprint arXiv:1610.02995, 2016. 3
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[29]
Learning equations for extrapolation and control
Subham Sahoo, Christoph Lampert, and Georg Martius. Learning equations for extrapolation and control. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4442–4450. PMLR, 10–15 Jul 2018. 3
work page 2018
-
[30]
Samuel Kim, Peter Y Lu, Srijon Mukherjee, Michael Gilbert, Li Jing, VladimirˇCeperi´c, and Marin Soljaˇci´c. Integration of neural network-based symbolic regression in deep learning for scientific discovery.IEEE transactions on neural networks and learning systems, 32(9):4166–4177, 2020. 4
work page 2020
-
[31]
Michael Zhang, Samuel Kim, Peter Y. Lu, and Marin Solja ˇci´c. Deep learning and symbolic regression for discovering parametric equations.IEEE Transactions on Neural Networks and Learning Systems, 35(11):16775–16787, 2024. 4
work page 2024
-
[32]
Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks.Journal of Machine Learning Research, 22(241):1–124,
-
[33]
Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables.Journal of the Royal Statistical Society Series B: Statistical Methodology, 68(1):49–67, 2006. 4
work page 2006
-
[34]
The benefit of group sparsity.The Annals of Statistics, 38(4):1978 – 2004, 2010
Junzhou Huang and Tong Zhang. The benefit of group sparsity.The Annals of Statistics, 38(4):1978 – 2004, 2010. doi: 10.1214/09-AOS778. URLhttps://doi.org/10.1214/09-AOS778. 4
-
[35]
Softly symbolifying kolmogorov-arnold networks.arXiv preprint arXiv:2512.07875,
James Bagrow and Josh Bongard. Softly symbolifying kolmogorov-arnold networks.arXiv preprint arXiv:2512.07875,
-
[36]
Estimating the dimension of a model.The Annals of Statistics, 6(2):461–464, 1978
Gideon Schwarz. Estimating the dimension of a model.The Annals of Statistics, 6(2):461–464, 1978. ISSN 00905364, 21688966. 5
work page 1978
-
[37]
Nguyen Quang Uy, Nguyen Xuan Hoai, Michael O’Neill, R. I. McKay, and Edgar Galv ´an-L´opez. Semantically-based crossover in genetic programming: application to real-valued symbolic regression.Genetic Programming and Evolvable Machines, 12(2):91–119, 2011. 5, 9
work page 2011
-
[38]
Benjamin C Koenig, Suyong Kim, and Sili Deng. KAN-ODEs: Kolmogorov–Arnold network ordinary differential equations for learning dynamical systems and hidden physics.Computer Methods in Applied Mechanics and Engineering, 432:117397, 2024. 5
work page 2024
-
[39]
Shirin Panahi, Mohammadamin Moradi, Erik M. Bollt, and Ying-Cheng Lai. Data-driven model discovery with Kolmogorov–Arnold networks.Phys. Rev. Res., 7:023037, Apr 2025. 5, 6, 9
work page 2025
-
[40]
James Bagrow and Josh Bongard. Multi-exit kolmogorov–arnold networks: enhancing accuracy and parsimony.Machine Learning: Science and Technology, 6(3):035037, aug 2025. 5, 6, 7, 8, 9
work page 2025
-
[41]
Kensuke Ikeda. Multiple-valued stationary state and its instability of the transmitted light by a ring cavity system.Optics communications, 30(2):257–261, 1979. 5
work page 1979
-
[42]
SM Hammel, CKRT Jones, and Jerome V Moloney. Global dynamical behavior of the optical field in a ring cavity.Journal of the Optical Society of America B, 2(4):552–564, 1985. 5
work page 1985
-
[43]
Nonlinear dynamics and population disappearances.The American Naturalist, 144(5): 873–879, 1994
Kevin McCann and Peter Yodzis. Nonlinear dynamics and population disappearances.The American Naturalist, 144(5): 873–879, 1994. 6
work page 1994
-
[44]
Neville.Properties of Concrete
Adam M. Neville.Properties of Concrete. Pearson, 5th edition, 2011. 7
work page 2011
-
[45]
I.-C. Yeh. Modeling of strength of high-performance concrete using artificial neural networks.Cement and Concrete Research, 28(12):1797–1808, 1998. ISSN 0008-8846. 7, 9 11
work page 1998
-
[46]
I-Cheng Yeh. Analysis of strength of concrete using design of experiments and neural networks.Journal of Materials in Civil Engineering, 18(4):597–604, 2006. 7, 9
work page 2006
-
[47]
Kam Hamidieh. A data-driven statistical model for predicting the critical temperature of a superconductor.Computational Materials Science, 154:346–354, 2018. 7, 9
work page 2018
-
[48]
MDR SuperCon datasheet ver.240322
Center for Basic Research on Materials. MDR SuperCon datasheet ver.240322. 7, 9
-
[49]
Categorical reparameterization with gumbel-softmax
Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. InInternational Conference on Learning Representations, 2017. URLhttps://openreview.net/forum?id=rkE3y85ee. 8
work page 2017
-
[50]
Maddison, Andriy Mnih, and Yee Whye Teh
Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. InInternational Conference on Learning Representations, 2017. URLhttps://openreview.net/ forum?id=S1jE5L5gl. 8 12
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.