arxiv: 2605.05914 · v1 · submitted 2026-05-07 · 🪐 quant-ph · cs.AI· cs.LG

Recognition: unknown

Quantum-enhanced Large Language Models on Quantum Hardware via Cayley Unitary Adapters

Augustine Kshetrimayum, Borja Aizpurua, Roman Orus, Saeed S. Jahromi, Sukhbinder Singh

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:28 UTC · model grok-4.3

classification 🪐 quant-ph cs.AIcs.LG

keywords quantum adapterslarge language modelsCayley transformunitary circuitsquantum hardwareperplexitytransformer modelsquantum utility

0 comments

The pith

Cayley-parameterised unitary adapters on quantum hardware improve Llama 3.1 8B perplexity by 1.4% using only 6000 extra parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that small quantum circuit blocks, parameterised through the Cayley transform, can be inserted into the projection layers of a frozen large language model and run on actual quantum processors to raise performance. For the 8-billion-parameter Llama 3.1 model this produces a 1.4% drop in perplexity while adding just 6000 trainable parameters. Experiments on a smaller 135-million-parameter model recover 83% of the accuracy lost to compression and allow correct answers on questions that purely classical versions miss. The results identify a noise-expressivity transition that marks when quantum hardware begins to help.

Core claim

Cayley-parameterised unitary adapters inserted into the frozen projection layers of pre-trained LLMs improve the perplexity of Llama 3.1 8B by 1.4% with only 6000 additional parameters when executed end-to-end on real quantum hardware. On a smaller model the same adapters produce monotonically better perplexity as block dimension grows, recover 83% of compression-induced degradation, answer questions that classical baselines fail, and exhibit a sharp noise-expressivity phase transition.

What carries the argument

The Cayley-parameterised unitary adapter, a quantum circuit block that uses the Cayley transform to create trainable unitary matrices inserted into LLM projection layers.

If this is right

Perplexity improves monotonically as the dimension of the unitary block increases.
83% of the performance lost to model compression is recovered on the smaller model.
The adapters produce correct answers on questions where both classical baselines fail.
A noise-expressivity phase transition appears that indicates the route to utility at larger qubit counts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be applied to other transformer architectures without retraining the base model weights.
At higher qubit counts the phase transition may allow quantum advantage in inference tasks beyond language modeling.
Memory scaling advantages would appear if the quantum adapters replace larger classical projection matrices.
The same Cayley parameterization might be tested on different quantum hardware platforms to check hardware independence.

Load-bearing premise

That the measured perplexity gains come from the quantum execution of the adapters rather than from classical training of their parameters or from unaccounted classical post-processing.

What would settle it

Execute the identical adapter parameters on a classical simulator of the same circuit depth and qubit count and observe whether the perplexity improvement vanishes.

Figures

Figures reproduced from arXiv: 2605.05914 by Augustine Kshetrimayum, Borja Aizpurua, Roman Orus, Saeed S. Jahromi, Sukhbinder Singh.

**Figure 1.** Figure 1: Cayley Unitary Adapter (CUA) architecture. a, Full-model backbone. Vertical pipeline of L frozen transformer blocks (L = 30 for SmolLM2; L = 32 for Llama-3.1-8B), bracketed by the input embedding and the LM head. The dashed-border block “Transformer ℓ” is the layer expanded in panel (b). b, Detail of one Llama transformer block. CUA blocks (purple) are inserted on the input side of each linear projection … view at source ↗

**Figure 2.** Figure 2: Realisation of the Cayley Unitary Adapter on ibm_basquecountry. a, Transpiled 2-qubit CUA circuit on physical qubits q[80] and q[81], in the IBM Heron r2 native gate set {CZ, SX, RZ, X}. Depth 19; 12 SX (magenta √ X boxes) + 9 RZ (blue boxes, with explicit angles in radians) + 3 CZ (vertical magenta connectors) + 2 reset operations. Measured outcomes are routed to the classical register c2. Single-circui… view at source ↗

**Figure 3.** Figure 3: Progression of QPU execution experiments (SmolLM2-135M). Timeline of the four QPU execution milestones achieved primarily on the ibm_basquecountry IBM System Two processor (156 qubits, IBM Heron r2; the same device used for all Llama-3.1-8B QPU runs in this work), with cross-validation milestones on ibm_strasbourg (127 qubits) between February and March 2025; transpilation and packing parameters in this … view at source ↗

**Figure 4.** Figure 4: WikiText perplexity as a function of unitary block dimension. Left axis (logarithmic): noiseless WikiText perplexity for 210-layer Cayley adapters in three regimes: unconstrained dense matrices, orthogonal unitaries, and sign-constrained unitaries (QPU-compatible), all as a function of block dimension from 4 × 4 (2 qubits) to full input dimension (384–896, 9–10 qubits). All configurations applied to the … view at source ↗

**Figure 5.** Figure 5: Perplexity vs. total parameter count. WikiText perplexity (compressed SmolLM2 backbone + adapter overhead) for 210-layer adapters in three regimes. Unitary adapters (green: sign-constrained; orange: orthogonal) achieve comparable perplexity to unconstrained dense matrices (blue) with approximately 50% fewer parameters, demonstrating that the orthogonality constraint acts as an effective regulariser rather … view at source ↗

**Figure 6.** Figure 6: Multi-benchmark comparison across model scales. a, Benchmark performance (WikiText PPL, LAMBADA PPL, BoolQ accuracy, HellaSwag normalised accuracy) for SmolLM2 (three configurations: uncompressed original, compressed baseline, 210-layer CUAenhanced). b, Perplexity improvements for Llama 3.1 8B under different adapter configurations (2-qubit BDU, full unitary + sign, unconstrained, all-sublayer BDU), inclu… view at source ↗

read the original abstract

Large language models (LLMs) have transformed artificial intelligence, yet classical architectures impose a fundamental constraint: every trainable parameter demands classical memory that scales unfavourably with model size. Quantum computing offers a qualitatively different pathway, but practical demonstrations on real hardware have remained elusive for models of practical relevance. Here we show that Cayley-parameterised unitary adapters -- quantum circuit blocks inserted into the frozen projection layers of pre-trained LLMs and executed on a 156-qubit IBM Quantum System Two superconducting processor -- improve the perplexity of Llama 3.1 8B, an 8-billion-parameter model in widespread use, by 1.4% with only 6,000 additional parameters and end-to-end inference validated on real Quantum Processing Unit (QPU). A systematic study on SmolLM2 (135M parameters), chosen for its tractability, reveals monotonically improving perplexity with unitary block dimension, 83% recovery of compression-induced degradation, and correct answers to questions that both classical baselines fail -- with a sharp noise-expressivity phase transition identifying the concrete path to quantum utility at larger qubit scales.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They ran a Cayley unitary adapter on real 156-qubit hardware inside Llama 3.1 8B and report a 1.4% perplexity gain, but the numbers do not yet show the gain comes from quantum expressivity rather than classical training of the 6000 parameters.

read the letter

The new piece is the end-to-end run: they froze an 8B model, inserted these parameterized unitary blocks into the projection layers, and executed inference on actual IBM superconducting hardware. That combination at this scale has not been shown before. On the smaller SmolLM2 they also track a noise-expressivity transition and recover most of the performance lost to compression, which gives a concrete picture of where the hardware limits sit today. Those are useful data points for anyone trying to move hybrid models off simulators. The execution itself looks like a solid engineering step. The soft spots sit in the attribution. The abstract gives the 1.4% figure without error bars, without a classical adapter baseline that uses the same parameter count and training procedure, and without a clear account of how the quantum measurement outcomes are turned back into classical activations. If a low-rank classical adapter trained the same way produces similar numbers, the quantum hardware is not carrying the result. The phase transition is reported only on SmolLM2, so it is not yet clear whether the same scaling holds at 8B. The training details for the adapters are also thin. This is aimed at people working on practical quantum-classical hybrids who need to see what current QPUs can actually handle. A reader who wants implementation lessons on inserting circuit blocks into frozen LLMs will find material here. It is worth sending to referees because the hardware demonstration is real and the missing controls are fixable; the paper is not incoherent on its own terms and the authors have engaged the right prior work on unitary adapters.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that Cayley-parameterised unitary adapters inserted into the frozen projection layers of pre-trained LLMs and executed on a 156-qubit IBM Quantum System Two processor improve the perplexity of Llama 3.1 8B by 1.4% using only 6,000 additional parameters, with end-to-end QPU inference validated. On the smaller SmolLM2 model, the approach yields monotonically improving perplexity with unitary block dimension, 83% recovery of compression-induced degradation, correct answers on questions where classical baselines fail, and a sharp noise-expressivity phase transition.

Significance. If the attribution to genuine quantum expressivity on hardware holds after proper controls, the result would be significant as one of the first demonstrations of quantum hardware providing measurable benefits to a practical-scale LLM via low-parameter adapters. The phase-transition analysis on SmolLM2 offers a potential scaling roadmap, and the emphasis on real QPU execution rather than simulation strengthens the practical relevance.

major comments (3)

Abstract: the 1.4% perplexity gain on Llama 3.1 8B is stated without error bars, number of runs, or statistical significance tests, which is load-bearing for the headline claim given the small effect size.
Abstract and results sections: no ablation or baseline is reported using a classical low-rank adapter (or equivalent classical unitary parameterization) with exactly the same 6,000 parameters and identical training procedure, leaving open whether the gain arises from the Cayley quantum blocks on the QPU or from classical training/post-processing.
SmolLM2 study: the noise-expressivity phase transition and 83% recovery are demonstrated only on the 135M model; the manuscript provides no corresponding ablation or scaling confirmation for the Llama 3.1 8B result, weakening the claim that the observed benefits generalize or identify a concrete path to quantum utility at 8B scale.

minor comments (2)

The abstract states that the adapters deliver 'correct answers to questions that both classical baselines fail' but supplies no quantitative metrics, example questions, or dataset details to support this beyond perplexity.
Training procedure for the 6,000 adapter parameters (optimizer, learning rate, epochs, hardware noise mitigation) is not described, which affects reproducibility of the reported numbers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of rigor and controls that we will address in the revision to strengthen the attribution of results to the quantum hardware components.

read point-by-point responses

Referee: Abstract: the 1.4% perplexity gain on Llama 3.1 8B is stated without error bars, number of runs, or statistical significance tests, which is load-bearing for the headline claim given the small effect size.

Authors: We agree that uncertainty quantification is essential given the modest effect size. The main text already describes the use of multiple independent training and inference runs on the QPU, but this detail is not reflected in the abstract. In the revised manuscript we will update the abstract to report the improvement as a mean with standard error across five runs and add a statement confirming statistical significance via a paired t-test. revision: yes
Referee: Abstract and results sections: no ablation or baseline is reported using a classical low-rank adapter (or equivalent classical unitary parameterization) with exactly the same 6,000 parameters and identical training procedure, leaving open whether the gain arises from the Cayley quantum blocks on the QPU or from classical training/post-processing.

Authors: This is a substantive concern. The manuscript currently contrasts the quantum adapter against the frozen compressed model without any adapter. To isolate the contribution of the Cayley-parameterized blocks executed on the QPU, we will add a matched classical control using a low-rank adapter (LoRA) with precisely 6,000 trainable parameters and the identical training schedule. The revised results section will present this comparison, allowing readers to evaluate whether the observed perplexity gains exceed those obtainable from classical parameterization alone. revision: yes
Referee: SmolLM2 study: the noise-expressivity phase transition and 83% recovery are demonstrated only on the 135M model; the manuscript provides no corresponding ablation or scaling confirmation for the Llama 3.1 8B result, weakening the claim that the observed benefits generalize or identify a concrete path to quantum utility at 8B scale.

Authors: We accept that the systematic scaling and phase-transition analysis is confined to SmolLM2. This model was selected because exhaustive variation of block dimension, noise levels, and recovery metrics requires repeated QPU access that is currently prohibitive at 8B scale. The Llama 3.1 8B experiment functions as an end-to-end hardware demonstration rather than a full scaling study. In the revision we will expand the discussion to explicitly connect the SmolLM2 noise-expressivity transition to the Llama result, framing the transition as a hardware roadmap while clearly stating the absence of equivalent ablations at 8B scale. revision: partial

Circularity Check

0 steps flagged

No circularity: results are direct empirical measurements on hardware

full rationale

The paper reports measured perplexity improvements from executing Cayley-parameterised unitary adapters on a real 156-qubit IBM QPU for Llama 3.1 8B and a systematic experimental study on SmolLM2. No load-bearing derivation, prediction, or first-principles claim is presented that reduces by the paper's own equations or self-citations to its inputs by construction. The 1.4% gain, 83% recovery, and noise-expressivity phase transition are stated as observed outcomes from hardware runs rather than quantities fitted or renamed within the model. Parameter count and block dimension choices are described as hardware-constrained selections, not as predictions derived from the adapter equations themselves. The analysis therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on the assumption that the Cayley parameterization produces trainable unitaries whose quantum execution yields measurable advantage over classical equivalents of the same parameter count. No new physical axioms are introduced beyond standard quantum circuit execution on superconducting hardware.

free parameters (2)

unitary block dimension
Chosen to produce the observed monotonic improvement and phase transition; directly controls the 6000-parameter budget.
adapter insertion locations
Selected within frozen projection layers; the specific choice affects which classical weights are bypassed.

axioms (2)

standard math Cayley transform maps real matrices to unitary matrices
Invoked to ensure the adapter blocks remain valid quantum operations.
domain assumption Quantum hardware noise is the dominant error source at current qubit counts
Used to interpret the noise-expressivity phase transition.

invented entities (1)

Cayley unitary adapter no independent evidence
purpose: Compact parameterization of trainable unitary blocks that can be executed on quantum hardware while adding few classical parameters.
New construct introduced to interface classical LLM layers with quantum circuits.

pith-pipeline@v0.9.0 · 5517 in / 1680 out tokens · 45022 ms · 2026-05-08T11:28:00.086211+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 12 canonical work pages · 4 internal anchors

[1]

B.et al.Language models are few-shot learners.Adv

Brown, T. B.et al.Language models are few-shot learners.Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020)

1901
[2]

Llama 3 model card.https://llama.meta.com(2024)

Meta AI. Llama 3 model card.https://llama.meta.com(2024)

2024
[3]

Scaling Laws for Neural Language Models

Kaplan, J.et al.Scaling laws for neural language models. Preprint atarXiv:2001.08361 (2020)

work page internal anchor Pith review arXiv 2001
[4]

J.et al.LoRA: Low-rank adaptation of large language models.Proc

Hu, E. J.et al.LoRA: Low-rank adaptation of large language models.Proc. Int. Conf. Learn. Represent.(2022). 12

2022
[5]

& Vetrov, D

Novikov, A., Podoprikhin, D., Osokin, A. & Vetrov, D. Tensorizing neural networks.Adv. Neural Inf. Process. Syst.28, 442–450 (2015)

2015
[6]

A practical introduction to tensor networks: matrix product states and projected entangled pair states.Ann

Orús, R. A practical introduction to tensor networks: matrix product states and projected entangled pair states.Ann. Phys.349, 117–158 (2014)

2014
[7]

Tensor networks for complex quantum systems.Nat

Orús, R. Tensor networks for complex quantum systems.Nat. Rev. Phys.1, 538–550 (2019)

2019
[8]

Cerezo, M.et al.Variational quantum algorithms.Nat. Rev. Phys.3, 625–644 (2021)

2021
[9]

Havlí ˇcek, V .et al.Supervised learning with quantum-enhanced feature spaces.Nature 567, 209–212 (2019)

2019
[10]

Foundations for near-term quantum natural language processing

Coecke, B., de Felice, G., Meichanetzidis, K. & Toumi, A. Foundations for near-term quantum natural language processing. Preprint atarXiv:2012.03755 (2020)

work page arXiv 2012
[11]

Recurrent quantum neural networks.Adv

Bausch, J. Recurrent quantum neural networks.Adv. Neural Inf. Process. Syst.33, 1368– 1379 (2020)

2020
[12]

Preprint at arXiv:2503.12790 (2025)

Yu, S.et al.Quantum-enhanced large language model efficient fine tuning. Preprint at arXiv:2503.12790 (2025)

work page arXiv 2025
[13]

Quantum large language model fine-tuning,

Li, H., Zhang, X. & Wang, Y . Quantum LLM fine-tuning. Preprint atarXiv:2504.08732 (2025)

work page arXiv 2025
[14]

& Chen, L

Zhao, X., Wu, H. & Chen, L. Training quantum self-attention on a 72-qubit quantum computer.IEEE Quantum Week(2024)

2024
[15]

Preprint at arXiv:2505.13205 (2025)

Li, L.et al.Quantum Knowledge Distillation for Large Language Models. Preprint at arXiv:2505.13205 (2025)

work page arXiv 2025
[16]

& Kuo, E.-J

Chen, C.-S. & Kuo, E.-J. Quantum-enhanced natural language generation: a multi-model framework with hybrid quantum-classical architectures. Preprint atarXiv:2508.21332 (2025). 13

work page arXiv 2025
[17]

& Shah, C

Gupta, A., Kaur, K., Gupta, V . & Shah, C. QLENS: Towards a quantum perspective of language transformers. Preprint atarXiv:2510.11963 (2025)

work page arXiv 2025
[18]

IBM Heron Processor: Technical Overview

IBM Quantum. IBM Heron Processor: Technical Overview. https://www.ibm.com/quantum/processors(2024)

2024
[19]

SmolLM2: Compact language models

Hugging Face. SmolLM2: Compact language models. https://huggingface.co/HuggingFaceTB/SmolLM2-135M(2024)

2024
[20]

Vandersypen, L. M. K.et al.Experimental realization of Shor’s quantum factoring algo- rithm using nuclear magnetic resonance.Nature414, 883–887 (2001)

2001
[21]

Sur quelques propriétés des déterminants gauches.J

Cayley, A. Sur quelques propriétés des déterminants gauches.J. Reine Angew. Math.32, 119–123 (1846)
[22]

& Martínez-Rubio, D

Lezcano-Casado, M. & Martínez-Rubio, D. Cheap orthogonal constraints in neural net- works: A simple parameterization of the orthogonal and unitary group.Proc. Int. Conf. Mach. Learn.3794–3803 (2019)

2019
[23]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network. Preprint atarXiv:1503.02531 (2015)

work page internal anchor Pith review arXiv 2015
[24]

Compactifai: extreme compression of large language models using quantum-inspired tensor networks,

Multiverse Computing. CompactifAI: Extreme compression of large language models us- ing quantum-inspired tensor networks. Preprint atarXiv:2401.14109 (2024)

work page arXiv 2024
[25]

Pointer Sentinel Mixture Models

Merity, S., Xiong, C., Bradbury, J. & Socher, R. Pointer sentinel mixture models. Preprint atarXiv:1609.07843 (2016)

work page internal anchor Pith review arXiv 2016
[26]

Paperno, D.et al.The LAMBADA dataset: Word prediction requiring a broad discourse context.Proc. Annu. Meet. Assoc. Comput. Linguist.1525–1534 (2016)

2016
[27]

Clark, C.et al.BoolQ: Exploring the surprising difficulty of natural yes/no questions. Proc. Conf. North Am. Chapter Assoc. Comput. Linguist.2924–2936 (2019)

2019
[28]

& Choi, Y

Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A. & Choi, Y . HellaSwag: Can a machine re- ally finish your sentence?Proc. Annu. Meet. Assoc. Comput. Linguist.4791–4800 (2019). 14

2019
[29]

S., Singh, S

Aizpurua, B., Jahromi, S. S., Singh, S. & Orús, R. Quantum large language models via tensor network disentanglers. Preprint atarXiv:2410.17397 (2024)

work page arXiv 2024
[30]

Classical Neural Networks on Quantum Devices via Tensor Network Disentanglers: A Case Study in Image Classification

Aizpurua, B., Singh, S. & Orús, R. Classical neural networks on quantum devices via tensor network disentanglers. Preprint atarXiv:2509.06653 (2025). Methods Base models and compression SmolLM2.SmolLM2-135M is a Llama-architecture decoder-only language model with 30 transformer blocks, embedding dimensiond= 576, and 135 million parameters [19]. The origin...

work page internal anchor Pith review Pith/arXiv arXiv 2025