arxiv: 2604.18496 · v2 · submitted 2026-04-20 · 💻 cs.ET

Recognition: unknown

Homodyne Photonic Tensor Processor exceeds 1,000-TOPS

Lian Zhou , Kaiwen Xue , Yun-Jhu Lee , Chun-Ho Lee , Yuan Li , Kiwon Kwon , Weipeng Zhang , Songlin Zhao

show 5 more authors

Jason Moraes Niranjan Bhatia Ryan Hamerly Mengjie Yu Zaijun Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:34 UTC · model grok-4.3

classification 💻 cs.ET

keywords photonic computinghomodyne detectiontensor processorgeneral matrix multiplicationthin-film lithium niobatehigh-throughput AIoptical accelerationcoherent optics

0 comments

The pith

A homodyne photonic circuit performs general matrix multiplications at over 1,000 tera-operations per second.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how a coherent integrated photonic system can handle the matrix multiplications at the heart of modern AI workloads by using light for both data encoding and computation. Time multiplexing cuts the number of modulators needed from quadratic to linear in the matrix size, allowing thousands of homodyne detection units to fit on one chip. Wafer-scale thin-film lithium niobate transmitters send data at up to 120 Gbaud per second into silicon circuits, delivering 7-bit accuracy on small arrays and 6-bit statistical accuracy on larger ones while reaching 1,000 to 6,000 TOPS total throughput. The design amortizes conversion losses to reach 330 TOPS per watt and is tested on actual language-model inference.

Core claim

The authors demonstrate a coherent homodyne integrated circuit for general matrix multiplication with aggregate throughput exceeding 1,000 TOPS. Massive on-chip optical fanout and time multiplexing reduce the required modulator count from O(N squared) to O(N), enabling 256 by 256 homodyne units each smaller than 0.0064 square millimeters on a single reticle. Wafer-scale 64-channel thin-film lithium niobate transmitters with 40 GHz bandwidth and 0.2 dB per centimeter loss are coupled to silicon/silicon-nitride computing circuits, achieving up to 7-bit accuracy across 8 by 8 channels at 120 Gbaud per second and 6-bit accuracy across 256 by 100 channels at 20 to 128 Gbaud per second for a total

What carries the argument

The time-multiplexed array of coherent homodyne detection units that performs general matrix multiplication through optical fanout and parallelism.

If this is right

The system runs quantized language models such as Qwen2.5-0.5B and produces accurate output tokens.
Energy efficiency reaches 330 TOPS per watt using standard foundry packaging.
The architecture supports both large-scale training and low-latency inference from data centers to edge devices.
Reduced modulator count allows dense integration of record-scale homodyne arrays on a single reticle.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Further scaling could lower the power cost of running foundation models compared with conventional electronic hardware.
The same time-multiplexing approach might extend to other photonic linear-algebra primitives beyond matrix multiplication.
Testing full-scale accuracy under realistic data-center temperature and power variations would clarify deployment readiness.
Coupling this processor with existing digital control electronics could create hybrid accelerators for mixed-precision workloads.

Load-bearing premise

Wafer-scale thin-film lithium niobate transmitters and their chip-to-chip coupling to silicon circuits can preserve the stated accuracy and throughput without unaccounted losses, crosstalk, or extra calibration overhead when the design is scaled to full 256 by 256 operation.

What would settle it

Simultaneous operation of all 256 by 256 channels at 120 Gbaud per second with direct measurement of end-to-end computational accuracy and total throughput.

read the original abstract

High-performance computing underpins modern artificial intelligence (AI), enabling foundation models, real-time inference and perception in autonomous systems, and data-intensive scientific simulations. Recent advances in quantization techniques utilizing low-precision computation without degrading model accuracy, create new opportunities for analog photonic computing characterized by ultra-high clock rates and low energy consumption. Here we propose and demonstrate a coherent homodyne integrated circuit capable of general matrix multiplication (GEMM) with aggregate throughput that exceeds 1,000 TOPS (tera-operations per second), enabled by massive on-chip optical fanout and parallelism. By leveraging time multiplexing, the required modulator count is reduced from O($N^2$) to O(N), allowing dense integration of record-scale 256 $\times$ 256 homodyne units (each <0.0064 $mm^2$) within a single reticle. We employ wafer-scale fabricated 64 thin-film lithium niobate (TFLN) transmitters (each over 40-GHz bandwidth with propagation loss of 0.2 dB/cm) to encode data and chip-to-chip coupled to Si/SiN computing circuits (64 channels). Our system achieves up to 7-bit computational accuracy across 8 $\times$ 8 parallel channels at record computing clockrate 120 Gbaud/s, and 6-bit statistical accuracy across 256 $\times$ 100 channels at 20-128 Gbaud/s, representing a total throughput of 1,000-6,000 TOPS. Massive parallelism amortizes the optoelectronic (OE) conversion to allow 330-TOPS/W efficiency using foundry-available packaging technology. The system throughput is benchmarked with Qwen2.5-0.5 billion parameter models that generate accurate tokens. High throughput and energy efficiency establish a near-term pathway toward light-based accelerators for large-scale training and low-latency inference from datacenters to edges, accelerating new models toward artificial general intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper fabricates a homodyne photonic GEMM accelerator on TFLN/SiN that claims 1000+ TOPS via time multiplexing, but the jump from 64 fabricated channels to 256x100 accuracy needs more loss and noise data to hold up.

read the letter

The core result here is a working photonic tensor processor that uses coherent homodyne detection and time multiplexing to deliver claimed throughputs above 1000 TOPS while keeping modulator count manageable. They built 64 TFLN transmitters at 40 GHz bandwidth with low loss, coupled them to Si/SiN circuits, and packed the equivalent of a 256 by 256 array into one reticle. The reported numbers show 7-bit accuracy on small 8x8 arrays at 120 Gbaud and 6-bit statistical accuracy on the larger 256x100 setup at 20-128 Gbaud, plus a real benchmark where the system runs Qwen2.5-0.5B and generates correct tokens. Efficiency lands at 330 TOPS/W using standard packaging, which is a practical plus for this kind of hardware.

Referee Report

3 major / 2 minor

Summary. The paper demonstrates a coherent homodyne photonic integrated circuit for general matrix multiplication (GEMM) using wafer-scale thin-film lithium niobate (TFLN) transmitters (64 units, >40 GHz bandwidth) chip-to-chip coupled to Si/SiN circuits. By employing time multiplexing to reduce modulator count from O(N²) to O(N), it claims dense integration of 256×256 homodyne units and reports aggregate throughputs of 1,000-6,000 TOPS, with 7-bit accuracy on 8×8 channels at 120 Gbaud/s and 6-bit statistical accuracy on 256×100 channels at 20-128 Gbaud/s. The system is benchmarked on Qwen2.5-0.5B inference and claims 330 TOPS/W efficiency.

Significance. If the reported accuracies and throughputs are experimentally validated at scale, this would constitute a notable hardware advance in analog photonic computing for AI workloads, demonstrating how massive optical parallelism and time multiplexing can deliver record TOPS while amortizing optoelectronic conversion costs. The use of foundry-compatible TFLN and Si/SiN platforms strengthens the case for near-term deployability in datacenter or edge accelerators.

major comments (3)

[abstract and scaling description] The 1,000-6,000 TOPS throughput and 6-bit accuracy claims for 256×100 channels rest on time-multiplexed scaling from only 64 physical TFLN transmitters and 64 coupled channels; the manuscript provides no quantitative end-to-end insertion loss, crosstalk, or phase-stability measurements for the effective larger array, leaving the parallelism factor used in the TOPS calculation unverified.
[results and benchmarking sections] No error bars, explicit measurement methodology, calibration procedures, or data-exclusion criteria are supplied for the 7-bit (8×8 at 120 Gbaud/s) and 6-bit (256×100) accuracy figures or the Qwen2.5-0.5B benchmark, preventing independent assessment of whether the reported computational precision is statistically robust.
[experimental setup and performance claims] The assumption that chip-to-chip coupling between the 64 TFLN transmitters and Si/SiN circuits incurs negligible additional loss or noise when extrapolated to 256×100 operation is load-bearing for the efficiency (330 TOPS/W) and accuracy claims, yet no supporting insertion-loss or crosstalk budgets are presented.

minor comments (2)

[abstract] Clarify whether the 120 Gbaud/s clock rate is the highest experimentally achieved or a design target, and specify the exact definition of 'statistical accuracy' versus 'computational accuracy' used in the bit-precision claims.
[abstract] The abstract states 'record-scale 256 × 256 homodyne units' but the fabricated and coupled hardware is 64 channels; a brief sentence reconciling the physical versus effective array sizes would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, providing clarifications on the time-multiplexing approach, experimental details, and scaling assumptions. We have revised the manuscript to incorporate additional explanations, data, and methodology descriptions where feasible.

read point-by-point responses

Referee: [abstract and scaling description] The 1,000-6,000 TOPS throughput and 6-bit accuracy claims for 256×100 channels rest on time-multiplexed scaling from only 64 physical TFLN transmitters and 64 coupled channels; the manuscript provides no quantitative end-to-end insertion loss, crosstalk, or phase-stability measurements for the effective larger array, leaving the parallelism factor used in the TOPS calculation unverified.

Authors: The 256×100 effective array size is realized via time-division multiplexing of the 64 physical TFLN transmitters and 64 Si/SiN channels, where each physical unit processes multiple sequential time slots to emulate larger matrix dimensions without increasing hardware count. The TOPS figure is derived by multiplying the per-channel demonstrated throughput (at 20-128 Gbaud/s) by the effective channel count and multiplexing depth. In the revised manuscript, we have added a dedicated subsection detailing the time-multiplexing protocol, the exact parallelism factor calculation, and quantitative measurements from the 64-channel prototype: propagation loss of 0.2 dB/cm, chip-to-chip coupling loss, crosstalk below -20 dB, and phase stability over the measurement duration. These support the scaling, as the physical optical paths remain fixed at 64 channels. We acknowledge that direct end-to-end measurements on a physically implemented 256×100 array are not available in the current prototype. revision: partial
Referee: [results and benchmarking sections] No error bars, explicit measurement methodology, calibration procedures, or data-exclusion criteria are supplied for the 7-bit (8×8 at 120 Gbaud/s) and 6-bit (256×100) accuracy figures or the Qwen2.5-0.5B benchmark, preventing independent assessment of whether the reported computational precision is statistically robust.

Authors: We agree that these elements are necessary for assessing statistical robustness. The revised manuscript now includes error bars on all accuracy plots (representing standard deviation over repeated measurements), a new subsection on experimental methodology, explicit calibration procedures (including real-time bias control for modulators, phase-locking via integrated monitors, and temperature stabilization), and data-exclusion criteria (e.g., discarding trials with phase drift >5° or signal-to-noise ratio below threshold). For the 7-bit accuracy at 120 Gbaud/s (8×8), we detail the bit-error-rate computation; for the 6-bit statistical accuracy (256×100), we explain the Monte Carlo sampling from the 64-channel data. The Qwen2.5-0.5B benchmark section now specifies the inference pipeline, token accuracy metric, and number of runs used. revision: yes
Referee: [experimental setup and performance claims] The assumption that chip-to-chip coupling between the 64 TFLN transmitters and Si/SiN circuits incurs negligible additional loss or noise when extrapolated to 256×100 operation is load-bearing for the efficiency (330 TOPS/W) and accuracy claims, yet no supporting insertion-loss or crosstalk budgets are presented.

Authors: The physical chip-to-chip coupling remains limited to the 64 channels in both the demonstrated and extrapolated cases, as time multiplexing reuses the same optical interfaces across time slots without adding couplings. The revised manuscript includes a comprehensive loss and noise budget table with measured values for the 64-channel setup: total insertion loss per path, crosstalk contributions, and estimated noise from coupling and OE conversion. These are shown to be consistent across the clock rates used. The 330 TOPS/W efficiency accounts for amortizing these fixed costs over the high aggregate throughput enabled by parallelism and 120 Gbaud/s operation. We clarify that the extrapolation does not increase physical coupling count or associated losses. revision: yes

Circularity Check

0 steps flagged

No circularity: experimental hardware demonstration with directly measured throughput

full rationale

The paper reports an experimental photonic processor demonstration. Throughput figures (1,000-6,000 TOPS) are computed from measured clock rates (120 Gbaud/s for 8x8 channels at 7-bit accuracy; 20-128 Gbaud/s for 256x100 channels at 6-bit accuracy) and parallelism, not from any fitted parameters, self-referential equations, or ansatzes. The time-multiplexing reduction from O(N²) to O(N) modulators is a standard architectural description, not a derived result. No load-bearing self-citations, uniqueness theorems, or renamings of known results appear in the text. The central claims rest on fabricated devices, coupling, and benchmarked model inference rather than a closed mathematical derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an experimental hardware demonstration. No free parameters, axioms, or invented entities are introduced in the abstract; claims rest on standard photonic fabrication and measurement techniques.

pith-pipeline@v0.9.0 · 5699 in / 1331 out tokens · 37383 ms · 2026-05-10T02:34:17.228270+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 6 canonical work pages · 3 internal anchors

[1]

LeCun, Y

Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature 521, 436 (2015)

2015
[2]

F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, SqueezeNet: AlexNet-Level Accuracy with 50x Fewer Parameters and <0.5MB Model Size, https://doi.org/10.48550/arXiv.1602.07360

work page doi:10.48550/arxiv.1602.07360
[3]

S. Han, H. Mao, and W. J. Dally, Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, https://doi.org/10.48550/arXiv.1510.00149

work page internal anchor Pith review doi:10.48550/arxiv.1510.00149
[4]

Shen et al., Deep learning with coherent nanophotonic circuits, Nat

Y. Shen et al., Deep learning with coherent nanophotonic circuits, Nat. Photonics 11, 441 (2017)

2017
[5]

Xu et al., 11 TOPS photonic convolutional accelerator for optical neural networks, Nature 589, 44 (2021)

X. Xu et al., 11 TOPS photonic convolutional accelerator for optical neural networks, Nature 589, 44 (2021)

2021
[6]

Feldmann et al., Parallel convolutional processing using an integrated photonic tensor core, Nature 589, 52 (2021)

J. Feldmann et al., Parallel convolutional processing using an integrated photonic tensor core, Nature 589, 52 (2021)

2021
[7]

Al-Kayed, C

N. Al-Kayed, C. St-Arnault, H. Morison, A. Aadhi, C. Huang, A. N. Tait, D. V. Plant, and B. J. Shastri, Programmable 200 GOPS Hopfield-inspired photonic Ising machine, Nature 648, 576 (2025)

2025
[8]

L. G. Wright, T. Onodera, M. M. Stein, T. Wang, D. T. Schachter, Z. Hu, and P. L. McMahon, Deep physical neural networks trained with backpropagation, Nature 601, 549 (2022)

2022
[9]

S. R. Ahmed et al., Universal photonic artificial intelligence acceleration, Nature 640, 368 (2025)

2025
[10]

Hua et al., An integrated large-scale photonic accelerator with ultralow latency, Nature 640, 361 (2025)

S. Hua et al., An integrated large-scale photonic accelerator with ultralow latency, Nature 640, 361 (2025)

2025
[11]

Attention Is All You Need

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention Is All You Need, https://doi.org/10.48550/arXiv.1706.03762

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.03762
[12]

L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang, B. Cui, and M.-H. Yang, Diffusion Models: A Comprehensive Survey of Methods and Applications, https://doi.org/10.48550/arXiv.2209.00796

work page doi:10.48550/arxiv.2209.00796
[13]

OpenAI et al., GPT-4 Technical Report, https://doi.org/10.48550/arXiv.2303.08774

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774
[14]

X. Lin, Y. Rivenson, N. T. Yardimci, M. Veli, Y. Luo, M. Jarrahi, and A. Ozcan, All-optical machine learning using diffractive deep neural networks, Science 361, 1004 (2018)

2018
[15]

T. Zhou, X. Lin, J. Wu, Y. Chen, H. Xie, Y. Li, J. Fan, H. Wu, L. Fang, and Q. Dai, Large-scale neuromorphic optoelectronic computing with a reconfigurable diffractive processing unit, Nat. Photonics 15, 367 (2021)

2021
[16]

Wang, S.-Y

T. Wang, S.-Y. Ma, L. G. Wright, T. Onodera, B. C. Richard, and P. L. McMahon, An optical neural network using less than 1 photon per multiplication, Nat. Commun. 13, 123 (2022)

2022
[17]

Chen et al., Deep learning with coherent VCSEL neural networks, Nat

Z. Chen et al., Deep learning with coherent VCSEL neural networks, Nat. Photonics 17, 723 (2023)

2023
[18]

Bernstein, A

L. Bernstein, A. Sludds, C. Panuski, S. Trajtenberg-Mills, R. Hamerly, and D. Englund, Single-shot optical neural network, Sci Adv 9, eadg7904 (2023)

2023
[19]

Liang et al., High-clockrate free-space optical in-memory computing, Light Sci

Y. Liang et al., High-clockrate free-space optical in-memory computing, Light Sci. Appl. 15, 115 (2026)

2026
[20]

Moralis-Pegios, G

M. Moralis-Pegios, G. Giamougiannis, A. Tsakyridis, D. Lazovsky, and N. Pleros, Perfect linear optics using silicon photonics, Nat. Commun. 15, 5468 (2024)

2024
[21]

A. N. Tait, T. F. de Lima, E. Zhou, A. X. Wu, M. A. Nahmias, B. J. Shastri, and P. R. Prucnal, Neuromorphic photonic networks using silicon photonic weight banks, Sci. Rep. 7, 7430 (2017)

2017
[22]

Yu et al., Parallel optical computing capable of 100-wavelength multiplexing, eLight 5, (2025)

X. Yu et al., Parallel optical computing capable of 100-wavelength multiplexing, eLight 5, (2025)

2025
[23]

Hamerly, L

R. Hamerly, L. Bernstein, A. Sludds, M. Soljačić, and D. Englund, Large-scale optical neural networks based on photoelectric multiplication, Phys. Rev. X. 9, (2019)

2019
[24]

Sludds et al., Delocalized photonic deep learning on the internet’s edge, Science 378, 270 (2022)

A. Sludds et al., Delocalized photonic deep learning on the internet’s edge, Science 378, 270 (2022)

2022
[25]

Lin et al., 120 GOPS Photonic tensor core in thin-film lithium niobate for inference and in situ training, Nat

Z. Lin et al., 120 GOPS Photonic tensor core in thin-film lithium niobate for inference and in situ training, Nat. Commun. 15, 9081 (2024)

2024
[26]

Rahimi Kari, N

S. Rahimi Kari, N. A. Nobile, D. Pantin, V. Shah, and N. Youngblood, Realization of an integrated coherent photonic platform for scalable matrix operations, Optica 11, 542 (2024)

2024
[27]

Quantization-aware Photonic Homodyne computing for Accelerated Artificial Intelligence and Scientific Simulation,

L. Zhou et al., Quantization-Aware Photonic Homodyne Computing for Accelerated Artificial Intelligence and Scientific Simulation, https://doi.org/10.48550/arXiv.2602.08269

work page doi:10.48550/arxiv.2602.08269
[28]

Ou et al., Hypermultiplexed integrated photonics-based optical tensor processor, Sci

S. Ou et al., Hypermultiplexed integrated photonics-based optical tensor processor, Sci. Adv. 11, eadu0228 (2025)

2025