Recognition: unknown
Homodyne Photonic Tensor Processor exceeds 1,000-TOPS
Pith reviewed 2026-05-10 02:34 UTC · model grok-4.3
The pith
A homodyne photonic circuit performs general matrix multiplications at over 1,000 tera-operations per second.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors demonstrate a coherent homodyne integrated circuit for general matrix multiplication with aggregate throughput exceeding 1,000 TOPS. Massive on-chip optical fanout and time multiplexing reduce the required modulator count from O(N squared) to O(N), enabling 256 by 256 homodyne units each smaller than 0.0064 square millimeters on a single reticle. Wafer-scale 64-channel thin-film lithium niobate transmitters with 40 GHz bandwidth and 0.2 dB per centimeter loss are coupled to silicon/silicon-nitride computing circuits, achieving up to 7-bit accuracy across 8 by 8 channels at 120 Gbaud per second and 6-bit accuracy across 256 by 100 channels at 20 to 128 Gbaud per second for a total
What carries the argument
The time-multiplexed array of coherent homodyne detection units that performs general matrix multiplication through optical fanout and parallelism.
If this is right
- The system runs quantized language models such as Qwen2.5-0.5B and produces accurate output tokens.
- Energy efficiency reaches 330 TOPS per watt using standard foundry packaging.
- The architecture supports both large-scale training and low-latency inference from data centers to edge devices.
- Reduced modulator count allows dense integration of record-scale homodyne arrays on a single reticle.
Where Pith is reading between the lines
- Further scaling could lower the power cost of running foundation models compared with conventional electronic hardware.
- The same time-multiplexing approach might extend to other photonic linear-algebra primitives beyond matrix multiplication.
- Testing full-scale accuracy under realistic data-center temperature and power variations would clarify deployment readiness.
- Coupling this processor with existing digital control electronics could create hybrid accelerators for mixed-precision workloads.
Load-bearing premise
Wafer-scale thin-film lithium niobate transmitters and their chip-to-chip coupling to silicon circuits can preserve the stated accuracy and throughput without unaccounted losses, crosstalk, or extra calibration overhead when the design is scaled to full 256 by 256 operation.
What would settle it
Simultaneous operation of all 256 by 256 channels at 120 Gbaud per second with direct measurement of end-to-end computational accuracy and total throughput.
read the original abstract
High-performance computing underpins modern artificial intelligence (AI), enabling foundation models, real-time inference and perception in autonomous systems, and data-intensive scientific simulations. Recent advances in quantization techniques utilizing low-precision computation without degrading model accuracy, create new opportunities for analog photonic computing characterized by ultra-high clock rates and low energy consumption. Here we propose and demonstrate a coherent homodyne integrated circuit capable of general matrix multiplication (GEMM) with aggregate throughput that exceeds 1,000 TOPS (tera-operations per second), enabled by massive on-chip optical fanout and parallelism. By leveraging time multiplexing, the required modulator count is reduced from O($N^2$) to O(N), allowing dense integration of record-scale 256 $\times$ 256 homodyne units (each <0.0064 $mm^2$) within a single reticle. We employ wafer-scale fabricated 64 thin-film lithium niobate (TFLN) transmitters (each over 40-GHz bandwidth with propagation loss of 0.2 dB/cm) to encode data and chip-to-chip coupled to Si/SiN computing circuits (64 channels). Our system achieves up to 7-bit computational accuracy across 8 $\times$ 8 parallel channels at record computing clockrate 120 Gbaud/s, and 6-bit statistical accuracy across 256 $\times$ 100 channels at 20-128 Gbaud/s, representing a total throughput of 1,000-6,000 TOPS. Massive parallelism amortizes the optoelectronic (OE) conversion to allow 330-TOPS/W efficiency using foundry-available packaging technology. The system throughput is benchmarked with Qwen2.5-0.5 billion parameter models that generate accurate tokens. High throughput and energy efficiency establish a near-term pathway toward light-based accelerators for large-scale training and low-latency inference from datacenters to edges, accelerating new models toward artificial general intelligence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper demonstrates a coherent homodyne photonic integrated circuit for general matrix multiplication (GEMM) using wafer-scale thin-film lithium niobate (TFLN) transmitters (64 units, >40 GHz bandwidth) chip-to-chip coupled to Si/SiN circuits. By employing time multiplexing to reduce modulator count from O(N²) to O(N), it claims dense integration of 256×256 homodyne units and reports aggregate throughputs of 1,000-6,000 TOPS, with 7-bit accuracy on 8×8 channels at 120 Gbaud/s and 6-bit statistical accuracy on 256×100 channels at 20-128 Gbaud/s. The system is benchmarked on Qwen2.5-0.5B inference and claims 330 TOPS/W efficiency.
Significance. If the reported accuracies and throughputs are experimentally validated at scale, this would constitute a notable hardware advance in analog photonic computing for AI workloads, demonstrating how massive optical parallelism and time multiplexing can deliver record TOPS while amortizing optoelectronic conversion costs. The use of foundry-compatible TFLN and Si/SiN platforms strengthens the case for near-term deployability in datacenter or edge accelerators.
major comments (3)
- [abstract and scaling description] The 1,000-6,000 TOPS throughput and 6-bit accuracy claims for 256×100 channels rest on time-multiplexed scaling from only 64 physical TFLN transmitters and 64 coupled channels; the manuscript provides no quantitative end-to-end insertion loss, crosstalk, or phase-stability measurements for the effective larger array, leaving the parallelism factor used in the TOPS calculation unverified.
- [results and benchmarking sections] No error bars, explicit measurement methodology, calibration procedures, or data-exclusion criteria are supplied for the 7-bit (8×8 at 120 Gbaud/s) and 6-bit (256×100) accuracy figures or the Qwen2.5-0.5B benchmark, preventing independent assessment of whether the reported computational precision is statistically robust.
- [experimental setup and performance claims] The assumption that chip-to-chip coupling between the 64 TFLN transmitters and Si/SiN circuits incurs negligible additional loss or noise when extrapolated to 256×100 operation is load-bearing for the efficiency (330 TOPS/W) and accuracy claims, yet no supporting insertion-loss or crosstalk budgets are presented.
minor comments (2)
- [abstract] Clarify whether the 120 Gbaud/s clock rate is the highest experimentally achieved or a design target, and specify the exact definition of 'statistical accuracy' versus 'computational accuracy' used in the bit-precision claims.
- [abstract] The abstract states 'record-scale 256 × 256 homodyne units' but the fabricated and coupled hardware is 64 channels; a brief sentence reconciling the physical versus effective array sizes would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, providing clarifications on the time-multiplexing approach, experimental details, and scaling assumptions. We have revised the manuscript to incorporate additional explanations, data, and methodology descriptions where feasible.
read point-by-point responses
-
Referee: [abstract and scaling description] The 1,000-6,000 TOPS throughput and 6-bit accuracy claims for 256×100 channels rest on time-multiplexed scaling from only 64 physical TFLN transmitters and 64 coupled channels; the manuscript provides no quantitative end-to-end insertion loss, crosstalk, or phase-stability measurements for the effective larger array, leaving the parallelism factor used in the TOPS calculation unverified.
Authors: The 256×100 effective array size is realized via time-division multiplexing of the 64 physical TFLN transmitters and 64 Si/SiN channels, where each physical unit processes multiple sequential time slots to emulate larger matrix dimensions without increasing hardware count. The TOPS figure is derived by multiplying the per-channel demonstrated throughput (at 20-128 Gbaud/s) by the effective channel count and multiplexing depth. In the revised manuscript, we have added a dedicated subsection detailing the time-multiplexing protocol, the exact parallelism factor calculation, and quantitative measurements from the 64-channel prototype: propagation loss of 0.2 dB/cm, chip-to-chip coupling loss, crosstalk below -20 dB, and phase stability over the measurement duration. These support the scaling, as the physical optical paths remain fixed at 64 channels. We acknowledge that direct end-to-end measurements on a physically implemented 256×100 array are not available in the current prototype. revision: partial
-
Referee: [results and benchmarking sections] No error bars, explicit measurement methodology, calibration procedures, or data-exclusion criteria are supplied for the 7-bit (8×8 at 120 Gbaud/s) and 6-bit (256×100) accuracy figures or the Qwen2.5-0.5B benchmark, preventing independent assessment of whether the reported computational precision is statistically robust.
Authors: We agree that these elements are necessary for assessing statistical robustness. The revised manuscript now includes error bars on all accuracy plots (representing standard deviation over repeated measurements), a new subsection on experimental methodology, explicit calibration procedures (including real-time bias control for modulators, phase-locking via integrated monitors, and temperature stabilization), and data-exclusion criteria (e.g., discarding trials with phase drift >5° or signal-to-noise ratio below threshold). For the 7-bit accuracy at 120 Gbaud/s (8×8), we detail the bit-error-rate computation; for the 6-bit statistical accuracy (256×100), we explain the Monte Carlo sampling from the 64-channel data. The Qwen2.5-0.5B benchmark section now specifies the inference pipeline, token accuracy metric, and number of runs used. revision: yes
-
Referee: [experimental setup and performance claims] The assumption that chip-to-chip coupling between the 64 TFLN transmitters and Si/SiN circuits incurs negligible additional loss or noise when extrapolated to 256×100 operation is load-bearing for the efficiency (330 TOPS/W) and accuracy claims, yet no supporting insertion-loss or crosstalk budgets are presented.
Authors: The physical chip-to-chip coupling remains limited to the 64 channels in both the demonstrated and extrapolated cases, as time multiplexing reuses the same optical interfaces across time slots without adding couplings. The revised manuscript includes a comprehensive loss and noise budget table with measured values for the 64-channel setup: total insertion loss per path, crosstalk contributions, and estimated noise from coupling and OE conversion. These are shown to be consistent across the clock rates used. The 330 TOPS/W efficiency accounts for amortizing these fixed costs over the high aggregate throughput enabled by parallelism and 120 Gbaud/s operation. We clarify that the extrapolation does not increase physical coupling count or associated losses. revision: yes
Circularity Check
No circularity: experimental hardware demonstration with directly measured throughput
full rationale
The paper reports an experimental photonic processor demonstration. Throughput figures (1,000-6,000 TOPS) are computed from measured clock rates (120 Gbaud/s for 8x8 channels at 7-bit accuracy; 20-128 Gbaud/s for 256x100 channels at 6-bit accuracy) and parallelism, not from any fitted parameters, self-referential equations, or ansatzes. The time-multiplexing reduction from O(N²) to O(N) modulators is a standard architectural description, not a derived result. No load-bearing self-citations, uniqueness theorems, or renamings of known results appear in the text. The central claims rest on fabricated devices, coupling, and benchmarked model inference rather than a closed mathematical derivation chain.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
LeCun, Y
Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature 521, 436 (2015)
2015
-
[2]
F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, SqueezeNet: AlexNet-Level Accuracy with 50x Fewer Parameters and <0.5MB Model Size, https://doi.org/10.48550/arXiv.1602.07360
-
[3]
S. Han, H. Mao, and W. J. Dally, Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, https://doi.org/10.48550/arXiv.1510.00149
work page internal anchor Pith review doi:10.48550/arxiv.1510.00149
-
[4]
Shen et al., Deep learning with coherent nanophotonic circuits, Nat
Y. Shen et al., Deep learning with coherent nanophotonic circuits, Nat. Photonics 11, 441 (2017)
2017
-
[5]
Xu et al., 11 TOPS photonic convolutional accelerator for optical neural networks, Nature 589, 44 (2021)
X. Xu et al., 11 TOPS photonic convolutional accelerator for optical neural networks, Nature 589, 44 (2021)
2021
-
[6]
Feldmann et al., Parallel convolutional processing using an integrated photonic tensor core, Nature 589, 52 (2021)
J. Feldmann et al., Parallel convolutional processing using an integrated photonic tensor core, Nature 589, 52 (2021)
2021
-
[7]
Al-Kayed, C
N. Al-Kayed, C. St-Arnault, H. Morison, A. Aadhi, C. Huang, A. N. Tait, D. V. Plant, and B. J. Shastri, Programmable 200 GOPS Hopfield-inspired photonic Ising machine, Nature 648, 576 (2025)
2025
-
[8]
L. G. Wright, T. Onodera, M. M. Stein, T. Wang, D. T. Schachter, Z. Hu, and P. L. McMahon, Deep physical neural networks trained with backpropagation, Nature 601, 549 (2022)
2022
-
[9]
S. R. Ahmed et al., Universal photonic artificial intelligence acceleration, Nature 640, 368 (2025)
2025
-
[10]
Hua et al., An integrated large-scale photonic accelerator with ultralow latency, Nature 640, 361 (2025)
S. Hua et al., An integrated large-scale photonic accelerator with ultralow latency, Nature 640, 361 (2025)
2025
-
[11]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention Is All You Need, https://doi.org/10.48550/arXiv.1706.03762
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.03762
-
[12]
L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang, B. Cui, and M.-H. Yang, Diffusion Models: A Comprehensive Survey of Methods and Applications, https://doi.org/10.48550/arXiv.2209.00796
-
[13]
OpenAI et al., GPT-4 Technical Report, https://doi.org/10.48550/arXiv.2303.08774
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774
-
[14]
X. Lin, Y. Rivenson, N. T. Yardimci, M. Veli, Y. Luo, M. Jarrahi, and A. Ozcan, All-optical machine learning using diffractive deep neural networks, Science 361, 1004 (2018)
2018
-
[15]
T. Zhou, X. Lin, J. Wu, Y. Chen, H. Xie, Y. Li, J. Fan, H. Wu, L. Fang, and Q. Dai, Large-scale neuromorphic optoelectronic computing with a reconfigurable diffractive processing unit, Nat. Photonics 15, 367 (2021)
2021
-
[16]
Wang, S.-Y
T. Wang, S.-Y. Ma, L. G. Wright, T. Onodera, B. C. Richard, and P. L. McMahon, An optical neural network using less than 1 photon per multiplication, Nat. Commun. 13, 123 (2022)
2022
-
[17]
Chen et al., Deep learning with coherent VCSEL neural networks, Nat
Z. Chen et al., Deep learning with coherent VCSEL neural networks, Nat. Photonics 17, 723 (2023)
2023
-
[18]
Bernstein, A
L. Bernstein, A. Sludds, C. Panuski, S. Trajtenberg-Mills, R. Hamerly, and D. Englund, Single-shot optical neural network, Sci Adv 9, eadg7904 (2023)
2023
-
[19]
Liang et al., High-clockrate free-space optical in-memory computing, Light Sci
Y. Liang et al., High-clockrate free-space optical in-memory computing, Light Sci. Appl. 15, 115 (2026)
2026
-
[20]
Moralis-Pegios, G
M. Moralis-Pegios, G. Giamougiannis, A. Tsakyridis, D. Lazovsky, and N. Pleros, Perfect linear optics using silicon photonics, Nat. Commun. 15, 5468 (2024)
2024
-
[21]
A. N. Tait, T. F. de Lima, E. Zhou, A. X. Wu, M. A. Nahmias, B. J. Shastri, and P. R. Prucnal, Neuromorphic photonic networks using silicon photonic weight banks, Sci. Rep. 7, 7430 (2017)
2017
-
[22]
Yu et al., Parallel optical computing capable of 100-wavelength multiplexing, eLight 5, (2025)
X. Yu et al., Parallel optical computing capable of 100-wavelength multiplexing, eLight 5, (2025)
2025
-
[23]
Hamerly, L
R. Hamerly, L. Bernstein, A. Sludds, M. Soljačić, and D. Englund, Large-scale optical neural networks based on photoelectric multiplication, Phys. Rev. X. 9, (2019)
2019
-
[24]
Sludds et al., Delocalized photonic deep learning on the internet’s edge, Science 378, 270 (2022)
A. Sludds et al., Delocalized photonic deep learning on the internet’s edge, Science 378, 270 (2022)
2022
-
[25]
Lin et al., 120 GOPS Photonic tensor core in thin-film lithium niobate for inference and in situ training, Nat
Z. Lin et al., 120 GOPS Photonic tensor core in thin-film lithium niobate for inference and in situ training, Nat. Commun. 15, 9081 (2024)
2024
-
[26]
Rahimi Kari, N
S. Rahimi Kari, N. A. Nobile, D. Pantin, V. Shah, and N. Youngblood, Realization of an integrated coherent photonic platform for scalable matrix operations, Optica 11, 542 (2024)
2024
-
[27]
L. Zhou et al., Quantization-Aware Photonic Homodyne Computing for Accelerated Artificial Intelligence and Scientific Simulation, https://doi.org/10.48550/arXiv.2602.08269
-
[28]
Ou et al., Hypermultiplexed integrated photonics-based optical tensor processor, Sci
S. Ou et al., Hypermultiplexed integrated photonics-based optical tensor processor, Sci. Adv. 11, eadu0228 (2025)
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.