pith. machine review for the scientific record. sign in

arxiv: 2605.14171 · v1 · pith:LS7YXZ3Qnew · submitted 2026-05-13 · 💻 cs.LG · cs.NI

CSI-JEPA: Towards Foundation Representations for Ubiquitous Sensing with Minimal Supervision

Pith reviewed 2026-05-15 04:50 UTC · model grok-4.3

classification 💻 cs.LG cs.NI
keywords self-supervised learningWi-Fi sensingchannel state informationmasked predictionrepresentation learningubiquitous sensinglabel-efficient learning
0
0 comments X

The pith

CSI-JEPA learns reusable temporal-spectral representations from unlabeled Wi-Fi channel state information through masked prediction of latent features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a self-supervised framework that pretrains on abundant unlabeled CSI samples by predicting masked channel regions from visible context. Tokenization occurs along time and subcarrier dimensions, paired with a masking strategy that targets areas of stronger local variations. After pretraining, the encoder stays frozen while lightweight task-specific adapters handle downstream sensing tasks. This approach is evaluated across seven real-world Wi-Fi sensing tasks, delivering accuracy gains up to 10.64 percentage points over supervised Transformers and label reductions up to 98 percent at matched performance. The core value lies in shifting from task-specific labeled training to scalable, reusable representations when labeled data remains scarce.

Core claim

CSI-JEPA pretrains an encoder on unlabeled CSI by predicting latent features of masked channel-response amplitude windows from visible context, using time-subcarrier tokenization and channel variation-aware masking to respect CSI physical structure, then freezes the encoder as a backbone for multiple downstream sensing tasks with only lightweight adapters.

What carries the argument

Masked predictive coding on tokenized CSI amplitude windows, where the model predicts latent features of variation-rich masked regions from surrounding visible context.

If this is right

  • Seven diverse Wi-Fi sensing tasks achieve higher mean accuracy than fully supervised baselines while using far less labeled data.
  • The same pretrained encoder supports multiple objectives by adding separate lightweight adapters rather than retraining entire models.
  • Label budgets for new deployments can be reduced by up to 98 percent while maintaining competitive performance.
  • Temporal-spectral structure in CSI is explicitly respected during pretraining through dimension-specific tokenization and variation-aware masking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar masked-prediction pretraining could extend to other radio-frequency sensing modalities that share time-frequency structure.
  • Foundation-style encoders for sensing would allow rapid adaptation across users, devices, and rooms without repeated large-scale labeling campaigns.
  • Deployment cost models for Wi-Fi sensing systems would shift from data-collection expense to compute for one-time pretraining.
  • The approach implies that explicit modeling of channel variation during masking is more effective than uniform random masking for radio signals.

Load-bearing premise

Representations learned from masked prediction on unlabeled CSI transfer effectively to new tasks and settings using only lightweight adapters without major domain shift.

What would settle it

Measure whether accuracy on a new device or environment falls below supervised baselines when the pretrained encoder is frozen and only adapters are trained on limited labels from that setting.

Figures

Figures reproduced from arXiv: 2605.14171 by Xuanhao Luo, Yuchen Liu, Zhizhen Li.

Figure 1
Figure 1. Figure 1: Illustrative CSI amplitude examples for Fall and Non-Fall samples. Left: temporal-spectral CSI heatmaps. Right: aggregated temporal and sub￾carrier profiles obtained by averaging over subcarriers and time, respectively. The two classes exhibit visibly different structures in both the heatmap and the aggregated one-dimensional views, suggesting that discriminative sensing cues exist along both temporal and … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of CSI-JEPA. The framework performs self-supervised predictive pretraining on temporal-spectral CSI samples [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy and weighted F1-score under different label budgets on four individually defined CSI sensing tasks. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy and weighted F1-score under different label [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Effect of pretraining epochs on downstream adaptation. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
read the original abstract

Channel state information (CSI) provides a widely available sensing modality for human and environment perception, but existing CSI sensing models usually rely on task-specific supervised training and require substantial labeled data for each task, device, user, or environment. This limits their scalability in practical deployments where unlabeled CSI is abundant but labeled data is costly to collect. In this paper, we present CSI-JEPA, a self-supervised predictive representation learning framework for label-efficient, multi-task Wi-Fi sensing. CSI-JEPA learns reusable temporal-spectral representations from unlabeled CSI samples by predicting latent features of masked channel regions from visible context. To better match the physical structure of CSI, CSI-JEPA tokenizes channel-response amplitude windows along the time and subcarrier dimensions. It then introduces a channel variation-aware masking strategy that samples predictive targets from regions with stronger local temporal and subcarrier-domain variations. After pretraining, the encoder is frozen and used as a backbone, with lightweight task-specific adapters added for downstream sensing tasks. We evaluate CSI-JEPA on seven real-world Wi-Fi sensing tasks spanning diverse objectives and deployment settings. The results show that CSI-JEPA improves downstream sensing performance over competitive baselines, achieving up to 10.64 percentage points mean accuracy gain over state-of-the-art supervised Transformer and matched-budget label savings of up to 98.0%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. CSI-JEPA introduces a self-supervised JEPA-style framework that pretrains a Transformer encoder on unlabeled CSI amplitude windows by predicting latent features of masked regions, using a channel variation-aware masking strategy on time-subcarrier tokenized inputs. After pretraining, the encoder is frozen and paired with lightweight task-specific adapters for transfer to seven real-world Wi-Fi sensing tasks, where it reports accuracy gains of up to 10.64 percentage points and label savings of up to 98% relative to supervised Transformer baselines.

Significance. If the empirical results are robust, the work provides a practical path toward foundation representations for ubiquitous CSI sensing, substantially lowering the labeled-data barrier for multi-task, multi-environment deployment. The combination of physically motivated masking and frozen-encoder transfer is a clear strength that aligns with successful self-supervised paradigms in other modalities.

major comments (3)
  1. [§4] §4 (Experiments) and Table 2: the reported maximum gains (10.64 pp accuracy, 98 % label savings) are presented as headline results without per-task breakdowns, error bars, or statistical significance tests; this information is load-bearing for the central claim that CSI-JEPA outperforms matched-budget supervised Transformers across diverse tasks.
  2. [§3.2] §3.2 (Channel variation-aware masking): the precise definition of the variation threshold, the sampling distribution over high-variation regions, and the exact masking ratio schedule are described only qualitatively; without these details the strategy cannot be reproduced or ablated, undermining claims that the masking is tailored to CSI physical structure.
  3. [§4.3] §4.3 (Ablation studies): the paper does not report an ablation isolating the contribution of the variation-aware masking versus standard random masking, which is necessary to establish that the proposed strategy is responsible for the observed transfer gains rather than the JEPA objective alone.
minor comments (2)
  1. [§3.1] Notation for the tokenized CSI windows (time-subcarrier patches) is introduced without an explicit equation; adding a compact definition (e.g., Eq. (3)) would improve clarity.
  2. [Figure 2] Figure 2 (architecture diagram) would benefit from explicit arrows indicating the flow of visible vs. masked tokens through the predictor.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the positive evaluation and constructive comments. We address each major point below and will revise the manuscript to strengthen reproducibility and empirical rigor.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments) and Table 2: the reported maximum gains (10.64 pp accuracy, 98 % label savings) are presented as headline results without per-task breakdowns, error bars, or statistical significance tests; this information is load-bearing for the central claim that CSI-JEPA outperforms matched-budget supervised Transformers across diverse tasks.

    Authors: We agree that per-task breakdowns with error bars and significance testing are necessary to support the central claims. In the revised manuscript we will expand Table 2 (and add a supplementary table) to report mean accuracy and standard deviation over five random seeds for each of the seven tasks, together with p-values from paired t-tests against the matched-budget supervised Transformer baselines. This will make the robustness of the reported gains transparent. revision: yes

  2. Referee: [§3.2] §3.2 (Channel variation-aware masking): the precise definition of the variation threshold, the sampling distribution over high-variation regions, and the exact masking ratio schedule are described only qualitatively; without these details the strategy cannot be reproduced or ablated, undermining claims that the masking is tailored to CSI physical structure.

    Authors: We acknowledge that §3.2 currently provides only a qualitative description. In the revision we will add the exact implementation details used in our experiments: the mathematical definition of the local variation score, the precise threshold for selecting high-variation regions, the normalized sampling distribution, and the linear masking-ratio schedule. Pseudocode will also be included to ensure full reproducibility. revision: yes

  3. Referee: [§4.3] §4.3 (Ablation studies): the paper does not report an ablation isolating the contribution of the variation-aware masking versus standard random masking, which is necessary to establish that the proposed strategy is responsible for the observed transfer gains rather than the JEPA objective alone.

    Authors: We agree that isolating the masking strategy is required. We will add a new ablation subsection in §4.3 that trains an otherwise identical JEPA model with standard random masking and reports downstream accuracy on the same seven tasks. This will quantify the incremental benefit of the channel variation-aware masking. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on held-out evaluations

full rationale

The paper describes a JEPA-style masked-prediction pretraining pipeline on tokenized CSI amplitude windows, followed by frozen-encoder transfer with lightweight adapters. All reported gains (accuracy improvements and label savings) are presented as direct outcomes of empirical evaluation across seven real-world tasks. No equations, derivations, or parameter-fitting steps are shown that reduce the target quantities to the inputs by construction. The masking strategy is motivated by CSI physical structure rather than tautological redefinition, and the transfer protocol follows standard foundation-model practice without self-referential uniqueness theorems or ansatz smuggling. Any self-citations are peripheral and do not carry the central empirical claims.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on the assumption that predictive pretraining on unlabeled CSI yields transferable features; limited details available from abstract.

free parameters (1)
  • masking ratio and variation threshold
    Hyperparameters controlling which regions are masked and treated as targets; values not specified in abstract.
axioms (1)
  • domain assumption Unlabeled CSI contains sufficient latent structure for learning task-agnostic representations via masked prediction
    Core premise enabling the self-supervised stage.
invented entities (1)
  • channel variation-aware masking strategy no independent evidence
    purpose: To select predictive targets from high-variation regions in time and subcarrier dimensions
    New component introduced to align with CSI physical structure

pith-pipeline@v0.9.0 · 5537 in / 1261 out tokens · 35360 ms · 2026-05-15T04:50:16.600166+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 1 internal anchor

  1. [1]

    Deep AI enabled ubiquitous wireless sensing: A survey,

    C. Li, Z. Cao, and Y . Liu, “Deep AI enabled ubiquitous wireless sensing: A survey,”ACM Computing Surveys (CSUR), vol. 54, no. 2, pp. 1–35, 2021

  2. [2]

    Cross-domain WiFi sensing with channel state information: A survey,

    C. Chen, G. Zhou, and Y . Lin, “Cross-domain WiFi sensing with channel state information: A survey,”ACM Computing Surveys, vol. 55, no. 11, pp. 1–37, 2023

  3. [3]

    WiFi sensing with channel state information: A survey,

    Y . Ma, G. Zhou, and S. Wang, “WiFi sensing with channel state information: A survey,”ACM Computing Surveys (CSUR), vol. 52, no. 3, pp. 1–36, 2019

  4. [4]

    Wi-Fi can do more: Toward ubiquitous wireless sensing,

    C. Wu, B. Wang, O. C. Au, and K. R. Liu, “Wi-Fi can do more: Toward ubiquitous wireless sensing,”IEEE Communications Standards Magazine, vol. 6, no. 2, pp. 42–49, 2022

  5. [5]

    Wireless sensing for human activity: A survey,

    J. Liu, H. Liu, Y . Chen, Y . Wang, and C. Wang, “Wireless sensing for human activity: A survey,”IEEE Communications Surveys & Tutorials, vol. 22, no. 3, pp. 1629–1645, 2019

  6. [6]

    IEEE 802.11 bf WLAN sensing procedure: Enabling the widespread adoption of WiFi sensing,

    T. Ropitault, C. R. da Silva, S. Blandino, A. Sahoo, N. Golmie, K. Yoon, C. Aldana, and C. Hu, “IEEE 802.11 bf WLAN sensing procedure: Enabling the widespread adoption of WiFi sensing,”IEEE Communications Standards Magazine, vol. 8, no. 1, pp. 58–64, 2024

  7. [7]

    An overview on IEEE 802.11 bf: WLAN sensing,

    R. Du, H. Hua, H. Xie, X. Song, Z. Lyu, M. Hu, Y . Xin, S. McCann, M. Montemurro, T. X. Hanet al., “An overview on IEEE 802.11 bf: WLAN sensing,”IEEE Communications Surveys & Tutorials, vol. 27, no. 1, pp. 184–217, 2024

  8. [8]

    Beamforming Feedback- Driven Wireless Positioning: A Transferable Vision Transformer Ap- proach,

    Z. Li, X. Luo, M. Chen, G. Li, and Y . Liu, “Beamforming Feedback- Driven Wireless Positioning: A Transferable Vision Transformer Ap- proach,”IEEE Transactions on Mobile Computing, 2026

  9. [9]

    Contactless respiration monitoring via off-the-shelf WiFi devices,

    X. Liu, J. Cao, S. Tang, J. Wen, and P. Guo, “Contactless respiration monitoring via off-the-shelf WiFi devices,”IEEE Transactions on Mo- bile Computing, vol. 15, no. 10, pp. 2466–2479, 2015

  10. [10]

    Walls have no ears: A non-intrusive WiFi-based user identification system for mobile devices,

    L. Cheng and J. Wang, “Walls have no ears: A non-intrusive WiFi-based user identification system for mobile devices,”IEEE/ACM Transactions on Networking, vol. 27, no. 1, pp. 245–257, 2019

  11. [11]

    BFMLoc: Transformer- Based Indoor Positioning Leveraging Beamforming Feedback Matri- ces,

    Z. Li, X. Luo, M. Chen, C. Xu, and Y . Liu, “BFMLoc: Transformer- Based Indoor Positioning Leveraging Beamforming Feedback Matri- ces,” inICC 2025-IEEE International Conference on Communications. IEEE, 2025, pp. 6699–6704

  12. [12]

    Inferring person-to-person proximity using WiFi signals,

    P. Sapiezynski, A. Stopczynski, D. K. Wind, J. Leskovec, and S. Lehmann, “Inferring person-to-person proximity using WiFi signals,” Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies, vol. 1, no. 2, pp. 1–20, 2017

  13. [13]

    AutoFi: Toward automatic Wi-Fi human sensing via geometric self-supervised learning,

    J. Yang, X. Chen, H. Zou, D. Wang, and L. Xie, “AutoFi: Toward automatic Wi-Fi human sensing via geometric self-supervised learning,” IEEE Internet of Things Journal, vol. 10, no. 8, pp. 7416–7425, 2022

  14. [14]

    A tutorial-cum-survey on self-supervised learning for wi-fi sensing: Trends, challenges, and outlook,

    A. Y . Radwan, M. Yildirim, N. Hasanzadeh, H. Tabassum, and S. Valaee, “A tutorial-cum-survey on self-supervised learning for wi-fi sensing: Trends, challenges, and outlook,”IEEE Communications Surveys & Tutorials, 2025

  15. [15]

    AM-FM: A Foundation Model for Ambient Intelligence Through WiFi,

    G. Zhu, Y . Hu, S. Jayaweera, W. Gao, W.-H. Wang, J. Zhang, B. Wang, C. Wu, and K. Liu, “AM-FM: A Foundation Model for Ambient Intelligence Through WiFi,”arXiv preprint arXiv:2602.11200, 2026

  16. [16]

    CSI-MAE: A Masked Autoencoder-based Channel Foundation Model,

    J. Jiang, X. Ruan, and S. Xu, “CSI-MAE: A Masked Autoencoder-based Channel Foundation Model,”arXiv preprint arXiv:2601.03789, 2026

  17. [17]

    A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27,

    Y . LeCunet al., “A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27,”Open Review, vol. 62, no. 1, pp. 1–62, 2022

  18. [18]

    Toward inte- grated sensing and communications in IEEE 802.11 bf Wi-Fi networks,

    F. Meneghello, C. Chen, C. Cordeiro, and F. Restuccia, “Toward inte- grated sensing and communications in IEEE 802.11 bf Wi-Fi networks,” IEEE Communications Magazine, vol. 61, no. 7, pp. 128–133, 2023

  19. [19]

    BFMSense: WiFi sensing using beamforming feedback matrix,

    E. Yi, D. Wu, J. Xiong, F. Zhang, K. Niu, W. Li, and D. Zhang, “BFMSense: WiFi sensing using beamforming feedback matrix,” in21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024, pp. 1697–1712

  20. [20]

    Sensing per- formance of the IEEE 802.11 bf protocol and its impact on data communication,

    A. Sahoo, T. Ropitault, S. Blandino, and N. Golmie, “Sensing per- formance of the IEEE 802.11 bf protocol and its impact on data communication,” in2024 IEEE 100th Vehicular Technology Conference (VTC2024-Fall). IEEE, 2024, pp. 1–7

  21. [21]

    Deep learning-based joint channel estimation and CSI feedback for RIS-assisted communications,

    H. Feng, Y . Xu, and Y . Zhao, “Deep learning-based joint channel estimation and CSI feedback for RIS-assisted communications,”IEEE Communications Letters, vol. 28, no. 8, pp. 1860–1864, 2024

  22. [22]

    Contextual combinatorial beam management via online probing for multiple ac- cess mmWave wireless networks,

    Z. Li, X. Luo, M. Chen, C. Xu, S. Mao, and Y . Liu, “Contextual combinatorial beam management via online probing for multiple ac- cess mmWave wireless networks,”IEEE Journal on Selected Areas in Communications, vol. 43, no. 3, pp. 959–972, 2025

  23. [23]

    Denoising diffusion probabilistic model for radio map estimation in generative wireless networks,

    X. Luo, Z. Li, Z. Peng, M. Chen, and Y . Liu, “Denoising diffusion probabilistic model for radio map estimation in generative wireless networks,”IEEE Transactions on Cognitive Communications and Net- working, vol. 11, no. 2, pp. 751–763, 2025

  24. [24]

    LLM4WM: Adapt- ing LLM for wireless multi-tasking,

    X. Liu, S. Gao, B. Liu, X. Cheng, and L. Yang, “LLM4WM: Adapt- ing LLM for wireless multi-tasking,”IEEE Transactions on Machine Learning in Communications and Networking, 2025

  25. [25]

    MMSense: Adapting Vision-based Foundation Model for Multi-task Multi-modal Wireless Sensing,

    Z. Li, X. Luo, X. Ge, L. Zhou, X. Lin, and Y . Liu, “MMSense: Adapting Vision-based Foundation Model for Multi-task Multi-modal Wireless Sensing,”arXiv preprint arXiv:2511.12305, 2025

  26. [26]

    Self-supervised learning from images with a joint-embedding predictive architecture,

    M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rab- bat, Y . LeCun, and N. Ballas, “Self-supervised learning from images with a joint-embedding predictive architecture,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 15 619–15 629

  27. [27]

    V-jepa: Latent video prediction for visual representation learning,

    A. Bardes, Q. Garrido, J. Ponce, X. Chen, M. Rabbat, Y . LeCun, M. Assran, and N. Ballas, “V-jepa: Latent video prediction for visual representation learning,” 2023

  28. [28]

    Vl-jepa: Joint embedding predictive architecture for vision-language.arXiv preprint arXiv:2512.10942, 2025

    D. Chen, M. Shukor, T. Moutakanni, W. Chung, J. Yu, T. Kasarla, Y . Bang, A. Bolourchi, Y . LeCun, and P. Fung, “Vl-jepa: Joint em- bedding predictive architecture for vision-language,”arXiv preprint arXiv:2512.10942, 2025

  29. [29]

    WirelessJEPA: A Multi- Antenna Foundation Model using Spatio-temporal Wireless Latent Pre- dictions,

    V . Chu, O. Mashaal, and H. Abou-Zeid, “WirelessJEPA: A Multi- Antenna Foundation Model using Spatio-temporal Wireless Latent Pre- dictions,”arXiv preprint arXiv:2601.20190, 2026

  30. [30]

    Learning latent wireless dynamics from channel state information,

    C. B. Chaaya, A. M. Girgis, and M. Bennis, “Learning latent wireless dynamics from channel state information,”IEEE Wireless Communica- tions Letters, vol. 14, no. 2, pp. 489–493, 2024

  31. [31]

    Structured Latent Dynamics in Wireless CSI via Homomorphic World Models,

    S. Naoumi, M. Bennis, and M. Chafii, “Structured Latent Dynamics in Wireless CSI via Homomorphic World Models,”arXiv preprint arXiv:2603.20048, 2026

  32. [32]

    JEPA-MSAC: A Joint-Embedding Predictive Architec- ture for Multimodal Sensing-Assisted Communications,

    C. Zheng, J. He, G. Cai, N. Li, M. Bennis, H. Wymeersch, and M. Debbah, “JEPA-MSAC: A Joint-Embedding Predictive Architec- ture for Multimodal Sensing-Assisted Communications,”arXiv preprint arXiv:2603.29796, 2026

  33. [33]

    SenseFi: A library and benchmark on deep-learning-empowered WiFi human sensing,

    J. Yang, X. Chen, H. Zou, C. X. Lu, D. Wang, S. Sun, and L. Xie, “SenseFi: A library and benchmark on deep-learning-empowered WiFi human sensing,”Patterns, vol. 4, no. 3, 2023

  34. [34]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

  35. [35]

    CSI- bench: A large-scale in-the-wild dataset for multi-task WiFi sensing,

    G. Zhu, Y . Hu, W. Gao, W.-H. Wang, B. Wang, and K. Liu, “CSI-Bench: A Large-Scale In-the-Wild Dataset for Multi-task WiFi Sensing,”arXiv preprint arXiv:2505.21866, 2025

  36. [36]

    The perceptron: a probabilistic model for information storage and organization in the brain

    F. Rosenblatt, “The perceptron: a probabilistic model for information storage and organization in the brain.”Psychological review, vol. 65, no. 6, p. 386, 1958

  37. [37]

    Learning repre- sentations by back-propagating errors,

    D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning repre- sentations by back-propagating errors,”nature, vol. 323, no. 6088, pp. 533–536, 1986

  38. [38]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

  39. [39]

    Gradient-based learning applied to document recognition,

    Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,”Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998