pith. sign in

arxiv: 2606.24910 · v1 · pith:UFDR4WJAnew · submitted 2026-06-19 · 📡 eess.AS · cs.AI· cs.SD

End-to-End Voice Intent Recognition for Spontaneous Human-Drone Interaction with Naive Users

Pith reviewed 2026-06-26 13:28 UTC · model grok-4.3

classification 📡 eess.AS cs.AIcs.SD
keywords end-to-end spoken language understandingvoice intent recognitionhuman-drone interactionspontaneous speechcross-modal distillationUAV teleoperationself-supervised learning
0
0 comments X

The pith

An end-to-end speech model reaches 93 percent accuracy on spontaneous drone commands at 7 ms latency, 29 times faster than cascade baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an end-to-end spoken language understanding system for controlling drones with natural, spontaneous speech from people who have no training in drone operation. It combines a pre-trained acoustic model with a simple classifier and uses knowledge from text models to improve understanding without needing written transcripts during use. Evaluation on a new collection of real French speech recordings shows high accuracy and very low delay compared to older step-by-step systems. A sympathetic reader would care because this could make drone piloting more intuitive and accessible in everyday situations where users speak naturally and make mistakes.

Core claim

The authors establish that an end-to-end architecture using a frozen self-supervised learning acoustic encoder and an LSTM classification head, trained with cross-modal knowledge distillation, achieves 93% accuracy on simple voice commands at 7 ms inference latency and 82% on the full spontaneous speech test set from the VoiceStick corpus, outperforming cascade baselines in both accuracy and speed by a large margin.

What carries the argument

The end-to-end SLU model with frozen SSL encoder, LSTM head, and cross-modal distillation objective that aligns acoustic representations to semantic embeddings.

If this is right

  • On simple commands the model reaches 93 percent accuracy while running 29 times faster than cascade approaches.
  • Cross-modal distillation improves robustness on disfluent spontaneous speech without requiring transcripts at test time.
  • The architecture provides calibrated confidence estimates suitable for real-time UAV control.
  • End-to-end designs prove preferable to cascaded systems for handling naive user speech in drone teleoperation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar end-to-end approaches might apply to voice interfaces in other robots or vehicles where users speak spontaneously.
  • The low latency could support more fluid, conversational interactions rather than single commands.
  • Testing the model on non-French languages or different acoustic environments would reveal its broader applicability.

Load-bearing premise

The VoiceStick corpus from 29 nonexpert pairs in actual drone sessions captures the kind of spontaneous and disfluent speech that occurs with naive users in general real-world drone interactions.

What would settle it

A new test set of spontaneous drone commands recorded from a different group of nonexpert users where the end-to-end model achieves less than 75 percent accuracy would indicate the approach does not generalize as claimed.

Figures

Figures reproduced from arXiv: 2606.24910 by Allan Henry (GIPSA-COPERNIC, Christian Graff (LPNC), GETALP, Jose-Ernesto Gomez-Balderas (GIPSA-COPERNIC), LPNC), Solange Rossato (GETALP), Sylvain Huet (GIPSA-COPERNIC).

Figure 1
Figure 1. Figure 1: Proposed End-to-End architecture. To handle the significant variance in duration and pacing characteristic of spontaneous speech, we implement a multi￾stage aggregation process. First, an LSTM models long￾range sequential dependencies across the frame-level feature sequence. An Attentive Pooling mechanism then collapses the LSTM outputs into a single segment-level representation, effectively weighting the … view at source ↗
Figure 2
Figure 2. Figure 2: Confidence score distributions for correct and incorrect predictions [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

Voice control offers an intuitive alternative to manual drone piloting, yet most existing systems rely on rigid command vocabularies that fail to handle the spontaneous, disfluent speech of naive users. This paper addresses this gap by proposing an End-to-End Spoken Language Understanding architecture for real-time human-drone interaction in French. Our model combines a frozen Self-Supervised Learning acoustic encoder with a lightweight LSTM-based classification head, augmented by a cross-modal knowledge distillation objective that aligns acoustic representations with semantic embeddings from a text teacher, without requiring transcription at inference time. We evaluate our approach on VoiceStick, a novel French corpus of spontaneous speech collected during real teleoperation sessions with 29 nonexpert dyads. On simple voice commands, our best configuration achieves 93% accuracy at 7 ms inference latency, outperforming cascade baselines (79%, 202 ms) with a 29x speedup. On the full spontaneous speech test set, our architecture reaches 82% accuracy, with crossmodal distillation consistently improving robustness across all configurations. These results demonstrate that End-to-End architectures are not only feasible but preferable for spontaneous voice-guided UAV teleoperation, combining semantic robustness, low latency, and calibrated confidence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes an end-to-end spoken language understanding architecture for spontaneous French voice commands in drone teleoperation. It freezes a self-supervised learning acoustic encoder, adds a lightweight LSTM classification head, and applies cross-modal knowledge distillation from text embeddings. The model is evaluated on the new VoiceStick corpus collected from 29 nonexpert dyads during real sessions, reporting 93% accuracy and 7 ms latency on simple commands (vs. 79% / 202 ms for cascade baselines) and 82% accuracy on the full spontaneous test set, concluding that E2E models are preferable for semantic robustness and low latency.

Significance. If the results hold under proper generalization tests, the work would demonstrate a practical advantage for end-to-end models over cascaded ASR+NLU pipelines in handling disfluent naive-user speech for UAV control, with the 29x latency reduction being a notable engineering contribution. The release of VoiceStick as a spontaneous-interaction corpus would also be a useful resource for the field, provided its coverage of disfluency and speaker variability is adequately documented.

major comments (2)
  1. [Evaluation] The central claim that E2E architectures are preferable rests on the 82% accuracy on the full spontaneous speech test set from VoiceStick. However, the manuscript provides no details on speaker-independent train/test splits, accent or lexical diversity metrics, or statistical significance tests for the reported gains over cascade baselines; without these, the 29-dyad corpus cannot securely support generalization beyond the collected sessions.
  2. [§4] The abstract states that cross-modal distillation 'consistently improving robustness across all configurations,' yet no ablation table or section quantifies the contribution of the distillation loss versus the base E2E model on the spontaneous subset; this is load-bearing for the claim that the full architecture is required.
minor comments (2)
  1. [Abstract] The latency comparison (7 ms vs 202 ms) should specify whether the cascade baseline includes ASR decoding time or only the NLU stage, to ensure the 29x speedup is measured on equivalent hardware.
  2. [Methods] Notation for the distillation objective (e.g., the alignment loss between acoustic and text embeddings) should be introduced with an equation number in the methods section for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The two major comments highlight important gaps in evaluation documentation and ablation analysis. We address each point below and will revise the manuscript accordingly to strengthen the claims.

read point-by-point responses
  1. Referee: [Evaluation] The central claim that E2E architectures are preferable rests on the 82% accuracy on the full spontaneous speech test set from VoiceStick. However, the manuscript provides no details on speaker-independent train/test splits, accent or lexical diversity metrics, or statistical significance tests for the reported gains over cascade baselines; without these, the 29-dyad corpus cannot securely support generalization beyond the collected sessions.

    Authors: We agree that the manuscript should provide explicit details on these evaluation aspects to support generalization claims. The VoiceStick splits were constructed to be speaker-independent by holding out complete dyads (ensuring no speaker overlap between train and test), and we will add a dedicated subsection in the revised paper documenting the split procedure, reporting accent variation (via self-reported and phonetic metrics) and lexical diversity (type-token ratios and command variation counts across the 29 dyads), and including statistical significance tests (McNemar's test) confirming p < 0.01 for the accuracy gains over cascade baselines. These additions will be included in the next version. revision: yes

  2. Referee: [§4] The abstract states that cross-modal distillation 'consistently improving robustness across all configurations,' yet no ablation table or section quantifies the contribution of the distillation loss versus the base E2E model on the spontaneous subset; this is load-bearing for the claim that the full architecture is required.

    Authors: We concur that a quantitative ablation is required to substantiate the distillation contribution on the spontaneous subset. The current text describes the objective but lacks a dedicated comparison. In the revision we will insert a new ablation table (and accompanying text) in §4 that reports accuracy on the spontaneous test set for the base E2E model versus the full model with distillation, across all encoder and head configurations, thereby quantifying the consistent robustness gains and directly supporting the claim that the complete architecture is beneficial. revision: yes

Circularity Check

0 steps flagged

No circularity: results are empirical evaluations on held-out corpus data

full rationale

The paper collects the VoiceStick corpus from 29 dyads during real sessions and reports model accuracies (93% simple commands, 82% full spontaneous set) on a held-out test set, comparing against cascade baselines. No equations, predictions, or claims reduce by construction to fitted inputs, self-definitions, or self-citation chains. The architecture (frozen SSL encoder + LSTM head + distillation) and results are independent of the reported metrics; evaluation does not rename or tautologically reproduce any input quantity. This is a standard self-contained experimental paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of the VoiceStick corpus for spontaneous speech and the effectiveness of the cross-modal distillation objective in improving robustness without requiring transcription at inference.

axioms (1)
  • domain assumption A frozen self-supervised learning acoustic encoder yields representations sufficient for intent classification in this domain.
    Invoked to enable low-latency inference while avoiding full fine-tuning or transcription.

pith-pipeline@v0.9.1-grok · 5797 in / 1454 out tokens · 27143 ms · 2026-06-26T13:28:16.221142+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 5 linked inside Pith

  1. [1]

    The State-of-the-Art of Human–Drone Interaction: A Survey,

    D. Tezza and M. Andujar, “The State-of-the-Art of Human–Drone Interaction: A Survey,” IEEE Access , vol. 7, pp. 167 438–167 454, 2019

  2. [2]

    Emotion encoding in Human-Drone Interaction,

    J. R. Cauchard, K. Y . Zhai, M. Spadafora, and J. A. Landay, “Emotion encoding in Human-Drone Interaction,” in 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI). Christchurch, New Zealand: IEEE, Mar. 2016, pp. 263–270

  3. [3]

    Multimodal Human–Robot Interaction for Human-Centric Smart Manufacturing: A Survey,

    T. Wang, P. Zheng, S. Li, and L. Wang, “Multimodal Human–Robot Interaction for Human-Centric Smart Manufacturing: A Survey,” Ad- vanced Intelligent Systems , vol. 6, no. 3, p. 2300359, Mar. 2024

  4. [4]

    Design and Development of an Android Application for V oice Control of Micro Unmanned Aerial Vehicles,

    C. Thomas, J. Joseph Thomas, R. Bharadwaj, A. K. Mondal, V . De- valla, and S. N. Omkar, “Design and Development of an Android Application for V oice Control of Micro Unmanned Aerial Vehicles,” in AIAA Aviation 2019 Forum . Dallas, Texas: American Institute of Aeronautics and Astronautics, Jun. 2019

  5. [5]

    De- sign of a Novice-Friendly Drone Control System,

    J. Cheng, S. Mahmud, M. Mohammed, A. Singh, and J.-H. Kim, “De- sign of a Novice-Friendly Drone Control System,” in 2024 IEEE 14th Annual Computing and Communication Workshop and Conference (CCWC). Las Vegas, NV , USA: IEEE, Jan. 2024, pp. 0184–0190

  6. [6]

    Flight, Camera, Action! Using Natural Language and Mixed Reality to Control a Drone,

    B. Huang, D. Bayazit, D. Ullman, N. Gopalan, and S. Tellex, “Flight, Camera, Action! Using Natural Language and Mixed Reality to Control a Drone,” in 2019 International Conference on Robotics and Automation (ICRA) . Montreal, QC, Canada: IEEE, May 2019, pp. 6949–6956

  7. [7]

    Enhancing V oice-Controlled Drone Navigation: A Hybrid Approach Using ASR and NLP for UA V Command Interpretation,

    Y . Alkasim and A. Altahhan, “Enhancing V oice-Controlled Drone Navigation: A Hybrid Approach Using ASR and NLP for UA V Command Interpretation,” in Artificial Intelligence XLII , M. Bramer and F. Stahl, Eds. Cham: Springer Nature Switzerland, 2026, vol. 16302, pp. 253–269, series Title: Lecture Notes in Computer Science

  8. [8]

    FAA Aerospace Forecast Fiscal Years 2022–2042,

    Federal Aviation Administration, “FAA Aerospace Forecast Fiscal Years 2022–2042,” U.S. Department of Transportation , Jun. 2022

  9. [9]

    Design for Acceptance and Intuitive Interaction: Teaming Autonomous Aerial Systems with Non-experts,

    S. Ellenrieder, M. Mehler, and M. T. Akdag, “Design for Acceptance and Intuitive Interaction: Teaming Autonomous Aerial Systems with Non-experts,” PACIS 2023 Proceedings, 2023

  10. [10]

    A Multi-Lingual Speech Recognition-Based Framework to Human-Drone Interaction,

    K. Choutri, M. Lagha, S. Meshoul, M. Batouche, Y . Kacel, and N. Mebarkia, “A Multi-Lingual Speech Recognition-Based Framework to Human-Drone Interaction,” Electronics, vol. 11, no. 12, p. 1829, Jun. 2022

  11. [11]

    In-Vehicle Speech Recognition for V oice- Driven UA V Control in a Collaborative Environment of MA V and UA V,

    J.-S. Park and N. Geng, “In-Vehicle Speech Recognition for V oice- Driven UA V Control in a Collaborative Environment of MA V and UA V,”Aerospace, vol. 10, no. 10, p. 841, Sep. 2023

  12. [12]

    Command-based voice teleoper- ation of a mobile robot via a human-robot interface,

    A. Poncela and L. Gallardo-Estrella, “Command-based voice teleoper- ation of a mobile robot via a human-robot interface,”Robotica, vol. 33, no. 1, pp. 1–18, Jan. 2015

  13. [13]

    V oice Commanded System for Navigation of Mobile Robots,

    D. S. Barbosa, A. F. R. Araujo, and E. Gutierrez-Huampo, “V oice Commanded System for Navigation of Mobile Robots,” in 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC) . Melbourne, Australia: IEEE, Oct. 2021, pp. 1087–1092

  14. [14]

    V oice enabled smart drone control,

    A. R. Fayjie, A. Ramezani, D. Oualid, and D. J. Lee, “V oice enabled smart drone control,” in 2017 Ninth International Conference on Ubiquitous and Future Networks (ICUFN) . Milan: IEEE, Jul. 2017, pp. 119–121

  15. [15]

    Unmanned Aerial Vehicle Control through Domain-Based Automatic Speech Recognition,

    R. Contreras, A. Ayala, and F. Cruz, “Unmanned Aerial Vehicle Control through Domain-Based Automatic Speech Recognition,”Com- puters, vol. 9, no. 3, p. 75, Sep. 2020

  16. [16]

    V oice Command Recognition for Drone Control by Deep Neural Networks on Embedded System,

    C. Yapicioglu, Z. Dokur, and T. Olmez, “V oice Command Recognition for Drone Control by Deep Neural Networks on Embedded System,” in 2021 8th International Conference on Electrical and Electronics Engineering (ICEEE). Antalya, Turkey: IEEE, 2021, pp. 65–72

  17. [17]

    Responsive Drone Autopilot System for Uncertain Natural Language Commands,

    S. Rajapaksha, V . Illankoon, N. D. Halloluwa, M. Satharana, and D. Umayanganie, “Responsive Drone Autopilot System for Uncertain Natural Language Commands,” in 2019 International Conference on Advancements in Computing (ICAC). Malabe, Sri Lanka: IEEE, Dec. 2019, pp. 232–237

  18. [18]

    Intelligent voice control system for UA V with mobile robot,

    S. Atanov, K. Moldamurat, M. Bakyt, D. Zinagabdenova, A. Moldamurat, B. Zhumazhanov, and A. Maidanov, “Intelligent voice control system for UA V with mobile robot,”Indonesian Journal of Electrical Engineering and Computer Science , vol. 38, no. 2, p. 1061, May 2025

  19. [19]

    Speech Commands: A Dataset for Limited-V ocabulary Speech Recognition,

    P. Warden, “Speech Commands: A Dataset for Limited-V ocabulary Speech Recognition,” 2018, arXiv:1804.03209 [cs]

  20. [20]

    Ef- ficient Self-Attention Model for Speech Recognition-Based Assistive Robots Control,

    S. Poirier, U. C ˆot´e-Allard, F. Routhier, and A. Campeau-Lecours, “Ef- ficient Self-Attention Model for Speech Recognition-Based Assistive Robots Control,” Sensors, vol. 23, no. 13, p. 6056, Jun. 2023

  21. [21]

    Speech-Guided Drone Control System Based on Large Language Model,

    S.-H. Choi, Z.-C. Kim, and S.-J. Buu, “Speech-Guided Drone Control System Based on Large Language Model,” in 2025 International Conference on Electronics, Information, and Communication (ICEIC). Osaka, Japan: IEEE, Jan. 2025, pp. 1–4

  22. [22]

    TrustNavGPT: Modeling Uncertainty to Improve Trustworthiness of Audio-Guided LLM-Based Robot Navigation,

    X. Sun, Y . Zhang, X. Tang, A. S. Bedi, and A. Bera, “TrustNavGPT: Modeling Uncertainty to Improve Trustworthiness of Audio-Guided LLM-Based Robot Navigation,” Aug. 2024, arXiv:2408.01867 [cs]

  23. [23]

    Navigation with Large Language Models: Semantic Guesswork as a Heuristic for Planning,

    D. Shah, M. Equi, B. Osinski, F. Xia, B. Ichter, and S. Levine, “Navigation with Large Language Models: Semantic Guesswork as a Heuristic for Planning,” Oct. 2023, arXiv:2310.10103 [cs]

  24. [24]

    Towards end-to-end spoken language understanding,

    D. Serdyuk, Y . Wang, C. Fuegen, A. Kumar, B. Liu, and Y . Bengio, “Towards end-to-end spoken language understanding,” Feb. 2018, arXiv:1802.08395 [cs]

  25. [25]

    Multimodal Audio- textual Architecture for Robust Spoken Language Understanding,

    A. R. Avila, M. Rezagholizadeh, and C. Xing, “Multimodal Audio- textual Architecture for Robust Spoken Language Understanding,” Jun. 2023, arXiv:2306.06819 [cs]

  26. [26]

    Speech-language Pre-training for End-to-end Spoken Language Un- derstanding,

    Y . Qian, X. Bian, Y . Shi, N. Kanda, L. Shen, Z. Xiao, and M. Zeng, “Speech-language Pre-training for End-to-end Spoken Language Un- derstanding,” Feb. 2021, arXiv:2102.06283 [cs]

  27. [27]

    Rich speech signal: exploring and exploiting end-to-end automatic speech recognizers’ ability to model hesitation phenomena,

    V . N. Vitale, L. Schettino, and F. Cutugno, “Rich speech signal: exploring and exploiting end-to-end automatic speech recognizers’ ability to model hesitation phenomena,” in Interspeech 2024. ISCA, Sep. 2024, pp. 222–226

  28. [28]

    Spoken DialogSum: An Emotion-Rich Conversational Dataset for Spoken Dialogue Summarization,

    Y .-J. Lu, K. Gao, M. Liang, H. Wang, T. Thebaud, L. Moro-Velazquez, N. Dehak, and J. Villalba, “Spoken DialogSum: An Emotion-Rich Conversational Dataset for Spoken Dialogue Summarization,” Dec. 2025, arXiv:2512.14687 [cs]

  29. [29]

    wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,

    A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” Oct. 2020, arXiv:2006.11477 [cs]

  30. [30]

    HuBERT: Self-Supervised Speech Representa- tion Learning by Masked Prediction of Hidden Units,

    W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-Supervised Speech Representa- tion Learning by Masked Prediction of Hidden Units,” Jun. 2021, arXiv:2106.07447 [cs]

  31. [31]

    SUPERB: Speech processing Universal PERformance Benchmark,

    S.-w. Yang, P.-H. Chi, Y .-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y . Y . Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin, T.-H. Huang, W.-C. Tseng, K.-t. Lee, D.-R. Liu, Z. Huang, S. Dong, S.-W. Li, S. Watanabe, A. Mohamed, and H.-y. Lee, “SUPERB: Speech processing Universal PERformance Benchmark,” Oct. 2021, arXiv:2105.01051 [cs]

  32. [32]

    On the Evaluation of Speech Foundation Models for Spoken Language Understanding,

    S. Arora, A. Pasad, C.-M. Chien, J. Han, R. Sharma, J.-w. Jung, H. Dhamyal, W. Chen, S. Shon, H.-y. Lee, K. Livescu, and S. Watan- abe, “On the Evaluation of Speech Foundation Models for Spoken Language Understanding,” inFindings of the Association for Computa- tional Linguistics ACL 2024. Bangkok, Thailand and virtual meeting: Association for Computation...

  33. [33]

    DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT,

    H.-J. Chang, S.-w. Yang, and H.-y. Lee, “DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT,” Apr. 2022, arXiv:2110.01900 [cs]

  34. [34]

    V oice Commands for Guidance to a 3D Position: To Collect Sponta- neous Data,

    A. Henry, C. Graff, S. Rossato, J.-E. Gomez-Balderas, and S. Huet, “V oice Commands for Guidance to a 3D Position: To Collect Sponta- neous Data,” inProceedings of the 18th ACM International Conference on PErvasive Technologies Related to Assistive Environments . Corfu Island Greece: ACM, Jun. 2025, pp. 410–411

  35. [35]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,

    N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” Aug. 2019, arXiv:1908.10084 [cs]

  36. [36]

    CamemBERT: a Tasty French Language Model,

    L. Martin, B. Muller, P. J. Ortiz Su ´arez, Y . Dupont, L. Romary, E. De La Clergerie, D. Seddah, and B. Sagot, “CamemBERT: a Tasty French Language Model,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, 2020, pp. 7203–7219

  37. [37]

    Text Embeddings by Weakly-Supervised Contrastive Pre- training,

    L. Wang, N. Yang, X. Huang, B. Jiao, L. Yang, D. Jiang, R. Majumder, and F. Wei, “Text Embeddings by Weakly-Supervised Contrastive Pre- training,” Feb. 2024, arXiv:2212.03533 [cs]

  38. [38]

    M3- Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation,

    J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian, and Z. Liu, “M3- Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation,” Dec. 2025, arXiv:2402.03216 [cs]

  39. [39]

    MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers,

    W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, and M. Zhou, “MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers,” Apr. 2020, arXiv:2002.10957 [cs]

  40. [40]

    Unsupervised Cross-lingual Representation Learning at Scale,

    A. Conneau, K. Khandelwal, N. Goyal, V . Chaudhary, G. Wenzek, F. Guzm ´an, E. Grave, M. Ott, L. Zettlemoyer, and V . Stoyanov, “Unsupervised Cross-lingual Representation Learning at Scale,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . Online: Association for Computational Linguistics, 2020, pp. 8440–8451

  41. [41]

    WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,

    S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,” IEEE Journal of Selected Topics in Signal Processing , vol. 16, no. 6, pp. 1505–1518, Oct. 2022, arXiv:...

  42. [42]

    data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language,

    A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language,” Oct. 2022, arXiv:2202.03555 [cs]

  43. [43]

    V oxPopuli: A Large- Scale Multilingual Speech Corpus for Representation Learning, Semi- Supervised Learning and Interpretation,

    C. Wang, M. Riviere, A. Lee, A. Wu, C. Talnikar, D. Haziza, M. Williamson, J. Pino, and E. Dupoux, “V oxPopuli: A Large- Scale Multilingual Speech Corpus for Representation Learning, Semi- Supervised Learning and Interpretation,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Con...

  44. [44]

    Fine-tuned XLSR-53 large model for speech recognition in French,

    Grosman, Jonatas, “Fine-tuned XLSR-53 large model for speech recognition in French,” 2021

  45. [45]

    LeBenchmark 2.0: a Standardized, Replicable and Enhanced Frame- work for Self-supervised Representations of French Speech,

    T. Parcollet, H. Nguyen, S. Evain, M. Z. Boito, A. Pupier, S. Mdhaffar, H. Le, S. Alisamir, N. Tomashenko, M. Dinarelli, S. Zhang, A. Al- lauzen, M. Coavoux, Y . Esteve, M. Rouvier, J. Goulian, B. Lecouteux, F. Portet, S. Rossato, F. Ringeval, D. Schwab, and L. Besacier, “LeBenchmark 2.0: a Standardized, Replicable and Enhanced Frame- work for Self-superv...

  46. [46]

    Pantagruel: Unified Self-Supervised Encoders for French Text and Speech,

    P.-H. Le, V . Pelloin, A. Chatelain, M. Bouziane, M. Ghennai, Q. Guan, K. Milintsevich, S. Mdhaffar, A. Mannion, N. Defauw, S. Gu, A. Au- dibert, M. Dinarelli, Y . Esteve, L. Goeuriot, S. Lalande, N. Herve, M. Coavoux, F. Portet, E. Ollion, M. Candito, M. Peyrard, S. Rossato, B. Lecouteux, A. Nardy, G. Serasset, V . Segonne, S. Evain, D. Fabre, and D. Sch...

  47. [47]

    Robust Speech Recognition via Large-Scale Weak Supervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust Speech Recognition via Large-Scale Weak Supervision,” Proceedings of the 40th International Conference on Machine Learning, vol. 202, pp. 28 492–28 518, Jul. 2023

  48. [48]

    V oicestick: A spontaneous speech corpus for drone voice guidance,

    A. Henry, S. Rossato, C. Graff, J.-E. Gomez-Balderas, and S. Huet, “V oicestick: A spontaneous speech corpus for drone voice guidance,” in 33rd Conference on Natural Language Processing (TALN) , 2026, to appear

  49. [49]

    Requirements of End-to-End Delays in Remote Control Channel for Remotely Piloted Aerial Systems,

    D. Brodnevs and A. Kutins, “Requirements of End-to-End Delays in Remote Control Channel for Remotely Piloted Aerial Systems,” IEEE Aerospace and Electronic Systems Magazine, vol. 36, no. 2, pp. 18–27, Feb. 2021

  50. [50]

    The Dynamics of Action Corrections in Situated Interaction,

    A. Raux and M. Nakano, “The Dynamics of Action Corrections in Situated Interaction,” 11th Annual Meeting of the Special Interest Group on Discourse and Dialogue , pp. 165–174, Sep. 2010