End-to-End Voice Intent Recognition for Spontaneous Human-Drone Interaction with Naive Users
Pith reviewed 2026-06-26 13:28 UTC · model grok-4.3
The pith
An end-to-end speech model reaches 93 percent accuracy on spontaneous drone commands at 7 ms latency, 29 times faster than cascade baselines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that an end-to-end architecture using a frozen self-supervised learning acoustic encoder and an LSTM classification head, trained with cross-modal knowledge distillation, achieves 93% accuracy on simple voice commands at 7 ms inference latency and 82% on the full spontaneous speech test set from the VoiceStick corpus, outperforming cascade baselines in both accuracy and speed by a large margin.
What carries the argument
The end-to-end SLU model with frozen SSL encoder, LSTM head, and cross-modal distillation objective that aligns acoustic representations to semantic embeddings.
If this is right
- On simple commands the model reaches 93 percent accuracy while running 29 times faster than cascade approaches.
- Cross-modal distillation improves robustness on disfluent spontaneous speech without requiring transcripts at test time.
- The architecture provides calibrated confidence estimates suitable for real-time UAV control.
- End-to-end designs prove preferable to cascaded systems for handling naive user speech in drone teleoperation.
Where Pith is reading between the lines
- Similar end-to-end approaches might apply to voice interfaces in other robots or vehicles where users speak spontaneously.
- The low latency could support more fluid, conversational interactions rather than single commands.
- Testing the model on non-French languages or different acoustic environments would reveal its broader applicability.
Load-bearing premise
The VoiceStick corpus from 29 nonexpert pairs in actual drone sessions captures the kind of spontaneous and disfluent speech that occurs with naive users in general real-world drone interactions.
What would settle it
A new test set of spontaneous drone commands recorded from a different group of nonexpert users where the end-to-end model achieves less than 75 percent accuracy would indicate the approach does not generalize as claimed.
Figures
read the original abstract
Voice control offers an intuitive alternative to manual drone piloting, yet most existing systems rely on rigid command vocabularies that fail to handle the spontaneous, disfluent speech of naive users. This paper addresses this gap by proposing an End-to-End Spoken Language Understanding architecture for real-time human-drone interaction in French. Our model combines a frozen Self-Supervised Learning acoustic encoder with a lightweight LSTM-based classification head, augmented by a cross-modal knowledge distillation objective that aligns acoustic representations with semantic embeddings from a text teacher, without requiring transcription at inference time. We evaluate our approach on VoiceStick, a novel French corpus of spontaneous speech collected during real teleoperation sessions with 29 nonexpert dyads. On simple voice commands, our best configuration achieves 93% accuracy at 7 ms inference latency, outperforming cascade baselines (79%, 202 ms) with a 29x speedup. On the full spontaneous speech test set, our architecture reaches 82% accuracy, with crossmodal distillation consistently improving robustness across all configurations. These results demonstrate that End-to-End architectures are not only feasible but preferable for spontaneous voice-guided UAV teleoperation, combining semantic robustness, low latency, and calibrated confidence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes an end-to-end spoken language understanding architecture for spontaneous French voice commands in drone teleoperation. It freezes a self-supervised learning acoustic encoder, adds a lightweight LSTM classification head, and applies cross-modal knowledge distillation from text embeddings. The model is evaluated on the new VoiceStick corpus collected from 29 nonexpert dyads during real sessions, reporting 93% accuracy and 7 ms latency on simple commands (vs. 79% / 202 ms for cascade baselines) and 82% accuracy on the full spontaneous test set, concluding that E2E models are preferable for semantic robustness and low latency.
Significance. If the results hold under proper generalization tests, the work would demonstrate a practical advantage for end-to-end models over cascaded ASR+NLU pipelines in handling disfluent naive-user speech for UAV control, with the 29x latency reduction being a notable engineering contribution. The release of VoiceStick as a spontaneous-interaction corpus would also be a useful resource for the field, provided its coverage of disfluency and speaker variability is adequately documented.
major comments (2)
- [Evaluation] The central claim that E2E architectures are preferable rests on the 82% accuracy on the full spontaneous speech test set from VoiceStick. However, the manuscript provides no details on speaker-independent train/test splits, accent or lexical diversity metrics, or statistical significance tests for the reported gains over cascade baselines; without these, the 29-dyad corpus cannot securely support generalization beyond the collected sessions.
- [§4] The abstract states that cross-modal distillation 'consistently improving robustness across all configurations,' yet no ablation table or section quantifies the contribution of the distillation loss versus the base E2E model on the spontaneous subset; this is load-bearing for the claim that the full architecture is required.
minor comments (2)
- [Abstract] The latency comparison (7 ms vs 202 ms) should specify whether the cascade baseline includes ASR decoding time or only the NLU stage, to ensure the 29x speedup is measured on equivalent hardware.
- [Methods] Notation for the distillation objective (e.g., the alignment loss between acoustic and text embeddings) should be introduced with an equation number in the methods section for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The two major comments highlight important gaps in evaluation documentation and ablation analysis. We address each point below and will revise the manuscript accordingly to strengthen the claims.
read point-by-point responses
-
Referee: [Evaluation] The central claim that E2E architectures are preferable rests on the 82% accuracy on the full spontaneous speech test set from VoiceStick. However, the manuscript provides no details on speaker-independent train/test splits, accent or lexical diversity metrics, or statistical significance tests for the reported gains over cascade baselines; without these, the 29-dyad corpus cannot securely support generalization beyond the collected sessions.
Authors: We agree that the manuscript should provide explicit details on these evaluation aspects to support generalization claims. The VoiceStick splits were constructed to be speaker-independent by holding out complete dyads (ensuring no speaker overlap between train and test), and we will add a dedicated subsection in the revised paper documenting the split procedure, reporting accent variation (via self-reported and phonetic metrics) and lexical diversity (type-token ratios and command variation counts across the 29 dyads), and including statistical significance tests (McNemar's test) confirming p < 0.01 for the accuracy gains over cascade baselines. These additions will be included in the next version. revision: yes
-
Referee: [§4] The abstract states that cross-modal distillation 'consistently improving robustness across all configurations,' yet no ablation table or section quantifies the contribution of the distillation loss versus the base E2E model on the spontaneous subset; this is load-bearing for the claim that the full architecture is required.
Authors: We concur that a quantitative ablation is required to substantiate the distillation contribution on the spontaneous subset. The current text describes the objective but lacks a dedicated comparison. In the revision we will insert a new ablation table (and accompanying text) in §4 that reports accuracy on the spontaneous test set for the base E2E model versus the full model with distillation, across all encoder and head configurations, thereby quantifying the consistent robustness gains and directly supporting the claim that the complete architecture is beneficial. revision: yes
Circularity Check
No circularity: results are empirical evaluations on held-out corpus data
full rationale
The paper collects the VoiceStick corpus from 29 dyads during real sessions and reports model accuracies (93% simple commands, 82% full spontaneous set) on a held-out test set, comparing against cascade baselines. No equations, predictions, or claims reduce by construction to fitted inputs, self-definitions, or self-citation chains. The architecture (frozen SSL encoder + LSTM head + distillation) and results are independent of the reported metrics; evaluation does not rename or tautologically reproduce any input quantity. This is a standard self-contained experimental paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A frozen self-supervised learning acoustic encoder yields representations sufficient for intent classification in this domain.
Reference graph
Works this paper leans on
-
[1]
The State-of-the-Art of Human–Drone Interaction: A Survey,
D. Tezza and M. Andujar, “The State-of-the-Art of Human–Drone Interaction: A Survey,” IEEE Access , vol. 7, pp. 167 438–167 454, 2019
2019
-
[2]
Emotion encoding in Human-Drone Interaction,
J. R. Cauchard, K. Y . Zhai, M. Spadafora, and J. A. Landay, “Emotion encoding in Human-Drone Interaction,” in 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI). Christchurch, New Zealand: IEEE, Mar. 2016, pp. 263–270
2016
-
[3]
Multimodal Human–Robot Interaction for Human-Centric Smart Manufacturing: A Survey,
T. Wang, P. Zheng, S. Li, and L. Wang, “Multimodal Human–Robot Interaction for Human-Centric Smart Manufacturing: A Survey,” Ad- vanced Intelligent Systems , vol. 6, no. 3, p. 2300359, Mar. 2024
2024
-
[4]
Design and Development of an Android Application for V oice Control of Micro Unmanned Aerial Vehicles,
C. Thomas, J. Joseph Thomas, R. Bharadwaj, A. K. Mondal, V . De- valla, and S. N. Omkar, “Design and Development of an Android Application for V oice Control of Micro Unmanned Aerial Vehicles,” in AIAA Aviation 2019 Forum . Dallas, Texas: American Institute of Aeronautics and Astronautics, Jun. 2019
2019
-
[5]
De- sign of a Novice-Friendly Drone Control System,
J. Cheng, S. Mahmud, M. Mohammed, A. Singh, and J.-H. Kim, “De- sign of a Novice-Friendly Drone Control System,” in 2024 IEEE 14th Annual Computing and Communication Workshop and Conference (CCWC). Las Vegas, NV , USA: IEEE, Jan. 2024, pp. 0184–0190
2024
-
[6]
Flight, Camera, Action! Using Natural Language and Mixed Reality to Control a Drone,
B. Huang, D. Bayazit, D. Ullman, N. Gopalan, and S. Tellex, “Flight, Camera, Action! Using Natural Language and Mixed Reality to Control a Drone,” in 2019 International Conference on Robotics and Automation (ICRA) . Montreal, QC, Canada: IEEE, May 2019, pp. 6949–6956
2019
-
[7]
Enhancing V oice-Controlled Drone Navigation: A Hybrid Approach Using ASR and NLP for UA V Command Interpretation,
Y . Alkasim and A. Altahhan, “Enhancing V oice-Controlled Drone Navigation: A Hybrid Approach Using ASR and NLP for UA V Command Interpretation,” in Artificial Intelligence XLII , M. Bramer and F. Stahl, Eds. Cham: Springer Nature Switzerland, 2026, vol. 16302, pp. 253–269, series Title: Lecture Notes in Computer Science
2026
-
[8]
FAA Aerospace Forecast Fiscal Years 2022–2042,
Federal Aviation Administration, “FAA Aerospace Forecast Fiscal Years 2022–2042,” U.S. Department of Transportation , Jun. 2022
2022
-
[9]
Design for Acceptance and Intuitive Interaction: Teaming Autonomous Aerial Systems with Non-experts,
S. Ellenrieder, M. Mehler, and M. T. Akdag, “Design for Acceptance and Intuitive Interaction: Teaming Autonomous Aerial Systems with Non-experts,” PACIS 2023 Proceedings, 2023
2023
-
[10]
A Multi-Lingual Speech Recognition-Based Framework to Human-Drone Interaction,
K. Choutri, M. Lagha, S. Meshoul, M. Batouche, Y . Kacel, and N. Mebarkia, “A Multi-Lingual Speech Recognition-Based Framework to Human-Drone Interaction,” Electronics, vol. 11, no. 12, p. 1829, Jun. 2022
2022
-
[11]
In-Vehicle Speech Recognition for V oice- Driven UA V Control in a Collaborative Environment of MA V and UA V,
J.-S. Park and N. Geng, “In-Vehicle Speech Recognition for V oice- Driven UA V Control in a Collaborative Environment of MA V and UA V,”Aerospace, vol. 10, no. 10, p. 841, Sep. 2023
2023
-
[12]
Command-based voice teleoper- ation of a mobile robot via a human-robot interface,
A. Poncela and L. Gallardo-Estrella, “Command-based voice teleoper- ation of a mobile robot via a human-robot interface,”Robotica, vol. 33, no. 1, pp. 1–18, Jan. 2015
2015
-
[13]
V oice Commanded System for Navigation of Mobile Robots,
D. S. Barbosa, A. F. R. Araujo, and E. Gutierrez-Huampo, “V oice Commanded System for Navigation of Mobile Robots,” in 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC) . Melbourne, Australia: IEEE, Oct. 2021, pp. 1087–1092
2021
-
[14]
V oice enabled smart drone control,
A. R. Fayjie, A. Ramezani, D. Oualid, and D. J. Lee, “V oice enabled smart drone control,” in 2017 Ninth International Conference on Ubiquitous and Future Networks (ICUFN) . Milan: IEEE, Jul. 2017, pp. 119–121
2017
-
[15]
Unmanned Aerial Vehicle Control through Domain-Based Automatic Speech Recognition,
R. Contreras, A. Ayala, and F. Cruz, “Unmanned Aerial Vehicle Control through Domain-Based Automatic Speech Recognition,”Com- puters, vol. 9, no. 3, p. 75, Sep. 2020
2020
-
[16]
V oice Command Recognition for Drone Control by Deep Neural Networks on Embedded System,
C. Yapicioglu, Z. Dokur, and T. Olmez, “V oice Command Recognition for Drone Control by Deep Neural Networks on Embedded System,” in 2021 8th International Conference on Electrical and Electronics Engineering (ICEEE). Antalya, Turkey: IEEE, 2021, pp. 65–72
2021
-
[17]
Responsive Drone Autopilot System for Uncertain Natural Language Commands,
S. Rajapaksha, V . Illankoon, N. D. Halloluwa, M. Satharana, and D. Umayanganie, “Responsive Drone Autopilot System for Uncertain Natural Language Commands,” in 2019 International Conference on Advancements in Computing (ICAC). Malabe, Sri Lanka: IEEE, Dec. 2019, pp. 232–237
2019
-
[18]
Intelligent voice control system for UA V with mobile robot,
S. Atanov, K. Moldamurat, M. Bakyt, D. Zinagabdenova, A. Moldamurat, B. Zhumazhanov, and A. Maidanov, “Intelligent voice control system for UA V with mobile robot,”Indonesian Journal of Electrical Engineering and Computer Science , vol. 38, no. 2, p. 1061, May 2025
2025
-
[19]
Speech Commands: A Dataset for Limited-V ocabulary Speech Recognition,
P. Warden, “Speech Commands: A Dataset for Limited-V ocabulary Speech Recognition,” 2018, arXiv:1804.03209 [cs]
Pith/arXiv arXiv 2018
-
[20]
Ef- ficient Self-Attention Model for Speech Recognition-Based Assistive Robots Control,
S. Poirier, U. C ˆot´e-Allard, F. Routhier, and A. Campeau-Lecours, “Ef- ficient Self-Attention Model for Speech Recognition-Based Assistive Robots Control,” Sensors, vol. 23, no. 13, p. 6056, Jun. 2023
2023
-
[21]
Speech-Guided Drone Control System Based on Large Language Model,
S.-H. Choi, Z.-C. Kim, and S.-J. Buu, “Speech-Guided Drone Control System Based on Large Language Model,” in 2025 International Conference on Electronics, Information, and Communication (ICEIC). Osaka, Japan: IEEE, Jan. 2025, pp. 1–4
2025
-
[22]
X. Sun, Y . Zhang, X. Tang, A. S. Bedi, and A. Bera, “TrustNavGPT: Modeling Uncertainty to Improve Trustworthiness of Audio-Guided LLM-Based Robot Navigation,” Aug. 2024, arXiv:2408.01867 [cs]
arXiv 2024
-
[23]
Navigation with Large Language Models: Semantic Guesswork as a Heuristic for Planning,
D. Shah, M. Equi, B. Osinski, F. Xia, B. Ichter, and S. Levine, “Navigation with Large Language Models: Semantic Guesswork as a Heuristic for Planning,” Oct. 2023, arXiv:2310.10103 [cs]
arXiv 2023
-
[24]
Towards end-to-end spoken language understanding,
D. Serdyuk, Y . Wang, C. Fuegen, A. Kumar, B. Liu, and Y . Bengio, “Towards end-to-end spoken language understanding,” Feb. 2018, arXiv:1802.08395 [cs]
Pith/arXiv arXiv 2018
-
[25]
Multimodal Audio- textual Architecture for Robust Spoken Language Understanding,
A. R. Avila, M. Rezagholizadeh, and C. Xing, “Multimodal Audio- textual Architecture for Robust Spoken Language Understanding,” Jun. 2023, arXiv:2306.06819 [cs]
arXiv 2023
-
[26]
Speech-language Pre-training for End-to-end Spoken Language Un- derstanding,
Y . Qian, X. Bian, Y . Shi, N. Kanda, L. Shen, Z. Xiao, and M. Zeng, “Speech-language Pre-training for End-to-end Spoken Language Un- derstanding,” Feb. 2021, arXiv:2102.06283 [cs]
arXiv 2021
-
[27]
Rich speech signal: exploring and exploiting end-to-end automatic speech recognizers’ ability to model hesitation phenomena,
V . N. Vitale, L. Schettino, and F. Cutugno, “Rich speech signal: exploring and exploiting end-to-end automatic speech recognizers’ ability to model hesitation phenomena,” in Interspeech 2024. ISCA, Sep. 2024, pp. 222–226
2024
-
[28]
Spoken DialogSum: An Emotion-Rich Conversational Dataset for Spoken Dialogue Summarization,
Y .-J. Lu, K. Gao, M. Liang, H. Wang, T. Thebaud, L. Moro-Velazquez, N. Dehak, and J. Villalba, “Spoken DialogSum: An Emotion-Rich Conversational Dataset for Spoken Dialogue Summarization,” Dec. 2025, arXiv:2512.14687 [cs]
arXiv 2025
-
[29]
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,
A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” Oct. 2020, arXiv:2006.11477 [cs]
arXiv 2020
-
[30]
HuBERT: Self-Supervised Speech Representa- tion Learning by Masked Prediction of Hidden Units,
W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-Supervised Speech Representa- tion Learning by Masked Prediction of Hidden Units,” Jun. 2021, arXiv:2106.07447 [cs]
arXiv 2021
-
[31]
SUPERB: Speech processing Universal PERformance Benchmark,
S.-w. Yang, P.-H. Chi, Y .-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y . Y . Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin, T.-H. Huang, W.-C. Tseng, K.-t. Lee, D.-R. Liu, Z. Huang, S. Dong, S.-W. Li, S. Watanabe, A. Mohamed, and H.-y. Lee, “SUPERB: Speech processing Universal PERformance Benchmark,” Oct. 2021, arXiv:2105.01051 [cs]
arXiv 2021
-
[32]
On the Evaluation of Speech Foundation Models for Spoken Language Understanding,
S. Arora, A. Pasad, C.-M. Chien, J. Han, R. Sharma, J.-w. Jung, H. Dhamyal, W. Chen, S. Shon, H.-y. Lee, K. Livescu, and S. Watan- abe, “On the Evaluation of Speech Foundation Models for Spoken Language Understanding,” inFindings of the Association for Computa- tional Linguistics ACL 2024. Bangkok, Thailand and virtual meeting: Association for Computation...
2024
-
[33]
DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT,
H.-J. Chang, S.-w. Yang, and H.-y. Lee, “DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT,” Apr. 2022, arXiv:2110.01900 [cs]
arXiv 2022
-
[34]
V oice Commands for Guidance to a 3D Position: To Collect Sponta- neous Data,
A. Henry, C. Graff, S. Rossato, J.-E. Gomez-Balderas, and S. Huet, “V oice Commands for Guidance to a 3D Position: To Collect Sponta- neous Data,” inProceedings of the 18th ACM International Conference on PErvasive Technologies Related to Assistive Environments . Corfu Island Greece: ACM, Jun. 2025, pp. 410–411
2025
-
[35]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,
N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” Aug. 2019, arXiv:1908.10084 [cs]
Pith/arXiv arXiv 2019
-
[36]
CamemBERT: a Tasty French Language Model,
L. Martin, B. Muller, P. J. Ortiz Su ´arez, Y . Dupont, L. Romary, E. De La Clergerie, D. Seddah, and B. Sagot, “CamemBERT: a Tasty French Language Model,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, 2020, pp. 7203–7219
2020
-
[37]
Text Embeddings by Weakly-Supervised Contrastive Pre- training,
L. Wang, N. Yang, X. Huang, B. Jiao, L. Yang, D. Jiang, R. Majumder, and F. Wei, “Text Embeddings by Weakly-Supervised Contrastive Pre- training,” Feb. 2024, arXiv:2212.03533 [cs]
Pith/arXiv arXiv 2024
-
[38]
J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian, and Z. Liu, “M3- Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation,” Dec. 2025, arXiv:2402.03216 [cs]
Pith/arXiv arXiv 2025
-
[39]
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers,
W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, and M. Zhou, “MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers,” Apr. 2020, arXiv:2002.10957 [cs]
arXiv 2020
-
[40]
Unsupervised Cross-lingual Representation Learning at Scale,
A. Conneau, K. Khandelwal, N. Goyal, V . Chaudhary, G. Wenzek, F. Guzm ´an, E. Grave, M. Ott, L. Zettlemoyer, and V . Stoyanov, “Unsupervised Cross-lingual Representation Learning at Scale,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . Online: Association for Computational Linguistics, 2020, pp. 8440–8451
2020
-
[41]
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,
S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,” IEEE Journal of Selected Topics in Signal Processing , vol. 16, no. 6, pp. 1505–1518, Oct. 2022, arXiv:...
arXiv 2022
-
[42]
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language,
A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language,” Oct. 2022, arXiv:2202.03555 [cs]
arXiv 2022
-
[43]
V oxPopuli: A Large- Scale Multilingual Speech Corpus for Representation Learning, Semi- Supervised Learning and Interpretation,
C. Wang, M. Riviere, A. Lee, A. Wu, C. Talnikar, D. Haziza, M. Williamson, J. Pino, and E. Dupoux, “V oxPopuli: A Large- Scale Multilingual Speech Corpus for Representation Learning, Semi- Supervised Learning and Interpretation,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Con...
2021
-
[44]
Fine-tuned XLSR-53 large model for speech recognition in French,
Grosman, Jonatas, “Fine-tuned XLSR-53 large model for speech recognition in French,” 2021
2021
-
[45]
T. Parcollet, H. Nguyen, S. Evain, M. Z. Boito, A. Pupier, S. Mdhaffar, H. Le, S. Alisamir, N. Tomashenko, M. Dinarelli, S. Zhang, A. Al- lauzen, M. Coavoux, Y . Esteve, M. Rouvier, J. Goulian, B. Lecouteux, F. Portet, S. Rossato, F. Ringeval, D. Schwab, and L. Besacier, “LeBenchmark 2.0: a Standardized, Replicable and Enhanced Frame- work for Self-superv...
arXiv 2024
-
[46]
Pantagruel: Unified Self-Supervised Encoders for French Text and Speech,
P.-H. Le, V . Pelloin, A. Chatelain, M. Bouziane, M. Ghennai, Q. Guan, K. Milintsevich, S. Mdhaffar, A. Mannion, N. Defauw, S. Gu, A. Au- dibert, M. Dinarelli, Y . Esteve, L. Goeuriot, S. Lalande, N. Herve, M. Coavoux, F. Portet, E. Ollion, M. Candito, M. Peyrard, S. Rossato, B. Lecouteux, A. Nardy, G. Serasset, V . Segonne, S. Evain, D. Fabre, and D. Sch...
arXiv 2026
-
[47]
Robust Speech Recognition via Large-Scale Weak Supervision,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust Speech Recognition via Large-Scale Weak Supervision,” Proceedings of the 40th International Conference on Machine Learning, vol. 202, pp. 28 492–28 518, Jul. 2023
2023
-
[48]
V oicestick: A spontaneous speech corpus for drone voice guidance,
A. Henry, S. Rossato, C. Graff, J.-E. Gomez-Balderas, and S. Huet, “V oicestick: A spontaneous speech corpus for drone voice guidance,” in 33rd Conference on Natural Language Processing (TALN) , 2026, to appear
2026
-
[49]
Requirements of End-to-End Delays in Remote Control Channel for Remotely Piloted Aerial Systems,
D. Brodnevs and A. Kutins, “Requirements of End-to-End Delays in Remote Control Channel for Remotely Piloted Aerial Systems,” IEEE Aerospace and Electronic Systems Magazine, vol. 36, no. 2, pp. 18–27, Feb. 2021
2021
-
[50]
The Dynamics of Action Corrections in Situated Interaction,
A. Raux and M. Nakano, “The Dynamics of Action Corrections in Situated Interaction,” 11th Annual Meeting of the Special Interest Group on Discourse and Dialogue , pp. 165–174, Sep. 2010
2010
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.