Recognition: unknown
Benchmarking Local Language Models for Social Robots using Edge Devices
Pith reviewed 2026-05-08 17:37 UTC · model grok-4.3
The pith
Certain 7B language models running on Raspberry Pi deliver balanced speed, energy use, and teaching quality for social robots even when general knowledge scores are moderate.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Systematic benchmarking of 25 open-source language models on Raspberry Pi 4 and 5 hardware for social-educational robots shows large differences in throughput and energy use, MMLU accuracy ranging from near-random to 57.2 percent, and no monotonic link between knowledge scores and teaching effectiveness. Granite4 Tiny Hybrid (7B) provides a practical balance at 2.5 tokens per second, 0.90 tokens per joule, and 54.6 percent MMLU accuracy with high teaching ratings; human validation on four models preserves the rank order with Pearson r of 0.967. These measurements support a proposed three-tier local inference architecture for the Robot Study Companion that trades off responsiveness and depth.
What carries the argument
Multi-dimensional evaluation on Raspberry Pi hardware that jointly tracks inference speed, energy consumption, six-category MMLU accuracy, and pedagogical quality scored by LLM and by five human raters.
If this is right
- A 7B hybrid model can sustain usable teaching interactions on Raspberry Pi 4 at 2.5 tokens per second while staying within modest energy limits.
- Strong performance on general-knowledge benchmarks is not required for high pedagogical ratings in this domain.
- Human raters can validate automated teaching-quality scores for a small subset of models and preserve the overall ordering.
- A three-tier local inference architecture can be deployed on the Robot Study Companion to match response speed to the demands of each interaction.
Where Pith is reading between the lines
- Direct trials with children would be needed to confirm whether the models ranked highest here produce measurable learning differences in real sessions.
- The same efficiency-versus-quality trade-offs could guide model selection for other battery-powered or privacy-sensitive robots beyond education.
- Energy-per-token figures offer a concrete way to compare hardware platforms when choosing between Raspberry Pi variants or adding accelerators.
- Instruction-tuned or hybrid models may warrant priority in future educational-robot benchmarks over those optimized only for broad knowledge tests.
Load-bearing premise
That LLM-generated ratings of pedagogical quality, confirmed by a small set of human raters on a limited set of prompts, reliably predict how well a model would actually teach children during live robot interactions.
What would settle it
A direct study measuring children's engagement or learning gains when the same robot uses different models from the benchmark and checking whether the observed performance ordering matches the paper's automated and human-rated rankings.
Figures
read the original abstract
Social-educational robots designed for socially interactive pedagogical support, such as the Robot Study Companion (RSC), rely on responsive, privacy-preserving interaction despite severely limited compute. However, there is a gap in systematic benchmarking of language models for edge computing in pedagogical applications. This paper benchmarks 25 open-source language models for local deployment on edge hardware. We evaluate each model across three dimensions: inference efficiency (tokens per second, energy consumption), general knowledge (a six-category MMLU subset), and teaching effectiveness (LLM-rated pedagogical quality), validated against five independent human raters using the Raspberry Pi(RPi)4 as the primary platform, with additional comparisons on the RPi5 and a laptop GPU. Results reveal pronounced trade-offs: throughput and energy efficiency vary by over an order of magnitude across models, MMLU accuracy ranges from near-random to 57.2%, and teaching effectiveness does not correlate monotonically with either metric. Among the evaluated models, Granite4 Tiny Hybrid (7B) achieves a strong overall balance, reaching 2.5 tokens per second, 0.90 tokens per joule, and 54.6% MMLU accuracy; high MMLU accuracy does not appear necessary for strong teaching scores. Human validation on four representative models preserved the automated rank ordering (Pearson r = 0.967, n = 4). Based on these findings, we propose a three-tier local inference architecture for the RSC that balances responsiveness and accuracy on resource-constrained hardware.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript benchmarks 25 open-source language models for local deployment on edge hardware (primarily Raspberry Pi 4, with comparisons to Pi 5 and laptop GPU) in the context of social-educational robots such as the Robot Study Companion. It measures inference efficiency (tokens per second and tokens per joule), general knowledge via a six-category MMLU subset, and teaching effectiveness via LLM-rated pedagogical quality, with human validation on four models showing high rank-order correlation (Pearson r=0.967). The work highlights efficiency-accuracy trade-offs, identifies Granite4 Tiny Hybrid (7B) as balanced (2.5 tokens/s, 0.90 tokens/J, 54.6% MMLU), notes that high MMLU is not required for strong teaching scores, and proposes a three-tier local inference architecture.
Significance. If the pedagogical metric is reliable, the paper supplies practical, hardware-specific benchmarking data that can guide deployment of LLMs on severely constrained platforms for interactive education. Strengths include the multi-metric evaluation across 25 models, explicit reporting of efficiency numbers on real edge devices, and the reported human-LLM rank correlation; these elements make the trade-off findings and architecture proposal potentially actionable for the robotics community.
major comments (2)
- [Human Validation subsection] Human Validation subsection: the reported Pearson r=0.967 (n=4) only confirms that the LLM judge preserves automated rank order on static outputs; it supplies no evidence that higher pedagogical scores produce measurable gains in child engagement, retention, or learning during actual RSC interactions on edge hardware. This directly limits support for the non-monotonic teaching claim and the model recommendation.
- [MMLU Evaluation and Teaching Effectiveness sections] MMLU Evaluation and Teaching Effectiveness sections: the six-category MMLU subset is presented without justification linking its content to pedagogical knowledge required for social-robot tutoring; absent this link, the observation that MMLU accuracy does not monotonically predict teaching scores cannot be confidently generalized to the target RSC application.
minor comments (2)
- [Results tables and figures] Results tables and figures: label all efficiency metrics with the exact hardware platform (RPi4 vs. RPi5 vs. GPU) and report variance or multiple runs to allow readers to assess stability of the tokens/s and tokens/J numbers.
- [Model selection paragraph] Model selection paragraph: state the inclusion criteria and sources used to arrive at the final set of 25 models so that the benchmark scope can be reproduced or extended.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which helps clarify the scope and limitations of our benchmarking study. We address each major comment below and will revise the manuscript to better articulate what our metrics demonstrate and what they do not.
read point-by-point responses
-
Referee: [Human Validation subsection] Human Validation subsection: the reported Pearson r=0.967 (n=4) only confirms that the LLM judge preserves automated rank order on static outputs; it supplies no evidence that higher pedagogical scores produce measurable gains in child engagement, retention, or learning during actual RSC interactions on edge hardware. This directly limits support for the non-monotonic teaching claim and the model recommendation.
Authors: We agree that the human validation (Pearson r=0.967 on n=4 models) only establishes that the LLM judge reliably reproduces human rank order on the static pedagogical outputs we evaluated; it does not constitute evidence of downstream effects on child engagement, retention, or learning in live RSC sessions. Our non-monotonic teaching claim and model recommendation are therefore scoped to the three measured dimensions (efficiency, MMLU, and LLM-rated pedagogical quality) rather than to proven educational outcomes. We will revise the Human Validation subsection and the Discussion to explicitly state this limitation, remove any implication of causal learning gains, and frame the Granite4 Tiny Hybrid recommendation strictly in terms of the observed efficiency–quality trade-off on the tested hardware. revision: yes
-
Referee: [MMLU Evaluation and Teaching Effectiveness sections] MMLU Evaluation and Teaching Effectiveness sections: the six-category MMLU subset is presented without justification linking its content to pedagogical knowledge required for social-robot tutoring; absent this link, the observation that MMLU accuracy does not monotonically predict teaching scores cannot be confidently generalized to the target RSC application.
Authors: The six-category MMLU subset was selected as a standard, compact proxy for general knowledge across domains that commonly appear in educational dialogues (e.g., STEM, humanities). We did not intend it as a direct measure of pedagogical expertise. The empirical observation that MMLU accuracy does not monotonically predict teaching scores is therefore limited to the 25 models and the specific metrics we computed; we do not claim broader generalization to all RSC tutoring scenarios. We will add a short paragraph in the MMLU Evaluation section justifying the category selection and will revise the Teaching Effectiveness and Discussion sections to emphasize that the non-monotonic relationship is an observation within our benchmark rather than a general principle. revision: yes
- Direct evidence that higher pedagogical scores translate into measurable gains in child engagement, retention, or learning during live RSC interactions on edge hardware would require IRB-approved user studies with children, which lies outside the scope of this benchmarking paper.
Circularity Check
Purely empirical benchmarking with no derivations or self-referential claims
full rationale
The paper performs direct empirical measurements of 25 models on edge hardware across inference speed, energy use, MMLU accuracy, and LLM-rated pedagogical quality, with human rater validation on a subset. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the text; results rest on external benchmarks and independent human ratings without any reduction of claims to their own definitions or prior author work by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Privacy issues in Large Language Models: A survey,
H. Kibriya, W. Z. Khan, A. Siddiqa, and M. K. Khan, “Privacy issues in Large Language Models: A survey,”Computers and Electrical Engineering, vol. 120, p. 109698, 2024
2024
-
[2]
Locally- deployed open-source llms for code generation: Promises and chal- lenges,
T. Kechaoui, M. W. Ouhab, B. Djamaa, and M. R. Senouci, “Locally- deployed open-source llms for code generation: Promises and chal- lenges,” in2025 7th International Conference on Pattern Analysis and Intelligent Systems (PAIS), 2025, pp. 1–6
2025
-
[3]
Open-Source Robotic Study Companion with Multimodal Human–Robot Interaction to Improve the Learning Experience of University Students,
F. Baksh, M. B. Zorec, and K. Kruusam ¨ae, “Open-Source Robotic Study Companion with Multimodal Human–Robot Interaction to Improve the Learning Experience of University Students,”Applied Sciences, vol. 14, no. 13, 2024
2024
-
[4]
Human-robot interaction in higher education: A literature review,
S. Matus and S. Cano, “Human-robot interaction in higher education: A literature review,” inSocial Computing and Social Media, A. Coman and S. Vasilache, Eds. Cham: Springer Nature Switzerland, 2025, pp. 236–256
2025
-
[5]
University Students’ Acceptance of a Robot Study Companion,
F. Baksh, I. Jackson, I. Jackson, and M. B. Zorec, “University Students’ Acceptance of a Robot Study Companion,” inRobotics in Education, R. Balogh, D. Obdr ˇz´alek, and N. Fachantidis, Eds. Cham: Springer Nature Switzerland, 2025, pp. 38–50
2025
-
[6]
Adapting a teachable robot’s dialog responses using reinforcement learning: Cross-cultural user study exploring effect on engagement,
R. Love, P. R. Cohen, G. Venture, and D. Kuli ´c, “Adapting a teachable robot’s dialog responses using reinforcement learning: Cross-cultural user study exploring effect on engagement,”ACM Transactions on Human-Robot Interaction, vol. 14, no. 4, pp. 1–34, 2025
2025
-
[7]
Investigating Adaptive Robot Tutoring in a Long-Term Interaction in Higher Education,
M. Donnermann, P. Schaper, and B. Lugrin, “Investigating Adaptive Robot Tutoring in a Long-Term Interaction in Higher Education,” in2022 31st IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), Aug. 2022, pp. 171–178
2022
-
[8]
A survey on privacy risks and protection in large language models,
K. Chen, X. Zhou, Y . Lin, S. Feng, L. Shen, and P. Wu, “A survey on privacy risks and protection in large language models,”Journal of King Saud University Computer and Information Sciences, vol. 37, 2025. [Online]. Available: https://doi.org/10.1007/s44443-025-00177-1
-
[9]
Measuring Massive Multitask Language Understanding
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring Massive Multitask Language Understanding,” Jan. 2021, arXiv:2009.03300 [cs]. [Online]. Available: http://arxiv.org/ abs/2009.03300
work page internal anchor Pith review arXiv 2021
-
[10]
GLUE: A multi-task benchmark and analysis platform for natural language understanding,
A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, “GLUE: A multi-task benchmark and analysis platform for natural language understanding,” inProceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, T. Linzen, G. Chrupała, and A. Alishahi, Eds. Brussels, Belgium: Association for Computational Lin...
2018
-
[11]
B. Xu, Y . Bai, H. Sun, Y . Lin, S. Liu, X. Liang, Y . Li, Z. Dong, J. Zhang, Y . Deng, X. Zou, Y . Gao, and H. Huang, “EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios,” Jan. 2026, arXiv:2505.16160 [cs]. [Online]. Available: http://arxiv.org/abs/2505.16160
-
[13]
Available: https://arxiv.org/pdf/2511.07425
[Online]. Available: https://arxiv.org/pdf/2511.07425
-
[14]
Llmpi: Optimizing llms for high-throughput on raspberry pi,
M. Ardakani, J. Malekar, and R. Zand, “Llmpi: Optimizing llms for high-throughput on raspberry pi,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. CVF, 2025, p. 6378. [Online]. Available: https://openaccess.thecvf.com/content/CVPR2025W/EDGE/papers/Arda kani LLMPi Optimizing LLMs for High-Throughput on...
2025
-
[15]
Litert-optimized int8 llm for raspberry pi4 deployment,
K. Yoon, H.-C. Moon, A. Kim, S. Kim, S.-S. Lee, S.-J. Jang, G. Gankhuyag, and J. Jeong, “Litert-optimized int8 llm for raspberry pi4 deployment,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2025, p. 5683. [Online]. Available: https://openaccess.thecvf.com/content/ICCV2025 W/AIM/html/Yoon LiteRT-Optimized INT...
2025
-
[16]
Sustainable llm inference for edge ai: Evaluating quantized llms for energy efficiency, output accuracy, and inference latency,
E. J. Husom, A. Goknil, M. Astekin, L. K. Shar, A. K ¨Aysen, S. Sen, B. A. Mithassel, and A. Soylu, “Sustainable llm inference for edge ai: Evaluating quantized llms for energy efficiency, output accuracy, and inference latency,”ACM Transactions on Internet of Things, vol. 6, no. 4, November 2025
2025
-
[17]
deepeval,
J. Ip and K. V ongthongsri, “deepeval,” Jan. 2026. [Online]. Available: https://github.com/confident-ai/deepeval
2026
-
[18]
Supplemental materials to ’benchmarking local language models for social robots using edge devices’,
D. Lamouille, M. B. Zorec, F. Baksh, and K. Kruusamae, “Supplemental materials to ’benchmarking local language models for social robots using edge devices’,”Zenodo, 2026. [Online]. Available: https://doi.org/10.5281/zenodo.19643021
-
[19]
M. Sclar, Y . Choi, Y . Tsvetkov, and A. Suhr, “Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting,” inThe Twelfth International Conference on Learning Representations, 2024, arXiv:2310.11324. [Online]. Available: https://openreview.net/forum?i d=RIu5lyNXjT
-
[20]
Does prompt formatting have any impact on llm performance?
J. He, M. Rungta, D. Koleczek, A. Sekhon, F. X. Wang, and S. Hasan, “Does prompt formatting have any impact on llm performance?”arXiv preprint arXiv:2411.10541, 2024
-
[21]
[Online]
Ollama Contributors, “Ollama,” 2024, accessed: 2026-02-03. [Online]. Available: https://github.com/ollama/ollama
2024
-
[22]
Computing Krippendorff’s alpha-reliability,
K. Krippendorff, “Computing Krippendorff’s alpha-reliability,” Univer- sity of Pennsylvania, Annenberg School for Communication, Tech. Rep.,
-
[23]
Available: https://repository.upenn.edu/asc papers/43
[Online]. Available: https://repository.upenn.edu/asc papers/43
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.