arxiv: 2605.03111 · v1 · submitted 2026-05-04 · 💻 cs.RO · cs.CL

Recognition: unknown

Benchmarking Local Language Models for Social Robots using Edge Devices

Dorian Lamouille , Matev\v{z} B. Zorec , Farnaz Baksh , Karl Kruusam\"ae

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:37 UTC · model grok-4.3

classification 💻 cs.RO cs.CL

keywords social robotsedge computinglanguage modelsbenchmarkingpedagogical AIRaspberry Piinference efficiencyMMLU

0 comments

The pith

Certain 7B language models running on Raspberry Pi deliver balanced speed, energy use, and teaching quality for social robots even when general knowledge scores are moderate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests 25 open-source language models on edge devices such as the Raspberry Pi to see which ones can support responsive, private interactions in social-educational robots. It measures inference speed in tokens per second, energy efficiency in tokens per joule, accuracy on a six-category slice of MMLU questions, and teaching quality rated first by another language model and then by five human judges. Trade-offs appear across all three dimensions, yet teaching scores do not rise in lockstep with faster inference or higher knowledge accuracy. One 7B hybrid model reaches 2.5 tokens per second, 0.90 tokens per joule, and 54.6 percent MMLU accuracy while earning strong pedagogical ratings, and the human raters largely agree with the automated ordering. These results lead the authors to outline a three-tier local inference setup for the Robot Study Companion that keeps responses timely on limited hardware.

Core claim

Systematic benchmarking of 25 open-source language models on Raspberry Pi 4 and 5 hardware for social-educational robots shows large differences in throughput and energy use, MMLU accuracy ranging from near-random to 57.2 percent, and no monotonic link between knowledge scores and teaching effectiveness. Granite4 Tiny Hybrid (7B) provides a practical balance at 2.5 tokens per second, 0.90 tokens per joule, and 54.6 percent MMLU accuracy with high teaching ratings; human validation on four models preserves the rank order with Pearson r of 0.967. These measurements support a proposed three-tier local inference architecture for the Robot Study Companion that trades off responsiveness and depth.

What carries the argument

Multi-dimensional evaluation on Raspberry Pi hardware that jointly tracks inference speed, energy consumption, six-category MMLU accuracy, and pedagogical quality scored by LLM and by five human raters.

If this is right

A 7B hybrid model can sustain usable teaching interactions on Raspberry Pi 4 at 2.5 tokens per second while staying within modest energy limits.
Strong performance on general-knowledge benchmarks is not required for high pedagogical ratings in this domain.
Human raters can validate automated teaching-quality scores for a small subset of models and preserve the overall ordering.
A three-tier local inference architecture can be deployed on the Robot Study Companion to match response speed to the demands of each interaction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Direct trials with children would be needed to confirm whether the models ranked highest here produce measurable learning differences in real sessions.
The same efficiency-versus-quality trade-offs could guide model selection for other battery-powered or privacy-sensitive robots beyond education.
Energy-per-token figures offer a concrete way to compare hardware platforms when choosing between Raspberry Pi variants or adding accelerators.
Instruction-tuned or hybrid models may warrant priority in future educational-robot benchmarks over those optimized only for broad knowledge tests.

Load-bearing premise

That LLM-generated ratings of pedagogical quality, confirmed by a small set of human raters on a limited set of prompts, reliably predict how well a model would actually teach children during live robot interactions.

What would settle it

A direct study measuring children's engagement or learning gains when the same robot uses different models from the benchmark and checking whether the observed performance ordering matches the paper's automated and human-rated rankings.

Figures

Figures reproduced from arXiv: 2605.03111 by Dorian Lamouille, Farnaz Baksh, Karl Kruusam\"ae, Matev\v{z} B. Zorec.

**Figure 2.** Figure 2: Human annotation validation across five independent raters and four models. (a) Mean rater scores per model across the eight teaching criteria; view at source ↗

**Figure 3.** Figure 3: Proposed three-tier local inference architecture for the RSC. A view at source ↗

read the original abstract

Social-educational robots designed for socially interactive pedagogical support, such as the Robot Study Companion (RSC), rely on responsive, privacy-preserving interaction despite severely limited compute. However, there is a gap in systematic benchmarking of language models for edge computing in pedagogical applications. This paper benchmarks 25 open-source language models for local deployment on edge hardware. We evaluate each model across three dimensions: inference efficiency (tokens per second, energy consumption), general knowledge (a six-category MMLU subset), and teaching effectiveness (LLM-rated pedagogical quality), validated against five independent human raters using the Raspberry Pi(RPi)4 as the primary platform, with additional comparisons on the RPi5 and a laptop GPU. Results reveal pronounced trade-offs: throughput and energy efficiency vary by over an order of magnitude across models, MMLU accuracy ranges from near-random to 57.2%, and teaching effectiveness does not correlate monotonically with either metric. Among the evaluated models, Granite4 Tiny Hybrid (7B) achieves a strong overall balance, reaching 2.5 tokens per second, 0.90 tokens per joule, and 54.6% MMLU accuracy; high MMLU accuracy does not appear necessary for strong teaching scores. Human validation on four representative models preserved the automated rank ordering (Pearson r = 0.967, n = 4). Based on these findings, we propose a three-tier local inference architecture for the RSC that balances responsiveness and accuracy on resource-constrained hardware.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper supplies concrete benchmarks for 25 LLMs on Raspberry Pi hardware aimed at educational social robots, but its teaching-effectiveness claims rest on LLM judges with only rank-order human checks on four models.

read the letter

The main thing to know is that the authors measured 25 open models on Pi 4, Pi 5, and a laptop GPU for speed, energy use, a six-category MMLU slice, and LLM-scored teaching quality, then ran five human raters on four of the models. Granite4 Tiny Hybrid stands out as a balanced option at 2.5 tokens per second and 54.6 percent MMLU, and the data show teaching scores do not track MMLU scores in a straight line. Human rankings matched the automated ones closely on the small sample they checked.

Referee Report

2 major / 2 minor

Summary. The manuscript benchmarks 25 open-source language models for local deployment on edge hardware (primarily Raspberry Pi 4, with comparisons to Pi 5 and laptop GPU) in the context of social-educational robots such as the Robot Study Companion. It measures inference efficiency (tokens per second and tokens per joule), general knowledge via a six-category MMLU subset, and teaching effectiveness via LLM-rated pedagogical quality, with human validation on four models showing high rank-order correlation (Pearson r=0.967). The work highlights efficiency-accuracy trade-offs, identifies Granite4 Tiny Hybrid (7B) as balanced (2.5 tokens/s, 0.90 tokens/J, 54.6% MMLU), notes that high MMLU is not required for strong teaching scores, and proposes a three-tier local inference architecture.

Significance. If the pedagogical metric is reliable, the paper supplies practical, hardware-specific benchmarking data that can guide deployment of LLMs on severely constrained platforms for interactive education. Strengths include the multi-metric evaluation across 25 models, explicit reporting of efficiency numbers on real edge devices, and the reported human-LLM rank correlation; these elements make the trade-off findings and architecture proposal potentially actionable for the robotics community.

major comments (2)

[Human Validation subsection] Human Validation subsection: the reported Pearson r=0.967 (n=4) only confirms that the LLM judge preserves automated rank order on static outputs; it supplies no evidence that higher pedagogical scores produce measurable gains in child engagement, retention, or learning during actual RSC interactions on edge hardware. This directly limits support for the non-monotonic teaching claim and the model recommendation.
[MMLU Evaluation and Teaching Effectiveness sections] MMLU Evaluation and Teaching Effectiveness sections: the six-category MMLU subset is presented without justification linking its content to pedagogical knowledge required for social-robot tutoring; absent this link, the observation that MMLU accuracy does not monotonically predict teaching scores cannot be confidently generalized to the target RSC application.

minor comments (2)

[Results tables and figures] Results tables and figures: label all efficiency metrics with the exact hardware platform (RPi4 vs. RPi5 vs. GPU) and report variance or multiple runs to allow readers to assess stability of the tokens/s and tokens/J numbers.
[Model selection paragraph] Model selection paragraph: state the inclusion criteria and sources used to arrive at the final set of 25 models so that the benchmark scope can be reproduced or extended.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback, which helps clarify the scope and limitations of our benchmarking study. We address each major comment below and will revise the manuscript to better articulate what our metrics demonstrate and what they do not.

read point-by-point responses

Referee: [Human Validation subsection] Human Validation subsection: the reported Pearson r=0.967 (n=4) only confirms that the LLM judge preserves automated rank order on static outputs; it supplies no evidence that higher pedagogical scores produce measurable gains in child engagement, retention, or learning during actual RSC interactions on edge hardware. This directly limits support for the non-monotonic teaching claim and the model recommendation.

Authors: We agree that the human validation (Pearson r=0.967 on n=4 models) only establishes that the LLM judge reliably reproduces human rank order on the static pedagogical outputs we evaluated; it does not constitute evidence of downstream effects on child engagement, retention, or learning in live RSC sessions. Our non-monotonic teaching claim and model recommendation are therefore scoped to the three measured dimensions (efficiency, MMLU, and LLM-rated pedagogical quality) rather than to proven educational outcomes. We will revise the Human Validation subsection and the Discussion to explicitly state this limitation, remove any implication of causal learning gains, and frame the Granite4 Tiny Hybrid recommendation strictly in terms of the observed efficiency–quality trade-off on the tested hardware. revision: yes
Referee: [MMLU Evaluation and Teaching Effectiveness sections] MMLU Evaluation and Teaching Effectiveness sections: the six-category MMLU subset is presented without justification linking its content to pedagogical knowledge required for social-robot tutoring; absent this link, the observation that MMLU accuracy does not monotonically predict teaching scores cannot be confidently generalized to the target RSC application.

Authors: The six-category MMLU subset was selected as a standard, compact proxy for general knowledge across domains that commonly appear in educational dialogues (e.g., STEM, humanities). We did not intend it as a direct measure of pedagogical expertise. The empirical observation that MMLU accuracy does not monotonically predict teaching scores is therefore limited to the 25 models and the specific metrics we computed; we do not claim broader generalization to all RSC tutoring scenarios. We will add a short paragraph in the MMLU Evaluation section justifying the category selection and will revise the Teaching Effectiveness and Discussion sections to emphasize that the non-monotonic relationship is an observation within our benchmark rather than a general principle. revision: yes

standing simulated objections not resolved

Direct evidence that higher pedagogical scores translate into measurable gains in child engagement, retention, or learning during live RSC interactions on edge hardware would require IRB-approved user studies with children, which lies outside the scope of this benchmarking paper.

Circularity Check

0 steps flagged

Purely empirical benchmarking with no derivations or self-referential claims

full rationale

The paper performs direct empirical measurements of 25 models on edge hardware across inference speed, energy use, MMLU accuracy, and LLM-rated pedagogical quality, with human rater validation on a subset. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the text; results rest on external benchmarks and independent human ratings without any reduction of claims to their own definitions or prior author work by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmarking study with no theoretical derivations, free parameters, axioms, or invented entities. The three-tier architecture is a post-hoc proposal derived from observed trade-offs rather than a formal model.

pith-pipeline@v0.9.0 · 5585 in / 1274 out tokens · 109283 ms · 2026-05-08T17:37:06.241385+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 7 canonical work pages · 1 internal anchor

[1]

Privacy issues in Large Language Models: A survey,

H. Kibriya, W. Z. Khan, A. Siddiqa, and M. K. Khan, “Privacy issues in Large Language Models: A survey,”Computers and Electrical Engineering, vol. 120, p. 109698, 2024

2024
[2]

Locally- deployed open-source llms for code generation: Promises and chal- lenges,

T. Kechaoui, M. W. Ouhab, B. Djamaa, and M. R. Senouci, “Locally- deployed open-source llms for code generation: Promises and chal- lenges,” in2025 7th International Conference on Pattern Analysis and Intelligent Systems (PAIS), 2025, pp. 1–6

2025
[3]

Open-Source Robotic Study Companion with Multimodal Human–Robot Interaction to Improve the Learning Experience of University Students,

F. Baksh, M. B. Zorec, and K. Kruusam ¨ae, “Open-Source Robotic Study Companion with Multimodal Human–Robot Interaction to Improve the Learning Experience of University Students,”Applied Sciences, vol. 14, no. 13, 2024

2024
[4]

Human-robot interaction in higher education: A literature review,

S. Matus and S. Cano, “Human-robot interaction in higher education: A literature review,” inSocial Computing and Social Media, A. Coman and S. Vasilache, Eds. Cham: Springer Nature Switzerland, 2025, pp. 236–256

2025
[5]

University Students’ Acceptance of a Robot Study Companion,

F. Baksh, I. Jackson, I. Jackson, and M. B. Zorec, “University Students’ Acceptance of a Robot Study Companion,” inRobotics in Education, R. Balogh, D. Obdr ˇz´alek, and N. Fachantidis, Eds. Cham: Springer Nature Switzerland, 2025, pp. 38–50

2025
[6]

Adapting a teachable robot’s dialog responses using reinforcement learning: Cross-cultural user study exploring effect on engagement,

R. Love, P. R. Cohen, G. Venture, and D. Kuli ´c, “Adapting a teachable robot’s dialog responses using reinforcement learning: Cross-cultural user study exploring effect on engagement,”ACM Transactions on Human-Robot Interaction, vol. 14, no. 4, pp. 1–34, 2025

2025
[7]

Investigating Adaptive Robot Tutoring in a Long-Term Interaction in Higher Education,

M. Donnermann, P. Schaper, and B. Lugrin, “Investigating Adaptive Robot Tutoring in a Long-Term Interaction in Higher Education,” in2022 31st IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), Aug. 2022, pp. 171–178

2022
[8]

A survey on privacy risks and protection in large language models,

K. Chen, X. Zhou, Y . Lin, S. Feng, L. Shen, and P. Wu, “A survey on privacy risks and protection in large language models,”Journal of King Saud University Computer and Information Sciences, vol. 37, 2025. [Online]. Available: https://doi.org/10.1007/s44443-025-00177-1

work page doi:10.1007/s44443-025-00177-1 2025
[9]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring Massive Multitask Language Understanding,” Jan. 2021, arXiv:2009.03300 [cs]. [Online]. Available: http://arxiv.org/ abs/2009.03300

work page internal anchor Pith review arXiv 2021
[10]

GLUE: A multi-task benchmark and analysis platform for natural language understanding,

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, “GLUE: A multi-task benchmark and analysis platform for natural language understanding,” inProceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, T. Linzen, G. Chrupała, and A. Alishahi, Eds. Brussels, Belgium: Association for Computational Lin...

2018
[11]

EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios,

B. Xu, Y . Bai, H. Sun, Y . Lin, S. Liu, X. Liang, Y . Li, Z. Dong, J. Zhang, Y . Deng, X. Zou, Y . Gao, and H. Huang, “EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios,” Jan. 2026, arXiv:2505.16160 [cs]. [Online]. Available: http://arxiv.org/abs/2505.16160

work page arXiv 2026
[13]

Available: https://arxiv.org/pdf/2511.07425

[Online]. Available: https://arxiv.org/pdf/2511.07425

work page arXiv
[14]

Llmpi: Optimizing llms for high-throughput on raspberry pi,

M. Ardakani, J. Malekar, and R. Zand, “Llmpi: Optimizing llms for high-throughput on raspberry pi,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. CVF, 2025, p. 6378. [Online]. Available: https://openaccess.thecvf.com/content/CVPR2025W/EDGE/papers/Arda kani LLMPi Optimizing LLMs for High-Throughput on...

2025
[15]

Litert-optimized int8 llm for raspberry pi4 deployment,

K. Yoon, H.-C. Moon, A. Kim, S. Kim, S.-S. Lee, S.-J. Jang, G. Gankhuyag, and J. Jeong, “Litert-optimized int8 llm for raspberry pi4 deployment,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2025, p. 5683. [Online]. Available: https://openaccess.thecvf.com/content/ICCV2025 W/AIM/html/Yoon LiteRT-Optimized INT...

2025
[16]

Sustainable llm inference for edge ai: Evaluating quantized llms for energy efficiency, output accuracy, and inference latency,

E. J. Husom, A. Goknil, M. Astekin, L. K. Shar, A. K ¨Aysen, S. Sen, B. A. Mithassel, and A. Soylu, “Sustainable llm inference for edge ai: Evaluating quantized llms for energy efficiency, output accuracy, and inference latency,”ACM Transactions on Internet of Things, vol. 6, no. 4, November 2025

2025
[17]

deepeval,

J. Ip and K. V ongthongsri, “deepeval,” Jan. 2026. [Online]. Available: https://github.com/confident-ai/deepeval

2026
[18]

Supplemental materials to ’benchmarking local language models for social robots using edge devices’,

D. Lamouille, M. B. Zorec, F. Baksh, and K. Kruusamae, “Supplemental materials to ’benchmarking local language models for social robots using edge devices’,”Zenodo, 2026. [Online]. Available: https://doi.org/10.5281/zenodo.19643021

work page doi:10.5281/zenodo.19643021 2026
[19]

Quantifying language models’ sensitivity to spurious features in prompt design.arXiv preprint arXiv:2310.11324, 2023

M. Sclar, Y . Choi, Y . Tsvetkov, and A. Suhr, “Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting,” inThe Twelfth International Conference on Learning Representations, 2024, arXiv:2310.11324. [Online]. Available: https://openreview.net/forum?i d=RIu5lyNXjT

work page arXiv 2024
[20]

Does prompt formatting have any impact on llm performance?

J. He, M. Rungta, D. Koleczek, A. Sekhon, F. X. Wang, and S. Hasan, “Does prompt formatting have any impact on llm performance?”arXiv preprint arXiv:2411.10541, 2024

work page arXiv 2024
[21]

[Online]

Ollama Contributors, “Ollama,” 2024, accessed: 2026-02-03. [Online]. Available: https://github.com/ollama/ollama

2024
[22]

Computing Krippendorff’s alpha-reliability,

K. Krippendorff, “Computing Krippendorff’s alpha-reliability,” Univer- sity of Pennsylvania, Annenberg School for Communication, Tech. Rep.,
[23]

Available: https://repository.upenn.edu/asc papers/43

[Online]. Available: https://repository.upenn.edu/asc papers/43