Pocket-Dentist: On-Device Dental Image Understanding via Efficient Multimodal Large Language Models

Bin Chen; Hong Jia; Kai Bian; Lingyan Ruan; Ting Dang; Xucheng Guo; Yiran Shen

arxiv: 2605.29299 · v3 · pith:ZTGWVQO6new · submitted 2026-05-28 · 💻 cs.CV · cs.AI

Pocket-Dentist: On-Device Dental Image Understanding via Efficient Multimodal Large Language Models

Kai Bian , Xucheng Guo , Bin Chen , Lingyan Ruan , Yiran Shen , Ting Dang , Hong Jia This is my paper

Pith reviewed 2026-06-29 08:50 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords dental image understandingvision-language modelson-device inferencemultimodal question answeringmodel efficiencyclinical prescreeningcompact VLMs

0 comments

The pith

Compact 2B vision-language models match larger VLMs on dental tasks after lightweight adaptation while using far less compute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds an efficiency-aware benchmark called Pocket-Dentist that gathers three dental datasets covering roughly 1,159 patients, five task types, and seven metrics to test multimodal question answering. Across 14 vision-language models it finds that 2B-parameter models, once lightly adapted, reach performance close to much larger models on most dental metrics yet run at substantially lower computational cost. This matters for moving dental prescreening out of specialist centers onto ordinary phones where inference must stay fast, private, and local.

Core claim

Compact VLMs such as 2B-parameter models become competitive with much larger VLMs on most metrics after lightweight adaptation while requiring substantially lower computational costs in dental image understanding. The Pocket-Dentist-2B model, when run locally on an iPhone 17 Pro, processed each sample in 4.31 seconds, cutting latency by 4.9 times and memory use by 2.3 times relative to a 7B baseline.

What carries the argument

The Pocket-Dentist benchmark that unifies three patient datasets, five task types, and seven metrics to measure both accuracy and computational cost for dental multimodal question answering, paired with lightweight adaptation of compact VLMs.

If this is right

On-device inference on consumer phones becomes feasible for dental image tasks.
Patient images can stay local, supporting privacy-preserving prescreening.
Latency drops to 4.31 seconds per sample and memory drops by 2.3 times versus 7B models.
Timely screening becomes available in settings without cloud access or specialist hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same lightweight adaptation pattern may transfer to other medical image domains that need on-device processing.
Consumer apps for preliminary dental checks could run entirely on phones without sending data elsewhere.
Lower memory and latency could reduce battery drain in mobile health tools.

Load-bearing premise

The three datasets spanning about 1,159 patients together with the five tasks and seven metrics adequately represent the needs of practical dental prescreening outside specialist centers.

What would settle it

A new, more diverse dental image collection where the adapted 2B model falls well behind larger models on accuracy metrics even though its compute advantage remains.

Figures

Figures reproduced from arXiv: 2605.29299 by Bin Chen, Hong Jia, Kai Bian, Lingyan Ruan, Ting Dang, Xucheng Guo, Yiran Shen.

**Figure 1.** Figure 1: Overview of the Pocket-Dentist pipeline. We benchmark 14 VLMs across three dental datasets under zero-shot, few-shot, and LoRA settings, identify InternVL3.5-2B as the best-performing compact VLM, and deploy it on an iPhone 17 Pro for on-device local inference. by curating three heterogeneous dental datasets into standardized prompt–response QA pairs through task-specific reformulation, LLM-assisted conve… view at source ↗

**Figure 2.** Figure 2: Pocket-Dentist iOS app running PocketDentist-2B on an iPhone 17 Pro. Following MobileAIBench (Murthy et al., 2024), we adopt: • TTFT (s): Time-to-first-token—latency from prompt submission to first output token. • ITPS (t/s): Input tokens per second—prompt processing throughput. • OTPS (t/s): Output tokens per second—generation throughput. • OET (s): Output evaluation time—wall-clock time for complete re… view at source ↗

read the original abstract

Evaluations of dental vision-language models remain fragmented across datasets, task definitions and metrics, and often ignore their computational cost. This limits their widespread deployment for dental screening outside specialist centres, where timely inference, limited hardware, and local handling of patient images are vital for practical, privacy-preserving clinical prescreening. Here we present Pocket-Dentist, an efficiency-aware benchmark for dental multimodal question answering that brings together three datasets spanning approximately 1,159 patients from BRAR and MetaDent, five task types and seven metrics. Across 14 typical VLMs, our results reveal an interesting observation: compact VLMs, such as 2B-parameter models, become competitive with much larger VLMs on most metrics after lightweight adaptation while requiring substantially lower computational costs in dental image understanding. Deployed locally on an iPhone 17 Pro, our finetuned compact VLM Pocket-Dentist-2B processed each sample in 4.31 s, reducing latency by 4.9x and memory use by 2.3x compared with a 7B baseline. Our project page is available at https://2026-icml.github.io/pocket-dentist-icml.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pocket-Dentist combines dental datasets into a benchmark and shows 2B VLMs can match larger ones after adaptation with solid iPhone latency numbers, though dataset coverage is the main open question.

read the letter

The paper's main new piece is the Pocket-Dentist benchmark that pulls together three datasets from BRAR and MetaDent for about 1159 patients, defines five task types and seven metrics, and then runs 14 VLMs on them. They report that 2B models reach competitive scores with much larger ones after lightweight adaptation and give concrete on-device results: 4.31 seconds per sample on an iPhone 17 Pro, with 4.9x lower latency and 2.3x lower memory than a 7B baseline.

What stands out is the deployment data. Those latency and memory figures are the sort of practical measurements that matter for privacy-preserving screening on consumer hardware, and the unified benchmark addresses the fragmentation the abstract mentions.

The soft spot is the data foundation. The three datasets may not cover enough real-world variation in demographics, rare conditions, or imaging setups outside controlled specialist environments. If the cases skew common or easy, the claim that compact models become competitive for practical prescreening rests on narrower ground than the abstract suggests. The abstract also gives high-level competitiveness statements without per-metric tables, error bars, or full adaptation details, so the strength of the numbers is hard to judge from the summary alone.

This is for people working on efficient VLMs or on-device medical imaging. A reader focused on those areas would find the benchmark and the phone numbers useful. The work shows clear empirical engagement and deserves a serious referee to check the full tables, adaptation procedures, and data coverage rather than a desk reject.

Referee Report

2 major / 1 minor

Summary. The paper introduces Pocket-Dentist, an efficiency-aware benchmark for dental multimodal question answering. It aggregates three datasets from BRAR and MetaDent covering approximately 1,159 patients, defines five task types and seven metrics, and evaluates 14 VLMs. The key finding is that compact 2B-parameter VLMs, after lightweight adaptation, become competitive with much larger VLMs on most metrics while requiring substantially lower computational costs. The work also demonstrates on-device deployment of a finetuned 2B model on an iPhone 17 Pro, achieving 4.31 seconds per sample, 4.9x latency reduction, and 2.3x memory reduction compared to a 7B baseline.

Significance. If the results hold, this benchmark could promote the use of efficient, on-device VLMs for dental prescreening, addressing privacy and hardware constraints outside specialist centers. The unification of datasets and focus on efficiency is valuable. The on-device results provide concrete evidence of practicality. However, the significance hinges on whether the datasets adequately represent real-world variability in dental imaging.

major comments (2)

[Abstract] Abstract and evaluation setup: The abstract states high-level competitiveness and latency numbers without detailed per-metric tables, error bars, statistical tests, or full adaptation procedures; the central claim that 2B models become competitive on most metrics cannot be verified from the provided text alone.
[Datasets] Datasets: The three datasets spanning approximately 1,159 patients from BRAR and MetaDent are aggregated without detailing demographic coverage, pathology rarity, imaging device variation, or out-of-distribution cases typical in non-specialist settings. Since the central claim requires that the five task types and seven metrics on these datasets suffice to establish competitiveness for practical dental prescreening, this is load-bearing and risks that the observed 4.9x latency and 2.3x memory gains may not generalize.

minor comments (1)

[Abstract] The project page URL references 2026-ICML, which appears inconsistent with current conference timelines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, clarifying the manuscript content and indicating revisions where they strengthen the work without altering its core claims.

read point-by-point responses

Referee: [Abstract] Abstract and evaluation setup: The abstract states high-level competitiveness and latency numbers without detailed per-metric tables, error bars, statistical tests, or full adaptation procedures; the central claim that 2B models become competitive on most metrics cannot be verified from the provided text alone.

Authors: The abstract is intentionally concise to highlight the main contributions and on-device results. The full manuscript contains the requested details: Section 4 presents per-metric tables with error bars across all 14 VLMs and five task types; Section 3.2 describes the lightweight adaptation procedures (including hyperparameters and training protocol); and statistical comparisons are reported via paired t-tests in the supplementary material. We have revised the abstract to explicitly reference these sections and tables so readers can locate the supporting evidence immediately. revision: partial
Referee: [Datasets] Datasets: The three datasets spanning approximately 1,159 patients from BRAR and MetaDent are aggregated without detailing demographic coverage, pathology rarity, imaging device variation, or out-of-distribution cases typical in non-specialist settings. Since the central claim requires that the five task types and seven metrics on these datasets suffice to establish competitiveness for practical dental prescreening, this is load-bearing and risks that the observed 4.9x latency and 2.3x memory gains may not generalize.

Authors: We agree that explicit characterization of the aggregated data strengthens the benchmark. We have added a new subsection (Section 3.1.1) that summarizes available demographic information, imaging device types, and pathology distributions drawn from the original BRAR and MetaDent publications. We also expanded the limitations paragraph in Section 5 to acknowledge that the current collection may under-represent certain out-of-distribution cases encountered in non-specialist settings and that further multi-center validation would be valuable. The efficiency gains themselves are measured on the same data distribution used for evaluation, so they remain valid within the reported scope; we do not claim universal generalization beyond the benchmark. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking with no derivations or self-referential reductions

full rationale

The paper consists entirely of empirical benchmarking: it aggregates three existing datasets (~1,159 patients), defines five task types and seven metrics, evaluates 14 VLMs (including lightweight adaptation of 2B models), and reports measured latency/memory on an iPhone 17 Pro. No equations, fitted parameters renamed as predictions, uniqueness theorems, ansatzes, or derivation chains appear in the provided text. The central observation (compact VLMs competitive after adaptation) is a direct empirical finding from the benchmark runs, not a reduction to prior inputs by construction. Self-citations, if present, are not load-bearing for any mathematical claim. This is a standard non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the chosen datasets and metrics are representative for dental prescreening; no free parameters or invented entities are described.

axioms (1)

domain assumption The three datasets from BRAR and MetaDent with approximately 1,159 patients provide sufficient coverage for the five task types in dental image understanding.
Benchmark validity depends on these datasets representing real clinical variation.

pith-pipeline@v0.9.1-grok · 5761 in / 1121 out tokens · 33585 ms · 2026-06-29T08:50:46.745287+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 2 canonical work pages · 2 internal anchors

[1]

doi: 10.1177/ 00220345261424242

ISSN 0022-0345, 1544-0591. doi: 10.1177/ 00220345261424242. Lu, Z., Li, X., Cai, D., Yi, R., Liu, F., Liu, W., et al. De- mystifying Small Language Models for Edge Deploy- ment. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 14747–14764, Vienna, Austria,
[2]

SmolVLM: Redefining small and efficient multimodal models

Association for Computational Linguistics. doi: 10.18653/v1/2025.acl-long.718. Marafioti, A., Zohar, O., Farr´e, M., Noyan, M., Bakouch, E., Cuenca, P., et al. SmolVLM: Redefining small and effi- cient multimodal models. arXiv preprint, 2025. https: //doi.org/10.48550/arXiv.2504.05299. Moor, M., Huang, Q., Wu, S., Yasunaga, M., Dalmia, Y ., Leskovec, J., ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.acl-long.718 2025
[3]

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

URL https://openreview.net/forum? id=SkeHuCVFDr. Zhang, X., Wu, C., Zhao, Z., Lin, W., Zhang, Y ., Wang, Y ., et al. PMC-VQA: Visual Instruction Tun- ing for Medical Visual Question Answering. arXiv preprint, 2023. https://doi.org/10.48550/ arXiv.2305.10415. 10 On-Device Dental Image Understanding via Efficient Multimodal Large Language Models A. Prompt T...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

doi: 10.1177/ 00220345261424242

ISSN 0022-0345, 1544-0591. doi: 10.1177/ 00220345261424242. Lu, Z., Li, X., Cai, D., Yi, R., Liu, F., Liu, W., et al. De- mystifying Small Language Models for Edge Deploy- ment. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 14747–14764, Vienna, Austria,

[2] [2]

SmolVLM: Redefining small and efficient multimodal models

Association for Computational Linguistics. doi: 10.18653/v1/2025.acl-long.718. Marafioti, A., Zohar, O., Farr´e, M., Noyan, M., Bakouch, E., Cuenca, P., et al. SmolVLM: Redefining small and effi- cient multimodal models. arXiv preprint, 2025. https: //doi.org/10.48550/arXiv.2504.05299. Moor, M., Huang, Q., Wu, S., Yasunaga, M., Dalmia, Y ., Leskovec, J., ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.acl-long.718 2025

[3] [3]

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

URL https://openreview.net/forum? id=SkeHuCVFDr. Zhang, X., Wu, C., Zhao, Z., Lin, W., Zhang, Y ., Wang, Y ., et al. PMC-VQA: Visual Instruction Tun- ing for Medical Visual Question Answering. arXiv preprint, 2023. https://doi.org/10.48550/ arXiv.2305.10415. 10 On-Device Dental Image Understanding via Efficient Multimodal Large Language Models A. Prompt T...

work page internal anchor Pith review Pith/arXiv arXiv 2023