NVLLM offloads FFN computations to integrated 3D NAND flash with page-level access and keeps attention in DRAM, delivering 16.7x-37.9x speedups over GPU out-of-core baselines for models up to 30B parameters.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
MSAO cuts end-to-end latency by 30% and resource overhead by 30-65% for multimodal LLM inference through sparsity-aware edge-cloud offloading while preserving accuracy.
citing papers explorer
-
NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference
NVLLM offloads FFN computations to integrated 3D NAND flash with page-level access and keeps attention in DRAM, delivering 16.7x-37.9x speedups over GPU out-of-core baselines for models up to 30B parameters.
-
MSAO: Adaptive Modality Sparsity-Aware Offloading with Edge-Cloud Collaboration for Efficient Multimodal LLM Inference
MSAO cuts end-to-end latency by 30% and resource overhead by 30-65% for multimodal LLM inference through sparsity-aware edge-cloud offloading while preserving accuracy.