ASTRA: Communication-Efficient Acceleration for Multi-Device Transformer Inference

Deepak Ganesan; Hui Guan; Lijun Zhang; Xiao Liu

arxiv: 2505.19342 · v2 · pith:OTAJLRKWnew · submitted 2025-05-25 · 💻 cs.LG · cs.AI

ASTRA: Communication-Efficient Acceleration for Multi-Device Transformer Inference

Xiao Liu , Lijun Zhang , Deepak Ganesan , Hui Guan This is my paper

classification 💻 cs.LG cs.AI

keywords astrainferencemulti-deviceattentioncommunication-efficientmodelsremainstimes

0 comments

read the original abstract

Multi-device inference can reduce Transformer latency by parallelizing computation. However, existing methods require high inter-device bandwidth, making them impractical for bandwidth-constrained environments. We present ASTRA, a communication-efficient framework that integrates sequence parallelism with mixed-precision attention, where non-local token embeddings are transmitted as low-bit vector-quantized codes while local attention remains full precision. To preserve accuracy under aggressive compression, ASTRA introduces Noise-Augmented Quantization and Distributed Class Tokens. Across vision and language models (e.g., ViT and GPT2), ASTRA achieves up to 2.64$\times$ speedup over single-device inference and up to 15.25$\times$ over prior multi-device baselines while operating at bandwidths as low as 10 Mbps. ASTRA remains robust on large models (e.g., Llama-3-8B) even under non-ideal network conditions such as packet loss and dynamic networks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Going Beyond the Edge: Distributed Inference of Transformer Models on Ultra-Low-Power Wireless Devices
cs.LG 2026-05 conditional novelty 7.0

CATS enables collaborative transformer inference on up to 16 ultra-low-power wireless devices, supporting models up to 14 times larger than a single device can run via SomeGather pruning and message-dropout robustness.