PrefixWall mitigates APC side channels in multi-tenant LLM systems via selective prefix isolation, delivering up to 70% higher cache reuse and 30% lower latency than full-isolation baselines.
Multiplexing dynamic deep learning workloads with slo-awareness in gpu clusters
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
FlexPipe introduces runtime pipeline refactoring for LLMs to achieve higher resource efficiency and lower latency in serverless GPU clusters with fragmentation.
CoLLM unifies FL PEFT and inference on shared edge replicas via intra-replica model sharing and two-timescale inter-replica coordination, achieving up to 3x higher goodput than prior LLM systems.
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
citing papers explorer
-
PrefixWall: Mitigating Prefix Caching Side Channels in Shared LLM Systems
PrefixWall mitigates APC side channels in multi-tenant LLM systems via selective prefix isolation, delivering up to 70% higher cache reuse and 30% lower latency than full-isolation baselines.
-
FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters
FlexPipe introduces runtime pipeline refactoring for LLMs to achieve higher resource efficiency and lower latency in serverless GPU clusters with fragmentation.
-
CoLLM: Continuous Adaptation for SLO-Aware LLM Serving on Shared GPU Clusters
CoLLM unifies FL PEFT and inference on shared edge replicas via intra-replica model sharing and two-timescale inter-replica coordination, achieving up to 3x higher goodput than prior LLM systems.
-
Seed1.5-VL Technical Report
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
- KubePACS: Kubernetes Cluster Using Performant, Highly Available, and Cost Efficient Spot Instances