Multiplexing dynamic deep learning workloads with slo-awareness in gpu clusters

Lingfan Yu, Jinkun Lin, Jinyang Li · 2025 · DOI 10.1145/3689031

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

open at publisher browse 5 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

PrefixWall: Mitigating Prefix Caching Side Channels in Shared LLM Systems

cs.CR · 2026-03-11 · unverdicted · novelty 7.0

PrefixWall mitigates APC side channels in multi-tenant LLM systems via selective prefix isolation, delivering up to 70% higher cache reuse and 30% lower latency than full-isolation baselines.

FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters

cs.DC · 2025-10-13 · unverdicted · novelty 5.0

FlexPipe introduces runtime pipeline refactoring for LLMs to achieve higher resource efficiency and lower latency in serverless GPU clusters with fragmentation.

CoLLM: Continuous Adaptation for SLO-Aware LLM Serving on Shared GPU Clusters

cs.DC · 2026-03-31 · unverdicted · novelty 4.0

CoLLM unifies FL PEFT and inference on shared edge replicas via intra-replica model sharing and two-timescale inter-replica coordination, achieving up to 3x higher goodput than prior LLM systems.

Seed1.5-VL Technical Report

cs.CV · 2025-05-11 · unverdicted · novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

KubePACS: Kubernetes Cluster Using Performant, Highly Available, and Cost Efficient Spot Instances

cs.DC · 2026-04-27

citing papers explorer

Showing 5 of 5 citing papers.

PrefixWall: Mitigating Prefix Caching Side Channels in Shared LLM Systems cs.CR · 2026-03-11 · unverdicted · none · ref 73
PrefixWall mitigates APC side channels in multi-tenant LLM systems via selective prefix isolation, delivering up to 70% higher cache reuse and 30% lower latency than full-isolation baselines.
FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters cs.DC · 2025-10-13 · unverdicted · none · ref 56
FlexPipe introduces runtime pipeline refactoring for LLMs to achieve higher resource efficiency and lower latency in serverless GPU clusters with fragmentation.
CoLLM: Continuous Adaptation for SLO-Aware LLM Serving on Shared GPU Clusters cs.DC · 2026-03-31 · unverdicted · none · ref 12
CoLLM unifies FL PEFT and inference on shared edge replicas via intra-replica model sharing and two-timescale inter-replica coordination, achieving up to 3x higher goodput than prior LLM systems.
Seed1.5-VL Technical Report cs.CV · 2025-05-11 · unverdicted · none · ref 123
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
KubePACS: Kubernetes Cluster Using Performant, Highly Available, and Cost Efficient Spot Instances cs.DC · 2026-04-27 · unreviewed · ref 11

Multiplexing dynamic deep learning workloads with slo-awareness in gpu clusters

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer