CCL-Bench packages traces and metadata to compute detailed compute, memory, and communication efficiency metrics, surfacing performance insights unavailable from end-to-end benchmarks.
Barbarians at the gate: How AI is upending systems research
8 Pith papers cite this work. Polarity classification is still indexing.
years
2026 8representative citing papers
Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.
DWDP distributes MoE weights across GPUs for independent execution without collective synchronization, improving output TPS/GPU by 8.8 percent on GB200 NVL72 for DeepSeek-R1 under 8K input and 1K output lengths.
An LLM-driven agentic system evolves microarchitectural policies for cache replacement, data prefetching, and branch prediction, producing designs that match or exceed prior state-of-the-art in IPC on standard benchmarks.
R^3 optimizes full scientific applications on GPUs better than tuning kernel parameters or compiler flags alone while running nearly an order of magnitude faster than modern evolutionary search methods.
Co-evolving LLM-generated solutions with their evaluators enables discovery of novel database algorithms that outperform state-of-the-art baselines, including a query rewrite policy with up to 6.8x lower latency.
Automated architectural discovery engines can outperform human design teams by exploring massive design spaces and compressing development cycles from months to weeks.
PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing, recommendation, and protein tasks.
citing papers explorer
-
CCL-Bench 1.0: A Trace-Based Benchmark for LLM Infrastructure
CCL-Bench packages traces and metadata to compute detailed compute, memory, and communication efficiency metrics, surfacing performance insights unavailable from end-to-end benchmarks.
-
Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics
Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.
-
DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72
DWDP distributes MoE weights across GPUs for independent execution without collective synchronization, improving output TPS/GPU by 8.8 percent on GB200 NVL72 for DeepSeek-R1 under 8K input and 1K output lengths.
-
Record-Remix-Replay: Hierarchical GPU Kernel Optimization using Evolutionary Search
R^3 optimizes full scientific applications on GPUs better than tuning kernel parameters or compiler flags alone while running nearly an order of magnitude faster than modern evolutionary search methods.