archive
Every paper Pith has read. Search by title, abstract, or pith.
595 papers in cs.DC · page 1
-
APWA scales agent workflows by parallelizing non-communicating subproblems
APWA: A Distributed Architecture for Parallelizable Agentic Workflows
-
Cache reorganization lifts GPU speedups for 28-qubit simulations on laptops
Accelerating State-Vector Quantum Simulation on Integrated GPUs via Cache Locality Optimization: A Cross-Architecture Evaluation
-
Wi-Fi logs build hierarchical mobility models with lower complexity
Analysis of wireless network access logs for a hierarchical characterization of user mobility
-
Unified GPU solver gives exact gradients for stiff heterogeneous soft bodies
DiffPhD: A Unified Differentiable Solver for Projective Heterogeneous Materials in Elastodynamics with Contact-Rich GPU-Acceleration
-
Exploration fails above ceil(k/(n-2))-1 deactivations per round
Semi-Synchronous Exploration in Dynamic Graphs
-
Distributed Sumcheck gives statistical zero-knowledge for graph problems
Distributed Statistical Zero-Knowledge Proofs via Sumcheck
-
EMA cuts model adaptation costs 15-42% in shifting environments
EMA: Efficient Model Adaptation for Learning-based Systems
-
MinT manages million LoRA policies over shared 1T models
MinT: Managed Infrastructure for Training and Serving Millions of LLMs
-
Federated fine-tuning matches centralized LLM training on private data
Towards the Next Frontier of LLMs, Training on Private Data: A Cross-Domain Benchmark for Federated Fine-Tuning
-
Adaptive KV compression speeds disaggregated LLM serving up to 9x
KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
-
Client committee speeds secure aggregation 4.6x
DisAgg: Distributed Aggregators for Efficient Secure Aggregation in Federated Learning
-
Multi-agent RL cuts LLM carbon by 33% and water by 43%
MARLIN: Multi-Agent Game-Theoretic Reinforcement Learning for Sustainable LLM Inference in Cloud Datacenters
-
Hybrid method cuts graph scheduling violations 45 percent
Sustainable Graph Analytics Workload Scheduling with Evolutionary Reinforcement Learning in Edge-Cloud Systems
-
Rescaled stepsizes remove bias in async SGD
Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity
-
TurboGR trains 0.2B-param generative recommenders at 54.71% MFU
TurboGR: An Accelerated Training System for Large-Scale Generative Recommendation
-
FPGA lock agents boost OLTP throughput 51X over CPUs
FPGA-Accelerated Lock Management and Transaction Processing: Architecture, Optimization, and Design Space Exploration
-
One rule unifies voting, proposals and constitutional amendment in metric spaces
Constitutional Governance in Metric Spaces
-
Metric-space protocol lets communities self-amend constitutions in polynomial time
Constitutional Governance in Metric Spaces
-
Hierarchical transformer preconditioner reaches 21 fps on stiff Poisson systems
Hierarchical Transformer Preconditioning for Interactive Physics Simulation
-
Transformer preconditioner speeds stiff physics 28x
Hierarchical Transformer Preconditioning for Interactive Physics Simulation
-
Drone swarms adapt composition to deliver lower latency connectivity
Swarm Network-as-a-Service (SNaaS)
-
Pipeline overlap speeds cloud-edge LLM inference up to 2.16x
PipeSD: An Efficient Cloud-Edge Collaborative Pipeline Inference Framework with Speculative Decoding
-
Pipeline speeds cloud-edge LLM inference 1.16-2.16x
PipeSD: An Efficient Cloud-Edge Collaborative Pipeline Inference Framework with Speculative Decoding
-
Heterogeneous solvers up to 32% faster than GPU-only for big matrices
Comparing the Performance of Heterogeneous Conjugate Gradient and Cholesky Solvers on Various Hardware Using SYCL
-
Dynamic pricing stabilizes mempool volume at target capacity
Dynamic Transaction Scheduling and Pricing in the Ethereum Mempool
-
LCL complexity on trees shifts without exact n knowledge
The Distributed Complexity Landscape on Trees Depends on the Knowledge About the Network Size
-
Overdecomposition supported efficiently on mixed GPGPU clusters
Efficient and Portable Support for Overdecomposition on Distributed Memory GPGPU Platforms
-
Parallel training lets RNNs learn from sequences over 10,000 steps
Parallel-in-Time Training of Recurrent Neural Networks for Dynamical Systems Reconstruction
-
Decoupled compression speeds GPU collectives up to 9.65x
NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding
-
Link failures cap LEO capacity scalability at O(1/n)
Capacity Scalability of LEO Constellations With Dynamic Link Failures
-
Per-head adaptive blocks improve sparse attention accuracy by 5.43%
AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference
-
Node failures scale wireless capacity and delay with sqrt of reliable nodes
On Capacity and Delay of Wireless Networks with Node Failures
-
Power capping leaves LLM decode energy untouched
The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
-
Overlays trade reliability against overhead for AI agent discovery
Trade-offs in Decentralized Agentic AI Discovery Across the Compute Continuum
-
LLM inference should be measured in joules per token at scale
Position: LLM Inference Should Be Evaluated as Energy-to-Token Production
-
GraphFlash hits 127x speedup in serverless graph processing
GraphFlash: Enabling Fast and Elastic Graph Processing on Serverless Infrastructure
-
NAVIS speeds on-SSD vector inserts up to 2.74x
NAVIS: Concurrent Search and Update with Low Position-Seeking Overhead in On-SSD Graph-Based Vector Search
-
Off-chain twins let DeFi agents simulate trades without waiting for blocks
State Twins: An Off-Chain Substrate for Agentic Reasoning over Decentralized Finance Protocols
-
Storage offloading breaks memory wall for full-graph GNN training
GriNNder: Breaking the Memory Capacity Wall in Full-Graph GNN Training with Storage Offloading
-
Task runtime dispatches QIR programs to multiple quantum processors
Classic and Quantum Task-Based Intelligent Runtime for QIRs Running on Multiple QPUs
-
Kairos cuts physical AI task latency by 32-66 percent
Kairos: A Scalable Serving System for Physical AI
-
Chunked prefetching speeds DiT steps up to 1.28x with 49% less GPU memory
ChunkFlow: Communication-Aware Chunked Prefetching for Layerwise Offloading in Distributed Diffusion Transformer Inference
-
Chakra standardizes graph traces for AI workload benchmarking
MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces
-
Directed graphs support Byzantine consensus only under specific connectivity
Byzantine Consensus in Directed Graphs with Message Authentication
-
ReCoVer preserves exact training trajectory after GPU losses
ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload
-
ShardTensor scales SciML to arbitrary spatial resolutions
ShardTensor: Domain Parallelism for Scientific Machine Learning
-
GCC 15 outperforms LLVM 21 in four of six RISC-V vector apps
Closer in the Gap: Towards Portable Performance on RISC-V Vector Processors
-
Edge micro-agent fixes failures safely with no destructive actions
An Uncertainty-Aware Resilience Micro-Agent for Causal Observability in the Computing Continuum
-
Mutable membership lets MoE survive rank faults without restarts
Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference
-
This paper performs a structured bidirectional review of peer-reviewed studies on AI and…
SoK: A Systematic Bidirectional Literature Review of AI & DLT Convergence