FlashCP introduces Whole-Doc sharding, sharding-aware KV communication, and a heuristic for mixed sharding plans, claiming up to 1.63x speedup over prior CP methods for LLM training.
superword
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2representative citing papers
Proxy compression trains language models on both raw bytes and compressed sequences to enable efficient training with raw-byte inference at test time.
citing papers explorer
-
FlashCP: Load-Balanced Communication-Efficient Context Parallelism for LLM Training
FlashCP introduces Whole-Doc sharding, sharding-aware KV communication, and a heuristic for mixed sharding plans, claiming up to 1.63x speedup over prior CP methods for LLM training.
-
Proxy Compression for Language Modeling
Proxy compression trains language models on both raw bytes and compressed sequences to enable efficient training with raw-byte inference at test time.