{"total":14,"items":[{"citing_arxiv_id":"2607.01817","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HCMS: Head-Chunked Multi-Stream Pipeline for Communication-Computation Overlap in Long-Sequence Parallel Attention","primary_cat":"cs.DC","submitted_at":"2026-07-02T07:30:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HCMS partitions multi-head attention into chunks and pipelines them across dual CUDA streams to overlap communication and computation, delivering 10-17.5% speedup over Ulysses for 31K-56K token sequences.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.01701","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Arachne: Orchestrating Cascades for Efficient Text-to-Video Model Training","primary_cat":"cs.DC","submitted_at":"2026-07-02T04:50:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Arachne orchestrates cascades for distributed T2V training and reports up to 65% lower iteration time with improving gains at larger scales compared to static bucketing approaches.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10493","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Achieving Cloud-Grade SLOs for Local Mixture-of-Experts Inference through CPU-GPU Hybrid Design","primary_cat":"cs.DC","submitted_at":"2026-06-09T07:17:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A CPU-GPU hybrid design with stream-loading prefill, expert parallelism, and disaggregation achieves cloud SLOs for local MoE inference on dual-socket CPUs and consumer GPUs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08587","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Kaczmarz Linear Attention","primary_cat":"cs.LG","submitted_at":"2026-05-09T01:07:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Extended long short-term memory, 2024. URLhttps://arxiv.org/abs/2405.04517. [4] William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, and Jonathan Ragan-Kelley. Striped attention: Faster ring attention for causal transformers, 2023. URLhttps://arxiv.org/abs/2311.09431. [5] Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. Rethinking attention with performers, 2022. URL https: //arxiv.org/abs/2009.14794. [6] Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023."},{"citing_arxiv_id":"2605.07569","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware","primary_cat":"cs.DC","submitted_at":"2026-05-08T10:41:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HexiSeq optimizes sequence and head partitioning across mixed GPUs to improve long-context LLM training throughput by up to 1.72x in simulations.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Simulations on 32 to 128 GPUs further show an average gain of 1.36×, and FLOP-equivalent comparisons show that HEXISEQcan approach the throughput of the strongest homogeneous baseline while running on heterogeneous clusters. References [1] Anthropic. What's new in claude opus 4.7, 2026. URLhttps://platform.claude.com/docs/en/about-claude/ models/whats-new-claude-4-7. [2] William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, and Jonathan Ragan- Kelley . Striped attention: Faster ring attention for causal transformers.arXiv preprint arXiv:2311.09431, 2023. [3] Qiaoling Chen, Diandian Gu, Guoteng Wang, Xun Chen, YingTong Xiong, Ting Huang, Qinghao Hu, Xin Jin, Yonggang Wen, Tianwei Zhang, et al."},{"citing_arxiv_id":"2606.17059","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Distributed Inference of LLMs on a P2P Network","primary_cat":"cs.DC","submitted_at":"2026-05-07T15:40:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A decentralized prefix-cache-aware routing scheme for P2P LLM serving improves simulated latency under low-delay skewed workloads but is limited by network latency and hotspots.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14561","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CoCoDiff: Optimizing Collective Communications for Distributed Diffusion Transformer Inference Under Ulysses Sequence Parallelism","primary_cat":"cs.DC","submitted_at":"2026-04-16T02:43:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoCoDiff achieves 3.6x average and 8.4x peak speedup for distributed DiT inference on up to 96 GPU tiles via tile-aware all-to-all, V-first scheduling, and selective V communication.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.27960","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Efficient Large Vision-Language Models: A Comprehensive Survey on Inference Strategies","primary_cat":"cs.LG","submitted_at":"2026-03-30T02:23:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"The paper surveys and taxonomizes inference optimization methods for large vision-language models across four categories while noting limitations and open problems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.18830","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training","primary_cat":"cs.CL","submitted_at":"2025-10-21T17:25:32+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MTraining scales LLM training to 512K-token contexts on 32 A100 GPUs by integrating dynamic sparse training patterns with balanced and hierarchical sparse ring attention, achieving up to 6x throughput gains without accuracy loss on long-context benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.21275","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InfiniPipe: Elastic Pipeline Parallelism for Efficient Variable-Length Long-Context LLM Training","primary_cat":"cs.DC","submitted_at":"2025-09-25T15:01:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InfiniPipe proposes elastic pipeline parallelism and stage-aware chunk-level adaptive checkpointing to achieve 1.69x speedup over state-of-the-art for variable-length long-context LLM training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2407.08608","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision","primary_cat":"cs.LG","submitted_at":"2024-07-11T15:44:48+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2405.21060","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality","primary_cat":"cs.LG","submitted_at":"2024-05-31T17:50:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ers for Efficient Open Language Models\". In:arXiv preprint arXiv:2404.07839 (2024). [15] George EP Box, Gwilym M Jenkins, Gregory C Reinsel, and Greta M Ljung. Time Series Analysis: Forecasting and Control. John Wiley & Sons, 2015. [16] James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. \"Quasi-recurrent Neural Networks\". In: arXiv preprint arXiv:1611.01576 (2016). [17] William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, and Jonathan Ragan- Kelley. \"Striped attention: Faster ring attention for causal transformers\". In:arXiv preprint arXiv:2311.09431 (2023). [18] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al."},{"citing_arxiv_id":"2402.08268","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"World Model on Million-Length Video And Language With Blockwise RingAttention","primary_cat":"cs.LG","submitted_at":"2024-02-13T07:47:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Presents open-source 7B models for million-token video and language understanding via Blockwise RingAttention, setting new benchmarks in retrieval and long video tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2312.06635","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Gated Linear Attention Transformers with Hardware-Efficient Training","primary_cat":"cs.LG","submitted_at":"2023-12-11T18:51:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":", Black, S., DiPofi, A., Foster, C., Golding, L., Hsu, J., McDonell, K., Muennighoff, N., Phang, J., Reynolds, L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou, A. A framework for few-shot language model evaluation, September 2021. Gers, F. A., Schmidhuber, J., and Cummins, F. A. Learning to forget: Continual prediction with LSTM. Neural Comput., 12(10):2451-2471, 2000. Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. 2023. Gu, A., Goel, K., and R'e, C. Efficiently modeling long sequences with structured state spaces. International Conference On Learning Representations, 2021a. Gu, A., Johnson, I., Goel, K., Saab, K. K., Dao, T., Rudra, A., and R'e, C. Combining recurrent, convolutional, and"}],"limit":50,"offset":0}