BSA-TNP is a new neural process model with KRBlocks and biased scan attention that claims to match top accuracy while scaling inference to over 1M points in under a minute on a single GPU and supporting translation invariance.
Rabe and Charles Staats
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 2verdicts
UNVERDICTED 2representative citing papers
SARATHI uses chunked prefills and decode-maximal batching to let decode steps ride along with prefill compute, delivering up to 10x higher decode throughput and 1.91x end-to-end throughput on models including LLaMA-13B and GPT-3.
citing papers explorer
-
Scalable Spatiotemporal Inference with Biased Scan Attention Transformer Neural Processes
BSA-TNP is a new neural process model with KRBlocks and biased scan attention that claims to match top accuracy while scaling inference to over 1M points in under a minute on a single GPU and supporting translation invariance.
-
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
SARATHI uses chunked prefills and decode-maximal batching to let decode steps ride along with prefill compute, delivering up to 10x higher decode throughput and 1.91x end-to-end throughput on models including LLaMA-13B and GPT-3.