TCL delivers 16.8x faster tuning on CPU and 12.48x on GPU with modestly lower inference latency by combining RDU active sampling, a lightweight Mamba cost model, and cross-platform continual knowledge distillation.
InProceedings of the International Conference on Parallel Architectures and Compilation Techniques(New York, NY, USA, 2022), ACM, pp
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
TCL: Enabling Fast and Efficient Cross-Hardware Tensor Program Optimization via Continual Learning
TCL delivers 16.8x faster tuning on CPU and 12.48x on GPU with modestly lower inference latency by combining RDU active sampling, a lightweight Mamba cost model, and cross-platform continual knowledge distillation.