LLMForge is a NAS framework with Infinite-Head Attention, a Forge-Former surrogate, and Forge-DSE engine that discovers hardware-specific architectures for edge language models, yielding variants with improved accuracy, energy, or latency on different substrates.
Emer, and Saman P
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
Sim-FA is a new simulator that instruments FlashAttention-3 for cycle-accurate GPGPU analysis, achieving 5.7% average error on H800 while explaining inaccuracies in existing DRAM traffic models.
Mambalaya delivers 4.9x prefill and 1.9x generation speedups on Mamba layers over prior accelerators by systematically fusing inter-Einsum operations.
Fine-grained fusion and adaptive scheduling in SSMs deliver up to 4.8x speedup and 10x lower on-chip memory, enabling a fusion-aware accelerator with 1.78x higher performance than MARCA at equal area.
FILCO introduces a real-time reconfigurable composing architecture for DNN acceleration that achieves 1.3x-5x better throughput and hardware efficiency than prior designs on diverse workloads via an analytical model and two-stage design space exploration.
A survey categorizing vendor mechanisms and user-level libraries for GPU-centric communication within and across nodes, with discussion of benefits, challenges, and open questions.
citing papers explorer
-
LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models
LLMForge is a NAS framework with Infinite-Head Attention, a Forge-Former surrogate, and Forge-DSE engine that discovers hardware-specific architectures for edge language models, yielding variants with improved accuracy, energy, or latency on different substrates.
-
Sim-FA: A GPGPU Simulator Framework for Fine-Grained FlashAttention Pipeline Analysis
Sim-FA is a new simulator that instruments FlashAttention-3 for cycle-accurate GPGPU analysis, achieving 5.7% average error on H800 while explaining inaccuracies in existing DRAM traffic models.
-
Mambalaya: Einsum-Based Fusion Optimizations on State-Space Models
Mambalaya delivers 4.9x prefill and 1.9x generation speedups on Mamba layers over prior accelerators by systematically fusing inter-Einsum operations.
-
Fine-Grained Fusion: The Missing Piece in Area-Efficient State Space Model Acceleration
Fine-grained fusion and adaptive scheduling in SSMs deliver up to 4.8x speedup and 10x lower on-chip memory, enabling a fusion-aware accelerator with 1.78x higher performance than MARCA at equal area.
-
FILCO: Flexible Composing Architecture with Real-Time Reconfigurability for DNN Acceleration
FILCO introduces a real-time reconfigurable composing architecture for DNN acceleration that achieves 1.3x-5x better throughput and hardware efficiency than prior designs on diverse workloads via an analytical model and two-stage design space exploration.
-
The Landscape of GPU-Centric Communication
A survey categorizing vendor mechanisms and user-level libraries for GPU-centric communication within and across nodes, with discussion of benefits, challenges, and open questions.
- The EDGE Language: Extended General Einsums for Graph Algorithms