FlexGen: High- throughput generative inference of large language models with a single GPU,

· 2023

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

From LLM to Silicon: RL-Driven ASIC Architecture Exploration for On-Device AI Inference

cs.AR · 2026-04-08 · unverdicted · novelty 5.0

An RL agent using Soft Actor-Critic with Mixture-of-Experts jointly optimizes ASIC architecture, memory hierarchy, and partitioning for AI inference, achieving 29809 tokens/s for Llama 3.1 at 3nm and under 13mW for SmolVLM across 3-28nm nodes without manual retuning.

citing papers explorer

Showing 1 of 1 citing paper.

From LLM to Silicon: RL-Driven ASIC Architecture Exploration for On-Device AI Inference cs.AR · 2026-04-08 · unverdicted · none · ref 34
An RL agent using Soft Actor-Critic with Mixture-of-Experts jointly optimizes ASIC architecture, memory hierarchy, and partitioning for AI inference, achieving 29809 tokens/s for Llama 3.1 at 3nm and under 13mW for SmolVLM across 3-28nm nodes without manual retuning.

FlexGen: High- throughput generative inference of large language models with a single GPU,

fields

years

verdicts

representative citing papers

citing papers explorer