arxiv: 2604.21952 · v1 · submitted 2026-04-23 · 💻 cs.LG · cs.AI· cs.AR· cs.NE· cs.RO

Recognition: unknown

Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models

Muhammad Shafique , Abdul Basit , Muhammad Abdullah Hanif , Alberto Marchisio , Rachmad Vidya Wicaksana Putra , Minghao Shao

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:59 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ARcs.NEcs.RO

keywords multimodal foundation modelshardware-software co-designmodel compressionspeculative decodingtransformer accelerationquantization and pruning

0 comments

The pith

A multi-layered hardware-software methodology accelerates multimodal foundation models by combining quantization, pruning, speculative decoding, and custom accelerators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a comprehensive approach to speed up multimodal foundation models through co-design of hardware and software for transformer blocks. It applies compression via hierarchy-aware mixed-precision quantization and structural pruning, optimizes inference with speculative decoding and model cascading that routes queries from small to large models, and tunes parameters like sequence length and visual resolution alongside graph fusion and memory-efficient attention. A specialized hardware accelerator handles the workloads, either through expert design or LLM assistance, with fine-tuning for domain adaptation. The methodology is evaluated on medical multimodal models and code generation, pointing toward energy-efficient spiking variants. Readers would care because these models demand heavy computation, so practical acceleration could expand their use in real applications.

Core claim

The paper claims that a multi-layered methodology combining hardware and software co-design, model compression through hierarchy-aware mixed-precision quantization and structural pruning of transformer blocks and MLP channels, operation optimization via speculative decoding and model cascading with lightweight self-tests, co-optimization of sequence length, visual resolution and stride plus graph-level operator fusion, optimized dataflow with memory-efficient attention, and specialized hardware accelerators can reduce computational and memory requirements of multimodal foundation models while supporting domain-specific fine-tuning.

What carries the argument

The multi-layered hardware-software co-design methodology for transformer blocks, which integrates compression, inference optimization, dataflow tuning, and accelerators to meet on-chip bandwidth and latency budgets.

If this is right

The techniques reduce computational and memory requirements for running multimodal foundation models.
Fine-tuning enables domain-specific adaptation during model development.
Effectiveness is shown for medical multimodal models and code generation tasks.
The methodology supports extensions to energy-efficient spiking multimodal models.
Hardware accelerators can be created via expert design or LLM-aided approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar co-design patterns might transfer to accelerating other large transformer-based systems beyond multimodal cases.
The model cascading approach could generalize to reduce average compute in any cascaded inference pipeline.
LLM-aided hardware design lowers the barrier for creating custom accelerators for specific workloads.

Load-bearing premise

The listed techniques of hierarchy-aware mixed-precision quantization, structural pruning, speculative decoding, model cascading, and hardware accelerators can be co-optimized and applied together to multimodal models without unacceptable accuracy or latency trade-offs.

What would settle it

Applying the full set of techniques together to a medical multimodal foundation model and observing whether accuracy falls below task requirements or end-to-end latency exceeds real-time budgets on standard benchmarks.

Figures

Figures reproduced from arXiv: 2604.21952 by Abdul Basit, Alberto Marchisio, Minghao Shao, Muhammad Abdullah Hanif, Muhammad Shafique, Rachmad Vidya Wicaksana Putra.

**Figure 2.** Figure 2: Overview of our multi-layered methodology for accelerating multimodal foundation models (MFMs). [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of hardware and software techniques for accelerating MFMs. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Our experimental results on the WikiText-2 dataset for [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 5.** Figure 5: Computation flows for (a) the Quantization-Dequantization scheme of [49], and (b) the quantization-based scheme with Requantization blocks [50]. Hardware Accelerator Architecture: As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Top-level overview of the SwiftTron architecture. An end-to-end integer-only datapath substantially reduces hardware complexity and energy consumption when compared to floating-point alternatives. As shown in Table I, experimental results demonstrate significant performance gains, achieving over 3.5× speedups over GPU baselines across language and vision Transformer benchmarks. The hardware analysis shows … view at source ↗

**Figure 9.** Figure 9: Compression results on medical VQA (closed-question accuracy). [PITH_FULL_IMAGE:figures/full_fig_p005_9.png] view at source ↗

**Figure 7.** Figure 7: Data contamination rate from pre-training data and pass rate for [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 12.** Figure 12: Experimental results on (a) the impact of different precision levels in each attention block of SpikeGPT-216M, and (b) the performance of SpikeGPT-216M under weight quantization considering different combinations of precision levels across attention blocks [75]. 2) Spiking Vision Transformers (SViTs): Several SViTs have been proposed in the literature, such as Spikformer, SpikeDriven Transformer (SDT), a… view at source ↗

read the original abstract

This work presents a multi-layered methodology for efficiently accelerating multimodal foundation models (MFMs). It combines hardware and software co-design of transformer blocks with an optimization pipeline that reduces computational and memory requirements. During model development, it employs performance enhancements through fine-tuning for domain-specific adaptation. Our methodology further incorporates hardware and software techniques for optimizing MFMs. Specifically, it employs MFM compression using hierarchy-aware mixed-precision quantization and structural pruning for transformer blocks and MLP channels. It also optimizes operations through speculative decoding, model cascading that routes queries through a small-to-large cascade and uses lightweight self-tests to determine when to escalate to larger models, as well as co-optimization of sequence length, visual resolution & stride, and graph-level operator fusion. To efficiently execute the model, the processing dataflow is optimized based on the underlying hardware architecture together with memory-efficient attention to meet on-chip bandwidth and latency budgets. To support this, a specialized hardware accelerator for the transformer workloads is employed, which can be developed through expert design or an LLM-aided design approach. We demonstrate the effectiveness of the proposed methodology on medical-MFMs and on code generation tasks, and conclude with extensions toward energy-efficient spiking-MFMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a high-level overview compiling standard acceleration techniques for multimodal models, but it offers no new results or data to support its demonstration claims.

read the letter

The main thing to know is that this paper is an overview of existing hardware-software co-design ideas for speeding up multimodal foundation models rather than a report of fresh experiments or measurements. It organizes a pipeline that includes hierarchy-aware quantization, structural pruning, speculative decoding, model cascading with lightweight checks, operator fusion, memory-efficient attention, and custom accelerators, plus fine-tuning for domain tasks and a nod to LLM-assisted hardware design. The extension to energy-efficient spiking models at the end is a sensible forward pointer. These elements are pulled together into a multi-layered methodology aimed at medical and code-generation use cases, which gives the piece some practical framing for readers who deploy large models under constraints. All the individual pieces are already familiar from the compression and acceleration literature, so the assembly itself is the main organizing contribution here. The soft spot is straightforward: the abstract states that the methodology is demonstrated on medical MFMs and code tasks, yet the text supplies no metrics, ablations, latency or energy numbers, accuracy deltas, or implementation details to back that up. Without those, the effectiveness claim stays at the level of intent rather than evidence. The paper therefore functions more as a checklist or survey than as a validated advance. Engineers working on edge deployment or constrained inference might skim it for the compiled list of options and references. Anyone already working in model compression will not find new derivations or surprising combinations. It is coherent in its own terms and engages the relevant literature, but the absence of concrete outcomes makes it thin for a research contribution. I would not bring it to a reading group or cite it for technical results. A serious editor should desk-reject unless the authors add actual case studies and measurements; only then would it merit referee time as either a methods paper or a short survey.

Referee Report

2 major / 0 minor

Summary. The manuscript presents a multi-layered hardware-software co-design methodology for accelerating multimodal foundation models (MFMs). It describes techniques including hierarchy-aware mixed-precision quantization, structural pruning of transformer blocks and MLP channels, speculative decoding, model cascading with lightweight self-tests, co-optimization of sequence length/visual resolution/stride, graph-level operator fusion, memory-efficient attention, and specialized hardware accelerators (expert or LLM-aided design). The paper claims to demonstrate the effectiveness of this methodology on medical-MFMs and code generation tasks and concludes with extensions toward energy-efficient spiking-MFMs.

Significance. If the claimed demonstrations were supported by empirical evidence, the work could be significant as a comprehensive framework for co-optimizing multiple acceleration techniques on real multimodal models, with relevance to resource-constrained domains such as medical imaging and code generation. The integration of compression, inference optimizations, and hardware specialization addresses a timely challenge in deploying large MFMs efficiently.

major comments (2)

[Abstract] Abstract: The central claim that 'We demonstrate the effectiveness of the proposed methodology on medical-MFMs and on code generation tasks' is unsupported, as the manuscript contains no experimental results, benchmarks, latency/energy metrics, accuracy deltas, ablation studies, implementation details, or comparisons to baselines.
[Full text] Manuscript body: The text functions as a high-level descriptive overview that enumerates techniques (quantization, pruning, speculative decoding, cascading, accelerators) without any quantitative evaluation, tables, figures, or task-specific results to substantiate co-optimization feasibility or acceptable accuracy/latency trade-offs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The manuscript is a high-level overview and compilation of hardware-software co-design techniques for multimodal foundation models, proposing a multi-layered methodology without presenting new empirical experiments or benchmarks. We will revise the abstract and body to remove any implication of unsupported demonstrations and clarify the paper's scope as a survey with illustrative discussion of applications.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'We demonstrate the effectiveness of the proposed methodology on medical-MFMs and on code generation tasks' is unsupported, as the manuscript contains no experimental results, benchmarks, latency/energy metrics, accuracy deltas, ablation studies, implementation details, or comparisons to baselines.

Authors: We agree that the claim is unsupported by new results. The manuscript compiles and organizes existing techniques into a co-design methodology, using medical and code domains as illustrative contexts drawn from prior literature rather than new experiments. We will revise the abstract to state that we discuss the application of the methodology to these tasks, removing the word 'demonstrate' and any implication of original empirical validation. revision: yes
Referee: [Full text] Manuscript body: The text functions as a high-level descriptive overview that enumerates techniques (quantization, pruning, speculative decoding, cascading, accelerators) without any quantitative evaluation, tables, figures, or task-specific results to substantiate co-optimization feasibility or acceptable accuracy/latency trade-offs.

Authors: This characterization is correct. The paper presents a conceptual framework and enumeration of techniques with co-optimization strategies, without new quantitative results, tables, or figures. We will add explicit statements in the introduction and conclusion clarifying that the work is an overview paper and that empirical validation of the full co-optimized pipeline is beyond the current scope (or cite relevant prior studies for individual techniques). No new experiments will be added. revision: partial

Circularity Check

0 steps flagged

No circularity: descriptive methodology overview with no derivations or equations

full rationale

The manuscript is a high-level survey-style description of hardware/software co-design techniques for MFMs. It enumerates methods (quantization, pruning, speculative decoding, cascading, accelerators) and asserts demonstrations on medical-MFMs and code generation, but supplies no equations, fitted parameters, predictions, or self-citation chains. The demonstration claim lacks supporting results, yet this is an evidentiary gap rather than circular reduction of any derivation to its inputs. No load-bearing steps exist that could be self-definitional, fitted-input predictions, or uniqueness imports.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces no free parameters, axioms, or invented entities; it is a descriptive compilation of existing techniques.

pith-pipeline@v0.9.0 · 5542 in / 1074 out tokens · 45869 ms · 2026-05-09T22:59:37.479905+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 16 canonical work pages · 2 internal anchors

[1]

Resource-efficient algorithms and systems of foundation models: A survey,

M. Xuet al., “Resource-efficient algorithms and systems of foundation models: A survey,”ACM CSUR, vol. 57, no. 5, Jan. 2025

2025
[2]

A survey of resource-efficient llm and multimodal foundation models,

——, “A survey of resource-efficient llm and multimodal foundation models,”arXiv preprint arXiv:2401.08092, 2024

work page arXiv 2024
[3]

Zero-shot text-to-image generation,

A. Rameshet al., “Zero-shot text-to-image generation,” inICML, 2021

2021
[4]

Text2video-zero: Text-to-image diffusion models are zero-shot video generators,

L. Khachatryanet al., “Text2video-zero: Text-to-image diffusion models are zero-shot video generators,” inICCV, 2023, pp. 15 954–15 964

2023
[5]

SAM 3: Segment Anything with Concepts

N. Carionet al., “Sam 3: Segment anything with concepts,”arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Visual instruction tuning,

H. Liuet al., “Visual instruction tuning,”NeurIPS, vol. 36, 2023

2023
[7]

Advancing healthcare in low-resource environments through an optimization and deployment framework for medical mul- timodal large language models,

A. Miret al., “Advancing healthcare in low-resource environments through an optimization and deployment framework for medical mul- timodal large language models,” 11 2024, pp. 1–8

2024
[8]

Vision-language-action models for robotics: A review towards real-world applications,

K. Kawaharazukaet al., “Vision-language-action models for robotics: A review towards real-world applications,”IEEE Access, vol. 13, 2025

2025
[9]

Dataset distillation.arXiv preprint arXiv:1811.10959, 2018

T. Wanget al., “Dataset distillation,”arXiv preprint:1811.10959, 2018

work page arXiv 2018
[10]

MPrompt: Exploring multi-level prompt tuning for machine reading comprehension,

G. Chenet al., “MPrompt: Exploring multi-level prompt tuning for machine reading comprehension,” inEMNLP, 2023

2023
[11]

Prefix propagation: Parameter-efficient tuning for long sequences,

J. Liet al., “Prefix propagation: Parameter-efficient tuning for long sequences,” inACL (Volume 2: Short Papers), 2023, pp. 1408–1419

2023
[12]

On the effectiveness of parameter-efficient fine-tuning,

Z. Fuet al., “On the effectiveness of parameter-efficient fine-tuning,” in AAAI, vol. 37, no. 11, 2023, pp. 12 799–12 807

2023
[13]

Towards green AI in fine-tuning large language models via adaptive backpropagation,

K. Huanget al., “Towards green AI in fine-tuning large language models via adaptive backpropagation,” inICLR, 2024

2024
[14]

EfficientDM: Efficient quantization-aware fine-tuning of low-bit diffusion models,

Y . Heet al., “EfficientDM: Efficient quantization-aware fine-tuning of low-bit diffusion models,” inICLR, 2024

2024
[15]

LoRA: Low-rank adaptation of large language models,

E. J. Huet al., “LoRA: Low-rank adaptation of large language models,” inICLR, 2022

2022
[16]

An efficient encoder-decoder architecture with top-down attention for speech separation,

K. Li, R. Yang, and X. Hu, “An efficient encoder-decoder architecture with top-down attention for speech separation,” inICLR, 2023

2023
[17]

Mega: Moving average equipped gated attention,

X. Maet al., “Mega: Moving average equipped gated attention,” inICLR, 2023

2023
[18]

RWKV: Reinventing RNNs for the transformer era,

B. Penget al., “RWKV: Reinventing RNNs for the transformer era,” in EMNLP, 2023

2023
[19]

Mamba: Linear-time sequence modeling with selective state spaces,

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” inCOLM, 2024

2024
[20]

Scaling vision with sparse mixture of experts,

C. Riquelmeet al., “Scaling vision with sparse mixture of experts,” NeurIPS, vol. 34, pp. 8583–8595, 2021

2021
[21]

Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding,

S. Baeet al., “Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding,” inEMNLP, 2023

2023
[22]

Denoising diffusion implicit models,

J. Songet al., “Denoising diffusion implicit models,” inICLR, 2021

2021
[23]

High-resolution image synthesis with latent diffusion models,

R. Rombachet al., “High-resolution image synthesis with latent diffusion models,” inCVPR, 2022, pp. 10 684–10 695

2022
[24]

Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models,

Y . Heet al., “Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models,” inICLR, 2023

2023
[25]

Sparsegpt: massive language models can be accurately pruned in one-shot,

E. Frantar and D. Alistarh, “Sparsegpt: massive language models can be accurately pruned in one-shot,” inICML, 2023

2023
[26]

A simple and effective pruning approach for large language models,

M. Sunet al., “A simple and effective pruning approach for large language models,” inICLR, 2024

2024
[27]

Plug-and-play: An efficient post-training pruning method for large language models,

Y . Zhanget al., “Plug-and-play: An efficient post-training pruning method for large language models,” inICLR, 2024

2024
[28]

Llm-pruner: On the structural pruning of large language models,

X. Ma, G. Fang, and X. Wang, “Llm-pruner: On the structural pruning of large language models,” inNeurIPS, vol. 36, 2023, pp. 21 702–21 720

2023
[29]

Sleb: streamlining llms through redundancy verification and elimination of transformer blocks,

J. Songet al., “Sleb: streamlining llms through redundancy verification and elimination of transformer blocks,” inICML, 2024

2024
[30]

Sheared LLaMA: Accelerating language model pre- training via structured pruning,

M. Xiaet al., “Sheared LLaMA: Accelerating language model pre- training via structured pruning,” inICLR, 2024

2024
[31]

Smoothquant: Accurate and efficient post-training quan- tization for large language models,

G. Xiaoet al., “Smoothquant: Accurate and efficient post-training quan- tization for large language models,” inICML, 2023

2023
[32]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration,

J. Linet al., “Awq: Activation-aware weight quantization for on-device llm compression and acceleration,” inMLSys, vol. 6, 2024, pp. 87–100

2024
[33]

Omniquant: Omnidirectionally calibrated quantization for large language models,

W. Shaoet al., “Omniquant: Omnidirectionally calibrated quantization for large language models,” inICLR, 2024

2024
[34]

Spinquant: Llm quantization with learned rotations,

Z. Liuet al., “Spinquant: Llm quantization with learned rotations,” in ICLR, 2025

2025
[35]

MiniLLM: Knowledge distillation of large language models,

Y . Guet al., “MiniLLM: Knowledge distillation of large language models,” inICLR, 2024

2024
[36]

Losparse: Structured compression of large language models based on low-rank and sparse approximation,

Y . Liet al., “Losparse: Structured compression of large language models based on low-rank and sparse approximation,” inICML, 2023

2023
[37]

LoRD: Low-rank decomposition of monolingual code LLMs for one-shot compression,

A. Kaushal, T. Vaidhya, and I. Rish, “LoRD: Low-rank decomposition of monolingual code LLMs for one-shot compression,” inICML 2024 Workshop on Foundation Models in the Wild, 2024

2024
[38]

Optimal brain restoration for joint quantization and sparsification of llms.arXiv preprint arXiv:2509.11177, 2025

H. Guo, Y . Li, and L. Benini, “Optimal brain restoration for joint quantization and sparsification of llms,”arXiv preprint:2509.11177, 2025

work page arXiv 2025
[39]

Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time,

Z. Liuet al., “Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time,” inNeurIPS, 2023

2023
[40]

Spectr: Fast speculative decoding via optimal transport,

Z. Sunet al., “Spectr: Fast speculative decoding via optimal transport,” inNeurIPS, 2023

2023
[41]

LLMLingua: Compressing prompts for accelerated inference of large language models,

H. Jianget al., “LLMLingua: Compressing prompts for accelerated inference of large language models,” inEMNLP, 2023

2023
[42]

Dynamic context pruning for efficient and interpretable autoregressive transformers,

S. Anagnostidiset al., “Dynamic context pruning for efficient and interpretable autoregressive transformers,” inNeurIPS, 2023

2023
[43]

Model cascading for code: A cascaded black-box multi- model framework for cost-efficient code completion with self-testing,

B. Chenet al., “Model cascading for code: A cascaded black-box multi- model framework for cost-efficient code completion with self-testing,” in IJCNN, 2025, pp. 1–9

2025
[44]

Efficient streaming language models with attention sinks,

G. Xiaoet al., “Efficient streaming language models with attention sinks,” inICLR, 2024

2024
[45]

Romanet: Fine-grained reuse-driven off-chip memory access management and data organization for deep neural network accelerators,

R. V . W. Putraet al., “Romanet: Fine-grained reuse-driven off-chip memory access management and data organization for deep neural network accelerators,”IEEE TVLSI, vol. 29, no. 4, pp. 702–715, 2021

2021
[46]

Drmap: A generic dram data mapping policy for energy-efficient processing of convolutional neural networks,

R. V . W. Putra, M. A. Hanif, and M. Shafique, “Drmap: A generic dram data mapping policy for energy-efficient processing of convolutional neural networks,” inDAC, 2020, pp. 1–6

2020
[47]

Large language model inference acceler- ation: A comprehensive hardware perspective.arXiv preprint arXiv:2410.04466, 2024

J. Liet al., “Large language model inference acceleration: A comprehen- sive hardware perspective,”arXiv preprint arXiv:2410.04466, 2024

work page arXiv 2024
[48]

A 3: Accelerating attention mechanisms in neural networks with approximation,

T. J. Hamet al., “A 3: Accelerating attention mechanisms in neural networks with approximation,” inHPCA, 2020, pp. 328–341

2020
[49]

Towards fully 8-bit integer inference for the transformer model,

Y . Linet al., “Towards fully 8-bit integer inference for the transformer model,” inIJCAI, 2020, pp. 3759–3765

2020
[50]

Swifttron: An efficient hardware accelerator for quantized transformers,

A. Marchisioet al., “Swifttron: An efficient hardware accelerator for quantized transformers,” inIJCNN, 2023, pp. 1–9

2023
[51]

A survey on hardware accelerators for large language models,

C. Kachris, “A survey on hardware accelerators for large language models,”Applied Sciences, vol. 15, no. 2, p. 586, 2025

2025
[52]

A survey of research in large language models for electronic design automation,

J. Panet al., “A survey of research in large language models for electronic design automation,”ACM TODAES, vol. 30, no. 3, 2025

2025
[53]

Survey of different large language model architectures: Trends, benchmarks, and challenges,

M. Shaoet al., “Survey of different large language model architectures: Trends, benchmarks, and challenges,”IEEE Access, 2024

2024
[54]

VerilogDB: The Largest, Highest-Quality Dataset with a Preprocessing Framework for LLM-based RTL Generation,

P. E. Calzadaet al., “Verilogdb: The largest, highest-quality dataset with a preprocessing framework for llm-based rtl generation,”arXiv preprint arXiv:2507.13369, 2025

work page arXiv 2025
[55]

Verilogeval: Evaluating large language models for verilog code generation,

M. Liu, N. Pinckney, B. Khailany, and H. Ren, “Verilogeval: Evaluating large language models for verilog code generation,” inICCAD, 2023

2023
[56]

Rtllm: An open-source benchmark for design rtl generation with large language model,

Y . Luet al., “Rtllm: An open-source benchmark for design rtl generation with large language model,” inASP-DAC, 2024, pp. 722–727

2024
[57]

Chipnemo: Domain-adapted llms for chip design,

M. Liuet al., “Chipnemo: Domain-adapted llms for chip design,”arXiv preprint arXiv:2311.00176, 2023

work page arXiv 2023
[58]

Rtlcoder: Fully open-source and efficient llm-assisted rtl code generation technique,

S. Liuet al., “Rtlcoder: Fully open-source and efficient llm-assisted rtl code generation technique,”IEEE TCAD, 2024

2024
[59]

Towards llm-powered verilog rtl assistant: Self-verification and self-correction,

H. Huanget al., “Towards llm-powered verilog rtl assistant: Self- verification and self-correction,”arXiv preprint arXiv:2406.00115, 2024

work page arXiv 2024
[60]

Mage: A multi-agent engine for automated rtl code generation,

Y . Zhao, H. Zhang, H. Huang, Z. Yu, and J. Zhao, “Mage: A multi-agent engine for automated rtl code generation,” inDAC, 2025, pp. 1–7

2025
[61]

Chateda: A large language model powered autonomous agent for eda,

H. Wuet al., “Chateda: A large language model powered autonomous agent for eda,”IEEE TCAD, vol. 43, no. 10, pp. 3184–3197, 2024

2024
[62]

VeriDispatcher: Multi-model dispatching through pre- inference difficulty prediction for RTL generation optimization,

Z. Wanget al., “Veridispatcher: Multi-model dispatching through pre- inference difficulty prediction for rtl generation optimization,”arXiv preprint arXiv:2511.22749, 2025

work page arXiv 2025
[63]

Netdetox: Adversarial and efficient evasion of hardware-security gnns via rl-llm orchestration,

——, “Netdetox: Adversarial and efficient evasion of hardware-security gnns via rl-llm orchestration,”arXiv preprint arXiv:2512.00119, 2025

work page arXiv 2025
[64]

Llms and the future of chip design: Unveiling security risks and building trust,

——, “Llms and the future of chip design: Unveiling security risks and building trust,” inISVLSI, 2024, pp. 385–390

2024
[65]

TrojanLoC: Fine-grained hardware Trojan detection from Verilog code,

W. Xiaoet al., “Trojanloc: Llm-based framework for rtl trojan localiza- tion,”arXiv preprint arXiv:2512.00591, 2025

work page arXiv 2025
[66]

Bugwhisperer: Fine-tuning llms for soc hardware vulnerability detection,

S. Tareket al., “Bugwhisperer: Fine-tuning llms for soc hardware vulnerability detection,” inVTS, 2025, pp. 1–5

2025
[67]

Vericontaminated: Assessing llm-driven verilog coding for data contamination,

Z. Wanget al., “Vericontaminated: Assessing llm-driven verilog coding for data contamination,”arXiv preprint arXiv:2503.13572, 2025

work page arXiv 2025
[68]

Verileaky: Navigating ip protection vs utility in fine-tuning for llm-driven verilog coding,

——, “Verileaky: Navigating ip protection vs utility in fine-tuning for llm-driven verilog coding,”arXiv preprint arXiv:2503.13116, 2025

work page arXiv 2025
[69]

Salad: Systematic assessment of machine unlearning on llm-aided hardware design,

——, “Salad: Systematic assessment of machine unlearing on llm-aided hardware design,”arXiv preprint arXiv:2506.02089, 2025

work page arXiv 2025
[70]

Tinyllava: A framework of small-scale large multimodal models,

B. Zhouet al., “Tinyllava: A framework of small-scale large multimodal models,”arXiv preprint arXiv:2402.14289, 2024

work page arXiv 2024
[71]

Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs,

S. Zhanget al., “Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs,” 2025

2025
[72]

Llava-med: Training a large language-and-vision assistant for biomedicine in one day,

C. Liet al., “Llava-med: Training a large language-and-vision assistant for biomedicine in one day,” 2023

2023
[73]

Rover: Autonomous open-vocabulary object searching in unexplored environments using vlm-driven scene understanding,

A. Basitet al., “Rover: Autonomous open-vocabulary object searching in unexplored environments using vlm-driven scene understanding,” in IJCNN, 2025, pp. 1–8

2025
[74]

Spikenas: A fast memory-aware neural architecture search framework for spiking neural network-based embedded ai systems,

R. V . W. Putra and M. Shafique, “Spikenas: A fast memory-aware neural architecture search framework for spiking neural network-based embedded ai systems,”IEEE TAI, pp. 1–12, 2025

2025
[75]

QSLM: A Performance- and Memory-aware Quantization Framework with Tiered Search Strategy for Spike-driven Language Models

R. V . W. Putra, P. Wickramasinghe, and M. Shafique, “Qslm: A performance- and memory-aware quantization framework with tiered search strategy for spike-driven language models,”arXiv preprint arXiv:2601.00679, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[76]

SpikeGPT: Generative pre-trained language model with spiking neural networks,

R.-J. Zhuet al., “SpikeGPT: Generative pre-trained language model with spiking neural networks,”TMLR, 2024

2024
[77]

Qsvit: A methodology for quantizing spiking vision transformers,

R. V . W. Putra, S. Iftikhar, and M. Shafique, “Qsvit: A methodology for quantizing spiking vision transformers,” inIJCNN, 2025, pp. 1–8

2025
[78]

Spike-driven transformer v2: Meta spiking neural network architecture inspiring the design of next-generation neuromorphic chips,

M. Yaoet al., “Spike-driven transformer v2: Meta spiking neural network architecture inspiring the design of next-generation neuromorphic chips,” inICLR, 2024. 7

2024