SegFold achieves 1.95× geometric-mean speedup over prior SpGEMM accelerators via fine-grained dynamic scheduling and remapping in its Segment dataflow.
Ice: An intelligent cognition engine with 3d nand-based in-memory computing for vector similarity search acceleration
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
IP-CaT jointly optimizes TLB and cache management for L1I prefetching via a translation prefetch buffer and trimodal replacement policy, yielding 8.7% geomean speedup over EPI across 105 server workloads.
COMPOSE is a timing-driven composable CGRA architecture that fuses cross-iteration operations and defers registration to deliver 1.6x performance and 2.9x EDP gains over prior CGRA designs for recurrence-bound loops.
NasZip delivers up to 8.4x speedup over CPU baselines and 1.69x over prior NDP accelerators for ANNS by combining near-data processing with statistics-based PCA early exiting, dynamic-float encoding, and data-aware neighbor mapping.
Proxics introduces lightweight virtual processors and low-latency communication channels as portable OS abstractions for programming near-data processing accelerators, demonstrated on real hardware for memory-intensive workloads.
Profile-guided opcode labeling removes consistently independent loads from the MDP working set, cutting queries 79%, false dependencies 77%, and raising small-core IPC 1.47% on SPEC2017 intspeed.
QARMA applies transformer-augmented reinforcement learning to qubit allocation and reuse in modular quantum systems, reporting up to 86% average reduction in inter-core communications versus optimized Qiskit baselines.
A two-level decoder scheduling framework reduces classical processing requirements for quantum error correction by 10-40% on fault-tolerant benchmarks by managing bursty workloads as shared resources.
citing papers explorer
-
SegFold: Accelerating Sparse GEMM with a Fine-Grained Dynamic Dataflow
SegFold achieves 1.95× geometric-mean speedup over prior SpGEMM accelerators via fine-grained dynamic scheduling and remapping in its Segment dataflow.
-
Enhancing Instruction Prefetching via Cache and TLB Management
IP-CaT jointly optimizes TLB and cache management for L1I prefetching via a translation prefetch buffer and trimodal replacement policy, yielding 8.7% geomean speedup over EPI across 105 server workloads.
-
COMPOSE: Static Timing-driven Composable Reconfigurable Architecture for Accelerating Recurrence-Bound Loops
COMPOSE is a timing-driven composable CGRA architecture that fuses cross-iteration operations and defers registration to deliver 1.6x performance and 2.9x EDP gains over prior CGRA designs for recurrence-bound loops.
-
NasZip: Software and Hardware Co-Design to Accelerate Approximate Nearest Neighbor Search with DIMM-Based Near-Data Processing
NasZip delivers up to 8.4x speedup over CPU baselines and 1.69x over prior NDP accelerators for ANNS by combining near-data processing with statistics-based PCA early exiting, dynamic-float encoding, and data-aware neighbor mapping.
-
Proxics: an efficient programming model for far memory accelerators
Proxics introduces lightweight virtual processors and low-latency communication channels as portable OS abstractions for programming near-data processing accelerators, demonstrated on real hardware for memory-intensive workloads.
-
PG-MDP: Profile-Guided Memory Dependence Prediction for Area-Constrained Cores
Profile-guided opcode labeling removes consistently independent loads from the MDP working set, cutting queries 79%, false dependencies 77%, and raising small-core IPC 1.47% on SPEC2017 intspeed.
-
Learning-Optimized Qubit Mapping and Reuse to Minimize Inter-Core Communication in Modular Quantum Architectures
QARMA applies transformer-augmented reinforcement learning to qubit allocation and reuse in modular quantum systems, reporting up to 86% average reduction in inter-core communications versus optimized Qiskit baselines.
-
Managing Classical Processing Requirements for Quantum Error Correction
A two-level decoder scheduling framework reduces classical processing requirements for quantum error correction by 10-40% on fault-tolerant benchmarks by managing bursty workloads as shared resources.
- The EDGE Language: Extended General Einsums for Graph Algorithms