arxiv: 2605.12433 · v1 · submitted 2026-05-12 · 💻 cs.AR · cs.PF

Recognition: 2 theorem links

· Lean Theorem

Enhancing Instruction Prefetching via Cache and TLB Management

Alexandre Valentin Jamet, Dimitrios Chasapis, Georgios Vavouliotis, Marc Casas, Marti Torrents

Pith reviewed 2026-05-13 03:02 UTC · model grok-4.3

classification 💻 cs.AR cs.PF

keywords instruction prefetchingTLB managementcache replacementserver workloadsmicroarchitectureaddress translationreuse patternsfront-end performance

0 comments

The pith

IP-CaT jointly manages TLB translations and L2 cache replacements to enhance L1 instruction prefetching, delivering an 8.7% geomean speedup on server workloads.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern server workloads feature large instruction footprints that create frequent front-end stalls, and existing L1I prefetchers fall short because page-crossing prefetches incur translation delays while many prefetched lines see little or no reuse. The paper introduces IP-CaT, a framework with two targeted components that operate without changing the prefetchers themselves: a compact translation prefetch buffer paired with the second-level TLB, plus a decision-tree replacement policy for the L2 cache that distinguishes among prefetched lines based on reuse. These mechanisms improve prefetch timeliness and retain useful code while discarding dead lines. When added to three different state-of-the-art prefetchers and tested on 105 workloads, the changes produce consistent speedups and beat prior specialized TLB and cache policies. If the results hold, processors could extract substantially more performance from current prefetching hardware.

Core claim

The paper introduces Instruction Prefetch-Centric Cache and TLB Management (IP-CaT), consisting of the translation Prefetch Buffer (tPB) colocated with the second-level TLB to store page table entries for page-crossing L1I prefetches, reducing translation overheads, and the Trimodal Instruction Prefetch Replacement Policy (TIPRP), a decision-tree-based L2 cache replacement policy specialized for lines fetched by L1I prefetchers. When applied to existing prefetchers including EPI, FNL+MMA, and Barca across 105 contemporary server workloads, IP-CaT consistently improves performance, for example achieving an 8.7% geomean speedup over EPI alone, while also outperforming state-of-the-art TLB and

What carries the argument

The translation Prefetch Buffer (tPB) colocated with the second-level TLB together with the Trimodal Instruction Prefetch Replacement Policy (TIPRP) that together reduce translation latency for page-crossing prefetches and manage heterogeneous reuse of prefetched L2 lines.

If this is right

IP-CaT improves the effectiveness of multiple existing L1I prefetchers without requiring changes to those prefetchers.
Page-crossing L1I prefetches become timelier because the tPB supplies translations without full page-table walk delays.
L2 cache space is used more efficiently by retaining high-reuse prefetched lines and evicting dead-on-arrival ones via the TIPRP.
IP-CaT surpasses prior instruction TLB prefetching methods, advanced TLB replacement policies such as CHiRP, and multiple cache replacement policies including Emissary, SHiP++, and Mockingjay.
Performance gains appear consistently across three different state-of-the-art L1I prefetchers on a broad set of contemporary server workloads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Joint TLB and cache management for prefetchers could be extended to data prefetchers if similar page-boundary and reuse heterogeneity appear in data streams.
Lightweight classifiers like the decision tree in TIPRP might be adapted for other microarchitectural decisions that face heterogeneous access patterns.
As instruction footprints grow further in cloud applications, structures such as tPB could help front-end performance scale without proportional increases in TLB size.

Load-bearing premise

The tPB and TIPRP incur negligible area, power, and latency overheads while the heterogeneous reuse patterns observed in the 105 workloads generalize to future server applications.

What would settle it

Hardware implementation showing that tPB accesses add measurable latency or power, or an evaluation on new workloads where the geomean speedup over baseline prefetchers falls to zero.

Figures

Figures reproduced from arXiv: 2605.12433 by Alexandre Valentin Jamet, Dimitrios Chasapis, Georgios Vavouliotis, Marc Casas, Marti Torrents.

**Figure 1.** Figure 1: Comparison of IP-CaT with i) combinations of the state-of-the-art TLB replacement policy (CHiRP [5]), instruction TLB prefetcher (Morrigan [4]), and code-aware cache replacement policy (Emissary [6]) and ii) an idealized upper bound combining optimized TLB and cache management for L1I prefetches. This comparison considers EPI as L1I prefetcher [1] and 105 server workloads. context, microarchitectural tech… view at source ↗

**Figure 2.** Figure 2: Geomean speedups when page-cross prefetches are discarded (No Page Cross), page-cross prefetching is permitted (Permit Page Cross), and an optimal scenario forcing sTLB hits for all L1I page-cross prefetches (Free Trans L1I Pref). Cross, where the prefetcher discards requests that cross page boundaries; ii) Permit Page Cross, where the prefetcher issues all requests no matter if they cross page boundaries … view at source ↗

**Figure 3.** Figure 3: Geomean speedups of two ideal scenarios forcing L2C hits for lines fetched by i) L1I page-cross prefetches (Ideal L2C (PGC Pref)) and ii) L1I prefetches (Ideal L2C (All Pref)). scenarios: i) Ideal L2C (PGC Pref), where code lines fetched by page-cross prefetches are not inserted in L2C until a demand L2C access requests them. These entries are instead placed in an infinite buffer located alongside the L2C.… view at source ↗

**Figure 5.** Figure 5: Organization and operation of tPB. 1) Integrating tPB in sTLB: Section IV-A presents tPB as a standalone structure to make its design and operation more transparent. In practical implementations, tPB can be seamlessly integrated into the sTLB by matching its associativity. Under this design, the sTLB is augmented with additional sets that are logically designated for tPB entries. Section VI-E evaluates mu… view at source ↗

**Figure 6.** Figure 6: (a) Overview of TIPRP and the implementation of its adaptive selection logic that dynamically selects between NPIP [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Operation of IP-CaT integrated in a standard microarchitecture. program phase, it is better to aggressively start with BIP and gradually fall into NPIP if needed. If a demand L2C request is served by a Leader Set of NPIP, PSEL1 and PSEL2 are updated only when the line that served the access has not been fetched into L2C by an L1I prefetch request (pb=0 in [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Evaluation considering either EPI (top), Barc¸a (middle), or FNL+MMA (bottom) as L1I prefetcher. EPI 0 20 % Speedup Bar¸ca FNL+MMA tPB+SRRIP(L2C) IP-CaT [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 13.** Figure 13: Comparing IP-CaT against an augmented sTLB. lines and code translations in L2C and sTLB, respectively. For example, when EPI is used, IP-CaT outperforms Mockingjay, PACIPV, SHiP++, SRRIP, and DRRIP by 8.0%, 3.0%, 5.3%, 4.4%, and 3.6%, respectively. 3) ISO Storage Comparison [PITH_FULL_IMAGE:figures/full_fig_p010_13.png] view at source ↗

**Figure 14.** Figure 14: Evaluation of IP-CaT considering a baseline with TAGE-SC-L, and a scenario using TAGE-SC-L with ITTAGE. B. Impact of Indirect Branch Target Prediction Our baseline uses TAGE-SC-L as conditional branch predictor. This section evaluates IP-CaT with and without the stateof-the-art indirect branch predictor ITTAGE [71] to quantify its impact on the performance of our proposal. We consider two configurations… view at source ↗

**Figure 12.** Figure 12: Comparison against L2C variants of the state-of-theart replacement policies. 2) Comparison with L2C Variants of Prior Policies: This section compares TIPRP and IP-CaT with the state-of-the-art policies of Table II which are originally designed for the LLC (Mockingjay, PACIPV, SHiP++, SRRIP, and DRRIP), now applied to L2C. We exclude PACMAN from this study due to its inferior performance [PITH_FULL_IMAGE… view at source ↗

**Figure 15.** Figure 15: Performance breakdown of IP-CaT. 0 5 10 15 20 % Speedup tPB TIPRP IP-CaT [PITH_FULL_IMAGE:figures/full_fig_p011_15.png] view at source ↗

**Figure 17.** Figure 17: Comparison to a variation of IP-CaT applying TIPRP to both demand and prefetch code lines in L2C (IP-CaT D+P). 1) Applying TIPRP to Demand Instruction Accesses: Figure 17 compares IP-CaT to a variation of IP-CaT that applies the TIPRP replacement policy not only to lines fetched by L1I prefetches but also to demand instruction accesses; we refer to this scheme as IP-CaT D+P [PITH_FULL_IMAGE:figures/full… view at source ↗

**Figure 20.** Figure 20: Comparing IP-CaT with a 64-entry fully-associative tPB against augmenting sTLB by 4 and 8 sets of 12 ways. EPI 0 20 40 % Speedup Bar¸ca FNL+MMA 1MB 1.375MB 2MB 4MB [PITH_FULL_IMAGE:figures/full_fig_p012_20.png] view at source ↗

**Figure 21.** Figure 21: Sensitivity of IP-CaT’s performance to the LLC size. larger LLCs (e.g., 4MB), IP-CaT continues to provide significant performance improvements. Similar trends are observed for the FNL+MMA and Barc¸a. An additional observation from [PITH_FULL_IMAGE:figures/full_fig_p012_21.png] view at source ↗

**Figure 22.** Figure 22: shows the performance improvement of all considered scenarios (Table II), excluding PACMAN due to inferior performance, when the baseline uses both 4KB and 2MB pages, as explained in Section V. The top, medium, and bottom plots show results for EPI, Barc¸a, and FNL+MMA, respectively. The x-axis reports the proportion of the memory footprint mapped in large pages as compared to small pages (e.g., 5% refers… view at source ↗

**Figure 23.** Figure 23: Performance evaluation in 4-core context. [PITH_FULL_IMAGE:figures/full_fig_p012_23.png] view at source ↗

**Figure 24.** Figure 24: Performance evaluation using SMT workloads. presented in Section V. We observe that IP-CaT outperforms all competing schemes across all considered prefetchers. The results show similar trends as in the single-thread evaluation, but with larger absolute speedups due to increased contention for structures such as the sTLB and L2C. For example, when considering the EPI, IP-CaT outperforms CLIP, PACIPV, and P… view at source ↗

read the original abstract

Modern server workloads exhibit massive instruction footprints that heavily pressure the processor front-end, making L1 instruction (L1I) prefetching critical for sustaining performance. However, this paper shows that current L1I prefetchers fail to reach their full potential due to two key limitations. First, L1I prefetches crossing page boundaries require address translation before issuance, and translation latency reduces prefetch timeliness. Second, the reuse behavior of code lines fetched by L1I prefetches is highly heterogeneous: while some lines are reused many times, others are dead-on-arrival. This paper introduces Instruction Prefetch-Centric Cache and TLB Management (IP-CaT), the first microarchitectural framework jointly optimizing TLB and cache management for L1I prefetching. IP-CaT consists of two components: (i) the translation Prefetch Buffer (tPB), a small structure colocated with the second-level TLB (sTLB) that stores page table entries fetched by page-crossing L1I prefetches, reducing translation overheads; and (ii) the Trimodal Instruction Prefetch Replacement Policy (TIPRP), a decision-tree-based L2 cache replacement policy specialized for lines fetched by L1I prefetches. We evaluate IP-CaT with three state-of-the-art L1I prefetchers: EPI, FNL+MMA, and Barca. Across 105 contemporary server workloads, IP-CaT consistently improves performance. For example, IP-CaT+EPI achieves an 8.7% geomean speedup over EPI alone. We further show that IP-CaT outperforms state-of-the-art instruction TLB prefetching, advanced TLB replacement (CHiRP), and state-of-the-art code-aware, prefetch-aware, and general-purpose cache replacement policies, including Emissary, SHiP++, and Mockingjay.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes IP-CaT, a framework to enhance L1I prefetching by jointly optimizing TLB and cache management. It introduces the translation Prefetch Buffer (tPB) colocated with the sTLB to cache page table entries for page-crossing prefetches and reduce translation latency, plus the Trimodal Instruction Prefetch Replacement Policy (TIPRP), a decision-tree L2 replacement policy specialized for lines fetched by L1I prefetchers. Evaluated with EPI, FNL+MMA, and Barca across 105 server workloads, IP-CaT+EPI delivers an 8.7% geomean speedup over EPI alone and outperforms instruction TLB prefetching, CHiRP, Emissary, SHiP++, and Mockingjay.

Significance. If the central performance claims hold after explicit overhead accounting, the work would be significant for server processor front-end design. Large instruction footprints make L1I prefetch timeliness and reuse heterogeneity key bottlenecks; a lightweight joint TLB-cache approach that demonstrably improves multiple prefetchers without offsetting costs would be a practical contribution with broad applicability.

major comments (3)

[Abstract] Abstract: the 8.7% geomean speedup for IP-CaT+EPI is reported without any cycle-accurate latency, power, or area numbers for the tPB on the translation path or for TIPRP decision logic; these overheads are load-bearing because even small added latency on page-crossing prefetches or extra dynamic power could erase or reverse the net gain.
[Evaluation] Evaluation section: no error bars, no workload selection criteria or cross-validation of TIPRP on held-out workloads, and no sensitivity analysis to tPB size are provided; this leaves open whether the reported outperformance over Emissary/SHiP++/Mockingjay is robust or corpus-specific.
[Design of tPB] Design of tPB: the claim that colocating tPB with sTLB incurs negligible latency requires explicit modeling of the lookup on the critical path for page-crossing L1I prefetches; without it the timeliness benefit cannot be verified.

minor comments (2)

[Abstract] Abstract: the sizes of tPB and the decision-tree depth of TIPRP are described only as 'small' and 'specialized'; quantitative parameters would improve reproducibility.
[Design] The paper could add a short table summarizing tPB and TIPRP hardware costs (entries, bits, comparators) for direct comparison with baselines.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments on overhead accounting, evaluation robustness, and tPB timing analysis are well-taken. We address each major comment below and will make targeted revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the 8.7% geomean speedup for IP-CaT+EPI is reported without any cycle-accurate latency, power, or area numbers for the tPB on the translation path or for TIPRP decision logic; these overheads are load-bearing because even small added latency on page-crossing prefetches or extra dynamic power could erase or reverse the net gain.

Authors: We agree that the abstract would benefit from explicit context on overheads to support the net speedup claim. The full manuscript already includes area, power, and latency estimates for the tPB (small buffer colocated with sTLB) and TIPRP decision logic in the hardware overhead and evaluation sections, showing tPB area under 1 KB with negligible dynamic power and no added critical-path latency in our models. To directly address the concern, we will revise the abstract to note that these overheads were modeled and do not offset the reported gains, with a pointer to the detailed analysis. This is a partial revision focused on visibility rather than new data. revision: partial
Referee: [Evaluation] Evaluation section: no error bars, no workload selection criteria or cross-validation of TIPRP on held-out workloads, and no sensitivity analysis to tPB size are provided; this leaves open whether the reported outperformance over Emissary/SHiP++/Mockingjay is robust or corpus-specific.

Authors: We appreciate the call for greater statistical rigor and validation. The 105 workloads were drawn from standard server suites (SPEC CPU, CloudSuite, and production server traces) selected specifically for large instruction footprints; we will add explicit selection criteria and workload characteristics to the evaluation section. We will also add error bars to all geomean and per-workload figures. For TIPRP robustness, we will include a new sensitivity study on tPB size (varying from 4 to 32 entries) and a cross-validation experiment partitioning workloads into training/test sets to confirm the decision-tree policy generalizes and the gains over Emissary, SHiP++, and Mockingjay are not corpus-specific. These additions will be incorporated in the revised manuscript. revision: yes
Referee: [Design of tPB] Design of tPB: the claim that colocating tPB with sTLB incurs negligible latency requires explicit modeling of the lookup on the critical path for page-crossing L1I prefetches; without it the timeliness benefit cannot be verified.

Authors: We acknowledge that the current description of tPB colocation would be strengthened by explicit critical-path modeling. The tPB is a small structure (8-16 entries) placed adjacent to the sTLB to enable parallel or overlapped lookup for page-crossing prefetches. Our cycle-level simulations already account for this access and show no additional stalls. In the revision, we will add a dedicated timing diagram and pipeline analysis subsection that models the exact lookup sequence on the translation path, confirming the added latency remains hidden within existing TLB access cycles. This will make the timeliness benefit fully verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation against external baselines with no derivations or self-referential reductions

full rationale

The paper presents a microarchitectural proposal (tPB colocated with sTLB and TIPRP decision-tree policy) evaluated via simulation on 105 server workloads. All performance claims (e.g., 8.7% geomean speedup of IP-CaT+EPI over EPI) are reported as measured speedups relative to independent prior prefetchers and replacement policies (EPI, FNL+MMA, Barca, Emissary, SHiP++, Mockingjay, CHiRP). No equations, fitted parameters renamed as predictions, self-citation load-bearing uniqueness theorems, or ansatzes appear in the provided text. The central results are therefore falsifiable by re-running the same simulators on the same workloads rather than reducing to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Abstract-only review provides no equations or detailed methods, so the ledger is limited to the two new structures introduced; no free parameters or standard axioms are identifiable from the given text.

invented entities (2)

translation Prefetch Buffer (tPB) no independent evidence
purpose: Stores page table entries fetched by page-crossing L1I prefetches to reduce translation latency
New structure colocated with sTLB, proposed to address translation overhead for crossing prefetches
Trimodal Instruction Prefetch Replacement Policy (TIPRP) no independent evidence
purpose: Decision-tree-based L2 cache replacement specialized for lines fetched by L1I prefetches
New policy to handle heterogeneous reuse of prefetched code lines

pith-pipeline@v0.9.0 · 5659 in / 1373 out tokens · 116093 ms · 2026-05-13T03:02:37.884156+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
IP-CaT comprises two modules: i) the translation Prefetch Buffer (tPB), a small buffer located alongside the last-level TLB (sTLB) ... and ii) the Trimodal Instruction Prefetch Replacement Policy (TIPRP), a decision-tree based replacement policy for the L2 cache
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
We evaluate IP-CaT with three state-of-the-art L1I prefetchers ... across 105 contemporary server workloads

Reference graph

Works this paper leans on

89 extracted references · 89 canonical work pages

[1]

A Cost-Effective Entangling Prefetcher for Instructions,

A. Ros and A. Jimborean, “A Cost-Effective Entangling Prefetcher for Instructions,” inProceedings of the 48th International Symposium on Computer Architecture, ser. ISCA ’21, 2021, pp. 99–111. [Online]. Available: https://doi.org/10.1109/ISCA52012.2021.00017

work page doi:10.1109/isca52012.2021.00017 2021
[2]

The FNL+MMA Instruction Cache Prefetcher,

A. Seznec, “The FNL+MMA Instruction Cache Prefetcher,” https://hal. inria.fr/hal-02884880/document

work page
[3]

Barca: Branch- agnostic region searching algorithm,

P. Gratz, D. A. Jim ´enez, N. Gober, and G. Chacon, “Barca: Branch- agnostic region searching algorithm,” inProceedings of the First In- struction Prefetching Championship (IPC), 2020

work page 2020
[4]

Morrigan: A Composite Instruction TLB Prefetcher,

G. Vavouliotis, L. Alvarez, B. Grot, D. Jim ´enez, and M. Casas, “Morrigan: A Composite Instruction TLB Prefetcher,” inProceedings of the 54th International Symposium on Microarchitecture, ser. MICRO ’21. New York, NY , USA: Association for Computing Machinery, 2021, p. 1138–1153. [Online]. Available: https://doi.org/10. 1145/3466752.3480049

work page arXiv 2021
[5]

CHiRP: Control-Flow History Reuse Prediction,

S. Mirbagher-Ajorpaz, E. Garza, G. Pokam, and D. A. Jim ´enez, “CHiRP: Control-Flow History Reuse Prediction,” inProceedings of the 2020 53rd International Symposium on Microarchitecture, ser. MICRO ’16, 2020, pp. 131–145. [Online]. Available: https: //doi.org/10.1109/MICRO50266.2020.00023

work page doi:10.1109/micro50266.2020.00023 2020
[6]

Emissary: Enhanced miss awareness replacement policy for l2 instruction caching,

N. P. Nagendra, B. R. Godala, I. Chaturvedi, A. Patel, S. Kanev, T. Moseley, J. Stark, G. A. Pokam, S. Campanoni, and D. I. August, “Emissary: Enhanced miss awareness replacement policy for l2 instruction caching,” inProceedings of the 50th Annual International Symposium on Computer Architecture, ser. ISCA ’23. New York, NY , USA: Association for Computin...

work page doi:10.1145/3579371.3589097 2023
[7]

SHiP++: Enhancing Signature-Based Hit Predictor for Improved Cache Performance,

V . Young, C.-C. Chou, A. Jaleel, and M. K. Qureshi, “SHiP++: Enhancing Signature-Based Hit Predictor for Improved Cache Performance,” in2nd Cache Replacement Championship (CRC-2), in conjunction with ISCA 2017, Jun. 2017. [Online]. Available: https://crc2.ece.tamu.edu/?page id=53

work page 2017
[8]

Effective mimicry of belady’s min policy,

I. Shah, A. Jain, and C. Lin, “Effective mimicry of belady’s min policy,” in2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), April 2022, pp. 558–572. [Online]. Available: https://doi.org/10.1109/HPCA53966.2022.00048

work page doi:10.1109/hpca53966.2022.00048 2022
[9]

AsmDB: Understanding and Mitigating Front-End Stalls in Warehouse-Scale Computers,

N. P. Nagendra, G. Ayers, D. I. August, H. K. Cho, S. Kanev, C. Kozyrakis, T. Krishnamurthy, H. Litz, T. Moseley, and P. Ranganathan, “AsmDB: Understanding and Mitigating Front-End Stalls in Warehouse-Scale Computers,”IEEE Micro, vol. 40, no. 3, pp. 56–63, 2020. [Online]. Available: https://doi.org/10.1145/3307650.3322234

work page doi:10.1145/3307650.3322234 2020
[10]

BOLT: A Practical Binary Optimizer for Data Centers and Beyond,

M. Panchenko, R. Auler, B. Nell, and G. Ottoni, “BOLT: A Practical Binary Optimizer for Data Centers and Beyond,” inProceedings of the 2019 International Symposium on Code Generation and Optimization, ser. CGO ’19. IEEE Press, 2019, pp. 2–14. [Online]. Available: https://doi.org/10.1109/CGO.2019.8661201

work page doi:10.1109/cgo.2019.8661201 2019
[11]

Hot Chips 2023: Arm’s Neoverse V2,

“Hot Chips 2023: Arm’s Neoverse V2,” https://chipsandcheese.com/ 2023/09/11/hot-chips-2023-arms-neoverse-v2/

work page 2023
[12]

Clearing the Clouds: A Study of Emerging Scale-out Workloads on Modern Hardware,

M. Ferdman, A. Adileh, O. Kocberber, S. V olos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi, “Clearing the Clouds: A Study of Emerging Scale-out Workloads on Modern Hardware,” inProceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’12. New...

work page 2012
[13]

Available: http://doi.acm.org/10.1145/2150976.2150982

[Online]. Available: http://doi.acm.org/10.1145/2150976.2150982

work page doi:10.1145/2150976.2150982
[14]

Fetch Directed Instruction Prefetching,

G. Reinman, B. Calder, and T. Austin, “Fetch Directed Instruction Prefetching,” inProceedings of the 32nd International Symposium on Microarchitecture, ser. MICRO ’99, 1999, pp. 16–27. [Online]. Available: https://doi.org/10.1109/MICRO.1999.809439

work page doi:10.1109/micro.1999.809439 1999
[15]

Reducing Memory Reference Energy with Opportunistic Virtual Caching,

A. Basu, M. D. Hill, and M. M. Swift, “Reducing Memory Reference Energy with Opportunistic Virtual Caching,” inProceedings of the 39th International Symposium on Computer Architecture, ser. ISCA ’12, 2012, pp. 297–308. [Online]. Available: https: //doi.org/10.1109/ISCA.2012.6237026

work page doi:10.1109/isca.2012.6237026 2012
[16]

Advanced Concepts on Address Translation, Appendix L in

Abhishek Bhattacharjee, “Advanced Concepts on Address Translation, Appendix L in ”Computer Architecture: A Quantitative Approach” by Hennessy and Patterson,” http://www.cs.yale.edu/homes/abhishek/ abhishek-appendix-l.pdf

work page
[17]

ARM Cortex-A55 Core Technical Reference Manual r1p0,

“ARM Cortex-A55 Core Technical Reference Manual r1p0,” https://developer.arm.com/documentation/100442/0100/functional- description/level-1-memory-system/data-prefetching?lang=en

work page
[18]

Hermes: Accelerating long-latency load requests via perceptron-based off-chip load prediction,

G. Vavouliotis, G. Chacon, L. Alvarez, P. V . Gratz, D. A. Jim´enez, and M. Casas, “Page Size Aware Cache Prefetching,” in Proceedings of the 55th International Symposium on Microarchitecture, ser. MICRO ’22, 2022, pp. 956–974. [Online]. Available: https: //doi.org/10.1109/MICRO56248.2022.00070

work page doi:10.1109/micro56248.2022.00070 2022
[19]

To cross, or not to cross pages for prefetching?

G. Vavouliotis, M. Torrents, B. Grot, K. Kalaitzidis, L. Peled, and M. Casas, “To cross, or not to cross pages for prefetching?” in 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), March 2025, pp. 188–203. [Online]. Available: https://doi.org/10.1109/HPCA61900.2025.00025

work page doi:10.1109/hpca61900.2025.00025 2025
[20]

Adaptive insertion policies for high performance caching,

M. K. Qureshi, A. Jaleel, Y . N. Patt, S. C. Steely, and J. Emer, “Adaptive insertion policies for high performance caching,” in Proceedings of the 34th Annual International Symposium on Computer Architecture, ser. ISCA ’07. New York, NY , USA: Association for Computing Machinery, 2007, p. 381–391. [Online]. Available: https://doi.org/10.1145/1250662.1250709

work page doi:10.1145/1250662.1250709 2007
[21]

High performing cache hierarchies for server workloads: Relaxing inclusion to capture the latency benefits of exclusive caches,

A. Jaleel, J. Nuzman, A. Moga, S. C. Steely, and J. Emer, “High performing cache hierarchies for server workloads: Relaxing inclusion to capture the latency benefits of exclusive caches,” in2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), 2015, pp. 343–353. [Online]. Available: https://doi.org/10.1109/HPCA.2015.7056045

work page doi:10.1109/hpca.2015.7056045 2015
[22]

Ship: signature-based hit predictor for high performance caching,

C.-J. Wu, A. Jaleel, M. Martonosi, S. C. Steely, and J. Emer, “Pacman: Prefetch-aware cache management for high performance caching,” in2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2011, pp. 442–453. [Online]. Available: https://doi.org/10.1145/2155620.215567

work page doi:10.1145/2155620.215567 2011
[23]

Windserve: Eﬀicient phase- disaggregated llm serving with stream-based dynamic scheduling

S. Mostofi, S. Gupta, A. Hassani, K. Tibrewala, E. Teran, P. V . Gratz, and D. A. Jim ´enez, “Light-weight cache replacement for instruction heavy workloads,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, ser. ISCA ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 1005–1019. [Online]. Available: ht...

work page doi:10.1145/3695053.3730993 2025
[24]

A Dueling Segmented LRU Replacement Algorithm with Adaptive Bypassing,

H. Gao and C. Wilkerson, “A Dueling Segmented LRU Replacement Algorithm with Adaptive Bypassing,” in1st JILP Workshop on Computer Architecture Competitions (JWAC-1): Cache Replacement Championship, Jun. 2010. [Online]. Available: https://jilp.org/jwac- 1/online/papers/005 gao.pdf

work page 2010
[25]

High performance cache replacement using re-reference interval prediction (rrip),

A. Jaleel, K. B. Theobald, S. C. Steely, and J. Emer, “High performance cache replacement using re-reference interval prediction (rrip),” inProceedings of the 37th Annual International Symposium on Computer Architecture, ser. ISCA ’10. New York, NY , USA: Association for Computing Machinery, 2010, p. 60–71. [Online]. Available: https://doi.org/10.1145/181...

work page doi:10.1145/1815961.1815971 2010
[26]

On the existence of a spectrum of policies that subsumes the least recently used (lru) and least frequently used (lfu) policies,

D. Lee, J. Choi, J.-H. Kim, S. H. Noh, S. L. Min, Y . Cho, and C. S. Kim, “On the existence of a spectrum of policies that subsumes the least recently used (lru) and least frequently used (lfu) policies,” inProceedings of the 1999 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, ser. SIGMETRICS ’99. New York, NY , U...

work page doi:10.1145/301453.301487 1999
[27]

The lru-k page replacement algorithm for database disk buffering,

E. J. O’Neil, P. E. O’Neil, and G. Weikum, “The lru-k page replacement algorithm for database disk buffering,” inProceedings of the 1993 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD ’93. New York, NY , USA: Association for Computing Machinery, 1993, p. 297–306. [Online]. Available: https://doi.org/10.1145/170035.170081

work page doi:10.1145/170035.170081 1993
[28]

Adaptive caches: Effective shaping of cache behavior to workloads,

R. Subramanian, Y . Smaragdakis, and G. H. Loh, “Adaptive caches: Effective shaping of cache behavior to workloads,” in2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’06), Dec 2006, pp. 385–396. [Online]. Available: https: //doi.org/10.1109/MICRO.2006.7

work page doi:10.1109/micro.2006.7 2006
[29]

Modified lru policies for improving second-level cache behavior,

W. Wong and J.-L. Baer, “Modified lru policies for improving second-level cache behavior,” inProceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550), 2000, pp. 49–60. [Online]. Available: https: //doi.org/10.1109/HPCA.2000.824338

work page doi:10.1109/hpca.2000.824338 2000
[30]

Ship: signature-based hit predictor for high performance caching,

C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely, and J. Emer, “Ship: signature-based hit predictor for high performance caching,” inProceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-44. New York, NY , USA: Association for Computing Machinery, 2011, p. 430–441. [Online]. Available: https://doi....

work page doi:10.1145/2155620.2155671 2011
[31]

Multiperspective Reuse Prediction,

D. A. Jim ´enez and E. Teran, “Multiperspective Reuse Prediction,” in Proceedings of the 50th International Symposium on Microarchitecture, ser. MICRO ’17. New York, NY , USA: Association for Computing Machinery, 2017, p. 436–448. [Online]. Available: https://doi.org/10. 1145/3123939.3123942

work page arXiv 2017
[32]

Perceptron learning for reuse prediction,

E. Teran, Z. Wang, and D. A. Jim ´enez, “Perceptron learning for reuse prediction,” in2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2016, pp. 1–12. [Online]. Available: https://doi.org/10.1109/MICRO.2016.7783705

work page doi:10.1109/micro.2016.7783705 2016
[33]

Sampling dead block prediction for last-level caches,

S. M. Khan, Y . Tian, and D. A. Jim ´enez, “Sampling dead block prediction for last-level caches,” in2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, Dec 2010, pp. 175–186. [Online]. Available: https://doi.org/10.1109/MICRO.2010.24

work page doi:10.1109/micro.2010.24 2010
[34]

Applying deep learning to the cache replacement problem,

Z. Shi, X. Huang, A. Jain, and C. Lin, “Applying deep learning to the cache replacement problem,” inProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’52. New York, NY , USA: Association for Computing Machinery, 2019, p. 413–425. [Online]. Available: https://doi.org/10.1145/3352460. 3358319

work page doi:10.1145/3352460 2019
[35]

Back to the future: Leveraging belady’s algorithm for improved cache replacement,

A. Jain and C. Lin, “Back to the future: Leveraging belady’s algorithm for improved cache replacement,” in2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016, pp. 78–89. [Online]. Available: https://doi.org/10.1109/ISCA.2016.17

work page doi:10.1109/isca.2016.17 2016
[36]

Improving cache management policies using dynamic reuse distances,

N. Duong, D. Zhao, T. Kim, R. Cammarota, M. Valero, and A. V . Veidenbaum, “Improving cache management policies using dynamic reuse distances,” inProceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-45. USA: IEEE Computer Society, 2012, p. 389–400. [Online]. Available: https://doi.org/10.1109/MICRO.2012.43

work page doi:10.1109/micro.2012.43 2012
[37]

Timekeeping in the memory system: Predicting and optimizing memory behavior,

Z. Hu, S. Kaxiras, and M. Martonosi, “Timekeeping in the memory system: Predicting and optimizing memory behavior,” inProceedings of the 29th Annual International Symposium on Computer Architecture, ser. ISCA ’02. USA: IEEE Computer Society, 2002, p. 209–220. [Online]. Available: https://doi.org/10.1145/545214.545239

work page doi:10.1145/545214.545239 2002
[38]

Cache replacement based on reuse-distance prediction,

G. Keramidas, P. Petoumenos, and S. Kaxiras, “Cache replacement based on reuse-distance prediction,” in2007 25th International Conference on Computer Design, 2007, pp. 245–250. [Online]. Available: https://doi.org/10.1109/ICCD.2007.4601909

work page doi:10.1109/iccd.2007.4601909 2007
[39]

Counter-based cache replacement and bypassing algorithms,

M. Kharbutli and Y . Solihin, “Counter-based cache replacement and bypassing algorithms,”IEEE Transactions on Computers, vol. 57, no. 4, pp. 433–447, 2008. [Online]. Available: https://doi.org/10.1109/ TC.2007.70816

work page arXiv 2008
[40]

Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency,

H. Liu, M. Ferdman, J. Huh, and D. Burger, “Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency,” in2008 41st IEEE/ACM International Symposium on Microarchitecture, Nov 2008, pp. 222–233. [Online]. Available: https://doi.org/10.1109/MICRO.2008.4771793

work page doi:10.1109/micro.2008.4771793 2008
[41]

Leeway: Addressing variability in dead-block prediction for last-level caches,

P. Faldu and B. Grot, “Leeway: Addressing variability in dead-block prediction for last-level caches,” in2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT), 2017, pp. 180–193. [Online]. Available: https://doi.org/10.1109/PACT.2017.32

work page doi:10.1109/pact.2017.32 2017
[42]

Insertion and promotion for tree-based pseudolru last-level caches,

D. A. Jim ´enez, “Insertion and promotion for tree-based pseudolru last-level caches,” inProceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-46. New York, NY , USA: Association for Computing Machinery, 2013, p. 284–296. [Online]. Available: https://doi.org/10.1145/2540708.2540733

work page doi:10.1145/2540708.2540733 2013
[43]

IEEE Computer Society, 338–351

D. Schall, A. Sandberg, and B. Grot, “The last-level branch predictor,” in2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2024, pp. 464–479. [Online]. Available: https://doi.org/10.1109/MICRO61859.2024.00042

work page doi:10.1109/micro61859.2024.00042 2024
[44]

Profiling a Warehouse-scale Computer,

S. Kanev, J. P. Darago, K. Hazelwood, P. Ranganathan, T. Moseley, G.-Y . Wei, and D. Brooks, “Profiling a Warehouse-scale Computer,” inProceedings of the 42nd International Symposium on Computer Architecture, ser. ISCA ’15. New York, NY , USA: ACM, 2015, pp. 158–

work page 2015
[45]

Available: http://doi.acm.org/10.1145/2749469.2750392

[Online]. Available: http://doi.acm.org/10.1145/2749469.2750392

work page doi:10.1145/2749469.2750392
[46]

Inside 6th- generation intel core: New microarchitecture code-named skylake,

J. Doweck, W.-F. Kao, A. K.-y. Lu, J. Mandelblat, A. Rahatekar, L. Rappoport, E. Rotem, A. Yasin, and A. Yoaz, “Inside 6th- generation intel core: New microarchitecture code-named skylake,” IEEE Micro, vol. 37, no. 2, pp. 52–62, 2017. [Online]. Available: https://doi.org/10.1109/MM.2017.38

work page doi:10.1109/mm.2017.38 2017
[47]

Adapting cache partitioning algorithms to pseudo-lru replacement policies,

K. Kedzierski, M. Moreto, F. J. Cazorla, and M. Valero, “Adapting cache partitioning algorithms to pseudo-lru replacement policies,” in2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2010, pp. 1–12. [Online]. Available: https: //doi.org/10.1109/IPDPS.2010.5470352

work page doi:10.1109/ipdps.2010.5470352 2010
[48]

Implementation of a pseudo-LRU algorithm in a partitioned cache,

W.-T. T. Chen, P. P. Liu, and K. C. Stelzer, “Implementation of a pseudo-LRU algorithm in a partitioned cache,” US Patent US7 069 390B2, jun, 2006. [Online]. Available: https://patents.google. com/patent/US7069390B2/en

work page 2006
[49]

Context-aware set dueling for dynamic policy arbitration,

D. Patsidis and G. Vavouliotis, “Context-aware set dueling for dynamic policy arbitration,”IEEE Computer Architecture Letters, vol. 24, no. 2, pp. 301–304, 2025. [Online]. Available: https: //doi.org/10.1109/LCA.2025.3617159

work page doi:10.1109/lca.2025.3617159 2025
[50]

Processing metadata, policies, and composite tags (prefetch flag in cache tag),

Advanced Micro Devices, Inc., “Processing metadata, policies, and composite tags (prefetch flag in cache tag),” U.S. Patent US11 635 960B2, 2023, describes a metadata flag in cache tags used to track or prevent prefetching into caches. [Online]. Available: https://patents.google.com/patent/US11635960B2 [49]Intel ® 64 and IA-32 Architectures Software Devel...

work page 2023
[51]

A case for (partially) tagged geometric history length branch prediction,

A. Seznec and P. Michaud, “A case for (partially) tagged geometric history length branch prediction,”Journal of Instruction-Level Parallelism, vol. 8, 2006, special issue on Branch Prediction. [Online]. Available: http://www.jilp.org/vol8/v8paper1.pdf

work page 2006
[52]

A new case for the tage branch predictor,

A. Seznec, “A new case for the tage branch predictor,” in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-44. New York, NY , USA: Association for Computing Machinery, 2011, p. 117–127. [Online]. Available: https://doi.org/10.1145/2155620.2155635

work page doi:10.1145/2155620.2155635 2011
[53]

Hermes: Accelerating long-latency load requests via perceptron-based off-chip load prediction,

A. Navarro-Torres, B. Panda, J. Alastruey-Bened ´e, P. Ib ´a˜nez, V . Vi˜nals-Y´ufera, and A. Ros, “Berti: an accurate local-delta data prefetcher,” inProceedings of the 55th International Symposium on Microarchitecture, ser. MICRO ’22, 2022, pp. 975–991. [Online]. Available: https://doi.org/10.1109/MICRO56248.2022.00072

work page doi:10.1109/micro56248.2022.00072 2022
[54]

ChampSim,

“ChampSim,” https://crc2.ece.tamu.edu/, accessed: 17-04-2024

work page 2024
[55]

The championship simulator: Architectural simulation for education and competition,

N. Gober, G. Chacon, L. Wang, P. V . Gratz, D. A. Jimenez, E. Teran, S. Pugsley, and J. Kim, “The championship simulator: Architectural simulation for education and competition,” 2022. [Online]. Available: https://arxiv.org/abs/2210.14324

work page arXiv 2022
[56]

Re-establishing fetch-directed instruction prefetching: An industry perspective,

Y . Ishii, J. Lee, K. Nathella, and D. Sunwoo, “Re-establishing fetch-directed instruction prefetching: An industry perspective,” in2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2021, pp. 172–182. [Online]. Available: https://doi.org/10.1109/ISPASS51385.2021.00034

work page doi:10.1109/ispass51385.2021.00034 2021
[57]

D. A. Patterson and J. L. Hennessy,Computer Architecture: A Quantita- tive Approach. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1990

work page 1990
[58]

[Online]

Cascade lake - microarchitectures - intel - WikiChip. [Online]. Available: https://en.wikichip.org/wiki/intel/microarchitectures/cascade lake#Memory Hierarchy

work page
[59]

A two level neural approach combining off-chip prediction with adaptive prefetch filtering,

A. V . Jamet, G. Vavouliotis, D. A. Jim ´enez, L. Alvarez, and M. Casas, “A two level neural approach combining off-chip prediction with adaptive prefetch filtering,” in2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2024, pp. 528–542. [Online]. Available: https://doi.org/10.1109/HPCA57654.2024.00046

work page doi:10.1109/hpca57654.2024.00046 2024
[60]

Instruction-aware cooperative tlb and cache replacement policies,

D. Chasapis, G. Vavouliotis, D. A. Jim ´enez, and M. Casas, “Instruction-aware cooperative tlb and cache replacement policies,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, ser. ASPLOS ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 619–636...

work page doi:10.1145/3669940.3707247 2025
[61]

Beyond malloc Efficiency to Fleet Efficiency: a Hugepage-aware Memory Allocator,

A. Hunter, C. Kennelly, P. Turner, D. Gove, T. Moseley, and P. Ranganathan, “Beyond malloc Efficiency to Fleet Efficiency: a Hugepage-aware Memory Allocator,” inProceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation, ser. OSDI ’21. USENIX Association, jul 2021, pp. 257–273. [Online]. Available: https://www.usenix.org/confe...

work page 2021
[62]

Translation Ranger: Operating System Support for Contiguity-Aware TLBs,

Z. Yan, D. Lustig, D. Nellans, and A. Bhattacharjee, “Translation Ranger: Operating System Support for Contiguity-Aware TLBs,” 15 inProceedings of the 46th International Symposium on Computer Architecture, ser. ISCA ’19. New York, NY , USA: Association for Computing Machinery, 2019, pp. 698–710. [Online]. Available: https://doi.org/10.1145/3307650.3322223

work page doi:10.1145/3307650.3322223 2019
[63]

Championship Value Prediction (CVP),

“Championship Value Prediction (CVP),” https://www.microarch.org/ cvp1/, accessed: 17-04-2024

work page 2024
[64]

The 1st Instruction Prefetching Championship,

“The 1st Instruction Prefetching Championship,” https://research.ece. ncsu.edu/ipc/, accessed: 17-04-2024

work page 2024
[65]

A Cost-Effective Entangling Prefetcher for Instructions,

G. Vavouliotis, L. Alvarez, V . Karakostas, K. Nikas, N. Koziris, D. A. Jim´enez, and M. Casas, “Exploiting page table locality for agile tlb prefetching,” in2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 2021, pp. 85–98. [Online]. Available: https://doi.org/10.1109/ISCA52012.2021.00016

work page doi:10.1109/isca52012.2021.00016 2021
[66]

Thermometer: profile-guided btb replacement for data center applications,

S. Song, T. A. Khan, S. M. Shahri, A. Sriraman, N. K. Soundararajan, S. Subramoney, D. A. Jim ´enez, H. Litz, and B. Kasikci, “Thermometer: profile-guided btb replacement for data center applications,” in Proceedings of the 49th Annual International Symposium on Computer Architecture, ser. ISCA ’22. New York, NY , USA: Association for Computing Machinery,...

work page doi:10.1145/3470496.3527430 2022
[67]

Hermes: Accelerating long-latency load requests via perceptron-based off-chip load prediction,

R. Bera, K. Kanellopoulos, S. Balachandran, D. Novo, A. Olgun, M. Sadrosadat, and O. Mutlu, “Hermes: Accelerating long-latency load requests via perceptron-based off-chip load prediction,” in2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), Oct 2022, pp. 1–18. [Online]. Available: https://doi.org/10.1109/ MICRO56248.2022.00015

work page arXiv 2022
[68]

Alternative basis matrix multiplication is fast and stable,

A. V . Jamet, G. Vavouliotis, D. A. Jim ´enez, L. Alvarez, and M. Casas, “Practically tackling memory bottlenecks of graph- processing workloads,” in2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2024, pp. 1034–1045. [Online]. Available: https://doi.org/10.1109/IPDPS57955.2024.00096

work page doi:10.1109/ipdps57955.2024.00096 2024
[69]

Multi-program benchmark definition,

A. N. Jacobvitz, A. D. Hilton, and D. J. Sorin, “Multi-program benchmark definition,” in2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), March 2015, pp. 72–82. [Online]. Available: https://doi.org/10.1109/ISPASS.2015. 7095786

work page doi:10.1109/ispass.2015 2015
[70]

Multiperspective reuse prediction,

D. A. Jim ´enez and E. Teran, “Multiperspective reuse prediction,” in2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), ser. MICRO-50 ’17, IEEE. New York, NY , USA: Association for Computing Machinery, 2017, pp. 436–448. [Online]. Available: https://doi.org/10.1145/3123939.3123942

work page doi:10.1145/3123939.3123942 2017
[71]

A 64-Kbytes ITTAGE indirect branch predictor,

A. Seznec, “A 64-Kbytes ITTAGE indirect branch predictor,” inJWAC- 2: Championship Branch Prediction. San Jose, United States: JILP, Jun 2011. [Online]. Available: https://inria.hal.science/hal-00639041

work page 2011
[72]

Micro-armed bandit: Lightweight & reusable reinforcement learning for microarchitecture decision-making,

G. Gerogiannis and J. Torrellas, “Micro-armed bandit: Lightweight & reusable reinforcement learning for microarchitecture decision-making,” inProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 698–713. [Online]. Available: https://doi.org/1...

work page doi:10.1145/3613424.3623780 2023
[73]

Micro-mama: Multi-agent reinforcement learning for multicore prefetching,

C. Block, G. Gerogiannis, and J. Torrellas, “Micro-mama: Multi-agent reinforcement learning for multicore prefetching,” inProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 884–898. [Online]. Available: https://doi.org/10. 1145/3725843.3756096

work page arXiv 2025
[74]

Going the Distance for TLB Prefetching: An Application-driven Study,

G. B. Kandiraju and A. Sivasubramaniam, “Going the Distance for TLB Prefetching: An Application-driven Study,” inProceedings of the 29th International Symposium on Computer Architecture, ser. ISCA ’02. Washington, DC, USA: IEEE Computer Society, 2002, pp. 195–206. [Online]. Available: http://dl.acm.org/citation.cfm?id=545215.545237

work page arXiv 2002
[75]

Large Pages and Lightweight Memory Management in Virtualized Environments: Can You Have It Both Ways?

B. Pham, J. Vesel ´y, G. H. Loh, and A. Bhattacharjee, “Large Pages and Lightweight Memory Management in Virtualized Environments: Can You Have It Both Ways?” inProceedings of the 48th International Symposium on Microarchitecture, ser. MICRO ’15. New York, NY , USA: ACM, 2015, pp. 1–12. [Online]. Available: http://doi.acm.org/10.1145/2830772.2830773

work page doi:10.1145/2830772.2830773 2015
[76]

Effective Hardware-Based Data Prefetching for High-Performance Processors,

J.-L. Baer and T.-F. Chen, “Effective Hardware-Based Data Prefetching for High-Performance Processors,”IEEE Trans. Comput., vol. 44, no. 5, pp. 609–623, may 1995. [Online]. Available: https://doi.org/10. 1109/12.381947

work page 1995
[77]

Pushing the envelope on free tlb prefetching,

G. Vavouliotis, L. Alvarez, and M. Casas, “Pushing the envelope on free tlb prefetching,” Barcelona Supercomputing Center (BSC) and Universitat Polit `ecnica de Catalunya (UPC), Tech. Rep., 2021. [Online]. Available: https://upcommons.upc.edu/entities/publication/ 198eaf18-44e5-4ed3-8bfa-bd2f3e96e154

work page 2021
[78]

Advanced hardware prefetching in virtual memory systems,

G. Vavouliotis, “Advanced hardware prefetching in virtual memory systems,” Ph.D. dissertation, Universitat Polit `ecnica de Catalunya (UPC), 2023. [Online]. Available: https://upcommons.upc.edu/entities/ publication/f16d637b-ad15-4f69-b830-3471b7f2fb84

work page 2023
[79]

Rethinking TLB Designs in Virtualized Environments: A Very Large Part-of-Memory TLB,

J. H. Ryoo, N. Gulur, S. Song, and L. K. John, “Rethinking TLB Designs in Virtualized Environments: A Very Large Part-of-Memory TLB,” inProceedings of the 44th International Symposium on Computer Architecture, ser. ISCA ’17. New York, NY , USA: ACM, 2017, pp. 469–

work page 2017
[80]

Available: http://doi.acm.org/10.1145/3079856.3080210

[Online]. Available: http://doi.acm.org/10.1145/3079856.3080210

work page doi:10.1145/3079856.3080210

Showing first 80 references.