arxiv: 2605.01419 · v1 · submitted 2026-05-02 · 💻 cs.AR

Recognition: unknown

Understanding Simulated Architecture via gem5 Call-Stack Profiling

Johan S\"oderstr\"om (1), Rashid Aligholipour (1), Yuan Yao (1) ((1) Uppsala University)

Authors on Pith no claims yet

Pith reviewed 2026-05-09 18:10 UTC · model grok-4.3

classification 💻 cs.AR

keywords gem5call-stack profilingarchitectural simulationCPU modelscache coherenceperf_eventsimulation analysishierarchical call-tree

0 comments

The pith

Call-stack profiling of gem5 directly reflects simulated system activity and uncovers behaviors missed by standard statistics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that examining the call stacks within the gem5 simulator itself provides a direct window into the inner workings of the simulated computer architecture. Conventional gem5 statistics offer only an indirect and limited view, often missing key details about inefficiencies or problematic behaviors in the models. By introducing a lightweight profiling tool that runs separately and samples the simulator's call stacks using Linux perf events, the authors demonstrate how this approach can reveal specific issues, such as unexpected slowdowns in simple CPU models and hard-to-detect problems in memory coherence protocols. This method allows for both broad structural analysis and targeted examination of individual components without modifying the simulator code.

Core claim

Call-stack profiling of gem5 itself offers a powerful yet underutilized perspective: the simulator's own call-stack directly reflects the activity of the simulated system, exposing insights that conventional statistics may overlook. The profiling framework samples gem5's runtime call-stacks, resolves symbols on the fly, and merges them into a hierarchical call-tree for analysis of CPU models and the Ruby memory system.

What carries the argument

A specialized lightweight profiling framework using Linux's perf_event interface to sample gem5's call-stacks in a separate process, resolve symbols, and build a hierarchical call-tree representation.

If this is right

TimingSimpleCPU proves inefficient due to its lockup-cache model and does not run simulations faster than a full out-of-order core.
Cache coherence protocol deadlocks and livelocks become straightforward to detect, even when the simulation appears normal or ends abruptly.
Architectural insights into complex systems like integrated CPU and memory models become more accessible through hierarchical views of simulator activity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar call-stack profiling could be adapted to other architecture simulators to gain comparable internal visibility.
Developers might use these profiles to optimize simulator performance itself by identifying hot paths in the code.
This approach could complement traditional tracing methods in hardware-software co-design studies.

Load-bearing premise

That sampling gem5's call-stacks provides an accurate and undistorted reflection of simulated system activity without meaningful interference from the simulator's layered design or the profiling overhead itself.

What would settle it

A direct comparison showing that the call-stack samples do not align with known simulation events or fail to identify the cache coherence deadlocks mentioned in the case studies.

Figures

Figures reproduced from arXiv: 2605.01419 by Johan S\"oderstr\"om (1), Rashid Aligholipour (1), Yuan Yao (1) ((1) Uppsala University).

**Figure 1.** Figure 1: Committed instructions per host-machine-second for view at source ↗

**Figure 3.** Figure 3: Interaction of AtomicSimpleCPU and Ruby, where the execution flows from a core into Ruby as a function call (no timing modeled). AS-CPU uses a single entry function—the tick function—to drive simulation. On each call, tick sequentially advances all core stages, one after the other: Instruction fetch, issuing a memory request to Ruby to fetch the instruction from the I-cache; Pre-execute, decoding the instr… view at source ↗

**Figure 2.** Figure 2: gem5 call-stack depth across CPU models (with view at source ↗

**Figure 4.** Figure 4: Interaction of TimingSimpleCPU and Ruby, where the execution flow between a core and Ruby is decoupled by the Ruby EventQueue. on-chip interconnect—either simple crossbar links or Garnet routers [8], both of which model timing latency—to the LLC and then to the memory controller. Once the request is serviced, Ruby creates a response packet and sends it back to the TS-CPU core. The response travels the path… view at source ↗

**Figure 7.** Figure 7: Call-stack merging and flexible view-control in the call view at source ↗

**Figure 9.** Figure 9: TS-CPU runtime breakdown results. tion fetch and load/store), which are discussed in view at source ↗

**Figure 10.** Figure 10: TS-CPU runtime breakdowns for L1/L2 cache. in sendReq/sendTimingReq to access the I-cache, completeIfetch to executing non-memory instructions, and BaseMMU/TLB::translateTiming to access ITLB. Memory-related instructions are forwarded to D-tick via RequestPort::sendTimingReq and then issued to Ruby. Meanwhile, Figure 9c shows a zoomed-in view of Ruby, where GAPBS spends significantly more time in garnet-… view at source ↗

**Figure 13.** Figure 13: O3-CPU L1 runtime under deadlock. protocol dead/livelocks. For example, checkpoint-based diagnosis is ineffective because the onset of a dead/livelock is often unknown. Likewise, running gem5 in debugging mode to monitor repetitive coherence actions provides limited help, as it incurs prohibitive overhead. Consequently, identifying dead or livelock requires monitoring a large span of simulation before th… view at source ↗

**Figure 12.** Figure 12: O3-CPU fetch and L1 controller runtime. cache locality, as if the benchmark partitions a large input into smaller chunks and each core attains higher L1 efficiency. However, the breakdowns produced by our tool—Figure 12a for the fetch stage and Figure 12b for the L1 cache controller—reveal a different story. In the 32-core GAPBS runs, L1 cache controller runtime is dominated by data load events (h_load_hi… view at source ↗

read the original abstract

Understanding the behavior of simulated architectures in gem5 is critical for studying complex, deeply integrated computing systems. However, conventional analysis methods provide only an indirect view of the simulated system internals. In this work, we show that call-stack profiling of gem5 itself offers a powerful yet underutilized perspective: the simulator's own call-stack directly reflects the activity of the simulated system, exposing insights that conventional statistics may overlook. Profiling gem5's call-stacks is challenging due to its highly layered and complex software design patterns. To address this, we introduce a specialized, lightweight profiling framework built on Linux's perf_event interface which samples gem5's runtime call-stacks throughout the simulation, resolves symbols on the fly, and merges samples into a hierarchical call-tree representation supporting both high-level structural views and focused, user-defined, component-specific analysis. Moreover, all profiling is performed in a separate process running alongside the main gem5 process, avoiding intrusive changes and overheads to the simulation itself. We apply our framework to gem5's three major CPU models -- AtomicSimpleCPU, TimingSimpleCPU, and O3CPU -- together with the Ruby memory system, and uncover behaviors that are not easily observable in conventional gem5 statistics. Our case studies reveal, for example, that TimingSimpleCPU is inefficient due to its use of a lockup-cache model and, despite its conceptual simplicity, does not simulate faster than a full out-of-order core. In addition, our tool makes it straightforward to detect cache coherence protocol deadlock and livelock -- issues that are otherwise difficult to identify, since the simulation either appears to run normally or terminates abruptly, making it hard to pinpoint when these conditions occur.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This gives gem5 users a separate-process call-stack sampler with hierarchical views that can flag model quirks like TimingSimpleCPU slowdowns and coherence deadlocks, but the claim of direct reflection of simulated activity overstates the case due to event scheduling layers.

read the letter

The paper builds a practical tool on top of perf_event that samples gem5's runtime stacks, resolves symbols on the fly, and merges them into a call tree with both broad structural views and user-defined component filters. It runs in its own process so the simulator itself stays untouched. That setup is the concrete new piece: nothing in the cited gem5 literature appears to have shipped this exact combination of lightweight sampling plus hierarchical merging for architecture models.

Referee Report

2 major / 2 minor

Summary. The paper introduces a lightweight, separate-process call-stack profiling framework for gem5 based on Linux perf_event. It samples gem5's runtime call-stacks, resolves symbols, and builds hierarchical call-trees to analyze the simulator's execution. The authors apply the tool to AtomicSimpleCPU, TimingSimpleCPU, O3CPU, and the Ruby memory system, claiming it reveals non-obvious behaviors such as TimingSimpleCPU's inefficiency from its lockup-cache model (despite conceptual simplicity) and facilitates detection of cache-coherence deadlocks or livelocks that are hard to spot with standard gem5 statistics.

Significance. If the mapping from gem5 call-stacks to simulated-component activity can be shown to be reliable and low-distortion, the framework would supply a useful complementary diagnostic for gem5 users working on complex CPU-memory interactions. The non-intrusive, separate-process design is a clear engineering strength that avoids modifying the simulator core.

major comments (2)

[Abstract] Abstract and §3 (framework description): the central claim that 'the simulator's own call-stack directly reflects the activity of the simulated system' is not accompanied by any quantitative validation or error analysis. Because gem5 is a discrete-event simulator, sampled stacks capture the current event handler plus its C++ call chain; the paper does not measure or bound the distortion introduced by event-queue ordering, virtual dispatch, or long-running handlers (e.g., lockup-cache stalls in TimingSimpleCPU).
[Case Studies] Case-study sections (CPU models and Ruby): the reported observations (TimingSimpleCPU slower than O3CPU, deadlock detection) are presented qualitatively with no numerical data, no comparison against conventional gem5 stats, and no verification that the call-tree hotspots correspond to the claimed simulated-component activity. Without such evidence the claim that the method 'exposes insights that conventional statistics may overlook' remains unsupported.

minor comments (2)

[Framework] The manuscript would benefit from an explicit description of the call-tree merging algorithm and any sampling-rate or symbol-resolution overhead measurements.
[Figures] Figure captions and axis labels in the call-tree visualizations should state the sampling interval and total number of samples collected so readers can assess statistical significance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and for recognizing the non-intrusive design of the profiling framework. We address each major comment below and will revise the manuscript to incorporate quantitative validation and supporting data.

read point-by-point responses

Referee: [Abstract] Abstract and §3 (framework description): the central claim that 'the simulator's own call-stack directly reflects the activity of the simulated system' is not accompanied by any quantitative validation or error analysis. Because gem5 is a discrete-event simulator, sampled stacks capture the current event handler plus its C++ call chain; the paper does not measure or bound the distortion introduced by event-queue ordering, virtual dispatch, or long-running handlers (e.g., lockup-cache stalls in TimingSimpleCPU).

Authors: We agree that the manuscript would benefit from explicit quantitative validation of the mapping. The central claim follows from gem5's discrete-event architecture, in which the active call-stack at each sample corresponds to the handler of the currently processed event and thereby to the simulated component's activity. We acknowledge that event-queue ordering, virtual dispatch, and handler duration can introduce indirection. In the revised manuscript we will add a dedicated subsection to §3 that discusses these potential sources of distortion and supplies empirical bounds derived from our existing profiling runs, including direct comparisons of stack-sample frequencies against gem5's internal event counters and targeted measurements for the lockup-cache stalls in TimingSimpleCPU. revision: yes
Referee: [Case Studies] Case-study sections (CPU models and Ruby): the reported observations (TimingSimpleCPU slower than O3CPU, deadlock detection) are presented qualitatively with no numerical data, no comparison against conventional gem5 stats, and no verification that the call-tree hotspots correspond to the claimed simulated-component activity. Without such evidence the claim that the method 'exposes insights that conventional statistics may overlook' remains unsupported.

Authors: The case studies were written to illustrate diagnostic capabilities through representative call-tree visualizations. We did record numerical data (simulation wall-clock times, per-component sample counts, and stack distributions) during the experiments. To strengthen the presentation we will expand the CPU-model and Ruby sections with quantitative tables that (a) report simulation performance metrics across AtomicSimpleCPU, TimingSimpleCPU, and O3CPU, (b) compare call-tree hotspot frequencies against conventional gem5 statistics, and (c) verify that the dominant lockup-cache activity in TimingSimpleCPU is not captured by standard stats. For the cache-coherence deadlock example we will add a step-by-step trace with timing information showing how the call-tree identifies the livelock condition. revision: yes

Circularity Check

0 steps flagged

No circularity: tool-building paper with no derivations or self-referential reductions

full rationale

The manuscript presents a profiling framework for gem5 and applies it to case studies on CPU models and Ruby. No equations, fitted parameters, predictions, or first-principles derivations appear in the abstract or described content. Claims rest on direct observation of call-stack samples and conventional statistics comparison, without any step that reduces by construction to its own inputs or prior self-citations. The work is self-contained as an engineering description.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that gem5 call-stacks faithfully mirror simulated behavior; no free parameters, invented entities, or additional axioms are introduced beyond standard Linux profiling interfaces.

axioms (1)

domain assumption gem5's call-stack directly reflects the activity of the simulated system
Core premise stated in the abstract as the basis for the profiling approach.

pith-pipeline@v0.9.0 · 5621 in / 1181 out tokens · 62891 ms · 2026-05-09T18:10:47.219802+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 4 canonical work pages

[1]

The gem5 Simulator,

N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashtiet al., “The gem5 Simulator,”ACM SIGARCH Computer Architecture News, vol. 39, no. 2, pp. 1–7, 2011

2011
[2]

M., Akram, A., Alian, M., Amslinger, R., An- dreozzi, M., Armejach, A., Asmussen, N., Beckmann, B., Bharadwaj, S., et al

J. Lowe-Power, A. M. Ahmad, A. Akram, M. Alian, R. Amslinger, M. Andreozzi, A. Armejach, N. Asmussen, B. Beckmann, S. Bharad- wajet al., “The gem5 Simulator: Version 20.0+,”arXiv preprint arXiv:2007.03152, 2020

work page arXiv 2007
[3]

Profiling gem5 Simulator,

J. Umeike, N. Patel, A. Manley, A. Mamandipoor, H. Yun, and M. Alian, “Profiling gem5 Simulator,” inIEEE International Symposium on Perfor- mance Analysis of Systems and Software (ISPASS), 2023, pp. 103–113

2023
[4]

Semi-Automatic Validation of Cycle-Accurate Sim- ulation Infrastructures: The Case for gem5-x86,

J. M. Cebrián González, A. Barredo, H. Caminal, M. Moretó, M. Casas, and M. Valero, “Semi-Automatic Validation of Cycle-Accurate Sim- ulation Infrastructures: The Case for gem5-x86,”Future Generation Computer Systems, vol. 112, pp. 832–847, 2020

2020
[5]

gprof: A Call Graph Execution Profiler,

S. L. Graham, P. B. Kessler, and M. K. McKusick, “gprof: A Call Graph Execution Profiler,” inSIGPLAN Symposium on Compiler Construction, 1982, pp. 120–126

1982
[6]

Intel VTune Profiler,

Intel, “Intel VTune Profiler,” https://www.intel.com/content/www/us/en/ developer/tools/oneapi/vtune-profiler.html, accessed: 2025-11-25

2025
[7]

IPC Considered Harmful for Multipro- cessor Workloads,

A. Alameldeen and D. Wood, “IPC Considered Harmful for Multipro- cessor Workloads,”IEEE Micro, vol. 26, no. 4, pp. 8–17, 2006

2006
[8]

GARNET: A Detailed On-Chip Network Model Inside a Full-System Simulator,

N. Agarwal, T. Krishna, L.-S. Peh, and N. K. Jha, “GARNET: A Detailed On-Chip Network Model Inside a Full-System Simulator,” in IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2009, pp. 33–42

2009
[9]

The GAP Benchmark Suite,

S. Beamer, K. Asanovi ´c, and D. Patterson, “The GAP Benchmark Suite,” arXiv preprint arXiv:1508.03619, 2015

work page arXiv 2015
[10]

PARSEC 3.0: A Multi- core Benchmark Suite with Network Stacks and SPLASH-2X,

X. Zhan, Y . Bao, C. Bienia, and K. Li, “PARSEC 3.0: A Multi- core Benchmark Suite with Network Stacks and SPLASH-2X,”ACM SIGARCH Computer Architecture News, vol. 44, no. 5, pp. 1–16, 2017

2017
[11]

SPEC CPU2017 Documentation,

SPEC, “SPEC CPU2017 Documentation,” 2017, accessed: 2024-10-29. [Online]. Available: https://www.spec.org/cpu2017/Docs/

2017
[12]

Automatically Characterizing Large Scale Program Behavior,

Sherwood, Timothy and Perelman, Erez and Hamerly, Greg and Calder, Brad, “Automatically Characterizing Large Scale Program Behavior,” inInternational Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2002, pp. 45—57

2002
[13]

Mul- tifacet’s General Execution-Driven Multiprocessor Simulator (GEMS) Toolset,

M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood, “Mul- tifacet’s General Execution-Driven Multiprocessor Simulator (GEMS) Toolset,”SIGARCH Comput. Archit. News, vol. 33, no. 4, p. 92–99, 2005

2005
[14]

Token Coherence: Decoupling Performance and Correctness,

M. Martin, M. Hill, and D. Wood, “Token Coherence: Decoupling Performance and Correctness,” inInternational Symposium on Computer Architecture (ISCA), 2003, pp. 182–193. APPENDIX A. Abstract This artifact provides the complete experimental infrastruc- ture used in the paperUnderstanding Simulated Architecture via gem5 Call-Stack Profiling. It includes the...

2003
[15]

How to access:The artifact can be downloaded via: •https://doi.org/10.5281/zenodo.19126063, or •https://zenodo.org/records/19126063

work page doi:10.5281/zenodo.19126063
[16]

The system used in our evaluation has the following configuration: i) x86 processor (Intel Core i7-12900K), ii) 128 GB RAM

Hardware dependencies:The experiments require a Linux server capable of usingperf_event(this probably re- quires root privileges, depending on the system configuration) and gem5 full-system simulations. The system used in our evaluation has the following configuration: i) x86 processor (Intel Core i7-12900K), ii) 128 GB RAM. The workflow launches multiple...
[17]

System monitoring tools such ashtoporbtop are recommended for observing resource usage during exper- iments

Software dependencies:The artifact depends on the following software: i) Linux operating system, ii) Python and Celery task framework, iii) scons build system, iv) gem5 simulator. System monitoring tools such ashtoporbtop are recommended for observing resource usage during exper- iments
[18]

Anatomy of the gem5 Simulator: AtomicSimpleCPU, TimingSimpleCPU, O3CPU, and Their Interaction with the Ruby Memory System,

Datasets:Due to the large sizes (about 350 GB) of the gem5 checkpoints as well as the disk images and kernel bina- ries required for the simulations, these files cannot be hosted on Zenodo. Therefore, users are responsible for preparing them locally. Checkpoints can be generated by following the tutorials provided by gem5, for example the PARSEC tutorial2...

work page arXiv 2025