pith. machine review for the scientific record. sign in

arxiv: 2604.15978 · v1 · submitted 2026-04-17 · 💻 cs.PL

Recognition: unknown

jMT: Testing Correctness of Java Memory Models (Extended Version)

Heike Wehrheim, Lukas Panneke

Pith reviewed 2026-05-10 07:31 UTC · model grok-4.3

classification 💻 cs.PL
keywords Java memory modelJMMmulti-execution semanticscausality checkingexecution graphslitmus testsconcurrent Javacompiler conformance
0
0 comments X

The pith

jMT builds multi-execution semantics for Java programs from single-execution graphs via causality checking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

jMT is a tool that takes definitions of well-formed execution graphs from single-execution models and applies causality checking to identify the valid behaviors under a proposed Java memory model. This construction supports testing new JMM proposals on individual programs to detect unintuitive results or restrictions on standard optimizations. The approach also checks whether a memory model aligns with existing compilers and compilation schemes. Evaluation across 169 litmus tests produces concrete observations about current JMMs.

Core claim

jMT relies on single-execution models defining well-formed execution graphs, based on which it builds valid multi-execution semantics via causality checking. Thereby, jMT supports evaluating new proposals of Java memory models (JMMs) on a per-program basis. jMT can furthermore be employed for testing the conformance of JMMs to existing compilation schemes and compilers.

What carries the argument

Causality checking on well-formed execution graphs produced by single-execution models to derive multi-execution semantics.

If this is right

  • New JMM proposals can be checked on a per-program basis for unexpected behaviors or overly restrictive rules.
  • jMT can verify whether a given JMM permits the optimizations performed by existing compilers.
  • The tool was run on 169 litmus tests and produced specific insights into the behavior of existing JMM definitions.
  • Multi-execution validation becomes feasible without enumerating every possible execution by hand.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same graph-based causality method could be adapted to test memory models for other languages such as C++.
  • Integration with concrete compiler pipelines would allow direct comparison of model predictions against actual generated code.
  • If the check scales well, it could support automated search for minimal counterexample programs that expose model flaws.

Load-bearing premise

That causality checking on single-execution well-formed graphs is sufficient to correctly identify all and only the valid multi-execution behaviors for any proposed JMM.

What would settle it

A litmus test program where the causality check on its single-execution graphs accepts a behavior that violates the intended rules of the JMM being tested, or rejects a behavior produced by all conforming compilers.

Figures

Figures reproduced from arXiv: 2604.15978 by Heike Wehrheim, Lukas Panneke.

Figure 1
Figure 1. Figure 1: Syntax of sequential programs. knowledge) the first to formalize the cross-execution event equivalence, a key but underspecified part of JLS04’s MEM. (3) We show that JAM21 is flawed via counterexamples, including mismatches between intended and actual behavior, and a counterexample to its claimed compilation correctness to x86. (4) We iden￾tify a contradiction in Lea’s VarHandle documentation [16], the de… view at source ↗
Figure 2
Figure 2. Figure 2: An example of stages 1–3 of the algorithm, constructing the symbolic [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Justification sequence for the execution graph [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of jMT (components from others in blue, ours in green). maps to EAX of T2. For registers appearing in behavior assertions, this mapping is bijective, hence invertible. Note 1. One might notice that the behavior assertion should not matter here, but herd in its listing of allowed behaviors only outputs the values of registers which appear in the behavior assertion. Therefore, we will only obtain an… view at source ↗
Figure 5
Figure 5. Figure 5: A program (a) which when compiled to x86 (b) using the compilation [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 1
Figure 1. Figure 1 [PITH_FULL_IMAGE:figures/full_fig_p023_1.png] view at source ↗
read the original abstract

Folklore is often saying "The Java memory model is broken." Therefore, several approaches have proposed repairs, only to find new programs exhibiting unexpected, unintuitive behavior or the model forbidding standard compiler optimizations. The complexity of defining a memory model for concurrent Java lies in the fact that it requires a multi-execution model. Multi-execution models need to inspect the many potential executions of a program in order to find the valid ones. Tools automatically validating novel proposals of Java memory models are, however, largely lacking. To alleviate this problem, we introduce jMT, a novel tool for constructing multi-execution semantics for concurrent Java programs. jMT relies on single-execution models defining well-formed execution graphs, based on which it builds valid multi-execution semantics via causality checking. Thereby, jMT supports evaluating new proposals of Java memory models (JMMs) on a per-program basis. jMT can furthermore be employed for testing the conformance of JMMs to existing compilation schemes and compilers. Our evaluation of jMT on 169 litmus tests reveals a number of interesting insights into existing JMMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript presents jMT, a tool that constructs multi-execution semantics for concurrent Java programs by applying causality checking to well-formed execution graphs obtained from single-execution models. It claims this enables evaluating novel Java memory model proposals on a per-program basis and testing their conformance to compilation schemes and compilers, with supporting evidence from an evaluation involving 169 litmus tests that yields insights into existing JMMs.

Significance. If the proposed causality checking mechanism accurately captures the valid behaviors under multi-execution semantics, jMT would offer a valuable automated framework for memory model designers in the programming languages community. This could help mitigate the issues with current JMMs by facilitating rapid prototyping and validation against litmus tests and compiler behaviors. The scale of the evaluation (169 tests) provides a solid empirical foundation for the tool's utility.

major comments (1)
  1. [Abstract and §1] The reduction from multi-execution JMM semantics to causality checks over single-execution well-formed graphs is described at a high level (Abstract and §1) without a formal proof establishing its soundness and completeness. This is particularly concerning for the central claim, as the causality predicate must correctly identify all and only valid behaviors, including out-of-thin-air executions and causality loops, to support per-program evaluation of arbitrary new JMM proposals.
minor comments (1)
  1. [Evaluation] The results on the 169 litmus tests are described as providing 'interesting insights' but lack quantitative analysis, such as precision/recall metrics or comparison to an oracle, which would strengthen the presentation of the tool's effectiveness.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful review and for recognizing the potential utility of jMT. We address the sole major comment below.

read point-by-point responses
  1. Referee: [Abstract and §1] The reduction from multi-execution JMM semantics to causality checks over single-execution well-formed graphs is described at a high level (Abstract and §1) without a formal proof establishing its soundness and completeness. This is particularly concerning for the central claim, as the causality predicate must correctly identify all and only valid behaviors, including out-of-thin-air executions and causality loops, to support per-program evaluation of arbitrary new JMM proposals.

    Authors: We agree that a formal proof of soundness and completeness for the causality-checking reduction is desirable to fully support the central claim. Section 3 of the manuscript already gives a precise definition of the causality predicate over well-formed single-execution graphs and explains how it rejects causality loops and out-of-thin-air behaviors. Nevertheless, the original submission presents the reduction at a descriptive level without an explicit proof. In the revised version we will add a proof sketch (as a new subsection or appendix) that argues: (i) every multi-execution behavior accepted by the predicate is derivable from some single-execution model, and (ii) every behavior rejected by the predicate either violates well-formedness or introduces an invalid causal dependency. This will directly address the referee's concern for arbitrary new JMM proposals. revision: yes

Circularity Check

0 steps flagged

No circularity detected in jMT's construction of multi-execution semantics

full rationale

The paper presents jMT as a tool that takes single-execution well-formed graphs (defined by existing models) and applies causality checking to produce multi-execution semantics. No equations, definitions, or steps are shown that reduce the claimed output to the inputs by construction, rename fitted parameters as predictions, or rely on load-bearing self-citations whose validity depends on the present work. The derivation is self-contained against external benchmarks of execution graphs and causality, with the 169-litmus evaluation serving as empirical illustration rather than a tautological fit.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that single-execution models of well-formed graphs plus causality checking suffice to define valid multi-execution semantics; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Single-execution models define well-formed execution graphs
    Invoked as the base input for building multi-execution semantics via causality checking.

pith-pipeline@v0.9.0 · 5491 in / 1227 out tokens · 39575 ms · 2026-05-10T07:31:57.111496+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 14 canonical work pages

  1. [1]

    ACM Trans

    Alglave, J., Maranget, L., Tautschnig, M.: Herding cats: Modelling, simulation, testing, and data mining for weak memory. ACM Trans. Program. Lang. Syst. 36(2), 7:1–7:74 (2014).https://doi.org/10.1145/2627752

  2. [2]

    In: Schneider, K., Brandt, J

    Aspinall, D., ˇSevˇ c´ ık, J.: Formalising Java’s data race free guarantee. In: Schneider, K., Brandt, J. (eds.) TPHOL’07. pp. 22–37. LNCS 4732, Springer (2007).https: //doi.org/10.1007/978-3-540-74591-4_4

  3. [3]

    In: ESOP

    Batty, M., Memarian, K., Nienhuis, K., Pichon-Pharabod, J., Sewell, P.: The prob- lem of programming language concurrency semantics. In: ESOP. Lecture Notes in Computer Science, vol. 9032, pp. 283–307. Springer (2015)

  4. [4]

    Bender, J., Palsberg, J.: A formalization of Java’s concurrent access modes. Proc. ACM Program. Lang.3(OOPSLA), 142:1–142:28 (2019)

  5. [5]

    In: ESOP

    Cenciarelli, P., Knapp, A., Sibilio, E.: The Java memory model: Operationally, denotationally, axiomatically. In: ESOP. Lecture Notes in Computer Science, vol. 4421, pp. 331–346. Springer (2007)

  6. [6]

    In: Dillig, I., Tasiran, S

    Gavrilenko, N., de Le´ on, H.P., Furbach, F., Heljanko, K., Meyer, R.: BMC for weak memory models: Relation analysis for compact SMT encodings. In: Dillig, I., Tasiran, S. (eds.) CAV’19’. pp. 355–365. LNCS 11561, Springer (2019).https: //doi.org/10.1007/978-3-030-25540-4_19

  7. [7]

    Geeson, L., Smith, L.: Compiler testing with relaxed memory models. In: CGO. pp. 334–348. IEEE (2024)

  8. [8]

    Gosling, J., Joy, B., Steele, G.: The Java language specification (1996)

  9. [9]

    threads and locks - 17.4

    Gosling, J., Joy, B., Steele, G., Bracha, G., Buckley, A., Smith, D., Bierman, G.: Java language specification - Chapter 17. threads and locks - 17.4. memory model,https://docs.oracle.com/javase/specs/jls/se24/html/jls-17.html# jls-17.4

  10. [10]

    Jeffrey, A., Riely, J., Batty, M., Cooksey, S., Kaysin, I., Podkopaev, A.: The leaky semicolon: compositional semantic dependencies for relaxed-memory concurrency. Proc. ACM Program. Lang.6(POPL), 1–30 (2022).https://doi.org/10.1145/ 3498716

  11. [11]

    In: Bouyer, P., van de Pol, J

    Jorshari, M.H.K., Kokologiannakis, M., Majumdar, R., Nagendra, S.: Optimal con- colic dynamic partial order reduction. In: Bouyer, P., van de Pol, J. (eds.) CON- CUR’25. LIPIcs, vol. 348, pp. 26:1–26:22. Schloss Dagstuhl - Leibniz-Zentrum f¨ ur Informatik (2025).https://doi.org/10.4230/LIPICS.CONCUR.2025.26

  12. [12]

    In: Silva, A., Leino, K.R.M

    Kokologiannakis, M., Vafeiadis, V.: Genmc: A model checker for weak memory models. In: Silva, A., Leino, K.R.M. (eds.) CAV’21. pp. 427–440. LNCS 12759, Springer (2021).https://doi.org/10.1007/978-3-030-81685-8_20

  13. [13]

    Lahav, O.: A case against semantic dependencies. In: The Future of Weak Memory 2024.https://popl24.sigplan.org/details/fowm-2024-papers/ 8/A-case-against-semantic-dependencies, recording of talk avaliable athttps: //youtu.be/7VZtwIvZ6qE

  14. [14]

    Millstein, and David Walker

    Lahav, O., Boker, U.: Decidable verification under a causally consistent shared memory. In: Donaldson, A.F., Torlak, E. (eds.) PLDI’20. pp. 211–226. ACM (2020). https://doi.org/10.1145/3385412.3385966

  15. [15]

    PLDI 2017 , year =

    Lahav, O., Vafeiadis, V., Kang, J., Hur, C., Dreyer, D.: Repairing sequential con- sistency in C/C++11. In: Cohen, A., Vechev, M.T. (eds.) PLDI’17. pp. 618–632. ACM (2017).https://doi.org/10.1145/3062341.3062352

  16. [16]

    Lea, D.: Using JDK 9 memory order modes,https://gee.cs.oswego.edu/dl/ html/j9mm.html jMT: Testing Correctness of Java Memory Models 19

  17. [17]

    In: ECOOP

    Liu, S., Bender, J., Palsberg, J.: Compiling volatile correctly in Java. In: ECOOP. LIPIcs, vol. 222, pp. 6:1–6:26. Schloss Dagstuhl - Leibniz-Zentrum f¨ ur Informatik (2022)

  18. [18]

    In: Ali, K., Vitek, J

    Liu, S., Bender, J., Palsberg, J.: Compiling volatile correctly in Java. In: Ali, K., Vitek, J. (eds.) ECOOP’22. pp. 6:1–26. LIPIcs 222, Schloss Dagstuhl - Leibniz- Zentrum f¨ ur Informatik (2022).https://doi.org/10.4230/LIPICS.ECOOP.2022.6

  19. [19]

    blogspot.com/2007/03/volatile-fields-and-synchronization.html

    Manson, J.: Volatile fields and synchronization,https://jeremymanson. blogspot.com/2007/03/volatile-fields-and-synchronization.html

  20. [20]

    Manson, J.: The Java Memory Model. Ph.D. thesis, University of Maryland, Col- lege Park, MD, USA (2004),https://hdl.handle.net/1903/1949

  21. [21]

    In: FTfJP’02 (2002)

    Manson, J., Pugh, W.: The Java memory model simulator. In: FTfJP’02 (2002)

  22. [22]

    Manson, J., Pugh, W., Adve, S.: The Java memory modelhttps: //web.archive.org/web/20060303190317/http://www.cs.umd.edu/users/ jmanson/java/journal.pdf, original URL:http://www.cs.umd.edu/users/ jmanson/java/journal.pdf, archived 2006-03-03

  23. [23]

    In: POPL

    Manson, J., Pugh, W.W., Adve, S.V.: The Java memory model. In: POPL. pp. 378–391. ACM (2005)

  24. [24]

    Moiseenko, E., Kokologiannakis, M., Vafeiadis, V.: Model checking for a multi- execution memory model. Proc. ACM Program. Lang.6(OOPSLA2), 758–785 (2022).https://doi.org/10.1145/3563315

  25. [25]

    Panneke, L., Wehrheim, H.: jMT: Testing correctness of Java memory models (ar- tifact) (Oct 2025).https://doi.org/10.5281/zenodo.17634679

  26. [26]

    In: M¨ uller, P

    Paviotti, M., Cooksey, S., Paradis, A., Wright, D., Owens, S., Batty, M.: Mod- ular relaxed dependencies in weak memory concurrency. In: M¨ uller, P. (ed.) ESOP20. pp. 599–625. LNCS 12075, Springer (2020).https://doi.org/10.1007/ 978-3-030-44914-8_22

  27. [27]

    In: Boyland, J.T

    Petri, G., Vitek, J., Jagannathan, S.: Cooking the books: Formalizing JMM imple- mentation recipes. In: Boyland, J.T. (ed.) ECOOP’15. LIPIcs, vol. 37, pp. 445–469. Schloss Dagstuhl - Leibniz-Zentrum f¨ ur Informatik (2015).https://doi.org/10. 4230/LIPICS.ECOOP.2015.445

  28. [28]

    Pugh, W.: Causality test cases,https://www.cs.umd.edu/ ~pugh/java/ memoryModel/CausalityTestCases.html

  29. [29]

    In: Java Grande

    Pugh, W.W.: Fixing the Java memory model. In: Java Grande. pp. 89–98. ACM (1999)

  30. [30]

    Richards, J., Wright, D., Cooksey, S., Batty, M.: Symbolic MRD: Dynamic memory, undefined behaviour, and extrinsic choice. Proc. ACM Program. Lang. 9(OOPSLA1), 1858–1882 (2025).https://doi.org/10.1145/3721089

  31. [31]

    In: Vitek, J

    ˇSevˇ c´ ık, J., Aspinall, D.: On validity of program transformations in the Java memory model. In: Vitek, J. (ed.) ECOOP’08. pp. 27–51. LNCS 5142, Springer (2008). https://doi.org/10.1007/978-3-540-70592-5_3

  32. [32]

    net/blog/2016/close-encounters-of-jmm-kind/

    Shipil¨ ev, A.: Close encounters of the Java memory model kind,https://shipilev. net/blog/2016/close-encounters-of-jmm-kind/

  33. [33]

    com/openjdk/jcstress/

    Shipil¨ ev, A., et al.: Java concurrency stress (jcstress) (2013),https://github. com/openjdk/jcstress/

  34. [34]

    Java Language S p e c i f i c a t i o n

    Trippel, C., Manerkar, Y.A., Lustig, D., Pellauer, M., Martonosi, M.: TriCheck: Memory model verification at the trisection of software, hardware, and ISA. In: Chen, Y., Temam, O., Carter, J. (eds.) ASPLOS’17. pp. 119–133. ACM (2017). https://doi.org/10.1145/3037697.3037719 20 L. Panneke and H. Wehrheim A The Java language specification’s SEM 1" Java Lang...