arxiv: 2604.15978 · v1 · submitted 2026-04-17 · 💻 cs.PL

Recognition: unknown

jMT: Testing Correctness of Java Memory Models (Extended Version)

Heike Wehrheim, Lukas Panneke

Pith reviewed 2026-05-10 07:31 UTC · model grok-4.3

classification 💻 cs.PL

keywords Java memory modelJMMmulti-execution semanticscausality checkingexecution graphslitmus testsconcurrent Javacompiler conformance

0 comments

The pith

jMT builds multi-execution semantics for Java programs from single-execution graphs via causality checking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

jMT is a tool that takes definitions of well-formed execution graphs from single-execution models and applies causality checking to identify the valid behaviors under a proposed Java memory model. This construction supports testing new JMM proposals on individual programs to detect unintuitive results or restrictions on standard optimizations. The approach also checks whether a memory model aligns with existing compilers and compilation schemes. Evaluation across 169 litmus tests produces concrete observations about current JMMs.

Core claim

jMT relies on single-execution models defining well-formed execution graphs, based on which it builds valid multi-execution semantics via causality checking. Thereby, jMT supports evaluating new proposals of Java memory models (JMMs) on a per-program basis. jMT can furthermore be employed for testing the conformance of JMMs to existing compilation schemes and compilers.

What carries the argument

Causality checking on well-formed execution graphs produced by single-execution models to derive multi-execution semantics.

If this is right

New JMM proposals can be checked on a per-program basis for unexpected behaviors or overly restrictive rules.
jMT can verify whether a given JMM permits the optimizations performed by existing compilers.
The tool was run on 169 litmus tests and produced specific insights into the behavior of existing JMM definitions.
Multi-execution validation becomes feasible without enumerating every possible execution by hand.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same graph-based causality method could be adapted to test memory models for other languages such as C++.
Integration with concrete compiler pipelines would allow direct comparison of model predictions against actual generated code.
If the check scales well, it could support automated search for minimal counterexample programs that expose model flaws.

Load-bearing premise

That causality checking on single-execution well-formed graphs is sufficient to correctly identify all and only the valid multi-execution behaviors for any proposed JMM.

What would settle it

A litmus test program where the causality check on its single-execution graphs accepts a behavior that violates the intended rules of the JMM being tested, or rejects a behavior produced by all conforming compilers.

Figures

Figures reproduced from arXiv: 2604.15978 by Heike Wehrheim, Lukas Panneke.

**Figure 1.** Figure 1: Syntax of sequential programs. knowledge) the first to formalize the cross-execution event equivalence, a key but underspecified part of JLS04’s MEM. (3) We show that JAM21 is flawed via counterexamples, including mismatches between intended and actual behavior, and a counterexample to its claimed compilation correctness to x86. (4) We identify a contradiction in Lea’s VarHandle documentation [16], the de… view at source ↗

**Figure 2.** Figure 2: An example of stages 1–3 of the algorithm, constructing the symbolic [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Justification sequence for the execution graph [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of jMT (components from others in blue, ours in green). maps to EAX of T2. For registers appearing in behavior assertions, this mapping is bijective, hence invertible. Note 1. One might notice that the behavior assertion should not matter here, but herd in its listing of allowed behaviors only outputs the values of registers which appear in the behavior assertion. Therefore, we will only obtain an… view at source ↗

**Figure 5.** Figure 5: A program (a) which when compiled to x86 (b) using the compilation [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 1.** Figure 1 [PITH_FULL_IMAGE:figures/full_fig_p023_1.png] view at source ↗

read the original abstract

Folklore is often saying "The Java memory model is broken." Therefore, several approaches have proposed repairs, only to find new programs exhibiting unexpected, unintuitive behavior or the model forbidding standard compiler optimizations. The complexity of defining a memory model for concurrent Java lies in the fact that it requires a multi-execution model. Multi-execution models need to inspect the many potential executions of a program in order to find the valid ones. Tools automatically validating novel proposals of Java memory models are, however, largely lacking. To alleviate this problem, we introduce jMT, a novel tool for constructing multi-execution semantics for concurrent Java programs. jMT relies on single-execution models defining well-formed execution graphs, based on which it builds valid multi-execution semantics via causality checking. Thereby, jMT supports evaluating new proposals of Java memory models (JMMs) on a per-program basis. jMT can furthermore be employed for testing the conformance of JMMs to existing compilation schemes and compilers. Our evaluation of jMT on 169 litmus tests reveals a number of interesting insights into existing JMMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

jMT turns single-execution well-formed graphs into multi-execution JMM semantics via causality checks, giving a practical per-program validator, though the reduction's soundness still needs concrete evidence.

read the letter

The paper introduces jMT as a tool that starts from single-execution models of well-formed graphs and applies causality checking to decide which multi-execution behaviors count as valid under a proposed Java memory model. This setup lets you test a new JMM proposal against individual programs and also check whether the model matches what existing compilers actually allow. That combination is the main new piece; prior work on JMMs has mostly been manual or limited to fixed models rather than a reusable construction for arbitrary proposals. The evaluation on 169 litmus tests is a solid start and apparently surfaces some concrete differences among existing models, which is useful for the subfield. The approach avoids circular definitions and sticks to standard execution-graph ideas, so the basic framing looks clean. The soft spot is that the central step—whether causality checking over single executions really captures exactly the allowed multi-execution behaviors, including out-of-thin-air cases and reorderings—rests on a high-level description without a shown equivalence argument, counter-example search, or quantified error analysis against an oracle. The litmus results are summarized as “interesting insights” rather than broken down by false-positive or false-negative rates. This is aimed at researchers and compiler people who work on memory-model proposals and need a way to run quick per-program checks. A reader already familiar with litmus testing and causality notions will get the most out of it. The work is coherent enough on its own terms to deserve a serious referee who can look at the full formal details and any released code.

Referee Report

1 major / 1 minor

Summary. The manuscript presents jMT, a tool that constructs multi-execution semantics for concurrent Java programs by applying causality checking to well-formed execution graphs obtained from single-execution models. It claims this enables evaluating novel Java memory model proposals on a per-program basis and testing their conformance to compilation schemes and compilers, with supporting evidence from an evaluation involving 169 litmus tests that yields insights into existing JMMs.

Significance. If the proposed causality checking mechanism accurately captures the valid behaviors under multi-execution semantics, jMT would offer a valuable automated framework for memory model designers in the programming languages community. This could help mitigate the issues with current JMMs by facilitating rapid prototyping and validation against litmus tests and compiler behaviors. The scale of the evaluation (169 tests) provides a solid empirical foundation for the tool's utility.

major comments (1)

[Abstract and §1] The reduction from multi-execution JMM semantics to causality checks over single-execution well-formed graphs is described at a high level (Abstract and §1) without a formal proof establishing its soundness and completeness. This is particularly concerning for the central claim, as the causality predicate must correctly identify all and only valid behaviors, including out-of-thin-air executions and causality loops, to support per-program evaluation of arbitrary new JMM proposals.

minor comments (1)

[Evaluation] The results on the 169 litmus tests are described as providing 'interesting insights' but lack quantitative analysis, such as precision/recall metrics or comparison to an oracle, which would strengthen the presentation of the tool's effectiveness.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful review and for recognizing the potential utility of jMT. We address the sole major comment below.

read point-by-point responses

Referee: [Abstract and §1] The reduction from multi-execution JMM semantics to causality checks over single-execution well-formed graphs is described at a high level (Abstract and §1) without a formal proof establishing its soundness and completeness. This is particularly concerning for the central claim, as the causality predicate must correctly identify all and only valid behaviors, including out-of-thin-air executions and causality loops, to support per-program evaluation of arbitrary new JMM proposals.

Authors: We agree that a formal proof of soundness and completeness for the causality-checking reduction is desirable to fully support the central claim. Section 3 of the manuscript already gives a precise definition of the causality predicate over well-formed single-execution graphs and explains how it rejects causality loops and out-of-thin-air behaviors. Nevertheless, the original submission presents the reduction at a descriptive level without an explicit proof. In the revised version we will add a proof sketch (as a new subsection or appendix) that argues: (i) every multi-execution behavior accepted by the predicate is derivable from some single-execution model, and (ii) every behavior rejected by the predicate either violates well-formedness or introduces an invalid causal dependency. This will directly address the referee's concern for arbitrary new JMM proposals. revision: yes

Circularity Check

0 steps flagged

No circularity detected in jMT's construction of multi-execution semantics

full rationale

The paper presents jMT as a tool that takes single-execution well-formed graphs (defined by existing models) and applies causality checking to produce multi-execution semantics. No equations, definitions, or steps are shown that reduce the claimed output to the inputs by construction, rename fitted parameters as predictions, or rely on load-bearing self-citations whose validity depends on the present work. The derivation is self-contained against external benchmarks of execution graphs and causality, with the 169-litmus evaluation serving as empirical illustration rather than a tautological fit.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that single-execution models of well-formed graphs plus causality checking suffice to define valid multi-execution semantics; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Single-execution models define well-formed execution graphs
Invoked as the base input for building multi-execution semantics via causality checking.

pith-pipeline@v0.9.0 · 5491 in / 1227 out tokens · 39575 ms · 2026-05-10T07:31:57.111496+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 14 canonical work pages

[1]

ACM Trans

Alglave, J., Maranget, L., Tautschnig, M.: Herding cats: Modelling, simulation, testing, and data mining for weak memory. ACM Trans. Program. Lang. Syst. 36(2), 7:1–7:74 (2014).https://doi.org/10.1145/2627752

work page doi:10.1145/2627752 2014
[2]

In: Schneider, K., Brandt, J

Aspinall, D., ˇSevˇ c´ ık, J.: Formalising Java’s data race free guarantee. In: Schneider, K., Brandt, J. (eds.) TPHOL’07. pp. 22–37. LNCS 4732, Springer (2007).https: //doi.org/10.1007/978-3-540-74591-4_4

work page doi:10.1007/978-3-540-74591-4_4 2007
[3]

In: ESOP

Batty, M., Memarian, K., Nienhuis, K., Pichon-Pharabod, J., Sewell, P.: The prob- lem of programming language concurrency semantics. In: ESOP. Lecture Notes in Computer Science, vol. 9032, pp. 283–307. Springer (2015)

2015
[4]

Bender, J., Palsberg, J.: A formalization of Java’s concurrent access modes. Proc. ACM Program. Lang.3(OOPSLA), 142:1–142:28 (2019)

2019
[5]

In: ESOP

Cenciarelli, P., Knapp, A., Sibilio, E.: The Java memory model: Operationally, denotationally, axiomatically. In: ESOP. Lecture Notes in Computer Science, vol. 4421, pp. 331–346. Springer (2007)

2007
[6]

In: Dillig, I., Tasiran, S

Gavrilenko, N., de Le´ on, H.P., Furbach, F., Heljanko, K., Meyer, R.: BMC for weak memory models: Relation analysis for compact SMT encodings. In: Dillig, I., Tasiran, S. (eds.) CAV’19’. pp. 355–365. LNCS 11561, Springer (2019).https: //doi.org/10.1007/978-3-030-25540-4_19

work page doi:10.1007/978-3-030-25540-4_19 2019
[7]

Geeson, L., Smith, L.: Compiler testing with relaxed memory models. In: CGO. pp. 334–348. IEEE (2024)

2024
[8]

Gosling, J., Joy, B., Steele, G.: The Java language specification (1996)

1996
[9]

threads and locks - 17.4

Gosling, J., Joy, B., Steele, G., Bracha, G., Buckley, A., Smith, D., Bierman, G.: Java language specification - Chapter 17. threads and locks - 17.4. memory model,https://docs.oracle.com/javase/specs/jls/se24/html/jls-17.html# jls-17.4
[10]

Jeffrey, A., Riely, J., Batty, M., Cooksey, S., Kaysin, I., Podkopaev, A.: The leaky semicolon: compositional semantic dependencies for relaxed-memory concurrency. Proc. ACM Program. Lang.6(POPL), 1–30 (2022).https://doi.org/10.1145/ 3498716

2022
[11]

In: Bouyer, P., van de Pol, J

Jorshari, M.H.K., Kokologiannakis, M., Majumdar, R., Nagendra, S.: Optimal con- colic dynamic partial order reduction. In: Bouyer, P., van de Pol, J. (eds.) CON- CUR’25. LIPIcs, vol. 348, pp. 26:1–26:22. Schloss Dagstuhl - Leibniz-Zentrum f¨ ur Informatik (2025).https://doi.org/10.4230/LIPICS.CONCUR.2025.26

work page doi:10.4230/lipics.concur.2025.26 2025
[12]

In: Silva, A., Leino, K.R.M

Kokologiannakis, M., Vafeiadis, V.: Genmc: A model checker for weak memory models. In: Silva, A., Leino, K.R.M. (eds.) CAV’21. pp. 427–440. LNCS 12759, Springer (2021).https://doi.org/10.1007/978-3-030-81685-8_20

work page doi:10.1007/978-3-030-81685-8_20 2021
[13]

Lahav, O.: A case against semantic dependencies. In: The Future of Weak Memory 2024.https://popl24.sigplan.org/details/fowm-2024-papers/ 8/A-case-against-semantic-dependencies, recording of talk avaliable athttps: //youtu.be/7VZtwIvZ6qE

2024
[14]

Millstein, and David Walker

Lahav, O., Boker, U.: Decidable verification under a causally consistent shared memory. In: Donaldson, A.F., Torlak, E. (eds.) PLDI’20. pp. 211–226. ACM (2020). https://doi.org/10.1145/3385412.3385966

work page doi:10.1145/3385412.3385966 2020
[15]

PLDI 2017 , year =

Lahav, O., Vafeiadis, V., Kang, J., Hur, C., Dreyer, D.: Repairing sequential con- sistency in C/C++11. In: Cohen, A., Vechev, M.T. (eds.) PLDI’17. pp. 618–632. ACM (2017).https://doi.org/10.1145/3062341.3062352

work page doi:10.1145/3062341.3062352 2017
[16]

Lea, D.: Using JDK 9 memory order modes,https://gee.cs.oswego.edu/dl/ html/j9mm.html jMT: Testing Correctness of Java Memory Models 19
[17]

In: ECOOP

Liu, S., Bender, J., Palsberg, J.: Compiling volatile correctly in Java. In: ECOOP. LIPIcs, vol. 222, pp. 6:1–6:26. Schloss Dagstuhl - Leibniz-Zentrum f¨ ur Informatik (2022)

2022
[18]

In: Ali, K., Vitek, J

Liu, S., Bender, J., Palsberg, J.: Compiling volatile correctly in Java. In: Ali, K., Vitek, J. (eds.) ECOOP’22. pp. 6:1–26. LIPIcs 222, Schloss Dagstuhl - Leibniz- Zentrum f¨ ur Informatik (2022).https://doi.org/10.4230/LIPICS.ECOOP.2022.6

work page doi:10.4230/lipics.ecoop.2022.6 2022
[19]

blogspot.com/2007/03/volatile-fields-and-synchronization.html

Manson, J.: Volatile fields and synchronization,https://jeremymanson. blogspot.com/2007/03/volatile-fields-and-synchronization.html

2007
[20]

Manson, J.: The Java Memory Model. Ph.D. thesis, University of Maryland, Col- lege Park, MD, USA (2004),https://hdl.handle.net/1903/1949

2004
[21]

In: FTfJP’02 (2002)

Manson, J., Pugh, W.: The Java memory model simulator. In: FTfJP’02 (2002)

2002
[22]

Manson, J., Pugh, W., Adve, S.: The Java memory modelhttps: //web.archive.org/web/20060303190317/http://www.cs.umd.edu/users/ jmanson/java/journal.pdf, original URL:http://www.cs.umd.edu/users/ jmanson/java/journal.pdf, archived 2006-03-03

work page arXiv 2006
[23]

In: POPL

Manson, J., Pugh, W.W., Adve, S.V.: The Java memory model. In: POPL. pp. 378–391. ACM (2005)

2005
[24]

Moiseenko, E., Kokologiannakis, M., Vafeiadis, V.: Model checking for a multi- execution memory model. Proc. ACM Program. Lang.6(OOPSLA2), 758–785 (2022).https://doi.org/10.1145/3563315

work page doi:10.1145/3563315 2022
[25]

Panneke, L., Wehrheim, H.: jMT: Testing correctness of Java memory models (ar- tifact) (Oct 2025).https://doi.org/10.5281/zenodo.17634679

work page doi:10.5281/zenodo.17634679 2025
[26]

In: M¨ uller, P

Paviotti, M., Cooksey, S., Paradis, A., Wright, D., Owens, S., Batty, M.: Mod- ular relaxed dependencies in weak memory concurrency. In: M¨ uller, P. (ed.) ESOP20. pp. 599–625. LNCS 12075, Springer (2020).https://doi.org/10.1007/ 978-3-030-44914-8_22

2020
[27]

In: Boyland, J.T

Petri, G., Vitek, J., Jagannathan, S.: Cooking the books: Formalizing JMM imple- mentation recipes. In: Boyland, J.T. (ed.) ECOOP’15. LIPIcs, vol. 37, pp. 445–469. Schloss Dagstuhl - Leibniz-Zentrum f¨ ur Informatik (2015).https://doi.org/10. 4230/LIPICS.ECOOP.2015.445

2015
[28]

Pugh, W.: Causality test cases,https://www.cs.umd.edu/ ~pugh/java/ memoryModel/CausalityTestCases.html
[29]

In: Java Grande

Pugh, W.W.: Fixing the Java memory model. In: Java Grande. pp. 89–98. ACM (1999)

1999
[30]

Richards, J., Wright, D., Cooksey, S., Batty, M.: Symbolic MRD: Dynamic memory, undefined behaviour, and extrinsic choice. Proc. ACM Program. Lang. 9(OOPSLA1), 1858–1882 (2025).https://doi.org/10.1145/3721089

work page doi:10.1145/3721089 2025
[31]

In: Vitek, J

ˇSevˇ c´ ık, J., Aspinall, D.: On validity of program transformations in the Java memory model. In: Vitek, J. (ed.) ECOOP’08. pp. 27–51. LNCS 5142, Springer (2008). https://doi.org/10.1007/978-3-540-70592-5_3

work page doi:10.1007/978-3-540-70592-5_3 2008
[32]

net/blog/2016/close-encounters-of-jmm-kind/

Shipil¨ ev, A.: Close encounters of the Java memory model kind,https://shipilev. net/blog/2016/close-encounters-of-jmm-kind/

2016
[33]

com/openjdk/jcstress/

Shipil¨ ev, A., et al.: Java concurrency stress (jcstress) (2013),https://github. com/openjdk/jcstress/

2013
[34]

Java Language S p e c i f i c a t i o n

Trippel, C., Manerkar, Y.A., Lustig, D., Pellauer, M., Martonosi, M.: TriCheck: Memory model verification at the trisection of software, hardware, and ISA. In: Chen, Y., Temam, O., Carter, J. (eds.) ASPLOS’17. pp. 119–133. ACM (2017). https://doi.org/10.1145/3037697.3037719 20 L. Panneke and H. Wehrheim A The Java language specification’s SEM 1" Java Lang...

work page doi:10.1145/3037697.3037719 2017