arxiv: 2605.07881 · v1 · submitted 2026-05-08 · 💻 cs.AR

Recognition: 2 theorem links

· Lean Theorem

AccelSync: Verifying Synchronization Coverage in Accelerator Pipeline Programs

Hangcheng An , Rui Wang , Depei Qian

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:04 UTC · model grok-4.3

classification 💻 cs.AR

keywords synchronization verificationaccelerator pipelinesdata race detectionstatic analysisAI acceleratorsbarrier sufficiencyhardware event semantics

0 comments

The pith

Synchronization hazards in AI accelerator pipelines are decidable by checking whether barriers order every cross-unit write-read pair on shared buffers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

AI accelerator programs run as concurrent pipelines across DMA, vector, matrix, and scalar units that share on-chip buffers, so a missing synchronization point can allow one unit to read stale data written by another. The paper models this behavior with events governed by program order, synchronization order, and barrier order, then defines barrier sufficiency as the property that the resulting happens-before relation covers every cross-unit write-read pair. It proves that deciding this property is possible in quadratic time in the number of events and that the decision procedure is sound and complete under the model. The resulting static checker can therefore scan production kernels without running them and surface races that simulation and testing miss because they do not capture the accelerator's cross-unit visibility rules.

Core claim

We formalize accelerator pipeline programs as a restricted concurrent language, define a parameterized hardware event semantics with three ordering relations—program order, synchronization order, and barrier order—and reduce the correctness question to barrier sufficiency: whether every cross-unit write-read pair on the same buffer is ordered by happens-before. We prove that barrier sufficiency is decidable in O(|E|^2) time and that our checker is both sound and complete under the modeled semantics.

What carries the argument

Parameterized hardware event semantics that combine program order, synchronization order, and barrier order to compute the happens-before relation and test whether it orders all cross-unit accesses on each buffer.

If this is right

The same decision procedure applies to any accelerator whose visibility semantics can be expressed by adjusting the three ordering relations in the model.
The quadratic-time algorithm scales to thousands of production kernels and runs at far lower cost than runtime sanitizers.
Every non-equivalent mutant that introduces a missing barrier is detected, confirming completeness on the test suite.
The checker reports a 19 percent defect rate on LLM-generated kernels, indicating that generated code frequently omits necessary synchronization points.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Compiler back-ends could invoke the checker after code generation to reject kernels that lack sufficient barriers before they reach hardware.
The same event-order model could be reused to verify other ordering properties such as fence placement or memory consistency in multi-unit accelerators.
Early adoption in operator libraries might reduce the number of nondeterministic bugs that only appear under specific driver or toolkit versions.

Load-bearing premise

The three ordering relations accurately capture the visibility rules between execution units on the modeled accelerators.

What would settle it

A kernel that the checker reports as barrier-sufficient yet produces nondeterministic results on the actual hardware due to an unobserved cross-unit race, or a kernel with a genuine race that the checker incorrectly marks as safe.

Figures

Figures reproduced from arXiv: 2605.07881 by Depei Qian, Hangcheng An, Rui Wang.

**Figure 2.** Figure 2: From accelerator pipeline kernel structure to event graph. The frontend extracts typed events (Write, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Synchronization coverage (barrier sufficiency) illustrated at the abstract event-graph level. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: AccelSync implementation pipeline. Two frontends (source-level and IR-level) feed the same event [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Verification time as a function of event count. The empirical trend matches the predicted practical [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

read the original abstract

AI accelerator operators are compiled into multi-stage pipeline programs where DMA, vector, matrix, and scalar units execute concurrently on shared on-chip buffers. A missing or misplaced synchronization primitive introduces hardware-visible data races that escape both simulation and golden testing, because neither models the accelerator's cross-unit visibility semantics. We formalize accelerator pipeline programs as a restricted concurrent language, define a parameterized hardware event semantics with three ordering relations -- program order, synchronization order, and barrier order -- and reduce the correctness question to barrier sufficiency: whether every cross-unit write-read pair on the same buffer is ordered by happens-before. Here "barrier" denotes an abstract ordering primitive in the model, covering vendor pipe barriers, hard-event synchronization, and equivalent frontend-normalized synchronization points. We prove that barrier sufficiency is decidable in $O(|E|^2)$ time and that our checker is both sound and complete under the modeled semantics. We implement AccelSync, a static verification tool instantiated for Ascend 910B2 and Cambricon MLU370 by changing only the hardware model. On 6,292 production kernels from the CANN operator library, AccelSync identifies 3 previously unknown synchronization hazards -- one matching a hazard class for which we observed nondeterministic outputs on Ascend 910B2 under a specific toolkit/driver configuration (CANN 8.0.RC3), though this observation was not reproducible after a subsequent driver upgrade -- and on 120 LLM-generated kernels it flags a 19.2% defect rate (95% CI: [13.0%, 27.4%]). A mutation study on 688 non-equivalent mutants yields 100% detection, and a head-to-head comparison shows AccelSync detects hazards that Huawei's runtime sanitizer msSanitizer misses, at 400x lower cost per kernel.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AccelSync gives a clean formal reduction of accelerator sync checking to quadratic-time barrier sufficiency, with a large production-kernel evaluation that found real hazards, but the hardware model link is thin.

read the letter

The paper reduces synchronization coverage in multi-unit accelerator pipelines to checking whether every cross-unit write-read pair on a buffer is ordered by happens-before under three relations: program order, synchronization order, and barrier order. They prove this check is decidable in O(|E|^2) time and that the checker is sound and complete inside the model. The implementation swaps only the hardware parameters to target Ascend 910B2 and Cambricon MLU370.

Referee Report

2 major / 2 minor

Summary. The paper presents AccelSync, a static verification tool for synchronization hazards in AI accelerator pipeline programs (DMA/vector/matrix/scalar units on shared buffers). It formalizes programs as a restricted concurrent language, defines a parameterized hardware event semantics using program order, synchronization order, and barrier order, and reduces correctness to barrier sufficiency (every cross-unit write-read pair on the same buffer must be ordered by happens-before). The authors prove decidability in O(|E|^2) time with soundness and completeness under the model, implement the checker for Ascend 910B2 and Cambricon MLU370 (by swapping only the hardware model), and evaluate on 6292 CANN kernels (finding 3 hazards, one matching a non-reproducible observed issue) plus 120 LLM-generated kernels (19.2% defect rate) and 688 mutants (100% detection), outperforming msSanitizer at lower cost.

Significance. If the model holds, the quadratic decidability proof and sound/complete checker provide a practical, scalable foundation for catching races missed by simulation/testing in production accelerator code. The large-scale evaluation on real kernels, 100% mutant detection, and direct comparison to an existing sanitizer are concrete strengths that support deployability. The parameterized design enabling reuse across vendors is a notable engineering contribution.

major comments (2)

[Abstract and §3] Abstract and §3 (semantics definition): The practical claims for Ascend 910B2 and Cambricon MLU370 rest on the three-relation parameterized semantics (program order, synchronization order, barrier order) accurately capturing all cross-unit visibility, including DMA/matrix/scalar interactions on shared buffers. The paper supports this only via one non-reproducible real-hazard match and an internal mutation study; no hardware trace validation, formal correspondence to vendor specs, or discussion of omitted behaviors (e.g., implicit unit-specific visibility or non-barrier effects) is provided. This is load-bearing for the applicability and hazard-detection claims.
[Evaluation section] Evaluation section (production kernels and mutation study): The 3 identified hazards and 19.2% defect rate on LLM kernels are reported under the model, but the single observed hazard on Ascend 910B2 (CANN 8.0.RC3) was not reproducible after a driver upgrade, and the 100% mutant detection was performed inside the model rather than against independent hardware traces. This weakens the evidence that the checker reliably detects real silicon hazards.

minor comments (2)

[Implementation] The description of how the hardware model is instantiated (changing only the model for each accelerator) is high-level; adding a small example of the parameterization for one barrier primitive would improve clarity.
[Evaluation] Table or figure reporting the 6292-kernel results could explicitly list the three detected hazards with their buffer and unit details for reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments on the semantics and evaluation. We clarify the model's scope and evidence, and will revise to add discussion of limitations where appropriate.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (semantics definition): The practical claims for Ascend 910B2 and Cambricon MLU370 rest on the three-relation parameterized semantics (program order, synchronization order, barrier order) accurately capturing all cross-unit visibility, including DMA/matrix/scalar interactions on shared buffers. The paper supports this only via one non-reproducible real-hazard match and an internal mutation study; no hardware trace validation, formal correspondence to vendor specs, or discussion of omitted behaviors (e.g., implicit unit-specific visibility or non-barrier effects) is provided. This is load-bearing for the applicability and hazard-detection claims.

Authors: We agree the semantics is an abstraction and does not claim to model every hardware detail. It focuses on the three ordering relations to capture cross-unit visibility for synchronization hazards, with soundness and completeness proven relative to this model. We will revise §3 to add explicit discussion of omitted behaviors such as implicit unit-specific visibility or non-barrier effects. Formal correspondence to vendor specs and hardware trace validation are not feasible, as these details are proprietary and not publicly available; our evidence instead rests on the mutation study and practical detection in production kernels. revision: partial
Referee: [Evaluation section] Evaluation section (production kernels and mutation study): The 3 identified hazards and 19.2% defect rate on LLM kernels are reported under the model, but the single observed hazard on Ascend 910B2 (CANN 8.0.RC3) was not reproducible after a driver upgrade, and the 100% mutant detection was performed inside the model rather than against independent hardware traces. This weakens the evidence that the checker reliably detects real silicon hazards.

Authors: We already disclose in the manuscript that the observed hazard was not reproducible after the driver upgrade. The 100% mutant detection is model-based, as independent hardware traces for 688 mutants would require impractical hardware access and resources. The contribution centers on decidable verification under a formal semantics that catches issues missed by tools like msSanitizer. We do not claim exhaustive real-silicon validation but demonstrate practical utility on thousands of kernels. revision: no

standing simulated objections not resolved

No hardware trace validation or formal correspondence to proprietary vendor specifications is available or feasible to provide.

Circularity Check

0 steps flagged

No significant circularity; formal proof is self-contained within defined model

full rationale

The paper defines a parameterized hardware event semantics using program order, synchronization order, and barrier order, then reduces barrier sufficiency to a happens-before check and proves decidability in O(|E|^2) time along with soundness and completeness under that semantics. This is a standard formal derivation that does not reduce any claim to its inputs by construction, nor rely on self-citations or fitted parameters for the core result. Empirical validation on kernels and mutation studies is separate from the proof and does not create circularity in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard concurrency theory plus hardware-specific modeling assumptions; no free parameters or new invented physical entities.

axioms (2)

domain assumption The three ordering relations (program order, synchronization order, barrier order) together define a sound happens-before relation for the accelerator's cross-unit visibility.
Invoked when reducing correctness to barrier sufficiency.
domain assumption The restricted concurrent language faithfully represents real accelerator pipeline programs compiled from AI operators.
Required for the checker to apply to production kernels.

pith-pipeline@v0.9.0 · 5634 in / 1349 out tokens · 39845 ms · 2026-05-11T02:04:44.016154+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalize accelerator pipeline programs as a restricted concurrent language, define a parameterized hardware event semantics with three ordering relations—program order, synchronization order, and barrier order—and reduce the correctness question to barrier sufficiency... We prove that barrier sufficiency is decidable in O(|E|^2) time and that our checker is both sound and complete under the modeled semantics.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The happens-before relation is the transitive closure of the union of all three orderings: →hb = (→po ∪ →so ∪ →bo)+

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 1 internal anchor

[1]

Jade Alglave, Luc Maranget, and Michael Tautschnig. 2014. Herding Cats: Modelling, Simulation, Testing, and Data Mining for Weak Memory.ACM TOPLAS36, 2 (2014)

work page 2014
[2]

Jai Arora, Sirui Lu, Devansh Jain, Tianfan Xu, Farzin Houshmand, Phitchaya Mangpo Phothilimthana, Mohsen Lesani, Praveen Narayanan, Karthik Srinivasa Murthy, Rastislav Bodík, Amit Sabne, and Charith Mendis. 2025. TensorRight: Automated Verification of Tensor Graph Rewrites.Proc. ACM Program. Lang.9, POPL (2025), 832–863. https://doi.org/10.1145/3704865

work page doi:10.1145/3704865 2025
[3]

ascend-rs contributors. 2025. ascend-rs: Memory-Safe NPU Kernel Programming in Rust. https://ascend-rs.org. Appendix C: MultiKernelBench vulnerability audit of 300 AscendC kernels

work page 2025
[5]

The Design and Implementation of a Verification Technique for GPU Kernels.ACM Trans. Program. Lang. Syst. 37, 3 (2015), 10:1–10:49. https://doi.org/10.1145/2743017

work page doi:10.1145/2743017 2015
[6]

Donaldson, Jeroen Ketema, Shaz Qadeer, Paul Thomson, and John Wickerson

Adam Betts, Nathan Chong, Alastair F. Donaldson, Jeroen Ketema, Shaz Qadeer, Paul Thomson, and John Wickerson

work page
[7]

The Design and Implementation of a Verification Technique for GPU Kernels. InPLDI

work page
[8]

Donaldson, Shaz Qadeer, and Paul Thomson

Adam Betts, Nathan Chong, Alastair F. Donaldson, Shaz Qadeer, and Paul Thomson. 2012. GPUVerify: A Verifier for GPU Kernels. InOOPSLA

work page 2012
[9]

Cambricon Technologies. 2023. BANG C Programming Guide. Cambricon Developer Documentation

work page 2023
[10]

Cambricon Technologies. 2024. BANG C Programming Practice: Pipeline and Synchronization Examples. https: //github.com/Cambricon/bang-c-practice

work page 2024
[11]

Cambricon Technologies. 2024. mlu-ops: Cambricon Open-Source Machine Learning Operator Library. https: //github.com/Cambricon/mlu-ops

work page 2024
[12]

Bodhisatwa Chatterjee, Drew Zagieboylo, Sana Damani, Siva Hari, and Christos Kozyrakis. 2025. ProofWright: Towards Agentic Formal Verification of CUDA.CoRRabs/2511.12294 (2025). https://doi.org/10.48550/ARXIV.2511.12294

work page doi:10.48550/arxiv.2511.12294 2025
[13]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. InOSDI

work page 2018
[14]

Edmund Clarke, Daniel Kroening, and Flavio Lerda. 2004. A Tool for Checking ANSI-C Programs. InTACAS. 168–176

work page 2004
[15]

Patrick Cousot and Radhia Cousot. 1977. Abstract Interpretation: A Unified Lattice Model for Static Analysis of Programs by Construction or Approximation of Fixpoints. (1977)

work page 1977
[16]

Peng Di, Ding Ye, Yu Su, Yulei Sui, and Jingling Xue. 2018. Automatic Parallelization of Tiled Loop Nests with Enhanced Fine-Grained Synchronization. InFMCAD

work page 2018
[17]

Haoran Ding, Zhaoguo Wang, and Haibo Chen. 2026. FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning.CoRRabs/2604.11556 (2026). https://doi.org/10.48550/ARXIV.2604.11556

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.11556 2026
[18]

Haoran Ding, Zhaoguo Wang, Zhuohao Shen, Rong Chen, and Haibo Chen. 2023. Automated Verification of Idempo- tence for Stateful Serverless Applications. InOSDI

work page 2023
[19]

Cormac Flanagan and Stephen N. Freund. 2009. FastTrack: Efficient and Precise Dynamic Race Detection. InPLDI

work page 2009
[20]

Google. 2024. XLA: Optimizing Compiler for Machine Learning. Developer Documentation

work page 2024
[21]

Pollard, Nadesh Ramanathan, and John Wickerson

Yann Herklotz, James D. Pollard, Nadesh Ramanathan, and John Wickerson. 2021. Formal Verification of High-Level Synthesis. InOOPSLA

work page 2021
[22]

Holzmann

Gerard J. Holzmann. 1997. The Model Checker SPIN.IEEE Transactions on Software Engineering23, 5 (1997), 279–295

work page 1997
[23]

Huawei. 2023. Ascend C Programming Guide. Huawei Developer Documentation

work page 2023
[24]

Huawei Technologies. 2025. msSanitizer: Anomaly Detection for AscendC Operators. CANN 8.0.RC3 Developer Documentation. https://www.hiascend.com/document/detail/zh/canncommercial/80RC3/devaids/opdev/optool/ atlasopdev_16_0039.html

work page 2025
[25]

Yue Jia and Mark Harman. 2011. An Analysis and Survey of the Development of Mutation Testing.IEEE TSE37, 5 (2011)

work page 2011
[26]

Zhe Jia, Blake Tillman, Marco Maggioni, and Daniele Paolo Scarpazza. 2019. Dissecting the Graphcore IPU Architecture via Microbenchmarking.CoRRabs/1912.03413

work page arXiv 2019
[27]

Jouppi, Cliff Young, Nishant Patil, David Patterson, et al

Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, et al. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. (2017). AccelSync: Verifying Synchronization Coverage in Accelerator Pipeline Programs 27

work page 2017
[28]

Leslie Lamport. 1978. Time, Clocks, and the Ordering of Events in a Distributed System.Commun. ACM21, 7 (1978)

work page 1978
[29]

Leslie Lamport. 1994. The Temporal Logic of Actions.ACM TOPLAS16, 3, 872–923

work page 1994
[30]

Xavier Leroy. 2009. Formal Verification of a Realistic Compiler.Commun. ACM(2009)

work page 2009
[31]

Guodong Li and Ganesh Gopalakrishnan. 2010. Scalable SMT-Based Verification of GPU Kernel Functions. InFSE

work page 2010
[32]

Guodong Li, Peng Li, Ganesh Gopalakrishnan, Shan Lu, and Jeremy Cong. 2012. GKLEE: Concolic Verification and Test Generation for GPUs. InPLDI

work page 2012
[33]

Peng Li, Ganesh Gopalakrishnan, et al . 2014. Effective Random Testing for Detecting Concurrency Bugs in GPU Programs. InASE

work page 2014
[34]

Peng Li, Chang Liu, and Ganesh Gopalakrishnan. 2014. Practical Symbolic Race Checking of GPU Programs. InSC

work page 2014
[35]

Ying Li, Xiang Gao, et al. 2020. Simulee: Detecting CUDA Synchronization Bugs via Dynamic Analysis. InICSE

work page 2020
[36]

Lopes, David Menendez, Santosh Nagarakatte, and John Regehr

Nuno P. Lopes, David Menendez, Santosh Nagarakatte, and John Regehr. 2015. Provably Correct Peephole Optimizations with Alive. InPLDI

work page 2015
[37]

Daniel Lustig, Sameer Sahasrabuddhe, and Olivier Giroux. 2019. A Formal Analysis of the NVIDIA PTX Memory Consistency Model. InASPLOS

work page 2019
[38]

Antoine Miné. 2006. The Octagon Abstract Domain. InHigher-Order and Symbolic Computation, Vol. 19. 31–100

work page 2006
[39]

Robert H. B. Netzer and Barton P. Miller. 1992. What Are Race Conditions? Some Issues and Formalizations.ACM Lett. Program. Lang. Syst.1, 1 (1992), 74–88

work page 1992
[40]

NVIDIA. 2024. CUDA-MEMCHECK Racecheck Tool Documentation. NVIDIA Developer Documentation

work page 2024
[41]

NVIDIA. 2024. TensorRT Developer Guide. Developer Documentation

work page 2024
[42]

Jonathan Ragan-Kelley, Andrew Adams, Sylvain Paris, Marc Levoy, Saman Amarasinghe, and Frédo Durand. 2013. Halide: Decoupling Algorithms from Schedules for High-Performance Image Processing. InPLDI

work page 2013
[43]

Thomas Reps, Susan Horwitz, and Mooly Sagiv. 1995. Precise Interprocedural Dataflow Analysis via Graph Reachability. (1995)

work page 1995
[44]

Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson. 1997. Eraser: A Dynamic Data Race Detector for Multithreaded Programs.ACM Trans. Comput. Syst.15, 4, 391–411

work page 1997
[45]

Konstantin Serebryany and Timur Iskhodzhanov. 2009. ThreadSanitizer: Data Race Detection in Practice. InWBIA (Workshop on Binary Instrumentation and Applications)

work page 2009
[46]

Philippe Tillet, H. T. Kung, and David Cox. 2019. Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations. InMLSys

work page 2019
[47]

Constantinides

John Wickerson, Mark Batty, Tyler Sorensen, and George A. Constantinides. 2017. Automatically Comparing Memory Consistency Models. InPOPL

work page 2017
[48]

Jie Zhao, Bojie Li, Wang Nie, Zhen Geng, Renwei Zhang, Xiong Gao, Bin Cheng, Chen Wu, Yun Cheng, Zheng Li, Peng Di, Kun Zhang, and Xuefeng Jin. 2021. AKG: Automatic Kernel Generation for Neural Processing Units using Polyhedral Transformations. InPLDI

work page 2021
[49]

Gonzalez, and Ion Stoica

Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. 2020. Ansor: Generating High-Performance Tensor Programs for Deep Learning. InOSDI

work page 2020
[50]

Hongyu Zhu, Ruofan Wu, Yijia Diao, Shanbin Ke, Haoyu Li, Chen Zhang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Wei Cui, Fan Yang, Mao Yang, Lidong Zhou, Asaf Cidon, and Gennady Pekhimenko. 2022. ROLLER: Fast and Efficient Tensor Compilation for Deep Learning. InOSDI

work page 2022