Recognition: 2 theorem links
· Lean TheoremAccelSync: Verifying Synchronization Coverage in Accelerator Pipeline Programs
Pith reviewed 2026-05-11 02:04 UTC · model grok-4.3
The pith
Synchronization hazards in AI accelerator pipelines are decidable by checking whether barriers order every cross-unit write-read pair on shared buffers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We formalize accelerator pipeline programs as a restricted concurrent language, define a parameterized hardware event semantics with three ordering relations—program order, synchronization order, and barrier order—and reduce the correctness question to barrier sufficiency: whether every cross-unit write-read pair on the same buffer is ordered by happens-before. We prove that barrier sufficiency is decidable in O(|E|^2) time and that our checker is both sound and complete under the modeled semantics.
What carries the argument
Parameterized hardware event semantics that combine program order, synchronization order, and barrier order to compute the happens-before relation and test whether it orders all cross-unit accesses on each buffer.
If this is right
- The same decision procedure applies to any accelerator whose visibility semantics can be expressed by adjusting the three ordering relations in the model.
- The quadratic-time algorithm scales to thousands of production kernels and runs at far lower cost than runtime sanitizers.
- Every non-equivalent mutant that introduces a missing barrier is detected, confirming completeness on the test suite.
- The checker reports a 19 percent defect rate on LLM-generated kernels, indicating that generated code frequently omits necessary synchronization points.
Where Pith is reading between the lines
- Compiler back-ends could invoke the checker after code generation to reject kernels that lack sufficient barriers before they reach hardware.
- The same event-order model could be reused to verify other ordering properties such as fence placement or memory consistency in multi-unit accelerators.
- Early adoption in operator libraries might reduce the number of nondeterministic bugs that only appear under specific driver or toolkit versions.
Load-bearing premise
The three ordering relations accurately capture the visibility rules between execution units on the modeled accelerators.
What would settle it
A kernel that the checker reports as barrier-sufficient yet produces nondeterministic results on the actual hardware due to an unobserved cross-unit race, or a kernel with a genuine race that the checker incorrectly marks as safe.
Figures
read the original abstract
AI accelerator operators are compiled into multi-stage pipeline programs where DMA, vector, matrix, and scalar units execute concurrently on shared on-chip buffers. A missing or misplaced synchronization primitive introduces hardware-visible data races that escape both simulation and golden testing, because neither models the accelerator's cross-unit visibility semantics. We formalize accelerator pipeline programs as a restricted concurrent language, define a parameterized hardware event semantics with three ordering relations -- program order, synchronization order, and barrier order -- and reduce the correctness question to barrier sufficiency: whether every cross-unit write-read pair on the same buffer is ordered by happens-before. Here "barrier" denotes an abstract ordering primitive in the model, covering vendor pipe barriers, hard-event synchronization, and equivalent frontend-normalized synchronization points. We prove that barrier sufficiency is decidable in $O(|E|^2)$ time and that our checker is both sound and complete under the modeled semantics. We implement AccelSync, a static verification tool instantiated for Ascend 910B2 and Cambricon MLU370 by changing only the hardware model. On 6,292 production kernels from the CANN operator library, AccelSync identifies 3 previously unknown synchronization hazards -- one matching a hazard class for which we observed nondeterministic outputs on Ascend 910B2 under a specific toolkit/driver configuration (CANN 8.0.RC3), though this observation was not reproducible after a subsequent driver upgrade -- and on 120 LLM-generated kernels it flags a 19.2% defect rate (95% CI: [13.0%, 27.4%]). A mutation study on 688 non-equivalent mutants yields 100% detection, and a head-to-head comparison shows AccelSync detects hazards that Huawei's runtime sanitizer msSanitizer misses, at 400x lower cost per kernel.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents AccelSync, a static verification tool for synchronization hazards in AI accelerator pipeline programs (DMA/vector/matrix/scalar units on shared buffers). It formalizes programs as a restricted concurrent language, defines a parameterized hardware event semantics using program order, synchronization order, and barrier order, and reduces correctness to barrier sufficiency (every cross-unit write-read pair on the same buffer must be ordered by happens-before). The authors prove decidability in O(|E|^2) time with soundness and completeness under the model, implement the checker for Ascend 910B2 and Cambricon MLU370 (by swapping only the hardware model), and evaluate on 6292 CANN kernels (finding 3 hazards, one matching a non-reproducible observed issue) plus 120 LLM-generated kernels (19.2% defect rate) and 688 mutants (100% detection), outperforming msSanitizer at lower cost.
Significance. If the model holds, the quadratic decidability proof and sound/complete checker provide a practical, scalable foundation for catching races missed by simulation/testing in production accelerator code. The large-scale evaluation on real kernels, 100% mutant detection, and direct comparison to an existing sanitizer are concrete strengths that support deployability. The parameterized design enabling reuse across vendors is a notable engineering contribution.
major comments (2)
- [Abstract and §3] Abstract and §3 (semantics definition): The practical claims for Ascend 910B2 and Cambricon MLU370 rest on the three-relation parameterized semantics (program order, synchronization order, barrier order) accurately capturing all cross-unit visibility, including DMA/matrix/scalar interactions on shared buffers. The paper supports this only via one non-reproducible real-hazard match and an internal mutation study; no hardware trace validation, formal correspondence to vendor specs, or discussion of omitted behaviors (e.g., implicit unit-specific visibility or non-barrier effects) is provided. This is load-bearing for the applicability and hazard-detection claims.
- [Evaluation section] Evaluation section (production kernels and mutation study): The 3 identified hazards and 19.2% defect rate on LLM kernels are reported under the model, but the single observed hazard on Ascend 910B2 (CANN 8.0.RC3) was not reproducible after a driver upgrade, and the 100% mutant detection was performed inside the model rather than against independent hardware traces. This weakens the evidence that the checker reliably detects real silicon hazards.
minor comments (2)
- [Implementation] The description of how the hardware model is instantiated (changing only the model for each accelerator) is high-level; adding a small example of the parameterization for one barrier primitive would improve clarity.
- [Evaluation] Table or figure reporting the 6292-kernel results could explicitly list the three detected hazards with their buffer and unit details for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the semantics and evaluation. We clarify the model's scope and evidence, and will revise to add discussion of limitations where appropriate.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (semantics definition): The practical claims for Ascend 910B2 and Cambricon MLU370 rest on the three-relation parameterized semantics (program order, synchronization order, barrier order) accurately capturing all cross-unit visibility, including DMA/matrix/scalar interactions on shared buffers. The paper supports this only via one non-reproducible real-hazard match and an internal mutation study; no hardware trace validation, formal correspondence to vendor specs, or discussion of omitted behaviors (e.g., implicit unit-specific visibility or non-barrier effects) is provided. This is load-bearing for the applicability and hazard-detection claims.
Authors: We agree the semantics is an abstraction and does not claim to model every hardware detail. It focuses on the three ordering relations to capture cross-unit visibility for synchronization hazards, with soundness and completeness proven relative to this model. We will revise §3 to add explicit discussion of omitted behaviors such as implicit unit-specific visibility or non-barrier effects. Formal correspondence to vendor specs and hardware trace validation are not feasible, as these details are proprietary and not publicly available; our evidence instead rests on the mutation study and practical detection in production kernels. revision: partial
-
Referee: [Evaluation section] Evaluation section (production kernels and mutation study): The 3 identified hazards and 19.2% defect rate on LLM kernels are reported under the model, but the single observed hazard on Ascend 910B2 (CANN 8.0.RC3) was not reproducible after a driver upgrade, and the 100% mutant detection was performed inside the model rather than against independent hardware traces. This weakens the evidence that the checker reliably detects real silicon hazards.
Authors: We already disclose in the manuscript that the observed hazard was not reproducible after the driver upgrade. The 100% mutant detection is model-based, as independent hardware traces for 688 mutants would require impractical hardware access and resources. The contribution centers on decidable verification under a formal semantics that catches issues missed by tools like msSanitizer. We do not claim exhaustive real-silicon validation but demonstrate practical utility on thousands of kernels. revision: no
- No hardware trace validation or formal correspondence to proprietary vendor specifications is available or feasible to provide.
Circularity Check
No significant circularity; formal proof is self-contained within defined model
full rationale
The paper defines a parameterized hardware event semantics using program order, synchronization order, and barrier order, then reduces barrier sufficiency to a happens-before check and proves decidability in O(|E|^2) time along with soundness and completeness under that semantics. This is a standard formal derivation that does not reduce any claim to its inputs by construction, nor rely on self-citations or fitted parameters for the core result. Empirical validation on kernels and mutation studies is separate from the proof and does not create circularity in the derivation chain.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The three ordering relations (program order, synchronization order, barrier order) together define a sound happens-before relation for the accelerator's cross-unit visibility.
- domain assumption The restricted concurrent language faithfully represents real accelerator pipeline programs compiled from AI operators.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We formalize accelerator pipeline programs as a restricted concurrent language, define a parameterized hardware event semantics with three ordering relations—program order, synchronization order, and barrier order—and reduce the correctness question to barrier sufficiency... We prove that barrier sufficiency is decidable in O(|E|^2) time and that our checker is both sound and complete under the modeled semantics.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The happens-before relation is the transitive closure of the union of all three orderings: →hb = (→po ∪ →so ∪ →bo)+
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Jade Alglave, Luc Maranget, and Michael Tautschnig. 2014. Herding Cats: Modelling, Simulation, Testing, and Data Mining for Weak Memory.ACM TOPLAS36, 2 (2014)
work page 2014
-
[2]
Jai Arora, Sirui Lu, Devansh Jain, Tianfan Xu, Farzin Houshmand, Phitchaya Mangpo Phothilimthana, Mohsen Lesani, Praveen Narayanan, Karthik Srinivasa Murthy, Rastislav Bodík, Amit Sabne, and Charith Mendis. 2025. TensorRight: Automated Verification of Tensor Graph Rewrites.Proc. ACM Program. Lang.9, POPL (2025), 832–863. https://doi.org/10.1145/3704865
-
[3]
ascend-rs contributors. 2025. ascend-rs: Memory-Safe NPU Kernel Programming in Rust. https://ascend-rs.org. Appendix C: MultiKernelBench vulnerability audit of 300 AscendC kernels
work page 2025
-
[5]
The Design and Implementation of a Verification Technique for GPU Kernels.ACM Trans. Program. Lang. Syst. 37, 3 (2015), 10:1–10:49. https://doi.org/10.1145/2743017
-
[6]
Donaldson, Jeroen Ketema, Shaz Qadeer, Paul Thomson, and John Wickerson
Adam Betts, Nathan Chong, Alastair F. Donaldson, Jeroen Ketema, Shaz Qadeer, Paul Thomson, and John Wickerson
-
[7]
The Design and Implementation of a Verification Technique for GPU Kernels. InPLDI
-
[8]
Donaldson, Shaz Qadeer, and Paul Thomson
Adam Betts, Nathan Chong, Alastair F. Donaldson, Shaz Qadeer, and Paul Thomson. 2012. GPUVerify: A Verifier for GPU Kernels. InOOPSLA
work page 2012
-
[9]
Cambricon Technologies. 2023. BANG C Programming Guide. Cambricon Developer Documentation
work page 2023
-
[10]
Cambricon Technologies. 2024. BANG C Programming Practice: Pipeline and Synchronization Examples. https: //github.com/Cambricon/bang-c-practice
work page 2024
-
[11]
Cambricon Technologies. 2024. mlu-ops: Cambricon Open-Source Machine Learning Operator Library. https: //github.com/Cambricon/mlu-ops
work page 2024
-
[12]
Bodhisatwa Chatterjee, Drew Zagieboylo, Sana Damani, Siva Hari, and Christos Kozyrakis. 2025. ProofWright: Towards Agentic Formal Verification of CUDA.CoRRabs/2511.12294 (2025). https://doi.org/10.48550/ARXIV.2511.12294
-
[13]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. InOSDI
work page 2018
-
[14]
Edmund Clarke, Daniel Kroening, and Flavio Lerda. 2004. A Tool for Checking ANSI-C Programs. InTACAS. 168–176
work page 2004
-
[15]
Patrick Cousot and Radhia Cousot. 1977. Abstract Interpretation: A Unified Lattice Model for Static Analysis of Programs by Construction or Approximation of Fixpoints. (1977)
work page 1977
-
[16]
Peng Di, Ding Ye, Yu Su, Yulei Sui, and Jingling Xue. 2018. Automatic Parallelization of Tiled Loop Nests with Enhanced Fine-Grained Synchronization. InFMCAD
work page 2018
-
[17]
Haoran Ding, Zhaoguo Wang, and Haibo Chen. 2026. FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning.CoRRabs/2604.11556 (2026). https://doi.org/10.48550/ARXIV.2604.11556
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.11556 2026
-
[18]
Haoran Ding, Zhaoguo Wang, Zhuohao Shen, Rong Chen, and Haibo Chen. 2023. Automated Verification of Idempo- tence for Stateful Serverless Applications. InOSDI
work page 2023
-
[19]
Cormac Flanagan and Stephen N. Freund. 2009. FastTrack: Efficient and Precise Dynamic Race Detection. InPLDI
work page 2009
-
[20]
Google. 2024. XLA: Optimizing Compiler for Machine Learning. Developer Documentation
work page 2024
-
[21]
Pollard, Nadesh Ramanathan, and John Wickerson
Yann Herklotz, James D. Pollard, Nadesh Ramanathan, and John Wickerson. 2021. Formal Verification of High-Level Synthesis. InOOPSLA
work page 2021
- [22]
-
[23]
Huawei. 2023. Ascend C Programming Guide. Huawei Developer Documentation
work page 2023
-
[24]
Huawei Technologies. 2025. msSanitizer: Anomaly Detection for AscendC Operators. CANN 8.0.RC3 Developer Documentation. https://www.hiascend.com/document/detail/zh/canncommercial/80RC3/devaids/opdev/optool/ atlasopdev_16_0039.html
work page 2025
-
[25]
Yue Jia and Mark Harman. 2011. An Analysis and Survey of the Development of Mutation Testing.IEEE TSE37, 5 (2011)
work page 2011
- [26]
-
[27]
Jouppi, Cliff Young, Nishant Patil, David Patterson, et al
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, et al. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. (2017). AccelSync: Verifying Synchronization Coverage in Accelerator Pipeline Programs 27
work page 2017
-
[28]
Leslie Lamport. 1978. Time, Clocks, and the Ordering of Events in a Distributed System.Commun. ACM21, 7 (1978)
work page 1978
-
[29]
Leslie Lamport. 1994. The Temporal Logic of Actions.ACM TOPLAS16, 3, 872–923
work page 1994
-
[30]
Xavier Leroy. 2009. Formal Verification of a Realistic Compiler.Commun. ACM(2009)
work page 2009
-
[31]
Guodong Li and Ganesh Gopalakrishnan. 2010. Scalable SMT-Based Verification of GPU Kernel Functions. InFSE
work page 2010
-
[32]
Guodong Li, Peng Li, Ganesh Gopalakrishnan, Shan Lu, and Jeremy Cong. 2012. GKLEE: Concolic Verification and Test Generation for GPUs. InPLDI
work page 2012
-
[33]
Peng Li, Ganesh Gopalakrishnan, et al . 2014. Effective Random Testing for Detecting Concurrency Bugs in GPU Programs. InASE
work page 2014
-
[34]
Peng Li, Chang Liu, and Ganesh Gopalakrishnan. 2014. Practical Symbolic Race Checking of GPU Programs. InSC
work page 2014
-
[35]
Ying Li, Xiang Gao, et al. 2020. Simulee: Detecting CUDA Synchronization Bugs via Dynamic Analysis. InICSE
work page 2020
-
[36]
Lopes, David Menendez, Santosh Nagarakatte, and John Regehr
Nuno P. Lopes, David Menendez, Santosh Nagarakatte, and John Regehr. 2015. Provably Correct Peephole Optimizations with Alive. InPLDI
work page 2015
-
[37]
Daniel Lustig, Sameer Sahasrabuddhe, and Olivier Giroux. 2019. A Formal Analysis of the NVIDIA PTX Memory Consistency Model. InASPLOS
work page 2019
-
[38]
Antoine Miné. 2006. The Octagon Abstract Domain. InHigher-Order and Symbolic Computation, Vol. 19. 31–100
work page 2006
-
[39]
Robert H. B. Netzer and Barton P. Miller. 1992. What Are Race Conditions? Some Issues and Formalizations.ACM Lett. Program. Lang. Syst.1, 1 (1992), 74–88
work page 1992
-
[40]
NVIDIA. 2024. CUDA-MEMCHECK Racecheck Tool Documentation. NVIDIA Developer Documentation
work page 2024
-
[41]
NVIDIA. 2024. TensorRT Developer Guide. Developer Documentation
work page 2024
-
[42]
Jonathan Ragan-Kelley, Andrew Adams, Sylvain Paris, Marc Levoy, Saman Amarasinghe, and Frédo Durand. 2013. Halide: Decoupling Algorithms from Schedules for High-Performance Image Processing. InPLDI
work page 2013
-
[43]
Thomas Reps, Susan Horwitz, and Mooly Sagiv. 1995. Precise Interprocedural Dataflow Analysis via Graph Reachability. (1995)
work page 1995
-
[44]
Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson. 1997. Eraser: A Dynamic Data Race Detector for Multithreaded Programs.ACM Trans. Comput. Syst.15, 4, 391–411
work page 1997
-
[45]
Konstantin Serebryany and Timur Iskhodzhanov. 2009. ThreadSanitizer: Data Race Detection in Practice. InWBIA (Workshop on Binary Instrumentation and Applications)
work page 2009
-
[46]
Philippe Tillet, H. T. Kung, and David Cox. 2019. Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations. InMLSys
work page 2019
-
[47]
John Wickerson, Mark Batty, Tyler Sorensen, and George A. Constantinides. 2017. Automatically Comparing Memory Consistency Models. InPOPL
work page 2017
-
[48]
Jie Zhao, Bojie Li, Wang Nie, Zhen Geng, Renwei Zhang, Xiong Gao, Bin Cheng, Chen Wu, Yun Cheng, Zheng Li, Peng Di, Kun Zhang, and Xuefeng Jin. 2021. AKG: Automatic Kernel Generation for Neural Processing Units using Polyhedral Transformations. InPLDI
work page 2021
-
[49]
Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. 2020. Ansor: Generating High-Performance Tensor Programs for Deep Learning. InOSDI
work page 2020
-
[50]
Hongyu Zhu, Ruofan Wu, Yijia Diao, Shanbin Ke, Haoyu Li, Chen Zhang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Wei Cui, Fan Yang, Mao Yang, Lidong Zhou, Asaf Cidon, and Gennady Pekhimenko. 2022. ROLLER: Fast and Efficient Tensor Compilation for Deep Learning. InOSDI
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.