Recognition: unknown
Points-to Analysis Using MDE: A Multi-level Deduplication Engine for Repetitive Data and Operations
Pith reviewed 2026-05-10 16:23 UTC · model grok-4.3
The pith
A recursive multi-level deduplication engine cuts memory use up to 18x and runtime 8x in flow- and context-sensitive pointer analysis.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that a Multi-level Deduplication Engine recursively augments data representations through de-duplication and assignment of unique identifiers, allowing many low-level operations to be trivialized or memoized while capturing structural redundancy that non-recursive methods miss, thereby scaling precise flow- and context-sensitive pointer analysis without loss of information.
What carries the argument
Multi-level Deduplication Engine (MDE), a recursive mechanism that de-duplicates values, assigns unique identifiers, and enables operation trivialization and memoization.
If this is right
- Precise pointer analysis can handle substantially larger programs without precision or soundness trade-offs.
- Representation choices become a primary lever for scaling analyses that involve heavy data propagation.
- Performance gains grow as program size increases, favoring use on real-world codebases.
- MDE can be added as a drop-in C++ library to existing analysis frameworks.
Where Pith is reading between the lines
- The same recursive deduplication pattern could apply to other static analyses that repeatedly combine sets or propagate data.
- Analysis frameworks might benefit from built-in support for detecting and exploiting such redundancy automatically.
- Deeper structural similarities across different analysis domains could be captured by extending the recursion levels.
Load-bearing premise
The high rate of redundant set-union operations observed on SPEC benchmarks is representative of pointer analysis workloads in general, and the deduplication process introduces no precision loss or soundness issues.
What would settle it
Running the same flow- and context-sensitive pointer analysis implementation on the same SPEC benchmarks both with and without MDE integration, then checking whether the reported memory and runtime reductions appear and whether the computed points-to sets remain identical.
read the original abstract
Precise pointer analysis is a foundational component of many client analyses and optimizations. Scaling flow- and context-sensitive pointer analysis has been a long-standing challenge, suffering from combinatorial growth in both memory usage and runtime. Existing approaches address this primarily by reducing the amount of information tracked often, at the cost of precision and soundness. In our experience a significant proportion of this cost comes from propagation of duplicate data and low-level data structure operations being repeated a large number of times. Our measurements on SPEC benchmarks show that more than 90% of all set-union operations performed can be redundant. We present Multi-level Deduplication Engine (MDE), a mechanism that recursively augments the representation of data through de-duplication and the assignment of unique identifiers to values to eliminate redundancy. This allows MDE to trivialize many operations, and memoize operations enabling their future reuse. MDE's recursive structure allows it to represent de-duplicated values that themselves are constructed from other de-deuplicated values, capturing structural redundancy not easily possible with non-recursive techniques. We provide a full C++ implementation of MDE as a library and integrate it into an existing implementation of a flow- and context-sensitive pointer analysis. Evaluation on selected SPEC benchmarks shows a reduction up to 18.10x in peak memory usage and 8.15x in runtime. More notably, MDE exhibits an upward trend of effectiveness with the increase in benchmark size. Besides performance improvements, this work highlights the importance of representation design and suggests new opportunities for bringing efficiency to future analyses.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Multi-level Deduplication Engine (MDE), a recursive mechanism that augments data representations in points-to analysis via deduplication and unique identifier assignment to eliminate redundancy in set-union operations and enable memoization. It reports that over 90% of set-union operations are redundant on SPEC benchmarks and claims that integrating MDE into a flow- and context-sensitive pointer analysis yields up to 18.10x peak memory reduction and 8.15x runtime improvement, with effectiveness increasing for larger benchmarks.
Significance. If the integration preserves exact points-to results, MDE offers a valuable alternative to precision-reducing approximations for scaling pointer analysis by targeting structural and operational redundancy. The provision of a complete C++ library implementation strengthens reproducibility and potential reuse in other analyses.
major comments (2)
- [Evaluation] The evaluation reports concrete speedups and memory reductions but provides no comparison of final points-to sets, no output-equivalence testing, and no argument that unique-identifier assignment plus recursive deduplication leaves propagation behavior unchanged. This equivalence is load-bearing for the central claim that MDE can be integrated without affecting analysis soundness or precision.
- [Abstract] The claim that more than 90% of set-union operations are redundant is presented without details on measurement methodology (e.g., how operations were instrumented or counted across client analyses), making it impossible to assess whether the figure is representative or reproducible.
minor comments (2)
- [Evaluation] The upward trend of effectiveness with benchmark size is asserted but not supported by a table or figure showing per-benchmark sizes, speedups, and memory reductions.
- [Evaluation] No error bars, variance across runs, or details on the SPEC benchmark selection and configuration are provided for the reported 18.10x and 8.15x factors.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments on our manuscript. We address each of the major comments below and indicate the revisions we will make to improve the paper.
read point-by-point responses
-
Referee: [Evaluation] The evaluation reports concrete speedups and memory reductions but provides no comparison of final points-to sets, no output-equivalence testing, and no argument that unique-identifier assignment plus recursive deduplication leaves propagation behavior unchanged. This equivalence is load-bearing for the central claim that MDE can be integrated without affecting analysis soundness or precision.
Authors: We agree that providing evidence for semantic equivalence is essential to support our claims. Although the MDE library is designed to provide equivalent set operations through canonical representations and memoized unions that compute the same results, we did not include explicit verification in the submitted manuscript. In the revised version, we will add both a theoretical argument explaining why the recursive deduplication and unique identifier assignment preserve the exact behavior of the original set-union and propagation (as MDE acts as a drop-in replacement for the data structures without altering their semantics), and empirical results comparing the final points-to sets produced with and without MDE to confirm they are identical. We will also report any minor differences if they arise due to implementation details. revision: yes
-
Referee: [Abstract] The claim that more than 90% of set-union operations are redundant is presented without details on measurement methodology (e.g., how operations were instrumented or counted across client analyses), making it impossible to assess whether the figure is representative or reproducible.
Authors: The referee is correct that the measurement methodology for the 90% redundancy figure is not detailed in the abstract or the provided summary. In the full manuscript, we describe the instrumentation, but to address this, we will revise the abstract and add a dedicated subsection in the evaluation or methodology section detailing how the set-union operations were counted, including the specific client analyses used, the instrumentation approach, and the benchmarks. This will enhance reproducibility and allow assessment of representativeness. revision: yes
Circularity Check
No circularity; empirical performance claims rest on external benchmarks and implementation, not self-referential definitions or fitted predictions
full rationale
The paper describes a new MDE mechanism for deduplicating data structures in pointer analysis, provides a C++ library implementation, integrates it into an existing flow- and context-sensitive analysis, and reports measured speedups and memory reductions on SPEC benchmarks. No equations, first-principles derivations, or 'predictions' appear in the provided text. The central claims are empirical measurements of redundancy (>90% redundant set-unions) and resulting performance gains (up to 18.10x memory, 8.15x runtime). These quantities are obtained from direct execution on external benchmarks rather than being defined in terms of themselves or obtained by fitting parameters to a subset and renaming the fit as a prediction. Potential concerns about output equivalence or precision preservation are correctness issues outside the scope of circularity analysis; they do not reduce any claimed derivation to its own inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
LL VM Alias Analysis Infrastructure.https://llvm.org/docs/AliasAnalysis.html; 2026
LL VM Contributors . LL VM Alias Analysis Infrastructure.https://llvm.org/docs/AliasAnalysis.html; 2026. [Accessed 2026-03-17]
2026
-
[2]
In defense of soundiness: a manifesto
Livshits B, Sridharan M, Smaragdakis Y , et al. In defense of soundiness: a manifesto. Commun. ACM. 2015;58(2):4446. doi: 10.1145/2644805
-
[3]
PhD thesis
Andersen LO, Lee P .Program Analysis and Specialization for the C Programming Language . PhD thesis. University of Copenhagen, Nørregade 10 DK-1165 Copenhagen K; 2005
2005
-
[4]
Points-to analysis in almost linear time
Steensgaard B. Points-to analysis in almost linear time. In: Association for Computing Machinery. 1996:32–41
1996
-
[5]
Using Datalog with Binary Decision Diagrams for Program Analysis
Whaley J, Avots D, Carbin M, Lam MS. Using Datalog with Binary Decision Diagrams for Program Analysis. In: Yi K. , ed. Programming Languages and SystemsSpringer Berlin Heidelberg. Springer Berlin Heidelberg 2005; Berlin, Heidelberg:97–118
2005
-
[6]
Wang K, Hussain A, Zuo Z, Xu G, Amiri Sani A. Graspan: A Single-machine Disk-based Graph System for Interprocedural Static Analyses of Large-scale Systems Code. SIGARCH Comput. Archit. News. 2017;45(1):389404. doi: 10.1145/3093337.3037744
-
[7]
Flow-sensitive pointer analysis for millions of lines of code
Hardekopf B, Lin C. Flow-sensitive pointer analysis for millions of lines of code. In: IEEE. 2011:289-298
2011
-
[8]
SVF: interprocedural static value-flow analysis in LL VM
Sui Y , Xue J. SVF: interprocedural static value-flow analysis in LL VM. In: CC ’16. Association for Computing Machinery. Association for Computing Machinery 2016; New Y ork, NY , USA:265266
2016
-
[9]
Liveness-based pointer analysis
Khedker UP , Mycroft A, Rawat PS. Liveness-based pointer analysis. In: Springer. 2012:265–282
2012
-
[10]
Interprocedural data flow analysis in Soot using value contexts
Padhye R, Khedker UP . Interprocedural data flow analysis in Soot using value contexts. In: SOAP ’13. Association for Computing Machinery. Association for Computing Machinery 2013; New Y ork, NY , USA:3136
2013
-
[11]
Flow- and Context-Sensitive Points-To Analysis Using Generalized Points-To Graphs
Gharat PM, Khedker UP , Mycroft A. Flow- and Context-Sensitive Points-To Analysis Using Generalized Points-To Graphs. In: Rival X. , ed. Static AnalysisSpringer Berlin Heidelberg. Springer Berlin Heidelberg 2016; Berlin, Heidelberg:212–236
2016
-
[12]
Precise interprocedural dataflow analysis via graph reachability
Reps T, Horwitz S, Sagiv M. Precise interprocedural dataflow analysis via graph reachability. In: POPL ’95. Association for Computing Machinery. Association for Computing Machinery 1995; New Y ork, NY , USA:4961
1995
-
[13]
Precise interprocedural dataflow analysis with applications to constant propagation
Sagiv M, Reps T, Horwitz S. Precise interprocedural dataflow analysis with applications to constant propagation. Theoretical Computer Science. 1996;167(1):131-170. doi: 10.1016/0304-3975(96)00072-2
-
[14]
Semi-sparse flow-sensitive pointer analysis
Hardekopf B, Lin C. Semi-sparse flow-sensitive pointer analysis. In: POPL ’09. Association for Computing Machinery. Association for Computing Machinery 2009; New Y ork, NY , USA:226238
2009
-
[15]
Using ZBDDs in Points-to Analysis
Lhoták O, Curial S, Amaral JN. Using ZBDDs in Points-to Analysis. In: Adve V , Garzarán MJ, Petersen P ., eds. Languages and Compilers for Parallel ComputingSpringer Berlin Heidelberg. Springer Berlin Heidelberg 2008; Berlin, Heidelberg:338–352
2008
-
[16]
Hash consed points-to sets
Barbar M, Sui Y . Hash consed points-to sets. In: Springer. 2021:25–48
2021
-
[17]
SPEC CPU2006 benchmark descriptions
Henning JL. SPEC CPU2006 benchmark descriptions. SIGARCH Comput. Archit. News. 2006;34(4):117. doi: 10.1145/1186736.1186737
-
[18]
Program Flow Analysis: Theory and Application
Muchnick SS, Jones ND. Program Flow Analysis: Theory and Application . Prentice Hall Professional Technical Reference, 1981
1981
-
[19]
https://en.cppreference.com/w/cpp/algorithm/set_union.html; 2025
std::set_union - cppreference.com. https://en.cppreference.com/w/cpp/algorithm/set_union.html; 2025. [Accessed 2025-12-21]
2025
-
[20]
Merge sort — Wikipedia, The Free Encyclopedia
Wikipedia contributors . Merge sort — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Merge_sort&oldid= 1325228601; 2025. [Online; accessed 21-December-2025]. Points-to Analysis Using MDE: A Multi-level Deduplication Engine for Repetitive Data and Operations 25
2025
-
[21]
Flow-sensitive pointer analysis for millions of lines of code
Hardekopf B, Lin C. Flow-sensitive pointer analysis for millions of lines of code. In: IEEE. 2011:289–298
2011
-
[22]
Github - SVF-tools/SVF: Static V alue-Flow Analysis Framework for Source Code
Sui Y , Xue J. Github - SVF-tools/SVF: Static V alue-Flow Analysis Framework for Source Code. https://github.com/SVF-tools/SVF; 2026. [Accessed 2026-02-19]
2026
-
[23]
llvm::SparseBitV ector< ElementSize > Class Template Reference.https://llvm.org/doxygen/classllvm_1_1SparseBitV ector
LL VM Contributors . llvm::SparseBitV ector< ElementSize > Class Template Reference.https://llvm.org/doxygen/classllvm_1_1SparseBitV ector. html; 2026. [Accessed 2026-02-20]
2026
-
[24]
LL VM Programmers Manual.https://llvm.org/docs/ProgrammersManual.html; 2026
LL VM Contributors . LL VM Programmers Manual.https://llvm.org/docs/ProgrammersManual.html; 2026. [Accessed 2026-02-20]
2026
-
[25]
Logic in Computer Science: Modelling and reasoning about systems
Huth M, Ryan M. Logic in Computer Science: Modelling and reasoning about systems . Cambridge university press, 2004
2004
-
[26]
Introduction to Zero-Suppressed Decision Diagrams:1–33; Cham: Springer International Publishing
Mishchenko A. Introduction to Zero-Suppressed Decision Diagrams:1–33; Cham: Springer International Publishing . 2015
2015
-
[27]
CUDD: CU decision diagram package release 3.0.0
Somenzi F. CUDD: CU decision diagram package release 3.0.0. University of Colorado at Boulder . 1998. [Accessed 2026-02-20]
1998
-
[28]
Ghorui A, Raste A, Khedker U. Artifact for Points-to Analysis Using MDE: A Multi-level Deduplication Engine for Repetitive Data and Operations. https://doi.org/10.5281/zenodo.19437315; 2026
-
[29]
The Often Misunderstood GEP Instruction
LL VM Contributors . The Often Misunderstood GEP Instruction. https://llvm.org/docs/GetElementPtr.html; 2026. [Accessed 2026-03-17]
2026
-
[30]
Compiler Analysis of the V alue Ranges for V ariables.IEEE Transactions on Software Engineering
Harrison WH. Compiler Analysis of the V alue Ranges for V ariables.IEEE Transactions on Software Engineering. 1977;3(3):243-250. Copyright - Copyright IEEE Computer Society May 1977; Last updated - 2026-02-09; CODEN - IESEDJ
1977
-
[31]
MIT Press, 2021
Cousot P .Principles of abstract interpretation. MIT Press, 2021
2021
-
[32]
https://www.boost.org/; 2026
Boost C++ Library Collection. https://www.boost.org/; 2026. [Accessed 2026-03-17]
2026
-
[33]
https://abseil.io/about/; 2026
Abseil C++ Library Collection. https://abseil.io/about/; 2026. [Accessed 2026-03-17]
2026
-
[34]
Hasse diagram — Wikipedia, The Free Encyclopedia
Wikipedia contributors . Hasse diagram — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Hasse_diagram& oldid=1339943720; 2026. [Online; accessed 14-March-2026]
2026
-
[35]
Pthreads — Wikipedia, The Free Encyclopedia
Wikipedia contributors . Pthreads — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Pthreads&oldid= 1342501295; 2026. [Online; accessed 16-March-2026]
2026
-
[36]
www.intel.com/content/www/us/en/developer/tools/oneapi/onetbb.html; 2025
Intel(R) OneAPI Threading Building Blocks. www.intel.com/content/www/us/en/developer/tools/oneapi/onetbb.html; 2025. [Accessed 2025-09- 22]
2025
-
[37]
LL VM Language Reference Manual.https://llvm.org/docs/LangRef.html; 2026
LL VM Contributors . LL VM Language Reference Manual.https://llvm.org/docs/LangRef.html; 2026. [Accessed 2026-02-20]
2026
-
[38]
Khedker UP , Sanyal A, Karkare A. Heap reference analysis using access graphs. ACM Trans. Program. Lang. Syst.. 2007;30(1):1es. doi: 10.1145/1290520.1290521 SUPPORTING INFORMATION The data that supports the findings of this study are openly available in Zenodo at https://doi.org/10.5281/zenodo.19437315, reference number 19437315. How to cite this article: ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.