arxiv: 2604.10445 · v1 · submitted 2026-04-12 · 💻 cs.PL

Recognition: unknown

Points-to Analysis Using MDE: A Multi-level Deduplication Engine for Repetitive Data and Operations

Aditi Raste, Anamitra Ghorui, Uday P. Khedker

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:23 UTC · model grok-4.3

classification 💻 cs.PL

keywords pointer analysispoints-to analysisdeduplicationmemory optimizationcontext-sensitive analysisflow-sensitive analysisdata redundancyperformance scaling

0 comments

The pith

A recursive multi-level deduplication engine cuts memory use up to 18x and runtime 8x in flow- and context-sensitive pointer analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Pointer analysis repeatedly performs the same set operations and propagates duplicate data, wasting most of its effort. Measurements show over 90 percent of set-union operations can be redundant. The paper introduces a Multi-level Deduplication Engine that builds unique representations recursively from other deduplicated values and memoizes operations. This turns repeated work into constant-time lookups. When integrated into an existing pointer analysis, the approach delivers large reductions in peak memory and runtime that increase with larger input programs.

Core claim

The authors establish that a Multi-level Deduplication Engine recursively augments data representations through de-duplication and assignment of unique identifiers, allowing many low-level operations to be trivialized or memoized while capturing structural redundancy that non-recursive methods miss, thereby scaling precise flow- and context-sensitive pointer analysis without loss of information.

What carries the argument

Multi-level Deduplication Engine (MDE), a recursive mechanism that de-duplicates values, assigns unique identifiers, and enables operation trivialization and memoization.

If this is right

Precise pointer analysis can handle substantially larger programs without precision or soundness trade-offs.
Representation choices become a primary lever for scaling analyses that involve heavy data propagation.
Performance gains grow as program size increases, favoring use on real-world codebases.
MDE can be added as a drop-in C++ library to existing analysis frameworks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same recursive deduplication pattern could apply to other static analyses that repeatedly combine sets or propagate data.
Analysis frameworks might benefit from built-in support for detecting and exploiting such redundancy automatically.
Deeper structural similarities across different analysis domains could be captured by extending the recursion levels.

Load-bearing premise

The high rate of redundant set-union operations observed on SPEC benchmarks is representative of pointer analysis workloads in general, and the deduplication process introduces no precision loss or soundness issues.

What would settle it

Running the same flow- and context-sensitive pointer analysis implementation on the same SPEC benchmarks both with and without MDE integration, then checking whether the reported memory and runtime reductions appear and whether the computed points-to sets remain identical.

read the original abstract

Precise pointer analysis is a foundational component of many client analyses and optimizations. Scaling flow- and context-sensitive pointer analysis has been a long-standing challenge, suffering from combinatorial growth in both memory usage and runtime. Existing approaches address this primarily by reducing the amount of information tracked often, at the cost of precision and soundness. In our experience a significant proportion of this cost comes from propagation of duplicate data and low-level data structure operations being repeated a large number of times. Our measurements on SPEC benchmarks show that more than 90% of all set-union operations performed can be redundant. We present Multi-level Deduplication Engine (MDE), a mechanism that recursively augments the representation of data through de-duplication and the assignment of unique identifiers to values to eliminate redundancy. This allows MDE to trivialize many operations, and memoize operations enabling their future reuse. MDE's recursive structure allows it to represent de-duplicated values that themselves are constructed from other de-deuplicated values, capturing structural redundancy not easily possible with non-recursive techniques. We provide a full C++ implementation of MDE as a library and integrate it into an existing implementation of a flow- and context-sensitive pointer analysis. Evaluation on selected SPEC benchmarks shows a reduction up to 18.10x in peak memory usage and 8.15x in runtime. More notably, MDE exhibits an upward trend of effectiveness with the increase in benchmark size. Besides performance improvements, this work highlights the importance of representation design and suggests new opportunities for bringing efficiency to future analyses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MDE gives a recursive dedup trick for points-to sets that cuts memory and time on SPEC with bigger gains on larger programs, but the paper still needs to show the final results match the original analysis exactly.

read the letter

The punchline is that this work takes the known redundancy in set operations during pointer analysis and attacks it with a recursive multi-level deduplication engine that assigns unique IDs and memoizes results. On selected SPEC benchmarks they see more than 90 percent of unions become trivial, delivering up to 18x lower peak memory and 8x faster runs, and the savings grow with program size. That scaling observation is useful because pointer analysis pain is worst on big codebases. They ship a complete C++ library and plug it into a real flow- and context-sensitive implementation, which is concrete engineering that others can actually try. The recursive structure is the part that feels new; it lets deduplicated values contain other deduplicated values, something flat memoization usually misses. The implementation detail and the trend with benchmark size are the strongest parts of the paper. The soft spot is the missing check that the transformed sets produce identical points-to results. The abstract and evaluation focus only on speed and memory; there is no reported comparison of final sets, no equivalence test, and no argument that the ID assignment and memoization cannot change propagation order or reachability in some client analysis. Without that, the performance numbers are interesting but not yet convincing for soundness-critical use. Measurement methodology for the 90 percent figure and the speedups is also thin. This paper is for people who build or tune static analyses and care about data-structure cost inside them. A reader who needs practical ways to shrink memory in dataflow problems will get concrete ideas and numbers to build on. It deserves a serious referee because the core mechanism is new in this setting and the empirical results are specific enough to review, even if the soundness gap needs to be closed in revision.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Multi-level Deduplication Engine (MDE), a recursive mechanism that augments data representations in points-to analysis via deduplication and unique identifier assignment to eliminate redundancy in set-union operations and enable memoization. It reports that over 90% of set-union operations are redundant on SPEC benchmarks and claims that integrating MDE into a flow- and context-sensitive pointer analysis yields up to 18.10x peak memory reduction and 8.15x runtime improvement, with effectiveness increasing for larger benchmarks.

Significance. If the integration preserves exact points-to results, MDE offers a valuable alternative to precision-reducing approximations for scaling pointer analysis by targeting structural and operational redundancy. The provision of a complete C++ library implementation strengthens reproducibility and potential reuse in other analyses.

major comments (2)

[Evaluation] The evaluation reports concrete speedups and memory reductions but provides no comparison of final points-to sets, no output-equivalence testing, and no argument that unique-identifier assignment plus recursive deduplication leaves propagation behavior unchanged. This equivalence is load-bearing for the central claim that MDE can be integrated without affecting analysis soundness or precision.
[Abstract] The claim that more than 90% of set-union operations are redundant is presented without details on measurement methodology (e.g., how operations were instrumented or counted across client analyses), making it impossible to assess whether the figure is representative or reproducible.

minor comments (2)

[Evaluation] The upward trend of effectiveness with benchmark size is asserted but not supported by a table or figure showing per-benchmark sizes, speedups, and memory reductions.
[Evaluation] No error bars, variance across runs, or details on the SPEC benchmark selection and configuration are provided for the reported 18.10x and 8.15x factors.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our manuscript. We address each of the major comments below and indicate the revisions we will make to improve the paper.

read point-by-point responses

Referee: [Evaluation] The evaluation reports concrete speedups and memory reductions but provides no comparison of final points-to sets, no output-equivalence testing, and no argument that unique-identifier assignment plus recursive deduplication leaves propagation behavior unchanged. This equivalence is load-bearing for the central claim that MDE can be integrated without affecting analysis soundness or precision.

Authors: We agree that providing evidence for semantic equivalence is essential to support our claims. Although the MDE library is designed to provide equivalent set operations through canonical representations and memoized unions that compute the same results, we did not include explicit verification in the submitted manuscript. In the revised version, we will add both a theoretical argument explaining why the recursive deduplication and unique identifier assignment preserve the exact behavior of the original set-union and propagation (as MDE acts as a drop-in replacement for the data structures without altering their semantics), and empirical results comparing the final points-to sets produced with and without MDE to confirm they are identical. We will also report any minor differences if they arise due to implementation details. revision: yes
Referee: [Abstract] The claim that more than 90% of set-union operations are redundant is presented without details on measurement methodology (e.g., how operations were instrumented or counted across client analyses), making it impossible to assess whether the figure is representative or reproducible.

Authors: The referee is correct that the measurement methodology for the 90% redundancy figure is not detailed in the abstract or the provided summary. In the full manuscript, we describe the instrumentation, but to address this, we will revise the abstract and add a dedicated subsection in the evaluation or methodology section detailing how the set-union operations were counted, including the specific client analyses used, the instrumentation approach, and the benchmarks. This will enhance reproducibility and allow assessment of representativeness. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical performance claims rest on external benchmarks and implementation, not self-referential definitions or fitted predictions

full rationale

The paper describes a new MDE mechanism for deduplicating data structures in pointer analysis, provides a C++ library implementation, integrates it into an existing flow- and context-sensitive analysis, and reports measured speedups and memory reductions on SPEC benchmarks. No equations, first-principles derivations, or 'predictions' appear in the provided text. The central claims are empirical measurements of redundancy (>90% redundant set-unions) and resulting performance gains (up to 18.10x memory, 8.15x runtime). These quantities are obtained from direct execution on external benchmarks rather than being defined in terms of themselves or obtained by fitting parameters to a subset and renaming the fit as a prediction. Potential concerns about output equivalence or precision preservation are correctness issues outside the scope of circularity analysis; they do not reduce any claimed derivation to its own inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities beyond the MDE technique itself; the central claim rests on the empirical observation of redundancy and the engineering integration of the library.

pith-pipeline@v0.9.0 · 5597 in / 1252 out tokens · 48182 ms · 2026-05-10T16:23:17.632984+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 6 canonical work pages

[1]

LL VM Alias Analysis Infrastructure.https://llvm.org/docs/AliasAnalysis.html; 2026

LL VM Contributors . LL VM Alias Analysis Infrastructure.https://llvm.org/docs/AliasAnalysis.html; 2026. [Accessed 2026-03-17]

2026
[2]

In defense of soundiness: a manifesto

Livshits B, Sridharan M, Smaragdakis Y , et al. In defense of soundiness: a manifesto. Commun. ACM. 2015;58(2):4446. doi: 10.1145/2644805

work page doi:10.1145/2644805 2015
[3]

PhD thesis

Andersen LO, Lee P .Program Analysis and Specialization for the C Programming Language . PhD thesis. University of Copenhagen, Nørregade 10 DK-1165 Copenhagen K; 2005

2005
[4]

Points-to analysis in almost linear time

Steensgaard B. Points-to analysis in almost linear time. In: Association for Computing Machinery. 1996:32–41

1996
[5]

Using Datalog with Binary Decision Diagrams for Program Analysis

Whaley J, Avots D, Carbin M, Lam MS. Using Datalog with Binary Decision Diagrams for Program Analysis. In: Yi K. , ed. Programming Languages and SystemsSpringer Berlin Heidelberg. Springer Berlin Heidelberg 2005; Berlin, Heidelberg:97–118

2005
[6]

Graspan: A Single-machine Disk-based Graph System for Interprocedural Static Analyses of Large-scale Systems Code

Wang K, Hussain A, Zuo Z, Xu G, Amiri Sani A. Graspan: A Single-machine Disk-based Graph System for Interprocedural Static Analyses of Large-scale Systems Code. SIGARCH Comput. Archit. News. 2017;45(1):389404. doi: 10.1145/3093337.3037744

work page doi:10.1145/3093337.3037744 2017
[7]

Flow-sensitive pointer analysis for millions of lines of code

Hardekopf B, Lin C. Flow-sensitive pointer analysis for millions of lines of code. In: IEEE. 2011:289-298

2011
[8]

SVF: interprocedural static value-ﬂow analysis in LL VM

Sui Y , Xue J. SVF: interprocedural static value-ﬂow analysis in LL VM. In: CC ’16. Association for Computing Machinery. Association for Computing Machinery 2016; New Y ork, NY , USA:265266

2016
[9]

Liveness-based pointer analysis

Khedker UP , Mycroft A, Rawat PS. Liveness-based pointer analysis. In: Springer. 2012:265–282

2012
[10]

Interprocedural data ﬂow analysis in Soot using value contexts

Padhye R, Khedker UP . Interprocedural data ﬂow analysis in Soot using value contexts. In: SOAP ’13. Association for Computing Machinery. Association for Computing Machinery 2013; New Y ork, NY , USA:3136

2013
[11]

Flow- and Context-Sensitive Points-To Analysis Using Generalized Points-To Graphs

Gharat PM, Khedker UP , Mycroft A. Flow- and Context-Sensitive Points-To Analysis Using Generalized Points-To Graphs. In: Rival X. , ed. Static AnalysisSpringer Berlin Heidelberg. Springer Berlin Heidelberg 2016; Berlin, Heidelberg:212–236

2016
[12]

Precise interprocedural dataﬂow analysis via graph reachability

Reps T, Horwitz S, Sagiv M. Precise interprocedural dataﬂow analysis via graph reachability. In: POPL ’95. Association for Computing Machinery. Association for Computing Machinery 1995; New Y ork, NY , USA:4961

1995
[13]

Precise interprocedural dataﬂow analysis with applications to constant propagation

Sagiv M, Reps T, Horwitz S. Precise interprocedural dataﬂow analysis with applications to constant propagation. Theoretical Computer Science. 1996;167(1):131-170. doi: 10.1016/0304-3975(96)00072-2

work page doi:10.1016/0304-3975(96)00072-2 1996
[14]

Semi-sparse ﬂow-sensitive pointer analysis

Hardekopf B, Lin C. Semi-sparse ﬂow-sensitive pointer analysis. In: POPL ’09. Association for Computing Machinery. Association for Computing Machinery 2009; New Y ork, NY , USA:226238

2009
[15]

Using ZBDDs in Points-to Analysis

Lhoták O, Curial S, Amaral JN. Using ZBDDs in Points-to Analysis. In: Adve V , Garzarán MJ, Petersen P ., eds. Languages and Compilers for Parallel ComputingSpringer Berlin Heidelberg. Springer Berlin Heidelberg 2008; Berlin, Heidelberg:338–352

2008
[16]

Hash consed points-to sets

Barbar M, Sui Y . Hash consed points-to sets. In: Springer. 2021:25–48

2021
[17]

SPEC CPU2006 benchmark descriptions

Henning JL. SPEC CPU2006 benchmark descriptions. SIGARCH Comput. Archit. News. 2006;34(4):117. doi: 10.1145/1186736.1186737

work page doi:10.1145/1186736.1186737 2006
[18]

Program Flow Analysis: Theory and Application

Muchnick SS, Jones ND. Program Flow Analysis: Theory and Application . Prentice Hall Professional Technical Reference, 1981

1981
[19]

https://en.cppreference.com/w/cpp/algorithm/set_union.html; 2025

std::set_union - cppreference.com. https://en.cppreference.com/w/cpp/algorithm/set_union.html; 2025. [Accessed 2025-12-21]

2025
[20]

Merge sort — Wikipedia, The Free Encyclopedia

Wikipedia contributors . Merge sort — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Merge_sort&oldid= 1325228601; 2025. [Online; accessed 21-December-2025]. Points-to Analysis Using MDE: A Multi-level Deduplication Engine for Repetitive Data and Operations 25

2025
[21]

Flow-sensitive pointer analysis for millions of lines of code

Hardekopf B, Lin C. Flow-sensitive pointer analysis for millions of lines of code. In: IEEE. 2011:289–298

2011
[22]

Github - SVF-tools/SVF: Static V alue-Flow Analysis Framework for Source Code

Sui Y , Xue J. Github - SVF-tools/SVF: Static V alue-Flow Analysis Framework for Source Code. https://github.com/SVF-tools/SVF; 2026. [Accessed 2026-02-19]

2026
[23]

llvm::SparseBitV ector< ElementSize > Class Template Reference.https://llvm.org/doxygen/classllvm_1_1SparseBitV ector

LL VM Contributors . llvm::SparseBitV ector< ElementSize > Class Template Reference.https://llvm.org/doxygen/classllvm_1_1SparseBitV ector. html; 2026. [Accessed 2026-02-20]

2026
[24]

LL VM Programmers Manual.https://llvm.org/docs/ProgrammersManual.html; 2026

LL VM Contributors . LL VM Programmers Manual.https://llvm.org/docs/ProgrammersManual.html; 2026. [Accessed 2026-02-20]

2026
[25]

Logic in Computer Science: Modelling and reasoning about systems

Huth M, Ryan M. Logic in Computer Science: Modelling and reasoning about systems . Cambridge university press, 2004

2004
[26]

Introduction to Zero-Suppressed Decision Diagrams:1–33; Cham: Springer International Publishing

Mishchenko A. Introduction to Zero-Suppressed Decision Diagrams:1–33; Cham: Springer International Publishing . 2015

2015
[27]

CUDD: CU decision diagram package release 3.0.0

Somenzi F. CUDD: CU decision diagram package release 3.0.0. University of Colorado at Boulder . 1998. [Accessed 2026-02-20]

1998
[28]

Artifact for Points-to Analysis Using MDE: A Multi-level Deduplication Engine for Repetitive Data and Operations

Ghorui A, Raste A, Khedker U. Artifact for Points-to Analysis Using MDE: A Multi-level Deduplication Engine for Repetitive Data and Operations. https://doi.org/10.5281/zenodo.19437315; 2026

work page doi:10.5281/zenodo.19437315 2026
[29]

The Often Misunderstood GEP Instruction

LL VM Contributors . The Often Misunderstood GEP Instruction. https://llvm.org/docs/GetElementPtr.html; 2026. [Accessed 2026-03-17]

2026
[30]

Compiler Analysis of the V alue Ranges for V ariables.IEEE Transactions on Software Engineering

Harrison WH. Compiler Analysis of the V alue Ranges for V ariables.IEEE Transactions on Software Engineering. 1977;3(3):243-250. Copyright - Copyright IEEE Computer Society May 1977; Last updated - 2026-02-09; CODEN - IESEDJ

1977
[31]

MIT Press, 2021

Cousot P .Principles of abstract interpretation. MIT Press, 2021

2021
[32]

https://www.boost.org/; 2026

Boost C++ Library Collection. https://www.boost.org/; 2026. [Accessed 2026-03-17]

2026
[33]

https://abseil.io/about/; 2026

Abseil C++ Library Collection. https://abseil.io/about/; 2026. [Accessed 2026-03-17]

2026
[34]

Hasse diagram — Wikipedia, The Free Encyclopedia

Wikipedia contributors . Hasse diagram — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Hasse_diagram& oldid=1339943720; 2026. [Online; accessed 14-March-2026]

2026
[35]

Pthreads — Wikipedia, The Free Encyclopedia

Wikipedia contributors . Pthreads — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Pthreads&oldid= 1342501295; 2026. [Online; accessed 16-March-2026]

2026
[36]

www.intel.com/content/www/us/en/developer/tools/oneapi/onetbb.html; 2025

Intel(R) OneAPI Threading Building Blocks. www.intel.com/content/www/us/en/developer/tools/oneapi/onetbb.html; 2025. [Accessed 2025-09- 22]

2025
[37]

LL VM Language Reference Manual.https://llvm.org/docs/LangRef.html; 2026

LL VM Contributors . LL VM Language Reference Manual.https://llvm.org/docs/LangRef.html; 2026. [Accessed 2026-02-20]

2026
[38]

edge miss

Khedker UP , Sanyal A, Karkare A. Heap reference analysis using access graphs. ACM Trans. Program. Lang. Syst.. 2007;30(1):1es. doi: 10.1145/1290520.1290521 SUPPORTING INFORMATION The data that supports the ﬁndings of this study are openly available in Zenodo at https://doi.org/10.5281/zenodo.19437315, reference number 19437315. How to cite this article: ...

work page doi:10.1145/1290520.1290521 2007