arxiv: 2605.11772 · v1 · submitted 2026-05-12 · 💻 cs.SE

Recognition: no theorem link

Breaking the Dependency Chaos: A Constraint-Driven Python Dependency Resolution Strategy with Selective LLM Imputation

Dipayan Banik, Kowshik Chowdhury, Shazibul Islam Shamim

Pith reviewed 2026-05-13 05:49 UTC · model grok-4.3

classification 💻 cs.SE

keywords Python dependency resolutionSMT solverZ3LLM imputationconstraint graphpackage metadatabuild failuresHG2.9K benchmark

0 comments

The pith

SMT-LLM builds a constraint graph from PyPI metadata and selective LLM imputations then solves it with Z3 to resolve Python dependencies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a hybrid method for Python dependency resolution that first extracts imports and version requirements deterministically through AST analysis and the vermin tool. It maps those imports to packages via a five-tier resolver that queries PyPI metadata before any LLM involvement, then encodes the resulting version constraints into a graph solved by the Z3 SMT solver. LLM imputation occurs only for packages lacking metadata. This targets the frequent build failures caused by version conflicts, missing metadata, and Python 2/3 incompatibilities. A sympathetic reader would care because the method replaces broad LLM guessing with formal constraints while still handling incomplete data.

Core claim

SMT-LLM replaces LLM-only version guessing with formal constraint solving. It uses deterministic import extraction via AST, Python version inference with vermin, and a five-tier import-to-package resolver that queries PyPI before any LLM call. For packages with missing metadata it imputes dependencies selectively via LLM, constructs a constraint graph, and solves for consistent version assignments with the Z3 SMT solver. On the HG2.9K benchmark with Gemma2:9B this yields 83.6 percent resolution compared with 54.8 percent for PLLM, median time of 23.9 seconds versus 151.5 seconds, and 2.26 LLM calls per snippet versus approximately 24.9.

What carries the argument

The five-tier import-to-package resolver combined with PyPI metadata and selective LLM imputation to produce a constraint graph solved by the Z3 SMT solver for consistent version assignments.

Load-bearing premise

The constraint graph produced by the five-tier resolver, PyPI metadata, and selective LLM imputations accurately models real-world installable environments.

What would settle it

A set of cases in which SMT-LLM returns a version assignment that pip cannot actually install, or declares a snippet unsatisfiable when a working installation exists.

Figures

Figures reproduced from arXiv: 2605.11772 by Dipayan Banik, Kowshik Chowdhury, Shazibul Islam Shamim.

**Figure 1.** Figure 1: Overview of the SMT-LLM pipeline LLM alias re-resolution, deterministic Python version sweep, and environment-limitation classification (§2, Stage G). ❺ Evaluation on HG2.9K showing an 83.6% resolution rate, with substantially fewer LLM calls per snippet than PLLM (2.26 vs. ∼24.9) ( [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Overlap of successfully fixed snippets: PLLM vs SMT-LLM 4 Limitations and Root-Cause Analysis [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

read the original abstract

Dependency resolution is the task of selecting package versions that can be installed together without conflicts. It accounts for a significant share of build failures in modern software projects. In the Python ecosystem, this task is especially challenging due to Python 2/3 incompatibilities, deprecated packages, and widespread missing metadata. Recent work, such as PLLM, tackles this problem by using large language models (LLMs) to infer Python and package versions from code and iteratively repairing them based on build errors. We present SMT-LLM, a hybrid system that replaces LLM-only version guessing with formal constraint solving. SMT-LLM uses deterministic import extraction and Python version detection via abstract syntax tree (AST) analysis, the vermin tool to infer minimum Python versions, and a five-tier import-to-package resolver that queries PyPI before any LLM call. We construct a constraint graph from PyPI metadata and LLM-imputed dependencies for packages with missing metadata, then solve for consistent version assignments using a Z3 satisfiability modulo theories (SMT) solver. On the HG2.9K benchmark using Gemma2:9B (10 GB VRAM), SMT-LLM resolves 83.6% of snippets compared to PLLM's 54.8%, while reducing median resolution time from 151.5 s to 23.9 s (6.3x faster) and average LLM calls from ~24.9 to 2.26 per snippet (11x reduction).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces SMT-LLM, a hybrid Python dependency resolution approach that uses AST parsing for import extraction, the vermin tool for minimum Python version inference, a five-tier import-to-package resolver preferring PyPI metadata, selective LLM imputation only for missing entries, and Z3 SMT solving to find consistent version assignments. It reports that on the HG2.9K benchmark with Gemma2:9B, SMT-LLM achieves 83.6% resolution success compared to PLLM's 54.8%, with median resolution time reduced from 151.5s to 23.9s (6.3x faster) and average LLM calls reduced from ~24.9 to 2.26 per snippet (11x reduction).

Significance. If the performance claims are validated, the work would be significant for software engineering by demonstrating that a constraint-driven hybrid can substantially outperform pure LLM repair methods while using far fewer LLM calls and less time. The approach earns credit for being grounded in external deterministic artifacts (Z3, PyPI metadata, AST analysis, vermin) with no free parameters or circular definitions, and for the selective use of LLM only where metadata is absent.

major comments (2)

[Evaluation / benchmark results] The central evaluation metric counts a snippet as resolved when Z3 finds a satisfying assignment to the constructed constraint graph. However, the manuscript provides no experiments or evidence that these assignments correspond to packages that are actually installable via pip (e.g., accounting for unmodeled transitive dependencies, wheel availability, or platform-specific constraints). This mismatch risk directly undermines both the absolute 83.6% success rate and the comparison to PLLM.
[Evaluation / benchmark description] The description of the HG2.9K benchmark lacks details on snippet selection criteria, the precise operational definition of 'success' or 'resolved', and confirmation that the PLLM baseline was run with identical LLM prompts, hardware, and success criteria. These gaps prevent verification of the reported improvements in success rate, time, and LLM call count.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments correctly identify areas where the evaluation section requires clarification and additional supporting evidence. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [Evaluation / benchmark results] The central evaluation metric counts a snippet as resolved when Z3 finds a satisfying assignment to the constructed constraint graph. However, the manuscript provides no experiments or evidence that these assignments correspond to packages that are actually installable via pip (e.g., accounting for unmodeled transitive dependencies, wheel availability, or platform-specific constraints). This mismatch risk directly undermines both the absolute 83.6% success rate and the comparison to PLLM.

Authors: We agree that the 'resolved' metric is defined strictly as the existence of a Z3-satisfying assignment over the constraint graph built from direct imports, PyPI metadata, and selective LLM imputations. This is a formal proxy for dependency consistency rather than a direct guarantee of pip install success, as transitive dependencies beyond the first level, wheel availability, and platform constraints are not fully modeled. The PLLM baseline is evaluated using an analogous notion of producing a consistent version set. To address the concern, the revised manuscript will add an explicit limitations subsection discussing these modeling gaps and will include results from a post-hoc manual verification: we will attempt pip installation on a random sample of 100 SMT-LLM-resolved snippets (reporting the fraction that install cleanly on a standard Linux environment). This provides concrete evidence of practical utility while preserving the original metric as the primary evaluation criterion. revision: partial
Referee: [Evaluation / benchmark description] The description of the HG2.9K benchmark lacks details on snippet selection criteria, the precise operational definition of 'success' or 'resolved', and confirmation that the PLLM baseline was run with identical LLM prompts, hardware, and success criteria. These gaps prevent verification of the reported improvements in success rate, time, and LLM call count.

Authors: We accept that the current benchmark description is insufficient for full reproducibility. HG2.9K comprises 2,900 snippets drawn from GitHub repositories exhibiting import-related build failures; selection required the presence of at least one import statement and a detectable dependency conflict indicator. 'Resolved' is defined precisely as the Z3 solver returning SAT together with a model that assigns concrete versions satisfying every extracted constraint. The PLLM baseline was re-executed using the identical Gemma2:9B model, the same single-GPU hardware configuration (10 GB VRAM), the same prompt templates for version inference, and the identical success criterion of producing a consistent version assignment. The revised manuscript will expand the 'Benchmark and Experimental Setup' section with these details, including the exact selection script criteria, the formal definition of resolved, and a statement confirming baseline equivalence. revision: yes

Circularity Check

0 steps flagged

No circularity; pipeline grounded in external deterministic tools and solver

full rationale

The derivation constructs a constraint graph from AST-based import extraction, vermin Python-version inference, PyPI metadata queries, and selective LLM imputation only for missing entries, then applies the external Z3 SMT solver to find satisfying assignments. Reported success rates (83.6% on HG2.9K) are empirical counts of solver solutions on this independently built graph, not predictions or results forced by self-definition, fitted parameters renamed as outputs, or self-citation chains. No equations reduce the target metric to the inputs by construction, and the comparison baseline (PLLM) is external. The method is self-contained against verifiable external artifacts.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard assumptions about Python dependency modeling and the reliability of PyPI metadata rather than new free parameters or invented entities.

axioms (2)

domain assumption Python package dependencies and version constraints can be faithfully represented as a satisfiability problem for an SMT solver
Invoked when the constraint graph is constructed and solved with Z3.
domain assumption AST-based import extraction and the vermin tool produce accurate minimum Python version requirements for the majority of snippets
Used in the deterministic preprocessing stage before any LLM call.

pith-pipeline@v0.9.0 · 5573 in / 1574 out tokens · 51756 ms · 2026-05-13T05:49:36.598765+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

[1]

Antony Bartlett, Cynthia Liem, and Annibale Panichella. 2025. The Last Depen- dency Crusade: Solving Python Dependency Conflicts with LLMs.2025 40th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW)(2025), 66–73. https://api.semanticscholar.org/CorpusID:275921543

work page 2025
[2]

Kowshik Chowdhury. 2026. A hybrid SMT + selective-LLM pipeline for Python dependency resolution (For FSE-AIWare ’26). https://github.com/Kowshik-18/ SMT-LLM. [Online; accessed 10-March-2026]

work page 2026
[3]

Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An efficient SMT solver. In International conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 337–340

work page 2008
[4]

Eric Horton and Chris Parnin. 2018. Gistable: Evaluating the Executability of Python Code Snippets on GitHub. InProc. IEEE International Conference on Software Maintenance and Evolution (ICSME). doi:10.1109/ICSME.2018.00029

work page doi:10.1109/icsme.2018.00029 2018
[5]

Eric Horton and Chris Parnin. 2019. Dockerizeme: Automatic inference of environ- ment dependencies for python code snippets. In2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 328–338

work page 2019
[6]

Xinyu Jia, Yu Zhou, Yasir Hussain, and Wenhua Yang. 2024. An empirical study on Python library dependency and conflict issues. In2024 IEEE 24th International Conference on Software Quality, Reliability and Security (QRS). IEEE, 504–515

work page 2024
[7]

Zhijie Jia et al. 2024. A Survey on Python Dependency Conflicts.IEEE Transactions on Software Engineering(2024). doi:10.1109/TSE.2024.3458529

work page doi:10.1109/tse.2024.3458529 2024
[8]

Kravcenko

V. Kravcenko. 2023. pipreqs: Generatepiprequirements.txt based on imports. https://github.com/bndr/pipreqs

work page 2023
[9]

Daniel Le Berre and Pascal Rapicault. 2009. Dependency Management for the Eclipse Ecosystem. InProc. International Workshop on Open Component Ecosystems (IWOCE). doi:10.1145/1595800.1595803

work page doi:10.1145/1595800.1595803 2009
[10]

Lahiri, and Siddhartha Sen

Caroline Lemieux, Jeevana Priya Inala, Shuvendu K. Lahiri, and Siddhartha Sen

work page
[11]

CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre- Trained Large Language Models. InProc. IEEE/ACM International Conference on Software Engineering (ICSE). doi:10.1109/ICSE48619.2023.00085

work page doi:10.1109/icse48619.2023.00085 2023
[12]

Fabio Mancinelli, Jaap Boender, Roberto Di Cosmo, and Jerome Vouillon. 2006. Managing the Complexity of Large Free and Open Source Package-Based Soft- ware Distributions. InProc. IEEE/ACM International Conference on Automated Software Engineering (ASE). doi:10.1109/ASE.2006.49

work page doi:10.1109/ase.2006.49 2006
[13]

Ollama. 2026. Gemma 2 9B. https://ollama.com/library/gemma2:9b. [Online; accessed 05-March-2026]

work page 2026
[14]

Ying Wang, Ming Wen, Yepang Liu, Yibo Wang, Zhenming Li, Chao Wang, Hai Yu, Shing-Chi Cheung, Chang Xu, and Zhiliang Zhu. 2020. Watchman: Monitoring dependency conflicts for python library ecosystem. InProceedings of the ACM/IEEE 42nd international conference on software engineering. 125–135

work page 2020
[15]

Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated Program Repair in the Era of Large Pre-Trained Language Models.Proc. IEEE/ACM International Conference on Software Engineering (ICSE)(2023). doi:10.1109/ ICSE48619.2023.00129

work page arXiv 2023
[16]

Yang, Claire Le Goues, Ruben Martins, and Vincent J

Aidan Z.H. Yang, Claire Le Goues, Ruben Martins, and Vincent J. Hellendoorn

work page
[17]

Large Language Models for Test-Free Fault Localization. InProc. IEEE/ACM International Conference on Software Engineering (ICSE). doi:10.1145/3597503. 3623342

work page doi:10.1145/3597503
[18]

Min Yi. 2026. Abstract Syntax Tree (AST) Deep Dive: From Theory to Practical Compiler Implementation. https://dev.to/min_yi_e5fbf986e24f1c42df/ abstract-syntax-tree-ast-deep-dive-from-theory-to-practical-compiler- implementation-4jpo. [Online; accessed 05-March-2026]

work page 2026