pith. sign in

arxiv: 2606.07792 · v1 · pith:XPKUYQC4new · submitted 2026-06-05 · 💻 cs.CR · cs.LG· cs.SE

MOLOT System Card: Malicious Operational Logic Observation Transformer

Pith reviewed 2026-06-27 21:37 UTC · model grok-4.3

classification 💻 cs.CR cs.LGcs.SE
keywords malicious code detectionstatic analysiscall graphsbehavior sequencesexplainable detectionPyPInpmDevSecOps
0
0 comments X

The pith

MOLOT detects malicious packages by modeling behavior sequences from static call graphs and mapping suspicious activities back to source locations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MOLOT as a static analysis system for identifying malicious code in software packages when metadata, maintainer history, and runtime traces are unavailable. It extracts behavior sequences from call graphs, feeds them to a transformer for classification, and adds an explanation stage that ranks suspicious behaviors and ties them to specific code sites. Evaluation covers Python and JavaScript packages from PyPI and npm, with comparisons to existing tools and checks against real moderation constraints on speed, memory, and false positives. The work also releases the Open Malicious-Code Bench for public testing. The central result is that static sequence modeling alone can yield accurate, explainable detection suitable for DevSecOps pipelines.

Core claim

MOLOT represents source code as behavior sequences derived from static call graphs, classifies them via a transformer to separate malicious from benign packages, and supplies explanations by ranking suspicious behavior activities and mapping them to concrete source-code locations. On PyPI and npm data the system meets accuracy, runtime, memory, and false-positive targets observed in production moderation workflows.

What carries the argument

Behavior sequences extracted from static call graphs, processed by the Malicious Operational Logic Observation Transformer, together with a ranking-based explanation stage that links flagged activities back to source locations.

If this is right

  • Detection becomes possible inside SAST tools that lack package metadata or execution traces.
  • Explanations can directly support human review by pointing to the exact code locations driving the flag.
  • The released Open Malicious-Code Bench supplies a shared test set for comparing future static detectors.
  • Performance under measured runtime and memory limits allows direct insertion into existing DevSecOps moderation queues.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sequence representation could be combined with dynamic traces when those become available, potentially raising detection rates further.
  • Extending the call-graph extraction to additional languages would test whether the approach generalizes beyond Python and JavaScript.
  • The public benchmark may encourage development of lighter-weight models that still retain the explanation feature.
  • If the sequences prove stable across package versions, the method could support continuous monitoring of supply-chain updates.

Load-bearing premise

Behavior sequences taken from static call graphs alone are enough to tell malicious packages from benign ones when metadata and dynamic traces cannot be used.

What would settle it

A production run on newly arriving PyPI or npm packages in which MOLOT either misses confirmed malicious samples or exceeds the false-positive rate tolerated by the moderation team.

Figures

Figures reproduced from arXiv: 2606.07792 by Aleksandr Khalikov, Daniil Lopatkin, Maksim Mitrofanov, Stanislav Rakovsky.

Figure 1
Figure 1. Figure 1: The MOLOT pipeline: call-graph extraction, traversal into activity chains, textual rendering, BERT classification, and SHAP-based explanation. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: SHAP attribution before retraining: the substring [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: SHAP attribution after retraining: behavioral tokens ( [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
read the original abstract

MOLOT (Malicious Operational Logic Observation Transformer) is a static malicious-code detection system designed for SAST setup where package metadata, maintainer history, and dynamic execution traces may be unavailable or unreliable. The system represents source code as behavior sequences derived from static call graphs, includes an explanation stage that ranks suspicious behavior activities and maps them back to source-code locations. The approach is evaluated on Python and JavaScript packages from PyPI and npm, compared with opensource detection tools, and validated under product constraints including runtime, memory use, and false-positive rates observed in a real moderation workflow. We also release Open Malicious-Code Bench, a public benchmark for reproducible evaluation of malicious-package detection methods. The results show that static behavior-sequence modeling can provide accurate, explainable, and deployable malicious-code detection for modern DevSecOps workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. MOLOT is a static malicious-code detection system for SAST scenarios where package metadata, maintainer history, and dynamic traces are unavailable. It represents code as behavior sequences from static call graphs, includes an explanation stage that ranks suspicious activities and maps them to source locations, evaluates on Python/JS packages from PyPI and npm against open-source tools, validates under product constraints (runtime, memory, false-positive rates in a real moderation workflow), and releases the Open Malicious-Code Bench benchmark. The central claim is that static behavior-sequence modeling yields accurate, explainable, and deployable detection for modern DevSecOps.

Significance. If the evaluation results hold, the work supplies a practical static-analysis method for malicious-package detection when dynamic execution or metadata are unreliable, together with a public benchmark that enables reproducible comparison. The release of Open Malicious-Code Bench is a concrete contribution to the field.

major comments (1)
  1. [call-graph construction] § on call-graph construction: the central claim requires that behavior sequences extracted solely from static call graphs suffice for accurate detection when metadata and dynamic traces are unavailable. In Python/JS, common malicious patterns (runtime eval/exec, __import__ indirection, string-based dispatch, or minified/obfuscated call sites) produce incomplete or identical static graphs for malicious and benign code. If the Open Malicious-Code Bench or the reported experiments do not contain a representative fraction of such cases, or if the sequence extraction collapses these patterns, the accuracy and deployability results do not generalize to the stated threat model.
minor comments (1)
  1. [Abstract] Abstract: asserts evaluation results and deployability but supplies no quantitative metrics, baselines, dataset sizes, or methodology details; these should be summarized even at the abstract level for a system card.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address the single major comment point-by-point below.

read point-by-point responses
  1. Referee: [call-graph construction] § on call-graph construction: the central claim requires that behavior sequences extracted solely from static call graphs suffice for accurate detection when metadata and dynamic traces are unavailable. In Python/JS, common malicious patterns (runtime eval/exec, __import__ indirection, string-based dispatch, or minified/obfuscated call sites) produce incomplete or identical static graphs for malicious and benign code. If the Open Malicious-Code Bench or the reported experiments do not contain a representative fraction of such cases, or if the sequence extraction collapses these patterns, the accuracy and deployability results do not generalize to the stated threat model.

    Authors: We agree that this is a central validity concern for any static-analysis claim. The Open Malicious-Code Bench was deliberately seeded with malicious packages exhibiting runtime eval/exec, __import__ indirection, string-based dispatch, and minification/obfuscation drawn from real PyPI and npm incidents; the static extractor incorporates name-resolution and limited constant-propagation heuristics to recover indirect targets where possible. Nevertheless, we acknowledge that highly adversarial obfuscation can still produce incomplete or colliding graphs. We will revise the manuscript to (1) report detection metrics stratified by obfuscation level within the benchmark, (2) add an explicit limitations subsection quantifying the fraction of samples where static graphs become indistinguishable, and (3) clarify the precise threat model under which the reported accuracy and false-positive rates are claimed to hold. These changes will be reflected in the next version. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes a static detection system using behavior sequences from call graphs, with evaluation on external PyPI/npm packages and comparison to open-source tools. No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations are present in the abstract or description. Claims rest on external benchmarks and product constraints rather than internal reductions to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no concrete equations, model architecture details, or training procedures, so no specific free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5683 in / 976 out tokens · 22306 ms · 2026-06-27T21:37:17.218221+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    Security update: Suspected supply chain incident,

    K. Dholakia and I. Jaffer, “Security update: Suspected supply chain incident,” LiteLLM Blog, Mar. 2026. [Online]. Available: https://docs.litellm.ai/blog/security-update-march-2026

  2. [2]

    The Shai-Hulud 2.0 NPM worm: analysis, and what you need to know,

    C. Tafani-Dereeper and S. Obregoso, “The Shai-Hulud 2.0 NPM worm: analysis, and what you need to know,” Datadog Security Labs, Nov

  3. [3]

    Available: https://securitylabs.datadoghq.com/articles/ shai-hulud-2.0-npm-worm/

    [Online]. Available: https://securitylabs.datadoghq.com/articles/ shai-hulud-2.0-npm-worm/

  4. [4]

    Killing Two Birds with One Stone: Malicious Package Detection in NPM and PyPI using a Single Model of Malicious Behavior Sequence

    J. Zhang, K. Huang, Y . Huang, B. Chen, R. Wang, C. Wang, and X. Peng, “Killing two birds with one stone: Malicious package detection in NPM and PyPI using a single model of malicious behavior sequence,”ACM Transactions on Software Engineering and Methodology, 2024, arXiv:2309.02637. [Online]. Available: https://dl.acm.org/doi/full/10.1145/3705304

  5. [5]

    1+1¿2: Integrating deep code behaviors with metadata features for malicious PyPI package detection,

    X. Sun, X. Gao, S. Cao, L. Bo, X. Wu, and K. Huang, “1+1¿2: Integrating deep code behaviors with metadata features for malicious PyPI package detection,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2024. [Online]. Available: https://dl.acm.org/doi/10. 1145/3691620.3695493

  6. [6]

    CLAMPD-Net: Cross-language malicious package detection across PyPI and NPM with multimodal fusion,

    T. Iqbal, G. Wu, and Z. Iqbal, “CLAMPD-Net: Cross-language malicious package detection across PyPI and NPM with multimodal fusion,” Information and Software Technology, 2026. [Online]. Available: https: //www.sciencedirect.com/science/article/abs/pii/S0950584926001187

  7. [7]

    SpiderScan: Practical detection of malicious NPM packages based on graph-based behavior modeling and matching,

    Y . Huang, R. Wang, W. Zheng, Z. Zhou, S. Wu, S. Ke, B. Chen, S. Gao, and X. Peng, “SpiderScan: Practical detection of malicious NPM packages based on graph-based behavior modeling and matching,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2024. [Online]. Available: https://dl.acm.org/doi/10.1145/36...

  8. [8]

    DySec: A machine learning-based dynamic analysis for detecting malicious packages in PyPI ecosystem,

    S. T. Mehedi, C. Islam, G. Ramachandran, and R. Jurdak, “DySec: A machine learning-based dynamic analysis for detecting malicious packages in PyPI ecosystem,” 2025. [Online]. Available: https://arxiv.org/abs/2503.00324

  9. [9]

    Detecting malicious source code in PyPI packages with LLMs: Does RAG come in handy?

    M. Ibiyo, T. Louangdy, P. T. Nguyen, C. Di Sipio, and D. Di Ruscio, “Detecting malicious source code in PyPI packages with LLMs: Does RAG come in handy?” 2025. [Online]. Available: https: //arxiv.org/abs/2504.13769

  10. [10]

    Leveraging large language models to detect NPM malicious packages,

    N. Zahan, P. Burckhardt, M. Lysenko, F. Aboukhadijeh, and L. Williams, “Leveraging large language models to detect NPM malicious packages,” inProceedings of the IEEE/ACM 47th International Conference on Software Engineering (ICSE), 2025. [Online]. Available: https://arxiv.org/abs/2403.12196

  11. [11]

    CodeQL: Semantic code analysis engine,

    GitHub, Inc., “CodeQL: Semantic code analysis engine,” GitHub repository. [Online]. Available: https://github.com/github/codeql

  12. [12]

    Semgrep: Lightweight static analysis for many languages,

    Semgrep, Inc., “Semgrep: Lightweight static analysis for many languages,” GitHub repository, 2024. [Online]. Available: https: //github.com/semgrep/semgrep

  13. [13]

    bandit4mal: A fork of Bandit with patterns to identify malicious Python code,

    D.-L. Vu, “bandit4mal: A fork of Bandit with patterns to identify malicious Python code,” GitHub repository. [Online]. Available: https://github.com/lyvd/bandit4mal

  14. [14]

    Understanding NPM malicious package detection: A benchmark-driven empirical analysis,

    W. Guo, Z. Chen, Z. Xu, C. Liu, M. Kang, S. Song, C. Liu, Y . Xu, W. Sun, and Y . Liu, “Understanding NPM malicious package detection: A benchmark-driven empirical analysis,” 2026. [Online]. Available: https://arxiv.org/abs/2603.27549

  15. [15]

    Unveiling malicious logic: Towards a statement-level taxonomy and dataset for securing Python packages,

    A. Ryan, J. M. Ifti, M. Erfan, A. A. U. Rahman, and M. R. Rahman, “Unveiling malicious logic: Towards a statement-level taxonomy and dataset for securing Python packages,” 2025. [Online]. Available: https://arxiv.org/abs/2512.12559

  16. [16]

    On the feasibility of cross-language detection of malicious packages in NPM and PyPI,

    P. Ladisa, S. E. Ponta, N. Ronzoni, M. Martinez, and O. Barais, “On the feasibility of cross-language detection of malicious packages in NPM and PyPI,” inProceedings of the 39th Annual Computer Security Applications Conference (ACSAC), 2023. [Online]. Available: https://arxiv.org/abs/2310.09571

  17. [17]

    OSSGadget: Collection of tools for analyzing open source packages,

    Microsoft, “OSSGadget: Collection of tools for analyzing open source packages,” GitHub repository. [Online]. Available: https: //github.com/microsoft/OSSGadget

  18. [18]

    Application inspector: A source-code analyzer for surveying features,

    ——, “Application inspector: A source-code analyzer for surveying features,” GitHub repository. [Online]. Available: https://github.com/ microsoft/ApplicationInspector

  19. [19]

    A benchmark comparison of Python malware detection approaches,

    D.-L. Vu, Z. Newman, and J. S. Meyers, “A benchmark comparison of Python malware detection approaches,” 2022. [Online]. Available: https://arxiv.org/abs/2209.13288

  20. [20]

    malicious-code-ruleset: Semgrep rules for detecting malicious code in OSS packages,

    Apiiro Ltd., “malicious-code-ruleset: Semgrep rules for detecting malicious code in OSS packages,” GitHub repository. [Online]. Available: https://github.com/apiiro/malicious-code-ruleset APPENDIXA LEAKAGE OFFILEIDENTIFIERS INEARLYACTIVITY CHAINS Symptom.In early versions of the pipeline, activity chains contained the entrypoint identifier — specifically,...