MOLOT System Card: Malicious Operational Logic Observation Transformer

Aleksandr Khalikov; Daniil Lopatkin; Maksim Mitrofanov; Stanislav Rakovsky

arxiv: 2606.07792 · v1 · pith:XPKUYQC4new · submitted 2026-06-05 · 💻 cs.CR · cs.LG· cs.SE

MOLOT System Card: Malicious Operational Logic Observation Transformer

Daniil Lopatkin , Maksim Mitrofanov , Stanislav Rakovsky , Aleksandr Khalikov This is my paper

Pith reviewed 2026-06-27 21:37 UTC · model grok-4.3

classification 💻 cs.CR cs.LGcs.SE

keywords malicious code detectionstatic analysiscall graphsbehavior sequencesexplainable detectionPyPInpmDevSecOps

0 comments

The pith

MOLOT detects malicious packages by modeling behavior sequences from static call graphs and mapping suspicious activities back to source locations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MOLOT as a static analysis system for identifying malicious code in software packages when metadata, maintainer history, and runtime traces are unavailable. It extracts behavior sequences from call graphs, feeds them to a transformer for classification, and adds an explanation stage that ranks suspicious behaviors and ties them to specific code sites. Evaluation covers Python and JavaScript packages from PyPI and npm, with comparisons to existing tools and checks against real moderation constraints on speed, memory, and false positives. The work also releases the Open Malicious-Code Bench for public testing. The central result is that static sequence modeling alone can yield accurate, explainable detection suitable for DevSecOps pipelines.

Core claim

MOLOT represents source code as behavior sequences derived from static call graphs, classifies them via a transformer to separate malicious from benign packages, and supplies explanations by ranking suspicious behavior activities and mapping them to concrete source-code locations. On PyPI and npm data the system meets accuracy, runtime, memory, and false-positive targets observed in production moderation workflows.

What carries the argument

Behavior sequences extracted from static call graphs, processed by the Malicious Operational Logic Observation Transformer, together with a ranking-based explanation stage that links flagged activities back to source locations.

If this is right

Detection becomes possible inside SAST tools that lack package metadata or execution traces.
Explanations can directly support human review by pointing to the exact code locations driving the flag.
The released Open Malicious-Code Bench supplies a shared test set for comparing future static detectors.
Performance under measured runtime and memory limits allows direct insertion into existing DevSecOps moderation queues.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sequence representation could be combined with dynamic traces when those become available, potentially raising detection rates further.
Extending the call-graph extraction to additional languages would test whether the approach generalizes beyond Python and JavaScript.
The public benchmark may encourage development of lighter-weight models that still retain the explanation feature.
If the sequences prove stable across package versions, the method could support continuous monitoring of supply-chain updates.

Load-bearing premise

Behavior sequences taken from static call graphs alone are enough to tell malicious packages from benign ones when metadata and dynamic traces cannot be used.

What would settle it

A production run on newly arriving PyPI or npm packages in which MOLOT either misses confirmed malicious samples or exceeds the false-positive rate tolerated by the moderation team.

Figures

Figures reproduced from arXiv: 2606.07792 by Aleksandr Khalikov, Daniil Lopatkin, Maksim Mitrofanov, Stanislav Rakovsky.

**Figure 1.** Figure 1: The MOLOT pipeline: call-graph extraction, traversal into activity chains, textual rendering, BERT classification, and SHAP-based explanation. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: SHAP attribution before retraining: the substring [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: SHAP attribution after retraining: behavioral tokens ( [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

read the original abstract

MOLOT (Malicious Operational Logic Observation Transformer) is a static malicious-code detection system designed for SAST setup where package metadata, maintainer history, and dynamic execution traces may be unavailable or unreliable. The system represents source code as behavior sequences derived from static call graphs, includes an explanation stage that ranks suspicious behavior activities and maps them back to source-code locations. The approach is evaluated on Python and JavaScript packages from PyPI and npm, compared with opensource detection tools, and validated under product constraints including runtime, memory use, and false-positive rates observed in a real moderation workflow. We also release Open Malicious-Code Bench, a public benchmark for reproducible evaluation of malicious-package detection methods. The results show that static behavior-sequence modeling can provide accurate, explainable, and deployable malicious-code detection for modern DevSecOps workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MOLOT adds a transformer on static call-graph sequences plus a new benchmark for malicious package detection, but the abstract supplies no metrics and the static-only method looks vulnerable to common obfuscation patterns.

read the letter

Hey,

The core of this paper is a system called MOLOT that turns static call graphs into behavior sequences, runs them through a transformer, and adds an explanation stage that ranks suspicious activities and points back to source locations. They also release Open Malicious-Code Bench for Python and JavaScript packages.

What stands out as useful is the focus on static detection when metadata and dynamic traces are off the table, which matches a real constraint in supply-chain checks. The explanation component is a practical addition for moderation workflows, and putting out the benchmark is the clearest contribution—reproducible datasets help the area move forward.

The soft spots are straightforward. The abstract claims accurate, explainable, and deployable results but gives no accuracy numbers, baselines, dataset sizes, or runtime details, so the performance claims cannot be checked. The stress-test point about static graphs missing dynamic resolution or obfuscation holds: patterns like eval, string-based dispatch, or minified call sites often produce incomplete or identical graphs for malicious and benign code. If the benchmark and experiments do not cover a representative share of those cases, the deployability argument weakens.

This paper is aimed at applied security engineers and tool builders who need static signals for PyPI and npm moderation. Someone looking for a new benchmark or sequence-modeling ideas in security could extract value from the released data.

I would send it for peer review because the benchmark is new and the operational setting is relevant, even though the current version needs concrete evaluation numbers and targeted tests against obfuscation to hold up.

Referee Report

1 major / 1 minor

Summary. MOLOT is a static malicious-code detection system for SAST scenarios where package metadata, maintainer history, and dynamic traces are unavailable. It represents code as behavior sequences from static call graphs, includes an explanation stage that ranks suspicious activities and maps them to source locations, evaluates on Python/JS packages from PyPI and npm against open-source tools, validates under product constraints (runtime, memory, false-positive rates in a real moderation workflow), and releases the Open Malicious-Code Bench benchmark. The central claim is that static behavior-sequence modeling yields accurate, explainable, and deployable detection for modern DevSecOps.

Significance. If the evaluation results hold, the work supplies a practical static-analysis method for malicious-package detection when dynamic execution or metadata are unreliable, together with a public benchmark that enables reproducible comparison. The release of Open Malicious-Code Bench is a concrete contribution to the field.

major comments (1)

[call-graph construction] § on call-graph construction: the central claim requires that behavior sequences extracted solely from static call graphs suffice for accurate detection when metadata and dynamic traces are unavailable. In Python/JS, common malicious patterns (runtime eval/exec, __import__ indirection, string-based dispatch, or minified/obfuscated call sites) produce incomplete or identical static graphs for malicious and benign code. If the Open Malicious-Code Bench or the reported experiments do not contain a representative fraction of such cases, or if the sequence extraction collapses these patterns, the accuracy and deployability results do not generalize to the stated threat model.

minor comments (1)

[Abstract] Abstract: asserts evaluation results and deployability but supplies no quantitative metrics, baselines, dataset sizes, or methodology details; these should be summarized even at the abstract level for a system card.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address the single major comment point-by-point below.

read point-by-point responses

Referee: [call-graph construction] § on call-graph construction: the central claim requires that behavior sequences extracted solely from static call graphs suffice for accurate detection when metadata and dynamic traces are unavailable. In Python/JS, common malicious patterns (runtime eval/exec, __import__ indirection, string-based dispatch, or minified/obfuscated call sites) produce incomplete or identical static graphs for malicious and benign code. If the Open Malicious-Code Bench or the reported experiments do not contain a representative fraction of such cases, or if the sequence extraction collapses these patterns, the accuracy and deployability results do not generalize to the stated threat model.

Authors: We agree that this is a central validity concern for any static-analysis claim. The Open Malicious-Code Bench was deliberately seeded with malicious packages exhibiting runtime eval/exec, __import__ indirection, string-based dispatch, and minification/obfuscation drawn from real PyPI and npm incidents; the static extractor incorporates name-resolution and limited constant-propagation heuristics to recover indirect targets where possible. Nevertheless, we acknowledge that highly adversarial obfuscation can still produce incomplete or colliding graphs. We will revise the manuscript to (1) report detection metrics stratified by obfuscation level within the benchmark, (2) add an explicit limitations subsection quantifying the fraction of samples where static graphs become indistinguishable, and (3) clarify the precise threat model under which the reported accuracy and false-positive rates are claimed to hold. These changes will be reflected in the next version. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes a static detection system using behavior sequences from call graphs, with evaluation on external PyPI/npm packages and comparison to open-source tools. No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations are present in the abstract or description. Claims rest on external benchmarks and product constraints rather than internal reductions to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no concrete equations, model architecture details, or training procedures, so no specific free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5683 in / 976 out tokens · 22306 ms · 2026-06-27T21:37:17.218221+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Security update: Suspected supply chain incident,

K. Dholakia and I. Jaffer, “Security update: Suspected supply chain incident,” LiteLLM Blog, Mar. 2026. [Online]. Available: https://docs.litellm.ai/blog/security-update-march-2026

2026
[2]

The Shai-Hulud 2.0 NPM worm: analysis, and what you need to know,

C. Tafani-Dereeper and S. Obregoso, “The Shai-Hulud 2.0 NPM worm: analysis, and what you need to know,” Datadog Security Labs, Nov
[3]

Available: https://securitylabs.datadoghq.com/articles/ shai-hulud-2.0-npm-worm/

[Online]. Available: https://securitylabs.datadoghq.com/articles/ shai-hulud-2.0-npm-worm/
[4]

Killing Two Birds with One Stone: Malicious Package Detection in NPM and PyPI using a Single Model of Malicious Behavior Sequence

J. Zhang, K. Huang, Y . Huang, B. Chen, R. Wang, C. Wang, and X. Peng, “Killing two birds with one stone: Malicious package detection in NPM and PyPI using a single model of malicious behavior sequence,”ACM Transactions on Software Engineering and Methodology, 2024, arXiv:2309.02637. [Online]. Available: https://dl.acm.org/doi/full/10.1145/3705304

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3705304 2024
[5]

1+1¿2: Integrating deep code behaviors with metadata features for malicious PyPI package detection,

X. Sun, X. Gao, S. Cao, L. Bo, X. Wu, and K. Huang, “1+1¿2: Integrating deep code behaviors with metadata features for malicious PyPI package detection,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2024. [Online]. Available: https://dl.acm.org/doi/10. 1145/3691620.3695493

arXiv 2024
[6]

CLAMPD-Net: Cross-language malicious package detection across PyPI and NPM with multimodal fusion,

T. Iqbal, G. Wu, and Z. Iqbal, “CLAMPD-Net: Cross-language malicious package detection across PyPI and NPM with multimodal fusion,” Information and Software Technology, 2026. [Online]. Available: https: //www.sciencedirect.com/science/article/abs/pii/S0950584926001187

2026
[7]

SpiderScan: Practical detection of malicious NPM packages based on graph-based behavior modeling and matching,

Y . Huang, R. Wang, W. Zheng, Z. Zhou, S. Wu, S. Ke, B. Chen, S. Gao, and X. Peng, “SpiderScan: Practical detection of malicious NPM packages based on graph-based behavior modeling and matching,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2024. [Online]. Available: https://dl.acm.org/doi/10.1145/36...

work page doi:10.1145/3691620.3695492 2024
[8]

DySec: A machine learning-based dynamic analysis for detecting malicious packages in PyPI ecosystem,

S. T. Mehedi, C. Islam, G. Ramachandran, and R. Jurdak, “DySec: A machine learning-based dynamic analysis for detecting malicious packages in PyPI ecosystem,” 2025. [Online]. Available: https://arxiv.org/abs/2503.00324

arXiv 2025
[9]

Detecting malicious source code in PyPI packages with LLMs: Does RAG come in handy?

M. Ibiyo, T. Louangdy, P. T. Nguyen, C. Di Sipio, and D. Di Ruscio, “Detecting malicious source code in PyPI packages with LLMs: Does RAG come in handy?” 2025. [Online]. Available: https: //arxiv.org/abs/2504.13769

arXiv 2025
[10]

Leveraging large language models to detect NPM malicious packages,

N. Zahan, P. Burckhardt, M. Lysenko, F. Aboukhadijeh, and L. Williams, “Leveraging large language models to detect NPM malicious packages,” inProceedings of the IEEE/ACM 47th International Conference on Software Engineering (ICSE), 2025. [Online]. Available: https://arxiv.org/abs/2403.12196

arXiv 2025
[11]

CodeQL: Semantic code analysis engine,

GitHub, Inc., “CodeQL: Semantic code analysis engine,” GitHub repository. [Online]. Available: https://github.com/github/codeql
[12]

Semgrep: Lightweight static analysis for many languages,

Semgrep, Inc., “Semgrep: Lightweight static analysis for many languages,” GitHub repository, 2024. [Online]. Available: https: //github.com/semgrep/semgrep

2024
[13]

bandit4mal: A fork of Bandit with patterns to identify malicious Python code,

D.-L. Vu, “bandit4mal: A fork of Bandit with patterns to identify malicious Python code,” GitHub repository. [Online]. Available: https://github.com/lyvd/bandit4mal
[14]

Understanding NPM malicious package detection: A benchmark-driven empirical analysis,

W. Guo, Z. Chen, Z. Xu, C. Liu, M. Kang, S. Song, C. Liu, Y . Xu, W. Sun, and Y . Liu, “Understanding NPM malicious package detection: A benchmark-driven empirical analysis,” 2026. [Online]. Available: https://arxiv.org/abs/2603.27549

arXiv 2026
[15]

Unveiling malicious logic: Towards a statement-level taxonomy and dataset for securing Python packages,

A. Ryan, J. M. Ifti, M. Erfan, A. A. U. Rahman, and M. R. Rahman, “Unveiling malicious logic: Towards a statement-level taxonomy and dataset for securing Python packages,” 2025. [Online]. Available: https://arxiv.org/abs/2512.12559

arXiv 2025
[16]

On the feasibility of cross-language detection of malicious packages in NPM and PyPI,

P. Ladisa, S. E. Ponta, N. Ronzoni, M. Martinez, and O. Barais, “On the feasibility of cross-language detection of malicious packages in NPM and PyPI,” inProceedings of the 39th Annual Computer Security Applications Conference (ACSAC), 2023. [Online]. Available: https://arxiv.org/abs/2310.09571

arXiv 2023
[17]

OSSGadget: Collection of tools for analyzing open source packages,

Microsoft, “OSSGadget: Collection of tools for analyzing open source packages,” GitHub repository. [Online]. Available: https: //github.com/microsoft/OSSGadget
[18]

Application inspector: A source-code analyzer for surveying features,

——, “Application inspector: A source-code analyzer for surveying features,” GitHub repository. [Online]. Available: https://github.com/ microsoft/ApplicationInspector
[19]

A benchmark comparison of Python malware detection approaches,

D.-L. Vu, Z. Newman, and J. S. Meyers, “A benchmark comparison of Python malware detection approaches,” 2022. [Online]. Available: https://arxiv.org/abs/2209.13288

arXiv 2022
[20]

malicious-code-ruleset: Semgrep rules for detecting malicious code in OSS packages,

Apiiro Ltd., “malicious-code-ruleset: Semgrep rules for detecting malicious code in OSS packages,” GitHub repository. [Online]. Available: https://github.com/apiiro/malicious-code-ruleset APPENDIXA LEAKAGE OFFILEIDENTIFIERS INEARLYACTIVITY CHAINS Symptom.In early versions of the pipeline, activity chains contained the entrypoint identifier — specifically,...

2025

[1] [1]

Security update: Suspected supply chain incident,

K. Dholakia and I. Jaffer, “Security update: Suspected supply chain incident,” LiteLLM Blog, Mar. 2026. [Online]. Available: https://docs.litellm.ai/blog/security-update-march-2026

2026

[2] [2]

The Shai-Hulud 2.0 NPM worm: analysis, and what you need to know,

C. Tafani-Dereeper and S. Obregoso, “The Shai-Hulud 2.0 NPM worm: analysis, and what you need to know,” Datadog Security Labs, Nov

[3] [3]

Available: https://securitylabs.datadoghq.com/articles/ shai-hulud-2.0-npm-worm/

[Online]. Available: https://securitylabs.datadoghq.com/articles/ shai-hulud-2.0-npm-worm/

[4] [4]

Killing Two Birds with One Stone: Malicious Package Detection in NPM and PyPI using a Single Model of Malicious Behavior Sequence

J. Zhang, K. Huang, Y . Huang, B. Chen, R. Wang, C. Wang, and X. Peng, “Killing two birds with one stone: Malicious package detection in NPM and PyPI using a single model of malicious behavior sequence,”ACM Transactions on Software Engineering and Methodology, 2024, arXiv:2309.02637. [Online]. Available: https://dl.acm.org/doi/full/10.1145/3705304

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3705304 2024

[5] [5]

1+1¿2: Integrating deep code behaviors with metadata features for malicious PyPI package detection,

X. Sun, X. Gao, S. Cao, L. Bo, X. Wu, and K. Huang, “1+1¿2: Integrating deep code behaviors with metadata features for malicious PyPI package detection,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2024. [Online]. Available: https://dl.acm.org/doi/10. 1145/3691620.3695493

arXiv 2024

[6] [6]

CLAMPD-Net: Cross-language malicious package detection across PyPI and NPM with multimodal fusion,

T. Iqbal, G. Wu, and Z. Iqbal, “CLAMPD-Net: Cross-language malicious package detection across PyPI and NPM with multimodal fusion,” Information and Software Technology, 2026. [Online]. Available: https: //www.sciencedirect.com/science/article/abs/pii/S0950584926001187

2026

[7] [7]

SpiderScan: Practical detection of malicious NPM packages based on graph-based behavior modeling and matching,

Y . Huang, R. Wang, W. Zheng, Z. Zhou, S. Wu, S. Ke, B. Chen, S. Gao, and X. Peng, “SpiderScan: Practical detection of malicious NPM packages based on graph-based behavior modeling and matching,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2024. [Online]. Available: https://dl.acm.org/doi/10.1145/36...

work page doi:10.1145/3691620.3695492 2024

[8] [8]

DySec: A machine learning-based dynamic analysis for detecting malicious packages in PyPI ecosystem,

S. T. Mehedi, C. Islam, G. Ramachandran, and R. Jurdak, “DySec: A machine learning-based dynamic analysis for detecting malicious packages in PyPI ecosystem,” 2025. [Online]. Available: https://arxiv.org/abs/2503.00324

arXiv 2025

[9] [9]

Detecting malicious source code in PyPI packages with LLMs: Does RAG come in handy?

M. Ibiyo, T. Louangdy, P. T. Nguyen, C. Di Sipio, and D. Di Ruscio, “Detecting malicious source code in PyPI packages with LLMs: Does RAG come in handy?” 2025. [Online]. Available: https: //arxiv.org/abs/2504.13769

arXiv 2025

[10] [10]

Leveraging large language models to detect NPM malicious packages,

N. Zahan, P. Burckhardt, M. Lysenko, F. Aboukhadijeh, and L. Williams, “Leveraging large language models to detect NPM malicious packages,” inProceedings of the IEEE/ACM 47th International Conference on Software Engineering (ICSE), 2025. [Online]. Available: https://arxiv.org/abs/2403.12196

arXiv 2025

[11] [11]

CodeQL: Semantic code analysis engine,

GitHub, Inc., “CodeQL: Semantic code analysis engine,” GitHub repository. [Online]. Available: https://github.com/github/codeql

[12] [12]

Semgrep: Lightweight static analysis for many languages,

Semgrep, Inc., “Semgrep: Lightweight static analysis for many languages,” GitHub repository, 2024. [Online]. Available: https: //github.com/semgrep/semgrep

2024

[13] [13]

bandit4mal: A fork of Bandit with patterns to identify malicious Python code,

D.-L. Vu, “bandit4mal: A fork of Bandit with patterns to identify malicious Python code,” GitHub repository. [Online]. Available: https://github.com/lyvd/bandit4mal

[14] [14]

Understanding NPM malicious package detection: A benchmark-driven empirical analysis,

W. Guo, Z. Chen, Z. Xu, C. Liu, M. Kang, S. Song, C. Liu, Y . Xu, W. Sun, and Y . Liu, “Understanding NPM malicious package detection: A benchmark-driven empirical analysis,” 2026. [Online]. Available: https://arxiv.org/abs/2603.27549

arXiv 2026

[15] [15]

Unveiling malicious logic: Towards a statement-level taxonomy and dataset for securing Python packages,

A. Ryan, J. M. Ifti, M. Erfan, A. A. U. Rahman, and M. R. Rahman, “Unveiling malicious logic: Towards a statement-level taxonomy and dataset for securing Python packages,” 2025. [Online]. Available: https://arxiv.org/abs/2512.12559

arXiv 2025

[16] [16]

On the feasibility of cross-language detection of malicious packages in NPM and PyPI,

P. Ladisa, S. E. Ponta, N. Ronzoni, M. Martinez, and O. Barais, “On the feasibility of cross-language detection of malicious packages in NPM and PyPI,” inProceedings of the 39th Annual Computer Security Applications Conference (ACSAC), 2023. [Online]. Available: https://arxiv.org/abs/2310.09571

arXiv 2023

[17] [17]

OSSGadget: Collection of tools for analyzing open source packages,

Microsoft, “OSSGadget: Collection of tools for analyzing open source packages,” GitHub repository. [Online]. Available: https: //github.com/microsoft/OSSGadget

[18] [18]

Application inspector: A source-code analyzer for surveying features,

——, “Application inspector: A source-code analyzer for surveying features,” GitHub repository. [Online]. Available: https://github.com/ microsoft/ApplicationInspector

[19] [19]

A benchmark comparison of Python malware detection approaches,

D.-L. Vu, Z. Newman, and J. S. Meyers, “A benchmark comparison of Python malware detection approaches,” 2022. [Online]. Available: https://arxiv.org/abs/2209.13288

arXiv 2022

[20] [20]

malicious-code-ruleset: Semgrep rules for detecting malicious code in OSS packages,

Apiiro Ltd., “malicious-code-ruleset: Semgrep rules for detecting malicious code in OSS packages,” GitHub repository. [Online]. Available: https://github.com/apiiro/malicious-code-ruleset APPENDIXA LEAKAGE OFFILEIDENTIFIERS INEARLYACTIVITY CHAINS Symptom.In early versions of the pipeline, activity chains contained the entrypoint identifier — specifically,...

2025