LLMs achieve only modest understanding of HMSC formal semantics at 52 percent accuracy, performing strongly on basic constructs but weakly on abstractions and traces.
Exploring Code Analysis: Zero-Shot Insights on Syntax and Semantics with LLMs
5 Pith papers cite this work. Polarity classification is still indexing.
abstract
Code analysis is fundamental in Software Engineering, supporting debugging, optimization, and security assessment. Human developers approach it through syntax parsing, static semantics inference, and dynamic reasoning. Traditional tools are effective but limited by language specificity and weak cross-language generalization. Large language models (LLMs) are promising for code tasks, yet their capabilities for fundamental code analysis remain underexplored. We structure our study around three aspects aligned with human practices: syntax parsing, static semantics inference, and dynamic reasoning. We evaluate 21 state-of-the-art LLMs across nine tasks in four languages (C, Java, Python, Solidity), including AST generation, CFG construction, data dependency, taint analysis, and flaky test reasoning. We apply a three-layer evaluation protocol (automated metrics, expert adjudication, consistency validation) to 3,124 code samples, achieving high inter-rater reliability (Cohen's kappa = 0.844-0.936) and strong human-machine agreement (Gwet's AC1 = 0.500-0.727, F1 = 0.791-0.882). While the best LLMs excel in syntax parsing (AST 90%+, expression matching 84-100%) and show promise in static analysis, their dynamic reasoning remains limited (<70%) with high data-shift sensitivity (per-project F1 varying 0-1.0). This hierarchy holds across model families and scales, suggesting fundamental rather than transient limitations. These findings show how LLMs complement traditional analyzers: they offer cross-language generalization but non-deterministic outputs needing validation, while traditional tools give deterministic guarantees but need language-specific configuration. We contribute a validated evaluation framework with comparison against traditional analyzers (Tree-sitter, Soot, Joern) and task-specific applicability tiers. Benchmark: https://github.com/mathieu0905/llm_code_analysis.git
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
NeuroFlake integrates discriminative token mining into LLMs to classify flaky tests, raising F1-score to 69.34% on FlakeBench while showing greater robustness to semantic-preserving perturbations than prior methods.
PMDetector is a hybrid static-plus-LLM framework that detects price manipulation in DeFi protocols via taint analysis, defense filtering, attack simulation, and validation, achieving 88% precision and 90% recall on 73 vulnerable plus 288 benign contracts.
ABLE uses LLMs with sanitization and iterative refinement to generate bypass YARA rules from malware traces, achieving 79% success on 334 samples and 47% more family detections.
CodePori is a multi-agent LLM system for code generation whose participant evaluation identifies practical challenges like memory limits and hallucinations missed by binary benchmarks.
citing papers explorer
-
(How) Do Large Language Models Understand High-Level Message Sequence Charts?
LLMs achieve only modest understanding of HMSC formal semantics at 52 percent accuracy, performing strongly on basic constructs but weakly on abstractions and traces.
-
NeuroFlake: A Neuro-Symbolic LLM Framework for Flaky Test Classification
NeuroFlake integrates discriminative token mining into LLMs to classify flaky tests, raising F1-score to 69.34% on FlakeBench while showing greater robustness to semantic-preserving perturbations than prior methods.
-
LLM-Powered Detection of Price Manipulation in DeFi
PMDetector is a hybrid static-plus-LLM framework that detects price manipulation in DeFi protocols via taint analysis, defense filtering, attack simulation, and validation, achieving 88% precision and 90% recall on 73 vulnerable plus 288 benign contracts.
-
A Large Language Model Approach to Generating Bypass Rules for Malware Evasion in Analysis Sandbox
ABLE uses LLMs with sanitization and iterative refinement to generate bypass YARA rules from malware traces, achieving 79% success on 334 samples and 47% more family detections.
-
CodePori: Large-Scale System for Autonomous Software Development Using Multi-Agent Technology
CodePori is a multi-agent LLM system for code generation whose participant evaluation identifies practical challenges like memory limits and hallucinations missed by binary benchmarks.