AgentFlow builds a framework-agnostic Agent Dependency Graph from agent program source code to support static analyses such as BOM generation and prompt-to-tool risk detection, evaluated on 5,399 real programs across five frameworks.
LLM-Enabled Open-Source Systems in the Wild: An Empirical Study of Vulnerabilities in GitHub Security Advisories
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Large language models (LLMs) are increasingly embedded in open-source software (OSS) ecosystems, creating complex interactions among natural language prompts, probabilistic model outputs, and execution-capable components. However, it remains unclear whether traditional vulnerability disclosure frameworks adequately capture these model-mediated risks. To investigate this, we analyze 295 GitHub Security Advisories published between January 2025 and January 2026 that reference LLM-related components, and we manually annotate a sample of 100 advisories using the OWASP Top 10 for LLM Applications 2025. We find no evidence of new implementation-level weakness classes specific to LLM systems. Most advisories map to established CWEs, particularly injection and deserialization weaknesses. At the same time, the OWASP-based analysis reveals recurring architectural risk patterns, especially Supply Chain, Excessive Agency, and Prompt Injection, which often co-occur across multiple stages of execution. These results suggest that existing advisory metadata captures code-level defects but underrepresents model-mediated exposure. We conclude that combining the CWE and OWASP perspectives provides a more complete and necessary view of vulnerabilities in LLM-integrated systems.
fields
cs.SE 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
AgentFlow: Building Agent Dependency Graphs for Static Analysis of Agent Programs
AgentFlow builds a framework-agnostic Agent Dependency Graph from agent program source code to support static analyses such as BOM generation and prompt-to-tool risk detection, evaluated on 5,399 real programs across five frameworks.