pith. machine review for the scientific record. sign in

arxiv: 2605.10074 · v1 · submitted 2026-05-11 · 💻 cs.CR · cs.SE

Recognition: no theorem link

Agentic Fuzzing: Opportunities and Challenges

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:05 UTC · model grok-4.3

classification 💻 cs.CR cs.SE
keywords agentic fuzzingLLM agentslogic bugsfuzzingJavaScript enginesroot cause analysisbug variantsV8
0
0 comments X

The pith

LLM agents acting as reasoning engines can find logic bug variants in mature codebases by analyzing known bugs and testing new scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces agentic fuzzing, a method that uses large language model agents to directly perform the reasoning needed to discover logic bugs. Rather than relying on traditional fuzzers that need execution feedback or static patterns, the agents start from historical bugs, determine their root causes, hypothesize new triggering scenarios in the code, and generate proof-of-concept tests to check them. This allows finding bugs that vary significantly in their execution paths or code structure from the originals. The authors developed AFuzz to implement this, tackling issues like redundant work through scenario coverage and efficient scheduling with a diversity-based method, leading to substantial bug findings in JavaScript engines.

Core claim

Given a reference bug, an LLM agent analyzes its root cause, hypothesizes new scenarios elsewhere in the codebase that may share that cause, and verifies each hypothesis by generating and running proof-of-concept code. This process finds variants that differ completely in trigger path or code structure from the reference, enabling the discovery of logic bugs that traditional methods struggle with in mature codebases.

What carries the argument

The agentic fuzzing pipeline consisting of root-cause analysis, scenario hypothesis generation, proof-of-concept verification, and scenario coverage tracking to avoid redundant investigations.

If this is right

  • Traditional fuzzers and static analyzers often miss logic bugs that require multi-step reasoning without clear execution signals.
  • Variants of bugs can be found across different code structures and even different software implementations using the same reference seed.
  • Practical challenges in agentic fuzzing, including harness engineering, redundant investigations, and seed scheduling, can be addressed with a multi-stage pipeline, deduplication via scenario coverage, and a DPP-MAP scheduler.
  • Application to the V8 JavaScript engine over one month resulted in 40 bugs found, $35,000 in bounties, and two CVEs, with additional 19 bugs discovered in SpiderMonkey and JavaScriptCore.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agentic fuzzing could be extended to other types of complex software systems where logic bugs are prevalent, such as database engines or network protocols.
  • Enhancements to LLM capabilities for more accurate hypothesis generation could further reduce invalid tests and improve overall efficiency.
  • Combining agentic fuzzing with conventional fuzzing techniques might provide a hybrid approach that leverages both reasoning and random exploration.
  • Success in transferring seeds from one engine to others indicates potential for automated identification of common vulnerabilities in related codebases.

Load-bearing premise

That LLM agents can consistently perform accurate multi-step root cause analysis and propose valid new bug scenarios that differ from the reference without generating excessive invalid hypotheses or overlooking actual variants.

What would settle it

Applying the AFuzz system to a different mature software project with a known set of undisclosed logic bugs and observing whether it identifies a significant number of them through novel scenario hypotheses rather than mostly failing to generate testable or relevant cases.

Figures

Figures reproduced from arXiv: 2605.10074 by Insu Yun, Junyoung Park.

Figure 1
Figure 1. Figure 1: Motivating example: an integer truncation vulnera [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Reference vulnerability (CVE-2025-10892) in V8’s [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of AFuzz, an Agentic Fuzzer 3 Design 3.1 Overview [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Scenario coverage to avoid redundant investigations. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Minified PoC for bug #12. The loop update [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
read the original abstract

Fuzzers and static analyzers find many bugs but struggle with logic bugs in mature codebases. Triggering such a bug often requires multi-step reasoning that produces no distinctive execution feedback, and variants can appear across implementations too different for a single pattern to match. Recent LLM-assisted approaches help, but they use LLMs as auxiliaries rather than as the reasoning engine. We propose agentic fuzzing, a bug-finding approach seeded by historical bugs in which deep agents perform the reasoning directly. Given a reference bug, the agent analyzes its root cause, hypothesizes new scenarios elsewhere in the codebase that may share that cause, and verifies each hypothesis by generating and running proof-of-concept code. This lets the agent find variants that differ completely in trigger path or code structure from the reference. We identify three practical challenges in implementing agentic fuzzing: harness engineering, redundant investigations across seeds with similar root causes, and scheduling seeds in a large corpus. We address these in AFuzz through a four-stage agent pipeline, scenario coverage that deduplicates previously explored scenarios, and a DPP-MAP scheduler that orders seeds by diversity. We ran AFuzz on the V8 JavaScript engine for about one month, finding 40 bugs (including three duplicates), receiving a total $35,000 bounty, and being assigned two CVEs. AFuzz also found 19 bugs (including one duplicate) in SpiderMonkey and JavaScriptCore using the seeds from V8. However, agentic fuzzing is in its early stages with several remaining open problems we discuss in the paper. Still, we think it points to a promising direction for finding logic bugs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes agentic fuzzing, in which LLM agents analyze root causes of historical bugs, hypothesize structurally different triggering scenarios elsewhere in the codebase, and verify them by generating and executing PoCs. This is implemented in AFuzz via a four-stage agent pipeline, scenario coverage to deduplicate explored scenarios, and a DPP-MAP scheduler to order seeds by diversity. Evaluation on the V8 JavaScript engine over approximately one month yielded 40 bugs (including 3 duplicates), $35,000 in bounties, and 2 CVEs; the same seeds also produced 19 bugs (1 duplicate) in SpiderMonkey and JavaScriptCore. The work is framed as early-stage with open challenges remaining.

Significance. If the empirical results hold, the work is significant because it provides concrete, real-world evidence that LLM agents can surface logic bugs requiring multi-step reasoning in mature, high-value codebases where conventional fuzzers and static analyzers fall short. The validation via actual bounties and assigned CVEs is a notable strength, as is the explicit enumeration of practical challenges (harness engineering, redundancy, scheduling) together with targeted mitigations. The cross-engine transfer results further support the claim that the approach can generalize beyond a single implementation.

major comments (2)
  1. [Evaluation section (results reporting)] Evaluation section (results reporting): The central claim that agents 'perform the reasoning directly' and reliably produce valid new variants rests on the observed bug counts, yet the evaluation supplies only aggregate totals (40 bugs in V8, 19 in other engines) with no data on total hypotheses generated per seed, fraction of invalid or redundant hypotheses, success rate of root-cause analysis, or false-positive rates for PoC generation. This leaves the reliability of the multi-step agentic process unquantified and prevents assessment of whether the findings exceed what the seed corpus alone would yield.
  2. [Approach and Evaluation] Approach and Evaluation: No baseline comparisons to standard fuzzers (e.g., AFL, libFuzzer) or non-agentic LLM-assisted methods are provided, nor ablations isolating the contribution of the four-stage pipeline versus simpler prompting. Without these, it is difficult to attribute the discovered bugs specifically to the agentic root-cause and hypothesis steps rather than to the quality of the historical seed corpus.
minor comments (2)
  1. [Abstract] Abstract: The statement that 'agentic fuzzing is in its early stages with several remaining open problems we discuss' would be more informative if the abstract briefly previewed the main open problems.
  2. [Throughout] Terminology: Terms such as 'deep agents' and 'DPP-MAP scheduler' appear without immediate definition or citation on first use, which reduces readability for readers outside the immediate sub-area.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback and recommendation for major revision. We address each major comment below, indicating revisions where we can strengthen the manuscript without misrepresenting our early-stage results.

read point-by-point responses
  1. Referee: Evaluation section (results reporting): The central claim that agents 'perform the reasoning directly' and reliably produce valid new variants rests on the observed bug counts, yet the evaluation supplies only aggregate totals (40 bugs in V8, 19 in other engines) with no data on total hypotheses generated per seed, fraction of invalid or redundant hypotheses, success rate of root-cause analysis, or false-positive rates for PoC generation. This leaves the reliability of the multi-step agentic process unquantified and prevents assessment of whether the findings exceed what the seed corpus alone would yield.

    Authors: We agree that granular metrics on the agent pipeline would better quantify reliability. In the revised manuscript we add a new subsection to the Evaluation section that reports aggregate statistics drawn from our experimental logs: total hypotheses generated, fraction leading to valid PoCs, and the role of scenario coverage in reducing redundant investigations. Root-cause analysis success is evidenced by the fact that only accurate analyses produced the reported bugs and bounties. We cannot, however, provide a direct quantitative comparison showing that the findings exceed what the seed corpus alone would yield, as that would require a separate non-agentic baseline experiment that was outside the scope of this study. revision: partial

  2. Referee: Approach and Evaluation: No baseline comparisons to standard fuzzers (e.g., AFL, libFuzzer) or non-agentic LLM-assisted methods are provided, nor ablations isolating the contribution of the four-stage pipeline versus simpler prompting. Without these, it is difficult to attribute the discovered bugs specifically to the agentic root-cause and hypothesis steps rather than to the quality of the historical seed corpus.

    Authors: We acknowledge that explicit baselines and ablations would aid attribution. Standard coverage-guided fuzzers target crash bugs rather than the multi-step logic bugs that require root-cause reasoning, so direct quantitative comparison is not straightforward; we have expanded the Related Work section with a qualitative contrast to prior LLM-assisted fuzzing that treats LLMs as auxiliaries. We have also added a limited ablation analysis based on our logs that isolates the contribution of the root-cause and hypothesis-generation stages. The cross-engine transfer results and real-world bounty/CVE validation provide supporting evidence that the agentic steps add value beyond the seeds alone, though we agree a fuller set of controlled experiments would be desirable in follow-on work. revision: partial

standing simulated objections not resolved
  • Direct quantitative baseline experiment comparing agentic fuzzing results to the historical seed corpus used without any LLM reasoning or pipeline, as no such non-agentic run was performed.

Circularity Check

0 steps flagged

No circularity; purely empirical evaluation with no derivations or self-referential claims

full rationale

The paper proposes agentic fuzzing as an empirical bug-finding method, describes a four-stage pipeline and DPP-MAP scheduler to address practical challenges, and supports its claims solely through reported experimental outcomes (40 bugs found in V8 with $35k bounties and 2 CVEs; 19 bugs in other engines). No equations, fitted parameters, mathematical predictions, or first-principles derivations exist. No self-citations appear as load-bearing premises, and no steps reduce by construction to inputs (e.g., no renaming of known patterns or ansatzes smuggled via prior work). The evaluation is self-contained against external benchmarks of bug discovery and bounties, with no circular reduction possible.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the contribution is an empirical methodology and tool implementation.

pith-pipeline@v0.9.0 · 5592 in / 1174 out tokens · 68901 ms · 2026-05-12T04:05:57.173522+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

101 extracted references · 101 canonical work pages · 1 internal anchor

  1. [1]

    San Jose, CA

    2017.Proceedings of the 38th IEEE Symposium on Security and Privacy (Oakland). San Jose, CA

  2. [2]

    San Diego, CA

    2019.Proceedings of the 26th Annual Network and Distributed System Security Symposium (NDSS). San Diego, CA

  3. [3]

    Boston, MA

    2020.Proceedings of the 29th USENIX Security Symposium (Security). Boston, MA

  4. [4]

    Los Angeles, CA

    2022.Proceedings of the 29th ACM Conference on Computer and Communications Security (CCS). Los Angeles, CA

  5. [5]

    Anaheim, CA

    2023.Proceedings of the 32nd USENIX Security Symposium (Security). Anaheim, CA

  6. [6]

    Salt Lake City, UT

    2024.Proceedings of the 31st ACM Conference on Computer and Communications Security (CCS). Salt Lake City, UT

  7. [7]

    Philadelphia, PA

    2024.Proceedings of the 33rd USENIX Security Symposium (Security). Philadelphia, PA

  8. [8]

    Lisbon, Portugal

    2024.Proceedings of the 46th International Conference on Software Engineering (ICSE). Lisbon, Portugal

  9. [9]

    San Diego, CA

    2025.Proceedings of the 32nd Annual Network and Distributed System Security Symposium (NDSS). San Diego, CA

  10. [10]

    Seattle, WA

    2025.Proceedings of the 34th USENIX Security Symposium (Security). Seattle, WA. Agentic Fuzzing: Opportunities and Challenges

  11. [11]

    24...@project.gserviceaccount.com. 2025. DCHECK failure in (builder_- >current_block()) == nullptr in maglev-graph-builder.cc. https://issues.chromium. org/issues/439945236. Accessed: 2026-04-23

  12. [12]

    24...@project.gserviceaccount.com. 2025. DCHECK failure in (builder_- >current_block()) == nullptr in maglev-graph-builder.cc. https://issues.chromium. org/issues/440145531. Accessed: 2026-04-23

  13. [13]

    24...@project.gserviceaccount.com. 2025. DCHECK failure in (current_block()) == nullptr in maglev-graph-builder.cc. https://issues.chromium.org/issues/ 439752700. Accessed: 2026-04-23

  14. [14]

    24...@project.gserviceaccount.com. 2025. DCHECK failure in (current_block()) == nullptr in maglev-graph-builder.cc. https://issues.chromium.org/issues/ 439970326. Accessed: 2026-04-23

  15. [15]

    24...@project.gserviceaccount.com. 2025. DCHECK failure in new_nodes_- at_end_.empty() in maglev-reducer.h. https://issues.chromium.org/issues/ 439752712. Accessed: 2026-04-23

  16. [16]

    Anthropic. 2026. Agent SDK overview. https://platform.claude.com/docs/en/ agent-sdk/overview. Accessed: 2026-04-23

  17. [17]

    Anthropic. 2026. Claude Code by Anthropic | AI Coding Agent, Terminal, IDE. https://claude.com/product/claude-code. Accessed: 2026-04-23

  18. [18]

    Anthropic. 2026. Create custom subagents. https://code.claude.com/docs/en/sub- agents. Accessed: 2026-04-23

  19. [19]

    Anthropic. 2026. Introducing Claude Opus 4.6. https://www.anthropic.com/ news/claude-opus-4-6. Accessed: 2026-04-23

  20. [20]

    Anthropic. 2026. Todo Lists. https://code.claude.com/docs/en/agent-sdk/todo- tracking. Accessed: 2026-04-23

  21. [21]

    Cornelius Aschermann, Tommaso Frassetto, Thorsten Holz, Patrick Jauernig, Ahmad-Reza Sadeghi, and Daniel Teuchert. 2019. NAUTILUS: Fishing for Deep Bugs with Grammars, See [2]

  22. [22]

    Lukas Bernhard, Tobias Scharnowski, Moritz Schloegel, Tim Blazytko, and Thorsten Holz. 2022. JIT-Picking: Differential Fuzzing of JavaScript Engines, See [4]

  23. [23]

    Kritik Bhattarai. 2026. V8 Sandbox bypass via untagged ExternalIntPtr in Access- Builder::ForExternalIntPtr (Chrome 145.0.7632.159). https://issues.chromium. org/issues/491749534. Accessed: 2026-04-23

  24. [24]

    Big Sleep. 2026. Big Sleep Tracker - Issue Tracker. https://issuetracker.google. com/savedsearches/7155917. Accessed: 2026-04-23

  25. [25]

    Big Sleep team. 2024. From Naptime to Big Sleep: Using Large Language Models To Catch Vulnerabilities In Real-World Code. https://projectzero.google/2024/ 10/from-naptime-to-big-sleep.html. Accessed: 2026-04-23

  26. [26]

    Konstantin Borimechkov. 2025. Stop the Bleed: The Developer’s Guide to Taming Claude Code. https://theexcitedengineer.substack.com/p/stop-the-bleed-the- developers-guide. Accessed: 2026-04-23

  27. [27]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

  28. [28]

    Chuyang Chen, Brendan Dolan-Gavitt, and Zhiqiang Lin. 2025. ELFuzz: Efficient Input Generation via LLM-driven Synthesis Over Fuzzer Space, See [10]

  29. [29]

    Laming Chen, Guoxin Zhang, and Eric Zhou. 2018. Fast Greedy MAP Inference for Determinantal Point Process to Improve Recommendation Diversity. InPro- ceedings of the 32nd Annual Conference on Neural Information Processing Systems (NeurIPS). Montreal, Canada

  30. [30]

    Isaac David and Arthur Gervais. 2025. Multi-Agent Penetration Testing AI for the Web.arXiv preprint arXiv:2508.20816(2025)

  31. [31]

    Oege de Moor. 2024. Introducing XBOW. https://xbow.com/blog/introducing- xbow. Accessed: 2026-04-23

  32. [32]

    Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. 2024. PentestGPT: Evaluating and harnessing large language models for automated penetration testing, See [7]

  33. [33]

    Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. 2023. Large Language Models Are Zero-Shot Fuzzers: Fuzzing Deep- Learning Libraries via Large Language Models. InProceedings of the International Symposium on Software Testing and Analysis (ISSTA). Seattle, WA

  34. [34]

    Yinlin Deng, Chunqiu Steven Xia, Chenyuan Yang, Shizhuo Dylan Zhang, Shu- jing Yang, and Lingming Zhang. 2024. Large Language Models are Edge-Case Generators: Crafting Unusual Programs for Fuzzing Deep Learning Libraries, See [8]

  35. [35]

    Sung Ta Dinh, Haehyun Cho, Kyle Martin, Adam Oest, Kyle Zeng, Alexandros Kapravelos, Gail Joon Ahn, Tiffany Bao, Ruoyu Wang, Adam Doupé, and Yan Shoshitaishvili. 2021. Favocado: Fuzzing the Binding Code of JavaScript Engines Using Semantically Correct Test Cases. InProceedings of the 28th Annual Network and Distributed System Security Symposium (NDSS). Virtual

  36. [36]

    Jueon Eom, Seyeon Jeong, and Taekyoung Kwon. 2024. Fuzzing JavaScript Interpreters with Coverage-Guided Reinforcement Learning for LLM-Based Mutation. InProceedings of the 33rd International Symposium on Software Testing and Analysis (ISSTA). Vienna, Austria

  37. [37]

    Siyue Feng, Yueming Wu, Wenjie Xue, Sikui Pan, Deqing Zou, Yang Liu, and Hai Jin. 2024. FIRE: Combining Multi-Stage Filtering with Taint Analysis for Scalable Recurring Vulnerability Detection, See [7]

  38. [38]

    Peter Girnus. 2026. Introducing ÆSIR: Finding Zero-Day Vulnerabilities at the Speed of AI. https://www.trendmicro.com/en_us/research/26/a/aesir.html. Accessed: 2026-04-23

  39. [39]

    GitHub. [n. d.]. CodeQL. https://codeql.github.com/. Accessed: 2026-04-23

  40. [40]

    Google. [n. d.]. Build, debug & deploy with AI | Gemini CLI. https://geminicli. com/. Accessed: 2026-04-23

  41. [41]

    Google. [n. d.]. Chromium Issue Tracker. https://issues.chromium.org/. Accessed: 2026-04-23

  42. [42]

    Google. [n. d.]. ClusterFuzz. https://github.com/google/clusterfuzz. Accessed: 2026-04-23

  43. [43]

    Google Big Sleep. 2025. V8: Integer truncation during Maglev compilation leading to memory corruption. https://issues.chromium.org/issues/444048019. Accessed: 2026-04-23

  44. [44]

    Google DeepMind. 2026. Gemma 4 — Google DeepMind. https://deepmind. google/models/gemma/gemma-4. Accessed: 2026-04-23

  45. [45]

    Samuel Groß, Simon Koch, Lukas Bernhard, Thorsten Holz, and Martin Johns

  46. [46]

    InPro- ceedings of the 30th Annual Network and Distributed System Security Symposium (NDSS)

    FUZZILLI: Fuzzing for JavaScript JIT Compiler Vulnerabilities. InPro- ceedings of the 30th Annual Network and Distributed System Security Symposium (NDSS). San Diego, CA

  47. [47]

    Samuel Groß. 2024. The V8 Sandbox. https://v8.dev/blog/sandbox. Accessed: 2026-04-23

  48. [48]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. 2025. DeepSeek-R1 incen- tivizes reasoning in LLMs through reinforcement learning.Nature645, 8081 (2025), 633–638

  49. [49]

    HyungSeok Han, DongHyeon Oh, and Sang Kil Cha. 2019. CodeAlchemist: Semantics-aware Code Generation to Find Vulnerabilities in JavaScript Engines, See [2]

  50. [50]

    Insu Han and Jennifer Gillenwater. 2020. MAP inference for customized deter- minantal point processes via maximum inner product search. InInternational Conference on Artificial Intelligence and Statistics. PMLR, 2797–2807

  51. [51]

    Andreas Happe and Jürgen Cito. 2023. Getting pwn’d by AI: Penetration Testing with Large Language Models. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). San Francisco, CA

  52. [52]

    Xiaoyu He, Xiaofei Xie, Yuekang Li, Jianwen Sun, Feng Li, Wei Zou, Yang Liu, Lei Yu, Jianhua Zhou, Wenchang Shi, and Wei Huo. 2021. SoFi: Reflection-Augmented Fuzzing for JavaScript Engines. InProceedings of the 28th ACM Conference on Computer and Communications Security (CCS). Virtual

  53. [53]

    Richard D Hipp. 2026. SQLite Home Page. https://sqlite.org. Accessed: 2026-04- 23

  54. [54]

    Christian Holler, Kim Herzig, and Andreas Zeller. 2012. Fuzzing with Code Fragments. InProceedings of the 21st USENIX Security Symposium (Security). Bellevue, WA

  55. [55]

    Kaifeng Huang, Chenhao Lu, Yiheng Cao, Bihuan Chen, and Xin Peng. 2024. VMud: Detecting Recurring Vulnerabilities with Multiple Fixing Functions via Function Selection and Semantic Equivalent Statement Matching, See [6]

  56. [56]

    Jiyong Jang, Abeer Agrawal, and David Brumley. 2012. ReDeBug: Finding Un- patched Code Clones in Entire OS Distributions. InProceedings of the 33rd IEEE Symposium on Security and Privacy (Oakland). San Francisco, CA

  57. [57]

    Wooseok Kang, Byoungho Son, and Kihong Heo. 2022. TRACER: Signature-based Static Analysis for Detecting Recurring Vulnerabilities, See [4]. Junyoung Park and Insu Yun

  58. [58]

    Seulbae Kim, Seunghoon Woo, Heejo Lee, and Hakjoo Oh. 2017. VUDDY: A Scalable Approach for Vulnerable Code Clone Discovery, See [1]

  59. [59]

    Kunlun Lab. 2021. Security: TianfuCup RCE bug Type confusion in LoadIC::ComputeHandler. https://issues.chromium.org/issues/40057622. Ac- cessed: 2026-04-23

  60. [60]

    LangChain. [n. d.]. deepagents. https://github.com/langchain-ai/deepagents. Accessed: 2026-04-23

  61. [61]

    Stella Laurenzo. 2026. [MODEL] Claude Code is unusable for complex en- gineering tasks with the Feb updates. https://github.com/anthropics/claude- code/issues/42796. Accessed: 2026-04-23

  62. [62]

    Suyoung Lee, HyungSeok Han, Sang Kil Cha, and Sooel Son. 2020. Montage: A Neural Network Language Model-Guided JavaScript Engine Fuzzer, See [3]

  63. [63]

    Ahmed Lekssays, Hamza Mouhcine, Khang Tran, Ting Yu, and Issa Khalil. 2025. LLMxCPG: Context-Aware Vulnerability Detection Through Code Property Graph-Guided Large Language Models, See [10]

  64. [64]

    Ziyang Li, Saikat Dutta, and Mayur Naik. 2025. IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities. InProceedings of the 13th International Conference on Learning Representations (ICLR). Singapore

  65. [65]

    Jiayi Lin, Changhua Luo, Mingxue Zhang, Lanteng Lin, Penghui Li, and Chenx- iong Qian. 2026. Fuzzing JavaScript Engines by Fusing JavaScript and WebAssem- bly. InProceedings of the 48th International Conference on Software Engineering (ICSE). Rio de Janeiro, Brazil

  66. [66]

    LLVM Project. 2026. libFuzzer - a library for coverage-guided fuzz testing. https://llvm.org/docs/LibFuzzer.html. Accessed: 2026-04-23

  67. [67]

    Yunlong Lyu, Yuxuan Xie, Peng Chen, and Hao Chen. 2024. Prompt Fuzzing for Fuzz Driver Generation, See [6]

  68. [68]

    Odile Macchi. 1975. The coincidence approach to stochastic point processes. Advances in Applied Probability7, 1 (1975), 83–122

  69. [69]

    Ruijie Meng, Martin Mirchev, Marcel Böhme, and Abhik Roychoudhury. 2024. Large Language Model guided Protocol Fuzzing. InProceedings of the 31st Annual Network and Distributed System Security Symposium (NDSS). San Diego, CA

  70. [70]

    OpenAI. [n. d.]. Codex. https://openai.com/codex. Accessed: 2026-04-23

  71. [71]

    Junyoung Park. 2026. ANGLE: [REDACTED]. https://issues.chromium.org/ issues/501476576. Accessed: 2026-04-23

  72. [72]

    Jihyeok Park, Seungmin An, Dongjun Youn, Gyeongwon Kim, and Sukyoung Ryu. 2021. JEST: N+1-Version Differential Testing of Both JavaScript Engines and Specification. InProceedings of the 43rd International Conference on Software Engineering (ICSE). Madrid, Spain

  73. [73]

    SeRya. 2010. Issue 1374005: Percise rounding parsing octal and hexadecimal strings.... (Closed). https://codereview.chromium.org/1374005. Accessed: 2026- 04-23

  74. [74]

    Xiangmin Shen, Lingzhi Wang, Zhenyuan Li, Yan Chen, Wencheng Zhao, Dawei Sun, Jiashui Wang, and Wei Ruan. 2025. PentestAgent: Incorporating LLM Agents to Automated Penetration Testing. InProceedings of the 20th ACM Symposium on Information, Computer and Communications Security (ASIACCS). Ha Noi, Vietnam

  75. [75]

    tckwgd. 2026. [BUG] Compaction death spiral - 211 compactions consuming all tokens with zero progress. https://github.com/anthropics/claude-code/issues/ 24179. Accessed: 2026-04-23

  76. [76]

    The Chromium Authors. 2026. Chromium. https://www.chromium.org/Home. Accessed: 2026-04-23

  77. [77]

    Dat Tran and Douwe Kiela. 2026. Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets.arXiv preprint arXiv:2604.02460(2026)

  78. [78]

    Vivek Trivedy. 2026. The Anatomy of an Agent Harness. https://www.langchain. com/blog/the-anatomy-of-an-agent-harness. Accessed: 2026-04-23

  79. [79]

    V8 Project Authors. [n. d.]. What is V8? https://v8.dev. Accessed: 2026-04-23

  80. [80]

    Toon Verwaest, Leszek Swirski, Victor Gomes, Olivier Flückiger, Darius Mer- cadier, and Camillo Bruni. 2023. Maglev - V8’s Fastest Optimizing JIT. https: //v8.dev/blog/maglev. Accessed: 2026-04-23

Showing first 80 references.