CornerCase: Automated Extremal Testing of Protocol Implementations using LLMs

George Varghese; Kuan Qian; Rathin Singha; Ryan Beckett; Siva Kesava Reddy Kakarla; Soheil Abbasloo; Srinath Saikrishnan; Todd Millstein; Tracy Zhao

arxiv: 2606.29124 · v1 · pith:D4JM64WMnew · submitted 2026-06-28 · 💻 cs.NI

CornerCase: Automated Extremal Testing of Protocol Implementations using LLMs

Rathin Singha , Kuan Qian , Srinath Saikrishnan , Tracy Zhao , Soheil Abbasloo , Ryan Beckett , Siva Kesava Reddy Kakarla , Todd Millstein

show 1 more author

George Varghese

This is my paper

Pith reviewed 2026-06-30 02:53 UTC · model grok-4.3

classification 💻 cs.NI

keywords extremal testingprotocol implementationsLLM constraint extractionboundary behaviorsdifferential testingnetwork protocolsRFC analysis

0 comments

The pith

CornerCase uses LLMs to extract validity constraints from protocol specs and generates tests at their boundaries to find implementation bugs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

CornerCase is a testing approach that targets bugs near the edges of what network protocol specifications allow. It splits the work into two steps where large language models read specification documents section by section to list the explicit rules for valid inputs and outputs, then creates test cases positioned right at or just beyond those rule limits. Running the tests on different implementations of the same protocol and comparing results reveals inconsistencies that indicate bugs. The method was applied to implementations of HTTP, DNS, BGP, SMTP, and QUIC, identifying 42 anomalies of which 26 were acknowledged as bugs and 18 were fixed.

Core claim

CornerCase decomposes test generation into LLM-based extraction of explicit validity constraints from protocol specifications in a structured section-by-section manner, followed by generation of extremal test cases at or near the boundary of each constraint; these tests are executed across multiple implementations with differential testing to identify inconsistencies that expose bugs missed by fuzzing and model-based testing.

What carries the argument

Two-stage process of LLM-driven structured constraint extraction from specifications combined with extremal test generation at constraint boundaries and differential testing across implementations.

If this is right

Boundary behaviors such as encoded null bytes in URLs or state-dependent message validity can be targeted systematically rather than left to chance in random testing.
Differential testing across implementations of the same protocol reliably surfaces bugs through observable inconsistencies.
The same decomposition can be repeated on additional protocols beyond the five evaluated here.
Many bugs previously unknown can be identified and reported, leading to fixes in production implementations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If LLM accuracy on constraint extraction improves, the approach could extend to longer or more ambiguous specifications.
Patterns in the extracted constraints might highlight recurring ambiguities in how protocols are specified.
Combining the boundary-focused tests with existing fuzzers could produce broader coverage of protocol edge cases.

Load-bearing premise

Large language models can accurately extract explicit validity constraints from protocol specifications without significant omissions or hallucinations.

What would settle it

A protocol specification where the LLM extraction step misses or misstates a key validity constraint, causing the generated tests to miss a known boundary bug or produce only false inconsistencies.

Figures

Figures reproduced from arXiv: 2606.29124 by George Varghese, Kuan Qian, Rathin Singha, Ryan Beckett, Siva Kesava Reddy Kakarla, Soheil Abbasloo, Srinath Saikrishnan, Todd Millstein, Tracy Zhao.

**Figure 1.** Figure 1: The Pipeline for Extremal Testing lowing the user to specify the test format as a JSON file, the framework can be easily extended to new protocols and previously unanticipated features. For example, our HTTP tests improved when we moved from a fixed filesystem – where the model only generates URI queries – to a richer format that allows it to specify which files should and should not exist, along with the… view at source ↗

read the original abstract

Many software bugs in network protocol implementations arise near specification boundaries, such as inputs just within or outside allowed ranges, or messages that are valid in isolation but invalid in a given state. From the SSL Heartbleed exploit to TCP Christmas Tree packets, boundary inputs have repeatedly exposed critical weaknesses, yet remain under-tested by existing techniques such as fuzzing and model-based testing. We present CornerCase, an automated extremal testing approach that systematically targets such boundary behaviors. Our key idea is to decompose test generation into two stages: first, large language models (LLMs) extract explicit validity constraints from protocol specifications (e.g., RFCs) in a structured, section-by-section manner; second, extremal test cases are generated at or near the boundary of each constraint. These tests are executed across multiple implementations, and differential testing identifies inconsistencies. We evaluate CornerCase on widely used implementations of HTTP, DNS, BGP, SMTP, and QUIC, uncovering many previously unknown bugs. For example, the HTTP server h2o enters a redirect loop when processing URLs containing encoded null bytes. Overall, we used CornerCase to identify and file 42 anomalies; to date 26 have been acknowledged as bugs and 18 fixed, with others under active investigation

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CornerCase applies LLMs to extract RFC constraints for extremal protocol testing and reports 26 acknowledged bugs, but the extraction step has no accuracy audit so the bug attributions rest on an unverified link.

read the letter

The main takeaway is that this work decomposes protocol test generation into LLM-driven constraint extraction from RFCs followed by boundary-case generation and differential execution. It ran on HTTP, DNS, BGP, SMTP, and QUIC implementations and surfaced 42 anomalies, 26 of which were acknowledged as bugs.

What stands out is the concrete outcome: real bugs filed and fixed, including the h2o redirect-loop case with encoded null bytes. That gives the empirical side some weight beyond just a method description.

The soft spot is exactly where the stress-test flagged it. The paper gives no quantitative check on how well the LLMs actually extract the constraints—no precision/recall against expert ground truth on held-out RFC sections, no audit for hallucinations or omissions. Without that, it is hard to know whether the generated tests are truly hitting specification boundaries or whether some anomalies are just artifacts of bad constraint lists. The abstract and method sketch do not address this.

The rest of the pipeline (extremal generation and differential testing) looks standard, so the novelty sits mainly in the LLM extraction stage applied to protocols. The citation pattern is not visible here, but the approach does not appear to rest on circular claims.

This is for researchers in automated network testing and anyone trying to use LLMs on formal specs. A reader already working on fuzzing or model-based testing will see the most direct value. It is worth sending to peer review because the bug reports provide an external signal that something useful happened, even if the extraction accuracy needs tighter evidence.

Referee Report

2 major / 1 minor

Summary. The paper presents CornerCase, an automated extremal testing framework that uses LLMs to extract explicit validity constraints from protocol specifications (e.g., RFCs) in a structured section-by-section manner, generates test cases at or near those constraint boundaries, and applies differential testing across multiple implementations of HTTP, DNS, BGP, SMTP, and QUIC to identify anomalies, reporting 42 anomalies of which 26 have been acknowledged as bugs.

Significance. If the LLM extraction step proves accurate, the work would provide a useful complement to fuzzing and model-based testing by systematically targeting boundary behaviors that have historically caused serious vulnerabilities; the external validation via 26 acknowledged bugs is a concrete strength that supports the empirical utility of the generated tests.

major comments (2)

[§3 (Approach)] §3 (Approach): The description of LLM-based constraint extraction provides no quantitative audit (e.g., precision/recall against expert-annotated ground truth on held-out RFC sections) of extraction accuracy, omissions, or hallucinations. This is load-bearing for the central claim because any systematic error in the extracted constraints directly invalidates the extremal tests and prevents confident attribution of differential anomalies to implementation bugs rather than test-construction artifacts.
[§5 (Evaluation)] §5 (Evaluation): The reported outcomes (42 anomalies, 26 acknowledged) give no breakdown of false-positive rates, no comparison of LLM-extracted constraints against the actual tests that triggered each anomaly, and no discussion of how differential testing distinguishes bugs from benign implementation differences or from LLM-induced invalid inputs.

minor comments (1)

[Abstract] The abstract and method overview could more explicitly flag the current lack of extraction validation as a limitation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional validation would strengthen the manuscript. We address each major comment below and outline planned revisions to the approach and evaluation sections.

read point-by-point responses

Referee: §3 (Approach): The description of LLM-based constraint extraction provides no quantitative audit (e.g., precision/recall against expert-annotated ground truth on held-out RFC sections) of extraction accuracy, omissions, or hallucinations. This is load-bearing for the central claim because any systematic error in the extracted constraints directly invalidates the extremal tests and prevents confident attribution of differential anomalies to implementation bugs rather than test-construction artifacts.

Authors: We agree that a quantitative audit of the LLM extraction step is important for rigorously validating the approach and for confident attribution of anomalies. The current manuscript relies on downstream developer acknowledgments (26 bugs) as indirect evidence of utility but does not include precision/recall metrics against expert ground truth. In the revised version we will add such an audit: we will manually annotate a held-out sample of RFC sections for a subset of the evaluated protocols, compute precision and recall for the extracted constraints, and report omission and hallucination rates. This will be presented in an expanded §3. revision: yes
Referee: §5 (Evaluation): The reported outcomes (42 anomalies, 26 acknowledged) give no breakdown of false-positive rates, no comparison of LLM-extracted constraints against the actual tests that triggered each anomaly, and no discussion of how differential testing distinguishes bugs from benign implementation differences or from LLM-induced invalid inputs.

Authors: We acknowledge that the evaluation section would benefit from greater transparency on these points. The manuscript currently emphasizes the aggregate counts and acknowledgments without the requested breakdowns or explicit discussion of differential-testing mechanics. In the revision we will expand §5 to include: (i) an analysis of potential false-positive anomalies and how they were filtered, (ii) concrete examples mapping specific extracted constraints to the test cases that triggered each reported anomaly, and (iii) a discussion of how running the same extremal inputs across multiple independent implementations helps separate implementation bugs from benign differences or from any LLM-induced invalid inputs. These additions will be supported by additional tables and case studies. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method with external validation

full rationale

The paper presents CornerCase as an empirical testing technique that uses LLMs to extract constraints section-by-section from protocol specs (RFCs) and then generates extremal tests for differential testing across implementations. No mathematical derivation chain, equations, predictions, or first-principles results are claimed. Results rest on reported bug findings (42 anomalies, 26 acknowledged) that are externally validated by third-party acknowledgments rather than internal fits or self-citations. No self-definitional steps, fitted-input predictions, or load-bearing self-citation chains appear in the described approach. The central assumption about LLM extraction accuracy is an empirical claim open to external audit, not a circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the capability of current LLMs to parse specifications reliably and on the existence of detectable inconsistencies across implementations at boundaries. No free parameters or invented entities are described.

axioms (1)

domain assumption LLMs can extract explicit validity constraints from protocol specifications (e.g., RFCs) accurately and completely in a structured manner
Central to the first stage of the method as described in the abstract.

pith-pipeline@v0.9.1-grok · 5785 in / 1262 out tokens · 58783 ms · 2026-06-30T02:53:45.756575+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

82 extracted references · 6 canonical work pages · 1 internal anchor

[1]

aiosmtpd - An asyncio based SMTP server

aiosmtpd community. aiosmtpd - An asyncio based SMTP server. https://aiosmtpd.aio-libs.org/en/latest/, 2026

2026
[2]

AFL 2018

American Fuzzing Lop AFL. AFL 2018. https: //lcamtuf.coredump.cx/afl/

2018
[3]

Assessing Claude Mythos Preview’s Cy- bersecurity Capabilities

Anthropic. Assessing Claude Mythos Preview’s Cy- bersecurity Capabilities. https://red.anthropic. com/2026/mythos-preview/,2026. Accessed: 2026- 04-19

2026
[4]

Can Aygun, Yehuda Afek, Anat Bremler-Barr, and Leonard Kleinrock

R. Can Aygun, Yehuda Afek, Anat Bremler-Barr, and Leonard Kleinrock. LAPRAD: LLM-Assisted PRotocol Attack Discovery. InIFIP Network- ing 2025 Proceedings, 2025. Also available as arXiv:2510.19264

work page arXiv 2025
[5]

Asma Bhat and S. M. K. Quadri. Equivalence class partitioning and boundary value analysis - A review. In2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom), pages 1557–1562, 2015

2015
[6]

Brandon L Black and Community. gdnsd. https://gdnsd.org/, 2023. Github:https://github.com/gdnsd/gdnsd

2023
[7]

Coverage-based greybox fuzzing as Markov chain

Marcel Böhme, Van-Thuan Pham, and Abhik Roy- choudhury. Coverage-based greybox fuzzing as Markov chain. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communi- cations Security, pages 1032–1043, 2016

2016
[8]

A Formal TLS Handshake Model in LNT

JosipBozic,Lina Marsso,Radu Mateescu,andFranz Wotawa. A formal TLS handshake model in LNT. arXiv preprint arXiv:1803.10319, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

KLEE: Unassisted and automatic generation ofhigh-coverage tests forcomplex systems programs

Cristian Cadar, Daniel Dunbar, Dawson R Engler, et al. KLEE: Unassisted and automatic generation ofhigh-coverage tests forcomplex systems programs. InOSDI, volume 8, pages 209–224, 2008

2008
[10]

quiche QUIC Implementa- tion

Cloudflare, Inc. quiche QUIC Implementa- tion. https://github.com/cloudflare/quiche,
[12]

CoreDNS community. CoreDNS. https://coredns.io/, 2026. Github:https://github.com/coredns/coredns

2026
[13]

The FRRouting protocol suite

FRR community. The FRRouting protocol suite. https://frrouting.org/, 2026. Github:https://github.com/FRRouting/frr

2026
[14]

GoBGP community. GoBGP. https://github.com/osrg/gobgp, 2026

2026
[15]

PowerDNS

PowerDNS Community. PowerDNS. https://www.powerdns.com/, 2026. Github:https://github.com/PowerDNS/pdns

2026
[16]

Internet Systems Consortium. BIND 9. https://www.isc.org/bind/, 2026. GitLab: https://gitlab.isc.org/isc-projects/bind9

2026
[17]

CZ.NIC. Knot. https://www.knot-dns.cz/, 2025. GitLab: https://gitlab.nic.cz/knot/ knot-dns

2025
[18]

A simple BGP fuzzer based on boofuzz.Github, 2023

Stanislav Dashevskyi. A simple BGP fuzzer based on boofuzz.Github, 2023. https://github.com/ Forescout/bgp_boofuzzer

2023
[19]

Protocol State Fuzzing of TLS Implementations

Joeri de Ruiter and Erik Poll. Protocol State Fuzzing of TLS Implementations. InUSENIX Se- curity Symposium, 2015

2015
[20]

Pentestgpt: An llm-empowered automatic penetration testing tool

Gelei Deng, Yi Liu, Victor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. Pentestgpt: An llm-empowered automatic penetration testing tool.arXiv preprint arXiv:2308.06782, 2024

work page arXiv 2024
[21]

SMTPD Python library

Python developer community. SMTPD Python library. https://docs.python.org/3.10/library/ smtpd.html, 2024

2024
[22]

Mailpit Email Testing Tool

Mailpit developers. Mailpit Email Testing Tool. https://github.com/axllent/mailpit, 2026

2026
[23]

OpenSMTPD Mail Server

OpenSMTPD developers. OpenSMTPD Mail Server. https://github.com/OpenSMTPD/OpenSMTPD, 2026

2026
[24]

Fuzzing targets and supported fuzzers available in FRR

Donatas Abraitis Donald Sharp and et al. Fuzzing targets and supported fuzzers available in FRR. Github, 2023. https://docs.frrouting.org/ projects/dev-guide/en/latest/fuzzing.html

2023
[25]

Kwik QUIC Implementation

Peter Doornbosch. Kwik QUIC Implementation. https://github.com/ptrd/kwik,2018. (Accessed: 2026-04-10)

2018
[26]

EURid.eu. Yadifa. https://www.yadifa.eu/, 2026. Github:https://github.com/yadifa/yadifa

2026
[27]

13 Zhang

Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M. 13 Zhang. Large language models for software engi- neering: Survey and open problems.arXiv preprint arXiv:2310.03533, 2023

work page arXiv 2023
[28]

A general approach to network configuration analysis

Ari Fogel,Stanley Fung,Luis Pedrosa,Meg Walraed- Sullivan, Ramesh Govindan, Ratul Mahajan, and Todd Millstein. A general approach to network configuration analysis. InProceedings of the 12th USENIX Conference on Networked Systems Design and Implementation, NSDI’15, page 469–483, USA,
[29]

Apache HTTP Server

Apache Software Foundation. Apache HTTP Server. https://httpd.apache.org, 1995. Source code: https://github.com/apache/httpd (ac- cessed 2026-04-10)

1995
[30]

Fuzzing DNS zone parsers

Frederic Cambus. Fuzzing DNS zone parsers. https://www.cambus.net/ fuzzing-dns-zone-parsers/
[31]

Hickory-DNS

Benjamin Fry and Community. Hickory-DNS. https://github.com/hickory-dns/ hickory-dns, 2026. Github: https://github.com/hickory-dns/ hickory-dns/

2026
[32]

DART: Directed automated random testing

Patrice Godefroid, Nils Klarlund, and Koushik Sen. DART: Directed automated random testing. InPro- ceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, pages 213–223, 2005

2005
[33]

Boundary Value Test Input Generation using Prompt Engineering with LLMs: Fault Detection and Coverage analysis, 2025

Xiujing Guo, Chen Li, and Tatsuhiro Tsuchiya. Boundary Value Test Input Generation using Prompt Engineering with LLMs: Fault Detection and Coverage analysis, 2025

2025
[34]

Caddy Web Server

Matthew Holt. Caddy Web Server. https:// caddyserver.com, 2015. Source code: https:// github.com/caddyserver/caddy (accessed 2026- 04-10)

2015
[35]

picoquic QUIC Implemen- tation

Christian Huitema. picoquic QUIC Implemen- tation. https://github.com/private-octopus/ picoquic, 2017. (Accessed: 2026-04-10)

2017
[36]

SCALE: Auto- matically finding RFC compliance bugs in DNS nameservers

Siva Kesava Reddy Kakarla, Ryan Beckett, Todd Millstein, and George Varghese. SCALE: Auto- matically finding RFC compliance bugs in DNS nameservers. In19th USENIX Symposium on Net- worked Systems Design and Implementation (NSDI 22), pages 307–323, 2022

2022
[37]

lighttpd Web Server

Jan Kneschke. lighttpd Web Server. https: //www.lighttpd.net, 2003. Source code: https: //github.com/lighttpd/lighttpd1.4 (accessed 2026-04-10)

2003
[38]

NLnet Labs. NSD. https://nlnetlabs.nl/projects/nsd/about/, 2026. Github:https://github.com/NLnetLabs/nsd

2026
[39]

Stalwart Mail Server

Stalwart Labs. Stalwart Mail Server. https://github.com/stalwartlabs/stalwart, 2026

2026
[40]

TwistedNames

Twisted Matrix Labs. TwistedNames. https://twisted.org/, 2026. Github:https://github.com/twisted/twisted

2026
[41]

aioquic QUIC Implementation

Jeremy Lainé. aioquic QUIC Implementation. https://github.com/aiortc/aioquic,2019. (Ac- cessed: 2026-04-10)

2019
[42]

Gatling: Auto- matic performance attack discovery in Large-scale Distributed systems.ACM Trans

Hyojeong Lee, Jeff Seibert, Dylan Fistrovic, Charles Killian, and Cristina Nita-Rotaru. Gatling: Auto- matic performance attack discovery in Large-scale Distributed systems.ACM Trans. Inf. Syst. Secur., 17(4), apr 2015

2015
[43]

LSQUIC QUIC Implemen- tation

LiteSpeed Technologies. LSQUIC QUIC Implemen- tation. https://github.com/litespeedtech/ lsquic, 2017. (Accessed: 2026-04-10)

2017
[44]

Insecure, 2009

Gordon Lyon.NMAP Network Scanning: The Of- ficial NMAP Project Guide to Network Discovery and Security Scanning. Insecure, 2009

2009
[45]

mvfst QUIC Implementation

Meta Platforms, Inc. mvfst QUIC Implementation. https://github.com/facebook/mvfst,2019. (Ac- cessed: 2026-04-10)

2019
[46]

MsQuic QUIC Implemen- tation

Microsoft Corporation. MsQuic QUIC Implemen- tation. https://github.com/microsoft/msquic,
[48]

OpenSSL TLS Heart- beat Extension Read Overrun (CVE-2014-0160)

MITRE Corporation. OpenSSL TLS Heart- beat Extension Read Overrun (CVE-2014-0160). https://cve.mitre.org/cgi-bin/cvename. cgi?name=CVE-2014-0160, 2014. Accessed: 2026-04-17

2014
[49]

Eywa: Automating model based testing using llms.arXiv preprint arXiv:2312.06875, 2023

Rajdeep Mondal, Rathin Singha, Todd Millstein, George Varghese, Ryan Beckett, and Siva Ke- sava Reddy Kakarla. Eywa: Automating model based testing using llms.arXiv preprint arXiv:2312.06875, 2023

work page arXiv 2023
[50]

Neqo QUIC Implementation

Mozilla Corporation. Neqo QUIC Implementation. https://github.com/mozilla/neqo, 2019. (Ac- cessed: 2026-04-10)

2019
[51]

NGINX QUIC Implementation

NGINX, Inc. NGINX QUIC Implementation. https://github.com/nginx/nginx, 2020. Project page: https://quic.nginx.org/ (Accessed: 2026- 04-10). 14

2020
[52]

Dns-fuzz

NMAP Organization. Dns-fuzz. https://nmap. org/nsedoc/scripts/dns-fuzz.html
[53]

H2O HTTP Server

Kazuho Oku. H2O HTTP Server. https://h2o. examp1e.net,2014. SourceCode: https://github. com/h2o/h2o(accessed 2026-04-10)

2014
[54]

https://peachtech.gitlab.io/ peach-fuzzer-community/

Peach Fuzzer. https://peachtech.gitlab.io/ peach-fuzzer-community/
[55]

quic-go QUIC Implemen- tation

quic-go contributors. quic-go QUIC Implemen- tation. https://github.com/quic-go/quic-go,
[57]

Chrome Image for the QUIC Interop Runner.https://github.com/ quic-interop/chrome-quic-interop-runner,

QUIC Interop Working Group. Chrome Image for the QUIC Interop Runner.https://github.com/ quic-interop/chrome-quic-interop-runner,
[58]

(Accessed: 2026-04-10)

2026
[59]

Quinn: QUIC Implemen- tation in Rust

quinn-rs contributors. Quinn: QUIC Implemen- tation in Rust. https://github.com/quinn-rs/ quinn, 2018. (Accessed: 2026-04-10)

2018
[60]

Testing software compo- nents using boundary value analysis

Muthu Ramachandran. Testing software compo- nents using boundary value analysis. In2003 Pro- ceedings 29th Euromicro Conference, pages 94–98. IEEE, 2003

2003
[61]

Automating QUIC Interoperability Testing

Marten Seemann and Jana Iyengar. Automating QUIC Interoperability Testing. InProceedings of the Workshop on the Evolution, Performance, and Interoperability of QUIC, EPIQ’20, pages 8–13, New York, NY, USA, 2020. ACM. Co-located with SIG- COMM 2020, Virtual Event, USA

2020
[62]

Black Box testing with Bound- ary Value Analysis and Equivalence Partitioning Methods

Muhammad Sholeh, Irmah Gisfas, Muhammad An- war Fauzi, et al. Black Box testing with Bound- ary Value Analysis and Equivalence Partitioning Methods. InJournal of Physics: Conference Series, volume 1823, page 012029. IOP Publishing, 2021

2021
[63]

MESSI: Behavioral Testing of BGP Im- plementations

Rathin Singha,Rajdeep Mondal,Ryan Beckett,Siva Kesava Reddy Kakarla, Todd Millstein, and George Varghese. MESSI: Behavioral Testing of BGP Im- plementations. In21st USENIX Symposium on Net- worked Systems Design and Implementation (NSDI 24), pages 1009–1023, 2024

2024
[64]

Extremal testing for network software using llms, 2025

Rathin Singha, Harry Qian, Srinath Saikrishnan, Tracy Zhao, Ryan Beckett, Siva Kesava Reddy Kakarla, and George Varghese. Extremal testing for network software using llms, 2025

2025
[65]

Honggfuzz - Security oriented software fuzzer

Robert Swiecki. Honggfuzz - Security oriented software fuzzer. https://github.com/google/ honggfuzz/tree/master/examples/bind
[66]

NGINX HTTP Server

Igor Sysoev. NGINX HTTP Server. https:// nginx.org, 2004. Source code: https://github. com/nginx/nginx(accessed 2026-04-10)

2004
[67]

HAProxy QUIC Implemen- tation

Willy Tarreau. HAProxy QUIC Implemen- tation. https://github.com/haproxy/haproxy,
[68]

Canoni- cal source: https://git.haproxy.org/ (Accessed: 2026-04-10)

QUIC support added in v2.6. Canoni- cal source: https://git.haproxy.org/ (Accessed: 2026-04-10)

2026
[69]

golang.org/x/net: QUIC Pack- age

The Go Authors. golang.org/x/net: QUIC Pack- age. https://pkg.go.dev/golang.org/x/net/ internal/quic, 2022. Source code: https:// github.com/golang/net(Accessed: 2026-04-10)

2022
[70]

ngtcp2 QUIC Implementa- tion

Tatsuhiro Tsujikawa. ngtcp2 QUIC Implementa- tion. https://github.com/ngtcp2/ngtcp2, 2017. (Accessed: 2026-04-10)

2017
[71]

Software testing with large language models: Survey, landscape, and vision.IEEE Transactions on Software Engineering,

Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. Software testing with large language models: Survey, landscape, and vision.IEEE Transactions on Software Engineering,
[72]

Also available as arXiv:2307.07221

work page arXiv
[73]

American Fuzzy Lop (AFL)

Michal Zalewski. American Fuzzy Lop (AFL). https://lcamtuf.coredump.cx/afl/, 2014. Ac- cessed: 2026-04-17

2014
[74]

Technitium DNS server

Shreyas Zare and Community. Technitium DNS server. https://technitium.com/dns/, 2026. Github: https://github.com/ TechnitiumSoftware/DnsServer

2026
[75]

Boundary value analysis in automatic white-box test generation

Zhiqiang Zhang, Tianyong Wu, and Jian Zhang. Boundary value analysis in automatic white-box test generation. In2015 IEEE 26th International Symposium on Software Reliability Engineering (IS- SRE), pages 239–249. IEEE, 2015

2015
[76]

Large language model for vulnerability detection: Emerg- ing results and future directions

Xiaogang Zhou, Tianyi Zhang, and David Lo. Large language model for vulnerability detection: Emerg- ing results and future directions. InProceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER), 2024

2024
[77]

test_id":

Sam Hocevar. zzuf: multi-purpose fuzzer. https: //caca.zoy.org/wiki/zzuf. 15 A Test Format This appendix documents the input and output test formats used by each protocol-specific harness. A.1 HTTP Test input format: { "test_id": "<integer>", "constraint": "<exact constraint that is being tested from the given list of constraints>", "description": "<descr...
[78]

Use the test case format to infer what inputs can be controlled by the tests
[79]

Scan the RFC chunk and find sentences that define constraints on those inputs. These include: - syntax rules, - allowed or disallowed values, - length or size limits, - character set restrictions, - relationships between multiple inputs, - ordering/state rules that can be represented as test inputs/state
[80]

But also look for sentences that describe a rule or constraint on inputs that can be tested with this framework

Constraints are generally RFC statements that include MUST/MUST NOT/SHOULD/SHOULD NOT. But also look for sentences that describe a rule or constraint on inputs that can be tested with this framework
[81]

Every constraint is written as a tuple: (<section_number>, <constraint>)
[82]

4.1.1",

If the chunk has no relevant constraints, return []. ### Important: - Return each constraint sentence *exactly as written* in the RFC (no edits). - Only include sentences that can plausibly be tested using the described setup. ### Output format (for each chunk): Return ONLY a JSON array like: [ ["4.1.1", "sentence1"], ["4.1.1", "sentence2"], ["4.2", "sent...
[83]

Reason about the differences between the implementations'outputs, considering: - Is one or more implementation likely violating the RFC (a real bug)? - Could the difference be due to acceptable implementation-specific behavior? - Could the difference plausibly be explained or fixed by configuration (e.g., security settings, extensions enabled/disabled, st...

Showing first 80 references.

[1] [1]

aiosmtpd - An asyncio based SMTP server

aiosmtpd community. aiosmtpd - An asyncio based SMTP server. https://aiosmtpd.aio-libs.org/en/latest/, 2026

2026

[2] [2]

AFL 2018

American Fuzzing Lop AFL. AFL 2018. https: //lcamtuf.coredump.cx/afl/

2018

[3] [3]

Assessing Claude Mythos Preview’s Cy- bersecurity Capabilities

Anthropic. Assessing Claude Mythos Preview’s Cy- bersecurity Capabilities. https://red.anthropic. com/2026/mythos-preview/,2026. Accessed: 2026- 04-19

2026

[4] [4]

Can Aygun, Yehuda Afek, Anat Bremler-Barr, and Leonard Kleinrock

R. Can Aygun, Yehuda Afek, Anat Bremler-Barr, and Leonard Kleinrock. LAPRAD: LLM-Assisted PRotocol Attack Discovery. InIFIP Network- ing 2025 Proceedings, 2025. Also available as arXiv:2510.19264

work page arXiv 2025

[5] [5]

Asma Bhat and S. M. K. Quadri. Equivalence class partitioning and boundary value analysis - A review. In2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom), pages 1557–1562, 2015

2015

[6] [6]

Brandon L Black and Community. gdnsd. https://gdnsd.org/, 2023. Github:https://github.com/gdnsd/gdnsd

2023

[7] [7]

Coverage-based greybox fuzzing as Markov chain

Marcel Böhme, Van-Thuan Pham, and Abhik Roy- choudhury. Coverage-based greybox fuzzing as Markov chain. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communi- cations Security, pages 1032–1043, 2016

2016

[8] [8]

A Formal TLS Handshake Model in LNT

JosipBozic,Lina Marsso,Radu Mateescu,andFranz Wotawa. A formal TLS handshake model in LNT. arXiv preprint arXiv:1803.10319, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[9] [9]

KLEE: Unassisted and automatic generation ofhigh-coverage tests forcomplex systems programs

Cristian Cadar, Daniel Dunbar, Dawson R Engler, et al. KLEE: Unassisted and automatic generation ofhigh-coverage tests forcomplex systems programs. InOSDI, volume 8, pages 209–224, 2008

2008

[10] [10]

quiche QUIC Implementa- tion

Cloudflare, Inc. quiche QUIC Implementa- tion. https://github.com/cloudflare/quiche,

[11] [12]

CoreDNS community. CoreDNS. https://coredns.io/, 2026. Github:https://github.com/coredns/coredns

2026

[12] [13]

The FRRouting protocol suite

FRR community. The FRRouting protocol suite. https://frrouting.org/, 2026. Github:https://github.com/FRRouting/frr

2026

[13] [14]

GoBGP community. GoBGP. https://github.com/osrg/gobgp, 2026

2026

[14] [15]

PowerDNS

PowerDNS Community. PowerDNS. https://www.powerdns.com/, 2026. Github:https://github.com/PowerDNS/pdns

2026

[15] [16]

Internet Systems Consortium. BIND 9. https://www.isc.org/bind/, 2026. GitLab: https://gitlab.isc.org/isc-projects/bind9

2026

[16] [17]

CZ.NIC. Knot. https://www.knot-dns.cz/, 2025. GitLab: https://gitlab.nic.cz/knot/ knot-dns

2025

[17] [18]

A simple BGP fuzzer based on boofuzz.Github, 2023

Stanislav Dashevskyi. A simple BGP fuzzer based on boofuzz.Github, 2023. https://github.com/ Forescout/bgp_boofuzzer

2023

[18] [19]

Protocol State Fuzzing of TLS Implementations

Joeri de Ruiter and Erik Poll. Protocol State Fuzzing of TLS Implementations. InUSENIX Se- curity Symposium, 2015

2015

[19] [20]

Pentestgpt: An llm-empowered automatic penetration testing tool

Gelei Deng, Yi Liu, Victor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. Pentestgpt: An llm-empowered automatic penetration testing tool.arXiv preprint arXiv:2308.06782, 2024

work page arXiv 2024

[20] [21]

SMTPD Python library

Python developer community. SMTPD Python library. https://docs.python.org/3.10/library/ smtpd.html, 2024

2024

[21] [22]

Mailpit Email Testing Tool

Mailpit developers. Mailpit Email Testing Tool. https://github.com/axllent/mailpit, 2026

2026

[22] [23]

OpenSMTPD Mail Server

OpenSMTPD developers. OpenSMTPD Mail Server. https://github.com/OpenSMTPD/OpenSMTPD, 2026

2026

[23] [24]

Fuzzing targets and supported fuzzers available in FRR

Donatas Abraitis Donald Sharp and et al. Fuzzing targets and supported fuzzers available in FRR. Github, 2023. https://docs.frrouting.org/ projects/dev-guide/en/latest/fuzzing.html

2023

[24] [25]

Kwik QUIC Implementation

Peter Doornbosch. Kwik QUIC Implementation. https://github.com/ptrd/kwik,2018. (Accessed: 2026-04-10)

2018

[25] [26]

EURid.eu. Yadifa. https://www.yadifa.eu/, 2026. Github:https://github.com/yadifa/yadifa

2026

[26] [27]

13 Zhang

Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M. 13 Zhang. Large language models for software engi- neering: Survey and open problems.arXiv preprint arXiv:2310.03533, 2023

work page arXiv 2023

[27] [28]

A general approach to network configuration analysis

Ari Fogel,Stanley Fung,Luis Pedrosa,Meg Walraed- Sullivan, Ramesh Govindan, Ratul Mahajan, and Todd Millstein. A general approach to network configuration analysis. InProceedings of the 12th USENIX Conference on Networked Systems Design and Implementation, NSDI’15, page 469–483, USA,

[28] [29]

Apache HTTP Server

Apache Software Foundation. Apache HTTP Server. https://httpd.apache.org, 1995. Source code: https://github.com/apache/httpd (ac- cessed 2026-04-10)

1995

[29] [30]

Fuzzing DNS zone parsers

Frederic Cambus. Fuzzing DNS zone parsers. https://www.cambus.net/ fuzzing-dns-zone-parsers/

[30] [31]

Hickory-DNS

Benjamin Fry and Community. Hickory-DNS. https://github.com/hickory-dns/ hickory-dns, 2026. Github: https://github.com/hickory-dns/ hickory-dns/

2026

[31] [32]

DART: Directed automated random testing

Patrice Godefroid, Nils Klarlund, and Koushik Sen. DART: Directed automated random testing. InPro- ceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, pages 213–223, 2005

2005

[32] [33]

Boundary Value Test Input Generation using Prompt Engineering with LLMs: Fault Detection and Coverage analysis, 2025

Xiujing Guo, Chen Li, and Tatsuhiro Tsuchiya. Boundary Value Test Input Generation using Prompt Engineering with LLMs: Fault Detection and Coverage analysis, 2025

2025

[33] [34]

Caddy Web Server

Matthew Holt. Caddy Web Server. https:// caddyserver.com, 2015. Source code: https:// github.com/caddyserver/caddy (accessed 2026- 04-10)

2015

[34] [35]

picoquic QUIC Implemen- tation

Christian Huitema. picoquic QUIC Implemen- tation. https://github.com/private-octopus/ picoquic, 2017. (Accessed: 2026-04-10)

2017

[35] [36]

SCALE: Auto- matically finding RFC compliance bugs in DNS nameservers

Siva Kesava Reddy Kakarla, Ryan Beckett, Todd Millstein, and George Varghese. SCALE: Auto- matically finding RFC compliance bugs in DNS nameservers. In19th USENIX Symposium on Net- worked Systems Design and Implementation (NSDI 22), pages 307–323, 2022

2022

[36] [37]

lighttpd Web Server

Jan Kneschke. lighttpd Web Server. https: //www.lighttpd.net, 2003. Source code: https: //github.com/lighttpd/lighttpd1.4 (accessed 2026-04-10)

2003

[37] [38]

NLnet Labs. NSD. https://nlnetlabs.nl/projects/nsd/about/, 2026. Github:https://github.com/NLnetLabs/nsd

2026

[38] [39]

Stalwart Mail Server

Stalwart Labs. Stalwart Mail Server. https://github.com/stalwartlabs/stalwart, 2026

2026

[39] [40]

TwistedNames

Twisted Matrix Labs. TwistedNames. https://twisted.org/, 2026. Github:https://github.com/twisted/twisted

2026

[40] [41]

aioquic QUIC Implementation

Jeremy Lainé. aioquic QUIC Implementation. https://github.com/aiortc/aioquic,2019. (Ac- cessed: 2026-04-10)

2019

[41] [42]

Gatling: Auto- matic performance attack discovery in Large-scale Distributed systems.ACM Trans

Hyojeong Lee, Jeff Seibert, Dylan Fistrovic, Charles Killian, and Cristina Nita-Rotaru. Gatling: Auto- matic performance attack discovery in Large-scale Distributed systems.ACM Trans. Inf. Syst. Secur., 17(4), apr 2015

2015

[42] [43]

LSQUIC QUIC Implemen- tation

LiteSpeed Technologies. LSQUIC QUIC Implemen- tation. https://github.com/litespeedtech/ lsquic, 2017. (Accessed: 2026-04-10)

2017

[43] [44]

Insecure, 2009

Gordon Lyon.NMAP Network Scanning: The Of- ficial NMAP Project Guide to Network Discovery and Security Scanning. Insecure, 2009

2009

[44] [45]

mvfst QUIC Implementation

Meta Platforms, Inc. mvfst QUIC Implementation. https://github.com/facebook/mvfst,2019. (Ac- cessed: 2026-04-10)

2019

[45] [46]

MsQuic QUIC Implemen- tation

Microsoft Corporation. MsQuic QUIC Implemen- tation. https://github.com/microsoft/msquic,

[46] [48]

OpenSSL TLS Heart- beat Extension Read Overrun (CVE-2014-0160)

MITRE Corporation. OpenSSL TLS Heart- beat Extension Read Overrun (CVE-2014-0160). https://cve.mitre.org/cgi-bin/cvename. cgi?name=CVE-2014-0160, 2014. Accessed: 2026-04-17

2014

[47] [49]

Eywa: Automating model based testing using llms.arXiv preprint arXiv:2312.06875, 2023

Rajdeep Mondal, Rathin Singha, Todd Millstein, George Varghese, Ryan Beckett, and Siva Ke- sava Reddy Kakarla. Eywa: Automating model based testing using llms.arXiv preprint arXiv:2312.06875, 2023

work page arXiv 2023

[48] [50]

Neqo QUIC Implementation

Mozilla Corporation. Neqo QUIC Implementation. https://github.com/mozilla/neqo, 2019. (Ac- cessed: 2026-04-10)

2019

[49] [51]

NGINX QUIC Implementation

NGINX, Inc. NGINX QUIC Implementation. https://github.com/nginx/nginx, 2020. Project page: https://quic.nginx.org/ (Accessed: 2026- 04-10). 14

2020

[50] [52]

Dns-fuzz

NMAP Organization. Dns-fuzz. https://nmap. org/nsedoc/scripts/dns-fuzz.html

[51] [53]

H2O HTTP Server

Kazuho Oku. H2O HTTP Server. https://h2o. examp1e.net,2014. SourceCode: https://github. com/h2o/h2o(accessed 2026-04-10)

2014

[52] [54]

https://peachtech.gitlab.io/ peach-fuzzer-community/

Peach Fuzzer. https://peachtech.gitlab.io/ peach-fuzzer-community/

[53] [55]

quic-go QUIC Implemen- tation

quic-go contributors. quic-go QUIC Implemen- tation. https://github.com/quic-go/quic-go,

[54] [57]

Chrome Image for the QUIC Interop Runner.https://github.com/ quic-interop/chrome-quic-interop-runner,

QUIC Interop Working Group. Chrome Image for the QUIC Interop Runner.https://github.com/ quic-interop/chrome-quic-interop-runner,

[55] [58]

(Accessed: 2026-04-10)

2026

[56] [59]

Quinn: QUIC Implemen- tation in Rust

quinn-rs contributors. Quinn: QUIC Implemen- tation in Rust. https://github.com/quinn-rs/ quinn, 2018. (Accessed: 2026-04-10)

2018

[57] [60]

Testing software compo- nents using boundary value analysis

Muthu Ramachandran. Testing software compo- nents using boundary value analysis. In2003 Pro- ceedings 29th Euromicro Conference, pages 94–98. IEEE, 2003

2003

[58] [61]

Automating QUIC Interoperability Testing

Marten Seemann and Jana Iyengar. Automating QUIC Interoperability Testing. InProceedings of the Workshop on the Evolution, Performance, and Interoperability of QUIC, EPIQ’20, pages 8–13, New York, NY, USA, 2020. ACM. Co-located with SIG- COMM 2020, Virtual Event, USA

2020

[59] [62]

Black Box testing with Bound- ary Value Analysis and Equivalence Partitioning Methods

Muhammad Sholeh, Irmah Gisfas, Muhammad An- war Fauzi, et al. Black Box testing with Bound- ary Value Analysis and Equivalence Partitioning Methods. InJournal of Physics: Conference Series, volume 1823, page 012029. IOP Publishing, 2021

2021

[60] [63]

MESSI: Behavioral Testing of BGP Im- plementations

Rathin Singha,Rajdeep Mondal,Ryan Beckett,Siva Kesava Reddy Kakarla, Todd Millstein, and George Varghese. MESSI: Behavioral Testing of BGP Im- plementations. In21st USENIX Symposium on Net- worked Systems Design and Implementation (NSDI 24), pages 1009–1023, 2024

2024

[61] [64]

Extremal testing for network software using llms, 2025

Rathin Singha, Harry Qian, Srinath Saikrishnan, Tracy Zhao, Ryan Beckett, Siva Kesava Reddy Kakarla, and George Varghese. Extremal testing for network software using llms, 2025

2025

[62] [65]

Honggfuzz - Security oriented software fuzzer

Robert Swiecki. Honggfuzz - Security oriented software fuzzer. https://github.com/google/ honggfuzz/tree/master/examples/bind

[63] [66]

NGINX HTTP Server

Igor Sysoev. NGINX HTTP Server. https:// nginx.org, 2004. Source code: https://github. com/nginx/nginx(accessed 2026-04-10)

2004

[64] [67]

HAProxy QUIC Implemen- tation

Willy Tarreau. HAProxy QUIC Implemen- tation. https://github.com/haproxy/haproxy,

[65] [68]

Canoni- cal source: https://git.haproxy.org/ (Accessed: 2026-04-10)

QUIC support added in v2.6. Canoni- cal source: https://git.haproxy.org/ (Accessed: 2026-04-10)

2026

[66] [69]

golang.org/x/net: QUIC Pack- age

The Go Authors. golang.org/x/net: QUIC Pack- age. https://pkg.go.dev/golang.org/x/net/ internal/quic, 2022. Source code: https:// github.com/golang/net(Accessed: 2026-04-10)

2022

[67] [70]

ngtcp2 QUIC Implementa- tion

Tatsuhiro Tsujikawa. ngtcp2 QUIC Implementa- tion. https://github.com/ngtcp2/ngtcp2, 2017. (Accessed: 2026-04-10)

2017

[68] [71]

Software testing with large language models: Survey, landscape, and vision.IEEE Transactions on Software Engineering,

Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. Software testing with large language models: Survey, landscape, and vision.IEEE Transactions on Software Engineering,

[69] [72]

Also available as arXiv:2307.07221

work page arXiv

[70] [73]

American Fuzzy Lop (AFL)

Michal Zalewski. American Fuzzy Lop (AFL). https://lcamtuf.coredump.cx/afl/, 2014. Ac- cessed: 2026-04-17

2014

[71] [74]

Technitium DNS server

Shreyas Zare and Community. Technitium DNS server. https://technitium.com/dns/, 2026. Github: https://github.com/ TechnitiumSoftware/DnsServer

2026

[72] [75]

Boundary value analysis in automatic white-box test generation

Zhiqiang Zhang, Tianyong Wu, and Jian Zhang. Boundary value analysis in automatic white-box test generation. In2015 IEEE 26th International Symposium on Software Reliability Engineering (IS- SRE), pages 239–249. IEEE, 2015

2015

[73] [76]

Large language model for vulnerability detection: Emerg- ing results and future directions

Xiaogang Zhou, Tianyi Zhang, and David Lo. Large language model for vulnerability detection: Emerg- ing results and future directions. InProceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER), 2024

2024

[74] [77]

test_id":

Sam Hocevar. zzuf: multi-purpose fuzzer. https: //caca.zoy.org/wiki/zzuf. 15 A Test Format This appendix documents the input and output test formats used by each protocol-specific harness. A.1 HTTP Test input format: { "test_id": "<integer>", "constraint": "<exact constraint that is being tested from the given list of constraints>", "description": "<descr...

[75] [78]

Use the test case format to infer what inputs can be controlled by the tests

[76] [79]

Scan the RFC chunk and find sentences that define constraints on those inputs. These include: - syntax rules, - allowed or disallowed values, - length or size limits, - character set restrictions, - relationships between multiple inputs, - ordering/state rules that can be represented as test inputs/state

[77] [80]

But also look for sentences that describe a rule or constraint on inputs that can be tested with this framework

Constraints are generally RFC statements that include MUST/MUST NOT/SHOULD/SHOULD NOT. But also look for sentences that describe a rule or constraint on inputs that can be tested with this framework

[78] [81]

Every constraint is written as a tuple: (<section_number>, <constraint>)

[79] [82]

4.1.1",

If the chunk has no relevant constraints, return []. ### Important: - Return each constraint sentence *exactly as written* in the RFC (no edits). - Only include sentences that can plausibly be tested using the described setup. ### Output format (for each chunk): Return ONLY a JSON array like: [ ["4.1.1", "sentence1"], ["4.1.1", "sentence2"], ["4.2", "sent...

[80] [83]

Reason about the differences between the implementations'outputs, considering: - Is one or more implementation likely violating the RFC (a real bug)? - Could the difference be due to acceptable implementation-specific behavior? - Could the difference plausibly be explained or fixed by configuration (e.g., security settings, extensions enabled/disabled, st...