pith. sign in

arxiv: 2606.29124 · v1 · pith:D4JM64WMnew · submitted 2026-06-28 · 💻 cs.NI

CornerCase: Automated Extremal Testing of Protocol Implementations using LLMs

Pith reviewed 2026-06-30 02:53 UTC · model grok-4.3

classification 💻 cs.NI
keywords extremal testingprotocol implementationsLLM constraint extractionboundary behaviorsdifferential testingnetwork protocolsRFC analysis
0
0 comments X

The pith

CornerCase uses LLMs to extract validity constraints from protocol specs and generates tests at their boundaries to find implementation bugs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

CornerCase is a testing approach that targets bugs near the edges of what network protocol specifications allow. It splits the work into two steps where large language models read specification documents section by section to list the explicit rules for valid inputs and outputs, then creates test cases positioned right at or just beyond those rule limits. Running the tests on different implementations of the same protocol and comparing results reveals inconsistencies that indicate bugs. The method was applied to implementations of HTTP, DNS, BGP, SMTP, and QUIC, identifying 42 anomalies of which 26 were acknowledged as bugs and 18 were fixed.

Core claim

CornerCase decomposes test generation into LLM-based extraction of explicit validity constraints from protocol specifications in a structured section-by-section manner, followed by generation of extremal test cases at or near the boundary of each constraint; these tests are executed across multiple implementations with differential testing to identify inconsistencies that expose bugs missed by fuzzing and model-based testing.

What carries the argument

Two-stage process of LLM-driven structured constraint extraction from specifications combined with extremal test generation at constraint boundaries and differential testing across implementations.

If this is right

  • Boundary behaviors such as encoded null bytes in URLs or state-dependent message validity can be targeted systematically rather than left to chance in random testing.
  • Differential testing across implementations of the same protocol reliably surfaces bugs through observable inconsistencies.
  • The same decomposition can be repeated on additional protocols beyond the five evaluated here.
  • Many bugs previously unknown can be identified and reported, leading to fixes in production implementations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If LLM accuracy on constraint extraction improves, the approach could extend to longer or more ambiguous specifications.
  • Patterns in the extracted constraints might highlight recurring ambiguities in how protocols are specified.
  • Combining the boundary-focused tests with existing fuzzers could produce broader coverage of protocol edge cases.

Load-bearing premise

Large language models can accurately extract explicit validity constraints from protocol specifications without significant omissions or hallucinations.

What would settle it

A protocol specification where the LLM extraction step misses or misstates a key validity constraint, causing the generated tests to miss a known boundary bug or produce only false inconsistencies.

Figures

Figures reproduced from arXiv: 2606.29124 by George Varghese, Kuan Qian, Rathin Singha, Ryan Beckett, Siva Kesava Reddy Kakarla, Soheil Abbasloo, Srinath Saikrishnan, Todd Millstein, Tracy Zhao.

Figure 1
Figure 1. Figure 1: The Pipeline for Extremal Testing lowing the user to specify the test format as a JSON file, the framework can be easily extended to new proto￾cols and previously unanticipated features. For example, our HTTP tests improved when we moved from a fixed filesystem – where the model only generates URI queries – to a richer format that allows it to specify which files should and should not exist, along with the… view at source ↗
read the original abstract

Many software bugs in network protocol implementations arise near specification boundaries, such as inputs just within or outside allowed ranges, or messages that are valid in isolation but invalid in a given state. From the SSL Heartbleed exploit to TCP Christmas Tree packets, boundary inputs have repeatedly exposed critical weaknesses, yet remain under-tested by existing techniques such as fuzzing and model-based testing. We present CornerCase, an automated extremal testing approach that systematically targets such boundary behaviors. Our key idea is to decompose test generation into two stages: first, large language models (LLMs) extract explicit validity constraints from protocol specifications (e.g., RFCs) in a structured, section-by-section manner; second, extremal test cases are generated at or near the boundary of each constraint. These tests are executed across multiple implementations, and differential testing identifies inconsistencies. We evaluate CornerCase on widely used implementations of HTTP, DNS, BGP, SMTP, and QUIC, uncovering many previously unknown bugs. For example, the HTTP server h2o enters a redirect loop when processing URLs containing encoded null bytes. Overall, we used CornerCase to identify and file 42 anomalies; to date 26 have been acknowledged as bugs and 18 fixed, with others under active investigation

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents CornerCase, an automated extremal testing framework that uses LLMs to extract explicit validity constraints from protocol specifications (e.g., RFCs) in a structured section-by-section manner, generates test cases at or near those constraint boundaries, and applies differential testing across multiple implementations of HTTP, DNS, BGP, SMTP, and QUIC to identify anomalies, reporting 42 anomalies of which 26 have been acknowledged as bugs.

Significance. If the LLM extraction step proves accurate, the work would provide a useful complement to fuzzing and model-based testing by systematically targeting boundary behaviors that have historically caused serious vulnerabilities; the external validation via 26 acknowledged bugs is a concrete strength that supports the empirical utility of the generated tests.

major comments (2)
  1. [§3 (Approach)] §3 (Approach): The description of LLM-based constraint extraction provides no quantitative audit (e.g., precision/recall against expert-annotated ground truth on held-out RFC sections) of extraction accuracy, omissions, or hallucinations. This is load-bearing for the central claim because any systematic error in the extracted constraints directly invalidates the extremal tests and prevents confident attribution of differential anomalies to implementation bugs rather than test-construction artifacts.
  2. [§5 (Evaluation)] §5 (Evaluation): The reported outcomes (42 anomalies, 26 acknowledged) give no breakdown of false-positive rates, no comparison of LLM-extracted constraints against the actual tests that triggered each anomaly, and no discussion of how differential testing distinguishes bugs from benign implementation differences or from LLM-induced invalid inputs.
minor comments (1)
  1. [Abstract] The abstract and method overview could more explicitly flag the current lack of extraction validation as a limitation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional validation would strengthen the manuscript. We address each major comment below and outline planned revisions to the approach and evaluation sections.

read point-by-point responses
  1. Referee: §3 (Approach): The description of LLM-based constraint extraction provides no quantitative audit (e.g., precision/recall against expert-annotated ground truth on held-out RFC sections) of extraction accuracy, omissions, or hallucinations. This is load-bearing for the central claim because any systematic error in the extracted constraints directly invalidates the extremal tests and prevents confident attribution of differential anomalies to implementation bugs rather than test-construction artifacts.

    Authors: We agree that a quantitative audit of the LLM extraction step is important for rigorously validating the approach and for confident attribution of anomalies. The current manuscript relies on downstream developer acknowledgments (26 bugs) as indirect evidence of utility but does not include precision/recall metrics against expert ground truth. In the revised version we will add such an audit: we will manually annotate a held-out sample of RFC sections for a subset of the evaluated protocols, compute precision and recall for the extracted constraints, and report omission and hallucination rates. This will be presented in an expanded §3. revision: yes

  2. Referee: §5 (Evaluation): The reported outcomes (42 anomalies, 26 acknowledged) give no breakdown of false-positive rates, no comparison of LLM-extracted constraints against the actual tests that triggered each anomaly, and no discussion of how differential testing distinguishes bugs from benign implementation differences or from LLM-induced invalid inputs.

    Authors: We acknowledge that the evaluation section would benefit from greater transparency on these points. The manuscript currently emphasizes the aggregate counts and acknowledgments without the requested breakdowns or explicit discussion of differential-testing mechanics. In the revision we will expand §5 to include: (i) an analysis of potential false-positive anomalies and how they were filtered, (ii) concrete examples mapping specific extracted constraints to the test cases that triggered each reported anomaly, and (iii) a discussion of how running the same extremal inputs across multiple independent implementations helps separate implementation bugs from benign differences or from any LLM-induced invalid inputs. These additions will be supported by additional tables and case studies. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method with external validation

full rationale

The paper presents CornerCase as an empirical testing technique that uses LLMs to extract constraints section-by-section from protocol specs (RFCs) and then generates extremal tests for differential testing across implementations. No mathematical derivation chain, equations, predictions, or first-principles results are claimed. Results rest on reported bug findings (42 anomalies, 26 acknowledged) that are externally validated by third-party acknowledgments rather than internal fits or self-citations. No self-definitional steps, fitted-input predictions, or load-bearing self-citation chains appear in the described approach. The central assumption about LLM extraction accuracy is an empirical claim open to external audit, not a circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the capability of current LLMs to parse specifications reliably and on the existence of detectable inconsistencies across implementations at boundaries. No free parameters or invented entities are described.

axioms (1)
  • domain assumption LLMs can extract explicit validity constraints from protocol specifications (e.g., RFCs) accurately and completely in a structured manner
    Central to the first stage of the method as described in the abstract.

pith-pipeline@v0.9.1-grok · 5785 in / 1262 out tokens · 58783 ms · 2026-06-30T02:53:45.756575+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

82 extracted references · 6 canonical work pages · 1 internal anchor

  1. [1]

    aiosmtpd - An asyncio based SMTP server

    aiosmtpd community. aiosmtpd - An asyncio based SMTP server. https://aiosmtpd.aio-libs.org/en/latest/, 2026

  2. [2]

    AFL 2018

    American Fuzzing Lop AFL. AFL 2018. https: //lcamtuf.coredump.cx/afl/

  3. [3]

    Assessing Claude Mythos Preview’s Cy- bersecurity Capabilities

    Anthropic. Assessing Claude Mythos Preview’s Cy- bersecurity Capabilities. https://red.anthropic. com/2026/mythos-preview/,2026. Accessed: 2026- 04-19

  4. [4]

    Can Aygun, Yehuda Afek, Anat Bremler-Barr, and Leonard Kleinrock

    R. Can Aygun, Yehuda Afek, Anat Bremler-Barr, and Leonard Kleinrock. LAPRAD: LLM-Assisted PRotocol Attack Discovery. InIFIP Network- ing 2025 Proceedings, 2025. Also available as arXiv:2510.19264

  5. [5]

    Asma Bhat and S. M. K. Quadri. Equivalence class partitioning and boundary value analysis - A review. In2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom), pages 1557–1562, 2015

  6. [6]

    Brandon L Black and Community. gdnsd. https://gdnsd.org/, 2023. Github:https://github.com/gdnsd/gdnsd

  7. [7]

    Coverage-based greybox fuzzing as Markov chain

    Marcel Böhme, Van-Thuan Pham, and Abhik Roy- choudhury. Coverage-based greybox fuzzing as Markov chain. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communi- cations Security, pages 1032–1043, 2016

  8. [8]

    A Formal TLS Handshake Model in LNT

    JosipBozic,Lina Marsso,Radu Mateescu,andFranz Wotawa. A formal TLS handshake model in LNT. arXiv preprint arXiv:1803.10319, 2018

  9. [9]

    KLEE: Unassisted and automatic generation ofhigh-coverage tests forcomplex systems programs

    Cristian Cadar, Daniel Dunbar, Dawson R Engler, et al. KLEE: Unassisted and automatic generation ofhigh-coverage tests forcomplex systems programs. InOSDI, volume 8, pages 209–224, 2008

  10. [10]

    quiche QUIC Implementa- tion

    Cloudflare, Inc. quiche QUIC Implementa- tion. https://github.com/cloudflare/quiche,

  11. [12]

    CoreDNS community. CoreDNS. https://coredns.io/, 2026. Github:https://github.com/coredns/coredns

  12. [13]

    The FRRouting protocol suite

    FRR community. The FRRouting protocol suite. https://frrouting.org/, 2026. Github:https://github.com/FRRouting/frr

  13. [14]

    GoBGP community. GoBGP. https://github.com/osrg/gobgp, 2026

  14. [15]

    PowerDNS

    PowerDNS Community. PowerDNS. https://www.powerdns.com/, 2026. Github:https://github.com/PowerDNS/pdns

  15. [16]

    Internet Systems Consortium. BIND 9. https://www.isc.org/bind/, 2026. GitLab: https://gitlab.isc.org/isc-projects/bind9

  16. [17]

    CZ.NIC. Knot. https://www.knot-dns.cz/, 2025. GitLab: https://gitlab.nic.cz/knot/ knot-dns

  17. [18]

    A simple BGP fuzzer based on boofuzz.Github, 2023

    Stanislav Dashevskyi. A simple BGP fuzzer based on boofuzz.Github, 2023. https://github.com/ Forescout/bgp_boofuzzer

  18. [19]

    Protocol State Fuzzing of TLS Implementations

    Joeri de Ruiter and Erik Poll. Protocol State Fuzzing of TLS Implementations. InUSENIX Se- curity Symposium, 2015

  19. [20]

    Pentestgpt: An llm-empowered automatic penetration testing tool

    Gelei Deng, Yi Liu, Victor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. Pentestgpt: An llm-empowered automatic penetration testing tool.arXiv preprint arXiv:2308.06782, 2024

  20. [21]

    SMTPD Python library

    Python developer community. SMTPD Python library. https://docs.python.org/3.10/library/ smtpd.html, 2024

  21. [22]

    Mailpit Email Testing Tool

    Mailpit developers. Mailpit Email Testing Tool. https://github.com/axllent/mailpit, 2026

  22. [23]

    OpenSMTPD Mail Server

    OpenSMTPD developers. OpenSMTPD Mail Server. https://github.com/OpenSMTPD/OpenSMTPD, 2026

  23. [24]

    Fuzzing targets and supported fuzzers available in FRR

    Donatas Abraitis Donald Sharp and et al. Fuzzing targets and supported fuzzers available in FRR. Github, 2023. https://docs.frrouting.org/ projects/dev-guide/en/latest/fuzzing.html

  24. [25]

    Kwik QUIC Implementation

    Peter Doornbosch. Kwik QUIC Implementation. https://github.com/ptrd/kwik,2018. (Accessed: 2026-04-10)

  25. [26]

    EURid.eu. Yadifa. https://www.yadifa.eu/, 2026. Github:https://github.com/yadifa/yadifa

  26. [27]

    13 Zhang

    Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M. 13 Zhang. Large language models for software engi- neering: Survey and open problems.arXiv preprint arXiv:2310.03533, 2023

  27. [28]

    A general approach to network configuration analysis

    Ari Fogel,Stanley Fung,Luis Pedrosa,Meg Walraed- Sullivan, Ramesh Govindan, Ratul Mahajan, and Todd Millstein. A general approach to network configuration analysis. InProceedings of the 12th USENIX Conference on Networked Systems Design and Implementation, NSDI’15, page 469–483, USA,

  28. [29]

    Apache HTTP Server

    Apache Software Foundation. Apache HTTP Server. https://httpd.apache.org, 1995. Source code: https://github.com/apache/httpd (ac- cessed 2026-04-10)

  29. [30]

    Fuzzing DNS zone parsers

    Frederic Cambus. Fuzzing DNS zone parsers. https://www.cambus.net/ fuzzing-dns-zone-parsers/

  30. [31]

    Hickory-DNS

    Benjamin Fry and Community. Hickory-DNS. https://github.com/hickory-dns/ hickory-dns, 2026. Github: https://github.com/hickory-dns/ hickory-dns/

  31. [32]

    DART: Directed automated random testing

    Patrice Godefroid, Nils Klarlund, and Koushik Sen. DART: Directed automated random testing. InPro- ceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, pages 213–223, 2005

  32. [33]

    Boundary Value Test Input Generation using Prompt Engineering with LLMs: Fault Detection and Coverage analysis, 2025

    Xiujing Guo, Chen Li, and Tatsuhiro Tsuchiya. Boundary Value Test Input Generation using Prompt Engineering with LLMs: Fault Detection and Coverage analysis, 2025

  33. [34]

    Caddy Web Server

    Matthew Holt. Caddy Web Server. https:// caddyserver.com, 2015. Source code: https:// github.com/caddyserver/caddy (accessed 2026- 04-10)

  34. [35]

    picoquic QUIC Implemen- tation

    Christian Huitema. picoquic QUIC Implemen- tation. https://github.com/private-octopus/ picoquic, 2017. (Accessed: 2026-04-10)

  35. [36]

    SCALE: Auto- matically finding RFC compliance bugs in DNS nameservers

    Siva Kesava Reddy Kakarla, Ryan Beckett, Todd Millstein, and George Varghese. SCALE: Auto- matically finding RFC compliance bugs in DNS nameservers. In19th USENIX Symposium on Net- worked Systems Design and Implementation (NSDI 22), pages 307–323, 2022

  36. [37]

    lighttpd Web Server

    Jan Kneschke. lighttpd Web Server. https: //www.lighttpd.net, 2003. Source code: https: //github.com/lighttpd/lighttpd1.4 (accessed 2026-04-10)

  37. [38]

    NLnet Labs. NSD. https://nlnetlabs.nl/projects/nsd/about/, 2026. Github:https://github.com/NLnetLabs/nsd

  38. [39]

    Stalwart Mail Server

    Stalwart Labs. Stalwart Mail Server. https://github.com/stalwartlabs/stalwart, 2026

  39. [40]

    TwistedNames

    Twisted Matrix Labs. TwistedNames. https://twisted.org/, 2026. Github:https://github.com/twisted/twisted

  40. [41]

    aioquic QUIC Implementation

    Jeremy Lainé. aioquic QUIC Implementation. https://github.com/aiortc/aioquic,2019. (Ac- cessed: 2026-04-10)

  41. [42]

    Gatling: Auto- matic performance attack discovery in Large-scale Distributed systems.ACM Trans

    Hyojeong Lee, Jeff Seibert, Dylan Fistrovic, Charles Killian, and Cristina Nita-Rotaru. Gatling: Auto- matic performance attack discovery in Large-scale Distributed systems.ACM Trans. Inf. Syst. Secur., 17(4), apr 2015

  42. [43]

    LSQUIC QUIC Implemen- tation

    LiteSpeed Technologies. LSQUIC QUIC Implemen- tation. https://github.com/litespeedtech/ lsquic, 2017. (Accessed: 2026-04-10)

  43. [44]

    Insecure, 2009

    Gordon Lyon.NMAP Network Scanning: The Of- ficial NMAP Project Guide to Network Discovery and Security Scanning. Insecure, 2009

  44. [45]

    mvfst QUIC Implementation

    Meta Platforms, Inc. mvfst QUIC Implementation. https://github.com/facebook/mvfst,2019. (Ac- cessed: 2026-04-10)

  45. [46]

    MsQuic QUIC Implemen- tation

    Microsoft Corporation. MsQuic QUIC Implemen- tation. https://github.com/microsoft/msquic,

  46. [48]

    OpenSSL TLS Heart- beat Extension Read Overrun (CVE-2014-0160)

    MITRE Corporation. OpenSSL TLS Heart- beat Extension Read Overrun (CVE-2014-0160). https://cve.mitre.org/cgi-bin/cvename. cgi?name=CVE-2014-0160, 2014. Accessed: 2026-04-17

  47. [49]

    Eywa: Automating model based testing using llms.arXiv preprint arXiv:2312.06875, 2023

    Rajdeep Mondal, Rathin Singha, Todd Millstein, George Varghese, Ryan Beckett, and Siva Ke- sava Reddy Kakarla. Eywa: Automating model based testing using llms.arXiv preprint arXiv:2312.06875, 2023

  48. [50]

    Neqo QUIC Implementation

    Mozilla Corporation. Neqo QUIC Implementation. https://github.com/mozilla/neqo, 2019. (Ac- cessed: 2026-04-10)

  49. [51]

    NGINX QUIC Implementation

    NGINX, Inc. NGINX QUIC Implementation. https://github.com/nginx/nginx, 2020. Project page: https://quic.nginx.org/ (Accessed: 2026- 04-10). 14

  50. [52]

    Dns-fuzz

    NMAP Organization. Dns-fuzz. https://nmap. org/nsedoc/scripts/dns-fuzz.html

  51. [53]

    H2O HTTP Server

    Kazuho Oku. H2O HTTP Server. https://h2o. examp1e.net,2014. SourceCode: https://github. com/h2o/h2o(accessed 2026-04-10)

  52. [54]

    https://peachtech.gitlab.io/ peach-fuzzer-community/

    Peach Fuzzer. https://peachtech.gitlab.io/ peach-fuzzer-community/

  53. [55]

    quic-go QUIC Implemen- tation

    quic-go contributors. quic-go QUIC Implemen- tation. https://github.com/quic-go/quic-go,

  54. [57]

    Chrome Image for the QUIC Interop Runner.https://github.com/ quic-interop/chrome-quic-interop-runner,

    QUIC Interop Working Group. Chrome Image for the QUIC Interop Runner.https://github.com/ quic-interop/chrome-quic-interop-runner,

  55. [58]

    (Accessed: 2026-04-10)

  56. [59]

    Quinn: QUIC Implemen- tation in Rust

    quinn-rs contributors. Quinn: QUIC Implemen- tation in Rust. https://github.com/quinn-rs/ quinn, 2018. (Accessed: 2026-04-10)

  57. [60]

    Testing software compo- nents using boundary value analysis

    Muthu Ramachandran. Testing software compo- nents using boundary value analysis. In2003 Pro- ceedings 29th Euromicro Conference, pages 94–98. IEEE, 2003

  58. [61]

    Automating QUIC Interoperability Testing

    Marten Seemann and Jana Iyengar. Automating QUIC Interoperability Testing. InProceedings of the Workshop on the Evolution, Performance, and Interoperability of QUIC, EPIQ’20, pages 8–13, New York, NY, USA, 2020. ACM. Co-located with SIG- COMM 2020, Virtual Event, USA

  59. [62]

    Black Box testing with Bound- ary Value Analysis and Equivalence Partitioning Methods

    Muhammad Sholeh, Irmah Gisfas, Muhammad An- war Fauzi, et al. Black Box testing with Bound- ary Value Analysis and Equivalence Partitioning Methods. InJournal of Physics: Conference Series, volume 1823, page 012029. IOP Publishing, 2021

  60. [63]

    MESSI: Behavioral Testing of BGP Im- plementations

    Rathin Singha,Rajdeep Mondal,Ryan Beckett,Siva Kesava Reddy Kakarla, Todd Millstein, and George Varghese. MESSI: Behavioral Testing of BGP Im- plementations. In21st USENIX Symposium on Net- worked Systems Design and Implementation (NSDI 24), pages 1009–1023, 2024

  61. [64]

    Extremal testing for network software using llms, 2025

    Rathin Singha, Harry Qian, Srinath Saikrishnan, Tracy Zhao, Ryan Beckett, Siva Kesava Reddy Kakarla, and George Varghese. Extremal testing for network software using llms, 2025

  62. [65]

    Honggfuzz - Security oriented software fuzzer

    Robert Swiecki. Honggfuzz - Security oriented software fuzzer. https://github.com/google/ honggfuzz/tree/master/examples/bind

  63. [66]

    NGINX HTTP Server

    Igor Sysoev. NGINX HTTP Server. https:// nginx.org, 2004. Source code: https://github. com/nginx/nginx(accessed 2026-04-10)

  64. [67]

    HAProxy QUIC Implemen- tation

    Willy Tarreau. HAProxy QUIC Implemen- tation. https://github.com/haproxy/haproxy,

  65. [68]

    Canoni- cal source: https://git.haproxy.org/ (Accessed: 2026-04-10)

    QUIC support added in v2.6. Canoni- cal source: https://git.haproxy.org/ (Accessed: 2026-04-10)

  66. [69]

    golang.org/x/net: QUIC Pack- age

    The Go Authors. golang.org/x/net: QUIC Pack- age. https://pkg.go.dev/golang.org/x/net/ internal/quic, 2022. Source code: https:// github.com/golang/net(Accessed: 2026-04-10)

  67. [70]

    ngtcp2 QUIC Implementa- tion

    Tatsuhiro Tsujikawa. ngtcp2 QUIC Implementa- tion. https://github.com/ngtcp2/ngtcp2, 2017. (Accessed: 2026-04-10)

  68. [71]

    Software testing with large language models: Survey, landscape, and vision.IEEE Transactions on Software Engineering,

    Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. Software testing with large language models: Survey, landscape, and vision.IEEE Transactions on Software Engineering,

  69. [72]

    Also available as arXiv:2307.07221

  70. [73]

    American Fuzzy Lop (AFL)

    Michal Zalewski. American Fuzzy Lop (AFL). https://lcamtuf.coredump.cx/afl/, 2014. Ac- cessed: 2026-04-17

  71. [74]

    Technitium DNS server

    Shreyas Zare and Community. Technitium DNS server. https://technitium.com/dns/, 2026. Github: https://github.com/ TechnitiumSoftware/DnsServer

  72. [75]

    Boundary value analysis in automatic white-box test generation

    Zhiqiang Zhang, Tianyong Wu, and Jian Zhang. Boundary value analysis in automatic white-box test generation. In2015 IEEE 26th International Symposium on Software Reliability Engineering (IS- SRE), pages 239–249. IEEE, 2015

  73. [76]

    Large language model for vulnerability detection: Emerg- ing results and future directions

    Xiaogang Zhou, Tianyi Zhang, and David Lo. Large language model for vulnerability detection: Emerg- ing results and future directions. InProceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER), 2024

  74. [77]

    test_id":

    Sam Hocevar. zzuf: multi-purpose fuzzer. https: //caca.zoy.org/wiki/zzuf. 15 A Test Format This appendix documents the input and output test formats used by each protocol-specific harness. A.1 HTTP Test input format: { "test_id": "<integer>", "constraint": "<exact constraint that is being tested from the given list of constraints>", "description": "<descr...

  75. [78]

    Use the test case format to infer what inputs can be controlled by the tests

  76. [79]

    Scan the RFC chunk and find sentences that define constraints on those inputs. These include: - syntax rules, - allowed or disallowed values, - length or size limits, - character set restrictions, - relationships between multiple inputs, - ordering/state rules that can be represented as test inputs/state

  77. [80]

    But also look for sentences that describe a rule or constraint on inputs that can be tested with this framework

    Constraints are generally RFC statements that include MUST/MUST NOT/SHOULD/SHOULD NOT. But also look for sentences that describe a rule or constraint on inputs that can be tested with this framework

  78. [81]

    Every constraint is written as a tuple: (<section_number>, <constraint>)

  79. [82]

    4.1.1",

    If the chunk has no relevant constraints, return []. ### Important: - Return each constraint sentence *exactly as written* in the RFC (no edits). - Only include sentences that can plausibly be tested using the described setup. ### Output format (for each chunk): Return ONLY a JSON array like: [ ["4.1.1", "sentence1"], ["4.1.1", "sentence2"], ["4.2", "sent...

  80. [83]

    Reason about the differences between the implementations'outputs, considering: - Is one or more implementation likely violating the RFC (a real bug)? - Could the difference be due to acceptable implementation-specific behavior? - Could the difference plausibly be explained or fixed by configuration (e.g., security settings, extensions enabled/disabled, st...

Showing first 80 references.