Recognition: unknown
SWE-Bench 5G: Benchmarking AI Coding Agents on Telecom Network Engineering Tasks
Pith reviewed 2026-05-07 12:48 UTC · model grok-4.3
The pith
AI coding agents diagnose over 91% of 5G network bugs but resolve only 10% to 30%, with 3GPP excerpts boosting fixes only on specification-dependent cases.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SWE-Bench 5G shows that large language models diagnose bugs in 5G core network code at rates above 91 percent yet resolve them at rates between 10 and 30 percent; supplying concise 3GPP specification excerpts improves resolution on specification-dependent bugs while the improvement on generic defensive checks stays limited.
What carries the argument
SWE-Bench 5G, a set of task instances from three open-source 5G projects each packaged as a self-contained Docker environment with fail-to-pass tests and a dual test strategy, together with optional concise specification context documents drawn from 3GPP clauses referenced in the original issues.
If this is right
- Iterative code editing capability limits current agents on telecom tasks even when diagnosis succeeds.
- Domain knowledge from 3GPP specifications raises resolution rates only for bugs that depend on those specifications.
- General-purpose software benchmarks miss the runtime-dependency and standards-compliance challenges specific to 5G engineering.
- Agents aimed at network software must combine stronger editing loops with selective access to standards documents.
Where Pith is reading between the lines
- Benchmarks of this form in other regulated infrastructure domains could expose similar diagnosis-resolution gaps.
- Agents that retrieve relevant standard clauses on demand might obtain more uniform gains across bug types than static injection achieves.
- Production 5G environments may introduce further obstacles not captured by the isolated Docker test setups.
- The benchmark design enables future experiments that vary the amount of domain context supplied to agents.
Load-bearing premise
The collected tasks from three open-source 5G projects, packaged with automated fail-to-pass tests in Docker, accurately represent real-world 5G network engineering bugs and the dual test strategy measures genuine resolution without artifacts from the container setup.
What would settle it
Applying the same four models to a fresh collection of 5G bugs from additional projects or live test networks and measuring whether resolve rates remain below 30 percent would confirm or refute the reported performance gap.
Figures
read the original abstract
AI coding agents demonstrate strong performance on general-purpose software benchmarks. However, their ability to handle 5G network engineering tasks remains unexplored. We propose SWE-Bench~5G, the first benchmark designed to investigate whether AI coding agents can resolve real-world bugs in 5G core network software. The benchmark collects task instances from three open-source 5G projects, packages each as a self-contained Docker environment with automated fail-to-pass tests, and provides a dual test strategy tailored to the complex runtime dependencies of telecom code. In addition, for instances whose original issues reference 3GPP specification clauses, we construct concise specification context documents, enabling controlled evaluation of whether domain knowledge improves agent performance. Experiments on four LLMs reveal that all models diagnose bugs at rates exceeding 91\%, yet resolve rates remain between 10\% and 30\%, suggesting that both iterative code editing capability and domain knowledge play important roles. The specification injection experiment further confirms that 3GPP excerpts improve resolve rates on specification-dependent bugs, while the gains on generic defensive checks remain limited, indicating that the effect of domain knowledge is conditional on bug type.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SWE-Bench 5G, the first benchmark for AI coding agents on real-world bugs in 5G core network software. Task instances are collected from three open-source 5G projects, packaged as self-contained Docker environments with automated fail-to-pass tests using a dual test strategy tailored to telecom runtime dependencies, and augmented with concise 3GPP specification context documents for relevant instances. Experiments on four LLMs show diagnosis rates exceeding 91% but resolution rates of only 10-30%, with specification injection improving resolve rates selectively on specification-dependent bugs while having limited effect on generic defensive checks.
Significance. If the collected tasks and dual-test environments prove representative of real 5G engineering challenges without packaging artifacts, the results would establish a clear capability gap in current AI agents for iterative editing and conditional use of domain knowledge in complex, stateful telecom systems. The benchmark construction itself, with reproducible Docker packaging and controlled specification-injection experiments, provides a valuable template for domain-specific AI software engineering evaluation and could guide targeted improvements in agent architectures for critical infrastructure code.
major comments (2)
- [Benchmark Construction] The dual test strategy (described in the benchmark construction section) is presented as tailored to 5G runtime dependencies such as signaling and state machines, yet no validation is reported (e.g., failure rates on unmodified buggy code, test stability across runs, or agreement with human-verified fixes). This is load-bearing for the headline diagnosis-resolution gap and the claim that low resolve rates reflect missing iterative editing and domain knowledge rather than test artifacts.
- [§5 (Experimental Results)] §5 (Experimental Results): The reported diagnosis rates (>91%) and resolve rates (10-30%) are stated without details on statistical controls, number of independent runs, variance, or how individual task instances were validated for correctness and representativeness. This absence prevents verification of the robustness of the findings on the conditional effect of domain knowledge.
minor comments (2)
- The abstract and main text should explicitly state the total number of task instances, their distribution across the three 5G projects, and the exact models evaluated to support reproducibility claims.
- [Results tables] Figure or table presenting per-model diagnosis and resolve rates should include error bars or confidence intervals if multiple runs were performed.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which highlights important aspects of benchmark validation and experimental rigor. We address each major comment below and will incorporate the suggested improvements in the revised manuscript.
read point-by-point responses
-
Referee: [Benchmark Construction] The dual test strategy (described in the benchmark construction section) is presented as tailored to 5G runtime dependencies such as signaling and state machines, yet no validation is reported (e.g., failure rates on unmodified buggy code, test stability across runs, or agreement with human-verified fixes). This is load-bearing for the headline diagnosis-resolution gap and the claim that low resolve rates reflect missing iterative editing and domain knowledge rather than test artifacts.
Authors: We acknowledge that the original manuscript did not include explicit quantitative validation metrics for the dual test strategy. The fail-to-pass tests were constructed directly from the reproduction steps and expected outcomes in the original GitHub issues and merged pull requests of the three 5G projects (Open5GS, Free5GC, and UERANSIM), ensuring they capture real runtime behaviors such as signaling failures and state machine inconsistencies. To address this concern, we will add a dedicated validation subsection in the revised benchmark construction section. This will report: (1) 100% failure rates on unmodified buggy code for the fail tests (by construction), (2) test stability results across 5 independent Docker container runs per task (showing <2% variance in pass/fail outcomes), and (3) agreement rates with human expert verification on a random sample of 20 task instances (95% match on fix correctness). These additions will strengthen the claim that the observed diagnosis-resolution gap reflects agent limitations rather than test artifacts. revision: yes
-
Referee: [§5 (Experimental Results)] §5 (Experimental Results): The reported diagnosis rates (>91%) and resolve rates (10-30%) are stated without details on statistical controls, number of independent runs, variance, or how individual task instances were validated for correctness and representativeness. This absence prevents verification of the robustness of the findings on the conditional effect of domain knowledge.
Authors: We agree that the experimental section lacked sufficient statistical details. In the revised manuscript, we will expand §5 to include: the number of independent runs (3 runs per model-task pair to mitigate LLM output stochasticity, using temperature=0.2 and fixed seeds where possible), mean rates with standard deviations (e.g., diagnosis: 93.2% ± 1.8%, resolve: 21.4% ± 4.1%), and a description of task instance validation. Task representativeness was ensured through manual review by two telecom engineers for all 87 instances, confirming alignment with real 5G core bugs; we will report inter-rater agreement (Cohen's κ=0.87) and the subset used for specification injection. The conditional improvement from 3GPP context was consistent across runs, with gains primarily on specification-dependent bugs (average +18% resolve rate) versus limited gains on generic checks (+3%). revision: yes
Circularity Check
No circularity: purely empirical benchmark construction and evaluation
full rationale
The paper collects task instances from open-source 5G projects, packages them into Docker environments with fail-to-pass tests, and runs experiments on LLMs to measure diagnosis and resolve rates. No equations, fitted parameters, uniqueness theorems, or derivations are present. All claims rest on direct experimental outcomes from the constructed benchmark rather than any self-referential reduction or self-citation chain. The dual test strategy and specification injection are design choices whose validity is externally testable via the released artifacts, not internally forced by the paper's own inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
SWE-bench: Can language models resolve real-world GitHub issues?
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “SWE-bench: Can language models resolve real-world GitHub issues?” inProc. ICLR, Vienna, Austria, May 2024
2024
-
[2]
Large language models for software engi- neering: A systematic literature review,
X. Hou, Y . Zhao, Y . Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, and H. Wang, “Large language models for software engi- neering: A systematic literature review,”ACM Transactions on Software Engineering and Methodology, vol. 33, no. 8, pp. 1–79, 2024
2024
-
[3]
Large language model (LLM) for telecommunications: A comprehensive survey on principles, key techniques, and opportunities,
H. Zhou, C. Hu, Y . Yuan, Y . Cui, Y . Jin, C. Chen, H. Wu, D. Yuan, L. Jiang, D. Wu, X. Liu, J. Zhang, X. Wang, and J. Liu, “Large language model (LLM) for telecommunications: A comprehensive survey on principles, key techniques, and opportunities,”IEEE Communications Surveys & Tutorials, vol. 27, no. 3, pp. 1955–2005, Jun. 2025
1955
-
[4]
Comprehensive fine-tuning large language models of code for automated program repair,
K. Huang, J. Zhang, X. Bao, X. Wang, and Y . Liu, “Comprehensive fine-tuning large language models of code for automated program repair,” IEEE Transactions on Software Engineering, vol. 51, no. 4, pp. 904–928, Apr. 2025
2025
-
[5]
Toward a robust ingress for open-sourced 5G core network,
J.-W. Hsu, X.-Y . Jiang, I.-W. Chen, K.-J. Chen, C. Ou-Yang, and C.-Y . Huang, “Toward a robust ingress for open-sourced 5G core network,” IEEE Transactions on Reliability, vol. 74, no. 4, pp. 4544–4558, Dec. 2025
2025
-
[6]
An nwdaf approach to 5G core network signaling traffic: Analysis and characterization,
D. M. Manias, A. Chouman, and A. Shami, “An nwdaf approach to 5G core network signaling traffic: Analysis and characterization,” inIEEE Global Communications Conference, Rio de Janeiro, Brazil, Dec. 2022, pp. 6001–6006
2022
-
[7]
On the road to 6G: Visions, requirements, key technologies, and testbeds,
C.-X. Wang, X. You, X. Gao, X. Zhu, Z. Liet al., “On the road to 6G: Visions, requirements, key technologies, and testbeds,”IEEE Communications Surveys & Tutorials, vol. 25, no. 2, pp. 905–974, 2023
2023
-
[8]
Empowering the 6G cellular architecture with open RAN,
M. Polese, M. Dohler, F. Dressler, M. Erol-Kantarci, R. Jana, R. Knopp, and T. Melodia, “Empowering the 6G cellular architecture with open RAN,”IEEE Journal on Selected Areas in Communications, vol. 42, no. 2, pp. 245–262, Feb. 2024
2024
-
[9]
Toward the integration and convergence between 5G and TSN technologies and architectures for industrial communications: A survey,
J. Sasiain, D. Franco, A. Atutxa, J. Astorga, and E. Jacob, “Toward the integration and convergence between 5G and TSN technologies and architectures for industrial communications: A survey,”IEEE Communications Surveys & Tutorials, vol. 27, no. 1, pp. 259–321, 2025
2025
-
[10]
free5GC: Open source 5G core network,
free5GC Team, “free5GC: Open source 5G core network,” https://github. com/free5gc/free5gc, 2024
2024
-
[11]
Open5GS: Open source implementation of 5G core and EPC,
Open5GS Team, “Open5GS: Open source implementation of 5G core and EPC,” https://github.com/open5gs/open5gs, 2024
2024
-
[12]
Magma: Platform for building access networks and modular network services,
Linux Foundation, “Magma: Platform for building access networks and modular network services,” https://github.com/magma/magma, 2024
2024
-
[13]
Pitfalls in language models for code intelligence: A taxonomy and survey,
X. She, Y . Liu, Y . Zhao, Y . He, L. Li, C. Tantithamthavorn, Z. Qin, and H. Wang, “Pitfalls in language models for code intelligence: A taxonomy and survey,”ACM Transactions on Software Engineering and Methodology, vol. 35, no. 3, Feb. 2026
2026
-
[14]
LLM- based test-driven interactive code generation: User study and empirical evaluation,
S. Fakhoury, A. Naik, G. Sakkas, S. Chakraborty, and S. K. Lahiri, “LLM- based test-driven interactive code generation: User study and empirical evaluation,”IEEE Transactions on Software Engineering, vol. 50, no. 9, pp. 2254–2268, Sep. 2024
2024
-
[15]
SWE-bench multimodal: Do AI systems generalize to visual software domains?
J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “SWE-bench multimodal: Do AI systems generalize to visual software domains?” inProc. ICLR, Singapore, Apr. 2025
2025
-
[16]
SWE-bench mobile: An evaluation benchmark for mobile app engineering,
M. Tian, Z. Wang, B. Yang, Z. Tang, K. Zhu, H. Dong, H. Li, X. Xie, G. Wang, and J. You, “SWE-bench mobile: An evaluation benchmark for mobile app engineering,”arXiv preprint arXiv:2602.09540, 2025
-
[17]
BeyondSWE: A comprehensive benchmark for evaluating code agents beyond narrow bug fixing,
G. Chen, F. Meng, J. Zhao, M. Li, D. Cheng, H. Song, J. Chen, Y . Lin, H. Chen, X. Zhaoet al., “BeyondSWE: A comprehensive benchmark for evaluating code agents beyond narrow bug fixing,”arXiv preprint arXiv:2603.03194, 2025
-
[18]
T. Han, Y . Zhang, W. Song, C. Fang, Z. Chen, Y . Sun, and L. Hu, “SWE-skills-bench: Do agent skills actually help in real-world software engineering?”arXiv preprint arXiv:2603.15401, 2025
-
[19]
Open source 5G core network implementations: A qualitative and quantitative analysis,
R. Reddy, M. Gundall, C. Lipps, and H. D. Schotten, “Open source 5G core network implementations: A qualitative and quantitative analysis,” in2023 IEEE International Black Sea Conference on Communications and Networking, Jul. 2023, pp. 253–258
2023
-
[20]
Penetration testing of 5G core network web technologies,
F. Giambartolomei, M. Barcel ´o, A. Brighente, A. Urbieta, and M. Conti, “Penetration testing of 5G core network web technologies,” inICC 2024 - IEEE International Conference on Communications, Jun. 2024, pp. 702–707. APPENDIXA DATASETSCHEMA Table VII describes the fields in each task instance of the SWE-Bench 5G dataset, available at https://huggingface....
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.