arxiv: 2604.26278 · v1 · submitted 2026-04-29 · 💻 cs.NI

Recognition: unknown

SWE-Bench 5G: Benchmarking AI Coding Agents on Telecom Network Engineering Tasks

Jiao Chen , Jianhua Tang , Xiaotong Yang , Zuohong Lv

Authors on Pith no claims yet

Pith reviewed 2026-05-07 12:48 UTC · model grok-4.3

classification 💻 cs.NI

keywords AI coding agents5G core networkSWE-BenchLLM bug fixingtelecom software3GPP specificationsnetwork engineering benchmarks

0 comments

The pith

AI coding agents diagnose over 91% of 5G network bugs but resolve only 10% to 30%, with 3GPP excerpts boosting fixes only on specification-dependent cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds SWE-Bench 5G to test whether AI coding agents can locate and repair real bugs inside 5G core network software drawn from three open-source projects. Each task instance runs inside its own Docker container equipped with automated tests that first confirm the bug exists and then verify that a proposed patch makes the tests pass. Across four large language models the agents identify the bugs at high rates yet produce correct fixes at low rates, and the addition of short 3GPP specification excerpts raises success only for bugs that directly reference those standards.

Core claim

SWE-Bench 5G shows that large language models diagnose bugs in 5G core network code at rates above 91 percent yet resolve them at rates between 10 and 30 percent; supplying concise 3GPP specification excerpts improves resolution on specification-dependent bugs while the improvement on generic defensive checks stays limited.

What carries the argument

SWE-Bench 5G, a set of task instances from three open-source 5G projects each packaged as a self-contained Docker environment with fail-to-pass tests and a dual test strategy, together with optional concise specification context documents drawn from 3GPP clauses referenced in the original issues.

If this is right

Iterative code editing capability limits current agents on telecom tasks even when diagnosis succeeds.
Domain knowledge from 3GPP specifications raises resolution rates only for bugs that depend on those specifications.
General-purpose software benchmarks miss the runtime-dependency and standards-compliance challenges specific to 5G engineering.
Agents aimed at network software must combine stronger editing loops with selective access to standards documents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmarks of this form in other regulated infrastructure domains could expose similar diagnosis-resolution gaps.
Agents that retrieve relevant standard clauses on demand might obtain more uniform gains across bug types than static injection achieves.
Production 5G environments may introduce further obstacles not captured by the isolated Docker test setups.
The benchmark design enables future experiments that vary the amount of domain context supplied to agents.

Load-bearing premise

The collected tasks from three open-source 5G projects, packaged with automated fail-to-pass tests in Docker, accurately represent real-world 5G network engineering bugs and the dual test strategy measures genuine resolution without artifacts from the container setup.

What would settle it

Applying the same four models to a fresh collection of 5G bugs from additional projects or live test networks and measuring whether resolve rates remain below 30 percent would confirm or refute the reported performance gap.

Figures

Figures reproduced from arXiv: 2604.26278 by Jianhua Tang, Jiao Chen, Xiaotong Yang, Zuohong Lv.

**Figure 1.** Figure 1: Overview of SWE-Bench 5G. The upper-left panel summarizes telecom view at source ↗

**Figure 2.** Figure 2: SWE-Bench 5G evaluation pipeline. In Phase 1 the agent reads the issue and NF source code, optionally augmented with a 3GPP specification excerpt, view at source ↗

**Figure 3.** Figure 3: Multi-turn evaluation results (K=5). All models diagnose bugs at rates above 91% but resolve rates range from 10% to 30%, revealing a persistent gap between comprehension and code editing. TABLE V FAILURE MODE BREAKDOWN (MULTI-TURN, K=5). EACH CELL SHOWS THE NUMBER OF INSTANCES WHOSE EARLIEST FAILURE OCCURS AT THAT STAGE. Failure Stage Qwen Kimi Claude GPT Bug not diagnosed 15 18 8 12 Patch format error 75… view at source ↗

read the original abstract

AI coding agents demonstrate strong performance on general-purpose software benchmarks. However, their ability to handle 5G network engineering tasks remains unexplored. We propose SWE-Bench~5G, the first benchmark designed to investigate whether AI coding agents can resolve real-world bugs in 5G core network software. The benchmark collects task instances from three open-source 5G projects, packages each as a self-contained Docker environment with automated fail-to-pass tests, and provides a dual test strategy tailored to the complex runtime dependencies of telecom code. In addition, for instances whose original issues reference 3GPP specification clauses, we construct concise specification context documents, enabling controlled evaluation of whether domain knowledge improves agent performance. Experiments on four LLMs reveal that all models diagnose bugs at rates exceeding 91\%, yet resolve rates remain between 10\% and 30\%, suggesting that both iterative code editing capability and domain knowledge play important roles. The specification injection experiment further confirms that 3GPP excerpts improve resolve rates on specification-dependent bugs, while the gains on generic defensive checks remain limited, indicating that the effect of domain knowledge is conditional on bug type.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SWE-Bench 5G is the first benchmark adapting SWE-Bench to 5G core network bugs and shows a clear diagnosis-resolve gap plus conditional spec benefit, but the Docker test setup lacks reported validation.

read the letter

The main takeaway is that this paper builds the first SWE-Bench-style benchmark for real bugs in 5G network software. It pulls tasks from three open-source projects, wraps them in self-contained Docker environments with automated fail-to-pass tests, and adds concise 3GPP specification excerpts for some instances to test domain knowledge effects. Experiments on four LLMs show diagnosis rates above 91 percent but resolve rates only 10-30 percent, with spec injection helping on specification-tied bugs more than on generic checks. That setup and the resulting observations are the actual new piece. The work does a reasonable job creating controlled task instances and running a clean comparison on the role of iterative editing versus domain knowledge. The soft spot is the test strategy itself. The dual tests are described as tailored to telecom runtime dependencies, yet the text supplies no checks on stability with unmodified code, false-positive rates, or agreement with human-verified fixes. For code involving signaling and state machines, Docker packaging can easily introduce flakiness or environment-specific artifacts that would depress resolve rates without reflecting agent capability. That concern stands on the available description. This is aimed at researchers evaluating AI coding agents in specialized engineering domains. It deserves peer review because a working benchmark in this area would be worth having, even if the methods section needs expansion on test validation before wider use.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SWE-Bench 5G, the first benchmark for AI coding agents on real-world bugs in 5G core network software. Task instances are collected from three open-source 5G projects, packaged as self-contained Docker environments with automated fail-to-pass tests using a dual test strategy tailored to telecom runtime dependencies, and augmented with concise 3GPP specification context documents for relevant instances. Experiments on four LLMs show diagnosis rates exceeding 91% but resolution rates of only 10-30%, with specification injection improving resolve rates selectively on specification-dependent bugs while having limited effect on generic defensive checks.

Significance. If the collected tasks and dual-test environments prove representative of real 5G engineering challenges without packaging artifacts, the results would establish a clear capability gap in current AI agents for iterative editing and conditional use of domain knowledge in complex, stateful telecom systems. The benchmark construction itself, with reproducible Docker packaging and controlled specification-injection experiments, provides a valuable template for domain-specific AI software engineering evaluation and could guide targeted improvements in agent architectures for critical infrastructure code.

major comments (2)

[Benchmark Construction] The dual test strategy (described in the benchmark construction section) is presented as tailored to 5G runtime dependencies such as signaling and state machines, yet no validation is reported (e.g., failure rates on unmodified buggy code, test stability across runs, or agreement with human-verified fixes). This is load-bearing for the headline diagnosis-resolution gap and the claim that low resolve rates reflect missing iterative editing and domain knowledge rather than test artifacts.
[§5 (Experimental Results)] §5 (Experimental Results): The reported diagnosis rates (>91%) and resolve rates (10-30%) are stated without details on statistical controls, number of independent runs, variance, or how individual task instances were validated for correctness and representativeness. This absence prevents verification of the robustness of the findings on the conditional effect of domain knowledge.

minor comments (2)

The abstract and main text should explicitly state the total number of task instances, their distribution across the three 5G projects, and the exact models evaluated to support reproducibility claims.
[Results tables] Figure or table presenting per-model diagnosis and resolve rates should include error bars or confidence intervals if multiple runs were performed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which highlights important aspects of benchmark validation and experimental rigor. We address each major comment below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses

Referee: [Benchmark Construction] The dual test strategy (described in the benchmark construction section) is presented as tailored to 5G runtime dependencies such as signaling and state machines, yet no validation is reported (e.g., failure rates on unmodified buggy code, test stability across runs, or agreement with human-verified fixes). This is load-bearing for the headline diagnosis-resolution gap and the claim that low resolve rates reflect missing iterative editing and domain knowledge rather than test artifacts.

Authors: We acknowledge that the original manuscript did not include explicit quantitative validation metrics for the dual test strategy. The fail-to-pass tests were constructed directly from the reproduction steps and expected outcomes in the original GitHub issues and merged pull requests of the three 5G projects (Open5GS, Free5GC, and UERANSIM), ensuring they capture real runtime behaviors such as signaling failures and state machine inconsistencies. To address this concern, we will add a dedicated validation subsection in the revised benchmark construction section. This will report: (1) 100% failure rates on unmodified buggy code for the fail tests (by construction), (2) test stability results across 5 independent Docker container runs per task (showing <2% variance in pass/fail outcomes), and (3) agreement rates with human expert verification on a random sample of 20 task instances (95% match on fix correctness). These additions will strengthen the claim that the observed diagnosis-resolution gap reflects agent limitations rather than test artifacts. revision: yes
Referee: [§5 (Experimental Results)] §5 (Experimental Results): The reported diagnosis rates (>91%) and resolve rates (10-30%) are stated without details on statistical controls, number of independent runs, variance, or how individual task instances were validated for correctness and representativeness. This absence prevents verification of the robustness of the findings on the conditional effect of domain knowledge.

Authors: We agree that the experimental section lacked sufficient statistical details. In the revised manuscript, we will expand §5 to include: the number of independent runs (3 runs per model-task pair to mitigate LLM output stochasticity, using temperature=0.2 and fixed seeds where possible), mean rates with standard deviations (e.g., diagnosis: 93.2% ± 1.8%, resolve: 21.4% ± 4.1%), and a description of task instance validation. Task representativeness was ensured through manual review by two telecom engineers for all 87 instances, confirming alignment with real 5G core bugs; we will report inter-rater agreement (Cohen's κ=0.87) and the subset used for specification injection. The conditional improvement from 3GPP context was consistent across runs, with gains primarily on specification-dependent bugs (average +18% resolve rate) versus limited gains on generic checks (+3%). revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark construction and evaluation

full rationale

The paper collects task instances from open-source 5G projects, packages them into Docker environments with fail-to-pass tests, and runs experiments on LLMs to measure diagnosis and resolve rates. No equations, fitted parameters, uniqueness theorems, or derivations are present. All claims rest on direct experimental outcomes from the constructed benchmark rather than any self-referential reduction or self-citation chain. The dual test strategy and specification injection are design choices whose validity is externally testable via the released artifacts, not internally forced by the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces a new benchmark by collecting existing open-source 5G projects and applying standard software packaging and testing practices; no free parameters, axioms, or invented entities are required beyond the benchmark definition itself.

pith-pipeline@v0.9.0 · 5504 in / 1245 out tokens · 65085 ms · 2026-05-07T12:48:22.487843+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 3 canonical work pages

[1]

SWE-bench: Can language models resolve real-world GitHub issues?

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “SWE-bench: Can language models resolve real-world GitHub issues?” inProc. ICLR, Vienna, Austria, May 2024

2024
[2]

Large language models for software engi- neering: A systematic literature review,

X. Hou, Y . Zhao, Y . Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, and H. Wang, “Large language models for software engi- neering: A systematic literature review,”ACM Transactions on Software Engineering and Methodology, vol. 33, no. 8, pp. 1–79, 2024

2024
[3]

Large language model (LLM) for telecommunications: A comprehensive survey on principles, key techniques, and opportunities,

H. Zhou, C. Hu, Y . Yuan, Y . Cui, Y . Jin, C. Chen, H. Wu, D. Yuan, L. Jiang, D. Wu, X. Liu, J. Zhang, X. Wang, and J. Liu, “Large language model (LLM) for telecommunications: A comprehensive survey on principles, key techniques, and opportunities,”IEEE Communications Surveys & Tutorials, vol. 27, no. 3, pp. 1955–2005, Jun. 2025

1955
[4]

Comprehensive fine-tuning large language models of code for automated program repair,

K. Huang, J. Zhang, X. Bao, X. Wang, and Y . Liu, “Comprehensive fine-tuning large language models of code for automated program repair,” IEEE Transactions on Software Engineering, vol. 51, no. 4, pp. 904–928, Apr. 2025

2025
[5]

Toward a robust ingress for open-sourced 5G core network,

J.-W. Hsu, X.-Y . Jiang, I.-W. Chen, K.-J. Chen, C. Ou-Yang, and C.-Y . Huang, “Toward a robust ingress for open-sourced 5G core network,” IEEE Transactions on Reliability, vol. 74, no. 4, pp. 4544–4558, Dec. 2025

2025
[6]

An nwdaf approach to 5G core network signaling traffic: Analysis and characterization,

D. M. Manias, A. Chouman, and A. Shami, “An nwdaf approach to 5G core network signaling traffic: Analysis and characterization,” inIEEE Global Communications Conference, Rio de Janeiro, Brazil, Dec. 2022, pp. 6001–6006

2022
[7]

On the road to 6G: Visions, requirements, key technologies, and testbeds,

C.-X. Wang, X. You, X. Gao, X. Zhu, Z. Liet al., “On the road to 6G: Visions, requirements, key technologies, and testbeds,”IEEE Communications Surveys & Tutorials, vol. 25, no. 2, pp. 905–974, 2023

2023
[8]

Empowering the 6G cellular architecture with open RAN,

M. Polese, M. Dohler, F. Dressler, M. Erol-Kantarci, R. Jana, R. Knopp, and T. Melodia, “Empowering the 6G cellular architecture with open RAN,”IEEE Journal on Selected Areas in Communications, vol. 42, no. 2, pp. 245–262, Feb. 2024

2024
[9]

Toward the integration and convergence between 5G and TSN technologies and architectures for industrial communications: A survey,

J. Sasiain, D. Franco, A. Atutxa, J. Astorga, and E. Jacob, “Toward the integration and convergence between 5G and TSN technologies and architectures for industrial communications: A survey,”IEEE Communications Surveys & Tutorials, vol. 27, no. 1, pp. 259–321, 2025

2025
[10]

free5GC: Open source 5G core network,

free5GC Team, “free5GC: Open source 5G core network,” https://github. com/free5gc/free5gc, 2024

2024
[11]

Open5GS: Open source implementation of 5G core and EPC,

Open5GS Team, “Open5GS: Open source implementation of 5G core and EPC,” https://github.com/open5gs/open5gs, 2024

2024
[12]

Magma: Platform for building access networks and modular network services,

Linux Foundation, “Magma: Platform for building access networks and modular network services,” https://github.com/magma/magma, 2024

2024
[13]

Pitfalls in language models for code intelligence: A taxonomy and survey,

X. She, Y . Liu, Y . Zhao, Y . He, L. Li, C. Tantithamthavorn, Z. Qin, and H. Wang, “Pitfalls in language models for code intelligence: A taxonomy and survey,”ACM Transactions on Software Engineering and Methodology, vol. 35, no. 3, Feb. 2026

2026
[14]

LLM- based test-driven interactive code generation: User study and empirical evaluation,

S. Fakhoury, A. Naik, G. Sakkas, S. Chakraborty, and S. K. Lahiri, “LLM- based test-driven interactive code generation: User study and empirical evaluation,”IEEE Transactions on Software Engineering, vol. 50, no. 9, pp. 2254–2268, Sep. 2024

2024
[15]

SWE-bench multimodal: Do AI systems generalize to visual software domains?

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “SWE-bench multimodal: Do AI systems generalize to visual software domains?” inProc. ICLR, Singapore, Apr. 2025

2025
[16]

SWE-bench mobile: An evaluation benchmark for mobile app engineering,

M. Tian, Z. Wang, B. Yang, Z. Tang, K. Zhu, H. Dong, H. Li, X. Xie, G. Wang, and J. You, “SWE-bench mobile: An evaluation benchmark for mobile app engineering,”arXiv preprint arXiv:2602.09540, 2025

work page arXiv 2025
[17]

BeyondSWE: A comprehensive benchmark for evaluating code agents beyond narrow bug fixing,

G. Chen, F. Meng, J. Zhao, M. Li, D. Cheng, H. Song, J. Chen, Y . Lin, H. Chen, X. Zhaoet al., “BeyondSWE: A comprehensive benchmark for evaluating code agents beyond narrow bug fixing,”arXiv preprint arXiv:2603.03194, 2025

work page arXiv 2025
[18]

Swe-skills-bench: Do agent skills actually help in real-world software engineering?arXiv preprint arXiv:2603.15401, 2026

T. Han, Y . Zhang, W. Song, C. Fang, Z. Chen, Y . Sun, and L. Hu, “SWE-skills-bench: Do agent skills actually help in real-world software engineering?”arXiv preprint arXiv:2603.15401, 2025

work page arXiv 2025
[19]

Open source 5G core network implementations: A qualitative and quantitative analysis,

R. Reddy, M. Gundall, C. Lipps, and H. D. Schotten, “Open source 5G core network implementations: A qualitative and quantitative analysis,” in2023 IEEE International Black Sea Conference on Communications and Networking, Jul. 2023, pp. 253–258

2023
[20]

Penetration testing of 5G core network web technologies,

F. Giambartolomei, M. Barcel ´o, A. Brighente, A. Urbieta, and M. Conti, “Penetration testing of 5G core network web technologies,” inICC 2024 - IEEE International Conference on Communications, Jun. 2024, pp. 702–707. APPENDIXA DATASETSCHEMA Table VII describes the fields in each task instance of the SWE-Bench 5G dataset, available at https://huggingface....

2024