Recognition: no theorem link
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
Pith reviewed 2026-05-16 06:45 UTC · model grok-4.3
The pith
Multi-SWE-bench supplies 1632 expert-curated issue-resolving tasks across seven languages to test LLMs beyond Python-only benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Multi-SWE-bench is a multilingual collection of 1632 real-world issue instances, each verified by expert annotators, that enables accurate measurement of how well large language models can generate patches to resolve issues in languages other than Python.
What carries the argument
Multi-SWE-bench, the set of 1632 expert-annotated issue instances drawn from repositories in seven languages that functions as the evaluation standard.
If this is right
- Current agent methods can be compared directly on the same multilingual tasks rather than only on Python.
- Model performance gaps between languages become measurable and can guide targeted improvements.
- The released 4723-instance RL dataset supplies structured trajectories for training agents on patch generation.
- The open-sourced annotation pipeline allows repeated expansion of the benchmark without starting from scratch.
Where Pith is reading between the lines
- Wider language coverage may reveal whether techniques that work in Python transfer directly or require language-specific adjustments.
- Community growth of the RL dataset could eventually support training loops that iterate on real repository histories rather than synthetic tasks.
- If the benchmark becomes a standard, progress reports on issue-resolving agents will need to include results from all seven languages to be considered comprehensive.
Load-bearing premise
The filtering and annotation steps performed by the 68 experts produce a set of instances that faithfully represent the difficulty and distribution of real-world issue-resolving work across the covered languages.
What would settle it
A re-annotation round by a fresh group of experts that selects a substantially different subset of instances or assigns markedly different difficulty labels would indicate the original curation did not produce a stable benchmark.
read the original abstract
The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. However, existing benchmarks, such as SWE-bench, focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across diverse software ecosystems. To address this, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators, ensuring that the benchmark can provide an accurate and reliable evaluation. Based on Multi-SWE-bench, we evaluate a series of state-of-the-art models using three representative methods (Agentless, SWE-agent, and OpenHands) and present a comprehensive analysis with key empirical insights. In addition, we launch a Multi-SWE-RL open-source community, aimed at building large-scale reinforcement learning (RL) training datasets for issue-resolving tasks. As an initial contribution, we release a set of 4,723 well-structured instances spanning seven programming languages, laying a solid foundation for RL research in this domain. More importantly, we open-source our entire data production pipeline, along with detailed tutorials, encouraging the open-source community to continuously contribute and expand the dataset. We envision our Multi-SWE-bench and the ever-growing Multi-SWE-RL community as catalysts for advancing RL toward its full potential, bringing us one step closer to the dawn of AGI.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Multi-SWE-bench, a multilingual benchmark for the issue-resolving task (generating patches to fix reported issues in codebases). It covers seven languages (Java, TypeScript, JavaScript, Go, Rust, C, C++), contains 1,632 instances curated from 2,456 candidates by 68 expert annotators, evaluates SOTA LLMs via Agentless, SWE-agent, and OpenHands, provides empirical analysis, and releases a larger Multi-SWE-RL dataset of 4,723 instances plus the full open-source data-production pipeline and tutorials.
Significance. If the curation process yields representative, high-quality instances with verifiable patches, the benchmark would meaningfully extend SWE-bench-style evaluation beyond Python, enabling cross-language comparisons of LLM issue-resolving capabilities. The release of the RL dataset and the fully open-sourced pipeline with tutorials is a concrete strength that supports reproducibility and community expansion of training data.
major comments (2)
- [Abstract and data-construction section] Abstract and data-construction section: The central claim that the 1,632 instances are 'high-quality' and enable 'accurate and reliable evaluation' rests on expert annotation from 2,456 candidates by 68 annotators, yet the manuscript reports no inter-annotator agreement statistics, explicit rejection/exclusion criteria, or post-curation verification that selected issues possess gold patches suitable for model testing. This quantitative gap directly affects the trustworthiness of all downstream model comparisons.
- [Evaluation section (§4)] Evaluation section (§4): The reported results for Agentless, SWE-agent, and OpenHands are presented without language-specific breakdowns or comparisons against a Python-only baseline (e.g., SWE-bench), making it difficult to assess whether the multilingual setting introduces new failure modes or merely replicates known Python behaviors.
minor comments (1)
- [Multi-SWE-RL section] The description of the Multi-SWE-RL release would benefit from an explicit statement of how the 4,723 instances differ from the 1,632 benchmark instances (e.g., presence/absence of gold patches, filtering criteria).
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. We have carefully reviewed each major comment and provide point-by-point responses below, indicating where revisions will be made to address the concerns.
read point-by-point responses
-
Referee: [Abstract and data-construction section] Abstract and data-construction section: The central claim that the 1,632 instances are 'high-quality' and enable 'accurate and reliable evaluation' rests on expert annotation from 2,456 candidates by 68 annotators, yet the manuscript reports no inter-annotator agreement statistics, explicit rejection/exclusion criteria, or post-curation verification that selected issues possess gold patches suitable for model testing. This quantitative gap directly affects the trustworthiness of all downstream model comparisons.
Authors: We agree that the absence of these quantitative details weakens the presentation of the curation process. In the revised manuscript, we will expand the data-construction section to include inter-annotator agreement statistics (e.g., Fleiss' kappa computed across the 68 annotators on a sampled subset of candidates). We will also add explicit descriptions of the rejection and exclusion criteria applied during filtering from 2,456 to 1,632 instances, including criteria related to patch verifiability and issue clarity. Finally, we will report the post-curation verification procedure, including any manual or automated checks performed to confirm that the retained gold patches are suitable for model evaluation. revision: yes
-
Referee: [Evaluation section (§4)] Evaluation section (§4): The reported results for Agentless, SWE-agent, and OpenHands are presented without language-specific breakdowns or comparisons against a Python-only baseline (e.g., SWE-bench), making it difficult to assess whether the multilingual setting introduces new failure modes or merely replicates known Python behaviors.
Authors: We acknowledge that language-specific breakdowns would improve interpretability. In the revised §4, we will add per-language performance tables and figures for all three methods, allowing readers to identify any language-dependent patterns. For comparison against a Python-only baseline, we will include a new discussion paragraph that situates our aggregate results against published SWE-bench numbers from prior work, while noting the inherent differences in repository scale, issue complexity, and benchmark construction that preclude a strictly controlled head-to-head comparison. We believe this addresses the core concern without overstating comparability. revision: partial
Circularity Check
No significant circularity in empirical dataset construction
full rationale
The paper presents an empirical contribution: the curation of a multilingual issue-resolving benchmark (Multi-SWE-bench) by selecting and annotating 1,632 instances from 2,456 candidates using 68 expert annotators, followed by model evaluations and release of additional RL data. No derivation chain, equations, parameter fitting, or predictions exist that could reduce outputs to inputs by construction. The central claims rest on procedural description of data collection rather than any self-referential logic, self-citation load-bearing premises, or renamed known results. This matches the default expectation for dataset papers, which are typically non-circular when they do not invoke mathematical self-definition or fitted predictions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert annotators can reliably identify and annotate high-quality issue-resolving instances from candidate pools
Forward citations
Cited by 19 Pith papers
-
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
-
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.
-
HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks
HWE-Bench is the first repository-level benchmark for LLM agents on real hardware bug repair, where the best agent fixes 70.7% of 417 tasks but drops below 65% on complex SoC projects.
-
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.
-
PlayCoder: Making LLM-Generated GUI Code Playable
PlayCoder raises the rate of LLM-generated GUI apps that can be played end-to-end without logic errors from near zero to 20.3% Play@3 by adding repository-aware generation, agent-driven testing, and iterative repair.
-
Steerable Instruction Following Coding Data Synthesis with Actor-Parametric Schema Co-Evolution
IFCodeEvolve synthesizes coding data via actor-schema co-evolution with MCTS, boosting a 32B model's performance to match proprietary SOTA on instruction following.
-
Debug2Fix: Can Interactive Debugging Help Coding Agents Fix More Bugs?
Debug2Fix integrates interactive debugging via subagents into coding agents, delivering >20% gains on GitBug-Java and SWE-Bench-Live while enabling weaker models to match stronger ones.
-
SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle
SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.
-
Revisiting DAgger in the Era of LLM-Agents
DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
-
SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution
SWE Atlas is a benchmark suite for coding agents that evaluates Codebase Q&A, Test Writing, and Refactoring using comprehensive protocols assessing both functional correctness and software engineering quality.
-
WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning
WebGen-R1 uses end-to-end RL with scaffold-driven generation and cascaded rewards for structure, function, and aesthetics to transform a 7B model into a generator of deployable multi-page websites that rivals much lar...
-
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
-
AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation
AdaExplore improves correctness and speed of Triton kernel generation by converting recurring failures into a memory of rules and organizing search as a tree that mixes local refinements with larger regenerations, yie...
-
Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair
Reshaping outcome rewards, process signals, and rollout comparability in GRPO raises strict compile-and-semantic accuracy in agentic code repair from 0.385 to 0.535 under weak feedback.
-
AlphaEval: Evaluating Agents in Production
AlphaEval is a benchmark of 94 production-sourced tasks from seven companies for evaluating full AI agent products across six domains using multiple judgment methods, plus a framework to build similar benchmarks.
-
Seed1.8 Model Card: Towards Generalized Real-World Agency
Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.
-
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
-
Kimi K2: Open Agentic Intelligence
Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
-
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
Reference graph
Works this paper leans on
- [1]
-
[2]
M. Allamanis and C. Sutton. Mining source code repositories at massive scale using language modeling. In 2013 10th working conference on mining software repositories (MSR), pages 207–216. IEEE,
work page 2013
-
[3]
B. Athiwaratkun, S. K. Gouda, Z. Wang, X. Li, Y. Tian, M. Tan, W. U. Ahmad, S. Wang, Q. Sun, M. Shang, et al. Multi-lingual evaluation of code generation models. arXiv preprint arXiv:2210.14868,
-
[4]
Program Synthesis with Large Language Models
URL https://www.augmentcode. com/blog/1-open-source-agent-on-swe-bench-verified-by-combining-claud e-3-7-and-o1 . 2025-03-31. J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732,
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P . de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P . Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P . Tillet, F. P . Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. ...
work page internal anchor Pith review Pith/arXiv arXiv
- [6]
- [7]
-
[8]
S. Iyer, I. Konstas, A. Cheung, and L. Zettlemoyer. Mapping language to code in programmatic context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1643–1652,
work page 2018
-
[9]
A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
URL https://arxiv.org/abs/2406.00515. C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770,
work page internal anchor Pith review Pith/arXiv arXiv
- [11]
-
[12]
T. Liu, C. Xu, and J. McAuley. Repobench: Benchmarking repository-level code auto-completion systems, 2024a. URL https://arxiv.org/abs/2306.03091. W. Liu, A. Yu, D. Zan, B. Shen, W. Zhang, H. Zhao, Z. Jin, and Q. Wang. GraphCoder: Enhancing Repository-Level Code Completion via Code Context Graph-based Retrieval and Language Model, 2024b. URL https://arxiv...
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
URL https://openai.com/index/openai-o3-mini/ . Accessed: 2025-01-31. Y. Ouyang, J. Yang, and L. Zhang. Benchmarking automated program repair: An extensive study on both real-world and artificial bugs. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 440–452,
work page 2025
- [14]
-
[15]
30 N. Saavedra, A. Silva, and M. Monperrus. Gitbug-actions: Building reproducible bug-fix benchmarks with github actions. In Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings, pages 1–5,
work page 2024
- [16]
-
[17]
X. Wang, P . Gao, X. Meng, C. Peng, R. Hu, Y. Lin, and C. Gao. Aegis: An agent-based framework for general bug reproduction from issue descriptions. arXiv preprint arXiv:2411.18015, 2024a. X. Wang, B. Li, Y. Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y. Song, B. Li, J. Singh, H. H. Tran, F. Li, R. Ma, M. Zheng, B. Qian, Y. Shao, N. Muennighoff, Y. Zhang, ...
-
[18]
C. S. Xia, Y. Deng, S. Dunn, and L. Zhang. Agentless: Demystifying llm-based software engineering agents. arXiv preprint arXiv:2407.01489,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. SWE- agent: Agent-computer interfaces enable automated software engineering. arXiv preprint arXiv:2405.15793,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
URL https://arxiv.org/abs/2403.12014. D. Zan, B. Chen, D. Yang, Z. Lin, M. Kim, B. Guan, Y. Wang, W. Chen, and J. Lou. CERT: continual pre-training on sketches for library-oriented code generation. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022, pages 2369–2375,
- [21]
-
[22]
F. Zhang, B. Chen, Y. Zhang, J. Keung, J. Liu, D. Zan, Y. Mao, J.-G. Lou, and W. Chen. Repocoder: Repository-level code completion through iterative retrieval and generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2471–2484,
work page 2023
-
[23]
URL https: //arxiv.org/abs/2412.17315. Q. Zheng, X. Xia, X. Zou, Y. Dong, S. Wang, Y. Xue, Z. Wang, L. Shen, A. Wang, Y. Li, T. Su, Z. Yang, and J. Tang. Codegeex: A pre-trained model for code generation with multilin- gual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5673–5684...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.