pith. machine review for the scientific record. sign in

arxiv: 2504.02605 · v1 · submitted 2025-04-03 · 💻 cs.SE · cs.AI· cs.CL

Recognition: no theorem link

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:45 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CL
keywords issue resolvingmultilingual benchmarklarge language modelssoftware engineeringcode repairSWE-benchreinforcement learning
0
0 comments X

The pith

Multi-SWE-bench supplies 1632 expert-curated issue-resolving tasks across seven languages to test LLMs beyond Python-only benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing benchmarks for the task of modifying codebases to fix reported issues have been limited almost entirely to Python. The paper creates Multi-SWE-bench by selecting and annotating 1632 high-quality instances from 2456 candidates in Java, TypeScript, JavaScript, Go, Rust, C, and C++ using 68 expert annotators. It then runs three standard agent methods on current models to produce performance data and empirical observations across languages. The work also releases an initial set of 4723 instances for reinforcement-learning training and publishes the full annotation pipeline for community use.

Core claim

Multi-SWE-bench is a multilingual collection of 1632 real-world issue instances, each verified by expert annotators, that enables accurate measurement of how well large language models can generate patches to resolve issues in languages other than Python.

What carries the argument

Multi-SWE-bench, the set of 1632 expert-annotated issue instances drawn from repositories in seven languages that functions as the evaluation standard.

If this is right

  • Current agent methods can be compared directly on the same multilingual tasks rather than only on Python.
  • Model performance gaps between languages become measurable and can guide targeted improvements.
  • The released 4723-instance RL dataset supplies structured trajectories for training agents on patch generation.
  • The open-sourced annotation pipeline allows repeated expansion of the benchmark without starting from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Wider language coverage may reveal whether techniques that work in Python transfer directly or require language-specific adjustments.
  • Community growth of the RL dataset could eventually support training loops that iterate on real repository histories rather than synthetic tasks.
  • If the benchmark becomes a standard, progress reports on issue-resolving agents will need to include results from all seven languages to be considered comprehensive.

Load-bearing premise

The filtering and annotation steps performed by the 68 experts produce a set of instances that faithfully represent the difficulty and distribution of real-world issue-resolving work across the covered languages.

What would settle it

A re-annotation round by a fresh group of experts that selects a substantially different subset of instances or assigns markedly different difficulty labels would indicate the original curation did not produce a stable benchmark.

read the original abstract

The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. However, existing benchmarks, such as SWE-bench, focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across diverse software ecosystems. To address this, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators, ensuring that the benchmark can provide an accurate and reliable evaluation. Based on Multi-SWE-bench, we evaluate a series of state-of-the-art models using three representative methods (Agentless, SWE-agent, and OpenHands) and present a comprehensive analysis with key empirical insights. In addition, we launch a Multi-SWE-RL open-source community, aimed at building large-scale reinforcement learning (RL) training datasets for issue-resolving tasks. As an initial contribution, we release a set of 4,723 well-structured instances spanning seven programming languages, laying a solid foundation for RL research in this domain. More importantly, we open-source our entire data production pipeline, along with detailed tutorials, encouraging the open-source community to continuously contribute and expand the dataset. We envision our Multi-SWE-bench and the ever-growing Multi-SWE-RL community as catalysts for advancing RL toward its full potential, bringing us one step closer to the dawn of AGI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Multi-SWE-bench, a multilingual benchmark for the issue-resolving task (generating patches to fix reported issues in codebases). It covers seven languages (Java, TypeScript, JavaScript, Go, Rust, C, C++), contains 1,632 instances curated from 2,456 candidates by 68 expert annotators, evaluates SOTA LLMs via Agentless, SWE-agent, and OpenHands, provides empirical analysis, and releases a larger Multi-SWE-RL dataset of 4,723 instances plus the full open-source data-production pipeline and tutorials.

Significance. If the curation process yields representative, high-quality instances with verifiable patches, the benchmark would meaningfully extend SWE-bench-style evaluation beyond Python, enabling cross-language comparisons of LLM issue-resolving capabilities. The release of the RL dataset and the fully open-sourced pipeline with tutorials is a concrete strength that supports reproducibility and community expansion of training data.

major comments (2)
  1. [Abstract and data-construction section] Abstract and data-construction section: The central claim that the 1,632 instances are 'high-quality' and enable 'accurate and reliable evaluation' rests on expert annotation from 2,456 candidates by 68 annotators, yet the manuscript reports no inter-annotator agreement statistics, explicit rejection/exclusion criteria, or post-curation verification that selected issues possess gold patches suitable for model testing. This quantitative gap directly affects the trustworthiness of all downstream model comparisons.
  2. [Evaluation section (§4)] Evaluation section (§4): The reported results for Agentless, SWE-agent, and OpenHands are presented without language-specific breakdowns or comparisons against a Python-only baseline (e.g., SWE-bench), making it difficult to assess whether the multilingual setting introduces new failure modes or merely replicates known Python behaviors.
minor comments (1)
  1. [Multi-SWE-RL section] The description of the Multi-SWE-RL release would benefit from an explicit statement of how the 4,723 instances differ from the 1,632 benchmark instances (e.g., presence/absence of gold patches, filtering criteria).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We have carefully reviewed each major comment and provide point-by-point responses below, indicating where revisions will be made to address the concerns.

read point-by-point responses
  1. Referee: [Abstract and data-construction section] Abstract and data-construction section: The central claim that the 1,632 instances are 'high-quality' and enable 'accurate and reliable evaluation' rests on expert annotation from 2,456 candidates by 68 annotators, yet the manuscript reports no inter-annotator agreement statistics, explicit rejection/exclusion criteria, or post-curation verification that selected issues possess gold patches suitable for model testing. This quantitative gap directly affects the trustworthiness of all downstream model comparisons.

    Authors: We agree that the absence of these quantitative details weakens the presentation of the curation process. In the revised manuscript, we will expand the data-construction section to include inter-annotator agreement statistics (e.g., Fleiss' kappa computed across the 68 annotators on a sampled subset of candidates). We will also add explicit descriptions of the rejection and exclusion criteria applied during filtering from 2,456 to 1,632 instances, including criteria related to patch verifiability and issue clarity. Finally, we will report the post-curation verification procedure, including any manual or automated checks performed to confirm that the retained gold patches are suitable for model evaluation. revision: yes

  2. Referee: [Evaluation section (§4)] Evaluation section (§4): The reported results for Agentless, SWE-agent, and OpenHands are presented without language-specific breakdowns or comparisons against a Python-only baseline (e.g., SWE-bench), making it difficult to assess whether the multilingual setting introduces new failure modes or merely replicates known Python behaviors.

    Authors: We acknowledge that language-specific breakdowns would improve interpretability. In the revised §4, we will add per-language performance tables and figures for all three methods, allowing readers to identify any language-dependent patterns. For comparison against a Python-only baseline, we will include a new discussion paragraph that situates our aggregate results against published SWE-bench numbers from prior work, while noting the inherent differences in repository scale, issue complexity, and benchmark construction that preclude a strictly controlled head-to-head comparison. We believe this addresses the core concern without overstating comparability. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical dataset construction

full rationale

The paper presents an empirical contribution: the curation of a multilingual issue-resolving benchmark (Multi-SWE-bench) by selecting and annotating 1,632 instances from 2,456 candidates using 68 expert annotators, followed by model evaluations and release of additional RL data. No derivation chain, equations, parameter fitting, or predictions exist that could reduce outputs to inputs by construction. The central claims rest on procedural description of data collection rather than any self-referential logic, self-citation load-bearing premises, or renamed known results. This matches the default expectation for dataset papers, which are typically non-circular when they do not invoke mathematical self-definition or fitted predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that expert human annotation reliably produces high-quality, representative issue-resolving tasks without systematic bias in language coverage or difficulty.

axioms (1)
  • domain assumption Expert annotators can reliably identify and annotate high-quality issue-resolving instances from candidate pools
    The paper states that 68 experts curated 1,632 instances from 2,456 candidates to ensure accuracy and reliability.

pith-pipeline@v0.9.0 · 5647 in / 1224 out tokens · 26543 ms · 2026-05-16T06:45:01.500998+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

    cs.AI 2026-05 accept novelty 8.0

    SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.

  2. SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

    cs.AI 2026-05 unverdicted novelty 8.0

    SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.

  3. HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks

    cs.AI 2026-04 unverdicted novelty 8.0

    HWE-Bench is the first repository-level benchmark for LLM agents on real hardware bug repair, where the best agent fixes 70.7% of 417 tasks but drops below 65% on complex SoC projects.

  4. AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

    cs.SE 2026-05 conditional novelty 7.0

    10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.

  5. PlayCoder: Making LLM-Generated GUI Code Playable

    cs.SE 2026-04 conditional novelty 7.0

    PlayCoder raises the rate of LLM-generated GUI apps that can be played end-to-end without logic errors from near zero to 20.3% Play@3 by adding repository-aware generation, agent-driven testing, and iterative repair.

  6. Steerable Instruction Following Coding Data Synthesis with Actor-Parametric Schema Co-Evolution

    cs.SE 2026-02 unverdicted novelty 7.0

    IFCodeEvolve synthesizes coding data via actor-schema co-evolution with MCTS, boosting a 32B model's performance to match proprietary SOTA on instruction following.

  7. Debug2Fix: Can Interactive Debugging Help Coding Agents Fix More Bugs?

    cs.SE 2026-02 conditional novelty 7.0

    Debug2Fix integrates interactive debugging via subagents into coding agents, delivering >20% gains on GitBug-Java and SWE-Bench-Live while enabling weaker models to match stronger ones.

  8. SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle

    cs.SE 2026-05 unverdicted novelty 6.0

    SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.

  9. Revisiting DAgger in the Era of LLM-Agents

    cs.LG 2026-05 conditional novelty 6.0

    DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.

  10. SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution

    cs.LG 2026-05 unverdicted novelty 6.0

    SWE Atlas is a benchmark suite for coding agents that evaluates Codebase Q&A, Test Writing, and Refactoring using comprehensive protocols assessing both functional correctness and software engineering quality.

  11. WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning

    cs.CL 2026-04 unverdicted novelty 6.0

    WebGen-R1 uses end-to-end RL with scaffold-driven generation and cascaded rewards for structure, function, and aesthetics to transform a 7B model into a generator of deployable multi-page websites that rivals much lar...

  12. Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

    cs.AI 2026-04 unverdicted novelty 6.0

    Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.

  13. AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation

    cs.CL 2026-04 unverdicted novelty 6.0

    AdaExplore improves correctness and speed of Triton kernel generation by converting recurring failures into a memory of rules and organizing search as a tree that mixes local refinements with larger regenerations, yie...

  14. Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair

    cs.AI 2026-05 unverdicted novelty 5.0

    Reshaping outcome rewards, process signals, and rollout comparability in GRPO raises strict compile-and-semantic accuracy in agentic code repair from 0.385 to 0.535 under weak feedback.

  15. AlphaEval: Evaluating Agents in Production

    cs.CL 2026-04 unverdicted novelty 5.0

    AlphaEval is a benchmark of 94 production-sourced tasks from seven companies for evaluating full AI agent products across six domains using multiple judgment methods, plus a framework to build similar benchmarks.

  16. Seed1.8 Model Card: Towards Generalized Real-World Agency

    cs.AI 2026-03 unverdicted novelty 5.0

    Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.

  17. UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    cs.AI 2025-09 conditional novelty 5.0

    UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.

  18. Kimi K2: Open Agentic Intelligence

    cs.LG 2025-07 unverdicted novelty 5.0

    Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.

  19. From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

    cs.AI 2025-04 accept novelty 4.0

    A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 18 Pith papers · 7 internal anchors

  1. [1]

    Abreu, P

    R. Abreu, P . Zoeteweij, and A. J. Van Gemund. On the accuracy of spectrum-based fault localization. In Testing: Academic and industrial conference practice and research techniques- MUTATION (TAICP ART-MUTATION 2007), pages 89–98. IEEE,

  2. [2]

    Allamanis and C

    M. Allamanis and C. Sutton. Mining source code repositories at massive scale using language modeling. In 2013 10th working conference on mining software repositories (MSR), pages 207–216. IEEE,

  3. [3]

    Athiwaratkun, S

    B. Athiwaratkun, S. K. Gouda, Z. Wang, X. Li, Y. Tian, M. Tan, W. U. Ahmad, S. Wang, Q. Sun, M. Shang, et al. Multi-lingual evaluation of code generation models. arXiv preprint arXiv:2210.14868,

  4. [4]

    Program Synthesis with Large Language Models

    URL https://www.augmentcode. com/blog/1-open-source-agent-on-swe-bench-verified-by-combining-claud e-3-7-and-o1 . 2025-03-31. J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732,

  5. [5]

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P . de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P . Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P . Tillet, F. P . Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. ...

  6. [6]

    URL https://arxiv.org/abs/2503.14443. D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P . Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948,

  7. [7]

    29 R. Hu, C. Peng, X. Wang, and C. Gao. An llm-based agent for reliable docker environment configuration. arXiv preprint arXiv:2502.13681,

  8. [8]

    S. Iyer, I. Konstas, A. Cheung, and L. Zettlemoyer. Mapping language to code in programmatic context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1643–1652,

  9. [9]

    OpenAI o1 System Card

    A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720,

  10. [10]

    URL https://arxiv.org/abs/2406.00515. C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770,

  11. [11]

    Lemner, L

    L. Lemner, L. Wahlgren, G. Gay, N. Mohammadiha, J. Liu, and J. Wennerberg. Exploring the integration of large language models in industrial test maintenance processes. arXiv preprint arXiv:2409.06416,

  12. [12]

    T. Liu, C. Xu, and J. McAuley. Repobench: Benchmarking repository-level code auto-completion systems, 2024a. URL https://arxiv.org/abs/2306.03091. W. Liu, A. Yu, D. Zan, B. Shen, W. Zhang, H. Zhao, Z. Jin, and Q. Wang. GraphCoder: Enhancing Repository-Level Code Completion via Code Context Graph-based Retrieval and Language Model, 2024b. URL https://arxiv...

  13. [13]

    Accessed: 2025-01-31

    URL https://openai.com/index/openai-o3-mini/ . Accessed: 2025-01-31. Y. Ouyang, J. Yang, and L. Zhang. Benchmarking automated program repair: An extensive study on both real-world and artificial bugs. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 440–452,

  14. [14]

    X. Peng, C. Wang, M. Liu, Y. Lou, and Y. Wu. Code digital twin: Empowering llms with tacit knowledge for complex software maintenance. arXiv preprint arXiv:2503.07967,

  15. [15]

    Saavedra, A

    30 N. Saavedra, A. Silva, and M. Monperrus. Gitbug-actions: Building reproducible bug-fix benchmarks with github actions. In Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings, pages 1–5,

  16. [16]

    URL https://github.com/SWE -agent/SWE-ReX. D. Wang, Z. Zhang, S. Feng, W. G. Halfond, and T. Yu. An empirical study on leveraging images in automated bug report reproduction. arXiv preprint arXiv:2502.15099,

  17. [17]

    X. Wang, P . Gao, X. Meng, C. Peng, R. Hu, Y. Lin, and C. Gao. Aegis: An agent-based framework for general bug reproduction from issue descriptions. arXiv preprint arXiv:2411.18015, 2024a. X. Wang, B. Li, Y. Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y. Song, B. Li, J. Singh, H. H. Tran, F. Li, R. Ma, M. Zheng, B. Qian, Y. Shao, N. Muennighoff, Y. Zhang, ...

  18. [18]

    C. S. Xia, Y. Deng, S. Dunn, and L. Zhang. Agentless: Demystifying llm-based software engineering agents. arXiv preprint arXiv:2407.01489,

  19. [19]

    J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. SWE- agent: Agent-computer interfaces enable automated software engineering. arXiv preprint arXiv:2405.15793,

  20. [20]

    URL https://arxiv.org/abs/2403.12014. D. Zan, B. Chen, D. Yang, Z. Lin, M. Kim, B. Guan, Y. Wang, W. Chen, and J. Lou. CERT: continual pre-training on sketches for library-oriented code generation. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022, pages 2369–2375,

  21. [21]

    D. Zan, A. Yu, W. Liu, D. Chen, B. Shen, W. Li, Y. Yao, Y. Gong, X. Chen, B. Guan, et al. CodeS: Natural Language to Code Repository via Multi-Layer Sketch. arXiv preprint arXiv:2403.16443,

  22. [22]

    Zhang, B

    F. Zhang, B. Chen, Y. Zhang, J. Keung, J. Liu, D. Zan, Y. Mao, J.-G. Lou, and W. Chen. Repocoder: Repository-level code completion through iterative retrieval and generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2471–2484,

  23. [23]

    URL https: //arxiv.org/abs/2412.17315. Q. Zheng, X. Xia, X. Zou, Y. Dong, S. Wang, Y. Xue, Z. Wang, L. Shen, A. Wang, Y. Li, T. Su, Z. Yang, and J. Tang. Codegeex: A pre-trained model for code generation with multilin- gual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5673–5684...