{"paper":{"title":"SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Reinforcement learning on open software evolution data enables LLMs to recover developer reasoning and solve 41% of real GitHub issues.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.SE","authors_text":"Daniel Fried, Gabriel Synnaeve, Jade Copet, Lingming Zhang, Olivier Duchenne, Quentin Carbonneaux, Rishabh Singh, Sida I. Wang, Yuxiang Wei","submitted_at":"2025-02-25T18:45:04Z","abstract_excerpt":"The recent DeepSeek-R1 release has demonstrated the immense potential of reinforcement learning (RL) in enhancing the general reasoning capabilities of large language models (LLMs). While DeepSeek-R1 and other follow-up work primarily focus on applying RL to competitive coding and math problems, this paper introduces SWE-RL, the first approach to scale RL-based LLM reasoning for real-world software engineering. Leveraging a lightweight rule-based reward (e.g., the similarity score between ground-truth and LLM-generated solutions), SWE-RL enables LLMs to autonomously recover a developer's reaso"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"our resulting reasoning model, Llama3-SWE-RL-70B, achieves a 41.0% solve rate on SWE-bench Verified -- a human-verified collection of real-world GitHub issues. To our knowledge, this is the best performance reported for medium-sized (<100B) LLMs to date, even comparable to leading proprietary LLMs like GPT-4o.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The assumption that a lightweight rule-based similarity score between ground-truth and generated solutions serves as an effective reward for learning genuine reasoning processes rather than superficial pattern matching.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Reinforcement learning on open software evolution data enables LLMs to recover developer reasoning and solve 41% of real GitHub issues.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"fe6533c6d9e271d0e7521ce818a7e23e44e6ebcdf982e1eb69132308f951c1fa"},"source":{"id":"2502.18449","kind":"arxiv","version":2},"verdict":{"id":"983ad494-552f-47b5-9c12-ce1cc3d6d6fe","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T10:23:06.408182Z","strongest_claim":"our resulting reasoning model, Llama3-SWE-RL-70B, achieves a 41.0% solve rate on SWE-bench Verified -- a human-verified collection of real-world GitHub issues. To our knowledge, this is the best performance reported for medium-sized (<100B) LLMs to date, even comparable to leading proprietary LLMs like GPT-4o.","one_line_summary":"SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The assumption that a lightweight rule-based similarity score between ground-truth and generated solutions serves as an effective reward for learning genuine reasoning processes rather than superficial pattern matching.","pith_extraction_headline":"Reinforcement learning on open software evolution data enables LLMs to recover developer reasoning and solve 41% of real GitHub issues."},"references":{"count":192,"sample":[{"doi":"","year":2024,"title":"Claude 3.5 sonnet model card addendum","work_id":"9821ab87-1805-43e6-8f4f-9a06dc3c9f37","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet","work_id":"1e26d961-6bb1-4e30-b195-245b5a95cfb1","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Codet: Code generation with generated tests","work_id":"5034399f-3a73-4a3e-824b-0fe8fe4d82e7","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Mich","work_id":"f06d44fc-f5c4-4ab7-951d-3eba0cbf5e88","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Meta large language model compiler: Foundation models of compiler optimization, 2024","work_id":"e9ffd682-76e4-4ec3-9520-c34bf4936d2c","ref_index":6,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":192,"snapshot_sha256":"baf37ceea2c212ab306189dc3981b7be94026341c1a43e50528c8a73afef1435","internal_anchors":15},"formal_canon":{"evidence_count":3,"snapshot_sha256":"5087d323b022ddc79eceb574c084178530a3aab19f5ee8cc213f8524576e8e00"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}