pith. sign in

arxiv: 2605.25890 · v1 · pith:GJHWDD7Pnew · submitted 2026-05-25 · 💻 cs.LG

Merge-Bench: Resolve Merge Conflicts with Large Language Models

Pith reviewed 2026-06-29 22:41 UTC · model grok-4.3

classification 💻 cs.LG
keywords merge conflictsversion controllarge language modelsreinforcement learningJavaGitsoftware engineering
0
0 comments X

The pith

A 14B-parameter model trained via reinforcement learning on real merge conflicts outperforms three commercial LLMs on Java.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds Merge-Bench, a dataset of 7938 real merge conflict hunks drawn from GitHub repositories, using committed developer resolutions as automatic ground truth. It trains LLMergeJ on Java code with Group Relative Policy Optimization, an online reinforcement learning technique. Evaluation shows LLMergeJ beats three commercial LLMs on Java while trailing only Gemini 2.5 Pro, and that commercial models achieve largely stable but sub-60-percent success rates across eleven languages. The work demonstrates that large language models can be applied directly to version-control merging without manual labeling. This opens a path toward automated assistance for a common, time-consuming developer task.

Core claim

The paper establishes that LLMergeJ, a 14B-parameter model trained with Group Relative Policy Optimization on the Merge-Bench dataset of 7938 merge conflict hunks, outperforms three commercial large language models on resolving Java merge conflicts while trailing only Gemini 2.5 Pro, and that commercial LLM performance remains largely stable but below 60 percent success across eleven programming languages.

What carries the argument

Merge-Bench dataset of real-world merge conflict hunks with developer-committed resolutions as ground truth, used to train LLMergeJ via GRPO reinforcement learning.

If this is right

  • Automated tools could reduce time spent on Java merge conflicts in large codebases.
  • Performance stability across languages suggests the approach does not require per-language retraining.
  • Accuracy below 60 percent implies models would still need human review for the majority of cases.
  • Scalable dataset construction allows training on far larger collections of conflicts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If developers sometimes commit suboptimal merges, both training and evaluation would inherit that noise.
  • Applying the same training method to other languages could test whether language-specific models improve beyond the observed stability.
  • Embedding such models in Git clients might shift team practices around concurrent changes.

Load-bearing premise

That the merge resolutions developers actually committed to the repository are always the correct ground truth for both training and measuring performance.

What would settle it

Independent resolutions of the same conflicts by multiple experienced developers, compared against both the committed resolutions and model outputs to measure agreement rates.

Figures

Figures reproduced from arXiv: 2605.25890 by Benedikt Schesch, Michael D. Ernst.

Figure 1
Figure 1. Figure 1: Size of the Merge-Bench dataset, by language. 4 Code Comparison Methods When a tool produces code, the generated code must be compared against the ground truth code. Here are ways to do the comparison. Textual. The generated code is considered correct if it exactly matches the ground truth. This approach is scalable and language-agnostic. However, it is overly restrictive and underestimates the quality of … view at source ↗
Figure 2
Figure 2. Figure 2: User prompt for merge conflict resolution. 5 LLMergeJ Training We trained LLMergeJ, a small but capable LLM for resolving Java merge con￾flicts. Our experiments (section 7) evaluated LLMergeJ against commercial LLMs, revealing the strengths and weaknesses of general-purpose commercial LLMs for resolving merge conflicts. Due to resource constraints, we trained LLMergeJ focused on one program￾ming language. … view at source ↗
Figure 3
Figure 3. Figure 3: Merge conflict reward over time during the training process. 0 500 1000 1500 2000 Step 500 600 700 800 900 1000 1100 Completion Length Completion Length (Raw) Running Mean [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Success in merging Java conflicts. “Equivalent text” is a subset of “code nor￾malized equivalent. The last 4 columns sum to 100%. Best results are shown in bold (1st place) and underlined (2nd place). Proprietary and public models are separated. The SFT baseline uses hyperparameters chosen to perform best on the exact test data used for evaluation. Despite this bias in its favor, the SFT baseline outputs t… view at source ↗
Figure 6
Figure 6. Figure 6: Model performance summary across all languages. Columns are as in fig. 5. C C++ C# Go Java JavaScript PHP Python Ruby Rust TypeScript 0 20 40 60 80 100 Percentage (%) Models Gemini 2.5 Pro o3 Pro Claude Opus 4 Grok 4 Qwen3 235B R1-0528 671B Segments Equivalent text Code normalized equivalent Conflict Different code Invalid markdown [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance comparison across models and programming languages. following the task specification. Surprisingly, this does not prevent the QwQ 32B [16] model from being competitive. – Conservative resolvers (such as Grok 4, Llama 4 Maverick, Qwen3 8B) preserve 40.1–86.1% of conflicts unresolved, suggesting a cautious approach when facing ambiguous merge scenarios. Outputting a conflict does no harm, but the… view at source ↗
read the original abstract

This paper applies machine learning to the difficult and important task of version control merging. (1) We constructed a dataset, Merge-Bench, of 7938 real-world merge conflict hunks from 1439 GitHub repositories. The ground truth is the merge resolution that developers committed to the repository. Our dataset construction methodology is scalable to arbitrary amounts of data since no manual labeling is required. (2) We trained a model, LLMergeJ, to resolve merge conflicts in Java programs. Our approach uses Group Relative Policy Optimization (GRPO), an online reinforcement learning method, to train a Large Language Model (LLM). (3) We performed two evaluations of the performance of LLMs on resolving merge conflicts. On Java programs, LLMergeJ with 14B parameters outperforms 3 commercial LLMs, trailing only Gemini 2.5 Pro. Across 11 programming languages, commercial LLM performance is largely stable from language to language. The best models correctly resolve less than 60% of merge conflicts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Merge-Bench, a dataset of 7938 real-world merge conflict hunks from 1439 GitHub repositories, with developer-committed resolutions serving as ground truth. It describes training LLMergeJ, a 14B-parameter LLM using Group Relative Policy Optimization (GRPO) for Java merge conflicts. Evaluations indicate that LLMergeJ outperforms three commercial LLMs on Java while trailing only Gemini 2.5 Pro; across 11 languages, commercial LLM performance is largely stable, with the best models correctly resolving less than 60% of conflicts.

Significance. If the central results hold, the work is significant for applying LLMs and RL to a practical software engineering task. The scalable, automatically constructed dataset (no manual labeling required) is a clear strength that supports large-scale training and benchmarking. The application of GRPO for policy optimization on merge resolution is a methodological contribution. The reported performance ceiling below 60% and cross-language stability provide falsifiable benchmarks that can guide future model development in this domain.

major comments (2)
  1. [Abstract] Abstract: The performance claims (LLMergeJ outperforming three commercial LLMs on Java; best models resolving <60% of conflicts) rest entirely on treating developer-committed resolutions as correct ground truth for both training (GRPO reward) and evaluation. No validation of GT quality, discussion of multiple valid resolutions, or analysis of selection biases in the 7938 hunks is reported. This assumption is load-bearing; if a non-negligible fraction of committed merges are suboptimal or later revised, both the 14B model win rates and the commercial ceiling become unreliable.
  2. [Abstract] Abstract: The abstract reports concrete performance numbers but supplies no information on evaluation methodology, how conflicts were selected from the 1439 repositories, statistical significance of the comparisons, or controls for dataset biases. This absence prevents verification of the central claims from the provided text.
minor comments (1)
  1. [Abstract] The manuscript could clarify whether the 11-language evaluation uses the same conflict selection criteria as the Java subset and whether any language-specific preprocessing was applied.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the importance of ground truth validation and methodological transparency. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The performance claims (LLMergeJ outperforming three commercial LLMs on Java; best models resolving <60% of conflicts) rest entirely on treating developer-committed resolutions as correct ground truth for both training (GRPO reward) and evaluation. No validation of GT quality, discussion of multiple valid resolutions, or analysis of selection biases in the 7938 hunks is reported. This assumption is load-bearing; if a non-negligible fraction of committed merges are suboptimal or later revised, both the 14B model win rates and the commercial ceiling become unreliable.

    Authors: We agree that developer-committed resolutions are an imperfect proxy for optimality and that the manuscript does not include explicit validation of GT quality or analysis of selection biases. This is a standard limitation in large-scale, automatically constructed SE benchmarks where manual inspection of thousands of cases is infeasible. However, the approach follows established practice in prior merge-conflict and code-generation datasets. To address the concern directly, we will add a dedicated limitations subsection (in Section 3) that (a) discusses the possibility of multiple valid resolutions, (b) reports any available post-hoc checks on a random sample of hunks, and (c) quantifies repository-level selection criteria to surface potential biases. These additions will not change the reported numbers but will qualify their interpretation. revision: yes

  2. Referee: [Abstract] Abstract: The abstract reports concrete performance numbers but supplies no information on evaluation methodology, how conflicts were selected from the 1439 repositories, statistical significance of the comparisons, or controls for dataset biases. This absence prevents verification of the central claims from the provided text.

    Authors: The current abstract prioritizes brevity and contribution highlights; full details on hunk selection, evaluation protocol, statistical tests, and bias controls appear in Sections 3 (Dataset Construction) and 4 (Experiments). We accept that the abstract should be more self-contained for readers who encounter only that section. We will revise the abstract to include one additional sentence summarizing the evaluation methodology, conflict filtering criteria, and the use of exact-match accuracy with statistical significance testing, while keeping the word count within typical limits. revision: yes

Circularity Check

0 steps flagged

No significant circularity: external developer resolutions serve as independent ground truth

full rationale

The paper's dataset and evaluation rest on merge resolutions that developers independently committed to GitHub repositories. These serve as external labels for both GRPO-based training and accuracy metrics. No step reduces a claimed prediction or result to a quantity fitted from the model's own outputs, nor does any load-bearing premise rely on self-citation chains or ansatzes imported from the authors' prior work. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5703 in / 1131 out tokens · 42528 ms · 2026-06-29T22:41:29.799333+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 11 canonical work pages · 5 internal anchors

  1. [1]

    In: ESEC/FSE

    Apel, S., Liebig, J., Brandl, B., Lengauer, C., Kästner, C.: Semistructured merge: Rethinking merge in revision control systems. In: ESEC/FSE. pp. 190–200 (Sep 2011)

  2. [2]

    Empirical Softw

    Brindescu, C., Ahmed, I., Jensen, C., Sarma, A.: An empirical investigation into merge conflicts and their effect on software quality. Empirical Softw. Engg.25, 562–590 (Jan 2020). https://doi.org/10.1007/s10664-019-09735-4

  3. [3]

    IEEE TSE49(4), 1599–1614 (Apr 2023)

    Dinella, E., Mytkowicz, T., Svyatkovskiy, A., Bird, C., Naik, M., Lahiri, S.: Deep- Merge: Learning to merge programs. IEEE TSE49(4), 1599–1614 (Apr 2023)

  4. [4]

    Dong, J., Zhu, Q., Sun, Z., Lou, Y., Hao, D.: Merge conflict resolution: Classifica- tion or generation? In: ASE. pp. 1652–1663 (Sep 2023)

  5. [5]

    Gousios, G.: The GHTorrent dataset and tool suite. In: MSR. pp. 233–236 (May 2013). https://doi.org/https://doi.org/10.1109/MSR.2013.6624034

  6. [6]

    Nature645, 633–638 (Sep 2025)

    Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., et al.: DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature645, 633–638 (Sep 2025). https://doi.org/10.1038/s41586-025- 09422-z

  7. [7]

    Han,D.,Han,M.,UnslothTeam:Unsloth(2023),http://github.com/unslothai/ unsloth

  8. [8]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Jain, N., Han, K., Gu, A., Li, W.D., Yan, F., Zhang, T., Wang, S., Solar-Lezama, A., Sen, K., Stoica, I.: LiveCodeBench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974 (June 2024)

  9. [9]

    Jimenez, C.E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., Narasimhan, K.: SWE-bench: Can language models resolve real-world GitHub issues?{https:// arxiv.org/abs/2310.06770}(Nov 2024)

  10. [10]

    IEEE TSE49(01), 64–83 (Jan 2023)

    Larsén, S., Falleri, J.R., Baudry, B., Monperrus, M.: Spork: Structured merge for Java with formatting preservation. IEEE TSE49(01), 64–83 (Jan 2023)

  11. [11]

    ASE22(3), 367–397 (May 2014)

    Leßenich, O., Apel, S., Lengauer, C.: Balancing precision and performance in struc- tured merge. ASE22(3), 367–397 (May 2014)

  12. [12]

    https://arxiv.org/abs/2505.22583 (May 2025)

    Lindenbauer, T., Bogomolov, E., Zharov, Y.: GitGoodBench: A novel benchmark for evaluating agentic performance on Git. https://arxiv.org/abs/2505.22583 (May 2025)

  13. [13]

    Empirical Softw

    Munaiah, N., Kroh, S., Cabrey, C., Nagappan, M.: Curating GitHub for engineered software projects. Empirical Softw. Engg.22(6), 3219–3253 (Dec 2017)

  14. [14]

    OpenAI Preparedness, NLP, P.: Introducing SWE-bench Verified.https:// openai.com/index/introducing-swe-bench-verified/(August 2024) Merge-Bench: Resolve Merge Conflicts with Large Language Models 15

  15. [15]

    In: ISSTA

    Qi, Z., Long, F., Achour, S., Rinard, M.: An analysis of patch plausibility and correctness for generate-and-validate patch generation systems. In: ISSTA. pp. 24– 36 (July 2015). https://doi.org/10.1145/2771783.2771791

  16. [16]

    Qwen Team: QwQ-32B: Embracing the power of reinforcement learning (March 2025),https://qwenlm.github.io/blog/qwq-32b/

  17. [17]

    Transactions of the American Mathematical Society74(2), 358–366 (1953)

    Rice, H.G.: Classes of recursively enumerable sets and their decision problems. Transactions of the American Mathematical Society74(2), 358–366 (1953)

  18. [18]

    Sakana AI: The AI CUDA engineer: Agentic CUDA kernel discovery, optimiza- tion and composition.https://sakana.ai/ai-cuda-engineer/#limitations- and-bloopers(Feb 2025)

  19. [19]

    Schesch, B., Featherman, R., Yang, K.J., Roberts, B.R., Ernst, M.D.: Evaluation of version control merge tools. In: ASE. pp. 831–843 (Oct 2024)

  20. [20]

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimizationalgorithms.CoRRabs/1707.06347(2017),http://arxiv.org/abs/ 1707.06347

  21. [21]

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y.K., Wu, Y., Guo, D.: DeepSeekMath: Pushing the limits of mathematical rea- soning in open language models.https://arxiv.org/abs/2402.03300(apr 2024)

  22. [22]

    Shen, B., Meng, N.: ConflictBench: A benchmark to evaluate software merge tools. J. Sys. Softw.214(2024)

  23. [23]

    In: 2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS)

    Shen, C., Yang, W., Pan, M., Zhou, Y.: Git merge conflict resolution leveraging strategy classification and LLM. In: 2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS). pp. 228–239 (2023)

  24. [24]

    In: ESEC/FSE

    Svyatkovskiy, A., Fakhoury, S., Ghorbani, N., Mytkowicz, T., Dinella, E., Bird, C., Jang, J., Sundaresan, N., Lahiri, S.K.: Program merge conflict resolution via neural transformers. In: ESEC/FSE. pp. 822–833 (Nov 2022)

  25. [25]

    Trindade Tavares, A., Borba, P., Cavalcanti, G., Soares, S.: Semistructured merge in JavaScript systems. In: ASE. pp. 1014–1025 (Sep 2019)

  26. [26]

    In: ICSE

    Weimer, W., Nguyen, T., Le Goues, C., Forrest, S.: Automatically finding patches using genetic programming. In: ICSE. pp. 364–374 (May 2009)

  27. [27]

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., et al.: Qwen3 technical report.https://arxiv.org/abs/2505.09388 (May 2025)

  28. [28]

    In: ISSTA

    Zhang, J., Mytkowicz, T., Kaufman, M., Piskac, R., Lahiri, S.K.: Using pre-trained languagemodelstoresolvetextualandsemanticmergeconflicts(experiencepaper). In: ISSTA. p. 77–88 (July 2022)

  29. [29]

    https://arxiv.org/abs/2409.14121 (Sep 2024)

    Zhang, Q., Su, L., Ye, K., Qian, C.: ConGra: Benchmarking automatic conflict resolution. https://arxiv.org/abs/2409.14121 (Sep 2024)