Recognition: unknown
React-ing to Grace Hopper 200: Five Open-Weights Coding Models, One React Native App, One GH200, One Weekend
Pith reviewed 2026-05-10 06:27 UTC · model grok-4.3
The pith
SWE-Bench rankings do not predict which open-weights coding model generates the most complete React Native application.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On this React Native generation task, Kimi-K2.5 at UD-Q3_K_XL quantization produces the most complete and specification-compliant project that runs out-of-the-box, outranking models with higher SWE-Bench Pro scores. Default temperature=0 causes sampling hangs in reasoning architectures, thinking traces leak through file-path parsers, and every model shows a training-data gap in adapting native-mobile APIs to web platforms. The models fall into an efficiency school (10-15 B active parameters) that matches the scale school (32-40 B active parameters) on SWE-Bench at roughly one-seventh the hardware cost.
What carries the argument
A single multi-file React Native application generation task with explicit requirements for authentication, per-user daily counting, and web compatibility, used to measure out-of-the-box execution and feature-level correctness across models.
If this is right
- Kimi-K2.5 at 3-bit quantization succeeds on the full task where higher-ranked models do not.
- Reasoning models require temperature settings other than the default zero to avoid sampling hangs during code generation.
- All tested models lack training coverage for web-platform adaptations of native mobile APIs.
- Open-weights coding models divide into an efficiency school and a scale school with different hardware demands.
- The efficiency school delivers comparable SWE-Bench results at about one-seventh the hardware cost of the scale school.
Where Pith is reading between the lines
- Developers may benefit from testing candidate models directly on their target application domain instead of selecting by benchmark rank alone.
- Aggressive quantization can sometimes improve output compliance by limiting extraneous reasoning steps.
- New evaluation suites should incorporate cross-platform mobile-to-web tasks to close the documented training gap.
Load-bearing premise
That performance on this one specific multi-file React Native task with a single run per model reflects general coding capabilities across tasks.
What would settle it
Running the identical React Native generation prompt multiple times with each model and checking whether the quality ranking among outputs remains stable.
read the original abstract
We evaluate five state-of-the-art open-weights coding language models -- Kimi-K2.5 (at Q3 and Q4 quantizations), GLM-5.1, Qwen3-Coder-480B, and DeepSeek-V3.2 -- on a single multi-file React Native application generation task on NVIDIA GH200 576 GB hardware. The task specifies authentication, per-user per-day counting, and web compatibility, and is evaluated on whether the generated project runs out-of-the-box and on feature-level correctness. We find that SWE-Bench rankings do not predict task performance: Kimi-K2.5 at aggressive 3-bit quantization (UD-Q3_K_XL, 480 GB) produces the most complete and specification-compliant output, outranking models with substantially higher SWE-Bench Pro scores. We document three novel deployment findings: (1) default temperature=0 in coding tools causes sampling hangs with reasoning-model architectures, (2) reasoning-model thinking traces can leak through integration tools' file-path parsers, and (3) web-platform adaptation of native-mobile APIs is a universal training-data gap across every model tested. We also map the hardware-tier structure of April 2026 open-weights coding models, identifying two architectural schools and showing that the efficiency school (10-15 B active parameters) delivers equivalent SWE-Bench results at roughly 1/7th the hardware cost of the scale school (32-40 B active parameters).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates five open-weights coding models (Kimi-K2.5 at Q3/Q4, GLM-5.1, Qwen3-Coder-480B, DeepSeek-V3.2) on a single multi-file React Native task requiring authentication, per-user per-day counting, and web compatibility, run once each on GH200 hardware. It claims SWE-Bench Pro rankings fail to predict performance, with aggressively quantized Kimi-K2.5 producing the most complete output; it also reports temperature hangs, trace leakage through file parsers, universal web-API gaps, and a two-school hardware taxonomy (efficiency vs. scale) for April 2026 models.
Significance. If the inversion result were shown to be robust, it would usefully caution against over-reliance on SWE-Bench for selecting models on narrow but realistic coding tasks and would supply immediately actionable deployment observations. The hardware-tier mapping is a modest but concrete contribution to understanding current open-weights scaling. The single-task, single-run design, however, prevents the central claim from reaching the evidentiary threshold expected in empirical software-engineering work.
major comments (3)
- [Abstract and §4] Abstract and §4 (Results): the claim that 'SWE-Bench rankings do not predict task performance' rests on one generation per model for a single task. No repeated trials, seed variation, temperature sweeps, or error bars are reported, so the observed ranking inversion cannot be distinguished from sampling noise or prompt-specific behavior.
- [§3] §3 (Methodology): the evaluation criteria for 'complete and specification-compliant output' and 'feature-level correctness' are not operationalized; it is unclear how partial successes, runtime errors, or web-compatibility failures were scored, preventing both replication and assessment of whether the task is representative.
- [§5] §5 (Discussion): the generalization that web-platform adaptation of native-mobile APIs is 'a universal training-data gap across every model tested' is based on the same single-task sample; without additional tasks or a systematic API-coverage analysis, the claim exceeds the evidence.
minor comments (3)
- [Table 1] Table 1 (model specifications) should include exact active-parameter counts and context lengths used in the runs rather than nominal values.
- [Appendix] The prompt template and any system instructions should be reproduced verbatim in an appendix to allow reproduction.
- [Figure 2] Figure 2 (hardware-tier diagram) would benefit from explicit labeling of the two architectural schools and the 1/7th cost ratio calculation.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our exploratory evaluation of open-weights coding models for a realistic React Native task. We address each major comment below, clarifying the scope of our single-run study conducted under hardware and time constraints, and have made targeted revisions to improve operationalization and temper generalizations while preserving the value of the observed performance inversion and deployment observations.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Results): the claim that 'SWE-Bench rankings do not predict task performance' rests on one generation per model for a single task. No repeated trials, seed variation, temperature sweeps, or error bars are reported, so the observed ranking inversion cannot be distinguished from sampling noise or prompt-specific behavior.
Authors: We acknowledge that our evaluation consists of a single generation per model on one task, driven by the substantial compute cost of the GH200 and the one-weekend study window. The inversion remains noteworthy because the 3-bit Kimi-K2.5 model produced a fully functional, specification-compliant application while higher-SWE-Bench models failed on core features such as authentication and web compatibility. We have added a Limitations subsection in §5 that explicitly discusses the single-run design, potential prompt sensitivity, and the need for future multi-seed and multi-task studies. We maintain that the result still usefully cautions against sole reliance on SWE-Bench for narrow, real-world coding tasks. revision: partial
-
Referee: [§3] §3 (Methodology): the evaluation criteria for 'complete and specification-compliant output' and 'feature-level correctness' are not operationalized; it is unclear how partial successes, runtime errors, or web-compatibility failures were scored, preventing both replication and assessment of whether the task is representative.
Authors: We have revised §3 to operationalize the criteria. 'Complete and specification-compliant output' requires the project to build and execute without errors on both native simulators and web browsers, with all three required features (authentication, per-user per-day counting, web compatibility) implemented. Feature-level correctness is scored by enumerating each specified requirement and verifying correct behavior via manual execution tests; partial successes are recorded as the number of successfully implemented features. The full task prompt and scoring rubric have been added to the appendix to support replication. revision: yes
-
Referee: [§5] §5 (Discussion): the generalization that web-platform adaptation of native-mobile APIs is 'a universal training-data gap across every model tested' is based on the same single-task sample; without additional tasks or a systematic API-coverage analysis, the claim exceeds the evidence.
Authors: We agree the original phrasing was too broad. We have revised §5 to state that web-platform adaptation of native-mobile APIs 'represents a consistent training-data gap observed across all models tested on this task' and now qualify the observation as suggestive of a potential broader issue that warrants systematic study with additional tasks and API-coverage analysis. The consistent failure remains a practically useful deployment finding for cross-platform development. revision: partial
Circularity Check
No circularity: direct empirical evaluation with no derivations or predictions
full rationale
The paper conducts a straightforward empirical comparison of five open-weights coding models on one specific multi-file React Native generation task, reporting observed outcomes for out-of-the-box execution and feature correctness. No equations, parameter fitting, model-based predictions, or derivation chains appear in the abstract or described content. The claim that SWE-Bench rankings do not predict task performance is presented as an observation from the single-run results rather than a derived quantity. No self-citations of theorems, ansatzes, or uniqueness results are invoked as load-bearing steps. This is a self-contained empirical report whose findings rest on direct measurement, not on any reduction to its own inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Kimi-K2.5 model card.https://huggingface.co/moonshotai/Kimi-K2.5, 2026
Moonshot AI. Kimi-K2.5 model card.https://huggingface.co/moonshotai/Kimi-K2.5, 2026
2026
-
[2]
GLM-5.1: From Vibe Coding to Agentic Engineering.https://huggingface.co/ zai-org/GLM-5.1, 2026
Z.AI. GLM-5.1: From Vibe Coding to Agentic Engineering.https://huggingface.co/ zai-org/GLM-5.1, 2026
2026
-
[3]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
DeepSeek-AI. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models. https://huggingface.co/deepseek-ai/DeepSeek-V3.2, 2025
2025
-
[4]
Qwen3-Coder-480B-A35B-Instruct
Qwen Team. Qwen3-Coder-480B-A35B-Instruct. https://huggingface.co/Qwen/ Qwen3-Coder-480B-A35B-Instruct, 2025
2025
-
[5]
MiniMax-M2.5 model card
MiniMax AI. MiniMax-M2.5 model card. https://huggingface.co/MiniMaxAI/ MiniMax-M2.5, 2026
2026
-
[6]
MiniMax-M2.7 model card
MiniMax AI. MiniMax-M2.7 model card. https://huggingface.co/MiniMaxAI/ MiniMax-M2.7, 2026. 7
2026
-
[7]
MiMo-V2-Flash: Multi-Token Prediction with Hybrid Attention.https: //huggingface.co/XiaomiMiMo/MiMo-V2-Flash, 2025
Xiaomi LLM-Core. MiMo-V2-Flash: Multi-Token Prediction with Hybrid Attention.https: //huggingface.co/XiaomiMiMo/MiMo-V2-Flash, 2025
2025
-
[8]
Dynamic 2.0 GGUF quantization methodology.https://unsloth.ai/blog/ dynamic-v2, 2025
Unsloth AI. Dynamic 2.0 GGUF quantization methodology.https://unsloth.ai/blog/ dynamic-v2, 2025
2025
-
[9]
Aider: AI pair programming in your terminal
Paul Gauthier. Aider: AI pair programming in your terminal. https://aider.chat, 2023–2026
2023
-
[10]
llama.cpp: Inference of Meta’s LLaMA model in pure C/C++.https://github.com/ggml-org/llama.cpp, 2023–2026
Georgi Gerganov and contributors. llama.cpp: Inference of Meta’s LLaMA model in pure C/C++.https://github.com/ggml-org/llama.cpp, 2023–2026
2023
-
[11]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?ICLR, 2024.https://arxiv.org/abs/2310.06770. 8
work page internal anchor Pith review arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.