arxiv: 2604.17187 · v1 · submitted 2026-04-19 · 💻 cs.SE

Recognition: unknown

React-ing to Grace Hopper 200: Five Open-Weights Coding Models, One React Native App, One GH200, One Weekend

Alex Potanin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:27 UTC · model grok-4.3

classification 💻 cs.SE

keywords open-weights coding modelsReact NativeSWE-Benchmodel quantizationapplication generationGH200 hardwarecross-platform compatibility

0 comments

The pith

SWE-Bench rankings do not predict which open-weights coding model generates the most complete React Native application.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates five open-weights coding models on one concrete task: generating a multi-file React Native project that must include authentication, per-user per-day counting, and web-browser compatibility. All runs occur on a single NVIDIA GH200 system. Kimi-K2.5 at aggressive 3-bit quantization delivers the most complete, runnable output even though other tested models hold substantially higher SWE-Bench Pro scores. The work also records three deployment problems common across the models and maps two distinct architectural schools among current open-weights coding systems.

Core claim

On this React Native generation task, Kimi-K2.5 at UD-Q3_K_XL quantization produces the most complete and specification-compliant project that runs out-of-the-box, outranking models with higher SWE-Bench Pro scores. Default temperature=0 causes sampling hangs in reasoning architectures, thinking traces leak through file-path parsers, and every model shows a training-data gap in adapting native-mobile APIs to web platforms. The models fall into an efficiency school (10-15 B active parameters) that matches the scale school (32-40 B active parameters) on SWE-Bench at roughly one-seventh the hardware cost.

What carries the argument

A single multi-file React Native application generation task with explicit requirements for authentication, per-user daily counting, and web compatibility, used to measure out-of-the-box execution and feature-level correctness across models.

If this is right

Kimi-K2.5 at 3-bit quantization succeeds on the full task where higher-ranked models do not.
Reasoning models require temperature settings other than the default zero to avoid sampling hangs during code generation.
All tested models lack training coverage for web-platform adaptations of native mobile APIs.
Open-weights coding models divide into an efficiency school and a scale school with different hardware demands.
The efficiency school delivers comparable SWE-Bench results at about one-seventh the hardware cost of the scale school.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers may benefit from testing candidate models directly on their target application domain instead of selecting by benchmark rank alone.
Aggressive quantization can sometimes improve output compliance by limiting extraneous reasoning steps.
New evaluation suites should incorporate cross-platform mobile-to-web tasks to close the documented training gap.

Load-bearing premise

That performance on this one specific multi-file React Native task with a single run per model reflects general coding capabilities across tasks.

What would settle it

Running the identical React Native generation prompt multiple times with each model and checking whether the quality ranking among outputs remains stable.

read the original abstract

We evaluate five state-of-the-art open-weights coding language models -- Kimi-K2.5 (at Q3 and Q4 quantizations), GLM-5.1, Qwen3-Coder-480B, and DeepSeek-V3.2 -- on a single multi-file React Native application generation task on NVIDIA GH200 576 GB hardware. The task specifies authentication, per-user per-day counting, and web compatibility, and is evaluated on whether the generated project runs out-of-the-box and on feature-level correctness. We find that SWE-Bench rankings do not predict task performance: Kimi-K2.5 at aggressive 3-bit quantization (UD-Q3_K_XL, 480 GB) produces the most complete and specification-compliant output, outranking models with substantially higher SWE-Bench Pro scores. We document three novel deployment findings: (1) default temperature=0 in coding tools causes sampling hangs with reasoning-model architectures, (2) reasoning-model thinking traces can leak through integration tools' file-path parsers, and (3) web-platform adaptation of native-mobile APIs is a universal training-data gap across every model tested. We also map the hardware-tier structure of April 2026 open-weights coding models, identifying two architectural schools and showing that the efficiency school (10-15 B active parameters) delivers equivalent SWE-Bench results at roughly 1/7th the hardware cost of the scale school (32-40 B active parameters).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A quick weekend report on one React Native task where a quantized model beat SWE-Bench leaders, plus some practical deployment notes, but the evidence for the main claim is thin.

read the letter

The paper runs five open-weights coding models on a single multi-file React Native generation task that includes auth, per-user counting, and web compatibility, all on a GH200. Kimi-K2.5 at 3-bit quantization came out ahead of models with higher SWE-Bench scores, and the authors flag three deployment issues: temperature=0 hangs in reasoning models, trace leaks into file parsers, and missing web adaptations for mobile APIs. They also sketch a split between efficiency-focused and scale-focused model architectures at current hardware tiers.

Referee Report

3 major / 3 minor

Summary. The manuscript evaluates five open-weights coding models (Kimi-K2.5 at Q3/Q4, GLM-5.1, Qwen3-Coder-480B, DeepSeek-V3.2) on a single multi-file React Native task requiring authentication, per-user per-day counting, and web compatibility, run once each on GH200 hardware. It claims SWE-Bench Pro rankings fail to predict performance, with aggressively quantized Kimi-K2.5 producing the most complete output; it also reports temperature hangs, trace leakage through file parsers, universal web-API gaps, and a two-school hardware taxonomy (efficiency vs. scale) for April 2026 models.

Significance. If the inversion result were shown to be robust, it would usefully caution against over-reliance on SWE-Bench for selecting models on narrow but realistic coding tasks and would supply immediately actionable deployment observations. The hardware-tier mapping is a modest but concrete contribution to understanding current open-weights scaling. The single-task, single-run design, however, prevents the central claim from reaching the evidentiary threshold expected in empirical software-engineering work.

major comments (3)

[Abstract and §4] Abstract and §4 (Results): the claim that 'SWE-Bench rankings do not predict task performance' rests on one generation per model for a single task. No repeated trials, seed variation, temperature sweeps, or error bars are reported, so the observed ranking inversion cannot be distinguished from sampling noise or prompt-specific behavior.
[§3] §3 (Methodology): the evaluation criteria for 'complete and specification-compliant output' and 'feature-level correctness' are not operationalized; it is unclear how partial successes, runtime errors, or web-compatibility failures were scored, preventing both replication and assessment of whether the task is representative.
[§5] §5 (Discussion): the generalization that web-platform adaptation of native-mobile APIs is 'a universal training-data gap across every model tested' is based on the same single-task sample; without additional tasks or a systematic API-coverage analysis, the claim exceeds the evidence.

minor comments (3)

[Table 1] Table 1 (model specifications) should include exact active-parameter counts and context lengths used in the runs rather than nominal values.
[Appendix] The prompt template and any system instructions should be reproduced verbatim in an appendix to allow reproduction.
[Figure 2] Figure 2 (hardware-tier diagram) would benefit from explicit labeling of the two architectural schools and the 1/7th cost ratio calculation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our exploratory evaluation of open-weights coding models for a realistic React Native task. We address each major comment below, clarifying the scope of our single-run study conducted under hardware and time constraints, and have made targeted revisions to improve operationalization and temper generalizations while preserving the value of the observed performance inversion and deployment observations.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Results): the claim that 'SWE-Bench rankings do not predict task performance' rests on one generation per model for a single task. No repeated trials, seed variation, temperature sweeps, or error bars are reported, so the observed ranking inversion cannot be distinguished from sampling noise or prompt-specific behavior.

Authors: We acknowledge that our evaluation consists of a single generation per model on one task, driven by the substantial compute cost of the GH200 and the one-weekend study window. The inversion remains noteworthy because the 3-bit Kimi-K2.5 model produced a fully functional, specification-compliant application while higher-SWE-Bench models failed on core features such as authentication and web compatibility. We have added a Limitations subsection in §5 that explicitly discusses the single-run design, potential prompt sensitivity, and the need for future multi-seed and multi-task studies. We maintain that the result still usefully cautions against sole reliance on SWE-Bench for narrow, real-world coding tasks. revision: partial
Referee: [§3] §3 (Methodology): the evaluation criteria for 'complete and specification-compliant output' and 'feature-level correctness' are not operationalized; it is unclear how partial successes, runtime errors, or web-compatibility failures were scored, preventing both replication and assessment of whether the task is representative.

Authors: We have revised §3 to operationalize the criteria. 'Complete and specification-compliant output' requires the project to build and execute without errors on both native simulators and web browsers, with all three required features (authentication, per-user per-day counting, web compatibility) implemented. Feature-level correctness is scored by enumerating each specified requirement and verifying correct behavior via manual execution tests; partial successes are recorded as the number of successfully implemented features. The full task prompt and scoring rubric have been added to the appendix to support replication. revision: yes
Referee: [§5] §5 (Discussion): the generalization that web-platform adaptation of native-mobile APIs is 'a universal training-data gap across every model tested' is based on the same single-task sample; without additional tasks or a systematic API-coverage analysis, the claim exceeds the evidence.

Authors: We agree the original phrasing was too broad. We have revised §5 to state that web-platform adaptation of native-mobile APIs 'represents a consistent training-data gap observed across all models tested on this task' and now qualify the observation as suggestive of a potential broader issue that warrants systematic study with additional tasks and API-coverage analysis. The consistent failure remains a practically useful deployment finding for cross-platform development. revision: partial

Circularity Check

0 steps flagged

No circularity: direct empirical evaluation with no derivations or predictions

full rationale

The paper conducts a straightforward empirical comparison of five open-weights coding models on one specific multi-file React Native generation task, reporting observed outcomes for out-of-the-box execution and feature correctness. No equations, parameter fitting, model-based predictions, or derivation chains appear in the abstract or described content. The claim that SWE-Bench rankings do not predict task performance is presented as an observation from the single-run results rather than a derived quantity. No self-citations of theorems, ansatzes, or uniqueness results are invoked as load-bearing steps. This is a self-contained empirical report whose findings rest on direct measurement, not on any reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking study with no free parameters, mathematical axioms, or newly invented entities. It relies on standard assumptions in AI evaluation such as the validity of the chosen task as a test.

pith-pipeline@v0.9.0 · 5569 in / 1290 out tokens · 82268 ms · 2026-05-10T06:27:34.001933+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Kimi-K2.5 model card.https://huggingface.co/moonshotai/Kimi-K2.5, 2026

Moonshot AI. Kimi-K2.5 model card.https://huggingface.co/moonshotai/Kimi-K2.5, 2026

2026
[2]

GLM-5.1: From Vibe Coding to Agentic Engineering.https://huggingface.co/ zai-org/GLM-5.1, 2026

Z.AI. GLM-5.1: From Vibe Coding to Agentic Engineering.https://huggingface.co/ zai-org/GLM-5.1, 2026

2026
[3]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-AI. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models. https://huggingface.co/deepseek-ai/DeepSeek-V3.2, 2025

2025
[4]

Qwen3-Coder-480B-A35B-Instruct

Qwen Team. Qwen3-Coder-480B-A35B-Instruct. https://huggingface.co/Qwen/ Qwen3-Coder-480B-A35B-Instruct, 2025

2025
[5]

MiniMax-M2.5 model card

MiniMax AI. MiniMax-M2.5 model card. https://huggingface.co/MiniMaxAI/ MiniMax-M2.5, 2026

2026
[6]

MiniMax-M2.7 model card

MiniMax AI. MiniMax-M2.7 model card. https://huggingface.co/MiniMaxAI/ MiniMax-M2.7, 2026. 7

2026
[7]

MiMo-V2-Flash: Multi-Token Prediction with Hybrid Attention.https: //huggingface.co/XiaomiMiMo/MiMo-V2-Flash, 2025

Xiaomi LLM-Core. MiMo-V2-Flash: Multi-Token Prediction with Hybrid Attention.https: //huggingface.co/XiaomiMiMo/MiMo-V2-Flash, 2025

2025
[8]

Dynamic 2.0 GGUF quantization methodology.https://unsloth.ai/blog/ dynamic-v2, 2025

Unsloth AI. Dynamic 2.0 GGUF quantization methodology.https://unsloth.ai/blog/ dynamic-v2, 2025

2025
[9]

Aider: AI pair programming in your terminal

Paul Gauthier. Aider: AI pair programming in your terminal. https://aider.chat, 2023–2026

2023
[10]

llama.cpp: Inference of Meta’s LLaMA model in pure C/C++.https://github.com/ggml-org/llama.cpp, 2023–2026

Georgi Gerganov and contributors. llama.cpp: Inference of Meta’s LLaMA model in pure C/C++.https://github.com/ggml-org/llama.cpp, 2023–2026

2023
[11]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?ICLR, 2024.https://arxiv.org/abs/2310.06770. 8

work page internal anchor Pith review arXiv 2024