pith. machine review for the scientific record. sign in

arxiv: 2604.18566 · v2 · submitted 2026-04-20 · 💻 cs.AI · cs.HC· cs.LG

Recognition: unknown

Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:16 UTC · model grok-4.3

classification 💻 cs.AI cs.HCcs.LG
keywords system dynamicscausal loop diagramsLLM benchmarkingcloud versus local modelsGGUFMLXquantization effectsinteractive model discussion
0
0 comments X

The pith

Cloud LLMs reach 77-89% on causal loop diagram extraction while top local models hit 77% and backend choice outweighs quantization effects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates proprietary cloud and locally hosted large language models on two new benchmarks designed for system dynamics assistance. One benchmark measures success at extracting structured causal loop diagrams from text, while the other tests interactive discussion, feedback on models, and coaching for building system dynamics models. Cloud models lead overall on diagram extraction, yet the strongest local model matches mid-tier cloud performance; discussion tasks expose larger gaps, especially when local models must fix errors in long prompts. The analysis isolates effects from model architecture, backend software, and compression level to guide practical use.

Core claim

Systematic tests on the CLD Leaderboard show cloud models attaining 77-89% overall pass rates on structured causal loop diagram extraction, with the best local model (Kimi K2.5 GGUF Q3 zero-shot) also reaching 77% and matching mid-tier cloud results. On the Discussion Leaderboard, local models achieve 50-100% on model building steps and 47-75% on feedback explanation but only 0-50% on error fixing due to long-context memory limits. Backend choice between GGUF and MLX produces larger performance differences than quantization level, with GGUF providing reliable JSON schema adherence at the cost of occasional indefinite generation on dense long-context prompts.

What carries the argument

The CLD Leaderboard and Discussion Leaderboard, two purpose-built test suites that measure structured causal loop diagram extraction and interactive model discussion, feedback, and coaching capabilities.

If this is right

  • Practitioners can select certain local models for causal loop diagram extraction without large accuracy loss compared with mid-tier cloud services.
  • Choosing between GGUF and MLX backends affects JSON reliability and long-context stability more than selecting Q3, Q4, or higher-bit quantization.
  • Local deployments require explicit JSON instructions when using MLX and careful context management when using GGUF to avoid stuck generations.
  • Parameter sweeps and cleaned timing data supply concrete guidance for running 123B-671B parameter models on Apple Silicon hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Improved long-context handling in local inference engines could close the remaining gap on error-fixing tasks and make fully offline system dynamics assistants practical.
  • The results point to a near-term path where organizations run system dynamics AI tools locally for data-privacy or cost reasons while retaining acceptable diagram quality.
  • Hardware-specific backends may become the dominant design variable when deploying large models for domain-specific reasoning tasks beyond system dynamics.

Load-bearing premise

The 53 CLD tests and the interactive discussion scenarios are representative of real-world system dynamics tasks and the automated or author-defined pass criteria accurately capture model capability without hidden biases.

What would settle it

A new collection of 50+ real practitioner system dynamics projects run through the same models and scoring rules would show whether local-model pass rates remain within 10 points of cloud rates or drop sharply on error-fixing and long-context items.

read the original abstract

We present a systematic evaluation of large language model families -- spanning both proprietary cloud APIs and locally-hosted open-source models -- on two purpose-built benchmarks for System Dynamics AI assistance: the \textbf{CLD Leaderboard} (53 tests, structured causal loop diagram extraction) and the \textbf{Discussion Leaderboard} (interactive model discussion, feedback explanation, and model building coaching). On CLD extraction, cloud models achieve 77--89\% overall pass rates; the best local model reaches 77\% (Kimi~K2.5~GGUF~Q3, zero-shot engine), matching mid-tier cloud performance. On Discussion, the best local models achieve 50--100\% on model building steps and 47--75\% on feedback explanation, but only 0--50\% on error fixing -- a category dominated by long-context prompts that expose memory limits in local deployments. A central contribution of this paper is a systematic analysis of \textit{model type effects} on performance: we compare reasoning vs.\ instruction-tuned architectures, GGUF (llama.cpp) vs.\ MLX (mlx\_lm) backends, and quantization levels (Q3 / Q4\_K\_M / MLX-3bit / MLX-4bit / MLX-6bit) across the same underlying model families. We find that backend choice has larger practical impact than quantization level: mlx\_lm does not enforce JSON schema constraints, requiring explicit prompt-level JSON instructions, while llama.cpp grammar-constrained sampling handles JSON reliably but causes indefinite generation on long-context prompts for dense models. We document the full parameter sweep ($t$, $p$, $k$) for all local models, cleaned timing data (stuck requests excluded), and a practitioner guide for running 671B--123B parameter models on Apple~Silicon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents two new benchmarks for evaluating LLMs as System Dynamics AI assistants: the CLD Leaderboard (53 structured tests for causal loop diagram extraction) and the Discussion Leaderboard (interactive scenarios covering model discussion, feedback explanation, and model-building coaching). It reports empirical pass rates showing cloud models at 77-89% on CLD extraction versus a best local model (Kimi K2.5 GGUF Q3 zero-shot) at 77%, with further breakdowns on Discussion tasks (50-100% on building steps, 47-75% on feedback, 0-50% on error fixing). The central analysis compares reasoning vs. instruction-tuned models, GGUF vs. MLX backends, and quantization levels (Q3/Q4/MLX variants), concluding that backend choice has larger practical impact than quantization; it also supplies full parameter sweeps (t, p, k), cleaned timing data excluding stuck requests, and a practitioner guide for 67B-123B models on Apple Silicon.

Significance. If the evaluation rubrics hold, the work supplies actionable empirical guidance for deploying local versus cloud LLMs in system-dynamics contexts, particularly the finding that selected local models can match mid-tier cloud performance on extraction while exposing long-context limitations. The full parameter sweeps, exclusion criteria for timing data, and backend-specific observations (JSON handling, grammar constraints) constitute reproducible strengths that practitioners can directly test.

major comments (2)
  1. [CLD Leaderboard] CLD Leaderboard (53 tests): The headline pass rates (cloud 77-89%, best local 77%) rest on author-defined pass/fail rubrics whose construction, exact scoring rules, and test items are not shown to have inter-rater reliability statistics or blind external expert validation. Without these, the reported gaps and the backend-versus-quantization conclusion remain sensitive to rubric phrasing or output-style biases rather than demonstrated system-dynamics capability.
  2. [Discussion Leaderboard] Discussion Leaderboard: The attribution of 0-50% error-fixing performance to memory limits in local deployments requires explicit quantification of prompt lengths, context-window usage, and how stuck or truncated generations were identified and excluded; the current description leaves open whether the performance drop is an artifact of the chosen long-context prompts rather than a general model-family limitation.
minor comments (2)
  1. [Abstract] Abstract and methods: The notation 'Kimi~K2.5~GGUF~Q3' and similar model identifiers should be standardized or footnoted for readability; likewise, the precise criteria used to flag and exclude 'stuck requests' from timing data should be stated explicitly.
  2. [CLD Leaderboard] The manuscript would benefit from a short table or appendix excerpt showing one or two representative CLD test prompts and the corresponding rubric items to allow readers to assess rubric transparency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and for recognizing the potential utility of the CLD and Discussion Leaderboards along with the reproducibility elements in our parameter sweeps and timing data. We will revise the manuscript to increase transparency on rubric construction and to supply the requested quantification for error-fixing performance. Our point-by-point responses follow.

read point-by-point responses
  1. Referee: [CLD Leaderboard] CLD Leaderboard (53 tests): The headline pass rates (cloud 77-89%, best local 77%) rest on author-defined pass/fail rubrics whose construction, exact scoring rules, and test items are not shown to have inter-rater reliability statistics or blind external expert validation. Without these, the reported gaps and the backend-versus-quantization conclusion remain sensitive to rubric phrasing or output-style biases rather than demonstrated system-dynamics capability.

    Authors: The rubrics were constructed from standard System Dynamics evaluation criteria for causal loop diagrams (correct variable identification, signed causal links, loop polarity, and structural completeness). The full set of 53 test items and the exact pass/fail decision rules appear in Appendix A of the manuscript. We will expand the main text with a dedicated subsection describing rubric development, including concrete examples of passing versus failing model outputs. We acknowledge that inter-rater reliability statistics and blind external expert validation were not performed; these would require additional resources and expert time outside the present study. In the revision we will explicitly note this as a limitation while arguing that the structured, deterministic nature of the 53 tests still provides a reproducible signal for the reported backend and quantization comparisons. revision: partial

  2. Referee: [Discussion Leaderboard] Discussion Leaderboard: The attribution of 0-50% error-fixing performance to memory limits in local deployments requires explicit quantification of prompt lengths, context-window usage, and how stuck or truncated generations were identified and excluded; the current description leaves open whether the performance drop is an artifact of the chosen long-context prompts rather than a general model-family limitation.

    Authors: We agree that the current description is insufficiently quantified. In the revised manuscript we will insert a new table reporting (i) mean and maximum prompt lengths for the error-fixing scenarios (approximately 4k–12k tokens), (ii) the context-window sizes of each local model, and (iii) the precise exclusion rules applied to timing and stuck-generation data (generations exceeding 30 minutes or failing to emit any tokens within the model’s configured limit were dropped, with counts disclosed). These additions demonstrate that the performance drop scales with prompt length and is systematically larger for local backends than for cloud APIs on identical prompts, supporting the interpretation as a general long-context limitation rather than an artifact of the specific test items. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark comparison

full rationale

The paper reports pass rates and performance metrics from direct evaluation of LLMs on fixed, purpose-built test sets (53 CLD extraction tests and interactive Discussion scenarios). No derivations, equations, fitted parameters, or predictions are present that could reduce to inputs by construction. The central claims rest on observed differences across model families, backends, and quantization levels, computed against externally defined rubrics and test items. No self-citations are used to justify core results, and the methodology is self-contained as an empirical study without any self-referential loops or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The evaluation rests on standard assumptions about LLM prompting behavior and benchmark representativeness rather than new free parameters, axioms, or invented entities.

axioms (2)
  • domain assumption LLMs can be prompted to produce valid structured output (e.g., JSON) for CLD extraction under the tested conditions
    Required for the reported pass rates on the CLD Leaderboard.
  • domain assumption The 53 CLD tests and interactive Discussion scenarios are representative of practical system dynamics assistance tasks
    Central to generalizing the benchmark scores to real-world utility.

pith-pipeline@v0.9.0 · 5644 in / 1393 out tokens · 60926 ms · 2026-05-10T04:16:06.418859+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 2 canonical work pages · 2 internal anchors

  1. [1]

    (2000).Business Dynamics: Systems Thinking and Modeling for a Complex World

    Sterman, J.D. (2000).Business Dynamics: Systems Thinking and Modeling for a Complex World. McGraw-Hill. 26

  2. [2]

    Wei, J., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. NeurIPS 2022

  3. [3]

    Frantar, E., et al. (2022). GPTQ: Accurate post-training quantization for generative pre-trained transformers.arXiv:2210.17323

  4. [4]

    OpenAI o1 System Card.https://openai.com/index/ openai-o1-system-card/

    OpenAI (2024). OpenAI o1 System Card.https://openai.com/index/ openai-o1-system-card/

  5. [5]

    com/ml-explore/mlx-lm

    Apple MLX Team (2024).mlx-lm: LLM inference and fine-tuning with MLX.https://github. com/ml-explore/mlx-lm

  6. [6]

    (2023).llama.cpp: Inference of LLaMA model in pure C/C++.https: //github.com/ggerganov/llama.cpp

    Gerganov, G., et al. (2023).llama.cpp: Inference of LLaMA model in pure C/C++.https: //github.com/ggerganov/llama.cpp

  7. [7]

    Moonshot AI (2025).Kimi K2.5 Technical Report.https://huggingface.co/moonshotai/ Kimi-K2.5

  8. [8]

    DeepSeek-V3 Technical Report

    DeepSeek-AI (2024). DeepSeek-V3 Technical Report.arXiv:2412.19437

  9. [9]

    Apple Inc. (2025). Mac Studio (M3 Ultra, 512 GB)—Technical Specifications.https://www. apple.com/mac-studio/specs/

  10. [10]

    IEA, Paris.https://www.iea.org/reports/electricity-2024

    International Energy Agency (2024).Electricity 2024: Analysis and Forecast to 2026. IEA, Paris.https://www.iea.org/reports/electricity-2024

  11. [11]

    Luccioni, A.S., Viguier, S., & Ligozat, A.L. (2023). Estimating the carbon footprint of BLOOM, a 176B parameter language model.Journal of Machine Learning Research, 24(253), 1–15

  12. [12]

    Patterson, D., et al. (2022). The carbon footprint of machine learning training will plateau, then shrink.Computer, 55(7), 18–28. 27