Recognition: unknown
Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion
Pith reviewed 2026-05-10 04:16 UTC · model grok-4.3
The pith
Cloud LLMs reach 77-89% on causal loop diagram extraction while top local models hit 77% and backend choice outweighs quantization effects.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Systematic tests on the CLD Leaderboard show cloud models attaining 77-89% overall pass rates on structured causal loop diagram extraction, with the best local model (Kimi K2.5 GGUF Q3 zero-shot) also reaching 77% and matching mid-tier cloud results. On the Discussion Leaderboard, local models achieve 50-100% on model building steps and 47-75% on feedback explanation but only 0-50% on error fixing due to long-context memory limits. Backend choice between GGUF and MLX produces larger performance differences than quantization level, with GGUF providing reliable JSON schema adherence at the cost of occasional indefinite generation on dense long-context prompts.
What carries the argument
The CLD Leaderboard and Discussion Leaderboard, two purpose-built test suites that measure structured causal loop diagram extraction and interactive model discussion, feedback, and coaching capabilities.
If this is right
- Practitioners can select certain local models for causal loop diagram extraction without large accuracy loss compared with mid-tier cloud services.
- Choosing between GGUF and MLX backends affects JSON reliability and long-context stability more than selecting Q3, Q4, or higher-bit quantization.
- Local deployments require explicit JSON instructions when using MLX and careful context management when using GGUF to avoid stuck generations.
- Parameter sweeps and cleaned timing data supply concrete guidance for running 123B-671B parameter models on Apple Silicon hardware.
Where Pith is reading between the lines
- Improved long-context handling in local inference engines could close the remaining gap on error-fixing tasks and make fully offline system dynamics assistants practical.
- The results point to a near-term path where organizations run system dynamics AI tools locally for data-privacy or cost reasons while retaining acceptable diagram quality.
- Hardware-specific backends may become the dominant design variable when deploying large models for domain-specific reasoning tasks beyond system dynamics.
Load-bearing premise
The 53 CLD tests and the interactive discussion scenarios are representative of real-world system dynamics tasks and the automated or author-defined pass criteria accurately capture model capability without hidden biases.
What would settle it
A new collection of 50+ real practitioner system dynamics projects run through the same models and scoring rules would show whether local-model pass rates remain within 10 points of cloud rates or drop sharply on error-fixing and long-context items.
read the original abstract
We present a systematic evaluation of large language model families -- spanning both proprietary cloud APIs and locally-hosted open-source models -- on two purpose-built benchmarks for System Dynamics AI assistance: the \textbf{CLD Leaderboard} (53 tests, structured causal loop diagram extraction) and the \textbf{Discussion Leaderboard} (interactive model discussion, feedback explanation, and model building coaching). On CLD extraction, cloud models achieve 77--89\% overall pass rates; the best local model reaches 77\% (Kimi~K2.5~GGUF~Q3, zero-shot engine), matching mid-tier cloud performance. On Discussion, the best local models achieve 50--100\% on model building steps and 47--75\% on feedback explanation, but only 0--50\% on error fixing -- a category dominated by long-context prompts that expose memory limits in local deployments. A central contribution of this paper is a systematic analysis of \textit{model type effects} on performance: we compare reasoning vs.\ instruction-tuned architectures, GGUF (llama.cpp) vs.\ MLX (mlx\_lm) backends, and quantization levels (Q3 / Q4\_K\_M / MLX-3bit / MLX-4bit / MLX-6bit) across the same underlying model families. We find that backend choice has larger practical impact than quantization level: mlx\_lm does not enforce JSON schema constraints, requiring explicit prompt-level JSON instructions, while llama.cpp grammar-constrained sampling handles JSON reliably but causes indefinite generation on long-context prompts for dense models. We document the full parameter sweep ($t$, $p$, $k$) for all local models, cleaned timing data (stuck requests excluded), and a practitioner guide for running 671B--123B parameter models on Apple~Silicon.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents two new benchmarks for evaluating LLMs as System Dynamics AI assistants: the CLD Leaderboard (53 structured tests for causal loop diagram extraction) and the Discussion Leaderboard (interactive scenarios covering model discussion, feedback explanation, and model-building coaching). It reports empirical pass rates showing cloud models at 77-89% on CLD extraction versus a best local model (Kimi K2.5 GGUF Q3 zero-shot) at 77%, with further breakdowns on Discussion tasks (50-100% on building steps, 47-75% on feedback, 0-50% on error fixing). The central analysis compares reasoning vs. instruction-tuned models, GGUF vs. MLX backends, and quantization levels (Q3/Q4/MLX variants), concluding that backend choice has larger practical impact than quantization; it also supplies full parameter sweeps (t, p, k), cleaned timing data excluding stuck requests, and a practitioner guide for 67B-123B models on Apple Silicon.
Significance. If the evaluation rubrics hold, the work supplies actionable empirical guidance for deploying local versus cloud LLMs in system-dynamics contexts, particularly the finding that selected local models can match mid-tier cloud performance on extraction while exposing long-context limitations. The full parameter sweeps, exclusion criteria for timing data, and backend-specific observations (JSON handling, grammar constraints) constitute reproducible strengths that practitioners can directly test.
major comments (2)
- [CLD Leaderboard] CLD Leaderboard (53 tests): The headline pass rates (cloud 77-89%, best local 77%) rest on author-defined pass/fail rubrics whose construction, exact scoring rules, and test items are not shown to have inter-rater reliability statistics or blind external expert validation. Without these, the reported gaps and the backend-versus-quantization conclusion remain sensitive to rubric phrasing or output-style biases rather than demonstrated system-dynamics capability.
- [Discussion Leaderboard] Discussion Leaderboard: The attribution of 0-50% error-fixing performance to memory limits in local deployments requires explicit quantification of prompt lengths, context-window usage, and how stuck or truncated generations were identified and excluded; the current description leaves open whether the performance drop is an artifact of the chosen long-context prompts rather than a general model-family limitation.
minor comments (2)
- [Abstract] Abstract and methods: The notation 'Kimi~K2.5~GGUF~Q3' and similar model identifiers should be standardized or footnoted for readability; likewise, the precise criteria used to flag and exclude 'stuck requests' from timing data should be stated explicitly.
- [CLD Leaderboard] The manuscript would benefit from a short table or appendix excerpt showing one or two representative CLD test prompts and the corresponding rubric items to allow readers to assess rubric transparency.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and for recognizing the potential utility of the CLD and Discussion Leaderboards along with the reproducibility elements in our parameter sweeps and timing data. We will revise the manuscript to increase transparency on rubric construction and to supply the requested quantification for error-fixing performance. Our point-by-point responses follow.
read point-by-point responses
-
Referee: [CLD Leaderboard] CLD Leaderboard (53 tests): The headline pass rates (cloud 77-89%, best local 77%) rest on author-defined pass/fail rubrics whose construction, exact scoring rules, and test items are not shown to have inter-rater reliability statistics or blind external expert validation. Without these, the reported gaps and the backend-versus-quantization conclusion remain sensitive to rubric phrasing or output-style biases rather than demonstrated system-dynamics capability.
Authors: The rubrics were constructed from standard System Dynamics evaluation criteria for causal loop diagrams (correct variable identification, signed causal links, loop polarity, and structural completeness). The full set of 53 test items and the exact pass/fail decision rules appear in Appendix A of the manuscript. We will expand the main text with a dedicated subsection describing rubric development, including concrete examples of passing versus failing model outputs. We acknowledge that inter-rater reliability statistics and blind external expert validation were not performed; these would require additional resources and expert time outside the present study. In the revision we will explicitly note this as a limitation while arguing that the structured, deterministic nature of the 53 tests still provides a reproducible signal for the reported backend and quantization comparisons. revision: partial
-
Referee: [Discussion Leaderboard] Discussion Leaderboard: The attribution of 0-50% error-fixing performance to memory limits in local deployments requires explicit quantification of prompt lengths, context-window usage, and how stuck or truncated generations were identified and excluded; the current description leaves open whether the performance drop is an artifact of the chosen long-context prompts rather than a general model-family limitation.
Authors: We agree that the current description is insufficiently quantified. In the revised manuscript we will insert a new table reporting (i) mean and maximum prompt lengths for the error-fixing scenarios (approximately 4k–12k tokens), (ii) the context-window sizes of each local model, and (iii) the precise exclusion rules applied to timing and stuck-generation data (generations exceeding 30 minutes or failing to emit any tokens within the model’s configured limit were dropped, with counts disclosed). These additions demonstrate that the performance drop scales with prompt length and is systematically larger for local backends than for cloud APIs on identical prompts, supporting the interpretation as a general long-context limitation rather than an artifact of the specific test items. revision: yes
Circularity Check
No circularity: purely empirical benchmark comparison
full rationale
The paper reports pass rates and performance metrics from direct evaluation of LLMs on fixed, purpose-built test sets (53 CLD extraction tests and interactive Discussion scenarios). No derivations, equations, fitted parameters, or predictions are present that could reduce to inputs by construction. The central claims rest on observed differences across model families, backends, and quantization levels, computed against externally defined rubrics and test items. No self-citations are used to justify core results, and the methodology is self-contained as an empirical study without any self-referential loops or ansatzes.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLMs can be prompted to produce valid structured output (e.g., JSON) for CLD extraction under the tested conditions
- domain assumption The 53 CLD tests and interactive Discussion scenarios are representative of practical system dynamics assistance tasks
Reference graph
Works this paper leans on
-
[1]
(2000).Business Dynamics: Systems Thinking and Modeling for a Complex World
Sterman, J.D. (2000).Business Dynamics: Systems Thinking and Modeling for a Complex World. McGraw-Hill. 26
2000
-
[2]
Wei, J., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. NeurIPS 2022
2022
-
[3]
Frantar, E., et al. (2022). GPTQ: Accurate post-training quantization for generative pre-trained transformers.arXiv:2210.17323
work page internal anchor Pith review arXiv 2022
-
[4]
OpenAI o1 System Card.https://openai.com/index/ openai-o1-system-card/
OpenAI (2024). OpenAI o1 System Card.https://openai.com/index/ openai-o1-system-card/
2024
-
[5]
com/ml-explore/mlx-lm
Apple MLX Team (2024).mlx-lm: LLM inference and fine-tuning with MLX.https://github. com/ml-explore/mlx-lm
2024
-
[6]
(2023).llama.cpp: Inference of LLaMA model in pure C/C++.https: //github.com/ggerganov/llama.cpp
Gerganov, G., et al. (2023).llama.cpp: Inference of LLaMA model in pure C/C++.https: //github.com/ggerganov/llama.cpp
2023
-
[7]
Moonshot AI (2025).Kimi K2.5 Technical Report.https://huggingface.co/moonshotai/ Kimi-K2.5
2025
-
[8]
DeepSeek-AI (2024). DeepSeek-V3 Technical Report.arXiv:2412.19437
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Apple Inc. (2025). Mac Studio (M3 Ultra, 512 GB)—Technical Specifications.https://www. apple.com/mac-studio/specs/
2025
-
[10]
IEA, Paris.https://www.iea.org/reports/electricity-2024
International Energy Agency (2024).Electricity 2024: Analysis and Forecast to 2026. IEA, Paris.https://www.iea.org/reports/electricity-2024
2024
-
[11]
Luccioni, A.S., Viguier, S., & Ligozat, A.L. (2023). Estimating the carbon footprint of BLOOM, a 176B parameter language model.Journal of Machine Learning Research, 24(253), 1–15
2023
-
[12]
Patterson, D., et al. (2022). The carbon footprint of machine learning training will plateau, then shrink.Computer, 55(7), 18–28. 27
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.