OpenJarvis: Personal AI, On Personal Devices

Andrew Park; Avanika Narayan; Azalia Mirhoseini; Caia Costello; Christopher R\'e; Chuan Li; Gabriel Bo; Hakki Orhun Akengin; Herumb Shandilya; Jon Saad-Falcon

arxiv: 2605.17172 · v1 · pith:3CGSZH6Knew · submitted 2026-05-16 · 💻 cs.LG · cs.AI· cs.CL

OpenJarvis: Personal AI, On Personal Devices

Jon Saad-Falcon , Avanika Narayan , Robby Manihani , Tanvir Bhathal , Herumb Shandilya , Hakki Orhun Akengin , Gabriel Bo , Andrew Park

show 5 more authors

Matthew Hart Caia Costello Chuan Li Christopher R\'e Azalia Mirhoseini

This is my paper

Pith reviewed 2026-05-20 14:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords personal AIon-device inferencespec optimizationLLM-guided searchagentic systemslocal modelsaccuracy benchmarkscost latency tradeoffs

0 comments

The pith

Decomposing personal AI into five editable primitives lets on-device specs match or exceed cloud accuracy on 4 of 8 benchmarks via cloud-guided search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that swapping local models into existing personal AI stacks causes large accuracy drops because those stacks are built around cloud models. OpenJarvis addresses this by representing the entire system as a typed spec over five independent primitives that can be optimized together rather than just tuning prompts. LLM-guided spec search lets frontier cloud models propose edits to the spec, keeping only those that do not regress accuracy when tested on local models. The final optimized spec then runs entirely on-device. A sympathetic reader would care because this approach promises private personal AI with far lower ongoing cost and latency while recovering most of the performance lost by moving away from the cloud.

Core claim

OpenJarvis represents a personal AI system as a typed spec over five primitives: Intelligence, Engine, Agents, Tools & Memory, and Learning. LLM-guided spec search uses frontier cloud models to propose edits across the spec at search time, accepting only non-regressing edits when evaluated on local models. The resulting specs run entirely on-device at inference time and match or exceed cloud accuracy on 4 of 8 benchmarks while landing within 3.2 pp of the best cloud baseline on average, with marginal API cost reduced by ~800x and end-to-end latency reduced by 4x.

What carries the argument

LLM-guided spec search, a local-cloud collaboration where frontier models propose edits to a typed spec of five primitives at search time, non-regressing edits are kept, and the final spec executes fully locally.

If this is right

Personal AI stacks become end-to-end optimizable rather than limited to prompt tuning.
On-device models achieve near-cloud accuracy on personal tasks like PinchBench and GAIA.
Marginal API cost for personal AI drops by roughly 800 times.
End-to-end latency for personal AI tasks improves by a factor of 4.
Each primitive in the stack can be measured and improved independently against accuracy, cost, and latency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decomposition and search approach could be tested on other agentic or tool-using systems where local constraints matter.
One-time cloud assistance during optimization might enable fully private, long-term local AI without repeated cloud calls.
Similar spec-based optimization could help close performance gaps in neighboring settings like mobile or edge AI deployments.

Load-bearing premise

That edits proposed by frontier cloud models can be reliably evaluated for non-regression on local models and that the resulting spec will continue to perform well at inference time without any cloud involvement.

What would settle it

Evaluating the optimized on-device specs across all 8 benchmarks and finding average accuracy more than 3.2 percentage points below the best cloud baseline.

Figures

Figures reproduced from arXiv: 2605.17172 by Andrew Park, Avanika Narayan, Azalia Mirhoseini, Caia Costello, Christopher R\'e, Chuan Li, Gabriel Bo, Hakki Orhun Akengin, Herumb Shandilya, Jon Saad-Falcon, Matthew Hart, Robby Manihani, Tanvir Bhathal.

**Figure 1.** Figure 1: Overview of OPENJARVIS. (Left) Five composable primitives (Intelligence, Engine, Agents, Tools & Memory, Learning) are composed through a declarative spec that can be shared, evaluated, and optimized end-to-end (Section 3.1). (Middle) Joint accuracyefficiency evaluation across the evaluated local specs (green) and 3 cloud baselines (blue) reveals that on-device configurations approach within 3.2 pp of th… view at source ↗

**Figure 2.** Figure 2: OpenJarvis architecture. Five composable primitives decouple model selection (Intelligence), inference runtime (Engine), agent logic (Agents), data integration (Tools & Memory), and on-device learning (Learning) into independently swappable layers. A spec (Section 3.1) composes all five into a declarative configuration that can be shared, evaluated, and optimized end-to-end. the optimizer that updates the … view at source ↗

**Figure 3.** Figure 3: The spec abstraction. (a) A spec is a typed configuration object with five primitives: Intelligence, Engine, Agents, Tools & Memory, and Learning. (b) Optimizers instantiate the same signature by restricting which fields they edit. LoRA edits Intelligence weights; DSPy and GEPA edit Agent prompts; LLM-guided spec search edits Intelligence, Engine, Agents, and Tools & Memory jointly. spec and wrapper expose… view at source ↗

**Figure 4.** Figure 4: Three ways to optimize a spec. OPENJARVIS proposes edits to the four editable primitives and keeps an edit only if held-out performance does not regress. Evolutionary spec search maintains and merges a population of candidate specs [1]. Single-component baselines edit one primitive at a time. family of single-primitive optimizers expresses natively [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Accuracy-efficiency frontier. Local configurations approach the best cloud accuracy within 3.2 pp while reducing marginal API cost by roughly 800× and end-to-end latency by roughly 4× under our benchmark protocol. Energy and hardware-specific breakdowns are in Appendix C.3. The best local model lands within 3.2 pp of the best cloud model on average and matches or exceeds cloud on 4 of 8 benchmarks. Across … view at source ↗

**Figure 6.** Figure 6: LLM-guided spec search improves local specs. Every student–teacher pair improves over the unoptimized spec on all three benchmarks. The strongest search-optimized Qwen3.5-9B student reaches 100.0% on PinchBench, 83.0% on LiveCodeBench, and 91.0% on LiveResearchBench. search closes this gap on PinchBench, LiveCodeBench, and LiveResearchBench. For each benchmark, we run search with four local target models a… view at source ↗

**Figure 7.** Figure 7: Accuracy vs. optimization cost. LLM-guided spec search reaches the best accuracy on all three main benchmarks. LoRA is the strongest single-primitive baseline, but LLMguided spec search is 7.1–10.9× cheaper to optimize. DSPy/SIMBA and prompt-only GEPA produce modest gains over the unoptimized local spec. 10 [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Proposer and move-space ablations for LLM-guided spec search. Student: Qwen3.5-9B; teacher: Claude Opus 4.6. Left: at fixed four-primitive move space, the LLM proposer outperforms a template-random proposer and evolutionary spec search. Middle: at fixed LLM proposer, expanding from one editable primitive to all four improves both accuracy and latency. Right: merging primitive pairs reduces accuracy and spe… view at source ↗

**Figure 9.** Figure 9: Two specs for the same personal AI system on different hardware. The Consumer deployment serves Gemma4-4B via Ollama with a single-turn agent and learning disabled. The Workstation deployment serves Qwen3.5-122B (FP8) via vLLM with a multistep coding agent, expanded tool set, and LLM-guided spec search enabled. [tools.mcp], [security], and connectors (omitted) are identical across both. Security at the To… view at source ↗

**Figure 10.** Figure 10: Edit-type allocation by benchmark. Share of accepted edits by primitive across the 8-benchmark suite. Student: Qwen3.5-9B; teacher: Claude Opus 4.6. The dominant primitive varies by task type: Intelligence dominates code (44% on LCB), Agent dominates agentic and customer-service tasks (41–45% on PB, TauB, and TBTel), and Tool dominates tool-calling and research (39–47% on TC15, GAIA, DRB, and LRB). Engine… view at source ↗

**Figure 11.** Figure 11: Edit-type allocation by failure cluster category. Row-normalized share of accepted edits of each type within each failure cluster category, pooled across the 8-benchmark suite. Student: Qwen3.5-9B; teacher: Claude Opus 4.6. The teacher maps diagnoses to the expected intervention type: retrieval failures receive mostly Tool edits (65%), reasoning failures mostly Intelligence edits (52%), control-flow fail… view at source ↗

read the original abstract

Personal AI stacks, like OpenClaw and Hermes Agent, are becoming central to daily work, yet they route nearly every query (often over sensitive local data) to cloud-hosted frontier models. Replacing frontier models with local models inside existing stacks does not work: swapping Claude Opus 4.6 for Qwen3.5-9B drops accuracy by 25-39 pp across personal AI tasks like PinchBench and GAIA. Existing stacks bundle agentic prompts, tool descriptions, memory configuration, and runtime settings around a specific cloud model. Only the prompts can be tuned, and state-of-the-art prompt optimizers close just 5 pp of the local-cloud gap on their own. This motivates a decomposed personal AI stack: one that exposes individual primitives which can be optimized individually or jointly to close the local-cloud gap. We present OpenJarvis, an architecture that represents a personal AI system as a typed spec over five primitives: Intelligence, Engine, Agents, Tools & Memory, and Learning. Each primitive is an independently editable field, making the stack end-to-end optimizable and measurable against accuracy, cost, and latency. Towards closing the local-cloud gap without surrendering local-model properties, OpenJarvis introduces LLM-guided spec search, a local-cloud collaboration in which frontier cloud models propose edits across the spec at search time, only non-regressing edits are accepted, and the resulting spec runs entirely on-device at inference time. With LLM-guided spec search, on-device specs match or exceed cloud accuracy on 4 of 8 benchmarks and land within 3.2 pp of the best cloud baseline on average. They also reduce marginal API cost by ~800x and end-to-end latency by 4x.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OpenJarvis decomposes personal AI into five editable primitives and uses cloud models only at search time to find specs that run close to cloud accuracy on local hardware, with big reported gains in cost and latency.

read the letter

The main takeaway is that this paper gives a concrete way to optimize an entire personal AI stack for on-device use instead of just swapping models or tweaking prompts. They define the system as a typed spec over five primitives—Intelligence, Engine, Agents, Tools & Memory, and Learning—and let frontier models propose edits across them during search, accepting only non-regressing changes before running the result locally at inference time. That setup produces on-device specs that match or beat cloud baselines on 4 of 8 benchmarks and stay within 3.2 points on average, while cutting marginal API cost by roughly 800x and latency by 4x. Those numbers address the real gap they document: dropping from something like Claude to a 9B local model loses 25-39 points, and prompt optimizers alone recover only about 5 points. Making the primitives independently editable and jointly searchable is the step that lets them close most of the rest without cloud involvement at runtime. The architecture is practical and the benchmark results are specific enough to be checked. The soft spot is the non-regression filter itself. If the local evaluations that decide whether an edit is kept use the same prompt distributions or benchmark instances later used for final reporting, or if they skip variance estimates across multiple runs, then accepted specs can look stable during search but degrade on fresh queries. The abstract does not spell out held-out splits or statistical controls for that step, so the transfer from search to inference time needs explicit verification in the full paper. Readers working on on-device agents, local inference stacks, or hybrid optimization will get the most from the spec format and the measured trade-offs. It is worth sending to peer review so the evaluation protocol for the search phase can be examined in detail.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces OpenJarvis, a decomposed personal AI architecture that represents the system as a typed spec over five primitives (Intelligence, Engine, Agents, Tools & Memory, and Learning). It proposes LLM-guided spec search in which frontier cloud models propose edits at optimization time; only non-regressing edits (evaluated on local models) are accepted, and the resulting spec executes entirely on-device at inference time. The central empirical claim is that the optimized on-device specs match or exceed cloud accuracy on 4 of 8 benchmarks, lie within 3.2 pp of the best cloud baseline on average, and deliver ~800x lower marginal API cost together with 4x lower end-to-end latency.

Significance. If the reported performance numbers are shown to be robust, the work would constitute a useful contribution toward practical, privacy-preserving personal AI by demonstrating that a hybrid search procedure can close most of the local-cloud accuracy gap while eliminating cloud involvement at inference. The explicit decomposition into independently editable primitives is a clean conceptual advance over prompt-only tuning. The hybrid local-cloud optimization with fully local deployment is a promising paradigm, though its significance hinges on rigorous verification that the search-time non-regression oracle generalizes.

major comments (2)

[Abstract and §4 (spec search)] Abstract and the LLM-guided spec search procedure: The headline result (on-device specs within 3.2 pp of the best cloud baseline on 8 benchmarks) depends on accepting only non-regressing edits proposed by cloud models. The manuscript does not state whether the local-model evaluations performed during search use held-out data, employ multiple trials to control for stochasticity, or reuse the same benchmark instances later used for final reporting. Without these controls the non-regression filter can accept specifications that overfit to search-time local evaluations and subsequently degrade at inference time.
[§5 and Tables 1-2] Experimental results section and associated tables: The per-benchmark and average accuracy figures are presented without accompanying statistical tests, confidence intervals, or variance estimates across runs. This makes it impossible to assess whether the reported matches/exceeds on 4 of 8 benchmarks and the 3.2 pp average gap are statistically reliable or could be explained by evaluation noise.

minor comments (2)

[Abstract] The abstract lists PinchBench and GAIA as illustrative tasks but does not enumerate the full set of 8 benchmarks or provide concise definitions of the evaluation metrics.
[§5] The precise definitions of 'marginal API cost' and 'end-to-end latency' (including what is measured in the 800x and 4x claims) should be stated explicitly in the experimental setup.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and positive assessment of the significance of our work. We address each of the major comments below, providing clarifications and committing to revisions where appropriate to strengthen the manuscript.

read point-by-point responses

Referee: Abstract and §4 (spec search): The headline result (on-device specs within 3.2 pp of the best cloud baseline on 8 benchmarks) depends on accepting only non-regressing edits proposed by cloud models. The manuscript does not state whether the local-model evaluations performed during search use held-out data, employ multiple trials to control for stochasticity, or reuse the same benchmark instances later used for final reporting. Without these controls the non-regression filter can accept specifications that overfit to search-time local evaluations and subsequently degrade at inference time.

Authors: We appreciate the referee's attention to the robustness of our search procedure. The LLM-guided spec search in §4 evaluates proposed edits by running the local models on the task benchmarks to check for non-regression. While the manuscript does not explicitly detail the use of held-out data or multiple trials in the current version, the evaluations are performed on the standard benchmark splits as described in §5. To address the concern of potential overfitting, we will revise the manuscript to include a more precise description of the search-time evaluation protocol, specifying that we use the same instances for consistency with final reporting but mitigate stochasticity by averaging over multiple inference runs where applicable. We believe this hybrid approach still provides a reliable filter because the cloud proposals are diverse and the acceptance criterion is conservative. revision: partial
Referee: §5 and Tables 1-2: Experimental results section and associated tables: The per-benchmark and average accuracy figures are presented without accompanying statistical tests, confidence intervals, or variance estimates across runs. This makes it impossible to assess whether the reported matches/exceeds on 4 of 8 benchmarks and the 3.2 pp average gap are statistically reliable or could be explained by evaluation noise.

Authors: We agree with the referee that the presentation of results can be improved by including measures of statistical reliability. In the revised manuscript, we will augment Tables 1 and 2 with confidence intervals computed via bootstrapping over the benchmark instances and report standard deviations from multiple evaluation runs with different random seeds. This will allow readers to better assess the significance of the observed gaps and matches to cloud baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical architecture and reports benchmark results from LLM-guided spec search. No equations, self-definitional primitives, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or description. The search uses cloud models only during optimization; final specs run locally and accuracies are measured on external benchmarks rather than reducing to inputs by construction. The derivation chain is self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract does not enumerate free parameters, background axioms, or new postulated entities; the five primitives are presented as an organizing framework rather than invented physical or mathematical objects.

pith-pipeline@v0.9.0 · 5894 in / 1177 out tokens · 53857 ms · 2026-05-20T14:21:02.769732+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

OPENJARVIS represents a personal AI system as a typed spec over five primitives... LLM-guided spec search... only non-regressing edits are accepted
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

on-device specs match or exceed cloud accuracy on 4 of 8 benchmarks... reduce marginal API cost by ~800x

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

108 extracted references · 108 canonical work pages · 13 internal anchors

[1]

Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl- Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. Gepa: Reflective prompt evolution can outperform reinforcement learning, 2026

work page 2026
[2]

Qwen-Agent

Alibaba Cloud. Qwen-Agent. https://github.com/QwenLM/Qwen-Agent, 2025. Agent framework built on Qwen models with tool use and RAG capabilities

work page 2025
[3]

Model context protocol, 2024

Anthropic. Model context protocol, 2024. Open standard for connecting AI assistants to external tools and data sources

work page 2024
[4]

Claude Opus 4.6: frontier model with extended-thinking and tool-use capabilities

Anthropic. Claude Opus 4.6: frontier model with extended-thinking and tool-use capabilities. Anthropic API Documentation, 2026.https://docs.anthropic.com

work page 2026
[5]

Apple M4 chip

Apple Inc. Apple M4 chip. https://www.apple.com/newsroom/2024/05/apple- introduces-m4-chip/, 2024. Accessed: April 2026

work page 2024
[6]

Apple intelligence foundation language models

Apple Machine Learning Research. Apple intelligence foundation language models. arXiv preprint arXiv:2407.21075, 2024

work page arXiv 2024
[7]

How to Train Your Advisor: Steering Black-Box LLMs with Advisor Models

Parth Asawa, Alan Zhu, Abby O’Neill, Matei Zaharia, Alexandros G. Dimakis, and Joseph E. Gonzalez. How to train your advisor: Steering black-box LLMs with advisor models.arXiv preprint arXiv:2510.02453, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. τ 2- bench: Evaluating conversational agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Inside the 2025–2027 compute crunch: What supply chain volatility really means for you

BCD International. Inside the 2025–2027 compute crunch: What supply chain volatility really means for you. BCD Video Blog, 2025. Accessed: April 2026

work page 2025
[10]

LangChain

Harrison Chase. LangChain. https://github.com/langchain-ai/langchain, 2022. Accessed: 2026

work page 2022
[11]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

MobileFineTuner: A unified end-to-end framework for fine-tuning LLMs on mobile phones.arXiv preprint arXiv:2512.08211, 2025

Xiaopei Chen, Liang Li, Fei Ji, and Wen Wu. MobileFineTuner: A unified end-to-end framework for fine-tuning LLMs on mobile phones.arXiv preprint arXiv:2512.08211, 2025

work page arXiv 2025
[13]

Gonzalez, and Ion Stoica

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating LLMs by human preference, 2024

work page 2024
[14]

Cormack, Charles L

Gordon V. Cormack, Charles L. A. Clarke, and Stefan Büttcher. Reciprocal rank fusion outperforms Condorcet and individual rank learning methods. InProceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 758–759. ACM, 2009

work page 2009
[15]

CrewAI: Framework for orchestrating role-playing, autonomous AI agents

CrewAI, Inc. CrewAI: Framework for orchestrating role-playing, autonomous AI agents. https://github.com/crewAIInc/crewAI, 2024. Accessed: 2026. 13

work page 2024
[16]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforce- ment learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

QLoRA: Efficient finetuning of quantized language models

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized language models. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[18]

Deepre- search bench: A comprehensive benchmark for deep research agents, 2025

Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deepre- search bench: A comprehensive benchmark for deep research agents, 2025

work page 2025
[19]

EdgeClaw: Local-cloud router plugin for OpenClaw

EdgeClaw Contributors. EdgeClaw: Local-cloud router plugin for OpenClaw. https: //github.com/edgeclaw/edgeclaw, 2025. Adds a local-cloud routing layer on top of OpenClaw as a plugin

work page 2025
[20]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

llama.cpp

Georgi Gerganov. llama.cpp. https://github.com/ggml- org/llama.cpp , 2023. C/C++ LLM inference engine with quantization support. 85K+ GitHub stars as of 2025

work page 2023
[22]

gemma.cpp

Google. gemma.cpp. https://github.com/google/gemma.cpp , 2024. Lightweight C++ inference engine for Gemma models

work page 2024
[23]

Agent development kit (ADK)

Google. Agent development kit (ADK). https://github.com/google/adk-python ,

work page
[24]

Python framework for building AI agents with tool use and multi-step reasoning

work page
[25]

Gemma 3n

Google. Gemma 3n. https://ai.google.dev/gemma/docs/gemma-3n , 2025. On- device Gemma variant with per-layer embeddings and elastic inference for phones and laptops

work page 2025
[26]

Gemma 4 26B: instruction-tuned open-weight 26B-parameter mixture-of- experts model

Google. Gemma 4 26B: instruction-tuned open-weight 26B-parameter mixture-of- experts model. Model card, Hugging Face, 2025. https://huggingface.co/google/ gemma-4-26b-it

work page 2025
[27]

Gemini nano

Google. Gemini nano. https://developer.android.com/ai/gemini-nano , 2026. Android Developers Documentation, last updated April 2, 2026

work page 2026
[28]

Gemini 3.1 Pro: frontier multimodal cloud model with extended- context reasoning

Google DeepMind. Gemini 3.1 Pro: frontier multimodal cloud model with extended- context reasoning. Google AI Developer Documentation, 2026. https://ai.google. dev

work page 2026
[29]

Gemma 4: Byte for byte, the most capable open models

Google DeepMind. Gemma 4: Byte for byte, the most capable open models. https: //blog.google/innovation- and- ai/technology/developers- tools/gemma- 4/ , April 2026. Blog post, April 2, 2026

work page 2026
[30]

MiniLLM: Knowledge distilla- tion of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distilla- tion of large language models. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[31]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[32]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, 2022

work page 2022
[33]

AI energy score

HuggingFace. AI energy score. https://huggingface.github.io/AIEnergyScore/,

work page
[34]

Standardized energy scoring system for AI models

work page
[35]

Granite 4.0 language models

IBM Research. Granite 4.0 language models. https://github.com/ibm-granite/ granite-4.0-language-models, 2025. Accessed: 2025-10-01. 14

work page 2025
[36]

Intel Core Ultra processors (series 1, meteor lake)

Intel Corporation. Intel Core Ultra processors (series 1, meteor lake). https:// www.intel.com/content/www/us/en/products/details/processors/core-ultra. html, 2023. Accessed: April 2026

work page 2023
[37]

IronClaw

IronClaw Contributors. IronClaw. https://github.com/ironclaw/ironclaw, 2025. Enterprise-focused fork of OpenClaw

work page 2025
[38]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

SWE-bench: Can language models resolve real-world Github issues? InThe Twelfth International Conference on Learning Representations, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world Github issues? InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[40]

Billion-scale similarity search with GPUs.IEEE Transactions on Big Data, 7(3):535–547, 2019

Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with GPUs.IEEE Transactions on Big Data, 7(3):535–547, 2019

work page 2019
[41]

DSPy: Compiling declarative lan- guage model calls into self-improving pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan A, Saiful Haq, Ashutosh Sharma, Thomas Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling declarative lan- guage model calls into self-improving pipelines. InThe Twelfth International Conference on Learning Representati...

work page 2024
[42]

PinchBench: Benchmarking LLM models as OpenClaw coding agents

Kilo Code. PinchBench: Benchmarking LLM models as OpenClaw coding agents. https://github.com/pinchbench/skill, 2025. 23 real-world agent tasks spanning scheduling, email, research, coding, and multi-step workflows. Open-source grading via automated checks and LLM judge

work page 2025
[43]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[44]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning, 2023

work page 2023
[45]

AWQ: Activation-aware weight quantization for LLM compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, and Song Han. AWQ: Activation-aware weight quantization for LLM compression and acceleration. InProceedings of Machine Learning and Systems, volume 6, 2024

work page 2024
[46]

On-device training under 256KB memory

Ji Lin, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, Chuang Gan, and Song Han. On-device training under 256KB memory . InAdvances in Neural Information Processing Systems, volume 35, 2022

work page 2022
[47]

Liquid nanos: Frontier-grade performance on everyday devices

Liquid AI. Liquid nanos: Frontier-grade performance on everyday devices. https:// www.liquid.ai/blog/introducing-liquid-nanos-frontier-grade-performance- on-everyday-devices, 2025

work page 2025
[48]

Mobilellm: Optimizing sub-billion parameter language models for on-device use cases,

Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, Vikas Chandra, et al. MobileLLM: Optimizing sub-billion parameter language models for on-device use cases.arXiv preprint arXiv:2402.14905, 2024

work page arXiv 2024
[49]

LM Studio

LM Studio. LM Studio. https://lmstudio.ai, 2024. Desktop application for running LLMs locally

work page 2024
[50]

Merrill, Alexander G

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, et al. Terminal- bench: Benchmarking agents on hard, realistic tasks in command line interfaces. In The Fourteenth International Conference on Learning Representations, 2026. 15

work page 2026
[51]

ExecuTorch: End-to-end solution for enabling on-device inference capabilities

Meta. ExecuTorch: End-to-end solution for enabling on-device inference capabilities. https://github.com/pytorch/executorch, 2024

work page 2024
[52]

GAIA: A benchmark for general AI assistants

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants. InThe Twelfth Interna- tional Conference on Learning Representations, 2024

work page 2024
[53]

MimicLaw

MimicLaw Contributors. MimicLaw. https://github.com/mimiclaw/mimiclaw, 2025. Persona-focused personal AI framework

work page 2025
[54]

Zeus: ML energy measurement framework

ML Energy Initiative. Zeus: ML energy measurement framework. https://ml.energy/ zeus/, 2023. GPU energy measurement toolkit for fine-grained energy profiling of ML workloads

work page 2023
[55]

MLC-LLM: Universal LLM deployment engine

MLC AI. MLC-LLM: Universal LLM deployment engine. https://github.com/mlc- ai/mlc-llm, 2023

work page 2023
[56]

MLCommons inference benchmark

MLCommons. MLCommons inference benchmark. https://github.com/mlcommons/ inference, 2024. Industry-standard inference benchmarking suite covering latency, throughput, and accuracy across hardware platforms

work page 2024
[57]

Ludwig: a type-based declara- tive deep learning toolbox, 2019

Piero Molino, Yaroslav Dudin, and Sai Sumanth Miryala. Ludwig: a type-based declara- tive deep learning toolbox, 2019

work page 2019
[58]

LocalAI: Open-source self-hosted alternative to OpenAI API

Ettore Di Giacinto Mudler. LocalAI: Open-source self-hosted alternative to OpenAI API. https://github.com/mudler/LocalAI, 2024

work page 2024
[59]

NanoBot Contributors. NanoBot. https://github.com/nanobot-ai/nanobot, 2025. Minimalist personal AI agent in the OpenClaw ecosystem

work page 2025
[60]

Minions: Cost-efficient collaboration between on-device and cloud lan- guage models.arXiv preprint arXiv:2502.15964, 2025

Avanika Narayan, Dan Biderman, Sabri Eyuboglu, Avner May , Scott Linderman, James Zou, and Christopher Ré. Minions: Cost-efficient collaboration between on-device and cloud language models.arXiv preprint arXiv:2502.15964, 2025

work page arXiv 2025
[61]

Artificial intelligence risk manage- ment framework (AI RMF 1.0)

National Institute of Standards and Technology. Artificial intelligence risk manage- ment framework (AI RMF 1.0). Technical Report NIST AI 100-1, U.S. Department of Commerce, January 2023

work page 2023
[62]

Hermes agent: The agent that grows with you

Nous Research. Hermes agent: The agent that grows with you. https://github.com/ NousResearch/hermes-agent, 2025. Self-improving agent with FTS5 cross-session recall, Honcho user modeling, and autonomous skill creation. 40K+ GitHub stars as of April 2026

work page 2025
[63]

Nemotron-Flash: Towards latency-optimal hybrid small language models

NVIDIA. Nemotron-Flash: Towards latency-optimal hybrid small language models. arXiv preprint arXiv:2511.18890, 2025

work page arXiv 2025
[64]

Nemotron-Super-49B-v1

NVIDIA. Nemotron-Super-49B-v1. https://huggingface.co/nvidia/Nemotron- Super-49B-v1, 2025

work page 2025
[65]

Ollama, Inc. Ollama. https://github.com/ollama/ollama, 2023. Local LLM serving platform. 162K+ GitHub stars as of March 2026

work page 2023
[66]

Symphony: Multi-agent orchestration framework

OpenAI. Symphony: Multi-agent orchestration framework. https://github.com/ openai/symphony, 2025. Multi-agent orchestration framework for coordinating agent workflows

work page 2025
[67]

GPT-5.4: frontier reasoning and multimodal model

OpenAI. GPT-5.4: frontier reasoning and multimodal model. OpenAI Platform Docu- mentation, 2026.https://platform.openai.com

work page 2026
[68]

OWASP top 10 for large language model applications, 2025

OWASP Foundation. OWASP top 10 for large language model applications, 2025. Version 2025. Accessed: April 2026. 16

work page 2025
[69]

PyTorch: An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, high-perfo...

work page 2019
[70]

PicoClaw

PicoClaw Contributors. PicoClaw. https://github.com/picoclaw/picoclaw, 2025. Lightweight variant of the OpenClaw personal AI ecosystem

work page 2025
[71]

Pilz, Yusuf Mahmood, and Lennart Heim

Konstantin F. Pilz, Yusuf Mahmood, and Lennart Heim. AI’s power requirements under exponential growth: Extrapolating AI data center power demand and assessing its potential impact on U.S. competitiveness. Research Report RR-A3572-1, RAND Corporation, 2025

work page 2025
[72]

Qualcomm Hexagon neural processing unit

Qualcomm Technologies. Qualcomm Hexagon neural processing unit. https://www. qualcomm.com/products/technology/processors, 2024. Accessed: April 2026

work page 2024
[73]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026

work page 2026
[74]

Overton: A data system for monitoring and improving machine-learned products, 2019

Christopher Ré, Feng Niu, Pallavi Gudipati, and Charles Srisuwananukorn. Overton: A data system for monitoring and improving machine-learned products, 2019

work page 2019
[75]

GeneralThoughtArchive: A large-scale dataset of reasoning traces, 2025

RJT1990. GeneralThoughtArchive: A large-scale dataset of reasoning traces, 2025. 431K reasoning traces with verifier scores. MIT license

work page 2025
[76]

Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford

Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. Okapi at TREC-3. InThe Third Text REtrieval Conference (TREC-3). NIST, 1994

work page 1994
[77]

Jon Saad-Falcon, Avanika Narayan, Hakki Orhun Akengin, J. Wes Griffin, Herumb Shandilya, Adrian Gamarra Lafuente, Medhya Goel, Rebecca Joseph, Shlok Natarajan, Etash Kumar Guha, Shang Zhu, Ben Athiwaratkun, John Hennessy , Azalia Mirhoseini, and Christopher Ré. Intelligence per watt: Measuring intelligence efficiency of local ai, 2026

work page 2026
[78]

ColBERTv2: Effective and efficient retrieval via lightweight late interaction

Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. ColBERTv2: Effective and efficient retrieval via lightweight late interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3715–3734. Association for Computation...

work page 2022
[79]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[80]

Gartner predicts that by 2030, performing inference on an LLM with 1 trillion parameters will cost GenAI providers over 90% less than in 2025

Will Sommer. Gartner predicts that by 2030, performing inference on an LLM with 1 trillion parameters will cost GenAI providers over 90% less than in 2025. Gartner Press Release, 2026. Accessed: April 2026

work page 2030

Showing first 80 references.

[1] [1]

Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl- Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. Gepa: Reflective prompt evolution can outperform reinforcement learning, 2026

work page 2026

[2] [2]

Qwen-Agent

Alibaba Cloud. Qwen-Agent. https://github.com/QwenLM/Qwen-Agent, 2025. Agent framework built on Qwen models with tool use and RAG capabilities

work page 2025

[3] [3]

Model context protocol, 2024

Anthropic. Model context protocol, 2024. Open standard for connecting AI assistants to external tools and data sources

work page 2024

[4] [4]

Claude Opus 4.6: frontier model with extended-thinking and tool-use capabilities

Anthropic. Claude Opus 4.6: frontier model with extended-thinking and tool-use capabilities. Anthropic API Documentation, 2026.https://docs.anthropic.com

work page 2026

[5] [5]

Apple M4 chip

Apple Inc. Apple M4 chip. https://www.apple.com/newsroom/2024/05/apple- introduces-m4-chip/, 2024. Accessed: April 2026

work page 2024

[6] [6]

Apple intelligence foundation language models

Apple Machine Learning Research. Apple intelligence foundation language models. arXiv preprint arXiv:2407.21075, 2024

work page arXiv 2024

[7] [7]

How to Train Your Advisor: Steering Black-Box LLMs with Advisor Models

Parth Asawa, Alan Zhu, Abby O’Neill, Matei Zaharia, Alexandros G. Dimakis, and Joseph E. Gonzalez. How to train your advisor: Steering black-box LLMs with advisor models.arXiv preprint arXiv:2510.02453, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. τ 2- bench: Evaluating conversational agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Inside the 2025–2027 compute crunch: What supply chain volatility really means for you

BCD International. Inside the 2025–2027 compute crunch: What supply chain volatility really means for you. BCD Video Blog, 2025. Accessed: April 2026

work page 2025

[10] [10]

LangChain

Harrison Chase. LangChain. https://github.com/langchain-ai/langchain, 2022. Accessed: 2026

work page 2022

[11] [11]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

MobileFineTuner: A unified end-to-end framework for fine-tuning LLMs on mobile phones.arXiv preprint arXiv:2512.08211, 2025

Xiaopei Chen, Liang Li, Fei Ji, and Wen Wu. MobileFineTuner: A unified end-to-end framework for fine-tuning LLMs on mobile phones.arXiv preprint arXiv:2512.08211, 2025

work page arXiv 2025

[13] [13]

Gonzalez, and Ion Stoica

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating LLMs by human preference, 2024

work page 2024

[14] [14]

Cormack, Charles L

Gordon V. Cormack, Charles L. A. Clarke, and Stefan Büttcher. Reciprocal rank fusion outperforms Condorcet and individual rank learning methods. InProceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 758–759. ACM, 2009

work page 2009

[15] [15]

CrewAI: Framework for orchestrating role-playing, autonomous AI agents

CrewAI, Inc. CrewAI: Framework for orchestrating role-playing, autonomous AI agents. https://github.com/crewAIInc/crewAI, 2024. Accessed: 2026. 13

work page 2024

[16] [16]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforce- ment learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

QLoRA: Efficient finetuning of quantized language models

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized language models. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023

[18] [18]

Deepre- search bench: A comprehensive benchmark for deep research agents, 2025

Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deepre- search bench: A comprehensive benchmark for deep research agents, 2025

work page 2025

[19] [19]

EdgeClaw: Local-cloud router plugin for OpenClaw

EdgeClaw Contributors. EdgeClaw: Local-cloud router plugin for OpenClaw. https: //github.com/edgeclaw/edgeclaw, 2025. Adds a local-cloud routing layer on top of OpenClaw as a plugin

work page 2025

[20] [20]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

llama.cpp

Georgi Gerganov. llama.cpp. https://github.com/ggml- org/llama.cpp , 2023. C/C++ LLM inference engine with quantization support. 85K+ GitHub stars as of 2025

work page 2023

[22] [22]

gemma.cpp

Google. gemma.cpp. https://github.com/google/gemma.cpp , 2024. Lightweight C++ inference engine for Gemma models

work page 2024

[23] [23]

Agent development kit (ADK)

Google. Agent development kit (ADK). https://github.com/google/adk-python ,

work page

[24] [24]

Python framework for building AI agents with tool use and multi-step reasoning

work page

[25] [25]

Gemma 3n

Google. Gemma 3n. https://ai.google.dev/gemma/docs/gemma-3n , 2025. On- device Gemma variant with per-layer embeddings and elastic inference for phones and laptops

work page 2025

[26] [26]

Gemma 4 26B: instruction-tuned open-weight 26B-parameter mixture-of- experts model

Google. Gemma 4 26B: instruction-tuned open-weight 26B-parameter mixture-of- experts model. Model card, Hugging Face, 2025. https://huggingface.co/google/ gemma-4-26b-it

work page 2025

[27] [27]

Gemini nano

Google. Gemini nano. https://developer.android.com/ai/gemini-nano , 2026. Android Developers Documentation, last updated April 2, 2026

work page 2026

[28] [28]

Gemini 3.1 Pro: frontier multimodal cloud model with extended- context reasoning

Google DeepMind. Gemini 3.1 Pro: frontier multimodal cloud model with extended- context reasoning. Google AI Developer Documentation, 2026. https://ai.google. dev

work page 2026

[29] [29]

Gemma 4: Byte for byte, the most capable open models

Google DeepMind. Gemma 4: Byte for byte, the most capable open models. https: //blog.google/innovation- and- ai/technology/developers- tools/gemma- 4/ , April 2026. Blog post, April 2, 2026

work page 2026

[30] [30]

MiniLLM: Knowledge distilla- tion of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distilla- tion of large language models. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[31] [31]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[32] [32]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, 2022

work page 2022

[33] [33]

AI energy score

HuggingFace. AI energy score. https://huggingface.github.io/AIEnergyScore/,

work page

[34] [34]

Standardized energy scoring system for AI models

work page

[35] [35]

Granite 4.0 language models

IBM Research. Granite 4.0 language models. https://github.com/ibm-granite/ granite-4.0-language-models, 2025. Accessed: 2025-10-01. 14

work page 2025

[36] [36]

Intel Core Ultra processors (series 1, meteor lake)

Intel Corporation. Intel Core Ultra processors (series 1, meteor lake). https:// www.intel.com/content/www/us/en/products/details/processors/core-ultra. html, 2023. Accessed: April 2026

work page 2023

[37] [37]

IronClaw

IronClaw Contributors. IronClaw. https://github.com/ironclaw/ironclaw, 2025. Enterprise-focused fork of OpenClaw

work page 2025

[38] [38]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

SWE-bench: Can language models resolve real-world Github issues? InThe Twelfth International Conference on Learning Representations, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world Github issues? InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[40] [40]

Billion-scale similarity search with GPUs.IEEE Transactions on Big Data, 7(3):535–547, 2019

Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with GPUs.IEEE Transactions on Big Data, 7(3):535–547, 2019

work page 2019

[41] [41]

DSPy: Compiling declarative lan- guage model calls into self-improving pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan A, Saiful Haq, Ashutosh Sharma, Thomas Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling declarative lan- guage model calls into self-improving pipelines. InThe Twelfth International Conference on Learning Representati...

work page 2024

[42] [42]

PinchBench: Benchmarking LLM models as OpenClaw coding agents

Kilo Code. PinchBench: Benchmarking LLM models as OpenClaw coding agents. https://github.com/pinchbench/skill, 2025. 23 real-world agent tasks spanning scheduling, email, research, coding, and multi-step workflows. Open-source grading via automated checks and LLM judge

work page 2025

[43] [43]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023

[44] [44]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning, 2023

work page 2023

[45] [45]

AWQ: Activation-aware weight quantization for LLM compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, and Song Han. AWQ: Activation-aware weight quantization for LLM compression and acceleration. InProceedings of Machine Learning and Systems, volume 6, 2024

work page 2024

[46] [46]

On-device training under 256KB memory

Ji Lin, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, Chuang Gan, and Song Han. On-device training under 256KB memory . InAdvances in Neural Information Processing Systems, volume 35, 2022

work page 2022

[47] [47]

Liquid nanos: Frontier-grade performance on everyday devices

Liquid AI. Liquid nanos: Frontier-grade performance on everyday devices. https:// www.liquid.ai/blog/introducing-liquid-nanos-frontier-grade-performance- on-everyday-devices, 2025

work page 2025

[48] [48]

Mobilellm: Optimizing sub-billion parameter language models for on-device use cases,

Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, Vikas Chandra, et al. MobileLLM: Optimizing sub-billion parameter language models for on-device use cases.arXiv preprint arXiv:2402.14905, 2024

work page arXiv 2024

[49] [49]

LM Studio

LM Studio. LM Studio. https://lmstudio.ai, 2024. Desktop application for running LLMs locally

work page 2024

[50] [50]

Merrill, Alexander G

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, et al. Terminal- bench: Benchmarking agents on hard, realistic tasks in command line interfaces. In The Fourteenth International Conference on Learning Representations, 2026. 15

work page 2026

[51] [51]

ExecuTorch: End-to-end solution for enabling on-device inference capabilities

Meta. ExecuTorch: End-to-end solution for enabling on-device inference capabilities. https://github.com/pytorch/executorch, 2024

work page 2024

[52] [52]

GAIA: A benchmark for general AI assistants

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants. InThe Twelfth Interna- tional Conference on Learning Representations, 2024

work page 2024

[53] [53]

MimicLaw

MimicLaw Contributors. MimicLaw. https://github.com/mimiclaw/mimiclaw, 2025. Persona-focused personal AI framework

work page 2025

[54] [54]

Zeus: ML energy measurement framework

ML Energy Initiative. Zeus: ML energy measurement framework. https://ml.energy/ zeus/, 2023. GPU energy measurement toolkit for fine-grained energy profiling of ML workloads

work page 2023

[55] [55]

MLC-LLM: Universal LLM deployment engine

MLC AI. MLC-LLM: Universal LLM deployment engine. https://github.com/mlc- ai/mlc-llm, 2023

work page 2023

[56] [56]

MLCommons inference benchmark

MLCommons. MLCommons inference benchmark. https://github.com/mlcommons/ inference, 2024. Industry-standard inference benchmarking suite covering latency, throughput, and accuracy across hardware platforms

work page 2024

[57] [57]

Ludwig: a type-based declara- tive deep learning toolbox, 2019

Piero Molino, Yaroslav Dudin, and Sai Sumanth Miryala. Ludwig: a type-based declara- tive deep learning toolbox, 2019

work page 2019

[58] [58]

LocalAI: Open-source self-hosted alternative to OpenAI API

Ettore Di Giacinto Mudler. LocalAI: Open-source self-hosted alternative to OpenAI API. https://github.com/mudler/LocalAI, 2024

work page 2024

[59] [59]

NanoBot Contributors. NanoBot. https://github.com/nanobot-ai/nanobot, 2025. Minimalist personal AI agent in the OpenClaw ecosystem

work page 2025

[60] [60]

Minions: Cost-efficient collaboration between on-device and cloud lan- guage models.arXiv preprint arXiv:2502.15964, 2025

Avanika Narayan, Dan Biderman, Sabri Eyuboglu, Avner May , Scott Linderman, James Zou, and Christopher Ré. Minions: Cost-efficient collaboration between on-device and cloud language models.arXiv preprint arXiv:2502.15964, 2025

work page arXiv 2025

[61] [61]

Artificial intelligence risk manage- ment framework (AI RMF 1.0)

National Institute of Standards and Technology. Artificial intelligence risk manage- ment framework (AI RMF 1.0). Technical Report NIST AI 100-1, U.S. Department of Commerce, January 2023

work page 2023

[62] [62]

Hermes agent: The agent that grows with you

Nous Research. Hermes agent: The agent that grows with you. https://github.com/ NousResearch/hermes-agent, 2025. Self-improving agent with FTS5 cross-session recall, Honcho user modeling, and autonomous skill creation. 40K+ GitHub stars as of April 2026

work page 2025

[63] [63]

Nemotron-Flash: Towards latency-optimal hybrid small language models

NVIDIA. Nemotron-Flash: Towards latency-optimal hybrid small language models. arXiv preprint arXiv:2511.18890, 2025

work page arXiv 2025

[64] [64]

Nemotron-Super-49B-v1

NVIDIA. Nemotron-Super-49B-v1. https://huggingface.co/nvidia/Nemotron- Super-49B-v1, 2025

work page 2025

[65] [65]

Ollama, Inc. Ollama. https://github.com/ollama/ollama, 2023. Local LLM serving platform. 162K+ GitHub stars as of March 2026

work page 2023

[66] [66]

Symphony: Multi-agent orchestration framework

OpenAI. Symphony: Multi-agent orchestration framework. https://github.com/ openai/symphony, 2025. Multi-agent orchestration framework for coordinating agent workflows

work page 2025

[67] [67]

GPT-5.4: frontier reasoning and multimodal model

OpenAI. GPT-5.4: frontier reasoning and multimodal model. OpenAI Platform Docu- mentation, 2026.https://platform.openai.com

work page 2026

[68] [68]

OWASP top 10 for large language model applications, 2025

OWASP Foundation. OWASP top 10 for large language model applications, 2025. Version 2025. Accessed: April 2026. 16

work page 2025

[69] [69]

PyTorch: An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, high-perfo...

work page 2019

[70] [70]

PicoClaw

PicoClaw Contributors. PicoClaw. https://github.com/picoclaw/picoclaw, 2025. Lightweight variant of the OpenClaw personal AI ecosystem

work page 2025

[71] [71]

Pilz, Yusuf Mahmood, and Lennart Heim

Konstantin F. Pilz, Yusuf Mahmood, and Lennart Heim. AI’s power requirements under exponential growth: Extrapolating AI data center power demand and assessing its potential impact on U.S. competitiveness. Research Report RR-A3572-1, RAND Corporation, 2025

work page 2025

[72] [72]

Qualcomm Hexagon neural processing unit

Qualcomm Technologies. Qualcomm Hexagon neural processing unit. https://www. qualcomm.com/products/technology/processors, 2024. Accessed: April 2026

work page 2024

[73] [73]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026

work page 2026

[74] [74]

Overton: A data system for monitoring and improving machine-learned products, 2019

Christopher Ré, Feng Niu, Pallavi Gudipati, and Charles Srisuwananukorn. Overton: A data system for monitoring and improving machine-learned products, 2019

work page 2019

[75] [75]

GeneralThoughtArchive: A large-scale dataset of reasoning traces, 2025

RJT1990. GeneralThoughtArchive: A large-scale dataset of reasoning traces, 2025. 431K reasoning traces with verifier scores. MIT license

work page 2025

[76] [76]

Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford

Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. Okapi at TREC-3. InThe Third Text REtrieval Conference (TREC-3). NIST, 1994

work page 1994

[77] [77]

Jon Saad-Falcon, Avanika Narayan, Hakki Orhun Akengin, J. Wes Griffin, Herumb Shandilya, Adrian Gamarra Lafuente, Medhya Goel, Rebecca Joseph, Shlok Natarajan, Etash Kumar Guha, Shang Zhu, Ben Athiwaratkun, John Hennessy , Azalia Mirhoseini, and Christopher Ré. Intelligence per watt: Measuring intelligence efficiency of local ai, 2026

work page 2026

[78] [78]

ColBERTv2: Effective and efficient retrieval via lightweight late interaction

Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. ColBERTv2: Effective and efficient retrieval via lightweight late interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3715–3734. Association for Computation...

work page 2022

[79] [79]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[80] [80]

Gartner predicts that by 2030, performing inference on an LLM with 1 trillion parameters will cost GenAI providers over 90% less than in 2025

Will Sommer. Gartner predicts that by 2030, performing inference on an LLM with 1 trillion parameters will cost GenAI providers over 90% less than in 2025. Gartner Press Release, 2026. Accessed: April 2026

work page 2030