pith. sign in

arxiv: 2606.00448 · v1 · pith:5R5BMNXVnew · submitted 2026-05-30 · 💻 cs.SE · cs.AI· cs.CR

When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems

Pith reviewed 2026-06-28 18:44 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CR
keywords compositional riskagent skillsLLM agentssafety measurementskill compositionexploitabilitystatic analysisinstall-time checks
0
0 comments X

The pith

Individually safe skills form genuine compositional risks in agent systems, with 18.2% of flagged pairs validated as real.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether skills that pass individual safety checks can still combine to enable unsafe behaviors in LLM agents. It introduces SkillReact, a framework that first applies a deterministic benchmark to flag structural pair patterns across over 200,000 combinations from a large skill registry. Human adjudication then calibrates the raw flag rate, showing that roughly one in five candidates represents an actual risk. Experiments with an action harness demonstrate that models vary in whether they actually issue the risky tool calls, with some completing full chains and others refusing. The results indicate that per-skill inspection alone leaves thousands of risks undetected by design.

Core claim

On 1,520 ClawHub skills, 651 pass individual inspection and form 211,575 pairs; the benchmark flags 22.25% as structural candidates. A pattern-stratified audit finds a population-weighted validity of 18.2%, implying about 14K genuine risk memberships missed by per-skill scanning. The action-based harness shows realization is gated by host-model disposition: Haiku-4-5 issues the full download-then-execute chain on 36 of 39 trials, Opus-4-7 stops at download, and Sonnet-4-6 refuses. A control holding the request fixed finds highest compliance with no skills installed.

What carries the argument

SkillReact, a compositional security measurement framework with a deterministic static-composition benchmark to flag pair-pattern hits, a two-rater LLM-assisted human-adjudication pipeline to validate them, and an action-based exploitability harness to test model-issued tool calls.

If this is right

  • Per-skill scanning misses compositional risks by construction because every pair is individually safe.
  • Whether a compositional risk becomes an executed tool call depends on the host model's disposition.
  • Compliance with risky requests is highest when no skills are installed.
  • Install-time compositional checks are needed as complements to per-skill scanning.
  • Capability isolation can limit which combined capabilities remain reachable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Skill registries could add automated pair-wise checks at install time to catch risks the current per-skill process misses.
  • The 18.2% validity rate offers a starting baseline for estimating hidden risks in other agent skill collections.
  • Safety outcomes depend on the interaction between installed skills and the specific underlying model, suggesting model-specific testing of skill sets.
  • Extending the harness to additional models could map which model families are more or less prone to executing compositional exploits.

Load-bearing premise

The two-rater LLM-assisted human-adjudication pipeline accurately identifies which structural candidates are genuine compositional risks rather than false positives.

What would settle it

Independent re-adjudication of the same pattern-stratified sample of flagged pairs by new human raters yielding a validity rate substantially different from 18.2%.

Figures

Figures reproduced from arXiv: 2606.00448 by Haoran Yu, Jingzhou Xu, Junxian You, Lifei Liu, Pin Qian, Su Wang, Xiaochong Jiang, Xiaoyuan Wang, Yihang Chen.

Figure 1
Figure 1. Figure 1: SkillReact in one view. (a) Per-skill safe ̸= set-safe: the same capability policy—flag an object O iff {FR, NO} ⊆ Cap(O) (file read ∧ network out, one of the 10 forbidden patterns)—clears each skill individually (Cap(A) = {FR}, Cap(B) = {NO}: both PASS), yet flags their co-installed union (Cap(S) = {FR, NO}, so {FR, NO} ⊆ Cap(S): FLAG); only the security object changes (skill → installed set), not the rul… view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of capabilities across 1,520 analyzed [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Capability co-occurrence heatmap across 1,520 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

LLM agents increasingly rely on community-contributed skills that expand an agent's operational capability set. We study a core safety problem in agentic AI systems: whether individually safe skills can compose into unsafe installed skill sets. We present SkillReact, a compositional security measurement framework with three components: a deterministic static-composition benchmark, a two-rater LLM-assisted human-adjudication pipeline, and an action-based exploitability harness. On 1,520 ClawHub skills, 651 pass individual inspection and form 211,575 pairs; the benchmark flags 22.25% of these as structural candidates. We treat this raw rate as a recall-oriented scanner ceiling and calibrate it against human judgment: in a pattern-stratified audit, roughly one in five flagged pair-pattern hits survives as a real compositional risk (population-weighted validity 18.2%, our headline result), implying about 14K genuine risk memberships in a single registry that per-skill scanning misses by construction, since every pair is individually safe. An action-based harness then probes when these candidates become model-issued tool calls, and finds realization gated by host-model disposition: on an anchor-conditioned dropper subset, Haiku-4-5 issues the dropper-stage tool call on all 39 direct-prompt trials (36 of them the full download-then-execute chain, 3 download-only), Opus-4-7 stops at the download, and Sonnet-4-6 refuses outright. A control that holds the request fixed and varies only the installed skills finds compliance highest with no skills installed: a composition fixes which capabilities are reachable, while the host model decides whether to use them. Together these motivate install-time compositional checks and capability isolation as complements to per-skill scanning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces the SkillReact framework to measure compositional safety risks in LLM agent skill ecosystems, where individually safe skills may form unsafe installed sets. On 1,520 ClawHub skills (651 passing individual checks), it generates 211,575 pairs, flags 22.25% as structural candidates via a deterministic static benchmark, and calibrates via a pattern-stratified two-rater LLM-assisted human audit to a population-weighted validity of 18.2%, implying ~14K genuine risk memberships missed by per-skill scanning. An action-based harness on an anchor-conditioned dropper subset shows model-dependent realization (e.g., Haiku-4-5 issues the full chain on all 39 trials; Opus-4-7 stops at download; Sonnet-4-6 refuses), with a control showing highest compliance with no skills installed.

Significance. If the 18.2% validity holds, the result provides a quantitative, falsifiable estimate of compositional risks that per-skill inspection misses by construction and demonstrates that host-model disposition gates exploitability even when capabilities are installed. The deterministic benchmark and concrete, reproducible harness trials on three specific models are strengths that support follow-up verification.

major comments (1)
  1. [Abstract] Abstract, paragraph on calibration against human judgment: the headline population-weighted validity of 18.2% (and the derived claim of ~14K genuine risk memberships) is produced by the two-rater LLM-assisted human-adjudication pipeline, but the manuscript supplies no inter-rater agreement statistics, explicit adjudication rubric, calibration examples, or breakdown of LLM assistance versus human override. This is load-bearing for the central measurement claim.
minor comments (2)
  1. [Abstract] Abstract: the 'dropper subset' and 'anchor-conditioned' phrasing are introduced without a preceding definition or cross-reference to the methods section.
  2. [Abstract] Abstract: the control experiment ('a control that holds the request fixed and varies only the installed skills') would benefit from an explicit statement of the fixed prompt text used.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and for identifying a transparency gap in our presentation of the central calibration result. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract, paragraph on calibration against human judgment: the headline population-weighted validity of 18.2% (and the derived claim of ~14K genuine risk memberships) is produced by the two-rater LLM-assisted human-adjudication pipeline, but the manuscript supplies no inter-rater agreement statistics, explicit adjudication rubric, calibration examples, or breakdown of LLM assistance versus human override. This is load-bearing for the central measurement claim.

    Authors: We agree the omission weakens the manuscript. The two-rater pipeline is described at a high level in Section 4.2, but the abstract and calibration paragraph indeed contain none of the requested details. In the revised version we will (1) insert a concise methods clause into the abstract, (2) add a dedicated subsection (or expanded paragraph) in Section 4.2 that reports inter-rater agreement (Cohen’s κ), (3) reproduce the full adjudication rubric, (4) provide three representative calibration examples that distinguish LLM-assisted suggestions from human overrides, and (5) tabulate the fraction of cases in which the human raters overruled the LLM. These additions will make the 18.2 % population-weighted validity directly verifiable without altering the numerical result itself. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claim rests on external human adjudication

full rationale

The derivation chain begins with a deterministic static-composition benchmark that flags 22.25% of pairs as structural candidates, then applies an external two-rater LLM-assisted human-adjudication pipeline to a pattern-stratified sample to obtain the population-weighted validity of 18.2%. This validity rate is produced by human judgment rather than any equation, fit, or self-citation internal to the paper. No self-definitional steps, fitted-input predictions, load-bearing self-citations, uniqueness theorems, ansatz smuggling, or renaming of known results appear in the abstract or described methodology. The measurement is therefore self-contained against an external benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the accuracy of the human adjudication pipeline and the assumption that the ClawHub sample and pattern-stratified audit generalize to the full set of flagged pairs.

axioms (1)
  • domain assumption The deterministic static-composition benchmark functions as a high-recall scanner whose false-positive rate can be calibrated by human review.
    Used to generate the initial 22.25% candidate rate before human calibration.

pith-pipeline@v0.9.1-grok · 5879 in / 1323 out tokens · 25069 ms · 2026-06-28T18:44:04.075434+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. When AUC 0.998 Is Not Enough: A Candidate Evaluation Protocol for Hidden-State Probes of Indirect Prompt Injection in Multimodal Computer-Use Agents

    cs.LG 2026-06 unverdicted novelty 7.0

    High AUC from linear probes on model activations for indirect prompt injection does not license an unqualified claim of malicious-content detection, per a Qwen2.5-VL-7B case study with text and visual controls.

  2. Beyond Accuracy: Measuring Bias Acknowledgment in Chain-of-Thought Reasoning for Responsible AI Evaluation

    cs.LG 2026-06 unverdicted novelty 6.0

    GPT-4o and Claude Sonnet 4 show similar susceptibility to bias on GSM8K (1.3% vs 1.2%) but differ sharply in acknowledgment rates (13% vs 75%) under a rubric-defined metric.

  3. Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdragon X Elite

    cs.CL 2026-06 unverdicted novelty 6.0

    First end-to-end RAG on mobile NPU delivers 18.1x faster prefilling, 4x lower latency and energy than CPU on Snapdragon X Elite with equivalent quality.

Reference graph

Works this paper leans on

55 extracted references · 29 canonical work pages · cited by 3 Pith papers · 19 internal anchors

  1. [1]

    Extend Claude with skills

    Anthropic, “Extend Claude with skills.” https://code.claude.com/do cs/en/skills, 2026. Claude Code documentation

  2. [2]

    Equipping agents for the real world with Agent Skills

    Anthropic, “Equipping agents for the real world with Agent Skills.” https://www.anthropic.com/engineering/equipping-agents-for-the-rea l-world-with-agent-skills, 2025. Anthropic Engineering

  3. [3]

    "Do Not Mention This to the User": Detecting and Understanding Malicious Agent Skills in the Wild

    Y . Liu, Z. Chen, Y . Zhang, G. Deng, Y . Li, J. Ning, Y . Zhang, and L. Y . Zhang, “Malicious agent skills in the wild: A large-scale security empirical study,”arXiv preprint arXiv:2602.06547, 2026

  4. [4]

    Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale

    Y . Liu, W. Wang, R. Feng, Y . Zhang, G. Xu, G. Deng, Y . Li, and L. Zhang, “Agent skills in the wild: An empirical study of security vulnerabilities at scale,”arXiv preprint arXiv:2601.10338, 2026

  5. [5]

    SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills

    Y . Hou, Z. Yang, Z. Pang, and X. Ma, “SkillSieve: A hierarchical triage framework for detecting malicious AI agent skills,”arXiv preprint arXiv:2604.06550, 2026

  6. [6]

    Formal analysis and supply chain security for agentic AI skills.CoRR, abs/2603.00195, 2026

    V . P. Bhardwaj, “Formal analysis and supply chain security for agentic AI skills,”arXiv preprint arXiv:2603.00195, 2026

  7. [7]

    Snyk finds prompt injection in 36%, 1467 malicious payloads in a ToxicSkills study of agent skills supply chain compromise,

    L. Beurer-Kellner, A. Kudrinskii, M. Milanta, K. B. Nielsen, H. Sarkar, and L. Tal, “Snyk finds prompt injection in 36%, 1467 malicious payloads in a ToxicSkills study of agent skills supply chain compromise,”Snyk Blog, Feb. 5, 2026

  8. [8]

    From Component Manipulation to System Compromise: Understanding and Detecting Malicious MCP Servers

    Y . Huanget al., “From component manipulation to system compro- mise: Understanding and detecting malicious MCP servers,”arXiv preprint arXiv:2604.01905, 2026

  9. [9]

    Safety is non-compositional: A formal framework for capability-based AI systems,

    C. Spera, “Safety is non-compositional: A formal framework for capability-based AI systems,”arXiv preprint arXiv:2603.15973, 2026

  10. [10]

    ClawHub: Skill and plugin registry for OpenClaw

    OpenClaw, “ClawHub: Skill and plugin registry for OpenClaw.” http s://github.com/openclaw/clawhub, 2026. Public agent-skill registry

  11. [11]

    MindGuard: Intrinsic decision inspection for securing LLM agents against metadata poisoning,

    Z. Wang, H. Du, G. Shi, J. Zhang, H. Cheng, Y . Yao, K. Guo, and X.- Y . Li, “MindGuard: Intrinsic decision inspection for securing LLM agents against metadata poisoning,”arXiv preprint arXiv:2508.20412, 2025

  12. [12]

    How Your Credentials Are Leaked by LLM Agent Skills: An Empirical Study

    Z. Chen, Y . Zhang, Y . Liu, G. Deng, Y . Li, Y . Zhang, J. Ning, L. Y . Zhang, L. Ma, and Z. Li, “Credential leakage in LLM agent skills: A large-scale empirical study,”arXiv preprint arXiv:2604.03070, 2026

  13. [13]

    Prompt injection attacks on agentic coding assistants: A systematic analysis of vulnerabilities in skills, tools, and protocol ecosystems,

    N. Maloyan and D. Namiot, “Prompt injection attacks on agentic coding assistants: A systematic analysis of vulnerabilities in skills, tools, and protocol ecosystems,”arXiv preprint arXiv:2601.17548, 2026

  14. [14]

    Prompt injection attack to tool selection in LLM agents,

    J. Shi, Z. Yuan, G. Tie, P. Zhou, N. Z. Gong, and L. Sun, “Prompt injection attack to tool selection in LLM agents,”Proceedings of NDSS, 2026

  15. [15]

    BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning

    G. Tie, J. Shi, P. Zhou, and L. Sun, “BadSkill: Backdoor at- tacks on agent skills via model-in-skill poisoning,”arXiv preprint arXiv:2604.09378, 2026

  16. [16]

    Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems

    Y . Qu, Y . Liu, T. Geng, G. Deng, Y . Li, L. Y . Zhang, Y . Zhang, and L. Ma, “Supply-chain poisoning attacks against LLM coding agent skill ecosystems,”arXiv preprint arXiv:2604.03081, 2026

  17. [17]

    SkillJect: Effectively Automating Skill-Based Prompt Injection for Skill-Enabled Agents

    X. Jia, J. Liao, S. Qin, J. Gu, W. Ren, X. Cao, Y . Liu, and P. Torr, “SkillJect: Effectively automating skill-based prompt injection for skill-enabled agents,”arXiv preprint arXiv:2602.14211, 2026

  18. [18]

    Noninterference and the composability of security properties,

    D. McCullough, “Noninterference and the composability of security properties,” inProceedings of the IEEE Symposium on Security and Privacy, 1988

  19. [19]

    A lattice model of secure information flow,

    D. E. Denning, “A lattice model of secure information flow,”Com- munications of the ACM, vol. 19, no. 5, pp. 236–243, 1976

  20. [20]

    The protection of information in computer systems,

    J. H. Saltzer and M. D. Schroeder, “The protection of information in computer systems,”Proceedings of the IEEE, vol. 63, no. 9, pp. 1278– 1308, 1975

  21. [21]

    COVERT: Com- positional analysis of Android inter-app permission leakage,

    H. Bagheri, A. Sadeghi, J. Garcia, and S. Malek, “COVERT: Com- positional analysis of Android inter-app permission leakage,”IEEE Transactions on Software Engineering, vol. 41, no. 9, 2015

  22. [22]

    MR- Droid: A scalable and prioritized analysis of inter-app communication risks,

    F. Liu, H. Cai, G. Wang, D. Yao, K. O. Elish, and B. G. Ryder, “MR- Droid: A scalable and prioritized analysis of inter-app communication risks,” inProceedings of the IEEE Mobile Security Technologies Workshop (MoST), 2017

  23. [23]

    Agent Tools Orchestration Leaks More: Dataset, Benchmark, and Mitigation

    Y . Qiao, D. Liu, H. Yang, W. Zhou, and S. Hu, “Agent tools orchestration leaks more: Dataset, benchmark, and mitigation,”arXiv preprint arXiv:2512.16310, 2025

  24. [24]

    AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents,

    E. Debenedetti, J. Zhang, M. Balunovic, L. Beurer-Kellner, M. Fis- cher, and F. Tram`er, “AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents,”Advances in Neural Information Processing Systems, 2024

  25. [25]

    Agent security bench (ASB): Formalizing and bench- marking attacks and defenses in LLM-based agents,

    H. Zhang, J. Huang, K. Mei, Y . Yao, Z. Wang, C. Zhan, H. Wang, and Y . Zhang, “Agent security bench (ASB): Formalizing and bench- marking attacks and defenses in LLM-based agents,” inInternational Conference on Learning Representations (ICLR), 2025

  26. [26]

    InjecAgent: Benchmarking indirect prompt injections in tool-integrated large language model agents,

    Q. Zhan, Z. Liang, Z. Ying, and D. Kang, “InjecAgent: Benchmarking indirect prompt injections in tool-integrated large language model agents,”Findings of ACL, 2024

  27. [27]

    Defeating Prompt Injections by Design

    E. Debenedettiet al., “Defeating prompt injections by design (CaMeL),”arXiv preprint arXiv:2503.18813, 2025

  28. [28]

    In2025 IEEE Conference on Se- cure and Trustworthy Machine Learning (SaTML), pages 23–42

    S. Chennabasappa, C. Nikolaidis, D. Song, D. Molnar, S. Ding, S. Wan, S. Whitman, L. Deason, N. Doucette, A. Montilla, A. Gampa, B. de Paola, D. Gabi, J. Crnkovich, J.-C. Testud, K. He, R. Chaturvedi, W. Zhou, and J. Saxe, “LlamaFirewall: An open source guardrail system for building secure AI agents,”arXiv preprint arXiv:2505.03574, 2025

  29. [29]

    AGrail: A lifelong agent guardrail with effective and adaptive safety detection,

    W. Luo, S. Dai, X. Liu, S. Banerjee, H. Sun, M. Chen, and C. Xiao, “AGrail: A lifelong agent guardrail with effective and adaptive safety detection,”Proceedings of ACL, 2025

  30. [30]

    AgentSpec: Customizable runtime enforcement for safe and reliable LLM agents,

    H. Wang, C. M. Poskitt, and J. Sun, “AgentSpec: Customizable runtime enforcement for safe and reliable LLM agents,”Proceedings of ICSE, 2026

  31. [31]

    Progent: Securing AI Agents with Privilege Control

    T. Shi, J. He, Z. Wang, H. Li, L. Wu, W. Guo, and D. Song, “Progent: Securing AI agents with privilege control,”arXiv preprint arXiv:2504.11703, 2026

  32. [32]

    Agent governance toolkit: Open-source runtime security for AI agents

    Microsoft, “Agent governance toolkit: Open-source runtime security for AI agents.” https://opensource.microsoft.com/blog/2026/04/02/int roducing-the-agent-governance-toolkit-open-source-runtime-securit y-for-ai-agents/, 2026

  33. [33]

    Reflect-Guard: Enhancing LLM Safeguards against Adversarial Prompts via Logical Self-Reflection

    L. Lin, J. You, Y . Li, L. Lin, Y . Wang, Z. Zhang, and M. Zheng, “Reflect-guard: Enhancing LLM safeguards against adversarial prompts via logical self-reflection,”arXiv preprint arXiv:2605.24834, 2026

  34. [34]

    OW ASP agentic skills top 10

    OW ASP Foundation, “OW ASP agentic skills top 10.” https://owasp. org/www-project-agentic-skills-top-10/, 2026

  35. [35]

    A Security Analysis of the OpenClaw AI Agent Framework

    S. Suwansathit, Y . Zhang, and G. Gu, “A security analysis of the OpenClaw AI agent framework,”arXiv preprint arXiv:2603.27517, 2026

  36. [36]

    SOK: A Taxonomy of Attack Vectors and Defense Strategies for Agentic Supply Chain Runtime

    X. Jiang, S. Yang, W. Yang, Y . Liu, and C. Ji, “SOK: A taxonomy of attack vectors and defense strategies for agentic supply chain runtime,”arXiv preprint arXiv:2602.19555, 2026

  37. [37]

    Learning how to use tools, not just when: Pattern-aware tool-integrated reasoning,

    N. Xu, Y . Jiang, S. R. Dipta, and H. Zhang, “Learning how to use tools, not just when: Pattern-aware tool-integrated reasoning,”arXiv preprint arXiv:2509.23292, 2026

  38. [38]

    SCRIBE: Structured Mid-Level Supervision for Tool-Using Language Models

    Y . Jiang and F. Ferraro, “SCRIBE: Structured mid-level supervision for tool-using language models,”arXiv preprint arXiv:2601.03555, 2026

  39. [39]

    From generation to judgment: Opportunities and challenges of LLM-as-a-judge,

    D. Li, B. Jiang, L. Huang, A. Beigi, C. Zhao, Z. Tan, A. Bhattacharjee, Y . Jiang, C. Chen, T. Wu,et al., “From generation to judgment: Opportunities and challenges of LLM-as-a-judge,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 2757–2791, 2025

  40. [40]

    Does tone change the answer? evaluating prompt politeness effects on modern LLMs: GPT, Gemini, Llama,

    H. Cai, B. Shen, L. Jin, L. Hu, and X. Fan, “Does tone change the answer? evaluating prompt politeness effects on modern LLMs: GPT, Gemini, Llama,”arXiv preprint arXiv:2512.12812, 2025

  41. [41]

    Performance-efficiency trade-offs in human preference prediction: A comparative study of traditional machine learning and large language models,

    Y . Zhang, Z. Xiang, and H. Xu, “Performance-efficiency trade-offs in human preference prediction: A comparative study of traditional machine learning and large language models,” inProceedings of the 31st IEEE Symposium on Computers and Communications (ISCC), IEEE, 2026

  42. [42]

    The lethal trifecta for AI agents: Private data, untrusted content, and external communication

    S. Willison, “The lethal trifecta for AI agents: Private data, untrusted content, and external communication.” https://simonwillison.net/20 25/Jun/16/the-lethal-trifecta/, 2025. Blog post, 16 June 2025

  43. [43]

    The measurement of observer agree- ment for categorical data,

    J. R. Landis and G. G. Koch, “The measurement of observer agree- ment for categorical data,”Biometrics, vol. 33, no. 1, pp. 159–174, 1977

  44. [44]

    Task-specific efficiency analysis: When small language models outperform large language models,

    J. Cao, Y . Ma, X. Li, Q. Ren, and X. Chen, “Task-specific efficiency analysis: When small language models outperform large language models,”arXiv preprint arXiv:2603.21389, 2026

  45. [45]

    CoDES: A context-efficient framework for enhancing small language models via domain-specific adaptation and model ensembling,

    L. Hu, Y . Xin, B. Shen, H. Cai, and L. Jin, “CoDES: A context-efficient framework for enhancing small language models via domain-specific adaptation and model ensembling,”Preprints, 10.20944/preprints202603.1152.v1, 2026

  46. [46]

    Toward sustainable on-device intelligence: A survey on energy-efficient RAG systems with small language models,

    Z. Cheng, L. Lai, Y . Liu, and Y . Sun, “Toward sustainable on-device intelligence: A survey on energy-efficient RAG systems with small language models,”Available at SSRN 6698538, 2026

  47. [47]

    DRP: Distilled reasoning pruning with skill-aware step decomposition for efficient large reasoning models,

    Y . Jiang, D. Li, and F. Ferraro, “DRP: Distilled reasoning pruning with skill-aware step decomposition for efficient large reasoning models,” inFindings of the Association for Computational Linguistics (ACL), 2026

  48. [48]

    Syner- gized data efficiency and compression (SEC) optimization for large language models,

    X. Li, Y . Ma, Y . Huang, X. Wang, Y . Lin, and C. Zhang, “Syner- gized data efficiency and compression (SEC) optimization for large language models,” in2024 4th International Conference on Electronic Information Engineering and Computer Science (EIECS), pp. 586– 591, 2024

  49. [49]

    Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation

    Y . Jiang, R. Li, S. R. Dipta, D. Li, and Z. Yang, “Cornerstones or stumbling blocks? deciphering the rock tokens in on-policy distilla- tion,”arXiv preprint arXiv:2605.09253, 2026

  50. [50]

    ChainCaps: Composition-safe tool-using agents via monotonic capability attenua- tion,

    X. Jiang, S. Yang, Z. Li, L. Liu, H. Yu, and Y . Liu, “ChainCaps: Composition-safe tool-using agents via monotonic capability attenua- tion,” inICML 2026 Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD), 2026

  51. [51]

    Reward Auditor: Inference on Reward Modeling Suitability in Real-World Perturbed Scenarios

    J. Zang, Y . Wei, R. Bai, S. Jiang, N. Mo, B. Li, Q. Sun, and H. Liu, “Reward auditor: Inference on reward modeling suitability in real- world perturbed scenarios,”arXiv preprint arXiv:2512.00920, 2025

  52. [52]

    Alleviating attention hacking in discriminative re- ward modeling through interaction distillation,

    J. Zang, “Alleviating attention hacking in discriminative re- ward modeling through interaction distillation,”arXiv preprint arXiv:2508.02618, 2025

  53. [53]

    AI for Auto-Research: Roadmap & User Guide

    L. Kong, X. Sun, W. Chow, L. Li, K. Q. Lin, X. B. Zhang, S. Wang, R. Li, Q. Wu, W. Gao, Y . Wang, S. Xie, J. Liu, L. Qu, S. Li, L. X. Ng, B. R. Cottereau, Z. Liu, T.-S. Chua, and W. T. Ooi, “AI for auto- research: Roadmap & user guide,”arXiv preprint arXiv:2605.18661, 2026

  54. [54]

    Resolving the Robustness-Precision Trade-off in Financial RAG through Hybrid Document-Routed Retrieval

    Z. Cheng, L. Lai, and Y . Liu, “Resolving the robustness-precision trade-off in financial RAG through hybrid document-routed retrieval,” arXiv preprint arXiv:2603.26815, 2026

  55. [55]

    Enhancing financial report question-answering: A retrieval-augmented generation system with reranking analysis,

    Z. Cheng, L. Lai, Y . Liu, K. Cheng, and X. Qi, “Enhancing financial report question-answering: A retrieval-augmented generation system with reranking analysis,” in6th International Conference on Electri- cal, Computer and Energy Technologies (ICECET), 2026