arxiv: 2605.09817 · v1 · submitted 2026-05-10 · 💻 cs.SE · cs.CR

Recognition: no theorem link

Evaluating Tool Cloning in Agentic-AI Ecosystems

David Jiang, Neil Gong, Taein Kim, Yuepeng Hu, Yuqi Jia

Pith reviewed 2026-05-12 01:57 UTC · model grok-4.3

classification 💻 cs.SE cs.CR

keywords tool cloningagentic AIMCP repositoriesSkills toolsimplementation similaritybenchmark contaminationduplication measurementprovenance

0 comments

The pith

Tool cloning creates widespread hidden duplication across public agent-tool repositories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures duplication by examining 8,861 repositories containing 100,011 tools from two major platforms. It applies lexical and fuzzy similarity metrics to all repository pairs and calibrates the results with manual review of sampled high-similarity cases. A sympathetic reader would care because inflated tool counts can distort how we evaluate agent capabilities and allow security problems to spread unnoticed. If correct, the finding means current datasets and benchmarks need to filter or label clones to produce reliable diversity and generalization numbers.

Core claim

The study performs the first large-scale audit of tool repositories in agentic AI ecosystems by computing pairwise lexical and fuzzy-structural similarity across all MCP-to-MCP, Skills-to-Skills, and cross-ecosystem pairs. High-similarity regions appear consistently, and manual verification of sampled pairs shows that 60 percent of high-Jaccard candidates and 85 percent of high-ssdeep candidates in the MCP ecosystem are true clones. These results demonstrate that cloning is a pervasive source of duplication that overstates ecosystem diversity and contaminates benchmark construction.

What carries the argument

A repository-level auditing pipeline that computes complementary lexical similarity and fuzzy-structural similarity metrics on all repository pairs, then calibrates true cloning rates through manual verification of 100 sampled pairs per ecosystem in each similarity bucket.

If this is right

Raw tool counts in marketplaces substantially overstate actual diversity.
Benchmark splits risk including near-duplicate tools, biasing generalization measurements.
Vulnerable code from source repositories can propagate widely through clones.
Provenance tracking, attribution, and intellectual-property questions become harder to resolve.
Datasets and benchmarks must incorporate repository provenance and similarity checks to remain valid.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agent platforms could add automated deduplication steps before listing new tools.
Security audits might focus first on frequently cloned repositories to catch widespread issues.
Benchmark creators could adopt similarity-aware train-test splits as standard practice.

Load-bearing premise

That lexical and fuzzy-structural similarity scores, after calibration on manually reviewed samples, reliably separate true cloning from independent but coincidentally similar code.

What would settle it

A full manual audit of every high-similarity pair that finds most of them are independently written implementations rather than clones.

Figures

Figures reproduced from arXiv: 2605.09817 by David Jiang, Neil Gong, Taein Kim, Yuepeng Hu, Yuqi Jia.

**Figure 2.** Figure 2: Functionality and description-space analysis of MCP tools. (a) MCP functionality dis [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Developer contribution distributions in the MCP and Skills ecosystems. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Pairwise repository similarity distributions across three comparison groups. The top row [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of MCP and Skills repository sizes measured by normalized source tokens. [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of tool counts for the top 40 authors in the MCP tool ecosystem. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Distribution of skill counts for the top 40 authors in the Skills tool ecosystem. [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Log-log distributions of developer contribution frequency. (a) MCP ecosystem. (b) Skills [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Distribution of MCP and Skills tools for authors present in both ecosystems (log scale). [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Functionality and description-space analysis of Skills. (a) Skills functionality distribu [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

read the original abstract

Agent tools are becoming a core interface through which LLM agents access external data, services, and execution environments. As these tools are distributed through public marketplaces, raw tool counts may substantially overstate ecosystem diversity if many repositories are cloned, lightly modified, or derived from shared templates. Such hidden duplication can contaminate benchmark splits, propagate vulnerable implementations, bias measurements of tool-use generalization, and raise provenance, attribution, and intellectual-property concerns. We present, to our knowledge, the first large-scale measurement study of tool cloning in agentic AI ecosystems. We curate a unified dataset from multiple public platforms, covering 7,508 Model Context Protocol (MCP) repositories with 87,564 extracted tools and 1,353 Skills repositories with 12,447 tools, for a total of 8,861 repositories and 100,011 tool entries. To measure implementation-level duplication, we build a repository-level auditing pipeline using complementary lexical and fuzzy-structural similarity metrics, and compute pairwise similarity across MCP-to-MCP, Skills-to-Skills, and MCP-to-Skills repository pairs. We further manually verify 100 sampled pairs per MCP and Skills ecosystem across similarity-score buckets to calibrate how often high similarity reflects true code cloning. Our analysis shows that cloning is not an isolated artifact: high-similarity regions appear across comparison settings, and 60\% of high-Jaccard candidates and 85\% of high-ssdeep candidates in the MCP ecosystem are manually verified as clones. These results indicate that tool cloning is a pervasive and severe source of hidden duplication in agent-tool ecosystems. They further suggest that agent-tool datasets and benchmarks should account for repository provenance and implementation similarity when measuring tool diversity or constructing evaluation splits.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is the first large-scale measurement of tool cloning across agent repositories, with decent data collection but thin calibration on what counts as a clone.

read the letter

The main takeaway is that they've assembled a big dataset of about 8,800 repositories and 100k tools from MCP and Skills platforms, then used Jaccard and ssdeep similarity to flag high-duplication pairs and manually checked 100 samples per ecosystem across score buckets. The reported 60% and 85% clone rates in the high-similarity tails suggest duplication is common enough to matter for benchmarks and diversity claims. That data collection step is the real contribution here; prior work apparently lacked this scale of cross-platform audit. The complementary metrics plus manual spot-checks give the numbers some grounding instead of relying on one fuzzy score alone. The soft spot is the manual verification itself. One hundred pairs per ecosystem is a modest sample when you're dealing with millions of possible pairs and trying to bound false positives in the tail. The abstract mentions sampling across buckets, which helps, but without seeing the exact stratification, decision rules for clone versus boilerplate, or inter-rater numbers, it's hard to know how much the pervasiveness claim generalizes. Threshold choice and any curation bias in the initial repo collection could also shift the headline percentages. This work is aimed at people building or evaluating agent-tool benchmarks who need to worry about hidden duplicates. It is worth a serious referee because the empirical question is timely and the dataset effort is non-trivial, even if the calibration details will need tightening in revision.

Referee Report

1 major / 1 minor

Summary. The paper conducts the first large-scale empirical study of tool cloning in agentic AI ecosystems. It curates a dataset of 7,508 MCP repositories (87,564 tools) and 1,353 Skills repositories (12,447 tools), applies Jaccard and ssdeep similarity metrics to all pairwise repository comparisons, and manually verifies 100 sampled pairs per ecosystem across similarity-score buckets. The analysis finds high similarity regions and verifies 60% of high-Jaccard and 85% of high-ssdeep MCP candidates as clones, concluding that tool cloning is pervasive and recommending that benchmarks account for repository provenance and implementation similarity.

Significance. If the manual verification reliably distinguishes cloning from coincidental similarity, this study would be significant for the field by quantifying hidden duplication in tool ecosystems at scale. The dataset size (over 100k tools) and complementary lexical/fuzzy metrics provide a solid foundation for the measurement. The findings could influence how tool diversity is measured and how evaluation splits are constructed in agent benchmarks, addressing issues like contamination and bias. The purely empirical approach with no fitted parameters or circular derivations is a strength.

major comments (1)

[Manual verification procedure (Results section)] The pervasiveness claim (60% of high-Jaccard and 85% of high-ssdeep candidates verified as clones) depends on manual verification of only 100 pairs per ecosystem. With ~28M possible MCP repository pairs, this sample size is too small to reliably calibrate false-positive rates in the high-similarity tail. The manuscript provides no details on sampling stratification across score buckets, inter-rater agreement, or explicit decision criteria for classifying pairs as clones versus coincidental similarity (e.g., shared boilerplate or common libraries). This under-calibration directly undermines the reliability of interpreting high similarity scores as evidence of pervasive cloning.

minor comments (1)

[Abstract and §4 (Methodology)] The abstract states verification occurs 'across similarity-score buckets' but the main text should explicitly define the bucket boundaries, the total number of high-similarity candidates, and the precise sampling method to support reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the single major comment below regarding the manual verification procedure, providing clarification on our approach while agreeing to enhance the manuscript with additional methodological details.

read point-by-point responses

Referee: The pervasiveness claim (60% of high-Jaccard and 85% of high-ssdeep candidates verified as clones) depends on manual verification of only 100 pairs per ecosystem. With ~28M possible MCP repository pairs, this sample size is too small to reliably calibrate false-positive rates in the high-similarity tail. The manuscript provides no details on sampling stratification across score buckets, inter-rater agreement, or explicit decision criteria for classifying pairs as clones versus coincidental similarity (e.g., shared boilerplate or common libraries). This under-calibration directly undermines the reliability of interpreting high similarity scores as evidence of pervasive cloning.

Authors: We appreciate the referee's emphasis on methodological transparency for the manual verification. Our sampling of 100 pairs per ecosystem was stratified across similarity-score buckets to concentrate on the high-similarity tail, where the distinction between cloning and coincidental similarity is most critical for our pervasiveness conclusions. This targeted calibration is appropriate for interpreting the metrics in the regions of interest, rather than requiring exhaustive sampling from the full ~28 million pairs. However, we agree that the manuscript would benefit from greater detail on the exact bucket-wise sampling proportions, the explicit decision criteria (including how boilerplate, shared libraries, and common dependencies were handled), and any inter-rater agreement measures. In the revised manuscript, we will expand the relevant sections to include a full description of the verification protocol, the classification rubric, and clarification on the verification process. These additions will strengthen the presentation without changing the reported verification rates or core findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical measurement study

full rationale

The paper conducts a purely empirical measurement study: it curates a dataset of repositories and tools, applies lexical and fuzzy similarity metrics to compute pairwise scores, and manually verifies a sample of high-similarity pairs. No derivations, equations, fitted parameters presented as predictions, or self-referential steps exist that would reduce the central claims about cloning prevalence to inputs by construction. The findings rest directly on the collected data and verification process, with no load-bearing self-citations or ansatzes that create circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that similarity metrics detect cloning and that the curated public-platform dataset is representative of agentic AI tool ecosystems.

axioms (1)

domain assumption High lexical and fuzzy-structural similarity between repositories indicates code cloning rather than independent development
Core premise of the auditing pipeline and manual verification calibration

pith-pipeline@v0.9.0 · 5609 in / 1104 out tokens · 58426 ms · 2026-05-12T01:57:26.145080+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages

[1]

Proceedings

Clone detection using abstract syntax trees , author=. Proceedings. International Conference on Software Maintenance , year=

work page
[2]

Queen’s School of computing TR , year=

A survey on software clone detection research , author=. Queen’s School of computing TR , year=

work page
[3]

Science of computer programming , year=

Comparison and evaluation of code clone detection techniques and tools: A qualitative approach , author=. Science of computer programming , year=

work page
[4]

2007 , publisher=

Survey of research on software clones , author=. 2007 , publisher=

work page 2007
[5]

IEEE Transactions on software engineering , year=

Comparison and evaluation of clone detection tools , author=. IEEE Transactions on software engineering , year=

work page
[6]

2009 IEEE 31st International Conference on Software Engineering , pages=

Do code clones matter? , author=. 2009 IEEE 31st International Conference on Software Engineering , pages=. 2009 , organization=

work page 2009
[7]

, author=

On finding duplication and near-duplication in large software systems. , author=. wcre , volume=

work page
[8]

1 , author=

The distribution of the flora in the alpine zone. 1 , author=. New phytologist , volume=. 1912 , publisher=

work page 1912
[9]

Digital investigation , volume=

Identifying almost identical files using context triggered piecewise hashing , author=. Digital investigation , volume=. 2006 , publisher=

work page 2006
[10]

, author=

A comparison of string distance metrics for name-matching tasks. , author=. IIWeb , volume=

work page
[11]

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval , pages=

Finding near-duplicate web pages: a large-scale evaluation of algorithms , author=. Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval , pages=

work page
[12]

, author=

Drebin: Effective and explainable detection of android malware in your pocket. , author=. Ndss , volume=. 2014 , organization=

work page 2014
[13]

Empirical Software Engineering , volume=

Empirical study of android repackaged applications , author=. Empirical Software Engineering , volume=. 2019 , publisher=

work page 2019
[14]

The 2014 ACM international conference on Measurement and modeling of computer systems , pages=

A measurement study of google play , author=. The 2014 ACM international conference on Measurement and modeling of computer systems , pages=

work page 2014
[15]

arXiv preprint arXiv:2009.08366 , year=

Graphcodebert: Pre-training code representations with data flow , author=. arXiv preprint arXiv:2009.08366 , year=

work page arXiv 2009
[16]

Findings of the association for computational linguistics: EMNLP 2020 , pages=

Codebert: A pre-trained model for programming and natural languages , author=. Findings of the association for computational linguistics: EMNLP 2020 , pages=

work page 2020
[17]

2024 , howpublished =

Introducing the Model Context Protocol , author =. 2024 , howpublished =

work page 2024
[18]

2025 , howpublished =

Equipping Agents for the Real World with Agent Skills , author =. 2025 , howpublished =

work page 2025
[20]

The twelfth international conference on learning representations , year=

Toolllm: Facilitating large language models to master 16000+ real-world apis , author=. The twelfth international conference on learning representations , year=

work page
[21]

International Conference on Learning Representations (ICLR) , year=

ReAct: Synergizing Reasoning and Acting in Language Models , author=. International Conference on Learning Representations (ICLR) , year=

work page
[22]

Advances in neural information processing systems , volume=

Reflexion: Language agents with verbal reinforcement learning , author=. Advances in neural information processing systems , volume=

work page
[23]

Advances in Neural Information Processing Systems , volume=

Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face , author=. Advances in Neural Information Processing Systems , volume=

work page
[24]

Advances in neural information processing systems , volume=

Toolformer: Language models can teach themselves to use tools , author=. Advances in neural information processing systems , volume=

work page
[25]

2012 IEEE symposium on security and privacy , pages=

Dissecting android malware: Characterization and evolution , author=. 2012 IEEE symposium on security and privacy , pages=. 2012 , organization=

work page 2012
[26]

2025 , howpublished =

Llama 4: Open Foundation Models for Multimodal and Efficient AI , author =. 2025 , howpublished =

work page 2025
[27]

Advances in Neural Information Processing Systems , year=

Webshop: Towards scalable real-world web interaction with grounded language agents , author=. Advances in Neural Information Processing Systems , year=

work page
[28]

The Twelfth International Conference on Learning Representations , year=

AgentBench: Evaluating LLMs as Agents , author=. The Twelfth International Conference on Learning Representations , year=

work page
[29]

The twelfth international conference on learning representations , year=

Swe-bench: Can language models resolve real-world github issues? , author=. The twelfth international conference on learning representations , year=

work page
[30]

Advances in Neural Information Processing Systems , year=

Gorilla: Large language model connected with massive apis , author=. Advances in Neural Information Processing Systems , year=

work page
[31]

IEEE transactions on software engineering , year=

CCFinder: A multilinguistic token-based code clone detection system for large scale source code , author=. IEEE transactions on software engineering , year=

work page
[32]

29th International Conference on Software Engineering (ICSE'07) , year=

Deckard: Scalable and accurate tree-based detection of code clones , author=. 29th International Conference on Software Engineering (ICSE'07) , year=

work page
[33]

Proceedings of the 38th international conference on software engineering , year=

Sourcerercc: Scaling code clone detection to big-code , author=. Proceedings of the 38th international conference on software engineering , year=

work page
[34]

IEEE Transactions on software Engineering , year=

CP-Miner: Finding copy-paste and related bugs in large-scale software code , author=. IEEE Transactions on software Engineering , year=

work page
[35]

Proceedings of the second ACM conference on Data and Application Security and Privacy , year=

Detecting repackaged smartphone applications in third-party android marketplaces , author=. Proceedings of the second ACM conference on Data and Application Security and Privacy , year=

work page
[36]

European Symposium on Research in Computer Security , year=

Attack of the clones: Detecting cloned applications on android markets , author=. European Symposium on Research in Computer Security , year=

work page
[37]

European Symposium on Research in Computer Security , year=

Andarwin: Scalable detection of semantically similar android applications , author=. European Symposium on Research in Computer Security , year=

work page
[38]

MCP.so , year = 2025, howpublished =

work page 2025
[39]

MCPServers.org , year = 2025, howpublished =

work page 2025
[40]

MCP Market , title =

work page
[41]

Introducing the model context protocol

Anthropic . Introducing the model context protocol. https://www.anthropic.com/news/model-context-protocol, 2024

work page 2024
[42]

Equipping agents for the real world with agent skills

Anthropic . Equipping agents for the real world with agent skills. https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills, 2025

work page 2025
[43]

Clone detection using abstract syntax trees

Ira D Baxter, Andrew Yahin, Leonardo Moura, Marcelo Sant'Anna, and Lorraine Bier. Clone detection using abstract syntax trees. In Proceedings. International Conference on Software Maintenance, 1998

work page 1998
[44]

Comparison and evaluation of clone detection tools

Stefan Bellon, Rainer Koschke, Giulio Antoniol, Jens Krinke, and Ettore Merlo. Comparison and evaluation of clone detection tools. IEEE Transactions on software engineering, 2007

work page 2007
[45]

Attack of the clones: Detecting cloned applications on android markets

Jonathan Crussell, Clint Gibler, and Hao Chen. Attack of the clones: Detecting cloned applications on android markets. In European Symposium on Research in Computer Security, 2012

work page 2012
[46]

Andarwin: Scalable detection of semantically similar android applications

Jonathan Crussell, Clint Gibler, and Hao Chen. Andarwin: Scalable detection of semantically similar android applications. In European Symposium on Research in Computer Security, 2013

work page 2013
[47]

Deckard: Scalable and accurate tree-based detection of code clones

Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. Deckard: Scalable and accurate tree-based detection of code clones. In 29th International Conference on Software Engineering (ICSE'07), 2007

work page 2007
[48]

Swe-bench: Can language models resolve real-world github issues? In The twelfth international conference on learning representations, 2023

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. Swe-bench: Can language models resolve real-world github issues? In The twelfth international conference on learning representations, 2023

work page 2023
[49]

Do code clones matter? In 2009 IEEE 31st International Conference on Software Engineering, pages 485--495

Elmar Juergens, Florian Deissenboeck, Benjamin Hummel, and Stefan Wagner. Do code clones matter? In 2009 IEEE 31st International Conference on Software Engineering, pages 485--495. IEEE, 2009

work page 2009
[50]

Ccfinder: A multilinguistic token-based code clone detection system for large scale source code

Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. Ccfinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE transactions on software engineering, 2002

work page 2002
[51]

Identifying almost identical files using context triggered piecewise hashing

Jesse Kornblum. Identifying almost identical files using context triggered piecewise hashing. Digital investigation, 3: 0 91--97, 2006

work page 2006
[52]

Survey of research on software clones

Rainer Koschke. Survey of research on software clones. 2007

work page 2007
[53]

Cp-miner: Finding copy-paste and related bugs in large-scale software code

Zhenmin Li, Shan Lu, Suvda Myagmar, and Yuanyuan Zhou. Cp-miner: Finding copy-paste and related bugs in large-scale software code. IEEE Transactions on software Engineering, 2006

work page 2006
[54]

Agentbench: Evaluating llms as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[55]

Mcp market

MCP Market. Mcp market. https://mcpmarket.com/, 2025

work page 2025
[56]

Llama 4: Open foundation models for multimodal and efficient ai

Meta AI . Llama 4: Open foundation models for multimodal and efficient ai. https://ai.meta.com/llama/, 2025. Accessed: 2026-05-06

work page 2025
[57]

Gorilla: Large language model connected with massive apis

Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive apis. In Advances in Neural Information Processing Systems, 2024

work page 2024
[58]

Toolllm: Facilitating large language models to master 16000+ real-world apis

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. In The twelfth international conference on learning representations, 2023

work page 2023
[59]

Comparison and evaluation of code clone detection techniques and tools: A qualitative approach

Chanchal K Roy, James R Cordy, and Rainer Koschke. Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Science of computer programming, 2009

work page 2009
[60]

A survey on software clone detection research

Chanchal Kumar Roy and James R Cordy. A survey on software clone detection research. Queen’s School of computing TR, 2007

work page 2007
[61]

Sourcerercc: Scaling code clone detection to big-code

Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, Chanchal K Roy, and Cristina V Lopes. Sourcerercc: Scaling code clone detection to big-code. In Proceedings of the 38th international conference on software engineering, 2016

work page 2016
[62]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems, 36: 0 68539--68551, 2023

work page 2023
[63]

Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems, 36: 0 38154--38180, 2023

work page 2023
[64]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in neural information processing systems, 36: 0 8634--8652, 2023

work page 2023
[65]

Skillsmp

SkillsMP. Skillsmp. https://skillsmp.com/, 2025

work page 2025
[66]

arXiv preprint arXiv:2306.05301 , year =

Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Boxi Cao, and Le Sun. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases. arXiv preprint arXiv:2306.05301, 2023

work page arXiv 2023
[67]

Mcpservers.org

MCPServers.org. Mcpservers.org. https://mcpservers.org/, 2025

work page 2025
[68]

MCP.so. Mcp.so. https://mcp.so/, 2025

work page 2025
[69]

Webshop: Towards scalable real-world web interaction with grounded language agents

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. In Advances in Neural Information Processing Systems, 2022

work page 2022
[70]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023

work page 2023
[71]

Detecting repackaged smartphone applications in third-party android marketplaces

Wu Zhou, Yajin Zhou, Xuxian Jiang, and Peng Ning. Detecting repackaged smartphone applications in third-party android marketplaces. In Proceedings of the second ACM conference on Data and Application Security and Privacy, 2012

work page 2012
[72]

Dissecting android malware: Characterization and evolution

Yajin Zhou and Xuxian Jiang. Dissecting android malware: Characterization and evolution. In 2012 IEEE symposium on security and privacy, pages 95--109. IEEE, 2012

work page 2012