pith. machine review for the scientific record. sign in

arxiv: 2604.18413 · v2 · submitted 2026-04-20 · 💻 cs.SE

Recognition: unknown

TypeScript Repository Indexing for Code Agent Retrieval

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:05 UTC · model grok-4.3

classification 💻 cs.SE
keywords TypeScriptcode indexingUniASTparsercode agentsrepository indexingcompiler APILSP
0
0 comments X

The pith

A parser using the TypeScript Compiler API directly builds reliable UniAST indexes for large repositories much faster than LSP-based methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that existing ABCoder parsers for TypeScript combine lightweight AST analysis with language server calls, and that the per-symbol JSON-RPC lookups create a scaling bottleneck on codebases of hundreds of thousands to millions of lines. The new abcoder-ts-parser instead works directly with the TypeScript compiler's own AST, semantic model, and module resolver, eliminating those round-trips while still producing the function-level UniAST graph index. Evaluation on three open-source projects reaching 1.2 million lines demonstrates both substantially lower indexing times and indexes that remain reliable for downstream code-agent retrieval. If the approach holds, graph-based context retrieval that preserves call chains becomes practical for much larger TypeScript repositories than before.

Core claim

The central claim is that abcoder-ts-parser, built directly on the TypeScript Compiler API, produces reliable UniAST indexes for TypeScript repositories up to 1.2 million lines of code significantly more efficiently than the existing architecture that augments AST parsers with language-server protocol calls.

What carries the argument

abcoder-ts-parser, which traverses the TypeScript compiler's native AST together with its semantic information and module-resolution logic to construct the function-level UniAST index without per-symbol RPC calls.

If this is right

  • Graph-based retrieval that keeps call chains and dependency links becomes feasible for TypeScript codebases that previously timed out during indexing.
  • LLM code agents can obtain richer context from larger repositories without incurring the latency of repeated language-server lookups.
  • The UniAST index can be refreshed more frequently during development because each rebuild finishes in less time.
  • The same direct-compiler pattern removes a scaling obstacle that affects any system trying to maintain semantic code graphs at repository scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar direct-compiler parsers could be written for other languages whose compilers expose comparable semantic APIs, potentially improving indexing speed across more of the code-agent ecosystem.
  • Faster indexing may allow agents to maintain live indexes that update as the developer edits code rather than requiring full rebuilds between sessions.
  • If the indexes prove complete, downstream tasks such as impact analysis or automated refactoring could also benefit from the same lightweight graph structure.

Load-bearing premise

That the TypeScript Compiler API supplies all needed semantic relationships and call chains for the UniAST index without the completeness gaps that originally led to adding language-server calls.

What would settle it

A side-by-side run of the new parser and the prior LSP-based parser on one of the 1.2-million-line projects that records both wall-clock indexing time and a manual or automated check of whether the extracted call chains and dependencies match.

read the original abstract

Graph-based code indexing can improve context retrieval for LLM-based code agents by preserving call chains and dependency relationships that keyword search and similarity retrieval often miss. ABCoder is an open-source framework that parses codebases into a function-level code index called UniAST. Its existing parsers combine lightweight AST parsers for syntactic analysis with language servers for semantic resolution, but because LSP-based resolution requires a JSON-RPC call for each symbol lookup, these per-symbol calls become a bottleneck on large TypeScript repositories. We present abcoder-ts-parser, a TypeScript parser built on the TypeScript Compiler API that works directly with the compiler's AST, semantic information, and module resolution logic. We evaluate the parser on three open-source TypeScript projects with up to 1.2 million lines of code and find that it produces reliable indexes significantly more efficiently than the existing architecture. For a live demonstration, watch: https://youtu.be/ryssr7ouvdE

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents abcoder-ts-parser, a TypeScript parser built on the TypeScript Compiler API (including its AST, semantic checker, and module resolver) to generate UniAST indexes for the ABCoder framework. It argues that this replaces the prior combination of lightweight AST parsers plus per-symbol LSP calls, which created bottlenecks on large repositories, and claims the new parser produces reliable indexes significantly more efficiently, as shown by evaluation on three open-source TypeScript projects with up to 1.2 million lines of code.

Significance. If the efficiency gains and index reliability hold, the work would provide a practical, scalable improvement to graph-based code indexing for LLM-based code agents, directly addressing a performance limitation in the existing ABCoder architecture for TypeScript codebases.

major comments (3)
  1. [Abstract / Evaluation] Abstract and Evaluation section: The central claim that the parser 'produces reliable indexes significantly more efficiently' is unsupported by any quantitative metrics, baselines, runtime numbers, accuracy measures, or error analysis; the abstract supplies only the project sizes and a qualitative assertion.
  2. [Motivation / §3] Motivation and §3 (Parser Design): The paper's own motivation notes that lightweight AST parsers had completeness gaps in semantic relationships and call chains, motivating LSP use; however, no verification (e.g., cross-file reference counts, call-graph edge comparison, or tests on tsconfig paths/declaration merging) is provided to confirm the Compiler API version achieves equivalent resolution.
  3. [Evaluation] Evaluation section: The claim of evaluation 'on three open-source TypeScript projects' lacks any description of the methodology, selected projects' characteristics (beyond LOC), or how 'reliability' was assessed relative to the prior LSP-augmented parsers.
minor comments (2)
  1. [Abstract] The live demo video link is helpful but the manuscript should include at least one self-contained code example or index snippet to illustrate the UniAST output.
  2. [§2 / §3] Notation for UniAST and the index structure could be clarified with a small diagram or table of node/edge types.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important areas where the manuscript can be strengthened. We address each major comment below and will incorporate the suggested improvements in the revised version.

read point-by-point responses
  1. Referee: [Abstract / Evaluation] Abstract and Evaluation section: The central claim that the parser 'produces reliable indexes significantly more efficiently' is unsupported by any quantitative metrics, baselines, runtime numbers, accuracy measures, or error analysis; the abstract supplies only the project sizes and a qualitative assertion.

    Authors: We agree that the abstract and Evaluation section currently rely on a qualitative assertion without supporting quantitative evidence. In the revised manuscript we will expand the abstract to reference key efficiency metrics and add to the Evaluation section concrete runtime comparisons against the prior LSP-augmented parser, indexing throughput numbers, accuracy measures, and error analysis drawn from the experiments on the three projects. revision: yes

  2. Referee: [Motivation / §3] Motivation and §3 (Parser Design): The paper's own motivation notes that lightweight AST parsers had completeness gaps in semantic relationships and call chains, motivating LSP use; however, no verification (e.g., cross-file reference counts, call-graph edge comparison, or tests on tsconfig paths/declaration merging) is provided to confirm the Compiler API version achieves equivalent resolution.

    Authors: The motivation correctly identifies the semantic gaps that prompted LSP usage. Our design replaces per-symbol LSP calls with the Compiler API's native semantic checker and module resolver. We acknowledge that explicit verification would strengthen the equivalence claim; we will add a dedicated subsection (or expand §3) containing cross-file reference counts, call-graph edge comparisons, and targeted tests for tsconfig path resolution and declaration merging. revision: yes

  3. Referee: [Evaluation] Evaluation section: The claim of evaluation 'on three open-source TypeScript projects' lacks any description of the methodology, selected projects' characteristics (beyond LOC), or how 'reliability' was assessed relative to the prior LSP-augmented parsers.

    Authors: We will substantially revise the Evaluation section to describe the evaluation methodology in detail, provide additional characteristics of the three projects (domain, architectural features, and TypeScript-specific constructs exercised), and explain how reliability was assessed, including direct side-by-side comparisons of reference resolution and call-chain completeness against the prior LSP-augmented parsers. revision: yes

Circularity Check

0 steps flagged

No circularity: direct implementation and empirical comparison with no derivations or self-referential reductions

full rationale

The paper presents an engineering implementation of abcoder-ts-parser using the TypeScript Compiler API, replacing LSP-based resolution in the existing ABCoder framework, followed by runtime and scalability evaluation on three TypeScript projects. No equations, parameter fitting, uniqueness theorems, or ansatzes are present. Claims of 'reliable indexes' and efficiency gains rest on direct benchmarking against the prior architecture rather than any self-definitional loop or fitted-input prediction. The work is self-contained as a systems contribution without load-bearing self-citations that reduce the central result to prior unverified assertions by the same authors.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the TypeScript Compiler API can serve as a complete drop-in replacement for LSP-based semantic resolution in the context of building UniAST indexes.

axioms (1)
  • domain assumption The TypeScript Compiler API provides equivalent or superior semantic and dependency information to LSP for function-level indexing without requiring per-symbol RPC calls.
    Invoked in the design choice to replace LSP resolution.

pith-pipeline@v0.9.0 · 5456 in / 1157 out tokens · 27098 ms · 2026-05-10T04:05:12.806225+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 6 canonical work pages · 2 internal anchors

  1. [1]

    Anthropic. 2024. Introducing the Model Context Protocol. https://www.anthropic.com/news/model-context-protocol Published 2024-11-25

  2. [2]

    Anthropic. 2026. Claude Code Overview. https://docs.anthropic.com/en/docs/claude-code/overview Official docu- mentation, accessed 2026-04-14

  3. [3]

    CloudWeGo Team. 2026. ABCoder: AI-Based Coder (AKA: A Brand-new Coder). https://github.com/cloudwego/ abcoder GitHub repository, accessed 2026-04-14

  4. [4]

    CloudWeGo Team. 2026. UniAST Specification. https://github.com/cloudwego/abcoder/blob/main/docs/uniast-zh.md GitHub documentation, accessed 2026-04-14

  5. [5]

    Virtual whiteboard for sketching hand-drawn like diagrams

    Excalidraw. 2026. excalidraw/excalidraw. https://github.com/excalidraw/excalidraw GitHub repository, “Virtual whiteboard for sketching hand-drawn like diagrams”, accessed 2026-04-14

  6. [6]

    Yichen Li, Jinyang Liu, Junsong Pu, Zhihan Jiang, Zhuangbin Chen, Xiao He, Tieying Zhang, Jianjun Chen, Yi Li, Rui Shi, and Michael R. Lyu. 2025. Automated Proactive Logging Quality Improvement for Large-Scale Codebases. In 2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 3426–3437. doi:10.1109/ ASE63991.2025.00283

  7. [7]

    Microsoft. 2026. Language Server Protocol. https://microsoft.github.io/language-server-protocol/ Official documenta- tion, version 3.17, accessed 2026-04-14

  8. [8]

    The fastest knowledge base for growing teams

    Outline. 2026. outline/outline. https://github.com/outline/outline GitHub repository, “The fastest knowledge base for growing teams”, accessed 2026-04-14

  9. [9]

    Siru Ouyang, Wenhao Yu, Kaixin Ma, Zilin Xiao, Zhihan Zhang, Mengzhao Jia, Jiawei Han, Hongming Zhang, and Dong Yu. 2024. RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph. doi:10.48550/ arXiv.2410.14684 arXiv:2410.14684; accepted to ICLR 2025

  10. [10]

    Junsong Pu, Yichen Li, Zhuangbin Chen, Jinyang Liu, Zhihan Jiang, Jianjun Chen, Rui Shi, Zibin Zheng, and Tieying Zhang. 2025. ErrorPrism: Reconstructing Error Propagation Paths in Cloud Service Systems. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 3534–3545. doi:10.1109/ASE63991.2025.00292

  11. [11]

    Developer-first error tracking and performance monitoring

    Sentry. 2026. getsentry/sentry. https://github.com/getsentry/sentry GitHub repository, “Developer-first error tracking and performance monitoring”, accessed 2026-04-14

  12. [12]

    Tree-sitter Contributors. 2026. Tree-sitter. https://tree-sitter.github.io/tree-sitter/ Official documentation, accessed 2026-04-14

  13. [13]

    Xu, Yiqing Xie, Graham Neubig, and Daniel Fried

    Zora Zhiruo Wang, Akari Asai, Xinyan Velocity Yu, Frank F. Xu, Yiqing Xie, Graham Neubig, and Daniel Fried

  14. [14]

    InFindings of the Association for Computational Linguistics: NAACL 2025

    CodeRAG-Bench: Can Retrieval Augment Code Generation?. InFindings of the Association for Computational Linguistics: NAACL 2025. Association for Computational Linguistics, Albuquerque, New Mexico, 3199–3214. doi:10. 18653/v1/2025.findings-naacl.176

  15. [15]

    Di Wu, Wasi Uddin Ahmad, Dejiao Zhang, Murali Krishna Ramanathan, and Xiaofei Ma. 2024. Repoformer: Selective Retrieval for Repository-Level Code Completion. InProceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235). PMLR, Vienna, Austria, 53270–53290. https://proceedings.mlr. press/v235/wu24a.html

  16. [16]

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2024. Agentless: Demystifying LLM-based Software Engineering Agents. doi:10.48550/arXiv.2407.01489

  17. [17]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R. Narasimhan, and Ofir Press

  18. [18]

    https://openreview.net/ forum?id=mXpq6ut8J3 NeurIPS 2024

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. https://openreview.net/ forum?id=mXpq6ut8J3 NeurIPS 2024

  19. [19]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. ReAct: Synergizing Reasoning and Acting in Language Models. doi:10.48550/arXiv.2210.03629 arXiv:2210.03629; also presented at ICLR 2023

  20. [20]

    Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen

  21. [21]

    RepoCoder : Repository-Level Code Completion Through Iterative Retrieval and Generation

    RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, 2471–2484. doi:10.18653/v1/2023.emnlp-main.151