arxiv: 2605.02162 · v1 · submitted 2026-05-04 · 💻 cs.DC · cs.MA

Recognition: unknown

AAFLOW: Scalable Patterns for Agentic AI Workflows

Arup Kumar Sarker , Mills Staylor , Aymen Alsaadi , Gregor von Laszewski , Shantenu Jha , Geoffrey Fox

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:21 UTC · model grok-4.3

classification 💻 cs.DC cs.MA

keywords agentic workflowsdistributed runtimezero-copy data planeoperator abstractionscalabilitydata flow optimizationLLM systems

0 comments

The pith

AAFLOW models agentic workflows as operators in a distributed runtime to deliver up to 4.64 times pipeline speedup through zero-copy data flows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to fix scalability and reproducibility problems in agentic AI systems that combine retrieval, reasoning, and memory by replacing fragmented data orchestration with a formal execution model drawn from high-performance computing. It demonstrates that representing workflows as operators, paired with a zero-copy data plane, removes serialization overhead and improves batching and communication without any change to LLM inference speed. This approach would let large-scale agent deployments run more reliably and efficiently on distributed hardware by focusing gains on data movement rather than model computation.

Core claim

AAFLOW is a unified distributed runtime that creates communication-efficient execution plans by modeling agentic workflows as an operator abstraction. Using Apache Arrow and Cylon, it builds a zero-copy data plane for direct interoperability between preprocessing, embedding, and vector retrieval. Resource-deterministic scheduling and asynchronous batching reduce coordination costs, yielding up to 4.64 times pipeline speedup and 2.8 times gains in embedding and upsert phases while LLM generation throughput stays comparable; the gains come from better data flow, batching, and communication efficiency rather than faster inference.

What carries the argument

Operator abstraction of agentic workflows, realized through a zero-copy data plane built on Apache Arrow and Cylon plus resource-deterministic scheduling and asynchronous batching.

If this is right

Agentic workflows acquire a scalable and reproducible execution model that follows high-performance computing principles.
Data-intensive phases such as embedding and vector upsert improve by factors of 2.8 without any change to LLM generation speed.
Preprocessing, embedding, and retrieval stages interoperate directly without serialization costs.
Coordination overhead drops through deterministic resource scheduling and asynchronous batching in distributed settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same operator-plus-zero-copy pattern could be applied to other data-heavy AI pipelines such as fine-tuning or evaluation loops.
Standardizing on a common data plane might reduce fragmentation across different agent frameworks over time.
Testing the abstraction on workflows with highly variable or conditional branching would reveal how far the model generalizes.

Load-bearing premise

That agentic workflows can be effectively modeled as an operator abstraction without losing flexibility or that the experimental benchmarks represent typical real-world usage patterns.

What would settle it

A benchmark set of complex agentic workflows that cannot be expressed as operators without major loss of behavior, or real-world runs that show no measurable speedup once data patterns differ from the tested cases.

Figures

Figures reproduced from arXiv: 2605.02162 by Arup Kumar Sarker, Aymen Alsaadi, Geoffrey Fox, Gregor von Laszewski, Mills Staylor, Shantenu Jha.

**Figure 1.** Figure 1: End-to-End code flow of AAFLOW with memory operation. The RAG view at source ↗

**Figure 2.** Figure 2: Cylon Layered Architecture. From the bottom-up view, the Hardware view at source ↗

**Figure 3.** Figure 3: AAFLOW design incorporating a multilayered architecture with agentic view at source ↗

**Figure 4.** Figure 4: Advanced Async with Parallel Pipeline of AAFLOW. With batching view at source ↗

**Figure 5.** Figure 5: Framework benchmarking in each vital stage in the RAG pipeline view at source ↗

**Figure 6.** Figure 6: Strong scaling behavior across configurations for Load, Transform, view at source ↗

**Figure 7.** Figure 7: Weak scaling behavior across configurations for Load, Transform, view at source ↗

**Figure 8.** Figure 8: Strong and Weak scaling behavior across configurations for Total view at source ↗

read the original abstract

Agentic workflows in large language model systems integrate retrieval, reasoning, and memory, but existing frameworks suffer from scalability and reproducibility limitations due to fragmented data orchestration, serialization overhead, and non-deterministic execution. Although these frameworks increase flexibility, they don't have a formal execution model that adheres to the principles of high-performance computing. We introduce AAFLOW, a unified distributed runtime that creates communication-efficient execution plans by modeling agentic workflows as an operator abstraction. Using Apache Arrow and Cylon, AAFLOW creates a zero-copy data plane that allows direct interoperability between preprocessing, embedding, and vector retrieval without the need for serialization overhead. To lower coordination costs, it uses resource-deterministic scheduling and asynchronous batching. While retaining comparable LLM generation throughput, experimental results demonstrate up to 4.64 times pipeline speedup and 2.8 times gains in embedding and upsert phases. Rather than LLM inference acceleration, these advantages result from enhanced data flow, batching, and communication efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AAFLOW applies operator abstraction and zero-copy data planes to agentic workflows but the speedup numbers lack supporting details.

read the letter

AAFLOW introduces a distributed runtime that treats agentic LLM workflows as operators on top of Apache Arrow and Cylon. The goal is to cut serialization and coordination costs in the data-heavy parts like embedding and vector operations while leaving LLM inference untouched. That framing is the main new piece: a concrete way to map these workflows onto HPC-style execution plans with resource-deterministic scheduling and async batching. The paper does a solid job laying out the fragmentation problem in existing agent frameworks and showing how zero-copy interoperability could help. If the implementation actually delivers on the data plane, it could be practical for teams running retrieval-augmented or multi-step agent pipelines at scale. The experimental claims are the weak part. The abstract states up to 4.64x pipeline speedup and 2.8x gains in embedding and upsert phases, yet supplies no baselines, workload descriptions, or variance numbers. More critically, agentic flows are driven by non-deterministic LLM outputs that create branches, loops, and tool selections. The operator model may handle static pipelines cleanly, but the abstract gives no indication of how dynamic control flow stays inside the zero-copy plane or whether fallbacks reintroduce the overhead the system claims to remove. That gap makes it hard to judge whether the reported gains would appear in typical use. This work is aimed at engineers and researchers who already deal with distributed LLM deployments and want better data movement primitives. It is worth sending to peer review so the implementation and benchmarks can be examined directly, even though the current evidence is thin.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces AAFLOW, a unified distributed runtime for agentic AI workflows in LLM systems. It models these workflows as an operator abstraction to generate communication-efficient execution plans, leveraging Apache Arrow and Cylon for a zero-copy data plane that enables direct interoperability between preprocessing, embedding, and vector retrieval. Resource-deterministic scheduling and asynchronous batching are used to reduce coordination costs. The abstract claims that this yields up to 4.64 times pipeline speedup and 2.8 times gains in embedding and upsert phases while preserving LLM generation throughput, with advantages attributed solely to data flow, batching, and communication efficiency rather than inference acceleration.

Significance. If the performance claims and the operator abstraction hold under scrutiny, AAFLOW could represent a meaningful step toward scalable, reproducible agentic workflows by importing HPC-style execution models into LLM orchestration frameworks. The emphasis on zero-copy data planes and deterministic scheduling addresses real pain points in fragmented data handling. However, the absence of any experimental methodology, baselines, or analysis in the manuscript prevents a positive assessment of significance at present.

major comments (2)

[Abstract] Abstract: The abstract asserts specific quantitative speedups (up to 4.64x pipeline and 2.8x in embedding/upsert phases) but supplies no experimental methodology, baselines, datasets, hardware configuration, error bars, or statistical details, rendering it impossible to evaluate whether the data support the central performance claims.
[Abstract] Abstract: The operator abstraction is asserted to model agentic workflows without loss of flexibility, yet no description is provided of how dynamic, non-deterministic control flow (conditional branching, loops, and tool selection driven by LLM outputs) is expressed as operators or kept inside the zero-copy plane; if fallback mechanisms or additional coordination layers are required, the reported speedups would not generalize beyond static pipelines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript introducing AAFLOW. The feedback highlights important areas for improvement in presenting our experimental claims and clarifying the operator abstraction's handling of dynamic workflows. We will revise the manuscript to incorporate detailed methodology and expanded descriptions, as outlined in our point-by-point responses below.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract asserts specific quantitative speedups (up to 4.64x pipeline and 2.8x in embedding/upsert phases) but supplies no experimental methodology, baselines, datasets, hardware configuration, error bars, or statistical details, rendering it impossible to evaluate whether the data support the central performance claims.

Authors: We agree that the manuscript as submitted does not include sufficient detail on the experimental methodology to support the quantitative claims in the abstract. The reported speedups derive from our evaluations of data flow, batching, and communication efficiency using Apache Arrow and Cylon, but to enable proper scrutiny we will add a dedicated Experiments section in the revision. This section will specify the baselines (e.g., comparisons against LangChain, LlamaIndex, and AutoGen), datasets and workloads used, hardware configurations, number of runs with error bars, and statistical analysis. The performance gains are isolated to the data plane and scheduling components rather than inference acceleration. revision: yes
Referee: [Abstract] Abstract: The operator abstraction is asserted to model agentic workflows without loss of flexibility, yet no description is provided of how dynamic, non-deterministic control flow (conditional branching, loops, and tool selection driven by LLM outputs) is expressed as operators or kept inside the zero-copy plane; if fallback mechanisms or additional coordination layers are required, the reported speedups would not generalize beyond static pipelines.

Authors: The operator model in AAFLOW expresses dynamic control flow through composable control operators (e.g., conditional and loop operators) that receive runtime decisions from LLM outputs and schedule subsequent data operators asynchronously. These control signals are passed within the zero-copy Arrow data plane alongside batched data to avoid serialization. We acknowledge that extremely frequent or complex branching may introduce limited coordination overhead outside the pure data plane. In the revised manuscript we will add a dedicated subsection with pseudocode examples illustrating how non-deterministic flows are modeled and will explicitly discuss the scope of the reported speedups (primarily for data-intensive segments) along with any limitations for highly dynamic cases. revision: partial

Circularity Check

0 steps flagged

No circularity: purely experimental claims with no derivations or self-referential reductions

full rationale

The manuscript introduces AAFLOW as an operator-based runtime for agentic workflows and reports empirical speedups (4.64x pipeline, 2.8x embedding/upsert) from zero-copy data planes, batching, and scheduling. No equations, fitted parameters, uniqueness theorems, or ansatzes appear; the central claims rest on benchmark measurements rather than any derivation that reduces to its own inputs by construction. Self-citations are absent from the load-bearing sections, and the operator abstraction is presented as a design choice whose validity is tested externally via experiments, not presupposed. The work is therefore self-contained against external benchmarks with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based on abstract; the central claim relies on the assumption that workflows fit the operator model and that experimental gains are due to data flow improvements.

axioms (1)

domain assumption Agentic workflows integrate retrieval, reasoning, and memory and can be modeled as operator abstractions for execution planning.
The paper states this as the foundation for creating communication-efficient execution plans.

invented entities (1)

AAFLOW no independent evidence
purpose: A unified distributed runtime for agentic AI workflows with zero-copy data plane.
Introduced as a new system in the paper; no external validation mentioned.

pith-pipeline@v0.9.0 · 5485 in / 1304 out tokens · 56177 ms · 2026-05-08T17:21:46.743789+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 31 canonical work pages

[1]

Rethinking testing for llm applications: Characteristics, challenges, and a lightweight interaction protocol,

W. Ma, Y . Yang, Q. Hu, S. Ying, Z. Jin, B. Du, Z. Xing, T. Li, J. Shi, Y . Liuet al., “Rethinking testing for llm applications: Characteristics, challenges, and a lightweight interaction protocol,” 2025. [Online]. Available: https://arxiv.org/abs/2508.20737

work page arXiv 2025
[2]

What is llamaindex ?

E. R. Vanna Winland, “What is llamaindex ?” IBM, Tech. Rep., April
[3]

Available: https://www .ibm.com/think/topics/llamaindex

[Online]. Available: https://www .ibm.com/think/topics/llamaindex
[4]

Optimize vector databases, enhance rag-driven generative ai,

M. B. Cathy Zhang, “Optimize vector databases, enhance rag-driven generative ai,” Intel, Tech. Rep., March 2024. [Online]. Available: https://medium .com/intel-tech/optimize-vector- databases-enhance-rag-driven-generative-ai-90c10416cb9c

2024
[5]

Performance comparison of dask and apache spark on hpc systems for neuroimaging,

M. Dugré, V . Hayot-Sasson, and T. Glatard, “Performance comparison of dask and apache spark on hpc systems for neuroimaging,” p. e7635,
[6]

Available: https://doi.org/10.1002/cpe.7635

[Online]. Available: https://doi.org/10.1002/cpe.7635

work page doi:10.1002/cpe.7635
[7]

Dataflow: An llm-driven framework for unified data preparation and workflow automation in the era of data-centric ai.arXiv preprint arXiv:2512.16676, 2025

H. Liang, X. Ma, Z. Liu, Z. H. Wong, Z. Zhao, Z. Meng, R. He, C. Shen, Q. Cai, Z. Hanet al., “Dataflow: An llm-driven framework for unified data preparation and workflow automation in the era of data-centric ai,”arXiv preprint arXiv:2512.16676, 2025. [Online]. Available: https://arxiv.org/abs/2512.16676

work page arXiv 2025
[8]

High performance dataframes from parallel processing patterns,

V . Abeykoon, P. Wickramasinghe, S. Kamburugamuve, H. Maithree, C. Widanage, N. Perera, T. A. Kanewala, A. Uyar, G. Gunduz, and G. Fox, “High performance dataframes from parallel processing patterns,” inParallel Processing and Applied Mathematics: 14th International Conference, PPAM 2022, Gdansk, Poland, September 11–14, 2022, Revised Selected Papers, Par...

2022
[9]

On the reproducibility limitations of rag systems,

B. Wang, D. Zhao, N. R. Tallent, and L. Guo, “On the reproducibility limitations of rag systems,” 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2509.18869

work page doi:10.48550/arxiv.2509.18869 2025
[10]

Simplify your rag application architecture with llamain- dex + postgresml,

LlamaIndex, “Simplify your rag application architecture with llamain- dex + postgresml,” https://www .llamaindex.ai/blog/simplify-your-rag- application-architecture-with-llamaindex-postgresml, accessed: October 7, 2025

2025
[11]

How to achieve 10x perfor- mance with vector database for llm using lancedb and pyarrow,

R. K. Choudhary, “How to achieve 10x perfor- mance with vector database for llm using lancedb and pyarrow,” https://www .rishabhxchoudhary.com/blog/ How_to_Achieve_10x_Performance_with_Vector_Database_for_LLM_using_LanceDB_and_PyArrow, accessed: October 7, 2025

2025
[12]

Radical-pilot and parsl: Executing heterogeneous workflows on hpc platforms,

A. Alsaadi, L. Ward, A. Merzky, K. Chard, I. Foster, S. Jha, and M. Turilli, “Radical-pilot and parsl: Executing heterogeneous workflows on hpc platforms,” in2022 IEEE/ACM Workshop on Workflows in Support of Large-Scale Science (WORKS). IEEE, 2022, pp. 27–34. [Online]. Available: https://doi.org/10.1109/WORKS56498.2022.00009

work page doi:10.1109/works56498.2022.00009 2022
[13]

Retrieval-augmented generation for knowledge-intensive nlp tasks,

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive nlp tasks,” inAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates...

2020
[14]

React: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” inThe eleventh international conference on learning representations,
[15]

Available: https://openreview .net/pdf?id=WE_vluYUL-X

[Online]. Available: https://openreview .net/pdf?id=WE_vluYUL-X
[16]

Reflexion: language agents with verbal reinforcement learning,

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: language agents with verbal reinforcement learning,” inProceedings of the 37th International Conference on Neural Information Processing Systems, ser. NIPS ’23. Red Hook, NY , USA: Curran Associates Inc., 2023. [Online]. Available: https://dl.acm.org/doi/10.5555/3666122.3666499

work page doi:10.5555/3666122.3666499 2023
[17]

Retrieval-augmented generation for knowledge-intensive nlp tasks,

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive nlp tasks,” inProceedings of the 34th International Conference on Neural Information Processing Systems, ser. NIPS ’20. Red Hook, NY , USA: Curran Associates I...

work page doi:10.5555/3495724.3496517 2020
[18]

Accelerating embarrassingly parallel algorithm on intel mic,

Q. Wang, J. Liu, X. Tang, F. Wang, G. Fu, and Z. Xing, “Accelerating embarrassingly parallel algorithm on intel mic,” in 2014 IEEE International Conference on Progress in Informatics and Computing, 2014, pp. 213–218. [Online]. Available: https: //doi.org/10.1109/PIC.2014.6972327

work page doi:10.1109/pic.2014.6972327 2014
[19]

Ucx: An open source framework for hpc network apis and beyond,

P. Shamis, M. G. Venkata, M. G. Lopez, M. B. Baker, O. Hernandez, Y . Itigin, M. Dubman, G. Shainer, R. L. Graham, L. Liss, Y . Shahar, S. Potluri, D. Rossetti, D. Becker, D. Poole, C. Lamb, S. Kumar, C. Stunkel, G. Bosilca, and A. Bouteiller, “Ucx: An open source framework for hpc network apis and beyond,” in2015 IEEE 23rd Annual Symposium on High-Perfor...

work page doi:10.1109/hoti.2015.13 2015
[20]

Mpi for python,

L. Dalcin, R. Paz, and M. Storti, “Mpi for python,”Journal of Parallel and Distributed Computing, vol. 65, no. 9, pp. 1108–1115, Sep. 2005. [Online]. Available: https://doi.org/10.1016/j.jpdc.2005.03.010

work page doi:10.1016/j.jpdc.2005.03.010 2005
[21]

High performance data engineering everywhere,

C. Widanage, N. Perera, V . Abeykoon, S. Kamburugamuve, T. A. Kanewala, H. Maithree, P. Wickramasinghe, A. Uyar, G. Gunduz, and G. Fox, “High performance data engineering everywhere,” in2020 IEEE International Conference on Smart Data Services (SMDS). IEEE, 2020, pp. 122–132. [Online]. Available: https: //doi.org/10.1109/SMDS49396.2020.00022

work page doi:10.1109/smds49396.2020.00022 2020
[22]

In-depth analysis on parallel processing patterns for high-performance dataframes,

N. Perera, A. K. Sarker, M. Staylor, G. von Laszewski, K. Shan, S. Kamburugamuve, C. Widanage, V . Abeykoon, T. A. Kanewela, and G. Fox, “In-depth analysis on parallel processing patterns for high-performance dataframes,”Future Generation Computer Systems, vol. 149, pp. 250–264, 2023. [Online]. Available: https: //doi.org/10.1016/j.future.2023.07.007

work page doi:10.1016/j.future.2023.07.007 2023
[23]

Deep rc: A scalable data engineering and deep learning pipeline,

A. K. Sarker, A. Alsaadi, A. J. Halpern, P. Tangella, M. Titov, N. Perera, M. Staylor, G. von Laszewski, S. Jha, and G. Fox, “Deep rc: A scalable data engineering and deep learning pipeline,” inJob Scheduling Strategies for Parallel Processing: 28th International Workshop, JSSPP 2025, Milan, Italy, June 3–4, 2025, Revised Selected Papers. Berlin, Heidelbe...

work page doi:10.1007/978-3-032-10507-3_11 2025
[25]

Available: https://arxiv.org/abs/2512.20795

[Online]. Available: https://arxiv.org/abs/2512.20795

work page arXiv
[26]

Architecture,

S. K. Niranda Perera, “Architecture,” Cylon, Tech. Rep., May 2022. [Online]. Available: https://cylondata.org/docs/arch/

2022
[27]

Radical-cylon: A heterogeneous data pipeline for scientific computing,

A. K. Sarker, A. Alsaadi, N. Perera, M. Staylor, G. von Laszewski, M. Turilli, O. O. Kilic, M. Titov, A. Merzky, S. Jha et al., “Radical-cylon: A heterogeneous data pipeline for scientific computing,” inJob Scheduling Strategies for Parallel Processing. Springer Nature Switzerland, 2024, pp. 84–102. [Online]. Available: https://doi.org/10.1007/978-3-031-74430-3_5

work page doi:10.1007/978-3-031-74430-3_5 2024
[28]

Design and performance characterization of radical-pilot on leadership-class platforms,

A. Merzky, M. Turilli, M. Titov, A. Al-Saadi, and S. Jha, “Design and performance characterization of radical-pilot on leadership-class platforms,”IEEE Transactions on Parallel and amp; Distributed Systems, vol. 33, no. 04, pp. 818–829, apr 2022. [Online]. Available: https://doi.org/10.1109/TPDS.2021.3105994

work page doi:10.1109/tpds.2021.3105994 2022
[29]

Gloo: Collective communications library with various primitives for multi-machine training,

Facebookincubator, “Gloo: Collective communications library with various primitives for multi-machine training,” Facebook, Tech. Rep., March 2023. [Online]. Available: https://github .com/facebookincubator/ gloo"

2023
[30]

Combining serverless and high-performance computing paradigms to support ml data-intensive applications,

M. Staylor, A. K. Sarker, G. von Laszewski, G. Fox, Y . Cheng, and J. Fox, “Combining serverless and high-performance computing paradigms to support ml data-intensive applications,”Frontiers in High Performance Computing, 2026

2026
[31]

Towards XAI in the SOC – A User- Centric Study of Explainable Alerts with SHAP and LIME, in: 2022 IEEE International Conference on Big Data (Big Data), IEEE, Osaka, Japan

K. Shan, N. Perera, D. Lenadora, T. Zhong, A. Kumar Sarker, S. Kamburugamuve, T. Amila Kanewela, C. Widanage, and G. Fox, “Hybrid cloud and hpc approach to high-performance dataframes,” in2022 IEEE International Conference on Big Data (Big Data), 2022, pp. 2728–2736. [Online]. Available: https://doi .org/10.1109/ BigData55660.2022.10020958

work page arXiv 2022
[32]

Supercharging distributed computing environments for high-performance data engineering,

N. Perera, A. K. Sarker, K. Shan, A. Fetea, S. Kamburugamuve, T. A. Kanewala, C. Widanage, M. Staylor, T. Zhong, V . Abeykoon, G. von Laszewski, and G. Fox, “Supercharging distributed computing environments for high-performance data engineering,”Frontiers in High Performance Computing, vol. V olume 2 - 2024, 2024. [Online]. Available: https://doi.org/10.3...

work page doi:10.3389/fhpcp.2024.1384619 2024
[33]

Context rot: How increasing input tokens impacts llm performance,

K. Hong, A. Troynikov, and J. Huber, “Context rot: How increasing input tokens impacts llm performance,” Chroma, Tech. Rep., July 2025. [Online]. Available: https://trychroma.com/research/context-rot

2025
[34]

Generative benchmarking,

K. Hong, A. Troynikov, J. Huber, and M. McGuire, “Generative benchmarking,” Chroma, Tech. Rep., April 2025. [Online]. Available: https://trychroma.com/research/generative-benchmarking

2025
[35]

Billion-scale similarity search with GPUs

J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search with gpus,”IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535–547, 2021. [Online]. Available: https://doi.org/10.1109/TBDATA.2019.2921572

work page doi:10.1109/tbdata.2019.2921572 2021
[36]

Llamaindex: Data framework for connecting large language models to data,

J. Liu, “Llamaindex: Data framework for connecting large language models to data,” LlamaIndex, Tech. Rep., 2023. [Online]. Available: https://github.com/jerryjliu/llama_index

2023
[37]

Evaluating chunking strategies for retrieval,

B. Smith and A. Troynikov, “Evaluating chunking strategies for retrieval,” Chroma, Tech. Rep., July 2024. [Online]. Available: https://trychroma.com/research/evaluating-chunking

2024
[38]

Pinecone: Scalable vector database for machine learning applications,

P. Developers, “Pinecone: Scalable vector database for machine learning applications,” Pinecone Systems Inc., Tech. Rep., May 2023. [Online]. Available: https://www.pinecone.io/

2023
[39]

From rag to agents: Building intelligent systems with memory and tools,

A. Karpathy, “From rag to agents: Building intelligent systems with memory and tools,” Eureka Labs, Tech. Rep., January 2022. [Online]. Available: https://karpathy.ai/agents-rag-memory

2022
[40]

Building custom ai workflows using langchain tools,

E. Davis, “Building custom ai workflows using langchain tools,” ThinkTide Global Research Journal, vol. 5, no. 4, pp. 54–62, 2024. [Online]. Available: https://thinktidejournal .com/index.php/TGRJ/article/ view/53/63

2024
[41]

Langgraph: Stateful multi-agent workflows,

L. Developer, “Langgraph: Stateful multi-agent workflows,” LangChain Inc, Tech. Rep., Jan 2024. [Online]. Available: https://blog.langchain.com/ langgraph-multi-agent-workflows

2024
[42]

Exploration of llm multi- agent application implementation based on langgraph+ crewai,

Z. Duan and J. Wang, “Exploration of llm multi- agent application implementation based on langgraph+ crewai,” arXiv preprint arXiv:2411.18241, 2024. [Online]. Available: https://doi.org/10.32388/R27SW4

work page doi:10.32388/r27sw4 2024
[43]

[Online]

CrewAI Team,Kickoff Crew Asynchronously, 2026, accessed: 2026-02-18. [Online]. Available: https://docs.crewai.com/en/learn/kickoff-async

2026
[44]

Autogen: Enabling next-gen llm applications via multi-agent conversation,

Q. Wu, G. Bansal, J. Zhang, Y . Wu, B. Li, E. E. Zhu, L. Jiang, X. Zhang, S. Zhang, A. Awadallah, R. W. White, D. Burger, and C. Wang, “Autogen: Enabling next-gen llm applications via multi-agent conversation,” inCOLM 2024, August 2024. [Online]. Available: https: //www.microsoft.com/en-us/research/publication/autogen-enabling- next-gen-llm-applications-v...

2024
[45]

Ray: A distributed framework for emerging {AI} applications,

P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, M. I. Jordanet al., “Ray: A distributed framework for emerging {AI} applications,” in 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 2018, pp. 561–577. [Online]. Available: https://www.usenix.org/system/files/osdi18-moritz.pdf

2018
[46]

[Online]

LlamaIndex Developers,Parallel Execution Ingestion Pipeline, LlamaIndex, 2024, accessed: 2026-02-13. [Online]. Available: https://developers .llamaindex.ai/python/examples/ingestion/ parallel_execution_ingestion_pipeline/

2024
[47]

Dask: Parallel computation with blocked algorithms and task scheduling,

M. Rocklin, “Dask: Parallel computation with blocked algorithms and task scheduling,” inProceedings of the 14th python in science conference, vol. 130. Citeseer, 2015, p. 136. [Online]. Available: https://proceedings.scipy.org/articles/Majora-7b98e3ed-013.pdf

2015
[48]

Higress-rag: A holistic optimization framework for enterprise retrieval-augmented generation via dual hybrid retrieval, adaptive routing, and crag,

W. Lin, “Higress-rag: A holistic optimization framework for enterprise retrieval-augmented generation via dual hybrid retrieval, adaptive routing, and crag,”arXiv preprint arXiv:2602.23374, 2025. [Online]. Available: https://arxiv.org/abs/2602.23374

work page arXiv 2025
[49]

Higress,

Alibaba Cloud and the Higress Authors, “Higress,” https://github .com/ alibaba/higress, 2026, gitHub repository. Accessed: 2026-01-25

2026
[50]

Apache spark: a unified engine for big data processing,

M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, and I. Stoica, “Apache spark: a unified engine for big data processing,”Commun. ACM, vol. 59, no. 11, p. 56–65, Oct
[51]

Apache Spark: A unified engine for big data processing,

[Online]. Available: https://doi.org/10.1145/2934664

work page doi:10.1145/2934664
[52]

Apache flink: Stream and batch processing in a single engine,

P. Carbone, A. Katsifodimos, S. Ewen, V . Markl, S. Haridi, and K. Tzoumas, “Apache flink: Stream and batch processing in a single engine,”The Bulletin of the Technical Committee on Data Engineering, vol. 38, no. 4, 2015. [Online]. Available: https://asterios.katsifodimos.com/assets/publications/flink-deb.pdf

2015
[53]

Towards scalable dataframe systems,

M. Petersohn, S. Macke, D. Xin, W. Ma, J. K. Wittenauer, S. Hoyer, R. Marcus, M. Zaharia, and B. Recht, “Towards scalable dataframe systems,”Proceedings of the VLDB Endowment (PVLDB), vol. 13, no. 12, pp. 2033–2046, 2020. [Online]. Available: https://doi.org/10.14778/3407790.3407807

work page doi:10.14778/3407790.3407807 2033
[54]

Incremental perception on real time 3d data,

A. K. Sarker and F. X. Lin, “Incremental perception on real time 3d data,” inProceedings of the 23rd Annual International Workshop on Mobile Computing Systems and Applications, ser. HotMobile ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 68–73. [Online]. Available: https://doi.org/10.1145/3508396.3512875

work page doi:10.1145/3508396.3512875 2022
[55]

Parsl: Pervasive parallel programming in python,

Y . Babuji, A. Woodard, Z. Li, D. S. Katz, B. Clifford, R. Kumar, L. Lacinski, R. Chard, J. M. Wozniak, I. Foster, M. Wilde, and K. Chard, “Parsl: Pervasive parallel programming in python,” in Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing, ser. HPDC ’19. New York, NY , USA: Association for Computing...

work page doi:10.1145/3307681.3325400 2019
[56]

High-level messaging patterns,

ZMQ, “High-level messaging patterns,” Zeromq, Tech. Rep., October
[57]

Available: https://zguide .zeromq.org/docs/chapter2/ #High-Level-Messaging-Patterns"

[Online]. Available: https://zguide .zeromq.org/docs/chapter2/ #High-Level-Messaging-Patterns"
[58]

Pathways: Asynchronous distributed dataflow for ml,

P. Barham, A. Chowdhery, J. Dean, S. Ghemawat, S. Hand, D. Hurt, M. Isard, H. Lim, R. Pang, S. Royet al., “Pathways: Asynchronous distributed dataflow for ml,”Proceedings of Machine Learning and Systems, vol. 4, pp. 430–449, 2022. [Online]. Available: https://proceedings .mlsys.org/paper_files/paper/ 2022/file/37385144cac01dff38247ab11c119e3c-Paper.pdf

work page arXiv 2022
[59]

45 Geng Zhang, Xuanlei Zhao, Kai Wang, and Yang You

J. Yuan, X. Li, C. Cheng, J. Liu, R. Guo, S. Cai, C. Yao, F. Yang, X. Yi, C. Wuet al., “Oneflow: Redesign the distributed deep learning framework from scratch,”arXiv preprint arXiv:2110.15032, 2021. [Online]. Available: https://arxiv.org/abs/2110.15032

work page arXiv 2021
[60]

Dspy: Compiling declarative language model calls into self-improving pipelines,

O. Khattab, A. Singhvi, P. Maheshwari, Z. Zhang, K. Santhanam, S. Vardhamanan, S. Haq, A. Sharma, T. T. Joshi, H. Moazam, H. Miller, M. Zaharia, and C. Potts, “Dspy: Compiling declarative language model calls into self-improving pipelines,”The Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/p...

2024
[61]

and Zhang, Hao and Stoica, Ion , booktitle =

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the 29th Symposium on Operating Systems Principles, ser. SOSP ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 611–626. [Online]. Availabl...

work page doi:10.1145/3600006.3613165 2023
[62]

Sglang: efficient execution of structured language model programs,

L. Zheng, L. Yin, Z. Xie, C. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, C. Barrett, and Y . Sheng, “Sglang: efficient execution of structured language model programs,” inProceedings of the 38th International Conference on Neural Information Processing Systems, ser. NIPS ’24. Red Hook, NY , USA: Curran Associates Inc., 2024. ...

work page doi:10.5555/3737916.3739916 2024
[63]

Memorag: Boosting long context processing with global memory-enhanced retrieval augmentation,

H. Qian, Z. Liu, P. Zhang, K. Mao, D. Lian, Z. Dou, and T. Huang, “Memorag: Boosting long context processing with global memory-enhanced retrieval augmentation,” inProceedings of the ACM on Web Conference 2025, ser. WWW ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 2366–2377. [Online]. Available: https://doi.org/10.1145/3696410.3714805

work page doi:10.1145/3696410.3714805 2025
[64]

Tuning llms by rag principles: Towards llm-native memory,

J. Wei, S. Wu, R. Liu, X. Ying, J. Shang, and F. Tao, “Tuning llms by rag principles: Towards llm-native memory,” arXiv preprint arXiv:2503.16071, 2025. [Online]. Available: https: //arxiv.org/abs/2503.16071

work page arXiv 2025
[65]

From RAG to memory: Non-parametric continual learning for large language models,

B. J. Gutiérrez, Y . Shu, W. Qi, S. Zhou, and Y . Su, “From RAG to memory: Non-parametric continual learning for large language models,” inForty-second International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/forum?id=LWH8yn4HS2

2025
[66]

Cue rag: Dynamic multi-output cue memory under h framework for retrieval-augmented generation,

Y . Fu, D. Liu, B. Zhang, Z. Jiang, H. Mei, and J. Guan, “Cue rag: Dynamic multi-output cue memory under h framework for retrieval-augmented generation,”Neurocomputing, vol. 639, p. 130235,
[67]

Available: https://www .sciencedirect.com/science/article/ pii/S0925231225009075

[Online]. Available: https://www .sciencedirect.com/science/article/ pii/S0925231225009075