arxiv: 2605.03411 · v1 · submitted 2026-05-05 · 💻 cs.NI

Recognition: unknown

DACP: A Scientific Data Access and Collaboration Protocol

Changfa Lu, Hao Ren, Xiaojie Zhu, Zhaoji Liang, Zhenjing Cheng, Zhihong Shen

Pith reviewed 2026-05-07 13:13 UTC · model grok-4.3

classification 💻 cs.NI

keywords DACPStreaming Data Framescientific data accessdata collaborationin-situ computationcross-domain collaborationnetwork protocoldata interoperability

0 comments

The pith

DACP protocol uses unified resource IDs and reverse supply to let scientists discover data, run in-situ computations, and stream results across centers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Scientific computing now generates vast datasets that stay trapped in separate centers, blocking the kind of cross-domain work needed for AI4Science. The paper introduces the Data Access and Collaboration Protocol (DACP) whose core is the Streaming Data Frame. This model, combined with unified resource identification, columnar framing, and a reverse supply mechanism, is meant to let users locate data, compute on it where it lives, and receive results as a stream. A reference server called faird demonstrates the approach. If the mechanisms work as described, they would remove the need to copy entire datasets and open a practical route to scalable, collaborative scientific data systems.

Core claim

DACP defines the Streaming Data Frame (SDF) as its core data model. Through Unified Resource Identification, columnar stream framing, and a reverse supply mechanism, DACP enables data discovery, in-situ computation, and the streaming return of results across scientific data centers, thereby facilitating efficient cross-domain collaboration. The paper also supplies faird, a reference server implementation that shows how the protocol can be realized in practice.

What carries the argument

The Streaming Data Frame (SDF) together with Unified Resource Identification, columnar stream framing, and a reverse supply mechanism. SDF serves as the uniform data model that carries framing, discovery, and streaming so that computation can occur at the data location rather than after bulk transfer.

If this is right

Data no longer needs to be copied in full before analysis, lowering transfer costs and latency.
Cross-domain teams can run computations directly on remote holdings and receive only the derived results.
A single protocol layer replaces ad-hoc integrations between centers.
Reference implementations such as faird provide a concrete starting point for building larger infrastructures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be extended to enforce fine-grained access policies at the frame level rather than at the file level.
Similar framing and reverse-supply ideas might apply to real-time sensor networks or distributed simulation outputs.
Adoption would be helped by open-source client libraries that hide the protocol details from end-user tools.
Performance claims could be tested by measuring end-to-end latency for typical AI4Science workflows before and after DACP deployment.

Load-bearing premise

The combination of SDF framing, URI, and reverse supply will actually eliminate data silos and deliver the claimed interoperability gains once implemented, without major performance, security, or adoption obstacles.

What would settle it

Multiple independent scientific data centers deploy compatible DACP servers and successfully locate each other's datasets, execute in-situ queries, and receive streamed results with no custom per-center adapters and without transferring raw data volumes.

Figures

Figures reproduced from arXiv: 2605.03411 by Changfa Lu, Hao Ren, Xiaojie Zhu, Zhaoji Liang, Zhenjing Cheng, Zhihong Shen.

**Figure 1.** Figure 1: The unified logical abstraction and interaction architecture of the Streaming Data Frame (SDF). The figure illustrates the core protocol methods (GET/PUT/COOK), the streaming retrieval mechanism via GET, and the File List Framing strategy. Users can perform coarse-grained filtering on the file list (e.g., Format=’csv’) or recursively ’drill down’ and retrieve specific file contents using the same set of op… view at source ↗

**Figure 3.** Figure 3: A cross-domain collaborative analysis scenario, where the global view at source ↗

**Figure 2.** Figure 2: The Client-Server interaction mechanism of DACP. view at source ↗

**Figure 4.** Figure 4: Performance evaluation on structured data. view at source ↗

**Figure 5.** Figure 5: Performance evaluation on unstructured data(mixed). view at source ↗

read the original abstract

Scientific computing is rapidly entering a data-intensive era. However, existing general-purpose network protocol stacks face limitations in eliminating data silos and improving data accessibility and interoperability, making it difficult to effectively meet the demands of emerging paradigms such as AI4Science. To address these challenges, we propose the Data Access and Collaboration Protocol (DACP). DACP defines the Streaming Data Frame (SDF) as its core data model. Through Unified Resource Identification, columnar stream framing, and a reverse supply mechanism, DACP enables data discovery, in-situ computation, and the streaming return of results across scientific data centers, thereby facilitating efficient cross-domain collaboration. Furthermore, this paper introduces faird, a reference server implementation of DACP. This work provides a viable path for building scalable and collaborative scientific data infrastructures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This protocol proposal for scientific data access introduces DACP and SDF but provides no empirical validation of its advantages over current tools.

read the letter

The paper's key point is that it defines a new protocol, DACP, built around the Streaming Data Frame (SDF) data model to improve data access and collaboration in scientific computing. It uses unified resource identification, columnar stream framing, and a reverse supply mechanism to allow data discovery, in-situ computation, and streaming of results across data centers. What the work does well is to clearly articulate the shortcomings of existing general-purpose network protocols for handling the demands of data-intensive science and AI4Science. The authors provide a structured description of the protocol and introduce faird as a reference server implementation, which gives a concrete starting point for understanding how DACP might be deployed. The main weakness is that everything rests on design assertions without supporting evidence. The paper does not include any performance measurements, scalability tests, security analysis, or comparisons against established systems such as HTTP/2, Globus, iRODS, or columnar formats like Parquet over gRPC. As a result, claims about eliminating data silos and facilitating efficient cross-domain collaboration stay speculative. This paper is for researchers and engineers working on scientific data infrastructures who are looking for protocol-level innovations. It would not be immediately useful for someone needing a ready-to-use solution with proven performance. I recommend sending it for peer review. The problem it addresses is relevant, and the design is detailed enough to warrant expert feedback, even though substantial revisions with empirical results would be needed to strengthen it.

Referee Report

2 major / 0 minor

Summary. The paper proposes the Data Access and Collaboration Protocol (DACP) to overcome limitations of existing network protocols in handling scientific data silos and interoperability needs for AI4Science. It defines the Streaming Data Frame (SDF) as the core data model and claims that Unified Resource Identification, columnar stream framing, and a reverse supply mechanism enable data discovery, in-situ computation, and streaming return of results across data centers. The manuscript also introduces faird as a reference server implementation to support scalable collaborative scientific data infrastructures.

Significance. If the protocol mechanisms deliver the claimed interoperability and efficiency gains upon implementation and adoption, the work could meaningfully advance scientific data infrastructure by reducing silos and supporting cross-domain collaboration in data-intensive fields. The inclusion of a reference implementation provides a concrete foundation that could facilitate community testing and extension.

major comments (2)

[Abstract and §1] Abstract and §1 (Introduction): The central claims that DACP 'enables data discovery, in-situ computation, and the streaming return of results' and thereby 'facilitating efficient cross-domain collaboration' rest entirely on architectural assertions without any supporting measurements, proofs, or comparative analysis.
[faird reference server section] faird reference server section: The implementation is described but contains no performance benchmarks, scalability tests, security analysis, throughput measurements under realistic workloads, or head-to-head comparisons against HTTP/2, Globus, iRODS, or columnar formats over gRPC, leaving the efficiency and interoperability advantages unverified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript proposing the Data Access and Collaboration Protocol (DACP). We address the major comments point by point below, clarifying the paper's focus as a protocol design and reference implementation while acknowledging areas for improvement.

read point-by-point responses

Referee: [Abstract and §1] Abstract and §1 (Introduction): The central claims that DACP 'enables data discovery, in-situ computation, and the streaming return of results' and thereby 'facilitating efficient cross-domain collaboration' rest entirely on architectural assertions without any supporting measurements, proofs, or comparative analysis.

Authors: The manuscript presents DACP as a protocol proposal, with the abstract and introduction stating the capabilities enabled by its core Streaming Data Frame model and mechanisms (Unified Resource Identification, columnar stream framing, and reverse supply). These are substantiated through the detailed architectural description in the body of the paper rather than through empirical data. We agree that the framing could be clearer regarding the absence of measurements and will revise the abstract and Section 1 to explicitly position the work as a design contribution, noting that quantitative evaluations and comparisons are planned for future work. revision: partial
Referee: [faird reference server section] faird reference server section: The implementation is described but contains no performance benchmarks, scalability tests, security analysis, throughput measurements under realistic workloads, or head-to-head comparisons against HTTP/2, Globus, iRODS, or columnar formats over gRPC, leaving the efficiency and interoperability advantages unverified.

Authors: The faird reference server is presented as an open implementation to demonstrate protocol feasibility and support community adoption, consistent with the referee's summary. The section emphasizes architectural realization of DACP features over performance metrics. We concur that benchmarks and analyses would strengthen the paper and will add a dedicated subsection outlining high-level performance considerations, basic security properties of the design, and explicit plans for future benchmarking and comparisons. Full head-to-head evaluations under realistic workloads, however, exceed the scope of this design-focused manuscript. revision: partial

Circularity Check

0 steps flagged

No circularity: protocol specification with no derivations or self-referential elements

full rationale

The paper is a forward-looking protocol specification that defines DACP, Streaming Data Frame (SDF), Unified Resource Identification, columnar stream framing, and reverse supply mechanism. It contains no mathematical derivations, equations, fitted parameters, predictions, or self-citations that could reduce to inputs by construction. Architectural claims about data discovery and collaboration are presented as design features rather than derived results. The work is self-contained as a specification proposal without load-bearing steps that invoke uniqueness theorems or prior self-work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The proposal rests on the domain assumption that current network stacks are inadequate for scientific data needs and introduces new protocol-level entities without external validation.

axioms (1)

domain assumption Existing general-purpose network protocol stacks face limitations in eliminating data silos and improving data accessibility and interoperability.
Explicitly stated as the motivation in the abstract.

invented entities (2)

Streaming Data Frame (SDF) no independent evidence
purpose: Core data model enabling columnar stream framing and in-situ computation.
Newly defined in the paper as the foundation of DACP.
DACP protocol no independent evidence
purpose: The overall access and collaboration protocol.
Proposed as the solution in this work.

pith-pipeline@v0.9.0 · 5439 in / 1199 out tokens · 63162 ms · 2026-05-07T13:13:12.566845+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

21 extracted references

[1]

T. Hey, S. Tansley, K. M. Tolleet al.,The fourth paradigm: data- intensive scientific discovery. Microsoft research Redmond, W A, 2009, vol. 1

2009
[2]

The fair guiding principles for scientific data management and stewardship,

M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Ax- ton, A. Baak, N. Blomberg, J.-W. Boiten, L. B. da Silva Santos, P. E. Bourneet al., “The fair guiding principles for scientific data management and stewardship,”Scientific data, vol. 3, no. 1, pp. 1–9, 2016

2016
[3]

Scientific discovery in the age of artificial intelligence,

H. Wang, T. Fu, Y . Du, W. Gao, K. Huang, Z. Liu, P. Chandak, S. Liu, P. Van Katwyk, A. Deacet al., “Scientific discovery in the age of artificial intelligence,”Nature, vol. 620, no. 7972, pp. 47–60, 2023

2023
[4]

Data integration: The teenage years,

A. Halevy, A. Rajaraman, and J. Ordille, “Data integration: The teenage years,” inProceedings of the 32nd international conference on Very large data bases, 2006, pp. 9–16

2006
[5]

Research data network: Concept, systems and applications,

S. Zhihong, Z. Xiaojie, W. Huajin, T. Jizhou, G. Xuebing, W. Hui, M. Yufang, and W. Linhuan, “Research data network: Concept, systems and applications,”Frontiers of Data and Computing, vol. 6, no. 4, pp. 3–21, 2024

2024
[6]

Data-intensive applications, challenges, techniques and technologies: A survey on big data,

C. P. Chen and C.-Y . Zhang, “Data-intensive applications, challenges, techniques and technologies: A survey on big data,”Information sci- ences, vol. 275, pp. 314–347, 2014

2014
[7]

Cloud-native repositories for big scientific data,

R. P. Abernathey, T. Augspurger, A. Banihirwe, C. C. Blackmon-Luca, T. J. Crone, C. L. Gentemann, J. J. Hamman, N. Henderson, C. Lepore, T. A. McCaieet al., “Cloud-native repositories for big scientific data,” Computing in Science & Engineering, vol. 23, no. 2, pp. 26–35, 2021

2021
[8]

Near-data processing: Insights from a micro- 46 workshop,

R. Balasubramonian, J. Chang, T. Manning, J. H. Moreno, R. Murphy, R. Nair, and S. Swanson, “Near-data processing: Insights from a micro- 46 workshop,”IEEE Micro, vol. 34, no. 4, pp. 36–42, 2014

2014
[9]

A protocol for packet network intercommunica- tion,

V . Cerf and R. Kahn, “A protocol for packet network intercommunica- tion,”IEEE Transactions on communications, vol. 22, no. 5, pp. 637– 648, 1974

1974
[10]

The world-wide web,

T. Berners-Lee, R. Cailliau, A. Luotonen, H. F. Nielsen, and A. Secret, “The world-wide web,”Communications of the ACM, vol. 37, no. 8, pp. 76–82, 1994

1994
[11]

The semantic web revisited,

N. Shadbolt, T. Berners-Lee, and W. Hall, “The semantic web revisited,” IEEE intelligent systems, vol. 21, no. 3, pp. 96–101, 2006

2006
[12]

Principled design of the modern web architecture,

R. T. Fielding and R. N. Taylor, “Principled design of the modern web architecture,”ACM Transactions on Internet Technology (TOIT), vol. 2, no. 2, pp. 115–150, 2002

2002
[13]

Performance evaluation of object serialization libraries in xml, json and binary formats,

K. Maeda, “Performance evaluation of object serialization libraries in xml, json and binary formats,” in2012 Second International Confer- ence on Digital Information and Communication Technology and it’s Applications (DICTAP). IEEE, 2012, pp. 177–182

2012
[14]

A file transfer protocol (ftp),

M. Gien, “A file transfer protocol (ftp),”Computer Networks (1976), vol. 2, no. 4-5, pp. 312–319, 1978

1976
[15]

The globus striped gridftp framework and server,

W. Allcock, J. Bresnahan, R. Kettimuthu, and M. Link, “The globus striped gridftp framework and server,” inSC’05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing. IEEE, 2005, pp. 54– 54

2005
[16]

Hierarchical data format 5: Hdf5,

S. Koranne, “Hierarchical data format 5: Hdf5,” inHandbook of open source tools. Springer, 2010, pp. 191–200

2010
[17]

Http extensions for web distributed authoring and ver- sioning (webdav),

L. Dusseault, “Http extensions for web distributed authoring and ver- sioning (webdav),” Tech. Rep., 2007

2007
[18]

The quic transport protocol: Design and internet-scale deployment,

A. Langley, A. Riddoch, A. Wilk, A. Vicente, C. Krasic, D. Zhang, F. Yang, F. Kouranov, I. Swett, J. Iyengaret al., “The quic transport protocol: Design and internet-scale deployment,” inProceedings of the conference of the ACM special interest group on data communication, 2017, pp. 183–196

2017
[19]

Apache arrow: A cross-language development platform for in-memory data,

The Apache Software Foundation, “Apache arrow: A cross-language development platform for in-memory data,” 2024, accessed: 2025-12-20. [Online]. Available: https://arrow.apache.org/

2024
[20]

Yelp Open Dataset,

Yelp Inc., “Yelp Open Dataset,” 2025, accessed: 2025-12-29. [Online]. Available: https://www.yelp.com/dataset

2025
[21]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255

2009