pith. machine review for the scientific record. sign in

arxiv: 2604.22100 · v1 · submitted 2026-04-23 · 💻 cs.DB · cs.IR

Recognition: unknown

Implementation and Privacy Guarantees for Scalable Keyword Search on SOLID-based Decentralized Data with Granular Visibility Constraints

Adriane Chapman, Alexandra Poulovassilis, Faria Ferooz, George Roussos, Helen Oliver, Mohamed Ragab, Mohammad Bahrani, Thanassis Tiropanis

Authors on Pith no claims yet

Pith reviewed 2026-05-08 13:02 UTC · model grok-4.3

classification 💻 cs.DB cs.IR
keywords Solid podsdecentralized searchkeyword searchWebIDprivacy-aware metadatathreat modelvisibility constraintssource selection
0
0 comments X

The pith

ESPRESSO enables scalable keyword search over distributed Solid pods while enforcing user visibility policies through scoped indexes and privacy-aware metadata.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ESPRESSO as a framework that performs keyword searches across data held in many separate Solid pods, each under its owner's control with custom visibility rules. It builds indexes tied to each WebID inside the pod and adds metadata that respects privacy, allowing servers to choose and order results efficiently without pulling all data centrally. A formal threat model examines risks from creating, combining, and using these indexes and metadata, such as leaks or unwanted inferences, and points to design choices that reduce exposure. Readers interested in personal data control would see this as a concrete method to search decentralized stores without handing sovereignty to a single operator.

Core claim

ESPRESSO constructs WebID-scoped indexes within pods and employs privacy-aware metadata to enable efficient source selection and ranking across servers under user-defined visibility policies. It introduces a formal threat model that analyzes security and privacy risks from index generation, aggregation, and use, including unintended metadata leakage and inference of sensitive information about data in personal stores. The analysis identifies key design principles that limit metadata exposure while mitigating unauthorized inference, providing a foundation for evaluating privacy-preserving decentralized search.

What carries the argument

WebID-scoped indexes paired with privacy-aware metadata inside the ESPRESSO framework, which perform source selection and ranking while respecting granular visibility constraints across pods.

Load-bearing premise

WebID-scoped indexes and privacy-aware metadata can be generated and aggregated without letting adversaries infer sensitive information beyond what the threat model explicitly accounts for.

What would settle it

An adversary extracts specific details about private pod contents using only the generated indexes or metadata in a working ESPRESSO system.

Figures

Figures reproduced from arXiv: 2604.22100 by Adriane Chapman, Alexandra Poulovassilis, Faria Ferooz, George Roussos, Helen Oliver, Mohamed Ragab, Mohammad Bahrani, Thanassis Tiropanis.

Figure 1
Figure 1. Figure 1: Our Motivating Search Scenario respects a search party’s access boundaries: an unauthorized search party’s visibility scope excludes restricted data, so it cannot view it; likewise, an authorized search party’s queries are evaluated only over the data within its visibility scope, and cannot reach beyond it. Motivating Scenario. To make our approach more concrete, view at source ↗
Figure 2
Figure 2. Figure 2: ESPRESSO Framework Architecture medications. Moreover, individuals who have chosen to make their contact details accessible to Alice’s WebID can subsequently be approached directly, enabling Alice to invite them to participate in the clinical trial. 3.1 ESPRESSO Infrastructure and Components The ESPRESSO architecture ( view at source ↗
Figure 3
Figure 3. Figure 3: Metadata maintained for source selection. view at source ↗
Figure 4
Figure 4. Figure 4: Indexing Process in ESPRESSO Indexing App, the owner of the pod, and the target pod. The process begins when the Indexing App initiates an authentication request through the browser interface. The pod owner logs in using their WebID credentials and authorizes the Indexing App to access the pod. After initial authorization, the Indexing App obtains an access token that allows it to perform subsequent indexi… view at source ↗
Figure 5
Figure 5. Figure 5: The process of registering a Pod for Search view at source ↗
Figure 6
Figure 6. Figure 6: Search Process when overlay returns relevant pods view at source ↗
Figure 7
Figure 7. Figure 7: Search Process when overlay returns results view at source ↗
read the original abstract

In decentralized personal data ecosystems grounded in architectures such as Solid, users retain sovereignty over their data via personal online data stores (pods), hosted on Solid-compliant server infrastructures. In such environments, data remains under the control of pod owners, which complicates search due to distribution across numerous pods and user-specific access constraints. ESPRESSO is a decentralized framework for scalable keyword-based search across distributed Solid pods under user-defined visibility policies. It addresses key challenges of decentralized search by constructing WebID-scoped indexes within pods and employing privacy-aware metadata to enable efficient source selection and ranking across servers. This paper further introduces a formal threat model for ESPRESSO, analysing the security and privacy risks associated with the generation, aggregation, and use of indexes and metadata. These risks include unintended metadata leakage and the potential for adversaries to infer sensitive information about data that resides within personal data stores. The analysis identifies key design principles that limit metadata exposure while mitigating unauthorized inference. The proposed threat model provides a foundation for evaluating privacy-preserving decentralized search and informs the design of systems with stronger privacy guarantees.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript presents ESPRESSO, a decentralized framework for scalable keyword-based search across distributed Solid pods under user-defined visibility policies. It describes the construction of WebID-scoped indexes within pods and the use of privacy-aware metadata to support efficient source selection and ranking. The paper also introduces a formal threat model that analyzes security and privacy risks associated with index generation, aggregation, and use, including unintended metadata leakage and inference of sensitive data, while identifying design principles to limit exposure.

Significance. If the proposed index construction, metadata mechanisms, and threat model can be shown to deliver the claimed privacy properties with concrete proofs and evaluations, the work would contribute to privacy-preserving search in decentralized personal data stores such as Solid. It addresses practical challenges of data sovereignty and granular access control in distributed environments, potentially informing future systems with stronger privacy guarantees.

major comments (1)
  1. Abstract: The manuscript states that a formal threat model is introduced which analyzes risks of generation, aggregation, and use of indexes and metadata, and that design principles limit metadata exposure while mitigating unauthorized inference. However, no formal privacy definition (e.g., an indistinguishability game or differential privacy bound), reduction to the index/metadata construction, or proof is referenced that demonstrates the WebID-scoped indexes and privacy-aware metadata satisfy the model. This is load-bearing for the central claim of privacy guarantees.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The feedback highlights an important point regarding the scope and formality of our threat model, which we address below. We will revise the manuscript accordingly to ensure the claims accurately reflect the content.

read point-by-point responses
  1. Referee: [—] Abstract: The manuscript states that a formal threat model is introduced which analyzes risks of generation, aggregation, and use of indexes and metadata, and that design principles limit metadata exposure while mitigating unauthorized inference. However, no formal privacy definition (e.g., an indistinguishability game or differential privacy bound), reduction to the index/metadata construction, or proof is referenced that demonstrates the WebID-scoped indexes and privacy-aware metadata satisfy the model. This is load-bearing for the central claim of privacy guarantees.

    Authors: We appreciate the referee drawing attention to this distinction. The threat model in the paper is a structured analysis that formally defines the adversary (including capabilities for observing indexes and metadata), enumerates specific risks (e.g., unintended leakage during index generation or aggregation, and inference of sensitive data from visibility metadata), and derives design principles (such as WebID-scoping and minimal metadata exposure) to reduce those risks. However, it does not provide a cryptographic-style formal privacy definition (such as an indistinguishability game or differential privacy bound), nor a reduction or proof that the index and metadata constructions satisfy such a definition. We agree that the current abstract wording could be read as implying stronger provable guarantees than are delivered. We will revise the abstract to state that we introduce a threat model that analyzes risks and identifies design principles to limit exposure, without claiming formal proofs or reductions. We will also add a clarifying sentence in the threat model section to explicitly note its analytical scope. This change ensures the central claims are precisely aligned with what the paper provides. revision: yes

Circularity Check

0 steps flagged

No circularity: design framework and threat model are self-contained contributions

full rationale

The paper presents ESPRESSO as a new decentralized search framework that constructs WebID-scoped indexes and privacy-aware metadata, then introduces a formal threat model analyzing risks of index/metadata generation, aggregation, and use. No equations, fitted parameters, predictions, or derivations are described that reduce to prior inputs by construction. The threat model is positioned as an original analysis identifying design principles, with no self-citation chains or ansatzes invoked to justify core claims. The central contributions remain independent of any fitted quantities or self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that Solid pods allow internal indexing without violating user sovereignty and that a threat model can be constructed to bound metadata leakage; no free parameters or new invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Users retain sovereignty over data via personal pods hosted on Solid-compliant servers
    Invoked in the opening description of decentralized personal data ecosystems.

pith-pipeline@v0.9.0 · 5516 in / 1274 out tokens · 68907 ms · 2026-05-08T13:02:21.156063+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references

  1. [1]

    Serge Abiteboul, Benjamin André, and Daniel Kaplan. 2015. Managing your digital life.Commun. ACM58, 5 (2015), 32–35

  2. [2]

    Mohammad Bahrani, Mohamed Ragab, Helen Oliver, Thanassis Tiropanis, Adri- ane Chapman, Alexandra Poulovassilis, and George Roussos. 2026. Rethinking information retrieval in a re-decentralised web: Exploring the feasibility and quality of search across personal online datastores.ACM Transactions on the Web20, 1 (2026), 1–24

  3. [3]

    Christian Esposito, Ross Horne, Livio Robaldo, Bart Buelens, and Elfi Goesaert

  4. [4]

    Information14, 7 (2023), 411

    Assessing the solid protocol in relation to security and privacy obligations. Information14, 7 (2023), 411

  5. [5]

    2026.Security Protocols and Threat Models

    Reynaldo Gil-Pons, Ross Horne, Sjouke Mauw, Felix Stutz, and Semen Yurkov. 2026.Security Protocols and Threat Models. Springer

  6. [6]

    Essam Mansour, Andrei Vlad Sambra, Sandro Hawke, Maged Zereba, Sarven Capadisli, Abdurrahman Ghanem, Ashraf Aboulnaga, and Tim Berners-Lee. 2016. A demonstration of the solid platform for social web applications. InProceedings of the 25th international conference companion on world wide web. 223–226

  7. [7]

    Ruben Mayer. [n.d.]. Benefits and Challenges of Decentralization in Data Sys- tems: Opportunities for Data Management Research.Proceedings of the VLDB Endowment. ISSN2150 ([n. d.]), 8097

  8. [8]

    Mohamed R Moawad, Mohamed Mohamed Maher Zenhom Abdelrahman Maher, Ahmed Awad, and Sherif Sakr. 2019. Minaret: a recommendation framework for scientific reviewers. Inthe 22nd International conference on extending database technology (EDBT)

  9. [9]

    Mohamed Ragab, Yury Savateev, Helen Oliver, Thanassis Tiropanis, Alexandra Poulovassilis, Adriane Chapman, and George Roussos. 2024. ESPRESSO: a framework to empower search on the decentralized web.Data Science and Engineering9, 4 (2024), 431–448

  10. [10]

    Mohamed Ragab, Yury Savateev, Helen Oliver, Thanassis Tiropanis, Alexandra Poulovassilis, Adriane Chapman, and George Roussos. 2024. Unlocking the potential of health data with decentralised search in personal health datastores. InCompanion Proceedings of the ACM Web Conference 2024. 1154–1157

  11. [11]

    Mohamed Ragab, Yury Savateev, Helen Oliver, Thanassis Tiropanis, Alexandra Poulovassilis, Adriane Chapman, Ruben Taelman, and George Roussos. 2024. De- centralized search over personal online datastores: architecture and performance evaluation. InInternational conference on web engineering. Springer, 49–64

  12. [12]

    Mohamed Ragab, Yury Savateev, Helen Oliver, Thanassis Tiropanis, Alexandra Poulovassilis, Adriane Chapman, Ruben Taelman, and George Roussos. 2024. Decentralized Search over Personal Online Datastores: Architecture and Per- formance Evaluation. InInternational Conference on Web Engineering. Springer, 49–64

  13. [13]

    Andrei Vlad Sambra, Essam Mansour, Sandro Hawke, Maged Zereba, Nicola Greco, Abdurrahman Ghanem, Dmitri Zagidulin, Ashraf Aboulnaga, and Tim Berners-Lee. 2016. Solid: a platform for decentralized social applications based on linked data.MIT CSAIL & Qatar Computing Research Institute, Tech. Rep.2016 (2016)

  14. [14]

    Joachim Van Herwegen and Ruben Verborgh. 2024. The Community Solid Server: Supporting research & development in an evolving ecosystem.Semantic Web15, 6 (2024), 2597–2611

  15. [15]

    Maarten Vandenbrande, Maxime Jakubowski, Pieter Bonte, Bart Buelens, Femke Ongenae, and Jan Van den Bussche. 2023. POD-QUERY: schema mapping and query rewriting for solid pods

  16. [16]

    Ying Zhao and Jinjun Chen. 2022. A survey on differential privacy for unstruc- tured data content.ACM Computing Surveys (CSUR)54, 10s (2022), 1–28. 13