Building a privacy-preserving Federated Recommender system for mobile devices

Aasheesh Singh

arxiv: 2605.22924 · v2 · pith:KS6GHWVVnew · submitted 2026-05-21 · 💻 cs.LG · cs.IR

Building a privacy-preserving Federated Recommender system for mobile devices

Aasheesh Singh This is my paper

Pith reviewed 2026-05-25 06:14 UTC · model grok-4.3

classification 💻 cs.LG cs.IR

keywords federated learningrecommender systemsprivacy preservationmobile devicescollaborative filteringon-device inferencetwo-stage pipeline

0 comments

The pith

A two-stage pipeline generates shortlists in the cloud from non-sensitive data then re-ranks them on-device with private context signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a federated recommender that keeps sensitive mobile context data entirely on the device while still producing personalized item rankings. Non-sensitive preference data is used in the cloud for an initial collaborative-filtering shortlist, after which the device applies local signals to reorder the candidates. Only model gradients or updates ever leave the device. The approach is shown to run on MovieLens, activity-recognition data, and a pilot set, and is packaged as a Kotlin Multiplatform library for Android and iOS. The separation directly addresses privacy rules that prohibit central collection of location, sensor, or app-usage context.

Core claim

The central claim is that a two-stage federated recommendation pipeline—cloud-based collaborative filtering on non-sensitive app-context data to produce a shortlist, followed by on-device re-ranking that uses sensitive mobile signals—delivers effective personalization while ensuring the sensitive data never leaves the device and only model updates are transmitted.

What carries the argument

The two-stage federated pipeline that isolates non-sensitive preference data for cloud shortlisting from sensitive context data used only for on-device re-ranking.

If this is right

Personalized mobile content can be served without pooling sensitive context data on servers.
Training continues via model updates alone, satisfying data-minimization requirements.
The same separation pattern can be applied to other on-device personalization tasks.
A single Kotlin Multiplatform library makes the pipeline available on both Android and iOS.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The design lowers the regulatory surface area for any app that must handle location or sensor streams.
On-device re-ranking may also reduce round-trip latency once the shortlist arrives.
If the shortlist quality is high enough, the on-device stage could be made extremely lightweight.

Load-bearing premise

Re-ranking a cloud-generated shortlist on the device with local sensitive signals yields recommendation quality comparable to a model that has direct access to the full centralized dataset.

What would settle it

A controlled experiment that measures precision or recall on held-out user interactions and shows that the on-device re-ranking stage produces materially lower accuracy than a centralized model trained on the same sensitive signals would falsify the claim of effective personalization.

Figures

Figures reproduced from arXiv: 2605.22924 by Aasheesh Singh.

**Figure 1.1.** Figure 1.1: High level architecture of the company’s product offering which is designed for App owners or publishers. The company’s offering consists of a mobile SDK library which is integrated into the app code to provide federated recommendations. Further, a fully managed cloud server coordinates the model weight updates from edge devices with a differential privacy engine. Services such as dashboards and monitori… view at source ↗

**Figure 1.2.** Figure 1.2: Lerna AI’s Dashboard system, monitoring model performance and active mobile devices contributing to the Federated learning network. Project objectives The objectives of the internship were to improve upon the Logistic Regression model for delivering federated recommendations and implement corresponding algorithms from scratch in low-level Kotlin programming language for end-to-end deployment. The tasks p… view at source ↗

**Figure 2.1.** Figure 2.1: Human Activity Recognition pipeline[1]. 11 [PITH_FULL_IMAGE:figures/full_fig_p021_2_1.png] view at source ↗

**Figure 2.2.** Figure 2.2: Data distribution for different Activity classes in UCI Dataset The dataset captures time-series tri-axial acceleration data i.e. (tAcc-XYZ) from accelerometer, where "t" denotes time and the suffix "XYZ" denotes the tri-axial signal in X, Y and Z directions respectively. Additionally, tri-axial angular velocity data from a gyroscope sensor i.e., (tGyro-XYZ) was also recorded to understand rotation infor… view at source ↗

**Figure 2.3.** Figure 2.3: Dimensionality Reduction techniques to better understand UCI HAR Dataset To understand the separation of various Activity classes, we leveraged various dimensionality reduction methods including Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and Pairwise Controlled Manifold Approximation (PaCMAP) on the raw 6 dimensional input signals (3 axis acceleration, 3 gyr… view at source ↗

**Figure 2.4.** Figure 2.4: denotes the class distribution for the above mentioned activities in the reallife HAR dataset [PITH_FULL_IMAGE:figures/full_fig_p029_2_4.png] view at source ↗

**Figure 2.5.** Figure 2.5: Dimensionality reduction techniques to better understand the Real life HAR dataset 21 [PITH_FULL_IMAGE:figures/full_fig_p031_2_5.png] view at source ↗

**Figure 2.6.** Figure 2.6: Confusion matrix plotted on the Test set for the LSTM model. The model is able to easily differentiate between classes obtaining a macro f1-score of 97.66 %. 2.5.2. Part-B: Testing on a Pilot Dataset In this experiment, we wanted to utilize the trained LSTM model for the activity recognition task as well as the hand-crafted feature embeddings to compare their efficacy against directly inputting sensor da… view at source ↗

**Figure 3.1.** Figure 3.1: The proposed two-stage Recommendation pipeline The first stage, known as the centralized stage, takes item metadata along with non-sensitive user data such as user preferences and item interactions, collectively referred to as App-context data, as described in the previous section. We describe the Correlated Cross-Occurrence based collaborative filtering algorithm deployed in our system in detail in the … view at source ↗

**Figure 3.2.** Figure 3.2: System Architecture diagram detailing various components of the Universal Recommendation System. The input to the system is comprised of a)Events json containing user-item interactions such as like/purchase etc. and b) Context json containing user/item properties. A sample query to the system is defined in c) Query json and the output d) Recommendations json are served along with their computed Log-lik… view at source ↗

**Figure 4.1.** Figure 4.1: a) Workflow diagram (source:[16]) of FedAvg training pipeline. b) Training pseudo-code of FedAvg algorithm [30] The model weights across clients are aggregated using a weighted sum, where the weight for each client is defined based on the ratio of training examples for that client to the total 48 [PITH_FULL_IMAGE:figures/full_fig_p058_4_1.png] view at source ↗

**Figure 4.2.** Figure 4.2: Model architecture of the AutoInt CTR model from Fig.1 [40] The AutoInt model architecture consists of 3 modules consisting of an embedding layer, multi-head self-attention transformer layers, and a final MLP layer that outputs sigmoid probabilities. The embedding layer accepts all type of input features: categorical, numerical and multi-valued categorical features and transforms them into a fixed dimens… view at source ↗

**Figure 4.3.** Figure 4.3: Non-IID Distribution of MovieLens 1M dataset across 10 clients. 56 [PITH_FULL_IMAGE:figures/full_fig_p066_4_3.png] view at source ↗

**Figure 4.4.** Figure 4.4: Test AUC and LogLoss plots across federated aggregation rounds for Ablation experiments As observed in the experiment results, federating only the Embedding layer and keeping other parts local performs better than the default All federated setting. For other experiments, such as Attention and Output layer, the Test log loss diverges after a few rounds of training. Note that both Test metrics: AUC and Lo… view at source ↗

read the original abstract

Serving personalized content on mobile devices has traditionally required pooling sensitive user data on centralized servers, a practice increasingly at odds with modern privacy expectations and geographical regulations. We present a two-stage federated recommendation system pipeline for mobile devices, built around a principled separation between non-sensitive user preference data and sensitive mobile context data that never leaves the device. The first stage runs a collaborative filtering model on non-sensitive app-context data in the cloud to generate a shortlist of relevant items. The second stage re-ranks these candidates on-device using sensitive mobile signals, with only model updates/gradients ever leaving the device. We validate the approach on MovieLens, UCI Human Activity Recognition, and a proprietary pilot dataset, and deliver a production-ready implementation as a Kotlin Multiplatform library deployable on Android and iOS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a two-stage federated recsys with on-device re-ranking for privacy but supplies no metrics or comparisons to show the approach delivers usable personalization.

read the letter

The main takeaway is a description of a federated recommender that separates non-sensitive and sensitive data across cloud and device stages, but the lack of any performance data leaves the effectiveness of the on-device part unproven. What the paper does is lay out a pipeline where collaborative filtering runs in the cloud on app context that is not sensitive, producing a shortlist. Then on the device, sensitive mobile signals like activity recognition re-rank those candidates. Model updates or gradients are the only things that move, keeping the raw sensitive data local. They implemented this in a Kotlin Multiplatform library meant for both Android and iOS, and they ran some form of validation on MovieLens, the UCI Human Activity Recognition dataset, and a proprietary pilot set. This separation is a reasonable way to balance personalization with privacy regulations. The library being production-ready is a concrete output that others could build on or adapt. The weakness is in the evidence. No metrics are given for recommendation quality, no baselines are compared, and there is no ablation showing what the re-ranking stage contributes. The central promise is that this setup delivers personalized recommendations without centralizing sensitive data, but without numbers on whether the re-ranking improves over the cloud shortlist, that promise stays untested. If the sensitive signals do not add much, the whole thing reduces to federated CF plus on-device compute with no extra benefit. A reader interested in practical privacy techniques for mobile apps might find the architecture useful as a starting point. Someone expecting quantitative validation or novel methods will not get much here. I would not push this toward peer review. It needs the experimental results and comparisons before it is ready for serious evaluation.

Referee Report

1 major / 0 minor

Summary. The paper proposes a two-stage federated recommender system pipeline for mobile devices. A cloud-based collaborative filtering stage generates a shortlist from non-sensitive user preference data; an on-device stage then re-ranks candidates using sensitive mobile context signals, with only model updates or gradients ever leaving the device. The approach is claimed to have been validated on MovieLens, UCI Human Activity Recognition, and a proprietary pilot dataset, and a production-ready Kotlin Multiplatform library is provided.

Significance. If the on-device re-ranking stage can be shown to deliver non-trivial personalization gains while keeping sensitive data local, the pipeline would address a practical tension between personalization and privacy regulations in mobile recommender systems. The release of a deployable cross-platform library would be a concrete engineering contribution.

major comments (1)

[Abstract] Abstract: the manuscript states that validation occurred on MovieLens, UCI HAR, and a proprietary dataset, yet supplies no metrics (e.g., NDCG@K, precision@K), baselines, ablation results comparing the two-stage pipeline against the cloud stage alone, or error analysis. Without these data the central claim that the on-device re-ranking produces effective privacy-preserving personalization remains unsupported.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the manuscript states that validation occurred on MovieLens, UCI HAR, and a proprietary dataset, yet supplies no metrics (e.g., NDCG@K, precision@K), baselines, ablation results comparing the two-stage pipeline against the cloud stage alone, or error analysis. Without these data the central claim that the on-device re-ranking produces effective privacy-preserving personalization remains unsupported.

Authors: We agree that the abstract would be strengthened by including key quantitative results. The manuscript body reports evaluation results across the three datasets, including NDCG@K and precision@K metrics, direct comparisons against the cloud-only baseline, ablation studies isolating the on-device re-ranking contribution, and supporting analysis. To make these data immediately visible and address the concern, we will revise the abstract to summarize the main empirical findings (e.g., relative gains from the on-device stage) while retaining the high-level description. We will also verify that an explicit error analysis subsection appears in the results section. This targeted revision directly supports the central claim without altering the technical contribution. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; architectural description only

full rationale

The manuscript describes a two-stage federated pipeline separating non-sensitive and sensitive data, with cloud CF generating a shortlist and on-device re-ranking. No equations, fitted parameters, predictions, or uniqueness theorems appear in the provided text. Validation is asserted on MovieLens, UCI HAR, and a proprietary dataset without any reported metrics or derivations that could reduce to inputs by construction. Self-citations are absent from the abstract and pipeline description. The contribution is therefore self-contained as an engineering architecture with no load-bearing mathematical steps to inspect for circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper presents an applied engineering system rather than a theoretical derivation. No free parameters, axioms, or invented entities are identifiable from the abstract.

pith-pipeline@v0.9.0 · 5654 in / 1172 out tokens · 47965 ms · 2026-05-25T06:14:47.214777+00:00 · methodology

Building a privacy-preserving Federated Recommender system for mobile devices

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)