FLUID: From Ephemeral IDs to Multimodal Semantic Codes for Industrial-Scale Livestreaming Recommendation

Anjia Cao; Chang Liu; Penghao Zhou; Qinglei Wang; Wentao Guo; Xinhang Yuan; Xudong Lu; Zexi Huang; Zikai Wang

arxiv: 2605.21832 · v2 · pith:DAOAXU2Onew · submitted 2026-05-20 · 💻 cs.AI

FLUID: From Ephemeral IDs to Multimodal Semantic Codes for Industrial-Scale Livestreaming Recommendation

Xinhang Yuan , Zexi Huang , Anjia Cao , Xudong Lu , Zikai Wang , Penghao Zhou , Chang Liu , Wentao Guo

show 1 more author

Qinglei Wang

This is my paper

Pith reviewed 2026-05-22 08:17 UTC · model grok-4.3

classification 💻 cs.AI

keywords livestreaming recommendationmultimodal semantic codesID-free rankingcold-start problemhierarchical codescross-domain encoderindustrial recommender systems

0 comments

The pith

FLUID replaces ephemeral item IDs with hierarchical multimodal codes from short videos and livestreams in large-scale ranking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Recommender systems normally assign each item a unique ID embedding that gathers signals from many user interactions. Livestream rooms, however, last only tens of minutes, so their IDs stay in a permanent cold-start state and ranking models cannot generalize well. FLUID trains a single multimodal encoder on both short videos and livestreams to produce discrete hierarchical LUCID codes. It then feeds slice-level and room-level codes as separate tokens into a late-fusion, ID-free ranker that is warmed up in stages during online training. When run at production scale on a billion-user platform, the system records measurable lifts in watch time, cold-start views, and active hours.

Core claim

FLUID is the first framework to retire the candidate-side item ID completely from a production livestreaming ranker. It couples a cross-domain multimodal encoder, trained jointly on short videos and livestreams to emit discrete hierarchical LUCID codes, with a late-fusion architecture that treats slice-level and room-level codes as independent tokens and stabilizes training through staged warmup under incremental online updates.

What carries the argument

LUCID codes: discrete hierarchical semantic tokens generated by a cross-domain multimodal encoder jointly trained on short videos and livestreams; these tokens substitute for item ID embeddings inside an ID-free late-fusion ranking model.

If this is right

The ID-free ranker generalizes to newly created live rooms that have never accumulated interaction data.
Joint training on short videos and livestreams produces codes that transfer semantic information across the two domains.
Staged warmup during online incremental training keeps the model stable after the removal of item ID embeddings.
Production deployment on platforms serving over one billion users yields gains of +0.55% Quality Watch Duration and +2.05% Cold-Start Room Views.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same code-generation approach could be tested on other short-lived content such as stories or temporary events where persistent IDs are unavailable.
Removing item IDs may reduce memory footprint and embedding table size in very large catalogs.
Cross-domain training might improve consistency when users move between short-form video and live content within the same app.

Load-bearing premise

The LUCID codes can capture and replace the collaborative signals that would normally come from user interactions with persistent item IDs.

What would settle it

An online A/B test on new live rooms that shows equal or lower cold-start room views when LUCID codes are removed compared with the ID-based baseline would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.21832 by Anjia Cao, Chang Liu, Penghao Zhou, Qinglei Wang, Wentao Guo, Xinhang Yuan, Xudong Lu, Zexi Huang, Zikai Wang.

**Figure 1.** Figure 1: Overview of FLUID. Top: a cross-domain multimodal encoder (SigLip2 ViT + Qwen3-Embedding) jointly trained on livestreams and short videos produces a 128-d slice embedding 𝑧, which RQ-KMeans discretizes into a 4-level codeword tuple— LUCID—with room-level LUCID obtained by per-level majority voting over slices in a session. Bottom: slice- and room-level LUCID enter the production ranker as independent candi… view at source ↗

**Figure 2.** Figure 2: Item ID embedding ℓ2 norm vs. live-room age, aggregated over one day of production traffic. The norm fails to converge within the typical ∼45-minute room lifetime. semantic codes more compatible with ID-based ranking [7, 19– 21, 47], following the semantic-id line opened by TIGER [26] and recent prefix-𝑛-gram parameterizations [48]. End-to-end multimodal recommenders [18, 36, 43] further train the multim… view at source ↗

**Figure 3.** Figure 3: Three-stage warmup procedure for FLUID. Each [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Live slices and short videos grouped by LUCID: [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Converged learned gate weight on the item ID under [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Modern recommender systems rely heavily on ID-based collaborative filtering: each item is represented by a unique ID embedding that accumulates collaborative signals from user interactions. Livestreaming recommendation, however, faces a unique challenge in this paradigm: a live room typically broadcasts for only tens of minutes, so its item ID remains poorly learned in a persistent cold-start state and ID-centric ranking models fail to generalize. We present FLUID, the first framework to fully retire the candidate-side item ID from a production-scale livestreaming ranker. FLUID introduces a cross-domain multimodal encoder, jointly trained on short videos and livestreams, to produce discrete hierarchical semantic codes, called LUCID, for content-based item characterization. To adapt the ranker to LUCID, FLUID further employs a staged warmup scheme: it first incorporates cold, slice-level LUCID as an independent token alongside the ID embedding, and then replaces the ID embedding with warm, room-level LUCID before online incremental training. Deployed on our industrial livestreaming recommenders with a cross-platform combined user base of over one billion globally, FLUID delivers significant online gains of +0.55% Quality Watch Duration, +2.05% Cold-Start Room Views, and +0.05% Active Hours.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FLUID replaces candidate item IDs with cross-domain multimodal discrete codes in a production livestreaming ranker and reports online lifts on cold-start metrics, but the abstract supplies almost no experimental details.

read the letter

The key point is that this paper presents a production system for livestreaming recommendation that eliminates candidate item IDs by using discrete multimodal semantic codes instead. They jointly train an encoder on short videos and livestreams to generate hierarchical LUCID codes. These get injected as separate tokens for slices and rooms in a late-fusion ranker. A staged warmup supports the online training. The system runs on platforms serving over a billion users and shows online improvements in quality watch duration, cold-start room views, and active hours. This tackles a real issue: live rooms are too short for ID embeddings to learn much from interactions. The cross-domain training and ID-free design are practical responses to that. The main limitation is the thin evidence in the abstract. It mentions A/B gains but provides no baselines, ablations, or details on how the codes perform compared to ID-based models. We cannot yet see if the multimodal features truly replace the collaborative signals or if other factors contribute. This work is for teams running large-scale recommenders with transient content like live streams. Readers dealing with cold-start in similar settings could adapt the architecture. It deserves peer review because of the scale and the concrete problem it solves. I would recommend sending it to referees, asking them to check the experimental rigor and the literature on multimodal codes in recsys.

Referee Report

2 major / 1 minor

Summary. The manuscript presents FLUID, a framework that retires candidate-side item ID embeddings from a production livestreaming ranker. It couples a cross-domain multimodal encoder (jointly trained on short videos and livestreams) that outputs discrete hierarchical codes called LUCID with a late-fusion ID-free architecture that injects slice-level and room-level LUCID tokens as independent features, stabilized by staged warmup under online incremental training. The paper reports online A/B gains of +0.55% Quality Watch Duration, +2.05% Cold-Start Room Views, and +0.05% Active Hours after deployment on a platform with a combined user base exceeding one billion.

Significance. If the substitution of LUCID codes for ID embeddings holds under rigorous validation, the work is significant for industrial recommender systems. It directly addresses the cold-start failure mode that arises when live rooms broadcast for only tens of minutes, offering a scalable, cross-domain semantic alternative to persistent ID-based collaborative filtering. The reported large-scale deployment and cross-platform gains constitute a practical contribution that could influence design choices for other ephemeral-content ranking problems.

major comments (2)

[Abstract] Abstract: the reported online A/B gains (+0.55% QWD, +2.05% CSRV, +0.05% AH) are stated without any accompanying experimental details—test population size, experiment duration, baseline models, statistical tests, or ablation results that isolate the contribution of the LUCID tokens versus an ID-based counterpart. This information is load-bearing for the central claim that the discrete hierarchical codes fully substitute for collaborative signals previously carried by item IDs.
[Method (LUCID encoder and late-fusion design)] The weakest assumption—that LUCID codes generated from the cross-domain multimodal encoder can capture and replace interaction-derived collaborative signals—is asserted but not supported by any ablation that compares ranking performance with and without candidate-side ID embeddings or that measures how much of the observed lift is attributable to the semantic codes versus other architectural changes.

minor comments (1)

[Abstract] The expansion of the acronym LUCID is not given in the abstract even though the term is introduced as a key contribution.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our work. The comments highlight important aspects of experimental transparency and validation that we address point by point below. We have prepared revisions to strengthen the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the reported online A/B gains (+0.55% QWD, +2.05% CSRV, +0.05% AH) are stated without any accompanying experimental details—test population size, experiment duration, baseline models, statistical tests, or ablation results that isolate the contribution of the LUCID tokens versus an ID-based counterpart. This information is load-bearing for the central claim that the discrete hierarchical codes fully substitute for collaborative signals previously carried by item IDs.

Authors: We agree that the abstract would be strengthened by additional context on the online A/B evaluation. In the revised version we will expand the abstract to note the experiment duration (multiple weeks of incremental deployment), the baseline as the prior production ID-based ranker, and that gains were assessed for statistical significance via standard hypothesis testing. Full population sizes and exact p-values remain subject to confidentiality constraints typical of industrial deployments, but we will clarify that the reported lifts reflect live traffic on a platform serving over one billion users. revision: yes
Referee: [Method (LUCID encoder and late-fusion design)] The weakest assumption—that LUCID codes generated from the cross-domain multimodal encoder can capture and replace interaction-derived collaborative signals—is asserted but not supported by any ablation that compares ranking performance with and without candidate-side ID embeddings or that measures how much of the observed lift is attributable to the semantic codes versus other architectural changes.

Authors: The referee correctly notes the absence of explicit offline ablations isolating LUCID from ID embeddings. Our primary evidence is the production deployment itself, where the system operates without candidate-side IDs and still delivers the reported gains, particularly in cold-start scenarios. To address this directly, the revised manuscript will include a new offline ablation subsection using pre-transition logged data, comparing an ID-augmented variant against the ID-free LUCID design on ranking metrics for both warm and cold items. This will quantify the semantic codes' contribution relative to residual architectural factors. revision: yes

standing simulated objections not resolved

Exact test population sizes and proprietary baseline configurations, which cannot be disclosed due to industrial confidentiality policies.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes FLUID as a framework that retires candidate-side item IDs by coupling a cross-domain multimodal encoder producing discrete hierarchical LUCID codes with a late-fusion ID-free architecture using slice- and room-level tokens plus staged warmup. No equations, derivations, or load-bearing steps are presented in the abstract or high-level claims that reduce the substitution of collaborative signals to a self-definition, fitted input renamed as prediction, or self-citation chain. The central premise is framed as an empirical engineering result validated by online A/B lifts on a billion-user platform, remaining self-contained against external benchmarks without internal reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Reviewed from abstract only; no explicit free parameters, axioms, or invented entities beyond the high-level introduction of LUCID codes are stated in the provided text.

invented entities (1)

LUCID codes no independent evidence
purpose: Discrete hierarchical multimodal semantic representations intended to replace item ID embeddings
Introduced in the abstract as the core output of the cross-domain encoder; no independent evidence or falsifiable prediction supplied.

pith-pipeline@v0.9.0 · 5758 in / 1325 out tokens · 56553 ms · 2026-05-22T08:17:32.336250+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FLUID couples a cross-domain multimodal encoder... to produce discrete hierarchical codes (LUCID) with a late-fusion, ID-free design that injects slice-level and room-level LUCID as independent tokens

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.