From Scaling to Structured Expressivity: Rethinking Transformers for CTR Prediction

Bencheng Yan; Bo Zheng; Chuan Yu; Di Wang; Jian Xu; Kaiyi Lin; Pengjie Wang; Yuejie Lei; Zheye Deng; Zhiyuan Zeng

arxiv: 2511.12081 · v2 · pith:XEKBGWGTnew · submitted 2025-11-15 · 💻 cs.IR · cs.LG

From Scaling to Structured Expressivity: Rethinking Transformers for CTR Prediction

Bencheng Yan , Yuejie Lei , Zhiyuan Zeng , Zheye Deng , Di Wang , Kaiyi Lin , Pengjie Wang , Chuan Yu

show 2 more authors

Jian Xu Bo Zheng

This is my paper

classification 💻 cs.IR cs.LG

keywords scalingtextbfcomplexityexpressivitystructuredtextitdatafields

0 comments

read the original abstract

Despite massive investments in scale, deep models for click-through rate (CTR) prediction often exhibit rapidly diminishing returns -- a stark contrast to the {predictable scaling laws} seen in large language models (LLMs). We identify the root cause as a {fundamental} \textit{structural misalignment}: {standard} Transformers assume sequential compositionality, whereas CTR data demand combinatorial reasoning over {heterogeneous} fields. To restore alignment, we introduce the \textbf{Field-Aware Transformer (FAT)}. {By reconstructing the standard Transformer block with field-centric parameters, FAT achieves \textit{structured expressivity}, {fundamentally shifting the model complexity dependence from the total vocabulary size $n$ with the number of fields $F$ ($n \gg F$).}} Crucially, to decouple model capacity from field cardinality, FAT employs a {{Basis-Composed Hypernetwork}} to synthesize field-specific parameters from shared bases, further reducing parameter complexity. {Theoretically, we ground this scaling behavior through a formal scaling law based on Rademacher complexity. Empirically, FAT outperforms exisiting state-of-the-art methods with up to \textbf{{+4.38\%}} AUC improvement, and delivers \textbf{+2.33\%} CTR and \textbf{+0.66\%} RPM in live production.} Our work establishes that scalable recommendation arises not from size alone, but from \textit{structured expressivity} -- architectural coherence with data semantics.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DeRes: Decoupling Residual Stability and Adaptivity for Scalable CTR Prediction
cs.IR 2026-06 unverdicted novelty 6.0

DeRes decouples residual stability and adaptivity via identity and block-attention paths with SiLU pointwise attention, delivering up to 0.32% AUC gains and steeper scaling laws on industrial and public CTR datasets.