arxiv: 2605.11272 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.AI· cs.IR

Recognition: 2 theorem links

· Lean Theorem

Localization Boosting for Growth Markets: Mitigating Cross-Locale Behavioral Bias in Learning-to-Rank

Ashwin Naresh Kumar, Jing Zheng, Suryaa Veerabathiran Seran, Tracy Holloway King

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:47 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.IR

keywords learning-to-rankexposure biaslocalizationvision-language modelsmulti-objective learningcross-locale biasrelevance labels

0 comments

The pith

Multi-objective learning-to-rank with vision-language labels and locale boosting reduces US-centric exposure bias in global templates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Learning-to-rank models trained on behavioral clicks inherit heavy US data dominance, over-serving American templates and hiding locally relevant content in growth markets. Click-only training also mutes semantically useful localization signals. Adding graded relevance labels from a vision-language model improves semantic fit but still fails to restore local visibility. The authors combine clicks, VLM signals, and an explicit locale-aware boosting term in one multi-objective objective; the resulting model lifts relevance while stabilizing local content exposure across five tested locales.

Core claim

A multi-objective framework that jointly optimizes behavioral supervision, VLM-derived relevance grades, and locale-aware boosting improves semantic alignment and restores stable local content visibility in non-US locales, whereas either clicks alone or clicks plus VLM labels leave the exposure imbalance intact.

What carries the argument

Locale-aware boosting term that counteracts cross-locale exposure imbalance inside the ranking loss while auxiliary VLM relevance labels supply semantic supervision.

If this is right

Relevance metrics rise in the five evaluated growth locales without sacrificing US performance.
Local templates receive measurably higher and more stable exposure once exposure is disentangled from semantic signals.
The same separation of exposure bias from semantic supervision applies to any ranking system whose training data is geographically skewed.
Pure auxiliary supervision (VLM labels) is insufficient by itself to correct visibility suppression.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar disentangling layers may be needed in other recommendation domains where one region dominates interaction data.
Dynamic versions of the boosting term could be driven by ongoing per-locale performance monitoring rather than fixed weights.
The approach implies that future LTR pipelines should treat exposure correction as a first-class modeling objective rather than an afterthought.

Load-bearing premise

Vision-language model relevance labels are accurate and unbiased across locales, and the added boosting term will not degrade overall ranking quality or create new biases.

What would settle it

A controlled ablation that removes only the locale-aware boosting component and measures whether local content visibility falls back to the click-only baseline despite the presence of VLM labels.

read the original abstract

Adobe Express is expanding internationally, but the US has a disproportionately large content supply and interaction volume. Learning-to-rank (LTR) models trained primarily on behavioral feedback inherit this imbalance: templates popular in US are over-served in non-US locales. This cross-locale exposure bias suppresses local content discoverability and degrades ranking quality in growth locales. We show that click-only training suppresses semantically informative localization features. Adding vision-language model (VLM) graded relevance labels as auxiliary supervision alongside clicks improves semantic alignment but does not preserve local content visibility. We propose a multi-objective framework combining behavioral supervision, VLM-derived relevance signals, and locale-aware boosting. Across five locales, the resulting model improves relevance while restoring stable localization, demonstrating the importance of disentangling exposure from semantic supervision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical engineering recipe for fixing US-centric skew in global LTR via VLM labels plus locale boosting, but the abstract supplies no numbers to show the gains are real or stable.

read the letter

The main takeaway is that click-only LTR in Adobe Express over-promotes US templates in other countries, and the authors try to correct this by adding VLM graded relevance as auxiliary supervision and then applying locale-aware boosting on top of a multi-objective setup. The claim is that this disentangles semantic quality from exposure bias and restores local visibility without hurting overall relevance across five locales. That framing is straightforward and matches a problem many global products face when one market dominates the data. The combination itself is not a new core algorithm, but the targeted application to growth-market localization bias is the piece that feels fresh. They correctly note that VLM signals alone fix semantics but leave visibility issues untouched, which is why the boosting step is added. This kind of incremental, signal-balancing work is common in industry and can be useful to read if you are dealing with similar imbalances. The soft spot is the lack of any concrete evidence in the abstract. No metrics, no baselines, no ablation results, and no statistical tests are mentioned, so it is impossible to judge whether the reported improvements are large, consistent, or simply the result of extra tuning. The assumption that VLM labels are accurate and unbiased across locales also sits unexamined; if the model carries cultural or linguistic skew from its training data, the multi-objective objective could trade one bias for another rather than remove it. The full paper would need to show inter-locale human agreement on the VLM grades and clear before-after comparisons on both relevance and local content exposure. This is aimed at practitioners running LTR systems for international products, especially in design or content tools. A reader already working on bias mitigation or multi-objective ranking could extract the high-level framework and adapt the boosting idea. It deserves a serious referee because the underlying problem is real and the proposed fix is concrete, even if the current write-up is thin on verification. I would send it to review rather than desk reject, with the expectation that revisions focus on adding the missing experiments and validation steps.

Referee Report

2 major / 1 minor

Summary. The paper addresses cross-locale exposure bias in learning-to-rank models for Adobe Express, where US-dominant behavioral data leads to over-serving US templates in growth locales. It proposes combining click supervision with VLM-derived graded relevance labels as auxiliary signals and a locale-aware boosting component in a multi-objective framework. The central claim is that this disentangles exposure bias from semantic supervision, yielding improved relevance and restored stable localization across five locales.

Significance. If the results hold, the work offers a concrete approach to mitigating locale imbalance in production LTR systems without sacrificing semantic quality. The explicit separation of behavioral, semantic (VLM), and locale-boosting objectives is a useful framing for growth-market ranking problems. No machine-checked proofs or parameter-free derivations are present, but the multi-objective formulation itself is a clear methodological contribution if supported by rigorous experiments.

major comments (2)

[Abstract] Abstract: The manuscript asserts that the multi-objective model 'improves relevance while restoring stable localization' across five locales, yet supplies no quantitative metrics, baselines, offline/online evaluation protocols, statistical significance tests, or ablation results. Without these, the central claim that the framework successfully disentangles exposure from semantic supervision cannot be assessed.
[Abstract] Proposed multi-objective framework (as described in the abstract): The approach treats VLM graded relevance labels as clean auxiliary supervision that can be safely combined with clicks and locale boosting. No inter-locale human correlation, calibration curves, or error analysis for the VLM outputs is provided. If the VLM exhibits systematic locale-specific biases (cultural, linguistic, or training-data skew), the reported restoration of localization cannot be attributed to the proposed disentangling mechanism.

minor comments (1)

[Abstract] The abstract would be clearer if it named the specific VLM, the five locales, and the precise form of the locale-aware boosting term.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for reviewing our manuscript. We value your comments on strengthening the abstract and validating the VLM supervision. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: The manuscript asserts that the multi-objective model 'improves relevance while restoring stable localization' across five locales, yet supplies no quantitative metrics, baselines, offline/online evaluation protocols, statistical significance tests, or ablation results. Without these, the central claim that the framework successfully disentangles exposure from semantic supervision cannot be assessed.

Authors: We agree that the abstract, as currently written, is high-level and does not include specific quantitative evidence. The body of the manuscript details the experimental setup with offline and online evaluations, baselines, ablations, and statistical tests across the five locales. To address this, we will revise the abstract to concisely report key quantitative outcomes, such as relative improvements in relevance metrics and localization stability, while directing readers to the full evaluation protocols in the paper. revision: yes
Referee: The approach treats VLM graded relevance labels as clean auxiliary supervision that can be safely combined with clicks and locale boosting. No inter-locale human correlation, calibration curves, or error analysis for the VLM outputs is provided. If the VLM exhibits systematic locale-specific biases (cultural, linguistic, or training-data skew), the reported restoration of localization cannot be attributed to the proposed disentangling mechanism.

Authors: This is a fair concern. The current version relies on the VLM as a general-purpose semantic signal without dedicated validation for locale biases. Our experiments show that adding the VLM signal improves semantic alignment but requires the locale boosting to restore visibility, supporting the disentangling claim. However, to rigorously rule out VLM biases as a confounding factor, we will include in the revision an analysis of VLM label agreement with human judgments across locales, along with calibration and error analysis. revision: yes

Circularity Check

0 steps flagged

No circularity detected; empirical multi-objective framework is self-contained

full rationale

The paper proposes combining click-based behavioral supervision with VLM-graded relevance labels and locale-aware boosting in a multi-objective LTR setup. No equations, derivations, or self-citations are presented that reduce any claimed prediction or result to the inputs by construction. The reported gains across five locales are framed as experimental outcomes from disentangling exposure bias, not tautological redefinitions or fitted parameters renamed as predictions. The central claim rests on external VLM signals and boosting rather than internal self-reference, making the derivation chain independent.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5447 in / 990 out tokens · 63791 ms · 2026-05-13T02:47:21.177745+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
multi-objective framework combining behavioral supervision, VLM-derived relevance signals, and locale-aware boosting (abstract; §4.2, §4.4)
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery and orbit embedding unclear
RankNet-style pairwise loss (Eq. 2) and ListNet top-1 (Eq. 6)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 3 internal anchors

[1]

Qingyao Ai, Tao Yang, Huazheng Wang, and Jiaxin Mao. 2021. Unbiased Learning to Rank: Online or Offline?ACM Transactions on Information Systems39, 2, Article 21 (2021). doi:10.1145/3439861

work page doi:10.1145/3439861 2021
[2]

Krisztian Balog, Donald Metzler, and Zhen Qin. 2025. Rankers, Judges, and Assis- tants: Towards Understanding the Interplay of LLMs in Information Retrieval Evaluation. InProceedings of the 48th International ACM SIGIR Conference on Re- search and Development in Information Retrieval. 3865–3875. doi:10.1145/3726302. 3730348

work page doi:10.1145/3726302 2025
[3]

Biega, Krishna P

Asia J. Biega, Krishna P. Gummadi, and Gerhard Weikum. 2018. Equity of Attention: Amortizing Individual Fairness in Rankings. InProceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). ACM, New York, NY, USA

work page 2018
[4]

Luiz Henrique Bonifacio, Andres Abeliuk, Pablo Castellanos, and Arman Cohan

work page
[5]

arXiv:2212.05144 [cs.IR]

InPars: Data Augmentation for Information Retrieval using Large Language Models. arXiv:2212.05144 [cs.IR]

work page arXiv
[6]

Christopher Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to Rank using Gradient Descent. InProceedings of the 22nd International Conference on Machine Learning (ICML). ACM, New York, NY, USA

work page 2005
[7]

Olivier Chapelle and Ya Zhang. 2009. A Dynamic Bayesian Network Click Model for Web Search Ranking. InProceedings of the 18th International Conference on World Wide Web (WWW). ACM, New York, NY, USA

work page 2009
[8]

Mouxiang Chen, Chenghao Liu, Jianling Sun, and Steven C. H. Hoi. 2021. Adapting Interactional Observation Embedding for Counterfactual Learning to Rank. InProceedings of the 44th International ACM SIGIR Conference on Re- search and Development in Information Retrieval (SIGIR ’21). ACM, 285–294. doi:10.1145/3404835.3462901

work page doi:10.1145/3404835.3462901 2021
[9]

Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. 2008. An Experi- mental Comparison of Click Position-Bias Models. InProceedings of the 1st ACM International Conference on Web Search and Data Mining (WSDM). ACM, New York, NY, USA

work page 2008
[10]

Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith B

Zhuyun Dai, Vincent Y. Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith B. Hall, and Ming-Wei Chang. 2022. Promptagator: Few-shot Dense Retrieval From 8 Examples. arXiv:2209.11755 [cs.CL] https://arxiv.org/ abs/2209.11755

work page arXiv 2022
[11]

Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. 2018. The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Iryna Gurevych and Yusuke Miyao (Eds.). Association for Computational Linguistics, Mel...

work page doi:10.18653/v1/p18-1128 2018
[12]

Gizem Gezici. 2022. Case Study: The Impact of Location on Bias in Search Results. arXiv:2206.11869 [cs.IR] https://arxiv.org/abs/2206.11869

work page arXiv 2022
[13]

Shashank Gupta, Yiming Liao, and Maarten de Rijke. 2026. Towards Two-Stage Counterfactual Learning to Rank. InProceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’26). ACM. doi:10.1145/3731120.3744583

work page doi:10.1145/3731120.3744583 2026
[14]

Maria Heuss, Fatemeh Sarvi, and Maarten de Rijke. 2022. Fairness of Exposure in Light of Incomplete Exposure Estimation. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’22). ACM. doi:10.1145/3477495.3531977

work page doi:10.1145/3477495.3531977 2022
[15]

Thorsten Joachims. 2005. Accurately Interpreting Clickthrough Data as Implicit Feedback. InProceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). ACM, New York, NY, USA

work page 2005
[16]

Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiased Learning-to-Rank with Biased Feedback. InProceedings of the 10th ACM Interna- tional Conference on Web Search and Data Mining (WSDM). ACM, New York, NY, USA

work page 2017
[17]

Robert Litschko, Ivan Vulić, Simone Paolo Ponzetto, and Goran Glavaš. 2022. On cross-lingual retrieval with multilingual text encoders.Information Retrieval Journal25 (2022), 149–183. doi:10.1007/s10791-022-09406-x

work page doi:10.1007/s10791-022-09406-x 2022
[18]

Xiao Liu, Juan Hu, Qi Shen, and Huan Chen. 2021. Geo-BERT Pre-training Model for Query Rewriting in POI Search. InFindings of the Association for Com- putational Linguistics: EMNLP 2021. Association for Computational Linguistics, 2209–2214. doi:10.18653/v1/2021.findings-emnlp.190

work page doi:10.18653/v1/2021.findings-emnlp.190 2021
[19]

Yang Liu, Dan Iter, Bryan Lee, Jialu Xu, Hanyuan Zhao, Douwe Kiela, et al

work page
[20]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. arXiv:2303.16634 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Zechun Niu, Zhilin Zhang, Jiaxin Mao, Qingyao Ai, and Ji-Rong Wen. 2025. Investigating the Robustness of Counterfactual Learning to Rank Models: A Reproducibility Study. InProceedings of the 48th International ACM SIGIR Con- ference on Research and Development in Information Retrieval (SIGIR ’25). ACM. doi:10.1145/3726302.3730310

work page doi:10.1145/3726302.3730310 2025
[22]

Harrie Oosterhuis and Maarten de Rijke. 2018. Differentiable Unbiased Online Learning to Rank. InProceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM). ACM, New York, NY, USA

work page 2018
[23]

OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

OpenAI. 2024. GPT-4o System Card. arXiv:2410.21276 [cs.CL] https://arxiv.org/ abs/2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Ashudeep Singh and Thorsten Joachims. 2018. Fairness of Exposure in Rankings. InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD). ACM, New York, NY, USA

work page 2018
[26]

Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. 2024. Is ChatGPT Good at Search? Investi- gating Large Language Models as Re-Ranking Agents. arXiv:2304.09542 [cs.CL] https://arxiv.org/abs/2304.09542

work page arXiv 2024
[27]

Adith Swaminathan and Thorsten Joachims. 2015. Counterfactual Risk Min- imization: Learning from Logged Bandit Feedback. InProceedings of the 32nd International Conference on Machine Learning (ICML). PMLR, Lille, France

work page 2015
[28]

Frank Wilcoxon. 1945. Individual Comparisons by Ranking Methods.Biometrics Bulletin1, 6 (1945), 80–83. http://www.jstor.org/stable/3001968

work page arXiv 1945
[29]

Le Yan, Zhen Qin, Honglei Zhuang, Xuanhui Wang, Michael Bendersky, and Marc Najork. 2022. Revisiting Two-tower Models for Unbiased Learning to Rank. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’22). ACM, 2410–2414. doi:10.1145/ 3477495.3531837

work page arXiv 2022
[30]

Tao Yang, Zhichao Xu, Zhenduo Wang, and Qingyao Ai. 2023. FARA: Future- aware Ranking Algorithm for Fairness Optimization. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM ’23). ACM. doi:10.1145/3583780.3614877

work page doi:10.1145/3583780.3614877 2023
[31]

Meike Zehlike, Francesco Bonchi, Sara Hajian, Mohamed Megahed, and Ricardo Baeza-Yates. 2017. FA*IR: A Fair Top-𝑘 Ranking Algorithm. InProceedings of the 2017 ACM Conference on Information and Knowledge Management (CIKM). ACM, New York, NY, USA

work page 2017
[32]

Xinyu Zhang, Xueguang Ma, Peng Shi, and Jimmy Lin. 2021. Mr. TyDi: A Multi- lingual Benchmark for Dense Retrieval. arXiv:2108.08787 [cs.CL] https://arxiv. org/abs/2108.08787

work page arXiv 2021
[33]

Xinyu Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, and Jimmy Lin

work page
[34]

arXiv:2210.09984 [cs.IR] https://arxiv.org/abs/2210.09984

Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages. arXiv:2210.09984 [cs.IR] https://arxiv.org/abs/2210.09984

work page arXiv
[35]

Yiqian Zhang, Yinfu Feng, Wen-Ji Zhou, Yunan Ye, Min Tan, Rong Xiao, Haihong Tang, Jiajun Ding, and Jun Yu. 2024. Multi-Domain Deep Learning from a Multi- View Perspective for Cross-Border E-commerce Search. InProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24). 9387–9395

work page 2024