arxiv: 2605.08637 · v1 · submitted 2026-05-09 · 📊 stat.ME

Recognition: 2 theorem links

· Lean Theorem

Spherical Mixture Integration for Latent Embedding Alignment across Multi-Source Feature Spaces

Yuming Zhang , Congyuan Duan , Dong Xia , Doudou Zhou , Tianxi Cai

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:50 UTC · model grok-4.3

classification 📊 stat.ME

keywords multi-EHRembedding alignmentvon Mises-Fisher mixturequasi-likelihoodsynonym clusteringlatent representationsnon-asymptotic boundsclinical concept harmonization

0 comments

The pith

A spherical mixture model integrates embeddings from multiple EHR sources to align feature spaces and recover synonym clusters with proven error bounds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops SMILE to address the challenges of harmonizing clinical codes across different institutions' EHR systems where raw codes are institution-specific and granular. By treating embeddings as privacy-preserving summaries and using auxiliary relationship pairs for weak supervision, it models synonyms as a mixture of von Mises-Fisher distributions on the sphere. A composite quasi-likelihood estimation procedure is proposed with non-asymptotic error bounds for the latent representations and mixture mean directions, plus consistency for recovering synonym clusters. This approach quantifies the statistical benefits of combining multiple data sources and knowledge graph information. Simulations and a real multi-institutional application show improved performance in alignment and clustering.

Core claim

SMILE models synonymy in multi-source clinical embeddings via a mixture of von Mises-Fisher distributions to produce unified latent representations. A composite quasi-likelihood estimator is developed for the latent embeddings and mixture parameters, for which non-asymptotic error bounds are established, along with consistent recovery of the synonym clusters. The theoretical results demonstrate the gains in statistical efficiency from integrating multiple sources and auxiliary information.

What carries the argument

Mixture of von Mises-Fisher distributions on the sphere for synonym modeling, with composite quasi-likelihood estimation for alignment.

Load-bearing premise

The embeddings from different sources lie in a shared latent space that can be aligned using the spherical geometry and the sparse auxiliary pairs provide sufficient supervision for the mixture components.

What would settle it

Running the method on simulated data with known ground-truth latent embeddings, mixture means, and synonym labels, and checking whether the observed estimation errors exceed the derived non-asymptotic bounds or if cluster recovery accuracy falls below the consistency claim.

Figures

Figures reproduced from arXiv: 2605.08637 by Congyuan Duan, Dong Xia, Doudou Zhou, Tianxi Cai, Yuming Zhang.

read the original abstract

Multi-institutional electronic health record (Multi-EHR) data have emerged as a powerful resource for developing predictive models to support clinical decisions and for generating reliable real-world evidence. By aggregating information from diverse patient populations and institutions, they enhance the robustness and generalizability of models and findings. However, analyzing multi-EHR remains challenging because disparate institutions rarely map all data elements to common ontology, and raw EHR codes are often overly granular and institution-specific, fragmenting representations of the same clinical concept. Hence, integrative analysis must overcome two key hurdles: harmonizing codes with the same clinical meaning (synonymy), and aligning institutional feature spaces. To address these challenges, we propose SMILE, a Spherical Mixture Integration for Latent Embedding alignment across multi-source feature spaces, where embeddings from heterogeneous sources serve as privacy-preserving summaries of clinical concepts and sparse auxiliary relationship pairs provide weak supervision on the latent geometry. Synonymy is modeled via a mixture of von Mises-Fisher distributions, yielding unified representations that consolidate semantically equivalent raw codes. We develop a composite quasi-likelihood estimation procedure and establish non-asymptotic error bounds for latent representations and mixture mean directions, together with consistent recovery of synonym clusters. The theory quantifies statistical gains from integrating multiple sources and auxiliary knowledge graph information. Simulations and a multi-institutional EHR application demonstrate improved alignment and synonym clustering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SMILE offers a spherical mixture model for multi-EHR embedding alignment with theoretical bounds, though the bounds hinge on assumptions about auxiliary pairs that need explicit verification.

read the letter

The main point is that SMILE uses a mixture of von Mises-Fisher distributions on the sphere to handle synonymy when aligning embeddings from different EHR sources, and it supplies non-asymptotic error bounds for the latent embeddings and the mixture components along with cluster recovery guarantees. What the paper does well is frame a practical problem in multi-institutional data analysis and propose a method that incorporates auxiliary knowledge graph pairs for supervision. The composite quasi-likelihood estimation seems reasonable for this setup, and the claim of statistical gains from multiple sources is worth exploring. The simulations and application section likely show concrete benefits in alignment quality. The soft spot lies in the theoretical claims. The non-asymptotic bounds rest on the idea that the sparse auxiliary pairs provide enough weak supervision to resolve alignment ambiguities on the sphere. The paper does not appear to include an explicit lower bound on the number or connectivity of those pairs needed for the rates to hold. If that assumption is violated in real data, the error bounds and consistency results would not apply uniformly. Derivation details are also missing from the high-level description, making it difficult to assess the proof strategy without the full text. This paper is for people working on statistical methods for healthcare data, particularly those dealing with embedding alignment and mixture models in high-dimensional settings. A reader focused on theoretical guarantees for practical integration tasks would find it relevant. It shows honest engagement with the challenges of EHR data and deserves a serious referee to check the math and assumptions. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper proposes SMILE, a spherical mixture model using von Mises-Fisher distributions to align latent embeddings across heterogeneous multi-source feature spaces (e.g., multi-institutional EHR data). Embeddings serve as privacy-preserving summaries, synonymy is captured via mixture components, and sparse auxiliary knowledge-graph pairs provide weak supervision for geometry. A composite quasi-likelihood estimator is developed, with non-asymptotic error bounds derived for latent representations and mixture mean directions, plus consistency results for synonym-cluster recovery. The theory claims to quantify statistical gains from multi-source integration and auxiliary information; simulations and a real EHR application illustrate improved alignment and clustering.

Significance. If the non-asymptotic bounds and identifiability results hold under the stated conditions, the work offers a principled, privacy-aware framework for harmonizing fragmented clinical codes across institutions, which is a pressing need in real-world evidence generation. The vMF mixture is a natural choice for directional embeddings, and providing explicit non-asymptotic rates plus quantification of multi-source gains strengthens the contribution beyond purely empirical alignment methods. Credit is due for combining theoretical guarantees with empirical validation on both simulated and real multi-EHR data.

major comments (2)

[§4] §4 (Theoretical Analysis), Theorem on non-asymptotic bounds: The error bounds for latent representations and mean directions rest on the auxiliary relationship pairs supplying sufficient weak supervision to resolve spherical rotational invariance and ensure component separation in the vMF mixture. No explicit lower bound on the number, density, or connectivity of these pairs is stated to guarantee the claimed rates uniformly; if the pairs are too sparse, the identifiability step fails and the bounds do not hold, which is load-bearing for the central consistency and gain-quantification claims.
[§3] §3 (Estimation Procedure), composite quasi-likelihood: The procedure integrates multiple sources and auxiliary pairs, but the derivation does not explicitly address how heterogeneous source-specific concentration parameters or mixture weights are jointly optimized without introducing additional bias terms that could offset the claimed statistical gains from integration.

minor comments (2)

Notation for the von Mises-Fisher concentration parameter κ and mean direction μ should be introduced with a brief reminder of the density formula at first use to aid readers unfamiliar with directional statistics.
In the simulation section, the metrics for 'improved alignment' (e.g., Procrustes distance or cluster purity) need explicit definitions and baseline comparisons to make the reported gains interpretable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review, as well as the positive assessment of the significance of SMILE for multi-EHR alignment. We appreciate the recognition of the vMF mixture approach and the non-asymptotic theory. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§4] §4 (Theoretical Analysis), Theorem on non-asymptotic bounds: The error bounds for latent representations and mean directions rest on the auxiliary relationship pairs supplying sufficient weak supervision to resolve spherical rotational invariance and ensure component separation in the vMF mixture. No explicit lower bound on the number, density, or connectivity of these pairs is stated to guarantee the claimed rates uniformly; if the pairs are too sparse, the identifiability step fails and the bounds do not hold, which is load-bearing for the central consistency and gain-quantification claims.

Authors: We agree that the current presentation would benefit from an explicit minimal condition on the auxiliary pairs. The manuscript assumes the pairs resolve rotational invariance and ensure separation but does not state a quantitative lower bound (e.g., on the number of pairs per component or graph connectivity). We will revise the theorem in §4 to include such a condition, for instance requiring that the auxiliary knowledge graph is connected and contains at least Ω(K log K) pairs for K components, under which the stated rates hold uniformly. This makes the assumptions transparent while preserving the core results on multi-source gains. revision: yes
Referee: [§3] §3 (Estimation Procedure), composite quasi-likelihood: The procedure integrates multiple sources and auxiliary pairs, but the derivation does not explicitly address how heterogeneous source-specific concentration parameters or mixture weights are jointly optimized without introducing additional bias terms that could offset the claimed statistical gains from integration.

Authors: The composite quasi-likelihood is the sum of source-specific vMF quasi-log-likelihoods plus the auxiliary-pair term; source-specific concentrations κ_s and weights π_s are treated as separate parameters and updated jointly with the shared embeddings and common mean directions via a block-coordinate EM procedure. Because the heterogeneity is explicitly parameterized and the quasi-likelihood remains consistent for the shared latent geometry, no offsetting bias is introduced beyond the standard quasi-likelihood approximation already accounted for in the §4 bounds. To improve clarity we will add a short remark in §3 describing the alternation steps and confirming that the multi-source efficiency gains are retained. We would welcome further specification if a particular bias mechanism is intended. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation rests on external statistical assumptions rather than self-reference.

full rationale

The abstract and described procedure introduce a new spherical mixture model (von Mises-Fisher) with composite quasi-likelihood estimation and derive non-asymptotic bounds under explicit modeling assumptions on embeddings as privacy-preserving summaries and sparse auxiliary KG pairs as weak supervision. No quoted equations or steps reduce predictions to fitted inputs by construction, invoke self-citations as load-bearing uniqueness theorems, or smuggle ansatzes via prior work. The central claims (error bounds, cluster recovery, multi-source gains) are presented as consequences of standard concentration arguments for mixtures once identifiability is granted by the auxiliary pairs; the sufficiency of those pairs is an assumption, not a tautology. This matches the default expectation for a methods paper whose theory is externally falsifiable via simulation and real-data application.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on modeling clinical concept embeddings as draws from von Mises-Fisher mixtures on the sphere and treating auxiliary relationship pairs as providing geometric supervision; these are domain assumptions rather than derived quantities.

free parameters (1)

mixture weights and concentration parameters of the von Mises-Fisher components
These parameters are estimated via the composite quasi-likelihood and directly determine the recovered mean directions and synonym clusters.

axioms (2)

domain assumption Embeddings from heterogeneous sources serve as privacy-preserving summaries of clinical concepts
Invoked in the abstract as the starting point for alignment.
domain assumption Sparse auxiliary relationship pairs supply weak supervision on the latent geometry
Stated as the mechanism enabling alignment across sources.

pith-pipeline@v0.9.0 · 5543 in / 1384 out tokens · 57997 ms · 2026-05-12T00:50:06.248938+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Vi ∼ f_r(x; μ_zi, κ), f_r(x; μ, κ) = C_r(κ) exp(κ μ^T x) ... composite quasi-likelihood ... non-asymptotic error bounds for latent representations and mixture mean directions
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

identifiable only up to ... orthogonal matrix O ∈ O_{r×r}

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages

[1]

Statistical guarantees for the

Balakrishnan, Sivaraman and Wainwright, Martin J and Yu, Bin , journal =. Statistical guarantees for the

work page
[2]

Regularized

Loh, Po-Ling and Wainwright, Martin J , journal =. Regularized

work page
[3]

High dimensional

Wang, Zhaoran and Gu, Quanquan and Ning, Yang and Liu, Han , booktitle =. High dimensional

work page
[4]

Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval, matrix completion, and blind deconvolution , volume =

Ma, Cong and Wang, Kaizheng and Chi, Yuejie and Chen, Yuxin , journal =. Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval, matrix completion, and blind deconvolution , volume =

work page
[5]

, journal =

Tropp, Joel A. , journal =. User-friendly tail bounds for sums of random matrices , volume =

work page
[6]

Hanson--Wright inequality and sub-Gaussian concentration , volume =

Rudelson, Mark and Vershynin, Roman , journal =. Hanson--Wright inequality and sub-Gaussian concentration , volume =

work page
[7]

High-Dimensional Probability: An Introduction with Applications in Data Science , year =

Vershynin, Roman , publisher =. High-Dimensional Probability: An Introduction with Applications in Data Science , year =

work page
[8]

NCCN guidelines insights: prostate cancer early detection, version 2.2016 , volume =

Carroll, Peter R and Parsons, J Kellogg and Andriole, Gerald and Bahnson, Robert R and Castle, Erik P and Catalona, William J and Dahl, Douglas M and Davis, John W and Epstein, Jonathan I and Etzioni, Ruth B and others , date-added =. NCCN guidelines insights: prostate cancer early detection, version 2.2016 , volume =. Journal of the National Comprehensiv...

work page 2016
[9]

Health care spending in the United States and other high-income countries , volume =

Papanicolas, Irene and Woskie, Liana R and Jha, Ashish K , date-added =. Health care spending in the United States and other high-income countries , volume =. Jama , number =

work page
[10]

Spectral Clustering and the High-Dimensional Stochastic Blockmodel , volume =

Rohe, Karl and Chatterjee, Sourav and Yu, Bin , journal =. Spectral Clustering and the High-Dimensional Stochastic Blockmodel , volume =

work page
[11]

Stochastic blockmodels with a growing number of classes , volume =

Choi, David S and Wolfe, Patrick J and Airoldi, Edoardo M , journal =. Stochastic blockmodels with a growing number of classes , volume =

work page
[12]

, journal =

Day, Oscar and Khoshgoftaar, Taghi M. , journal =. A survey on heterogeneous transfer learning , volume =

work page
[13]

and Cook, Diane J

Feuz, Kyle D. and Cook, Diane J. , journal =. Transfer Learning across Feature-Rich Heterogeneous Feature Spaces via Feature-Space Remapping (FSR) , year =

work page
[14]

arXiv preprint arXiv:2310.08459 , year=

A recent survey of heterogeneous transfer learning , author=. arXiv preprint arXiv:2310.08459 , year=

work page arXiv
[15]

Multisource representation learning for pediatric knowledge extraction from electronic health records , volume =

Li, Mengyan and Li, Xiaoou and Pan, Kevin and Geva, Alon and Yang, Doris and Sweet, Sara Morini and Bonzel, Clara-Lea and Ayakulangara Panickan, Vidul and Xiong, Xin and Mandl, Kenneth and others , date-added =. Multisource representation learning for pediatric knowledge extraction from electronic health records , volume =. NPJ Digital Medicine , number =

work page
[16]

International electronic health record-derived COVID-19 clinical course profiles: the 4CE consortium , volume =

Brat, Gabriel A and Weber, Griffin M and Gehlenborg, Nils and Avillach, Paul and Palmer, Nathan P and Chiovato, Luca and Cimino, James and Waitman, Lemuel R and Omenn, Gilbert S and Malovini, Alberto and others , date-added =. International electronic health record-derived COVID-19 clinical course profiles: the 4CE consortium , volume =. NPJ Digital Medic...

work page
[17]

Federated and distributed learning applications for electronic health records and structured medical data: a scoping review , volume =

Li, Siqi and Liu, Pinyan and Nascimento, Gustavo G and Wang, Xinru and Leite, Fabio Renato Manzolli and Chakraborty, Bibhas and Hong, Chuan and Ning, Yilin and Xie, Feng and Teo, Zhen Ling and others , date-added =. Federated and distributed learning applications for electronic health records and structured medical data: a scoping review , volume =. Journ...

work page
[18]

Deep representation learning of patient data from Electronic Health Records (EHR): A systematic review , volume =

Si, Yuqi and Du, Jingcheng and Li, Zhao and Jiang, Xiaoqian and Miller, Timothy and Wang, Fei and Zheng, W Jim and Roberts, Kirk , date-added =. Deep representation learning of patient data from Electronic Health Records (EHR): A systematic review , volume =. Journal of Biomedical Informatics , pages =

work page
[19]

Current Challenges in the Application of Algorithms in Multi-institutional Clinical Settings , year =

Kades, Klaus , date-added =. Current Challenges in the Application of Algorithms in Multi-institutional Clinical Settings , year =

work page
[20]

Multi-site research using electronic health record data: Lessons learned from a case study , volume =

Garcia, Brittany and Hogarth, Michael and Wang, Yu and Zhu, Xi and Tu, Shin-Ping , date-added =. Multi-site research using electronic health record data: Lessons learned from a case study , volume =. Learning Health Systems , number =

work page
[21]

A survey of informatics platforms that enable distributed comparative effectiveness research using multi-institutional heterogenous clinical data , volume =

Sittig, Dean F and Hazlehurst, Brian L and Brown, Jeffrey and Murphy, Shawn and Rosenman, Marc and Tarczy-Hornoch, Peter and Wilcox, Adam B , date-added =. A survey of informatics platforms that enable distributed comparative effectiveness research using multi-institutional heterogenous clinical data , volume =. Medical Care , pages =

work page
[22]

The All of Us Research Program is an opportunity to enhance the diversity of US biomedical research , volume =

Bianchi, Diana W and Brennan, Patricia Flatley and Chiang, Michael F and Criswell, Lindsey A and D'Souza, Rena N and Gibbons, Gary H and Gilman, James K and Gordon, Joshua A and Green, Eric D and Gregurick, Susan and others , date-added =. The All of Us Research Program is an opportunity to enhance the diversity of US biomedical research , volume =. Natur...

work page
[23]

Mobilizing data during a crisis: Building rapid evidence pipelines using multi-institutional real world data , volume =

Marwaha, Jayson S and Downing, Maren and Halamka, John and Abernethy, Amy and Franklin, Joseph B and Anderson, Brian and Kohane, Isaac and Wagholikar, Kavishwar and Brownstein, John and Haendel, Melissa and others , booktitle =. Mobilizing data during a crisis: Building rapid evidence pipelines using multi-institutional real world data , volume =

work page
[24]

Arch: Large-scale knowledge graph via aggregated narrative codified health records analysis , volume =

Gan, Ziming and Zhou, Doudou and Rush, Everett and Panickan, Vidul A and Ho, Yuk-Lam and Ostrouchovm, George and Xu, Zhiwei and Shen, Shuting and Xiong, Xin and Greco, Kimberly F and others , journal =. Arch: Large-scale knowledge graph via aggregated narrative codified health records analysis , volume =

work page
[25]

Code2vec: Embedding and clustering medical diagnosis data , year =

Kartchner, David and Christensen, Tanner and Humpherys, Jeffrey and Wade, Sean , booktitle =. Code2vec: Embedding and clustering medical diagnosis data , year =

work page
[26]

Multi-layer representation learning for medical concepts , year =

Choi, Edward and Bahadori, Mohammad Taha and Searles, Elizabeth and Coffey, Catherine and Thompson, Michael and Bost, James and Tejedor-Sojo, Javier and Sun, Jimeng , booktitle =. Multi-layer representation learning for medical concepts , year =

work page
[27]

McInnes, Bridget T and Pedersen, Ted and Carlis, John , booktitle =

work page
[28]

A latent variable model approach to PMI-based word embeddings , year =

Arora, Sanjeev and Li, Yuanzhi and Liang, Yingyu and Ma, Tengyu and Risteski, Andrej , journal =. A latent variable model approach to PMI-based word embeddings , year =

work page
[29]

Graph alignment with noisy supervision , year =

Pei, Shichao and Yu, Lu and Yu, Guoxian and Zhang, Xiangliang , booktitle =. Graph alignment with noisy supervision , year =

work page
[30]

Exact Recovery of Two-Latent Variable Stochastic Block Model with Side Information , year =

Shahiri, Mohammad and Eskandari, Mahdi , booktitle =. Exact Recovery of Two-Latent Variable Stochastic Block Model with Side Information , year =

work page
[31]

Von mises-fisher clustering models , year =

Gopal, Siddharth and Yang, Yiming , booktitle =. Von mises-fisher clustering models , year =

work page
[32]

Sparse mixture of von Mises-Fisher distribution

Barbaro, Florian and Rossi, Fabrice , booktitle =. Sparse mixture of von Mises-Fisher distribution. , year =

work page
[33]

Multiview incomplete knowledge graph integration with application to cross-institutional ehr data harmonization , volume =

Zhou, Doudou and Gan, Ziming and Shi, Xu and Patwari, Alina and Rush, Everett and Bonzel, Clara-Lea and Panickan, Vidul A and Hong, Chuan and Ho, Yuk-Lam and Cai, Tianrun and others , journal =. Multiview incomplete knowledge graph integration with application to cross-institutional ehr data harmonization , volume =

work page
[34]

Spherical regression under mismatch corruption with application to automated knowledge translation , volume =

Shi, Xu and Li, Xiaoou and Cai, Tianxi , journal =. Spherical regression under mismatch corruption with application to automated knowledge translation , volume =

work page
[35]

Multi-source learning via completion of block-wise overlapping noisy matrices , volume =

Zhou, Doudou and Cai, Tianxi and Lu, Junwei , journal =. Multi-source learning via completion of block-wise overlapping noisy matrices , volume =

work page
[36]

Clinical knowledge extraction via sparse embedding regression (KESER) with multi-center large scale electronic health record data , volume =

Hong, Chuan and Rush, Everett and Liu, Molei and Zhou, Doudou and Sun, Jiehuan and Sonabend, Aaron and Castro, Victor M and Schubert, Petra and Panickan, Vidul A and Cai, Tianrun and others , journal =. Clinical knowledge extraction via sparse embedding regression (KESER) with multi-center large scale electronic health record data , volume =

work page
[37]

Maximum likelihood from incomplete data via the EM algorithm , volume =

Dempster, Arthur P and Laird, Nan M and Rubin, Donald B , journal =. Maximum likelihood from incomplete data via the EM algorithm , volume =

work page
[38]

The EM algorithm and extensions , year =

McLachlan, Geoffrey J and Krishnan, Thriyambakam , publisher =. The EM algorithm and extensions , year =

work page
[39]

Self-alignment pretraining for biomedical entity representations , year =

Liu, Fangyu and Shareghi, Ehsan and Meng, Zaiqiao and Basaldella, Marco and Collier, Nigel , journal =. Self-alignment pretraining for biomedical entity representations , year =

work page
[40]

CODER: Knowledge-infused cross-lingual medical term embedding for term normalization , volume =

Yuan, Zheng and Zhao, Zhengyun and Sun, Haixia and Li, Jiao and Wang, Fei and Yu, Sheng , journal =. CODER: Knowledge-infused cross-lingual medical term embedding for term normalization , volume =

work page
[41]

Unsupervised hyperalignment for multilingual word embeddings , year =

Alaux, Jean and Grave, Edouard and Cuturi, Marco and Joulin, Armand , journal =. Unsupervised hyperalignment for multilingual word embeddings , year =

work page
[42]

Minimax rates in permutation estimation for feature matching , volume =

Collier, Olivier and Dalalyan, Arnak S , journal =. Minimax rates in permutation estimation for feature matching , volume =

work page
[43]

Correlation alignment for unsupervised domain adaptation , year =

Sun, Baochen and Feng, Jiashi and Saenko, Kate , journal =. Correlation alignment for unsupervised domain adaptation , year =

work page
[44]

Unsupervised alignment of embeddings with wasserstein procrustes , year =

Grave, Edouard and Joulin, Armand and Berthet, Quentin , booktitle =. Unsupervised alignment of embeddings with wasserstein procrustes , year =

work page
[45]

Covariance alignment: from maximum likelihood estimation to Gromov-Wasserstein , year =

Han, Yanjun and Rigollet, Philippe and Stepaniants, George , journal =. Covariance alignment: from maximum likelihood estimation to Gromov-Wasserstein , year =

work page
[46]

Correlated topic models , volume =

Blei, David and Lafferty, John , journal =. Correlated topic models , volume =

work page
[47]

Strong recovery of geometric planted matchings , year =

Kunisky, Dmitriy and Niles-Weed, Jonathan , booktitle =. Strong recovery of geometric planted matchings , year =

work page
[48]

Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses , volume =

Loh, Po-Ling and Wainwright, Martin J , journal =. Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses , volume =

work page
[49]

The multivariate Poisson-log normal distribution , volume =

Aitchison, John and Ho, CH , journal =. The multivariate Poisson-log normal distribution , volume =

work page
[50]

Variational inference for probabilistic Poisson PCA , volume =

Chiquet, Julien and Mariadassou, Mahendra and Robin, St. Variational inference for probabilistic Poisson PCA , volume =. The Annals of Applied Statistics , number =

work page
[51]

Variational inference for sparse network reconstruction from count data , year =

Chiquet, Julien and Robin, Stephane and Mariadassou, Mahendra , booktitle =. Variational inference for sparse network reconstruction from count data , year =

work page
[52]

An iterative clustering algorithm for the Contextual Stochastic Block Model with optimality guarantees , url =

Guillaume Braun and Hemant Tyagi and Christophe Biernacki , booktitle =. An iterative clustering algorithm for the Contextual Stochastic Block Model with optimality guarantees , url =. 2022 , Bdsk-Url-1 =

work page 2022
[53]

Joint and individual variation explained (JIVE) for integrated analysis of multiple data types , volume =

Lock, Eric F and Hoadley, Katherine A and Marron, James Stephen and Nobel, Andrew B , journal =. Joint and individual variation explained (JIVE) for integrated analysis of multiple data types , volume =

work page
[54]

Angle-based joint and individual variation explained , volume =

Feng, Qing and Jiang, Meilei and Hannig, Jan and Marron, JS , journal =. Angle-based joint and individual variation explained , volume =

work page
[55]

Group component analysis for multiblock data: Common and individual feature extraction , volume =

Zhou, Guoxu and Cichocki, Andrzej and Zhang, Yu and Mandic, Danilo P , journal =. Group component analysis for multiblock data: Common and individual feature extraction , volume =

work page
[56]

A non-negative matrix factorization method for detecting modules in heterogeneous omics multi-modal data , volume =

Yang, Zi and Michailidis, George , journal =. A non-negative matrix factorization method for detecting modules in heterogeneous omics multi-modal data , volume =

work page
[57]

Structural learning and integrative decomposition of multi-view data , volume =

Gaynanova, Irina and Li, Gen , journal =. Structural learning and integrative decomposition of multi-view data , volume =

work page
[58]

Integrative factorization of bidimensionally linked matrices , volume =

Park, Jun Young and Lock, Eric F , journal =. Integrative factorization of bidimensionally linked matrices , volume =

work page
[59]

Bidimensional linked matrix factorization for pan-omics pan-cancer analysis , volume =

Lock, Eric F and Park, Jun Young and Hoadley, Katherine A , journal =. Bidimensional linked matrix factorization for pan-omics pan-cancer analysis , volume =

work page
[60]

Hierarchical nuclear norm penalization for multi-view data integration , volume =

Yi, Sangyoon and Wong, Raymond Ka Wai and Gaynanova, Irina , journal =. Hierarchical nuclear norm penalization for multi-view data integration , volume =

work page
[61]

Network-adjusted covariates for community detection , volume =

Hu, Yaofang and Wang, Wanjie , journal =. Network-adjusted covariates for community detection , volume =

work page
[62]

International statistical classification of diseases and related health problems

Br. International statistical classification of diseases and related health problems. World Health Statistics Quarterly. Rapport Trimestriel de Statistiques Sanitaires Mondiales , number =

work page
[63]

LOINC, a universal standard for identifying laboratory observations: a 5-year update , volume =

McDonald, Clement J and Huff, Stanley M and Suico, Jeffrey G and Hill, Gilbert and Leavelle, Dennis and Aller, Raymond and Forrey, Arden and Mercer, Kathy and DeMoor, Georges and Hook, John and others , journal =. LOINC, a universal standard for identifying laboratory observations: a 5-year update , volume =

work page
[64]

RxNorm: prescription for electronic drug information exchange , volume =

Liu, Simon and Ma, Wei and Moore, Robin and Ganesan, Vikraman and Nelson, Stuart , journal =. RxNorm: prescription for electronic drug information exchange , volume =

work page
[65]

Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation , volume =

Chen, Jianlv and Xiao, Shitao and Zhang, Peitian and Luo, Kun and Lian, Defu and Liu, Zheng , journal =. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation , volume =

work page
[66]

Spectral Clustering with Likelihood Refinement is Optimal for Latent Class Recovery , year =

Lyu, Zhongyuan and Gu, Yuqi , journal =. Spectral Clustering with Likelihood Refinement is Optimal for Latent Class Recovery , year =

work page
[67]

Model-based clustering of categorical data based on the Hamming distance , volume =

Argiento, Raffaele and Filippi-Mazzola, Edoardo and Paci, Lucia , journal =. Model-based clustering of categorical data based on the Hamming distance , volume =

work page
[68]

Exploratory latent structure analysis using both identifiable and unidentifiable models , volume =

Goodman, Leo A , journal =. Exploratory latent structure analysis using both identifiable and unidentifiable models , volume =

work page
[69]

Latent class models for categorical data , year =

Celeux, Gilles and Govaert, G. Latent class models for categorical data , year =. The Handbook of Cluster Analysis , pages =

work page
[70]

Robust clustering with subpopulation-specific deviations , year =

Stephenson, Briana JK and Herring, Amy H and Olshan, Andrew , journal =. Robust clustering with subpopulation-specific deviations , year =

work page
[71]

Optimal aggregation of classifiers in statistical learning , volume =

Tsybakov, Alexander B , journal =. Optimal aggregation of classifiers in statistical learning , volume =

work page
[72]

Functional classification with margin conditions , year =

Fromont, Magalie and Tuleau, Christine , booktitle =. Functional classification with margin conditions , year =

work page
[73]

A theory for record linkage , volume =

Fellegi, Ivan P and Sunter, Alan B , journal =. A theory for record linkage , volume =

work page
[74]

Bayesian estimation of bipartite matchings for record linkage , volume =

Sadinle, Mauricio , journal =. Bayesian estimation of bipartite matchings for record linkage , volume =

work page
[75]

Constrained k-means clustering with background knowledge , volume =

Wagstaff, Kiri and Cardie, Claire and Rogers, Seth and Schr. Constrained k-means clustering with background knowledge , volume =. Icml , pages =

work page
[76]

Constrained clustering: Advances in algorithms, theory, and applications , year =

Basu, Sugato and Davidson, Ian and Wagstaff, Kiri , publisher =. Constrained clustering: Advances in algorithms, theory, and applications , year =

work page
[77]

Anatomical therapeutic chemical classification system (ATC) , year =

Nahler, Gerhard , booktitle =. Anatomical therapeutic chemical classification system (ATC) , year =

work page
[78]

Representation learning to advance multi-institutional studies with electronic health record data from

Zhou, Doudou and Tong, Han and Wang, Linshanshan and others , journal=. Representation learning to advance multi-institutional studies with electronic health record data from. 2026 , publisher=

work page 2026
[79]

Stochastic blockmodels: First steps , volume =

Holland, Paul W and Laskey, Kathryn Blackmond and Leinhardt, Samuel , journal =. Stochastic blockmodels: First steps , volume =

work page
[80]

Consistency of spectral clustering in stochastic block models , year =

Lei, Jing and Rinaldo, Alessandro , journal =. Consistency of spectral clustering in stochastic block models , year =

work page