Demographic-Adjusted Unsupervised Learning on Electronic Health Records: A Poisson Dirichlet Model for Latent Disease Clustering and Patient Risk Stratification
Keywords:
Poisson Dirichlet Model, Latent Dirichlet Allocation , Excess Risk Modeling, Patient Stratification, Epidemiological Analysis, Precision Public HealthAbstract
Electronic Health Records (EHRs) have emerged as a transformative resource for advancing healthcare analytics by enabling large-scale, data-driven discovery of patient patterns and comorbidity structures. However, unsupervised machine learning approaches such as Latent Dirichlet Allocation (LDA), though widely used to uncover latent disease clusters, often struggle with key limitations: they are sensitive to demographic confounding and model only raw co-occurrence frequencies, limiting epidemiological interpretability. This research addresses these gaps by proposing the Poisson Dirichlet Model (PDM), a novel probabilistic framework that integrates demographic-adjusted expected counts and models diagnosis frequencies using a Poisson likelihood. The goal is to identify clinically meaningful latent disease clusters and stratify patients into risk-based subgroups, overcoming demographic biases inherent in prior models. We evaluated PDM against LDA across three real-world cohorts (osteoporosis, delirium/dementia, and COPD/bronchiectasis) using datasets from the Rochester Epidemiology Project, employing survival analysis, comorbidity profiling, and qualitative cluster visualizations. Results demonstrate that while LDA achieves stronger statistical separation, PDM reveals more epidemiologically relevant excess-risk patterns, providing nuanced insights into latent disease mechanisms beyond age or sex effects. Notably, PDM complements the interpretability and transparency often lacking in deep learning or network-based approaches, positioning it as a valuable tool for precision public health and data-driven patient stratification. We conclude that integrating expected demographic-adjusted counts within probabilistic topic models yields substantial methodological and clinical advantages, and we recommend future research to extend this framework for scalable, multimodal, and longitudinal healthcare data analysis
References
[1] W. R. Hersh, “Adding value to the electronic health record through secondary use of data for quality assurance, research, and surveillance,” Am. J. Manag. Care, vol. 13, no. 6, pp. 277–279, 2007.
[2] J. L. St. Sauver et al., “Data resource profile: The Rochester Epidemiology Project (REP) medical records-linkage system,” Int. J. Epidemiol., vol. 41, no. 6, pp. 1614–1624, 2012, doi: 10.1093/ije/dys195.
[3] Z. Obermeyer and E. J. Emanuel, “Predicting the future—big data, machine learning, and clinical medicine,” N. Engl. J. Med., vol. 375, no. 13, pp. 1216–1219, 2016, doi: 10.1056/NEJMp1606181.
[4] R. Miotto, F. Wang, S. Wang, X. Jiang, and J. T. Dudley, “Deep learning for healthcare: Review, opportunities and challenges,” Brief. Bioinform., vol. 19, no. 6, pp. 1236–1246, 2018, doi: 10.1093/bib/bbx044.
[5] E. Choi, M. T. Bahadori, J. Sun, J. Kulas, A. Schuetz, and W. F. Stewart, “RETAIN: An interpretable predictive model for healthcare using reverse time attention mechanism,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 29, 2016.
[6] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,” J. Mach. Learn. Res., vol. 3, pp. 993–1022, 2003.
[7] T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proc. Natl. Acad. Sci. U.S.A., vol. 101, no. Suppl. 1, pp. 5228–5235, 2004, doi: 10.1073/pnas.0307752101.
[8] W. Zhao, W. Zou, and J. J. Chen, “Topic modeling for cluster analysis of large biological and medical datasets,” BMC Bioinformatics, vol. 15, no. Suppl. 11, p. S11, 2014, doi: 10.1186/1471-2105-15-S11-S11.
[9] D. C. Li, T. Therneau, C. Chute, and H. Liu, “Discovering associations among diagnosis groups using topic modeling,” AMIA Summits Transl. Sci. Proc., vol. 2014, p. 43, 2014.
[10] R. Pivovarov et al., “Learning probabilistic phenotypes from heterogeneous EHR data,” J. Biomed. Inform., vol. 58, pp. 156–165, 2015, doi: 10.1016/j.jbi.2015.10.001.
[11] X. Wang, D. Sontag, and F. Wang, “Unsupervised learning of disease progression models,” in Proc. 20th ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., 2014, pp. 85–94, doi: 10.1145/2623330.2623754.
[12] E. Choi, M. T. Bahadori, A. Schuetz, W. F. Stewart, and J. Sun, “Doctor AI: Predicting clinical events via recurrent neural networks,” in Mach. Learn. Healthc. Conf., 2016, pp. 301–318.
[13] E. Choi, M. T. Bahadori, L. Song, W. F. Stewart, and J. Sun, “GRAM: Graph-based attention model for healthcare representation learning,” in Proc. 23rd ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., 2017, pp. 787–795, doi: 10.1145/3097983.3098126.
[14] A.-L. Barabási, N. Gulbahce, and J. Loscalzo, “Network medicine: A network-based approach to human disease,” Nat. Rev. Genet., vol. 12, no. 1, pp. 56–68, 2011, doi: 10.1038/nrg2918.
[15] D. Gligorijevic et al., “Large-scale discovery of disease-disease and disease-gene associations,” Sci. Rep., vol. 6, p. 32404, 2016, doi: 10.1038/srep32404.
[16] Y. Wang et al., “Unsupervised machine learning for the discovery of latent disease clusters and patient subgroups using electronic health records,” arXiv Preprint, arXiv:1905.10309, 2019.
[17] P. Schnell, Q. Tang, W. W. Offen, and B. P. Carlin, “A Bayesian credible subgroups approach to identifying patient subgroups with positive treatment effects,” Biometrics, vol. 72, no. 4, pp. 1026–1036, 2016, doi: 10.1111/biom.12514.
[18] P. Ni et al., “Constructing disease similarity networks based on disease module theory,” IEEE/ACM Trans. Comput. Biol. Bioinform., vol. 16, no. 1, pp. 246–256, 2019, doi: 10.1109/TCBB.2017.2758465.
[19] L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,” J. Mach. Learn. Res., vol. 9, no. Nov, pp. 2579–2605, 2008.
[20] S. Gao, H. C. Hendrie, K. S. Hall, and S. Hui, “The relationships between age, sex, and the incidence of dementia and
Alzheimer disease: A meta-analysis,” Arch. Gen. Psychiatry, vol. 55, no. 9, pp. 809–815, 1998, doi: 10.1001/archpsyc.55.9.809.
[21] C. Tzourio, “Hypertension, cognitive decline, and dementia: An epidemiological perspective,” Dialogues Clin. Neurosci., vol. 9, no. 1, pp. 61–70, 2007, doi: 10.31887/DCNS.2007.9.1/ctzourio.
[22] M. Blei and J. Lafferty, “Dynamic topic models,” in Proc. 23rd Int. Conf. Mach. Learn., 2006, pp. 113–120, doi: 10.1145/1143844.1143859.
[23] M. Hoffman, D. Blei, C. Wang, and J. Paisley, “Stochastic variational inference,” J. Mach. Learn. Res., vol. 14, pp. 1303–1347, 2013.
[24] T. Lasko, J. Denny, and M. Levy, “Computational phenotype discovery using unsupervised feature learning over noisy, sparse, and irregular clinical data,” PLoS ONE, vol. 8, no. 6, p. e66341, 2013, doi: 10.1371/journal.pone.0066341.
[25] R. Singh et al., “Temporal deep learning for predicting health trajectories from electronic health records,” in Proc. AAAI Conf. Artif. Intell., vol. 32, no. 1, 2018.
[26] J. Beaulieu-Jones and C. Greene, “Semi-supervised learning of the electronic health record for phenotype stratification,” J. Biomed. Inform., vol. 64, pp. 168–178, 2016, doi: 10.1016/j.jbi.2016.10.007.
[27] R. Miotto, F. Wang, S. Wang, X. Jiang, and J. Dudley, “Deep learning for healthcare: Review, opportunities and challenges,” Brief. Bioinform., vol. 19, no. 6, pp. 1236–1246, 2018, doi: 10.1093/bib/bbx044.
[28] J. Obermeyer et al., “Dissecting racial bias in an algorithm used to manage the health of populations,” Science, vol. 366, no. 6464, pp. 447–453, 2019, doi: 10.1126/science.aax2342.
[29] S. Rajkomar et al., “Ensuring fairness in machine learning to advance health equity,” Ann. Intern. Med., vol. 169, no. 12, pp. 866–872, 2018, doi: 10.7326/M18-1990.