Leveraging Fine-Tuned ClinicalBERT and Active Learning to Detect Cognitive Impairment from Unstructured EHR Notes

Authors

  • Bhanu Lintang Wibowo Department of Informatics, Universitas Muhammadiyah Semarang, Indonesia Author
  • Fredy Usman Tri Nugroho Diploma in Medical Laboratory Technology, Politeknik Yakpermas Banyumas, Indonesia Author

Keywords:

clinicalBERT, active learning framework

Abstract

Dementia is a progressive neurodegenerative condition that impairs cognitive function and affects over 50 million people worldwide, yet it remains substantially underdiagnosed in clinical practice. This underdiagnosis is exacerbated by the frequent absence of structured documentation, such as International Classification of Diseases (ICD) codes or medication records, in electronic health records (EHRs). To address this gap, this study proposes a transformer-based natural language processing (NLP) framework for detecting cognitive impairment (CI) directly from unstructured clinician notes. Specifically, we fine-tune ClinicalBERT, a pretrained language model adapted to clinical contexts, on a large, carefully annotated EHR dataset encompassing over 279,000 dementia-related sequences, including 8,656 expert-labeled samples. We compare the proposed model against a logistic regression baseline using term frequency–inverse document frequency (TF-IDF) features. Our findings demonstrate that ClinicalBERT significantly outperforms the baseline, achieving an AUC of 0.98 and an accuracy of 0.93, compared to 0.95 and 0.84, respectively. Furthermore, the model successfully identifies patients exhibiting cognitive impairment even in the absence of dementia-specific ICD codes or medications, addressing the critical issue of under documentation. We also introduce an active learning framework that strategically guides further annotation efforts by prioritizing uncertain or novel cases, thereby improving model performance with fewer additional labels. In conclusion, this research provides a robust, scalable, and automated approach for leveraging unstructured clinical narratives to enhance early detection of cognitive impairment, offering valuable implications for improving clinical decision support, patient management, and the development of dementia research cohorts.

References

[1] Alzheimer’s Association, “2021 Alzheimer’s Disease Facts and Figures,” Alzheimers Dement., vol. 17, no. 3, pp. 327–406, Mar. 2021, doi: 10.1002/alz.12328.

[2] World Health Organization, “Dementia,” Sept. 2021. [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/dementia. [Accessed: 01-May-2025].

[3] J. Hsu et al., “Electronic health records and underdiagnosis of dementia,” JAMA Intern. Med., vol. 179, no. 1, pp. 111–119, Jan. 2019, doi: 10.1001/jamainternmed.2018.4562.

[4] A. Rajkomar et al., “Scalable and accurate deep learning with electronic health records,” NPJ Digit. Med., vol. 1, pp. 18, May 2018, doi: 10.1038/s41746-018-0029-1.

[5] B. S. Glicksberg et al., “Automated disease cohort selection using word embeddings from electronic health records,” in Proc. Pacific Symp. Biocomput., 2018, pp. 145–156.

[6] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. R. Stat. Soc. Series B, vol. 58, no. 1, pp. 267–288, 1996.

[7] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.

[8] E. Alsentzer et al., “Publicly available clinical BERT embeddings,” in Proc. 2nd Clin. Natural Lang. Process. Workshop, Minneapolis, MN, USA, 2019, pp. 72–78, doi: 10.18653/v1/W19-1909.

[9] L. McInnes, J. Healy, and J. Melville, “UMAP: Uniform manifold approximation and projection for dimension reduction,” arXiv preprint arXiv:1802.03426, 2018.

[10] J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global vectors for word representation,” in Proc. 2014 Conf. Empir. Methods Nat. Lang. Process., 2014, pp. 1532–1543, doi: 10.3115/v1/D14-1162.

[11] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A next-generation hyperparameter optimization framework,” in Proc. 25th ACM SIGKDD Int. Conf. Knowledge Discovery & Data Mining, 2019, pp. 2623–2631, doi: 10.1145/3292500.3330701.

[12] T. Wolf et al., “Transformers: State-of-the-art natural language processing,” arXiv preprint arXiv:1910.03771, 2019.

[13] T. Rajapakse, “Simple transformers,” 2020. [Online]. Available: https://simpletransformers.ai/. [Accessed: 01-May-2025].

Downloads

Published

2025-04-30