Evaluating Reinforcement Learning Policies in Observational Healthcare Using Robust Off-Policy Estimation and Diagnostic Methods

Authors

  • Ovien Yoga Caesarizky Department of Informatics, Universitas Muhammadiyah Semarang Author
  • Thahta Ardhika Prabu Nagara Department of Informatics, Universitas Muhammadiyah Semarang Author
  • Laelatul Khikmah Department of Statistics, Institut Teknologi Statistika dan Bisnis Muhammadiyah Semarang Author
  • Sherly Nur Ekawati Postgraduate Program of Medical Laboratory Science, Universitas Muhammadiyah Semarang Author

Keywords:

healthcare decision support, sepsis management, importance sampling, doubly robust estimators, observational data analysis, machine learning in healthcare

Abstract

The increasing integration of machine learning (ML), particularly reinforcement learning (RL), into healthcare has generated significant interest in developing data-driven treatment strategies. However, reliable evaluation of RL policies using retrospective clinical data remains a fundamental challenge, given issues such as data sparsity, high variance in off-policy estimates, and potential biases arising from confounding variables. This study proposes a robust methodological framework for evaluating RL algorithms in observational health settings, with a specific focus on sepsis management using the MIMIC-III database. The framework integrates advanced statistical estimators, including weighted doubly robust (WDR) methods, and incorporates empirical diagnostics such as importance weight distribution analyses and effective sample size calculations. We systematically compare the RL-derived optimal policy against clinician, random, and no-action baselines over 50 randomized train-test splits. Quantitative results demonstrate that while the RL policy achieves higher average cumulative reward estimates, the performance gains are accompanied by substantial variance and limited data support, raising important considerations about the interpretability and generalizability of such models. By explicitly addressing the methodological gaps present in prior works, this research offers a transparent, reproducible, and clinically grounded approach to RL policy evaluation. The findings highlight the necessity of combining algorithmic innovation with rigorous evaluation practices and domain expertise to ensure safe and effective translation of RL systems into real-world clinical workflows. This study contributes both methodological advancements and practical recommendations that can inform future development and validation of machine learning applications in healthcare

References

[1] N. Prasad, L.-F. Cheng, C. Chivers, M. Draugelis, and B. E. Engelhardt, “A reinforcement learning approach to weaning of mechanical ventilation in intensive care units,” arXiv preprint arXiv:1704.06300, 2017. [Online]. Available: https://doi.org/10.48550/arXiv.1704.06300

[2] A. Raghu, M. Komorowski, I. Ahmed, L. Celi, P. Szolovits, and M. Ghassemi, “Deep reinforcement learning for sepsis treatment,” arXiv preprint arXiv:1711.09602, 2017. [Online]. Available: https://doi.org/10.48550/arXiv.1711.09602

[3] S. M. Shortreed, E. Laber, D. J. Lizotte, T. S. Stroup, J. Pineau, and S. A. Murphy, “Informing sequential clinical decision-making through reinforcement learning: An empirical study,” Machine Learning, vol. 84, no. 1–2, pp. 109–136, 2011, doi: 10.1007/s10994-011-5253-5.

[4] D. Precup, R. S. Sutton, and S. P. Singh, “Eligibility traces for off-policy policy evaluation,” in Proc. ICML, 2000, pp. 759–766.

[5] P. Thomas and E. Brunskill, “Data-efficient off-policy policy evaluation for reinforcement learning,” in Proc. ICML, 2016, pp. 2139–2148.

[6] N. Jiang and L. Li, “Doubly robust off-policy value evaluation for reinforcement learning,” arXiv preprint arXiv:1511.03722, 2015. [Online]. Available: https://doi.org/10.48550/arXiv.1511.03722

[7] J. M. Robins, “Robust estimation in sequentially ignorable missing data and causal inference models,” in Proc. Am. Statist. Assoc., vol. 1999, pp. 6–10, 2000.

[8] D. Hein, A. Hentschel, T. Runkler, and S. Udluft, “Particle swarm optimization for generating interpretable fuzzy reinforcement learning policies,” Engineering Applications of Artificial Intelligence, vol. 65, pp. 87–98, 2017, doi:
10.1016/j.engappai.2017.07.004.

[9] U. Shalit, F. Johansson, and D. Sontag, “Estimating individual treatment effect: Generalization bounds and algorithms,” arXiv preprint arXiv:1606.03976, 2016. [Online]. Available: https://doi.org/10.48550/arXiv.1606.03976

[10] A. E. W. Johnson, D. J. Stone, L. A. Celi, and T. J. Pollard, “The MIMIC code repository: Enabling reproducibility in critical care research,” J. Am. Med. Inform. Assoc., 2017, doi: 10.1093/jamia/ocx084.

[11] M. Singer et al., “The third international consensus definitions for sepsis and septic shock (sepsis-3),” JAMA, vol. 315, no. 8, pp. 801–810, 2016, doi: 10.1001/jama.2016.0287.

[12] D. Hein, A. Hentschel, T. Runkler, and S. Udluft, “Particle swarm optimization for generating interpretable fuzzy reinforcement learning policies,” Engineering Applications of Artificial Intelligence, vol. 65, pp. 87–98, 2017, doi: 10.1016/j.engappai.2017.07.004.

Downloads

Published

2025-04-30