Cost-Sensitive Fraud Detection with Reliability Calibration: APractical Pipeline with XGBoost and Focal-Proxy Reweighting
Abstract
Fraud detection on payment transactions is an extremely imbalanced, high-stakes classification task in which deployment decisions depend not only on ranking quality but also on reliable prob ability estimates. We study credit card fraud detection on a standard real-transaction benchmark (284,807 transactions; 492 frauds) and target two deployment requirements: cost-sensitive thresh olding under asymmetric error costs and reliability calibration so model outputs can be interpreted as stable risk scores. We benchmark Logistic Regression and XGBoost, and propose a focal-proxy reweighting scheme for boosted trees via iterative weight updates inspired by focal loss. Probabili ties are calibrated on validation using Platt scaling, temperature scaling, and isotonic-style monotone calibration; the best calibrator is selected by minimum validation Brier score. For decision making, we choose the operating threshold that minimizes expected cost Cost(t) = 10 · FN(t) + 1 · FP(t) on validation, then evaluate on a held-out test set. On the benchmark split (train 199,364; valida tion 42,721; test 42,722), the calibrated XGBoost baseline achieves AUROC 0.973, AUPRC 0.812, fraud-class F1 0.767, and expected cost 154 with very low calibration error (ECE = 1.1 × 10−4). Overall, calibration reduces ECE and improves or maintains Brier score, while cost-aware thresh olding makes the FN/FP trade-off explicit via decision curves. The focal-proxy variant does not improve expected cost in this


