Clinical stability and propensity score matching in Cardiac Surgery: Is the clinical evaluation of treatment efficacy algorithm-dependent in small sample size settings?

Bottigliengo, D.; Acar, A. S.; Sciannameo, V.; Lorenzoni, G.; Bejko, J.; Bottio, T.; Cozzi, E.; Vadori, M.; Soulillou, J. -P.; Roussel, J. C.; Le Torneau, T.; Senage, T.; Manez, R.; Costa, C.; Padler-Karavani, V.; Scali, S.; Carrozzini, M.; Fiorello, E.; Fusca, Samuel; Gerosa, G.; Baldi, I.; Berchialla, P.; Gregori, D.

doi:10.2427/13001

Background: Propensity score matching represents one of the most popular techniques to deal with treatment allocation bias in observational studies. However, when the number of enrolled patients is very low, the creation of matched set of subjects may highly depend on the model used to estimate individual propensity scores, undermining the stability of consequential clinical findings. In this study, we investigate the potential issues related to the stability of the matched sets created by different propensity score models and we propose some diagnostic tools to evaluate them. Methods: Matched groups of patients were created using five different methods: Logistic Regression, Classification and Regression Trees, Bagging, Random Forest and Generalized Boosted Model. Differences between subjects in the matched sets were evaluated by comparing both pre-treatment covariates and propensity score distributions. We applied our proposal to a cardio-surgical observational study that aims to compare two different procedures of cardiac valve replacement. Results: Both baseline characteristics and propensity score distributions were systematically different across matched samples of patients created with different models used to estimate propensity score. The most relevant differences were observed for the matched set created by estimating individual propensity scores with Classification and Regression Trees algorithm. Conclusion: Clinical stability of matched samples created with different statistical methods should always be evaluated to ensure reliability of final estimates. This work opens the door for future investigations that fully assess the implications of this finding.