Discrimination of Patients with Prostate Cancer from Healthy Persons Using a Set of Single Nucleotide Polymorphisms
Purpose: Prostate cancer is the second cancer diagnosed cancer in males. It accounts for about 4% of cancer-related mortality in men. Several genetic polymorphisms in different genes have been identified that alter the risk of this kind of malignancy.
Materials and methods: We used the random forest (RF) algorithm for prediction of prostate cancer risk in Iranian population using 13 different single nucleotide polymorphisms (SNPs) in four genes (ANRIL, HOTAIR, IL-6 and IL-8). The samples were divided into a training set (n=320) and a test set (n=80) to evaluate the generalization power for training algorithm. For hyper-parameters tuning, we used randomized search with 5-fold cross-validation for the following hyper-parameters: (1) Number of trees or estimators in the forest (set from 3 to 500); (2) The maximum number of leaf nodes (set from 2 to 32); (3) The maximum number of features used for the best split (set from 5 to 13); and (4) Using bootstrap samples in the trees building (True or False). Accuracy, sensitivity, specificity, and F1-score in both training and test sets were reported.
Results: The most important SNP was ANRIL-rs1333048: A/A (Gini index= 0.096) followed by ANRIL-rs10757278: G/G (Gini index= 0.059). Training Dataset Outcomes were as follow: Accuracy: 0.896, Sensitivity: 0.85, Specificity: 0.944 and F1 Score: 0.891. Test Dataset Outcomes were as follow: Accuracy: 0.787, Sensitivity: 0.775, Specificity: 0.800 and F1 Score: 0.784. The AUC Scores were 0.966 and 0.841 for training and test datasets, respectively.
Conclusion: The proposed panels of SNPs can predict risk of prostate cancer in Iranian population with appropriate accuracy.