02 — Clinical ML · Python · Scikit-learn
Osteoporosis Risk Prediction with Ensemble Methods
Explored an osteoporosis case-control dataset through demographic-driven EDA and built classification models (Logistic Regression, Random Forest, SVM, Gradient Boosting). Top models achieved strong ROC performance, but consistently favored the negative class — highlighting the real-world challenges of identifying positive cases in imbalanced clinical data.
Notebook
Full Analysis
Osteoporosis classification on a matched case-control study
In this notebook, there are 3 parts:
- Part 1: data visualization and univeriate feature selection using Chi-square test
- Part 2: data cleaning (converting variables into one-hot encoding)
- Part 3: running ML using LR, RF, SVM, XGB
Brief data description:
- This is a matched case-control dataset, matching on confounding variables (age, gender, race)
- Total sample size: 61,022 rows, 46 columns
- Numeric variables:
- BMI_Avg (BMI already imputed using multiple imputation in SAS)
- Age
- Calcium and Sodium variables ('Calcium_Closest_Osteo', 'Calcium_Avg_Prior', 'Calcium_Avg_Ever', 'Sodium_Closest_Osteo', 'Sodium_Avg_Prior', 'Sodium_Avg_Ever', 'Sodium_Worst_Prior', 'Sodium_Worst_Ever')
- Categorical variables: Gender, Ethnicity, Alcohol_Prior, Tobacco_Prior, etc. (anything with 'prior' in variable names)
Data preprocessing for numeric variables
For the numeric variables, I converted them into categorical variables. The main purpose of this convertion is to help Logistic Regression learn easier in the next step, or to help Random Forest split faster with binary variables when building decision trees.
To do this convertion, I created new variables: by discretize and create presence/absence variables
- step 1:
discretize, converting these continuous variables into categorical variables by discretizing them.
- Here I applied qcut function in Python. Comparing to decile (which creates 10 bins), qcut creates bins with equally number of data (equal N) in each bin. However, the trade off will be bin interval not equal. All the variables with 'decile' in variable names are the products of this step
- step 2: creating
presence/absencevariables to these numeric variables. All the variables with 'cat' in variable names are the products of this step
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
%matplotlib inline
pwd
'/Users/thanhng/ThanhNguyen93.github.io/projects'
cd /Users/thanhng/osteo_project
/Users/thanhng/osteo_project
data = pd.read_csv('osteo_clean.csv')
print(data.shape)
data.head()
(61022, 46)
| Strata | sex | race_combine | Age_combine | osteo_predict | BMI_Avg_imputed | Alcohol_Prior | Tobacco_Prior | Drug_antipsych_prior | Drug_Estrogens_prior | ... | Sodium_Worst_Prior_decile | Sodium_Worst_Ever_decile | Calcium_Closest_Osteo_cat | Calcium_Avg_Prior_cat | Calcium_Avg_Ever_cat | Sodium_Closest_Osteo_cat | Sodium_Avg_Prior_cat | Sodium_Avg_Ever_cat | Sodium_Worst_Prior_cat | Sodium_Worst_Ever_cat | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 9576 | F | black | from_50_to_70 | 0.0 | 37.834028 | 0 | 0 | 0 | 0 | ... | NaN | 5.0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 |
| 1 | 13833 | F | white | from_70_to_80 | 0.0 | 32.948625 | 0 | 0 | 0 | 0 | ... | NaN | 9.0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 |
| 2 | 28858 | F | black | from_30_to_50 | 0.0 | 33.388021 | 0 | 0 | 0 | 0 | ... | NaN | 4.0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 |
| 3 | 16496 | F | others | from_50_to_70 | 0.0 | 28.308928 | 1 | 0 | 0 | 0 | ... | 9.0 | 9.0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 4 | 1 | F | white | more_than_80 | 1.0 | 22.666667 | 0 | 0 | 1 | 0 | ... | 4.0 | 2.0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
5 rows × 46 columns
data.isna().sum()
Strata 0 sex 0 race_combine 0 Age_combine 0 osteo_predict 0 BMI_Avg_imputed 0 Alcohol_Prior 0 Tobacco_Prior 0 Drug_antipsych_prior 0 Drug_Estrogens_prior 0 Drug_Glucocorticoids_prior 0 Drug_Nsaids_prior 0 Drug_Opiates_prior 0 Drug_Thiazide_prior 0 Drug_Loop_Diuretic_Prior 0 Drug_Pp_inhibitors_prior 0 Drug_Progesterone_prior 0 Drug_Seizure_prior 0 Drug_Ssris_prior 0 Drug_Tc_antidepress_prior 0 HeartDisease_Prior 0 Liver_Prior 0 PulmDisease_Prior 0 CNS_Disease_Prior 0 Malignancy_Prior 0 Hyponatremia_Prior 0 Chronic_Hyponatremia 0 Recent_Hyponatremia 0 Median_Recent_Hypo_Cat_edit 0 Lowest_Recent_Hypo_Cat_edit 0 Calcium_Closest_Osteo_decile 25454 Calcium_Avg_Prior_decile 25385 Calcium_Avg_Ever_decile 0 Sodium_Closest_Osteo_decile 0 Sodium_Avg_Prior_decile 19891 Sodium_Avg_Ever_decile 0 Sodium_Worst_Prior_decile 19891 Sodium_Worst_Ever_decile 249 Calcium_Closest_Osteo_cat 0 Calcium_Avg_Prior_cat 0 Calcium_Avg_Ever_cat 0 Sodium_Closest_Osteo_cat 0 Sodium_Avg_Prior_cat 0 Sodium_Avg_Ever_cat 0 Sodium_Worst_Prior_cat 0 Sodium_Worst_Ever_cat 0 dtype: int64
data.dtypes
Strata int64 sex object race_combine object Age_combine object osteo_predict float64 BMI_Avg_imputed float64 Alcohol_Prior int64 Tobacco_Prior int64 Drug_antipsych_prior int64 Drug_Estrogens_prior int64 Drug_Glucocorticoids_prior int64 Drug_Nsaids_prior int64 Drug_Opiates_prior int64 Drug_Thiazide_prior int64 Drug_Loop_Diuretic_Prior int64 Drug_Pp_inhibitors_prior int64 Drug_Progesterone_prior int64 Drug_Seizure_prior int64 Drug_Ssris_prior int64 Drug_Tc_antidepress_prior int64 HeartDisease_Prior int64 Liver_Prior int64 PulmDisease_Prior int64 CNS_Disease_Prior int64 Malignancy_Prior int64 Hyponatremia_Prior int64 Chronic_Hyponatremia int64 Recent_Hyponatremia int64 Median_Recent_Hypo_Cat_edit float64 Lowest_Recent_Hypo_Cat_edit float64 Calcium_Closest_Osteo_decile float64 Calcium_Avg_Prior_decile float64 Calcium_Avg_Ever_decile int64 Sodium_Closest_Osteo_decile int64 Sodium_Avg_Prior_decile float64 Sodium_Avg_Ever_decile int64 Sodium_Worst_Prior_decile float64 Sodium_Worst_Ever_decile float64 Calcium_Closest_Osteo_cat int64 Calcium_Avg_Prior_cat int64 Calcium_Avg_Ever_cat int64 Sodium_Closest_Osteo_cat int64 Sodium_Avg_Prior_cat int64 Sodium_Avg_Ever_cat int64 Sodium_Worst_Prior_cat int64 Sodium_Worst_Ever_cat int64 dtype: object
Part 1: data visualization
This dataset is used for a case-control study, matching on demographic variables such as age, sex, and gender. It means that the case (having osteo) and control (not having osteo) samples will have similar distributions across confounding variables. In this case, 'osteo' and 'no_osteo' will have equal sample size N across the 3 demographic variables.
#create a copy of Y for the sake of visualization
data['osteo_label'] = data['osteo_predict'].replace({1.0: 'osteo', 0.0: 'no_osteo'})
data['osteo_label'].value_counts()
no_osteo 30511 osteo 30511 Name: osteo_label, dtype: int64
the data is distributed equally between 2 classes of Y label (no_osteo and osteo). This is expected because our data is matched based on age variable
col_order = ['less_than_30', 'from_30_to_50','from_50_to_70', 'from_70_to_80', 'more_than_80']
ax = sns.catplot(y="BMI_Avg_imputed", x="Age_combine", hue='osteo_label', data=data, kind='box', order=col_order,
palette="pastel")
ax.set_xticklabels(['<30', '31-50', '51-70', '71-80', '>80'])
plt.subplots_adjust(top=0.9)
ax.fig.suptitle('BMI across Ages groups')
Text(0.5, 0.98, 'BMI across Ages groups')
In general, the BMI mean across all the age groups are around 30, suggesting that these patients are obese. But the means of these BMI across all different groups are closely near each other, suggesting that there are not much different in these age groups. Looking closely at individual groups, BMI_Avg_imputed in group 70_80 has many 'outliers' than in other groups, since their BMI are above 80.
sns.catplot(y="BMI_Avg_imputed", x="race_combine", hue='osteo_label', data=data, kind='box', palette="pastel").set(title='BMI across Race', xlabel='Race')
<seaborn.axisgrid.FacetGrid at 0x126e44350>
sns.catplot(y="BMI_Avg_imputed", x="sex", hue='osteo_label', data=data, kind='box',palette="pastel" ).set(title='BMI across Gender', xlabel='Gender').set_xticklabels(['Female', 'Male'])
<seaborn.axisgrid.FacetGrid at 0x12efab690>
sns.catplot(y="BMI_Avg_imputed", x="sex", hue='osteo_label', data=data, kind='violin').set(title='BMI across Gender')
<seaborn.axisgrid.FacetGrid at 0x125b4fe50>
col_order = ['less_than_30', 'from_30_to_50','from_50_to_70', 'from_70_to_80', 'more_than_80']
ax = sns.catplot(x="Age_combine", kind="count", palette="pastel", data=data, order=col_order)
ax.set_xticklabels(['<30', '31-50', '51-70', '71-80', '>80'], ha="center")
plt.subplots_adjust(top=0.9)
ax.fig.suptitle('Age distribution across subgroups')
spots = zip(ax.ax.patches)
for spot in spots:
ax.ax.text(spot[0].get_x(), spot[0].get_height()+400, spot[0].get_height())
ax = sns.catplot(x="race_combine", kind="count", data=data, hue='osteo_label',
palette="pastel").set(xlabel='Race')
ax.set_xticklabels(ha="center")
plt.subplots_adjust(top=0.9)
ax.fig.suptitle('Race distribution by osteoporosis')
spots = zip(ax.ax.patches)
for spot in spots:
ax.ax.text(spot[0].get_x(), spot[0].get_height()+500, spot[0].get_height())
col_order = ['less_than_30', 'from_30_to_50','from_50_to_70', 'from_70_to_80', 'more_than_80']
ax = sns.catplot(x="Age_combine", kind="count", hue="osteo_label",
palette="pastel", edgecolor=".6",data=data, order=col_order, legend_out=False, height=6).set(xlabel='Age')
ax.set_xticklabels(['<30', '31-50', '51-70', '71-80', '>80'], ha="center")
plt.tight_layout()
plt.subplots_adjust(top=0.9)
ax.fig.suptitle('Age distribution by osteoporosis')
spots = zip(ax.ax.patches)
for spot in spots:
ax.ax.text(spot[0].get_x(), spot[0].get_height()+400, spot[0].get_height(), fontsize=8.5)
ax = sns.catplot(x="sex", kind="count", data=data, hue='osteo_label',
palette="pastel").set(xlabel='Gender')
ax.set_xticklabels(['Female', 'Male'], ha="center")
plt.subplots_adjust(top=0.9)
ax.fig.suptitle('Gender distribution by osteoporosis')
spots = zip(ax.ax.patches)
for spot in spots:
ax.ax.text(spot[0].get_x(), spot[0].get_height()+400, spot[0].get_height())
Perform univariate feature selection using Chi-square test since all these variables are categorical variables
To do this, we need to convert categorical variables with string into number representatives
data['sex_category'] = data['sex'].copy()
data['sex_category'] = data['sex_category'].replace({'F':1, 'M':0})
data['sex_category'].value_counts()
1 53852 0 7170 Name: sex_category, dtype: int64
#convert race
data['race_combine'].value_counts()
white 36438 black 17256 others 7328 Name: race_combine, dtype: int64
data['race_category'] = data['race_combine'].copy()
data['race_category'] = data['race_category'].replace({'white': '0', 'black':'1', 'others':'2'})
data['race_category'].value_counts()
0 36438 1 17256 2 7328 Name: race_category, dtype: int64
#convert age
data['Age_combine'].value_counts().sort_index()
from_30_to_50 7264 from_50_to_70 25908 from_70_to_80 14712 less_than_30 984 more_than_80 12154 Name: Age_combine, dtype: int64
data['age_category'] = data['Age_combine'].copy()
data['age_category']= data['age_category'].replace({
'less_than_30': '0',
'from_30_to_50':'1',
'from_50_to_70': '2',
'from_70_to_80': '3',
'more_than_80': '4'
})
data['age_category'].value_counts().sort_index()
0 984 1 7264 2 25908 3 14712 4 12154 Name: age_category, dtype: int64
#fill missing with 0 in multiple cols
lst_col = ['Calcium_Closest_Osteo_decile','Calcium_Avg_Prior_decile',
'Sodium_Avg_Prior_decile', 'Sodium_Worst_Prior_decile', 'Sodium_Worst_Ever_decile']
data[lst_col] = data[lst_col].fillna(0).astype(str)
data.isna().sum()
Strata 0 sex 0 race_combine 0 Age_combine 0 osteo_predict 0 BMI_Avg_imputed 0 Alcohol_Prior 0 Tobacco_Prior 0 Drug_antipsych_prior 0 Drug_Estrogens_prior 0 Drug_Glucocorticoids_prior 0 Drug_Nsaids_prior 0 Drug_Opiates_prior 0 Drug_Thiazide_prior 0 Drug_Loop_Diuretic_Prior 0 Drug_Pp_inhibitors_prior 0 Drug_Progesterone_prior 0 Drug_Seizure_prior 0 Drug_Ssris_prior 0 Drug_Tc_antidepress_prior 0 HeartDisease_Prior 0 Liver_Prior 0 PulmDisease_Prior 0 CNS_Disease_Prior 0 Malignancy_Prior 0 Hyponatremia_Prior 0 Chronic_Hyponatremia 0 Recent_Hyponatremia 0 Median_Recent_Hypo_Cat_edit 0 Lowest_Recent_Hypo_Cat_edit 0 Calcium_Closest_Osteo_decile 0 Calcium_Avg_Prior_decile 0 Calcium_Avg_Ever_decile 0 Sodium_Closest_Osteo_decile 0 Sodium_Avg_Prior_decile 0 Sodium_Avg_Ever_decile 0 Sodium_Worst_Prior_decile 0 Sodium_Worst_Ever_decile 0 Calcium_Closest_Osteo_cat 0 Calcium_Avg_Prior_cat 0 Calcium_Avg_Ever_cat 0 Sodium_Closest_Osteo_cat 0 Sodium_Avg_Prior_cat 0 Sodium_Avg_Ever_cat 0 Sodium_Worst_Prior_cat 0 Sodium_Worst_Ever_cat 0 osteo_label 0 sex_category 0 race_category 0 age_category 0 dtype: int64
df_features_cat = data.drop(['Strata', 'sex', 'race_combine',
'Age_combine', 'osteo_predict','BMI_Avg_imputed', 'osteo_label'], axis=1)
df_features_cat
| Alcohol_Prior | Tobacco_Prior | Drug_antipsych_prior | Drug_Estrogens_prior | Drug_Glucocorticoids_prior | Drug_Nsaids_prior | Drug_Opiates_prior | Drug_Thiazide_prior | Drug_Loop_Diuretic_Prior | Drug_Pp_inhibitors_prior | ... | Calcium_Avg_Prior_cat | Calcium_Avg_Ever_cat | Sodium_Closest_Osteo_cat | Sodium_Avg_Prior_cat | Sodium_Avg_Ever_cat | Sodium_Worst_Prior_cat | Sodium_Worst_Ever_cat | sex_category | race_category | age_category | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 2 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 3 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 |
| 3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 2 |
| 4 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 4 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 61017 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
| 61018 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 2 |
| 61019 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 3 |
| 61020 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 2 |
| 61021 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 2 |
61022 rows × 43 columns
#perform Chi-square test
from sklearn.feature_selection import SelectKBest, chi2
feature_selection = SelectKBest(chi2, k=5)
feature_selection.fit(df_features_cat, data['osteo_predict'])
selected_features = df_features_cat.columns[feature_selection.get_support()]
print("The five selected features are: ", list(selected_features))
The five selected features are: ['Chronic_Hyponatremia', 'Recent_Hyponatremia', 'Median_Recent_Hypo_Cat_edit', 'Lowest_Recent_Hypo_Cat_edit', 'Calcium_Avg_Ever_decile']
correlation_score_df = pd.DataFrame(zip(feature_selection.scores_, feature_selection.pvalues_), index=df_features_cat.columns).reset_index()
correlation_score_df.columns=['variables', 'chi2_score', 'chi2_pvalue']
correlation_score_df
| variables | chi2_score | chi2_pvalue | |
|---|---|---|---|
| 0 | Alcohol_Prior | 35.041667 | 3.227251e-09 |
| 1 | Tobacco_Prior | 236.966851 | 1.803369e-53 |
| 2 | Drug_antipsych_prior | 76.379679 | 2.340473e-18 |
| 3 | Drug_Estrogens_prior | 19.043103 | 1.277988e-05 |
| 4 | Drug_Glucocorticoids_prior | 233.095174 | 1.259957e-52 |
| 5 | Drug_Nsaids_prior | 91.570916 | 1.076638e-21 |
| 6 | Drug_Opiates_prior | 395.455300 | 5.373691e-88 |
| 7 | Drug_Thiazide_prior | 88.356643 | 5.465574e-21 |
| 8 | Drug_Loop_Diuretic_Prior | 60.984050 | 5.753911e-15 |
| 9 | Drug_Pp_inhibitors_prior | 269.300643 | 1.611877e-60 |
| 10 | Drug_Progesterone_prior | 17.442907 | 2.960670e-05 |
| 11 | Drug_Seizure_prior | 28.970803 | 7.347747e-08 |
| 12 | Drug_Ssris_prior | 75.768827 | 3.188961e-18 |
| 13 | Drug_Tc_antidepress_prior | 49.459870 | 2.024713e-12 |
| 14 | HeartDisease_Prior | 7.779533 | 5.284146e-03 |
| 15 | Liver_Prior | 44.690873 | 2.307304e-11 |
| 16 | PulmDisease_Prior | 127.781058 | 1.253337e-29 |
| 17 | CNS_Disease_Prior | 97.196721 | 6.277051e-23 |
| 18 | Malignancy_Prior | 10.870082 | 9.773018e-04 |
| 19 | Hyponatremia_Prior | 58.500278 | 2.032622e-14 |
| 20 | Chronic_Hyponatremia | 971.299045 | 3.111071e-213 |
| 21 | Recent_Hyponatremia | 810.925007 | 2.273987e-178 |
| 22 | Median_Recent_Hypo_Cat_edit | 1277.562005 | 8.498559e-280 |
| 23 | Lowest_Recent_Hypo_Cat_edit | 1895.152960 | 0.000000e+00 |
| 24 | Calcium_Closest_Osteo_decile | 134.771813 | 3.703808e-31 |
| 25 | Calcium_Avg_Prior_decile | 22.388102 | 2.227497e-06 |
| 26 | Calcium_Avg_Ever_decile | 3724.571243 | 0.000000e+00 |
| 27 | Sodium_Closest_Osteo_decile | 169.127796 | 1.147275e-38 |
| 28 | Sodium_Avg_Prior_decile | 16.676051 | 4.433720e-05 |
| 29 | Sodium_Avg_Ever_decile | 16.281055 | 5.460720e-05 |
| 30 | Sodium_Worst_Prior_decile | 33.830992 | 6.011355e-09 |
| 31 | Sodium_Worst_Ever_decile | 471.444236 | 1.554152e-104 |
| 32 | Calcium_Closest_Osteo_cat | 595.434211 | 1.647786e-131 |
| 33 | Calcium_Avg_Prior_cat | 468.714230 | 6.103270e-104 |
| 34 | Calcium_Avg_Ever_cat | 0.000000 | 1.000000e+00 |
| 35 | Sodium_Closest_Osteo_cat | 58.482878 | 2.050678e-14 |
| 36 | Sodium_Avg_Prior_cat | 10.558484 | 1.156560e-03 |
| 37 | Sodium_Avg_Ever_cat | 0.000000 | 1.000000e+00 |
| 38 | Sodium_Worst_Prior_cat | 10.558484 | 1.156560e-03 |
| 39 | Sodium_Worst_Ever_cat | 1.020206 | 3.124701e-01 |
| 40 | sex_category | 0.000000 | 1.000000e+00 |
| 41 | race_category | 0.000000 | 1.000000e+00 |
| 42 | age_category | 0.000000 | 1.000000e+00 |
Because this sample size is large, the Chi-square test is likely to
return a low p-value even for a table with small differences from the expected proportions.
X = pd.DataFrame(feature_selection.transform(df_features_cat),
columns=selected_features)
X.head()
| Chronic_Hyponatremia | Recent_Hyponatremia | Median_Recent_Hypo_Cat_edit | Lowest_Recent_Hypo_Cat_edit | Calcium_Avg_Ever_decile | |
|---|---|---|---|---|---|
| 0 | 0 | 0 | 0.0 | 0.0 | 9 |
| 1 | 0 | 0 | 0.0 | 0.0 | 5 |
| 2 | 0 | 0 | 0.0 | 0.0 | 0 |
| 3 | 0 | 0 | 0.0 | 0.0 | 9 |
| 4 | 0 | 0 | 0.0 | 0.0 | 5 |
Check correlation to see what variables are correlated to each other
# Create correlation matrix
#remove 'strata' and Y
data_corr = data.drop(['Strata', 'osteo_predict'], axis=1)
corr_mat = data_corr.corr()
# Create mask
mask = np.zeros_like(corr_mat, dtype=np.bool)
mask[np.triu_indices_from(mask, k=1)] = True
# Plot heatmap
plt.figure(figsize=(15, 10))
sns.heatmap(corr_mat, annot=True, fmt='.1f',
cmap='RdBu_r', vmin=-1, vmax=1,
mask=mask)
<AxesSubplot:>
#check for correlation score that larger than 0.2
plt.figure(figsize=(15, 10))
sns.heatmap(corr_mat[corr_mat > 0.2], annot=True,
fmt='.1f', cmap=sns.cubehelix_palette(200), mask=mask)
<AxesSubplot:>
#check for correlation score that are negative
plt.figure(figsize=(15, 10))
sns.heatmap(corr_mat[corr_mat < 0.0], annot=True,
fmt='.1f', cmap=sns.cubehelix_palette(200), mask=mask)
<AxesSubplot:>
Part 2: One-hot encoding (OHE)
- dummy: dtype = uint8, when input to ML, can't decode
Thus, we use OHE sklearn
from sklearn.preprocessing import OneHotEncoder
def CONVERT_TO_OHE(original_data):
non_ohe = original_data.drop(['sex','race_combine', 'Age_combine',
'Strata','BMI_Avg_imputed', 'osteo_predict'], axis=1)
#convert non_ohe to string
non_ohe = non_ohe.astype(str)
enc = OneHotEncoder(drop='first')
enc_fit = enc.fit_transform(non_ohe)
col_names = enc.get_feature_names(non_ohe.columns)
ohe_df = pd.DataFrame(enc_fit.toarray(), columns=col_names)
#combine with original data
combine = original_data[['Strata','BMI_Avg_imputed', 'osteo_predict']]
data_ohe = pd.concat([ohe_df, combine], axis=1)
return data_ohe
Part 3: training models using LR, RF, SVM, Adaboost, XGB
We choose these models because majority of our variables are category. And since some of them are slightly correlated to each other, we don't use Naive Bayes here because NB assumption may be violated.
The confusion matrix follows this order:
|negative class| positive class|
| ------------- |-------------|
| TN| FP|
| FN | TP |
from sklearn.model_selection import KFold, cross_val_predict, cross_validate
import sklearn.linear_model as linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import confusion_matrix, classification_report, recall_score, precision_score, f1_score
from sklearn.metrics import roc_auc_score, accuracy_score, log_loss, roc_curve
from sklearn import svm
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
#########
from sklearn.feature_selection import RFE
from sklearn.model_selection import GridSearchCV
def TRAIN_MODEL_ML(data_ohe, classifier, tune):
X = data_ohe.drop(['osteo_predict'], axis=1)
Y = data_ohe.filter(['Strata', 'osteo_predict'])
fold = KFold(10, shuffle = True, random_state = 12345)
strata = data_ohe['Strata'].unique()
all_preds = np.full(data.shape[0], 100)
probability = np.ones(data.shape[0])
training_acc = []
testing_acc = []
for train_index, test_index in fold.split(strata):
train_index_strata = strata[train_index]
test_index_strata = strata[test_index]
X_train = X.loc[X['Strata'].isin(train_index_strata)]
X_train = X_train.drop(['Strata'], axis = 1)
X_test = X.loc[X['Strata'].isin(test_index_strata)]
X_test = X_test.drop(['Strata'], axis = 1)
y_train = Y.loc[Y['Strata'].isin(train_index_strata)]['osteo_predict']
y_test = Y.loc[Y['Strata'].isin(test_index_strata)]['osteo_predict']
if classifier == 'LR':
lr = LogisticRegression(solver='liblinear')
if classifier == 'RF':
# Create a based model
lr = RandomForestClassifier(random_state = 12345)
if tune == 'true':
param_grid = {
'bootstrap': [True],
'max_depth': [80, 90, 100, 110],
'max_features': [2, 3],
'min_samples_leaf': [3, 4, 5],
'min_samples_split': [8, 10, 12],
'n_estimators': [100, 200, 300, 1000]
}
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = lr, param_grid = param_grid,
cv = 3, n_jobs = -1, verbose = 1)
# Fit the grid search to the data
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)
best_grid = grid_search.best_estimator_
grid_accuracy = evaluate(best_grid, X_test, y_test)
print('Improvement of {:0.2f}%.'.format( 100 * (grid_accuracy - base_accuracy) / base_accuracy))
if classifier == 'SVM_linear':
svm_linear = LinearSVC(max_iter = 1000, random_state = 12345, C = 0.001, loss='hinge')
lr = CalibratedClassifierCV(svm_linear)
if classifier == 'SVM_rbf':
lr = svm.SVC(kernel='rbf', probability = True, max_iter = 1000, C= 100, gamma = 30)
if classifier == 'AdaBoost':
logistic_regression = linear_model.LogisticRegression(random_state = 12345, solver = 'lbfgs')
lr = AdaBoostClassifier(n_estimators=100,
base_estimator = logistic_regression,
learning_rate=1)
if classifier == 'XGB':
lr =GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
max_depth=3, random_state=42)
fit = lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
all_preds[y_test.index] = y_pred
probs = lr.predict_proba(X_test)
probability[y_test.index] = probs[:, 1]
training_acc.append(fit.score(X_train, y_train))
testing_acc.append(fit.score(X_test, y_test))
data['all_pred'] = all_preds
data['probability'] = probability
print('\nfinal result of %s model: ' % classifier)
print(classification_report(data['osteo_predict'], data['all_pred']), '\n')
print(confusion_matrix(data['osteo_predict'], data['all_pred']), '\n')
print('ROC_AUC of', classifier, '', roc_auc_score(data['osteo_predict'], data['probability']))
return data, training_acc, testing_acc, lr
def RUN_MULTIPLE_MODELS(data, classifiers, tune):
result = []
for classifier in classifiers:
data_ohe = CONVERT_TO_OHE(data)
data_ohe['osteo_predict'] = data_ohe['osteo_predict'].astype(str).replace({'0.0': '0', '1.0': '1'})
model_data, train_acc, test_acc, clf = TRAIN_MODEL_ML(data_ohe, classifier, tune)
fpr,tpr,threshold=roc_curve(model_data['osteo_predict'], model_data['probability'])
auc = roc_auc_score(model_data['osteo_predict'], model_data['probability'])
result.append({'classifier': classifier,
'FPR': fpr,
'TPR': tpr,
'AUC': auc,
'training_acc': train_acc,
'testing_acc': test_acc,
'clf': clf})
result_df = pd.DataFrame(result, columns=['classifier','FPR','TPR','AUC',
'training_acc', 'testing_acc', 'clf'])
result_df.set_index('classifier', inplace=True)
return result_df
def PLOT_MULTIPLE_ROC(result_df):
#plot multiple AUC in 1 plot
fig = plt.figure(figsize=(8,6))
for i in result_df.index:
plt.plot(result_df.loc[i]['FPR'],
result_df.loc[i]['TPR'],
label="{}, AUC={:.3f}".format(i, result_df.loc[i]['AUC']))
plt.plot([0,1], [0,1], color='orange', linestyle='--')
plt.xticks(np.arange(0.0, 1.1, step=0.1))
plt.xlabel("False Positive Rate", fontsize=15)
plt.yticks(np.arange(0.0, 1.1, step=0.1))
plt.ylabel("True Positive Rate", fontsize=15)
plt.title('ROC Curve Comparison', fontweight='bold', fontsize=15)
plt.legend(prop={'size':13}, loc='lower right')
plt.show()
data_train_model = data.drop(['sex_category', 'race_category', 'age_category', 'osteo_label'],
axis=1)
classifiers = ['LR', 'RF']
lr_res = RUN_MULTIPLE_MODELS(data_train_model, classifiers, 'false')
final result of LR model:
precision recall f1-score support
0.0 0.71 0.77 0.74 30511
1.0 0.75 0.68 0.71 30511
accuracy 0.72 61022
macro avg 0.73 0.72 0.72 61022
weighted avg 0.73 0.72 0.72 61022
[[23474 7037]
[ 9799 20712]]
ROC_AUC of LR 0.8088236758353666
final result of RF model:
precision recall f1-score support
0.0 0.70 0.75 0.72 30511
1.0 0.73 0.68 0.70 30511
accuracy 0.71 61022
macro avg 0.72 0.71 0.71 61022
weighted avg 0.72 0.71 0.71 61022
[[22843 7668]
[ 9744 20767]]
ROC_AUC of RF 0.7939774953285219
PLOT_MULTIPLE_ROC(lr_res)
LRperforms better thanRFwith its higherAUC. This also suggests that our data is biased toward linear modelLRrathan than non-linear modelRF- according to the
confusion matrixof these 2 models, our variables are biased towards negative class since theTNandFNare higher than theTPandFP.
lr_res
| FPR | TPR | AUC | training_acc | testing_acc | clf | |
|---|---|---|---|---|---|---|
| classifier | ||||||
| LR | [0.0, 0.0, 0.0, 3.2775064730752846e-05, 3.2775... | [0.0, 3.2775064730752846e-05, 0.00147487791288... | 0.808824 | [0.7262099857970065, 0.7268936635105608, 0.726... | [0.7187090432503277, 0.724188790560472, 0.7273... | LogisticRegression(solver='liblinear') |
| RF | [0.0, 0.004391858673920881, 0.0069155386581888... | [0.0, 0.0905575038510701, 0.1370653207040084, ... | 0.793977 | [0.999981791033905, 0.9999635833940277, 0.9999... | [0.7123197903014417, 0.7055063913470994, 0.724... | (DecisionTreeClassifier(max_features='auto', r... |
classifiers = ['SVM_linear', 'SVM_rbf']
svm = RUN_MULTIPLE_MODELS(data_train_model, classifiers, 'false')
/opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning)
/opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning)
final result of SVM_linear model:
precision recall f1-score support
0.0 0.65 0.76 0.70 30511
1.0 0.72 0.59 0.65 30511
accuracy 0.68 61022
macro avg 0.68 0.68 0.68 61022
weighted avg 0.68 0.68 0.68 61022
[[23339 7172]
[12399 18112]]
ROC_AUC of SVM_linear 0.7416086534360626
/opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:249: ConvergenceWarning: Solver terminated early (max_iter=1000). Consider pre-processing your data with StandardScaler or MinMaxScaler. % self.max_iter, ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:249: ConvergenceWarning: Solver terminated early (max_iter=1000). Consider pre-processing your data with StandardScaler or MinMaxScaler. % self.max_iter, ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:249: ConvergenceWarning: Solver terminated early (max_iter=1000). Consider pre-processing your data with StandardScaler or MinMaxScaler. % self.max_iter, ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:249: ConvergenceWarning: Solver terminated early (max_iter=1000). Consider pre-processing your data with StandardScaler or MinMaxScaler. % self.max_iter, ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:249: ConvergenceWarning: Solver terminated early (max_iter=1000). Consider pre-processing your data with StandardScaler or MinMaxScaler. % self.max_iter, ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:249: ConvergenceWarning: Solver terminated early (max_iter=1000). Consider pre-processing your data with StandardScaler or MinMaxScaler. % self.max_iter, ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:249: ConvergenceWarning: Solver terminated early (max_iter=1000). Consider pre-processing your data with StandardScaler or MinMaxScaler. % self.max_iter, ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:249: ConvergenceWarning: Solver terminated early (max_iter=1000). Consider pre-processing your data with StandardScaler or MinMaxScaler. % self.max_iter, ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:249: ConvergenceWarning: Solver terminated early (max_iter=1000). Consider pre-processing your data with StandardScaler or MinMaxScaler. % self.max_iter, ConvergenceWarning) /opt/anaconda3/lib/python3.7/site-packages/sklearn/svm/_base.py:249: ConvergenceWarning: Solver terminated early (max_iter=1000). Consider pre-processing your data with StandardScaler or MinMaxScaler. % self.max_iter, ConvergenceWarning)
final result of SVM_rbf model:
precision recall f1-score support
0.0 0.50 0.99 0.67 30511
1.0 0.57 0.01 0.02 30511
accuracy 0.50 61022
macro avg 0.54 0.50 0.34 61022
weighted avg 0.54 0.50 0.34 61022
[[30249 262]
[30157 354]]
ROC_AUC of SVM_rbf 0.5081350872046656
PLOT_MULTIPLE_ROC(svm)
classifiers = ['AdaBoost', 'XGB']
boosted_models = RUN_MULTIPLE_MODELS(data_train_model, classifiers, 'false')
final result of AdaBoost model:
precision recall f1-score support
0.0 0.66 0.78 0.71 30511
1.0 0.73 0.59 0.65 30511
accuracy 0.68 61022
macro avg 0.69 0.68 0.68 61022
weighted avg 0.69 0.68 0.68 61022
[[23700 6811]
[12443 18068]]
ROC_AUC of AdaBoost 0.7515335781064527
final result of XGB model:
precision recall f1-score support
0.0 0.73 0.77 0.75 30511
1.0 0.76 0.71 0.74 30511
accuracy 0.74 61022
macro avg 0.74 0.74 0.74 61022
weighted avg 0.74 0.74 0.74 61022
[[23628 6883]
[ 8782 21729]]
ROC_AUC of XGB 0.8326386710050808
PLOT_MULTIPLE_ROC(boosted_models)
def PLOT_ACCURACY(training_acc, testing_acc):
plt.plot(training_acc)
plt.plot(testing_acc)
plt.xlabel('10-fold CV')
plt.ylabel('accuracy')
plt.legend(['training_acc', 'testing_acc'])
plt.title('Model accuracy')
plt.show()
res_df = pd.concat([lr_res, svm, boosted_models]).reset_index()
res_df
| classifier | FPR | TPR | AUC | training_acc | testing_acc | clf | |
|---|---|---|---|---|---|---|---|
| 0 | LR | [0.0, 0.0, 0.0, 3.2775064730752846e-05, 3.2775... | [0.0, 3.2775064730752846e-05, 0.00147487791288... | 0.808824 | [0.7262099857970065, 0.7268936635105608, 0.726... | [0.7187090432503277, 0.724188790560472, 0.7273... | LogisticRegression(solver='liblinear') |
| 1 | RF | [0.0, 0.004391858673920881, 0.0069155386581888... | [0.0, 0.0905575038510701, 0.1370653207040084, ... | 0.793977 | [0.999981791033905, 0.9999635833940277, 0.9999... | [0.7123197903014417, 0.7055063913470994, 0.724... | (DecisionTreeClassifier(max_features='auto', r... |
| 2 | SVM_linear | [0.0, 0.0, 0.0, 3.2775064730752846e-05, 3.2775... | [0.0, 3.2775064730752846e-05, 0.01025859526072... | 0.741609 | [0.6813977202374449, 0.6805899490167516, 0.678... | [0.67021625163827, 0.678302196001311, 0.686004... | CalibratedClassifierCV(base_estimator=LinearSV... |
| 3 | SVM_rbf | [0.0, 0.0, 3.2775064730752846e-05, 3.277506473... | [0.0, 3.2775064730752846e-05, 3.27750647307528... | 0.508135 | [0.5147856804690629, 0.5142388929351784, 0.514... | [0.5018020969855832, 0.4995083579154376, 0.504... | SVC(C=100, gamma=30, max_iter=1000, probabilit... |
| 4 | AdaBoost | [0.0, 0.0, 0.0, 3.2775064730752846e-05, 3.2775... | [0.0, 3.2775064730752846e-05, 0.01265117498607... | 0.751534 | [0.6866965293710623, 0.6859796067006555, 0.685... | [0.6805373525557011, 0.6856768272697477, 0.695... | (LogisticRegression(random_state=1176181795), ... |
| 5 | XGB | [0.0, 0.0, 0.0, 3.2775064730752846e-05, 3.2775... | [0.0, 3.2775064730752846e-05, 0.00016387532365... | 0.832639 | [0.7685640409337557, 0.7699563000728332, 0.768... | [0.7385321100917431, 0.7340216322517208, 0.747... | ([DecisionTreeRegressor(criterion='friedman_ms... |
for i in range(res_df.shape[0]):
classifier = res_df['classifier'][i]
training = res_df['training_acc'][i]
testing = res_df['testing_acc'][i]
print('classifier: ', classifier)
PLOT_ACCURACY(training, testing)
print('\n')
classifier: LR
classifier: RF
classifier: SVM_linear
classifier: SVM_rbf
classifier: AdaBoost
classifier: XGB
PLOT_MULTIPLE_ROC(pd.concat([lr_res, svm[:-1], boosted_models]))
Conclusion
- The
AUC plotabove shows thatXGBperforms best with this data with highest AUC score. SVM_linearhas higher AUC score thanSVM_rbf, andLRhas higher AUC score thanRFsuggests that this dataset favorslinear modelthannon-linear model- the
confusion matrixof all 6 models indicates a consistent trend that the variables tend to predict more negative class than positive class, despite our data iscase-controlstudy.
- Next steps will be
feature selectionandparameter tuniningfor these models