Data Description: The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.
Domain: Banking
Context: Leveraging customer information is paramount for most businesses. In the case of a bank, attributes of customers like the ones mentioned below can be crucial in strategizing a marketing campaign when launching a new product.
Attribute Information
age
: age at the time of calljob
: type of jobmarital
: marital statuseducation
: education background at the time of calldefault
: has credit in default?balance
: average yearly balance, in euros (numeric)housing
: has housing loan?loan
: has personal loan?contact
: contact communication typeday
: last contact day of the month (1 -31)month
: last contact month of year ('jan', 'feb', 'mar', ..., 'nov', 'dec')duration
: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration = 0 then Target = 'no'). Yet, the duration is not known before a call is performed. Also, after the end of the call Target is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.campaign
: number of contacts performed during this campaign and for this client (includes last contact)pdays
: number of days that passed by after the client was last contacted from a previous campaignprevious
: number of contacts performed before this campaign and for this clientpoutcome
: outcome of the previous marketing campaigntarget
: has the client subscribed a term deposit? ('yes', 'no')Learning Outcomes
# Basic packages
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns
from scipy import stats; from scipy.stats import zscore, norm, randint
import matplotlib.style as style; style.use('fivethirtyeight')
import plotly.express as px
%matplotlib inline
# Impute and Encode
from sklearn.preprocessing import LabelEncoder
from impyute.imputation.cs import mice
# Modelling - LR, KNN, NB, Metrics
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, recall_score, precision_score
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, BaggingClassifier, VotingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.dummy import DummyClassifier
from sklearn.metrics import make_scorer
# Oversampling
from imblearn.over_sampling import SMOTE
# Suppress warnings
import warnings; warnings.filterwarnings('ignore')
# Visualize Tree
from sklearn.tree import export_graphviz
from IPython.display import Image
from os import system
# Display settings
pd.options.display.max_rows = 10000
pd.options.display.max_columns = 10000
random_state = 42
np.random.seed(random_state)
# Reading the data as dataframe and print the first five rows
bank = pd.read_csv('bank-full.csv')
bank.head()
# Get info of the dataframe columns
bank.info()
Performing exploratory data analysis on the bank dataset. Below are some of the steps performed:
object
columnsjob
, marital
, education
, default
, housing
, loan
, contact
, day
, month
, poutcome
)age
, balance
, duration
, campaign
, pdays
, previous
)job
, marital
, education
, default
, housing
, loan
, contact
, day
, month
, poutcome
, Target
) to float for MICE training. Creating multiple imputations, as opposed to single imputations to complete
datasets, accounts for the statistical uncertainty in the imputations. MICE algorithms works by running multiple regression models and each missing value is modeled conditionally depeding on the observed (non-missing) values.Target
. Drop columns based on these.bank.describe(include = 'all').T
columns = bank.loc[:, bank.dtypes == 'object'].columns.tolist()
for cols in columns:
print(f'Unique values for {cols} is \n{bank[cols].unique()}\n')
Categorical
job
: Nominal. Includes type of job. 'blue-collar' is the most frequently occurring in the data.marital
: Nominal. Most of the clients are married in the dataset we have.education
: Ordinal. Most of the clients have secondary level education.default
: Binary. Most of clients don't have credit in default.housing
: Binary. Most of the clients have housing loan.loan
: Binary. Most of the clients don't have personal loan.Numerical
age
: Continuous, ratio (has true zero, technically). Whether it's discrete or continuous depends on whether they are measured to the nearest year or not. At present, it seems it's discrete. Min age in the dataset being 18 and max being 95.balance
: Continuous, ratio. Range of average yearly balance is very wide from -8019 euros to 102127 euros.Categorical
contact
: Nominal. Includes communication type with the client, most frequently use communication mode is cellular.day
: Ordinal. Includes last contact day of the month.month
: Ordinal. Includes last contact month of the year.Numerical
duration
: Continuous, interval. Includes last contact duration in seconds. Min value being 0 and max value being 4918. It would be important to check is higher duration of call leading to more subscription.campaign
: Discrete, interval. Min number of contacts performed during this campaign being 1 and is also represents about 25% of the value and max being 63.Categorical
poutcome
: Nominal. Includes outcome of the previous marketing campaign. Most occuring value being 'unknown'.Numerical
pdays
: Continuous, interval. Min number of days that passed by after the client was last contacted from a previous campaign being -1 which may be dummy value for the cases where client wasn't contacted and max days being 63.previous
: Discrete, ratio. Min number of contacts performed before this campaign is 0 and max being 275.Target
: Binary. Most occurring value being 'no' i.e. cases where the client didn't subscribe to the term deposit.Descriptive statistics for the numerical variables (age, balance, duration, campaign, pdays, previous)
age
: Range of Q1 to Q3 is between 33 to 48. Since mean is slightly greater than median, we can say that age is right (positively) skewed.balance
: Range of Q1 to Q3 is between 72 to 1428. Since mean is greater than median, we can say that balance is skewed towards right (positively).duration
: Range of Q1 to Q3 is between 103 to 319. Since mean is greater than median, we can say that duration is right (positively) skewed.campaign
: Range of Q1 to Q3 is between 1 to 3. Since mean is greater than median, we can say that campaign is right (positively) skewed.pdays
: 75% of data values are around -1 which is a dummy value. It needs further check without considering the -1 value.previous
: 75% of data values are around 0 which is a dummy value, maybe cases where client wasn't contacted. It needs further checks.display(bank['Target'].value_counts(), bank['Target'].value_counts(normalize = True)*100)
Out of 45211 cases, only 5289 (=11.69%) are the cases where the client has subscribed to the term deposit.
# Replace values in some of the categorical columns
replace_values = {'education': {'unknown': -1, 'primary': 1, 'secondary': 2, 'tertiary': 3}, 'Target': {'no': 0, 'yes': 1},
'default': {'no': 0, 'yes': 1}, 'housing': {'no': 0, 'yes': 1}, 'loan': {'no': 0, 'yes': 1},
'month': {'jan': 1, 'feb': 2, 'mar': 3, 'apr': 4, 'may': 5, 'jun': 6,
'jul': 7, 'aug': 8, 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12}}
bank = bank.replace(replace_values)
# Convert columns to categorical types
columns.extend(['day'])
for cols in columns:
bank[cols] = bank[cols].astype('category')
# Functions that will help us with EDA plot
def odp_plots(df, col):
f,(ax1, ax2, ax3) = plt.subplots(1, 3, figsize = (15, 7.2))
# Boxplot to check outliers
sns.boxplot(x = col, data = df, ax = ax1, orient = 'v', color = 'darkslategrey')
# Distribution plot with outliers
sns.distplot(df[col], ax = ax2, color = 'teal', fit = norm).set_title(f'Distribution of {col} with outliers')
# Removing outliers, but in a new dataframe
upperbound, lowerbound = np.percentile(df[col], [1, 99])
y = pd.DataFrame(np.clip(df[col], upperbound, lowerbound))
# Distribution plot without outliers
sns.distplot(y[col], ax = ax3, color = 'tab:orange', fit = norm).set_title(f'Distribution of {col} without outliers')
kwargs = {'fontsize':14, 'color':'black'}
ax1.set_title(col + ' Boxplot Analysis', **kwargs)
ax1.set_xlabel('Box', **kwargs)
ax1.set_ylabel(col + ' Values', **kwargs)
return plt.show()
def target_plot(df, col, target = 'Target'):
fig = plt.figure(figsize = (15, 7.2))
# Distribution for 'Target' -- didn't subscribed, considering outliers
ax = fig.add_subplot(121)
sns.distplot(df[(df[target] == 0)][col], color = 'c',
ax = ax).set_title(f'{col.capitalize()} for Term Desposit - Didn\'t subscribed')
# Distribution for 'Target' -- Subscribed, considering outliers
ax= fig.add_subplot(122)
sns.distplot(df[(df[target] == 1)][col], color = 'b',
ax = ax).set_title(f'{col.capitalize()} for Term Desposit - Subscribed')
return plt.show()
def target_count(df, col1, col2):
fig = plt.figure(figsize = (15, 7.2))
ax = fig.add_subplot(121)
sns.countplot(x = col1, data = df, palette = ['tab:blue', 'tab:cyan'], ax = ax, orient = 'v',
hue = 'Target').set_title(col1.capitalize() +' count plot by Target',
fontsize = 13)
plt.legend(labels = ['Didn\'t Subcribed', 'Subcribed'])
plt.xticks(rotation = 90)
ax = fig.add_subplot(122)
sns.countplot(x = col2, data = df, palette = ['tab:blue', 'tab:cyan'], ax = ax, orient = 'v',
hue = 'Target').set_title(col2.capitalize() +' coount plot by Target',
fontsize = 13)
plt.legend(labels = ['Didn\'t Subcribed', 'Subcribed'])
plt.xticks(rotation = 90)
return plt.show()
Looking at one feature at a time to understand how are the values distributed, checking outliers, checking relation of the column with Target column (bi).
# Subscribe and didn't subscribe for categorical columns
target_count(bank, 'job', 'marital')
target_count(bank, 'education', 'default')
target_count(bank, 'housing', 'loan')
target_count(bank, 'contact', 'day')
target_count(bank, 'month', 'poutcome')
# Outlier, distribution for 'age' column
Q3 = bank['age'].quantile(0.75)
Q1 = bank['age'].quantile(0.25)
IQR = Q3 - Q1
print('Age column', '--'*55)
display(bank.loc[(bank['age'] < (Q1 - 1.5 * IQR)) | (bank['age'] > (Q3 + 1.5 * IQR))].head())
odp_plots(bank, 'age')
# Distribution of 'age' by 'Target'
target_plot(bank, 'age')
# Outlier, distribution for 'balance' column
Q3 = bank['balance'].quantile(0.75)
Q1 = bank['balance'].quantile(0.25)
IQR = Q3 - Q1
print('Balance column', '--'*55)
display(bank.loc[(bank['balance'] < (Q1 - 1.5 * IQR)) | (bank['balance'] > (Q3 + 1.5 * IQR))].head())
odp_plots(bank, 'balance')
# Distribution of 'balance' by 'Target'
target_plot(bank, 'balance')
# Outlier, distribution for 'duration' column
Q3 = bank['duration'].quantile(0.75)
Q1 = bank['duration'].quantile(0.25)
IQR = Q3 - Q1
print('Duration column', '--'*54)
display(bank.loc[(bank['duration'] < (Q1 - 1.5 * IQR)) | (bank['duration'] > (Q3 + 1.5 * IQR))].head())
odp_plots(bank, 'duration')
# Distribution of 'duration' by 'Target'
target_plot(bank, 'duration')
# Outlier, distribution for 'campaign' column
Q3 = bank['campaign'].quantile(0.75)
Q1 = bank['campaign'].quantile(0.25)
IQR = Q3 - Q1
print('Campaign column', '--'*54)
display(bank.loc[(bank['campaign'] < (Q1 - 1.5 * IQR)) | (bank['campaign'] > (Q3 + 1.5 * IQR))].head())
odp_plots(bank, 'campaign')
# Distribution of 'campaign' by 'Target'
target_plot(bank, 'campaign')
# Outlier, distribution for 'pdays' column
Q3 = bank['pdays'].quantile(0.75)
Q1 = bank['pdays'].quantile(0.25)
IQR = Q3 - Q1
print('Pdays column', '--'*55)
display(bank.loc[(bank['pdays'] < (Q1 - 1.5 * IQR)) | (bank['pdays'] > (Q3 + 1.5 * IQR))].head())
# Check outlier in 'pdays', not considering -1
pdays = bank.loc[bank['pdays'] > 0, ['pdays', 'Target']]
pdays = pd.DataFrame(pdays, columns = ['pdays', 'Target'])
odp_plots(pdays, 'pdays')
# Distribution of 'pdays' by 'Target', not considering -1
target_plot(pdays, 'pdays')
# Outlier, distribution and probability plot for 'previous' column
Q3 = bank['previous'].quantile(0.75)
Q1 = bank['previous'].quantile(0.25)
IQR = Q3 - Q1
print('Previous column', '--'*54)
display(bank.loc[(bank['previous'] < (Q1 - 1.5 * IQR)) | (bank['previous'] > (Q3 + 1.5 * IQR))].head())
odp_plots(bank, 'previous')
# Distribution of 'previous' by 'Target'
target_plot(bank, 'previous')
print('Categorical Columns: \n{}'.format(list(bank.select_dtypes('category').columns)))
print('\nNumerical Columns: \n{}'.format(list(bank.select_dtypes(exclude = 'category').columns)))
# Removing outliers with upper and lower percentile values being 99 and 1, respectively
bank_nulls = bank.copy(deep = True)
columns = ['age', 'balance', 'duration', 'campaign', 'pdays', 'previous']
for col in columns:
upper_lim = np.percentile(bank_nulls[col].values, 99)
lower_lim = np.percentile(bank_nulls[col].values, 1)
bank_nulls.loc[(bank_nulls[col] > upper_lim), col] = np.nan
bank_nulls.loc[(bank_nulls[col] < lower_lim), col] = np.nan
print('Column for which outliers where removed with upper and lower percentile values: \n', columns)
# # Frequency encoding of 'job' column, this would creating too many columns with sparse distribution
# columns = ['job']#, 'marital', 'contact', 'poutcome']
# for col in columns:
# counts = bank_nulls[col].value_counts().index.tolist()
# encoding = bank_nulls.groupby(col).size()
# encoding = encoding/len(bank_nulls)
# bank_nulls[col] = bank_nulls[col].map(encoding)
# print([counts, bank_nulls[col].value_counts().index.tolist()], '\n')
# pd.get_dummies
cols_to_transform = ['job', 'marital', 'contact', 'poutcome']
bank_nulls = pd.get_dummies(bank_nulls, columns = cols_to_transform) #, drop_first = True)
print('Got dummies for \n', cols_to_transform)
bank_nulls.info()
# Convert 'astype' of categorical columns to integer for getting it ready for MICE
columns = ['education', 'default', 'housing', 'loan', 'day', 'month', 'Target']
for col in columns:
bank_nulls[col] = bank_nulls[col].astype('float')
np.nan
in the earlier step¶# start the MICE training
bank_imputed = mice(bank_nulls.values)
bank_imputed = pd.DataFrame(bank_imputed, columns = bank_nulls.columns)
display(bank.describe(include = 'all').T, bank_imputed.describe(include = 'all').T)
Column | Before MICE | After MICE |
---|---|---|
age |
Range of Q1 to Q3 is 33-48. Mean > Median, right (positively) skewed | Range of Q1 to Q3 is unchanged, because of change in min and max values there's a slight reduction is mean, right skewed |
balance |
Range of Q1 to Q3 is 72-1428. Mean > Median, skewed towards right (positively) | Range of Q1 to Q3 is 81 to 1402, reduction in mean, right skewed |
duration |
Range of Q1 to Q3 is 103-319. Mean > Median, right (positively) skewed | Range of Q1 to Q3 is 106-316, right skewed |
campaign |
Range of Q1 to Q3 is 1-3. Mean > Median, right (positively) skewed | Unchanged range and skewness |
pdays |
75% of data values are around -1 | Unchanged |
previous |
75% of data values are around 0 | Unchanged |
# Checking whether count of 0 in previous is equal to count of -1 in pdays
display(bank_imputed.loc[bank_imputed['previous'] == 0, 'previous'].value_counts().sum(),
bank_imputed.loc[bank_imputed['pdays'] == -1, 'pdays'].value_counts().sum())
Count of 0 in previous is equal to count of -1 in pdays column, we might replace -1 in pdays with 0 to account for cases where the client wasn't contacted previously. Checking correlation between variables and target next...
Checking relationship between two or more variables. Includes correlation and scatterplot matrix, checking relation between two variables and Target.
sns.pairplot(bank_imputed[['age', 'education', 'default', 'balance', 'housing', 'loan', 'day', 'month',
'duration', 'campaign', 'pdays', 'previous', 'Target']], hue = 'Target')
# Correlation matrix for all variables
corr = bank_imputed.corr()
mask = np.zeros_like(corr, dtype = np.bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize = (11, 9))
cmap = sns.diverging_palette(220, 10, as_cmap = True)
sns.heatmap(corr, mask = mask, cmap = cmap, square = True, linewidths = .5, cbar_kws = {"shrink": .5})#, annot = True)
ax.set_title('Correlation Matrix of Data')
# Filter for correlation value greater than 0.8
sort = corr.abs().unstack()
sort = sort.sort_values(kind = "quicksort", ascending = False)
sort[(sort > 0.8) & (sort < 1)]
# Absolute correlation of independent variables with 'Target' i.e. the target variable
absCorrwithDep = []
allVars = bank_imputed.drop('Target', axis = 1).columns
for var in allVars:
absCorrwithDep.append(abs(bank_imputed['Target'].corr(bank_imputed[var])))
display(pd.DataFrame([allVars, absCorrwithDep], index = ['Variable', 'Correlation']).T.\
sort_values('Correlation', ascending = False))
poutcome_unknown
and pdays
; contact_unknown
and contact_cellular
; poutcome_unknown
and previous
; marital_married
and marital_single
; poutcome_unknown
and poutcome_failure
; pdays
and poutcome_failure
; previous
and pdays
; poutcome_failure
and previous
columns are correlated with each other.duration
, poutcome_success
, poutcome_unknown
and previous
are few columns which have a relatively strong correlation with Target
column.#bank_imputed.drop(['pdays', 'contact_cellular'], axis = 1, inplace = True) #, 'previous', 'marital_married', 'poutcome_failure'
# Creating age groups
bank_imputed.loc[(bank_imputed['age'] < 30), 'age_group'] = 20
bank_imputed.loc[(bank_imputed['age'] >= 30) & (bank_imputed['age'] < 40), 'age_group'] = 30
bank_imputed.loc[(bank_imputed['age'] >= 40) & (bank_imputed['age'] < 50), 'age_group'] = 40
bank_imputed.loc[(bank_imputed['age'] >= 50) & (bank_imputed['age'] < 60), 'age_group'] = 50
bank_imputed.loc[(bank_imputed['age'] >= 60), 'age_group'] = 60
# Check relationship between balance and age group by Target
fig = plt.figure(figsize = (15, 7.2))
ax = sns.boxplot(x = 'age_group', y = 'balance', hue = 'Target', palette = 'afmhot', data = bank_imputed)
ax.set_title('Relationship between balance and age group by Target')
# Check relationship between campaign and age group by Target
fig = plt.figure(figsize = (15, 7.2))
ax = sns.boxplot(x = 'age_group', y = 'campaign', hue = 'Target', palette = 'afmhot', data = bank_imputed)
ax.set_title('Relationship between campaign and age group by Target')
# bank_imputed.drop(['age_group'], axis = 1, inplace = True)
Created age_group
and checked it's relation with balance
and target
and it appears that higher the balance range more are the chances that the client would subscribe to the term deposit irrespective of age group. It also appears that clients within age group 50 have the highest range of balance.
Then checked relation between campaign, age group and target and it appears that campaigns for client with age group 20 and 60 are less.
# Separating dependent and independent variables
X = bank_imputed.drop(['Target'], axis = 1)
y = bank_imputed['Target']
# Splitting the data into training and test set in the ratio of 70:30 respectively
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = random_state)
dummy = DummyClassifier(strategy = 'most_frequent', random_state = random_state)
dummy.fit(X_train, y_train)
y_pred = dummy.predict(X_test)
accuracy_ = accuracy_score(y_test, y_pred)
pre_s = precision_score(y_test, y_pred, average = 'binary', pos_label = 1)
re_s = recall_score(y_test, y_pred, average = 'binary', pos_label = 1)
f1_s = f1_score(y_test, y_pred, average = 'binary', pos_label = 1)
pre_m = precision_score(y_test, y_pred, average = 'macro')
re_m = recall_score(y_test, y_pred, average = 'macro')
f1_m = f1_score(y_test, y_pred, average = 'macro')
print('Training Score: ', dummy.score(X_train, y_train).round(3))
print('Test Score: ', dummy.score(X_test, y_test).round(3))
print('Accuracy: ', accuracy_.round(3))
print('Precision Score - Subscribe: ', pre_s.round(3))
print('Recall Score - Subscribe: ', re_s.round(3))
print('f1 Score - Subscribe: ', f1_s.round(3))
print('Precision Score - Macro: ', pre_m.round(3))
print('Recall Score - Macro: ', re_m.round(3))
print('f1 Score - Macro: ', f1_m.round(3))
df = pd.DataFrame([accuracy_.round(3), pre_s.round(3), pre_m.round(3), re_s.round(3),
re_m.round(3), f1_s.round(3), f1_m.round(3)], columns = ['Baseline Model']).T
df.columns = ['Accuracy', 'Precision_Subscribe', 'Precision_Macro',
'Recall_Subscribe', 'Recall_Macro', 'f1_Subscribe', 'f1_Macro']
df
# Helper function for making prediction and evaluating scores
def train_and_predict(n_splits, base_model, X, y, name, subscribe = 1, oversampling = False):
features = X.columns
X = np.array(X)
y = np.array(y)
folds = list(StratifiedKFold(n_splits = n_splits, shuffle = True, random_state = random_state).split(X, y))
train_pred = np.zeros((X.shape[0], len(base_model)))
accuracy = []
precision_subscribe = []
recall_subscribe = []
f1_subscribe = []
precision_macro = []
recall_macro = []
f1_macro = []
for i, clf in enumerate(base_model):
for j, (train, test) in enumerate(folds):
# Creating train and test sets
X_train = X[train]
y_train = y[train]
X_test = X[test]
y_test = y[test]
if oversampling:
sm = SMOTE(random_state = random_state, sampling_strategy = 'minority')
X_train_res, y_train_res = sm.fit_sample(X_train, y_train)
# fit the model
clf.fit(X_train_res, y_train_res)
# Get predictions
y_true, y_pred = y_test, clf.predict(X_test)
# Evaluate train and test scores
train_ = clf.score(X_train_res, y_train_res)
test_ = clf.score(X_test, y_test)
else:
# fit the model
clf.fit(X_train, y_train)
# Get predictions
y_true, y_pred = y_test, clf.predict(X_test)
# Evaluate train and test scores
train_ = clf.score(X_train, y_train)
test_ = clf.score(X_test, y_test)
# Other scores
accuracy_ = accuracy_score(y_true, y_pred).round(3)
precision_b = precision_score(y_true, y_pred, average = 'binary', pos_label = subscribe).round(3)
recall_b = recall_score(y_true, y_pred, average = 'binary', pos_label = subscribe).round(3)
f1_b = f1_score(y_true, y_pred, average = 'binary', pos_label = subscribe).round(3)
precision_m = precision_score(y_true, y_pred, average = 'macro').round(3)
recall_m = recall_score(y_true, y_pred, average = 'macro').round(3)
f1_m = f1_score(y_true, y_pred, average = 'macro').round(3)
print(f'Model- {name.capitalize()} and CV- {j}')
print('-'*20)
print('Training Score: {0:.3f}'.format(train_))
print('Test Score: {0:.3f}'.format(test_))
print('Accuracy Score: {0:.3f}'.format(accuracy_))
print('Precision Score - Subscribe: {0:.3f}'.format(precision_b))
print('Recall Score - Subscribe: {0:.3f}'.format(recall_b))
print('f1 Score - Subscribe: {0:.3f}'.format(f1_b))
print('Precision Score - Macro: {0:.3f}'.format(precision_m))
print('Recall Score - Macro: {0:.3f}'.format(recall_m))
print('f1 Score - Macro: {0:.3f}'.format(f1_m))
print('\n')
## Appending scores
accuracy.append(accuracy_)
precision_subscribe.append(precision_b)
recall_subscribe.append(recall_b)
f1_subscribe.append(f1_b)
precision_macro.append(precision_m)
recall_macro.append(recall_m)
f1_macro.append(f1_m)
# Creating a dataframe of scores
df = pd.DataFrame([np.mean(accuracy).round(3), np.mean(precision_subscribe).round(3),
np.mean(precision_macro).round(3), np.mean(recall_subscribe).round(3),
np.mean(recall_macro).round(3), np.mean(f1_subscribe).round(3),
np.mean(f1_macro).round(3)], columns = [name]).T
df.columns = ['Accuracy', 'Precision_Subscribe', 'Precision_Macro',
'Recall_Subscribe', 'Recall_Macro', 'f1_Subscribe', 'f1_Macro']
return df
# Separating dependent and independent variables
from sklearn.preprocessing import RobustScaler
X = bank_imputed.drop(['Target'], axis = 1)
y = bank_imputed['Target']
# Applying RobustScaler to make it less prone to outliers
features = X.columns
scaler = RobustScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns = features)
# Scaling the independent variables
Xs = X.apply(zscore)
display(X.shape, Xs.shape, y.shape)
Oversampling the one with better accuracy and recall score for subscribe
# LR model without hyperparameter tuning
LR = LogisticRegression()
base_model = [LR]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'Logistic Regression Without Hyperparameter Tuning')
df = df.append(df1)
df
# LR with hyperparameter tuning
LR = LogisticRegression(n_jobs = -1, random_state = random_state)
params = {'penalty': ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100], 'max_iter': [100, 110, 120, 130, 140]}
scoring = {'Recall': make_scorer(recall_score), 'f1_score': make_scorer(f1_score)}
skf = StratifiedKFold(n_splits = 10, shuffle = True, random_state = random_state)
LR_hyper = GridSearchCV(LR, param_grid = params, n_jobs = -1, cv = skf, scoring = scoring, refit = 'f1_score')
LR_hyper.fit(X_train, y_train)
print(LR_hyper.best_estimator_)
print(LR_hyper.best_params_)
# LR model with hyperparameter tuning
LR_Hyper = LogisticRegression(C = 100, class_weight = None, dual = False, fit_intercept = True,
intercept_scaling = 1, l1_ratio = None, max_iter = 100,
multi_class = 'warn', n_jobs = -1, penalty = 'l2', random_state = 42,
solver = 'warn', tol = 0.0001, verbose = 0, warm_start = False)
base_model = [LR_Hyper]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'Logistic Regression With Hyperparameter Tuning')
df = df.append(df1)
df
# KNN Model after scaling the features without hyperparameter tuning
kNN = KNeighborsClassifier()
base_model = [kNN]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, Xs, y, 'k-Nearest Neighbor Scaled Without Hyperparameter Tuning')
df = df.append(df1)
df
# Choosing a K Value
error_rate = {}
weights = ['uniform', 'distance']
for w in weights:
print(w)
rate = []
for i in range(1, 40):
knn = KNeighborsClassifier(n_neighbors = i, weights = w)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
rate.append(np.mean(y_pred != y_test))
plt.figure(figsize = (15, 7.2))
plt.plot(range(1, 40), rate, color = 'blue', linestyle = 'dashed', marker = 'o',
markerfacecolor = 'red', markersize = 10)
plt.title('Error Rate vs K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
plt.show()
# KNN with hyperparameter tuning
kNN = KNeighborsClassifier(n_jobs = -1)
params = {'n_neighbors': list(range(3, 40, 2)), 'weights': ['uniform', 'distance']}
scoring = {'Recall': make_scorer(recall_score), 'f1_score': make_scorer(f1_score)}
skf = StratifiedKFold(n_splits = 3, shuffle = True, random_state = random_state)
kNN_hyper = GridSearchCV(kNN, param_grid = params, n_jobs = -1, cv = skf, scoring = scoring, refit = 'f1_score')
kNN_hyper.fit(X_train, y_train)
print(kNN_hyper.best_estimator_)
print(kNN_hyper.best_params_)
# KNN with hyperparameter tuning
kNN_hyper = KNeighborsClassifier(algorithm = 'auto', leaf_size = 30, metric = 'minkowski', metric_params = None,
n_jobs = -1, n_neighbors = 3, p = 2, weights = 'distance')
base_model = [kNN_hyper]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, Xs, y, 'k-Nearest Neighbor Scaled With Hyperparameter Tuning')
df = df.append(df1)
df
# Naive Bayes Model
NB = GaussianNB()
base_model = [NB]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'Naive Bayes Classifier')
df = df.append(df1)
df
# Naive Bayes with oversampling
NB_over = GaussianNB()
base_model = [NB_over]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'Naive Bayes, Oversampled',
oversampling = True)
df = df.append(df1)
df
# LR model with oversampling
LR_over = LogisticRegression(C = 1, class_weight = None, dual = False, fit_intercept = True,
intercept_scaling = 1, l1_ratio = None, max_iter = 100,
multi_class = 'warn', n_jobs = -1, penalty = 'l1', random_state = 42,
solver = 'warn', tol = 0.0001, verbose = 0, warm_start = False)
base_model = [LR_over]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'Logistic Regression, Oversampled With Hyperparameter Tuning',
oversampling = True)
df = df.append(df1)
df
Decision Tree Classifier, Bagging Classifier, AdaBoost Classifier, Gradient Boosting Classifier and Random Forest Classifier. Oversampling the ones with higher accuracy and better recall for subscribe.
# Decision Tree Classifier
DT = DecisionTreeClassifier(random_state = random_state)
base_model = [DT]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'Decision Tree Classifier')
df = df.append(df1)
df
# Decision Tree Classifier with hyperparameter tuning
dt_hyper = DecisionTreeClassifier(max_depth = 3, random_state = random_state)
base_model = [dt_hyper]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'Decision Tree Classifier - Reducing Max Depth')
df = df.append(df1)
df
dt_hyper = DecisionTreeClassifier(max_depth = 3, random_state = random_state)
dt_hyper.fit(X, y)
decisiontree = open('decisiontree.dot','w')
dot_data = export_graphviz(dt_hyper, out_file = 'decisiontree.dot', feature_names = X.columns,
class_names = ['No', 'Yes'], rounded = True, proportion = False, filled = True)
decisiontree.close()
retCode = system('dot -Tpng decisiontree.dot -o decisiontree.png')
if(retCode>0):
print('system command returning error: '+str(retCode))
else:
display(Image('decisiontree.png'))
print('Feature Importance for Decision Tree Classifier ', '--'*38)
feature_importances = pd.DataFrame(dt_hyper.feature_importances_, index = X.columns,
columns=['Importance']).sort_values('Importance', ascending = True)
feature_importances.sort_values(by = 'Importance', ascending = True).plot(kind = 'barh', figsize = (15, 7.2))
# Bagging Classifier
bgcl = BaggingClassifier(base_estimator = DecisionTreeClassifier(max_depth = 3, random_state = random_state),
n_estimators = 50, random_state = random_state)
base_model = [bgcl]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'Bagging Classifier')
df = df.append(df1)
df
# AdaBoost Classifier
abcl = AdaBoostClassifier(n_estimators = 10, random_state = random_state)
base_model = [abcl]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'AdaBoost Classifier')
df = df.append(df1)
df
# Gradient Boosting Classifier
gbcl = GradientBoostingClassifier(n_estimators = 50, random_state = random_state)
base_model = [gbcl]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'Gradient Boosting Classifier')
df = df.append(df1)
df
abcl_over = AdaBoostClassifier(n_estimators = 15, random_state = random_state, learning_rate = 0.3)
base_model = [abcl_over]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'AdaBoost Classifier, Oversampled', oversampling = True)
df = df.append(df1)
df
# Random Forest Classifier
rfc = RandomForestClassifier(n_jobs = -1, random_state = random_state)
base_model = [rfc]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'Random Forest Classifier')
df = df.append(df1)
df
# Random Forest Classifier with hyperparameter tuning
rfc = RandomForestClassifier(n_jobs = -1, random_state = random_state)
params = {'n_estimators' : [10, 20, 30, 50, 75, 100], 'max_depth': [1, 2, 3, 5, 7, 10]}
scoring = {'Recall': make_scorer(recall_score), 'f1_score': make_scorer(f1_score)}
skf = StratifiedKFold(n_splits = 3, shuffle = True, random_state = random_state)
rfc_grid = GridSearchCV(rfc, param_grid = params, n_jobs = -1, cv = skf, scoring = scoring, refit = 'f1_score')
rfc_grid.fit(X, y)
print(rfc_grid.best_estimator_)
print(rfc_grid.best_params_)
# Random Forest Classifier with hyperparameter tuning
rfc_hyper = RandomForestClassifier(bootstrap = True, class_weight = None, criterion = 'gini', max_depth = 10,
max_features = 'auto', max_leaf_nodes = None, min_impurity_decrease = 0.0,
min_impurity_split = None, min_samples_leaf = 1, min_samples_split = 2,
min_weight_fraction_leaf = 0.0, n_estimators = 20, n_jobs = -1,
oob_score = False, random_state = 42, verbose = 0, warm_start = False)
base_model = [rfc_hyper]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'Random Forest Classifier With Hyperparameter Tuning')
df = df.append(df1)
df
# Random Forest Classifier with hyperparameter tuning, Oversampled
rfc_over = RandomForestClassifier(bootstrap = True, class_weight = None, criterion = 'gini', max_depth = 10,
max_features = 'auto', max_leaf_nodes = None, min_impurity_decrease = 0.0,
min_impurity_split = None, min_samples_leaf = 1, min_samples_split = 2,
min_weight_fraction_leaf = 0.0, n_estimators = 20, n_jobs = -1,
oob_score = False, random_state = 42, verbose = 0, warm_start = False)
base_model = [rfc_over]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y,
'Random Forest Classifier, Oversampled With Hyperparameter Tuning',
oversampling = True)
df = df.append(df1)
df
# Random Forest Classifier with hyperparameter tuning, Oversampled -- Reducing Max Depth
rfc_over = RandomForestClassifier(bootstrap = True, class_weight = None, criterion = 'gini', max_depth = 3,
max_features = 'auto', max_leaf_nodes = None, min_impurity_decrease = 0.0,
min_impurity_split = None, min_samples_leaf = 1, min_samples_split = 2,
min_weight_fraction_leaf = 0.0, n_estimators = 50, n_jobs = -1,
oob_score = False, random_state = 42, verbose = 0, warm_start = False)
base_model = [rfc_over]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y,
'Random Forest Classifier, Oversampled With Hyperparameter Tuning - Reducing Max Depth',
oversampling = True)
df = df.append(df1)
df
rfc_over = RandomForestClassifier(bootstrap = True, class_weight = None, criterion = 'gini', max_depth = 3,
max_features = 'auto', max_leaf_nodes = None, min_impurity_decrease = 0.0,
min_impurity_split = None, min_samples_leaf = 1, min_samples_split = 2,
min_weight_fraction_leaf = 0.0, n_estimators = 50, n_jobs = -1,
oob_score = False, random_state = 42, verbose = 0, warm_start = False)
rfc_over.fit(X, y)
random_forest_tree = open('random_forest.dot','w')
dot_data = export_graphviz(rfc_over.estimators_[0], out_file = random_forest_tree, feature_names = list(X_train), class_names = ['No', 'Yes'], rounded = True, proportion = False, filled = True)
random_forest_tree.close()
retCode = system("dot -Tpng random_forest.dot -o random_forest.png")
if(retCode>0):
print("system command returning error: "+str(retCode))
else:
display(Image("random_forest.png"))
print('Feature Importance for Random Forest Classifier ', '--'*38)
feature_importances = pd.DataFrame(rfc_over.feature_importances_, index = X.columns,
columns=['Importance']).sort_values('Importance', ascending = True)
feature_importances.sort_values(by = 'Importance', ascending = True).plot(kind = 'barh', figsize = (15, 7.2))
print('Conditional Formatting on the scores dataframe ', '--'*39)
display(df.style.background_gradient(cmap = sns.light_palette('green', as_cmap = True)))
for i, types in enumerate(df.columns):
temp = df[types]
plt.figure(i, figsize = (15, 7.2))
temp.sort_values(ascending = True).plot(kind = 'barh')
plt.title(f'{types.capitalize()} Scores')
plt.show()
The classification goal is to predict if the client will subscribe (yes/no) a term deposit.
Most of the ML models works best when the number of classes are in equal proportion since they are designed to maximize accuracy and reduce error. Thus, they do not take into account the class distribution / proportion or balance of classes. In our dataset, the clients subscribing to term deposit (class 'yes' i.e. 1) is 11.7% whereas those about 88.3% of the clients didn't subscribe (class 'no' i.e. 0) to the term deposit.
Building a DummyClassifier, baseline model, in our case gave an accuracy of 88.2% with zero recall and precision for predicting minority class i.e. where the client subscribed to term deposits. In this cases, important performance measures such as precision, recall, and f1-score would be helpful. We can also calculate this metrics for the minority, positive, class.
The confusion matrix for class 1 (Subscribed) would look like:
Predicted: 0 (Not Subscribed) | Predicted: 1 (Subscribed) | |
---|---|---|
Actual: 0 (Not Subscribed) | True Negatives | False Positives |
Actual: 1 (Subscribed) | False Negatives | True Positives |
In our case, it would be recall that would hold more importance then precision. So choosing recall particularly for class 1 and accuracy as as evaluation metric. Also important would be how is model behaving over the training and test scores across the cross validation sets.
Modeling was sub-divided in two phases, in the first phase we applied standard models (with and without the hyperparameter tuning wherever applicable) such as Logistic Regression, k-Nearest Neighbor and Naive Bayes classifiers. In second phase apply ensemble techniques such as Decision Tree, Bagging, AdaBoost, Gradient Boosting and Random Forest classifiers. Oversampling the ones with higher accuracy and better recall for subscribe.
Oversampling, which is one of common ways to tackle the issue of imbalanced data. Over-sampling refers to various methods that aim to increase the number of instances from the underrepresented class in the data set. Out of the various methods, we chose Synthetic Minority Over-Sampling Technique (SMOTE). SMOTE’s main advantage compared to traditional random naive over-sampling is that by creating synthetic observations instead of reusing existing observations, classifier is less likely to overfit.
In the first phase (Standard machine learning models vs baseline model),
In the second phase (Ensemble models vs baseline model),