Author: Pratik Sharma

Statlog (Vehicle Silhouettes) Data Set

The purpose of the case study is to classify a given silhouette as one of four different types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.

Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.

The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.

Attribute Information

  • compactness: (average perim)^2/area
  • circularity: (average radius)^2/area
  • distance_circularity: area/(av.distance from border)^2
  • radius_ratio: (max.rad-min.rad)/av.radius
  • pr.axis_aspect_ratio: (minor axis)/(major axis)
  • max.length_aspect_ratio: (length perp. max length)/(max length)
  • scatter_ratio: (inertia about minor axis)/(inertia about major axis)
  • elongatedness: area/(shrink width)^2
  • pr.axis_rectangularity: area/(pr.axis length*pr.axis width)
  • max.length_rectangularity: area/(max.length*length perp. to this)
  • scaled_variance: (2nd order moment about minor axis)/area along major axis
  • scaled_variance.1: (2nd order moment about major axis)/area along minor axis
  • scaled_radius_of_gyration: (mavar+mivar)/area
  • scaled_radius_of_gyration.1
  • skewness_about: (3rd order moment about major axis)/sigma_min^3 major axis
  • skewness_about.1: (3rd order moment about minor axis)/sigma_maj^3 minor axis
  • skewness_about.2
  • hollows_ratio: (area of hollows)/(area of bounding polygon)
  • class: van, car, bus

Where sigma_maj^2 is the variance along the major axis and sigma_min^2 is the variance along the minor axis, and area of hollows = area of bounding poly-area of object

The area of the bounding polygon is found as a side result of the computation to find the maximum length. Each individual length computation yields a pair of calipers to the object orientated at every 5 degrees. The object is propagated into an image containing the union of these calipers to obtain an image of the bounding polygon.

Learning Outcomes

  1. Data pre-processing - Understand the data and treat missing values (Use box plot), outliers
  2. Understanding the attributes - Find relationship between different attributes (Independent variables) and choose carefully which all attributes have to be a part of the analysis and why
  3. Use PCA from scikit learn and elbow plot to find out reduced number of dimension (which covers more than 95% of the variance)
  4. Use Naive Bayes and Support Vector Classifier. Use grid search for SVC (try C values - 0.01, 0.05, 0.5, 1 and kernel = linear, rbf) and find out the best hyper parameters and do cross validation to find the accuracy.

Import Packages

In [1]:
# Basic packages
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns, gc
from scipy import stats; from scipy.stats import zscore, norm, randint
import matplotlib.style as style; style.use('fivethirtyeight')
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
%matplotlib inline

# Models
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold, cross_val_score, learning_curve
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

# Display settings
pd.options.display.max_rows = 400
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format

random_state = 42
np.random.seed(random_state)

# Suppress warnings
import warnings; warnings.filterwarnings('ignore')

Reading the data as a dataframe and print the first five rows

In [2]:
# Reading the data as dataframe and print the first five rows
vehicle = pd.read_csv('vehicle.csv')
vehicle.head()
Out[2]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.00 83.00 178.00 72.00 10 162.00 42.00 20.00 159 176.00 379.00 184.00 70.00 6.00 16.00 187.00 197 van
1 91 41.00 84.00 141.00 57.00 9 149.00 45.00 19.00 143 170.00 330.00 158.00 72.00 9.00 14.00 189.00 199 van
2 104 50.00 106.00 209.00 66.00 10 207.00 32.00 23.00 158 223.00 635.00 220.00 73.00 14.00 9.00 188.00 196 car
3 93 41.00 82.00 159.00 63.00 9 144.00 46.00 19.00 143 160.00 309.00 127.00 63.00 6.00 10.00 199.00 207 van
4 85 44.00 70.00 205.00 103.00 52 149.00 45.00 19.00 144 241.00 325.00 188.00 127.00 9.00 11.00 180.00 183 bus
In [3]:
vehicle.columns
Out[3]:
Index(['compactness', 'circularity', 'distance_circularity', 'radius_ratio',
       'pr.axis_aspect_ratio', 'max.length_aspect_ratio', 'scatter_ratio',
       'elongatedness', 'pr.axis_rectangularity', 'max.length_rectangularity',
       'scaled_variance', 'scaled_variance.1', 'scaled_radius_of_gyration',
       'scaled_radius_of_gyration.1', 'skewness_about', 'skewness_about.1',
       'skewness_about.2', 'hollows_ratio', 'class'],
      dtype='object')

Get info of the dataframe columns and check missing values

In [4]:
# Get info of the dataframe columns
vehicle.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 846 entries, 0 to 845
Data columns (total 19 columns):
compactness                    846 non-null int64
circularity                    841 non-null float64
distance_circularity           842 non-null float64
radius_ratio                   840 non-null float64
pr.axis_aspect_ratio           844 non-null float64
max.length_aspect_ratio        846 non-null int64
scatter_ratio                  845 non-null float64
elongatedness                  845 non-null float64
pr.axis_rectangularity         843 non-null float64
max.length_rectangularity      846 non-null int64
scaled_variance                843 non-null float64
scaled_variance.1              844 non-null float64
scaled_radius_of_gyration      844 non-null float64
scaled_radius_of_gyration.1    842 non-null float64
skewness_about                 840 non-null float64
skewness_about.1               845 non-null float64
skewness_about.2               845 non-null float64
hollows_ratio                  846 non-null int64
class                          846 non-null object
dtypes: float64(14), int64(4), object(1)
memory usage: 125.7+ KB

Observation 1 - Dataset shape

Dataset has 846 rows and 19 columns with missing values in several columns. Checking those..

In [5]:
# Checking missing values in dataframe
vehicle.isnull().sum()
Out[5]:
compactness                    0
circularity                    5
distance_circularity           4
radius_ratio                   6
pr.axis_aspect_ratio           2
max.length_aspect_ratio        0
scatter_ratio                  1
elongatedness                  1
pr.axis_rectangularity         3
max.length_rectangularity      0
scaled_variance                3
scaled_variance.1              2
scaled_radius_of_gyration      2
scaled_radius_of_gyration.1    4
skewness_about                 6
skewness_about.1               1
skewness_about.2               1
hollows_ratio                  0
class                          0
dtype: int64

Exploratory Data Analysis

Performing exploratory data analysis on the bank dataset. Below are some of the steps performed:

  • Get descriptive statistics including five point summary
  • Comment on the types of variables in dataset and descriptive statistics observation
  • Check unique values in class columns
  • Check distribution of class column
  • Check missing values in the dataframe and impute those missing values
  • Univariate and Bivariate visualization: Looking at one feature at a time to understand how are the values are distributed, checking outliers and relation of the columns with class
  • Handle outliers using SimpleImputer
  • Multivariate visualization including correlation and scatterplot matrix. In the process identify the features to be taken further for the study

Five point summary of numerical attributes and check unique values in 'object' columns

In [6]:
# Five point summary
vehicle.describe(include = 'all').T
Out[6]:
count unique top freq mean std min 25% 50% 75% max
compactness 846.00 NaN NaN NaN 93.68 8.23 73.00 87.00 93.00 100.00 119.00
circularity 841.00 NaN NaN NaN 44.83 6.15 33.00 40.00 44.00 49.00 59.00
distance_circularity 842.00 NaN NaN NaN 82.11 15.78 40.00 70.00 80.00 98.00 112.00
radius_ratio 840.00 NaN NaN NaN 168.89 33.52 104.00 141.00 167.00 195.00 333.00
pr.axis_aspect_ratio 844.00 NaN NaN NaN 61.68 7.89 47.00 57.00 61.00 65.00 138.00
max.length_aspect_ratio 846.00 NaN NaN NaN 8.57 4.60 2.00 7.00 8.00 10.00 55.00
scatter_ratio 845.00 NaN NaN NaN 168.90 33.21 112.00 147.00 157.00 198.00 265.00
elongatedness 845.00 NaN NaN NaN 40.93 7.82 26.00 33.00 43.00 46.00 61.00
pr.axis_rectangularity 843.00 NaN NaN NaN 20.58 2.59 17.00 19.00 20.00 23.00 29.00
max.length_rectangularity 846.00 NaN NaN NaN 148.00 14.52 118.00 137.00 146.00 159.00 188.00
scaled_variance 843.00 NaN NaN NaN 188.63 31.41 130.00 167.00 179.00 217.00 320.00
scaled_variance.1 844.00 NaN NaN NaN 439.49 176.67 184.00 318.00 363.50 587.00 1018.00
scaled_radius_of_gyration 844.00 NaN NaN NaN 174.71 32.58 109.00 149.00 173.50 198.00 268.00
scaled_radius_of_gyration.1 842.00 NaN NaN NaN 72.45 7.49 59.00 67.00 71.50 75.00 135.00
skewness_about 840.00 NaN NaN NaN 6.36 4.92 0.00 2.00 6.00 9.00 22.00
skewness_about.1 845.00 NaN NaN NaN 12.60 8.94 0.00 5.00 11.00 19.00 41.00
skewness_about.2 845.00 NaN NaN NaN 188.92 6.16 176.00 184.00 188.00 193.00 206.00
hollows_ratio 846.00 NaN NaN NaN 195.63 7.44 181.00 190.25 197.00 201.00 211.00
class 846 3 car 429 NaN NaN NaN NaN NaN NaN NaN

Observation 2 - information on the type of variable

compactness, max.length_aspect_ratio, max.length_rectangularity, hollows_ratio, class has no missing values, rest all features don't have any missing values. All features are of numerical types. class is a target variable and has three unique values.

Observation 3 - Descriptive statistics for the numerical variables

Descriptive statistics for the numerical variables

  • compactness: Range of Q1 to Q3 is between 87 to 100. It appears that the column is almost normally distributed.

  • circularity: Range of Q1 to Q3 is 40 to 49. This column too appears to be almost normally distributed.

  • distance_circularity: Range of Q1 to Q3 is 70 to 98. Mean is slightly greater than median, we can say that the column is slightly skewed towards right.

  • radius_ratio: Range of Q1 to Q3 is 141 to 195. Mean is slightly greater than median, we can say that the column is slightly skewed towards right.

  • pr.axis_aspect_ratio: Range of Q1 to Q3 is 57 to 65. Mean is slightly greater than median, we can say that the column is slightly skewed towards right.

  • max.length_aspect_ratio: Range of Q1 to Q3 is 7 to 10. Mean is slightly greater than median, we can say that the column is slightly skewed towards right.

  • scatter_ratio: Range of Q1 to Q3 is 147 to 198. Mean is greater than median, we can say that the column is skewed towards right.

  • elongatedness: Range of Q1 to Q3 is 33 to 46. Mean is less than median, we can say that the column is skewed towards left.

  • pr.axis_rectangularity: Range of Q1 to Q3 is 19 to 23. Mean is greater than median, we can say that the column is skewed towards right.

  • max.length_rectangularity: Range of Q1 to Q3 is 137 to 159. Mean is greater than median, we can say that the column is skewed towards right.

  • scaled_variance: Range of Q1 to Q3 is 167 to 217. Mean is grater than median, we can say that the column is skewed towards right.

  • scaled_variance.1: Range of Q1 to Q3 is 318 to 587. Mean is greater than median, we can say that the column is skewed towards right.

  • scaled_radius_of_gyration: Range of Q1 to Q3 is 149 to 198. Mean is greater than median, we can say that the column is skewed towards right.

  • scaled_radius_of_gyration.1: Range of Q1 to Q3 is 67 to 75. Mean is greater than median, we can say that the column is skewed towards right.

  • skewness_about: Range of Q1 to Q3 is 2 to 6. Mean is greater than median, skewed towards right.

  • skewness_about.1: Range of Q1 to Q3 is 5 to 19. Mean is greater than median, skewed towards right.

  • skewness_about.2: Range of Q1 to Q3 is 184 to 193. Mean is slightly greater than median, almost normally distributed.

  • hollows_ratio: Range of Q1 to Q3 is 197 to 211. Mean is less than median, skewed towards left.

In [7]:
columns = vehicle.loc[:, vehicle.dtypes == 'object'].columns.tolist()
for cols in columns:
    print(f'Unique values for {cols} is \n{vehicle[cols].unique()}\n')
del cols, columns
Unique values for class is 
['van' 'car' 'bus']

Checking the distribution of class variable

In [8]:
display(vehicle['class'].value_counts(), vehicle['class'].value_counts(normalize = True)*100)
car    429
bus    218
van    199
Name: class, dtype: int64
car   50.71
bus   25.77
van   23.52
Name: class, dtype: float64
In [9]:
replace_struc = {'car': 3, 'bus': 2, 'van': 1}
vehicle['class'] = vehicle['class'].map(replace_struc)
del replace_struc

Observation 4 - Distribution of class variable

car represents about 50.7% of the total values in class variable, bus about 25.8% and van about 23.5%.

In [10]:
# Check missing values in the dataframe
vehicle.isnull().sum()
Out[10]:
compactness                    0
circularity                    5
distance_circularity           4
radius_ratio                   6
pr.axis_aspect_ratio           2
max.length_aspect_ratio        0
scatter_ratio                  1
elongatedness                  1
pr.axis_rectangularity         3
max.length_rectangularity      0
scaled_variance                3
scaled_variance.1              2
scaled_radius_of_gyration      2
scaled_radius_of_gyration.1    4
skewness_about                 6
skewness_about.1               1
skewness_about.2               1
hollows_ratio                  0
class                          0
dtype: int64
In [11]:
null_columns = vehicle.columns[vehicle.isnull().any()]; columns = list(vehicle.columns)

print('Descriptive Stats before imputation for columns with missing values: \n', '--'*30)
display(vehicle[null_columns].describe().T)

# Using SimpleImputer to fill missing values by median
impute = SimpleImputer(missing_values = np.nan, strategy = 'median', verbose = 1)
vehicle = pd.DataFrame(impute.fit_transform(vehicle), columns = columns)

print('Descriptive Stats after imputation: \n', '--'*30)
display(vehicle[null_columns].describe().T)

del null_columns
Descriptive Stats before imputation for columns with missing values: 
 ------------------------------------------------------------
count mean std min 25% 50% 75% max
circularity 841.00 44.83 6.15 33.00 40.00 44.00 49.00 59.00
distance_circularity 842.00 82.11 15.78 40.00 70.00 80.00 98.00 112.00
radius_ratio 840.00 168.89 33.52 104.00 141.00 167.00 195.00 333.00
pr.axis_aspect_ratio 844.00 61.68 7.89 47.00 57.00 61.00 65.00 138.00
scatter_ratio 845.00 168.90 33.21 112.00 147.00 157.00 198.00 265.00
elongatedness 845.00 40.93 7.82 26.00 33.00 43.00 46.00 61.00
pr.axis_rectangularity 843.00 20.58 2.59 17.00 19.00 20.00 23.00 29.00
scaled_variance 843.00 188.63 31.41 130.00 167.00 179.00 217.00 320.00
scaled_variance.1 844.00 439.49 176.67 184.00 318.00 363.50 587.00 1018.00
scaled_radius_of_gyration 844.00 174.71 32.58 109.00 149.00 173.50 198.00 268.00
scaled_radius_of_gyration.1 842.00 72.45 7.49 59.00 67.00 71.50 75.00 135.00
skewness_about 840.00 6.36 4.92 0.00 2.00 6.00 9.00 22.00
skewness_about.1 845.00 12.60 8.94 0.00 5.00 11.00 19.00 41.00
skewness_about.2 845.00 188.92 6.16 176.00 184.00 188.00 193.00 206.00
Descriptive Stats after imputation: 
 ------------------------------------------------------------
count mean std min 25% 50% 75% max
circularity 846.00 44.82 6.13 33.00 40.00 44.00 49.00 59.00
distance_circularity 846.00 82.10 15.74 40.00 70.00 80.00 98.00 112.00
radius_ratio 846.00 168.87 33.40 104.00 141.00 167.00 195.00 333.00
pr.axis_aspect_ratio 846.00 61.68 7.88 47.00 57.00 61.00 65.00 138.00
scatter_ratio 846.00 168.89 33.20 112.00 147.00 157.00 198.00 265.00
elongatedness 846.00 40.94 7.81 26.00 33.00 43.00 46.00 61.00
pr.axis_rectangularity 846.00 20.58 2.59 17.00 19.00 20.00 23.00 29.00
scaled_variance 846.00 188.60 31.36 130.00 167.00 179.00 217.00 320.00
scaled_variance.1 846.00 439.31 176.50 184.00 318.25 363.50 586.75 1018.00
scaled_radius_of_gyration 846.00 174.71 32.55 109.00 149.00 173.50 198.00 268.00
scaled_radius_of_gyration.1 846.00 72.44 7.47 59.00 67.00 71.50 75.00 135.00
skewness_about 846.00 6.36 4.90 0.00 2.00 6.00 9.00 22.00
skewness_about.1 846.00 12.60 8.93 0.00 5.00 11.00 19.00 41.00
skewness_about.2 846.00 188.92 6.15 176.00 184.00 188.00 193.00 206.00

Observation 5 - After imputation

A quick observation after imputating the missing values: medians remain unchanged while mean changes slightly not significantly. Type of skewness remain unchanged.

In [12]:
# Functions that will help us with EDA plot
def odp_plots(df, col):
    f,(ax1, ax2, ax3) = plt.subplots(1, 3, figsize = (15, 7.2))
    
    # Boxplot to check outliers
    sns.boxplot(x = col, data = df, ax = ax1, orient = 'v', color = 'darkslategrey')
    
    # Distribution plot with outliers
    sns.distplot(df[col], ax = ax2, color = 'teal', fit = norm).set_title(f'{col} with outliers')
    
    # Removing outliers, but in a new dataframe
    upperbound, lowerbound = np.percentile(df[col], [1, 99])
    y = pd.DataFrame(np.clip(df[col], upperbound, lowerbound))
    
    # Distribution plot without outliers
    sns.distplot(y[col], ax = ax3, color = 'tab:orange', fit = norm).set_title(f'{col} without outliers')
    
    kwargs = {'fontsize':14, 'color':'black'}
    ax1.set_title(col + ' Boxplot Analysis', **kwargs)
    ax1.set_xlabel('Box', **kwargs)
    ax1.set_ylabel(col + ' Values', **kwargs)

    return plt.show()

# function for ploting distribution of variables with target
def target_plot(df, col1, col2, col3, target = 'class'):
    f,(ax1, ax2, ax3) = plt.subplots(1, 3, figsize = (15, 7.2))
    f.suptitle(f'Distribution for Car, Bus, Van for {col1.capitalize()}, {col2.capitalize()}, {col3.capitalize()}', 
               fontsize = 14)
    # Distribution for col1 considering outliers
    sns.distplot(df[(df[target] == 3)][col1], color = 'c', ax = ax1, hist = False, 
                 label = 'Car').set_title(f'{col1.capitalize()}')
    sns.distplot(df[(df[target] == 2)][col1], color = 'b', ax = ax1, hist = False, 
                 label = 'Bus').set_title(f'{col1.capitalize()}')
    sns.distplot(df[(df[target] == 1)][col1], color = 'm', ax = ax1, hist = False, 
                 label = 'Van').set_title(f'{col1.capitalize()}')

    # Distribution for col2 considering outliers
    sns.distplot(df[(df[target] == 3)][col2], color = 'c', ax = ax2, hist = False, 
                 label = 'Car').set_title(f'{col2.capitalize()}')
    sns.distplot(df[(df[target] == 2)][col2], color = 'b', ax = ax2, hist = False, 
                 label = 'Bus').set_title(f'{col2.capitalize()}')
    sns.distplot(df[(df[target] == 1)][col2], color = 'm', ax = ax2, hist = False, 
                 label = 'Van').set_title(f'{col2.capitalize()}')

    # Distribution for col3 considering outliers
    sns.distplot(df[(df[target] == 3)][col3], color = 'c', ax = ax3, hist = False, 
                 label = 'Car').set_title(f'{col3.capitalize()}')
    sns.distplot(df[(df[target] == 2)][col3], color = 'b', ax = ax3, hist = False, 
                 label = 'Bus').set_title(f'{col3.capitalize()}')
    sns.distplot(df[(df[target] == 1)][col3], color = 'm', ax = ax3, hist = False, 
                 label = 'Van').set_title(f'{col3.capitalize()}')
    return plt.show()

# Correlation matrix for all variables
def correlation_matrix(df, threshold = 0.8):
    corr = df.corr()
    mask = np.zeros_like(corr, dtype = np.bool)
    mask[np.triu_indices_from(mask)] = True
    f, ax = plt.subplots(figsize = (15, 7.2))
    cmap = sns.diverging_palette(220, 10, as_cmap = True)
    sns.heatmap(corr, mask = mask, cmap = cmap, square = True, linewidths = .5, cbar_kws = {"shrink": .5})#, annot = True)
    ax.set_title('Correlation Matrix of Data')

    # Filter for correlation value greater than threshold
    sort = corr.abs().unstack()
    sort = sort.sort_values(kind = "quicksort", ascending = False)
    display(sort[(sort > threshold) & (sort < 1)])

# Helper function for PCA plots
def pca_plots(df, col1, col2, xlabel, ylabel, ax):
    ax.set_xlabel(xlabel); ax.set_ylabel(ylabel)
    ax.set_title(f'{xlabel} vs {ylabel}', fontsize = 14)
    targets = [3, 2, 1]; colors = ['r', 'g', 'b']
    for target, color in zip(targets, colors):
        indicesToKeep = df['class'] == target
        ax.scatter(df.loc[indicesToKeep, col1], df.loc[indicesToKeep, col2], c = color, s = 50)
    ax.legend(targets)

# Helper function to plot learning curve
def plot_learning_curve(estimator, X, y,  ax, ylim = None, cv = None, n_jobs = 1,
                        train_sizes = np.linspace(.1, 1.0, 5), name = 'Naive Bayes \n Principal Compoents Learning Curve'):    
    if ylim is not None:
        plt.ylim(*ylim)
        
    # First Estimator
    train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv = cv, n_jobs = n_jobs, 
                                                            train_sizes = train_sizes)
    train_scores_mean = np.mean(train_scores, axis = 1)
    train_scores_std = np.std(train_scores, axis = 1)
    test_scores_mean = np.mean(test_scores, axis = 1)
    test_scores_std = np.std(test_scores, axis = 1)

    ax.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, 
                     alpha = 0.1, color = '#ff9124')
    ax.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, 
                     alpha = 0.1, color = '#2492ff')

    ax.plot(train_sizes, train_scores_mean, 'o-', color = '#ff9124', label = 'Training score')
    ax.plot(train_sizes, test_scores_mean, 'o-', color ='#2492ff', label = 'Cross-validation score')
    ax.set_title(name, fontsize = 14)
    ax.set_xlabel('Training size')
    ax.set_ylabel('Score')
    ax.grid(True)
    ax.legend(loc = 'best')

Univariate and Bivariate Visualization

Looking at one feature at a time to understand how are the values distributed, checking outliers, checking relation of the column with class column (bi).

In [13]:
# A quick check to find columns that contain outliers
fig = plt.figure(figsize = (15, 7.2))
ax = sns.boxplot(data = vehicle.iloc[:, 0:18], orient = 'h')
In [14]:
# Outlier, distribution for columns with outliers
boxplotcolumns = ['radius_ratio', 'pr.axis_aspect_ratio', 'max.length_aspect_ratio', 'scaled_variance', 'scaled_variance.1',
                  'scaled_radius_of_gyration.1', 'skewness_about', 'skewness_about.1']
for cols in boxplotcolumns:
    Q3 = vehicle[cols].quantile(0.75)
    Q1 = vehicle[cols].quantile(0.25)
    IQR = Q3 - Q1

    print(f'{cols.capitalize()} column', '--'*40)
    display(len(vehicle.loc[(vehicle[cols] < (Q1 - 1.5 * IQR)) | (vehicle[cols] > (Q3 + 1.5 * IQR))]), 
            vehicle.loc[(vehicle[cols] < (Q1 - 1.5 * IQR)) | (vehicle[cols] > (Q3 + 1.5 * IQR))].head())
    odp_plots(vehicle, cols)

del cols, IQR, boxplotcolumns
Radius_ratio column --------------------------------------------------------------------------------
3
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
37 90.00 48.00 86.00 306.00 126.00 49.00 153.00 44.00 19.00 156.00 272.00 346.00 200.00 118.00 0.00 15.00 185.00 194.00 1.00
135 89.00 47.00 83.00 322.00 133.00 48.00 158.00 43.00 20.00 163.00 229.00 364.00 176.00 97.00 0.00 14.00 184.00 194.00 1.00
388 94.00 47.00 85.00 333.00 138.00 49.00 155.00 43.00 19.00 155.00 320.00 354.00 187.00 135.00 12.00 9.00 188.00 196.00 1.00
8
Pr.axis_aspect_ratio column --------------------------------------------------------------------------------
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
4 85.00 44.00 70.00 205.00 103.00 52.00 149.00 45.00 19.00 144.00 241.00 325.00 188.00 127.00 9.00 11.00 180.00 183.00 2.00
37 90.00 48.00 86.00 306.00 126.00 49.00 153.00 44.00 19.00 156.00 272.00 346.00 200.00 118.00 0.00 15.00 185.00 194.00 1.00
100 82.00 45.00 66.00 252.00 126.00 52.00 148.00 45.00 19.00 144.00 237.00 326.00 185.00 119.00 1.00 1.00 181.00 185.00 2.00
135 89.00 47.00 83.00 322.00 133.00 48.00 158.00 43.00 20.00 163.00 229.00 364.00 176.00 97.00 0.00 14.00 184.00 194.00 1.00
291 89.00 45.00 81.00 246.00 102.00 43.00 155.00 44.00 20.00 160.00 200.00 347.00 177.00 90.00 9.00 17.00 183.00 192.00 1.00
Max.length_aspect_ratio column --------------------------------------------------------------------------------
13
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
4 85.00 44.00 70.00 205.00 103.00 52.00 149.00 45.00 19.00 144.00 241.00 325.00 188.00 127.00 9.00 11.00 180.00 183.00 2.00
37 90.00 48.00 86.00 306.00 126.00 49.00 153.00 44.00 19.00 156.00 272.00 346.00 200.00 118.00 0.00 15.00 185.00 194.00 1.00
100 82.00 45.00 66.00 252.00 126.00 52.00 148.00 45.00 19.00 144.00 237.00 326.00 185.00 119.00 1.00 1.00 181.00 185.00 2.00
127 85.00 41.00 66.00 155.00 65.00 22.00 149.00 45.00 19.00 139.00 173.00 330.00 155.00 75.00 6.00 16.00 184.00 191.00 2.00
135 89.00 47.00 83.00 322.00 133.00 48.00 158.00 43.00 20.00 163.00 229.00 364.00 176.00 97.00 0.00 14.00 184.00 194.00 1.00
Scaled_variance column --------------------------------------------------------------------------------
1
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
388 94.00 47.00 85.00 333.00 138.00 49.00 155.00 43.00 19.00 155.00 320.00 354.00 187.00 135.00 12.00 9.00 188.00 196.00 1.00
Scaled_variance.1 column --------------------------------------------------------------------------------
2
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
85 110.00 58.00 106.00 180.00 51.00 6.00 261.00 26.00 28.00 171.00 278.00 998.00 257.00 83.00 9.00 13.00 181.00 182.00 2.00
835 111.00 58.00 105.00 183.00 51.00 6.00 265.00 26.00 29.00 174.00 285.00 1018.00 255.00 85.00 4.00 8.00 181.00 183.00 2.00
Scaled_radius_of_gyration.1 column --------------------------------------------------------------------------------
15
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
4 85.00 44.00 70.00 205.00 103.00 52.00 149.00 45.00 19.00 144.00 241.00 325.00 188.00 127.00 9.00 11.00 180.00 183.00 2.00
37 90.00 48.00 86.00 306.00 126.00 49.00 153.00 44.00 19.00 156.00 272.00 346.00 200.00 118.00 0.00 15.00 185.00 194.00 1.00
47 85.00 42.00 66.00 122.00 54.00 6.00 148.00 46.00 19.00 141.00 172.00 317.00 174.00 88.00 6.00 14.00 180.00 182.00 2.00
79 89.00 44.00 68.00 113.00 50.00 7.00 150.00 45.00 19.00 147.00 171.00 328.00 189.00 88.00 6.00 5.00 179.00 182.00 2.00
100 82.00 45.00 66.00 252.00 126.00 52.00 148.00 45.00 19.00 144.00 237.00 326.00 185.00 119.00 1.00 1.00 181.00 185.00 2.00
Skewness_about column --------------------------------------------------------------------------------
12
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
44 119.00 54.00 106.00 220.00 65.00 12.00 213.00 31.00 24.00 167.00 223.00 675.00 232.00 66.00 20.00 1.00 192.00 202.00 3.00
113 88.00 35.00 50.00 121.00 58.00 5.00 114.00 59.00 17.00 122.00 132.00 192.00 138.00 74.00 21.00 4.00 182.00 187.00 3.00
123 90.00 36.00 57.00 130.00 57.00 6.00 121.00 56.00 17.00 127.00 137.00 216.00 132.00 68.00 22.00 23.00 190.00 195.00 3.00
190 97.00 48.00 94.00 198.00 63.00 9.00 181.00 36.00 21.00 155.00 200.00 494.00 189.00 64.00 20.00 11.00 199.00 203.00 3.00
346 117.00 52.00 110.00 228.00 65.00 12.00 212.00 31.00 24.00 163.00 228.00 668.00 220.00 66.00 21.00 25.00 194.00 205.00 3.00
Skewness_about.1 column --------------------------------------------------------------------------------
1
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
132 97.00 42.00 101.00 186.00 59.00 9.00 186.00 36.00 22.00 138.00 208.00 511.00 168.00 67.00 7.00 41.00 194.00 206.00 3.00
In [15]:
# Distribution of col by target variable
target_plot(vehicle, 'compactness', 'circularity', 'distance_circularity')
target_plot(vehicle, 'radius_ratio', 'pr.axis_aspect_ratio', 'max.length_aspect_ratio')
target_plot(vehicle, 'scatter_ratio', 'elongatedness', 'pr.axis_rectangularity')
target_plot(vehicle, 'max.length_rectangularity', 'scaled_variance', 'scaled_variance.1')
target_plot(vehicle, 'scaled_radius_of_gyration', 'scaled_radius_of_gyration.1', 'skewness_about')
target_plot(vehicle, 'skewness_about.1', 'skewness_about.2', 'hollows_ratio')

Observation 6 - Comments after checking outliers and distribution

  • Used box plot for the features of the vehicle dataframe. Found that radius_ratio, pr.axis_aspect_ratio, max.length_aspect_ratio, scaled_variance, scaled_variance.1, scaled_radius_of_gyration.1, skewness_about, skewness_about.1 columns has outliers.
  • Used quantile method to check outliers in these column. It appears that scaled_radius_of_gyration.1 has maximum outliers around 15 of them, which represents about 1.77% of number of rows.
  • It appears that removing outliers below 25% percentile and above 75% percentile will bring most of the columns to almost normal distribution. That would mean losing around 1.77% of the data.

Let's not to that, rather replace these outliers with null values and then replace those nulls with median values to avoid losing data.

Handling outliers using SimpleImputer

In [16]:
# Removing outliers with upper and lower percentile values being 75 and 25, respectively
vehicle_im = vehicle.copy(deep = True)
outliers_cols = ['radius_ratio', 'pr.axis_aspect_ratio', 'max.length_aspect_ratio', 'scaled_variance', 'scaled_variance.1',
           'scaled_radius_of_gyration.1', 'skewness_about', 'skewness_about.1']

for col in outliers_cols:
    Q3 = vehicle_im[col].quantile(0.75)
    Q1 = vehicle_im[col].quantile(0.25)
    IQR = Q3 - Q1
    upper_lim = Q3 + (1.5 * IQR)
    lower_lim = Q1 - (1.5 * IQR)
    vehicle_im.loc[(vehicle_im[col] < lower_lim) | (vehicle_im[col] > upper_lim), col] = np.nan

print('Column for which outliers where removed with upper and lower percentile values: \n', outliers_cols)
Column for which outliers where removed with upper and lower percentile values: 
 ['radius_ratio', 'pr.axis_aspect_ratio', 'max.length_aspect_ratio', 'scaled_variance', 'scaled_variance.1', 'scaled_radius_of_gyration.1', 'skewness_about', 'skewness_about.1']
In [17]:
vehicle_im.isnull().sum()
Out[17]:
compactness                     0
circularity                     0
distance_circularity            0
radius_ratio                    3
pr.axis_aspect_ratio            8
max.length_aspect_ratio        13
scatter_ratio                   0
elongatedness                   0
pr.axis_rectangularity          0
max.length_rectangularity       0
scaled_variance                 1
scaled_variance.1               2
scaled_radius_of_gyration       0
scaled_radius_of_gyration.1    15
skewness_about                 12
skewness_about.1                1
skewness_about.2                0
hollows_ratio                   0
class                           0
dtype: int64
In [18]:
# Using SimpleImputer to fill missing values by median
print('Descriptive Stats before handling outliers: \n', '--'*30)
display(vehicle[outliers_cols].describe().T)

columns = list(vehicle_im.columns)
impute = SimpleImputer(missing_values = np.nan, strategy = 'median', verbose = 1)
vehicle_im = pd.DataFrame(impute.fit_transform(vehicle_im), columns = columns)

print('Descriptive Stats after handling outliers: \n', '--'*30)
display(vehicle_im[outliers_cols].describe().T)
del outliers_cols, vehicle
Descriptive Stats before handling outliers: 
 ------------------------------------------------------------
count mean std min 25% 50% 75% max
radius_ratio 846.00 168.87 33.40 104.00 141.00 167.00 195.00 333.00
pr.axis_aspect_ratio 846.00 61.68 7.88 47.00 57.00 61.00 65.00 138.00
max.length_aspect_ratio 846.00 8.57 4.60 2.00 7.00 8.00 10.00 55.00
scaled_variance 846.00 188.60 31.36 130.00 167.00 179.00 217.00 320.00
scaled_variance.1 846.00 439.31 176.50 184.00 318.25 363.50 586.75 1018.00
scaled_radius_of_gyration.1 846.00 72.44 7.47 59.00 67.00 71.50 75.00 135.00
skewness_about 846.00 6.36 4.90 0.00 2.00 6.00 9.00 22.00
skewness_about.1 846.00 12.60 8.93 0.00 5.00 11.00 19.00 41.00
Descriptive Stats after handling outliers: 
 ------------------------------------------------------------
count mean std min 25% 50% 75% max
radius_ratio 846.00 168.33 32.15 104.00 141.00 167.00 194.75 252.00
pr.axis_aspect_ratio 846.00 61.15 5.61 47.00 57.00 61.00 65.00 76.00
max.length_aspect_ratio 846.00 8.12 2.06 3.00 7.00 8.00 10.00 13.00
scaled_variance 846.00 188.43 31.03 130.00 167.00 179.00 216.75 288.00
scaled_variance.1 846.00 437.79 174.35 184.00 318.25 363.25 586.00 987.00
scaled_radius_of_gyration.1 846.00 71.93 6.16 59.00 67.00 71.00 75.00 87.00
skewness_about 846.00 6.13 4.57 0.00 2.00 5.00 9.00 19.00
skewness_about.1 846.00 12.57 8.88 0.00 5.00 11.00 19.00 40.00
In [19]:
# A quick check to find columns that contain outliers
fig = plt.figure(figsize = (15, 7.2))
ax = sns.boxplot(data = vehicle_im.iloc[:, 0:18], orient = 'h')

Observation 7 - Comments after checking outliers and distribution

  • Instead of removing the outliers, which might have resulted in loss of data, we first replaced the outliers (using IQR method) with nulls and then used SimpleImputer to replace those nulls with median values.
  • Again, there's least effect on means and no effect on (ofcourse) median.

Multivariate Visualization

Checking relationship between two or more variables. Includes correlation and scatterplot matrix, checking relation between two variables and target variable.

In [20]:
# Correlation matrix for all variables
correlation_matrix(vehicle_im, threshold = 0.8)
scatter_ratio                pr.axis_rectangularity        0.99
pr.axis_rectangularity       scatter_ratio                 0.99
scaled_variance.1            scatter_ratio                 0.98
scatter_ratio                scaled_variance.1             0.98
pr.axis_rectangularity       scaled_variance.1             0.97
scaled_variance.1            pr.axis_rectangularity        0.97
scatter_ratio                elongatedness                 0.97
elongatedness                scatter_ratio                 0.97
max.length_rectangularity    circularity                   0.96
circularity                  max.length_rectangularity     0.96
scaled_variance              scatter_ratio                 0.96
scatter_ratio                scaled_variance               0.96
elongatedness                pr.axis_rectangularity        0.95
pr.axis_rectangularity       elongatedness                 0.95
elongatedness                scaled_variance.1             0.95
scaled_variance.1            elongatedness                 0.95
scaled_variance              elongatedness                 0.95
elongatedness                scaled_variance               0.95
pr.axis_rectangularity       scaled_variance               0.95
scaled_variance              pr.axis_rectangularity        0.95
                             scaled_variance.1             0.94
scaled_variance.1            scaled_variance               0.94
circularity                  scaled_radius_of_gyration     0.93
scaled_radius_of_gyration    circularity                   0.93
distance_circularity         elongatedness                 0.91
elongatedness                distance_circularity          0.91
scatter_ratio                distance_circularity          0.91
distance_circularity         scatter_ratio                 0.91
scaled_radius_of_gyration.1  hollows_ratio                 0.90
hollows_ratio                scaled_radius_of_gyration.1   0.90
pr.axis_rectangularity       distance_circularity          0.89
distance_circularity         pr.axis_rectangularity        0.89
skewness_about.2             hollows_ratio                 0.89
hollows_ratio                skewness_about.2              0.89
scaled_variance.1            distance_circularity          0.88
distance_circularity         scaled_variance.1             0.88
                             scaled_variance               0.87
scaled_variance              distance_circularity          0.87
max.length_rectangularity    scaled_radius_of_gyration     0.87
scaled_radius_of_gyration    max.length_rectangularity     0.87
scatter_ratio                circularity                   0.85
circularity                  scatter_ratio                 0.85
pr.axis_rectangularity       circularity                   0.84
circularity                  pr.axis_rectangularity        0.84
skewness_about.2             scaled_radius_of_gyration.1   0.83
scaled_radius_of_gyration.1  skewness_about.2              0.83
circularity                  scaled_variance.1             0.83
scaled_variance.1            circularity                   0.83
elongatedness                radius_ratio                  0.83
radius_ratio                 elongatedness                 0.83
elongatedness                circularity                   0.82
circularity                  elongatedness                 0.82
pr.axis_rectangularity       compactness                   0.81
compactness                  pr.axis_rectangularity        0.81
scatter_ratio                compactness                   0.81
compactness                  scatter_ratio                 0.81
pr.axis_rectangularity       max.length_rectangularity     0.81
max.length_rectangularity    pr.axis_rectangularity        0.81
                             scatter_ratio                 0.81
scatter_ratio                max.length_rectangularity     0.81
scaled_variance.1            compactness                   0.81
compactness                  scaled_variance.1             0.81
circularity                  scaled_variance               0.80
scaled_variance              circularity                   0.80
dtype: float64
In [21]:
# Absolute correlation of independent variables with the target variable
absCorrwithDep = []
allVars = vehicle_im.drop('class', axis = 1).columns

for var in allVars:
    absCorrwithDep.append(abs(vehicle_im['class'].corr(vehicle_im[var])))

display(pd.DataFrame([allVars, absCorrwithDep], index = ['Variable', 'Correlation']).T.\
        sort_values('Correlation', ascending = False))
Variable Correlation
7 elongatedness 0.48
6 scatter_ratio 0.46
11 scaled_variance.1 0.45
8 pr.axis_rectangularity 0.44
3 radius_ratio 0.44
2 distance_circularity 0.43
10 scaled_variance 0.42
0 compactness 0.30
15 skewness_about.1 0.27
1 circularity 0.25
12 scaled_radius_of_gyration 0.25
13 scaled_radius_of_gyration.1 0.18
5 max.length_aspect_ratio 0.17
17 hollows_ratio 0.14
9 max.length_rectangularity 0.14
4 pr.axis_aspect_ratio 0.06
14 skewness_about 0.06
16 skewness_about.2 0.05

Observation 8 - Correlation matrix

  • scatter_ratio and pr.axis_rectangularity; scaled_variance.1 and scatter_ratio; pr.axis_rectangularity and scaled_variance.1; pr.axis_rectangularity and scaled_variance.1; elongatedness and scatter_ratio; circularity and max.length_rectangularity; scaled_variance and scatter_ratio; elongatedness and pr.axis_rectangularity; elongatedness and scaled_variance.1; elongatedness and scaled_variance; pr.axis_rectangularity, scaled_variance and scaled_variance.1; distance_circularity and elongatedness; circularity and scaled_radius_of_gyration; distance_circularity and elongatedness; scatter_ratio and distance_circularity are correlated with each other with a correlation coeff greater than 0.9.
  • elongatedness, scatter_ratio, scaled_variance.1, pr.axis_rectangularity and radius_ratio are some columns which have relatively strong correlation with the class variable.
  • Though multicollinearity exists between columns, some of those have a strong influence on the target variable as well. For now lets remove max.length_rectangularity, scaled_variance, scaled_radius_of_gyration, distance_circularity, hollows_ratio and skewness_about.2
In [22]:
vehicle_im.drop(['max.length_rectangularity', 'scaled_variance', 'scaled_radius_of_gyration', 
                 'distance_circularity', 'hollows_ratio', 'skewness_about.2'], axis = 1, inplace = True)

Let's plot pairplot for columns that have a relatively strong correlation with class variable...

In [23]:
print('Indeed highly correlated variables', '--'*30)
sns.pairplot(vehicle_im[['elongatedness', 'scatter_ratio', 'scaled_variance.1', 
             'pr.axis_rectangularity', 'radius_ratio', 'compactness', 'class']], hue = 'class')
Indeed highly correlated variables ------------------------------------------------------------
Out[23]:
<seaborn.axisgrid.PairGrid at 0x1be9c108148>
In [24]:
vehicle_im.shape
Out[24]:
(846, 13)
In [25]:
# Creating separate variables for features and target
features = vehicle_im.iloc[:, 0:12]; features_list = list(features.columns)
target = vehicle_im['class']

features.shape, target.shape
Out[25]:
((846, 12), (846,))
In [26]:
features.head()
Out[26]:
compactness circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity scaled_variance.1 scaled_radius_of_gyration.1 skewness_about skewness_about.1
0 95.00 48.00 178.00 72.00 10.00 162.00 42.00 20.00 379.00 70.00 6.00 16.00
1 91.00 41.00 141.00 57.00 9.00 149.00 45.00 19.00 330.00 72.00 9.00 14.00
2 104.00 50.00 209.00 66.00 10.00 207.00 32.00 23.00 635.00 73.00 14.00 9.00
3 93.00 41.00 159.00 63.00 9.00 144.00 46.00 19.00 309.00 63.00 6.00 10.00
4 85.00 44.00 205.00 61.00 8.00 149.00 45.00 19.00 325.00 71.00 9.00 11.00

PCA

Steps performed:

  1. As mentioned here and steps taken here, to avoid leakage of data, let's first split the data into train and test set before scaling and performing rest of the PCA steps.
  2. Creating a covariance matrix for identifying Principal components
  3. Identify eigen values and eigen vector
  4. Finding variance and cumulative variance by each eigen vector
  5. Use PCA from sklearn and find Principal Components. Transform data to components formed.
In [27]:
# Splitting the data into training and test set in the ratio of 85:25 respectively
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.25, 
                                                    random_state = random_state, shuffle = False)
In [28]:
X_train = StandardScaler().fit_transform(X_train)
X_test = StandardScaler().fit_transform(X_test)
y_train = y_train.values
y_test = y_test.values

display(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
(634, 12)
(634,)
(212, 12)
(212,)
In [29]:
# Covariance matrix
cov_matrix = np.cov(X_train.T)
print('Covariance Matrix \n%s', cov_matrix)
Covariance Matrix 
%s [[ 1.00157978  0.6927466   0.72862246  0.19898718  0.49505746  0.81440645
  -0.78865303  0.81671979  0.81013144 -0.23968157  0.19300214  0.15222827]
 [ 0.6927466   1.00157978  0.64967972  0.21559846  0.56170726  0.8488418
  -0.82596821  0.84312815  0.83012816  0.06020995  0.11617679 -0.01374727]
 [ 0.72862246  0.64967972  1.00157978  0.6458051   0.46182824  0.77614113
  -0.82845233  0.75248537  0.76394977 -0.38576407  0.03171845  0.19228994]
 [ 0.19898718  0.21559846  0.6458051   1.00157978  0.13974833  0.19695357
  -0.29410931  0.16953522  0.1946448  -0.3107984  -0.05839069 -0.02595801]
 [ 0.49505746  0.56170726  0.46182824  0.13974833  1.00157978  0.49596454
  -0.50931992  0.49250771  0.46287865 -0.32635155  0.08945289  0.15872201]
 [ 0.81440645  0.8488418   0.77614113  0.19695357  0.49596454  1.00157978
  -0.97322484  0.99049017  0.98555786  0.00793151  0.04716969  0.2125568 ]
 [-0.78865303 -0.82596821 -0.82845233 -0.29410931 -0.50931992 -0.97322484
   1.00157978 -0.95075667 -0.951896    0.078029   -0.02942302 -0.18614481]
 [ 0.81671979  0.84312815  0.75248537  0.16953522  0.49250771  0.99049017
  -0.95075667  1.00157978  0.97995967  0.02439504  0.05782451  0.21258193]
 [ 0.81013144  0.83012816  0.76394977  0.1946448   0.46287865  0.98555786
  -0.951896    0.97995967  1.00157978  0.01380612  0.04629087  0.19962112]
 [-0.23968157  0.06020995 -0.38576407 -0.3107984  -0.32635155  0.00793151
   0.078029    0.02439504  0.01380612  1.00157978 -0.08266412 -0.1630561 ]
 [ 0.19300214  0.11617679  0.03171845 -0.05839069  0.08945289  0.04716969
  -0.02942302  0.05782451  0.04629087 -0.08266412  1.00157978 -0.03768531]
 [ 0.15222827 -0.01374727  0.19228994 -0.02595801  0.15872201  0.2125568
  -0.18614481  0.21258193  0.19962112 -0.1630561  -0.03768531  1.00157978]]
In [30]:
# Eigen values and vector
eig_vals, eig_vecs = np.linalg.eig(cov_matrix)
print('Eigen Vectors \n%s', eig_vecs)
print('\n Eigen Values \n%s', eig_vals)
Eigen Vectors 
%s [[-3.39450715e-01  2.36357770e-02 -1.63175268e-01 -6.33209676e-02
  -1.45937709e-02  3.85496746e-01  8.02007929e-01  2.06839187e-01
  -9.52114113e-03  2.79550123e-02  1.33738935e-01 -8.51456120e-04]
 [-3.39576130e-01 -1.76359038e-01  3.31498572e-02 -1.88619678e-01
   1.58754055e-01 -2.27806785e-01 -2.50134455e-01  8.17630172e-01
  -2.30219364e-02  3.24498995e-02  2.09070144e-02 -8.18780263e-02]
 [-3.39786015e-01  3.05768414e-01  1.68414061e-01 -4.04368461e-02
  -1.67741829e-01  3.73521106e-02 -7.12117053e-02 -1.47234052e-01
  -1.40364519e-02  2.04231985e-02 -2.11786378e-01 -8.11975143e-01]
 [-1.32048433e-01  5.17895579e-01  4.65277587e-01 -2.26573804e-01
  -3.15478527e-01 -4.09971025e-01  1.87407805e-01  2.88899782e-02
   1.92555573e-02 -3.03940544e-02  2.26065837e-02  3.75717404e-01]
 [-2.36731239e-01  1.72203564e-01 -2.89579241e-01  1.66499147e-02
   6.73394380e-01 -5.20849315e-01  1.74163583e-01 -2.70023726e-01
   4.90218338e-04  2.67872955e-02 -3.44032638e-02 -1.67256251e-02]
 [-3.79544630e-01 -1.58917126e-01  3.01331235e-02  5.88996430e-02
  -4.79304615e-02  6.54990971e-02 -1.25360547e-01 -1.47011385e-01
   8.60949779e-01 -1.53988809e-01  8.81496494e-03  1.31811209e-01]
 [ 3.78338479e-01  6.69207582e-02 -7.69632239e-02 -2.91668784e-02
   3.87528801e-02 -5.56037802e-02  2.43249899e-01  2.48362869e-01
   2.64947746e-01 -8.21334412e-02 -8.01702921e-01 -2.21039377e-02]
 [-3.75735266e-01 -1.78253455e-01  1.05735058e-02  5.99371936e-02
  -4.49167978e-02  7.12421058e-02 -7.37033904e-02 -1.29405301e-01
  -3.96630014e-01 -6.96120346e-01 -3.30434257e-01  2.07335161e-01]
 [-3.74393248e-01 -1.65354225e-01  3.99406608e-02  5.22829152e-02
  -7.13145088e-02  1.05626352e-01 -1.06451482e-01 -1.62436434e-01
  -1.71924114e-01  6.92605118e-01 -4.25380948e-01  2.89810311e-01]
 [ 6.12701550e-02 -6.85789303e-01  2.47813201e-01 -3.30063062e-02
  -2.15965053e-01 -4.72780069e-01  3.49013955e-01 -1.47471791e-01
  -1.77458470e-02  3.09999159e-02  2.86184522e-02 -2.18226278e-01]
 [-3.60056676e-02  5.78245751e-03 -6.93815124e-01 -5.12978716e-01
  -4.51599502e-01 -1.79321998e-01 -9.50385923e-02 -9.44386738e-02
   2.14355561e-03  2.41279651e-03  1.48702838e-03  9.02504349e-03]
 [-8.32850162e-02  1.33359885e-01 -3.03715402e-01  7.95161997e-01
  -3.63283293e-01 -2.81867433e-01  3.46024304e-02  1.90207154e-01
  -1.36635099e-02  1.99097494e-02  3.69652255e-02 -1.86876876e-03]]

 Eigen Values 
%s [6.53481256e+00 1.55027798e+00 1.11438386e+00 1.07646142e+00
 7.55326425e-01 5.00210798e-01 2.05141488e-01 1.52593595e-01
 5.87515113e-03 2.04033235e-02 4.18767553e-02 6.15939938e-02]
In [31]:
# Cumulative variance explained
tot = sum(eig_vals)
var_exp = [(i /tot) * 100 for i in sorted(eig_vals, reverse = True)]
cum_var_exp = np.cumsum(var_exp)

print('Cumulative Variance Explained', cum_var_exp)
Cumulative Variance Explained [ 54.37087735  67.26948354  76.54136817  85.49773096  91.78218978
  95.9440383   97.65085431  98.9204619   99.43293559  99.78135812
  99.95111763 100.        ]
In [32]:
plt.figure(figsize = (15 , 7.2))
plt.plot(var_exp)
Out[32]:
[<matplotlib.lines.Line2D at 0x1be9cb03d88>]
In [33]:
# Ploting 
plt.figure(figsize = (15 , 7.2))
plt.bar(range(1, eig_vals.size + 1), var_exp, alpha = 0.5, align = 'center', label = 'Individual explained variance')
plt.step(range(1, eig_vals.size + 1), cum_var_exp, where = 'mid', label = 'Cumulative explained variance')
plt.axhline(y = 95, color = 'r', linestyle = '--')
plt.axvline(x = 6, color = 'r', linestyle = '--')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.legend(loc = 'best')
plt.tight_layout()
plt.show()

Observation 8 - PCA

  • Visually we can observe that their is steep drop in variance explained with increase in number of PC's.
  • We will proceed with 6 components here which covers more than 95% of variance.
In [34]:
# Reducing the dimensions from 12 to 6
pca = PCA(n_components = 6, random_state = random_state)
pca.fit(X_train)
X_train_reduced = pca.fit_transform(X_train)
X_test_reduced = pca.fit_transform(X_test)
display(X_train_reduced.shape, X_test_reduced.shape)
(634, 6)
(212, 6)
In [35]:
pca.components_
Out[35]:
array([[ 0.34009463,  0.33794945,  0.33503446,  0.13041569,  0.23652898,
         0.38048316, -0.38018014,  0.37579804,  0.3752845 , -0.05863032,
         0.05802983,  0.08216775],
       [-0.01708559,  0.18133851, -0.32574657, -0.54502345, -0.19833762,
         0.15551679, -0.04657311,  0.18064402,  0.13444857,  0.64541435,
         0.165809  ,  0.05665432],
       [ 0.06575121,  0.17648847, -0.03327224, -0.01308773,  0.12490053,
        -0.05935091,  0.04297934, -0.06143859, -0.06308507, -0.09867145,
         0.66276782, -0.6967989 ],
       [ 0.2112849 , -0.13502957, -0.1346522 , -0.37483765,  0.31520721,
        -0.0682641 ,  0.10682439, -0.04298389, -0.0485113 , -0.36467931,
         0.45724618,  0.56284827],
       [-0.05721052, -0.13484345,  0.19649251,  0.4550835 , -0.54315451,
         0.00282427, -0.04349429, -0.02006966,  0.02745604,  0.1809912 ,
         0.54106651,  0.33052979],
       [ 0.37871695, -0.29415461,  0.06660237, -0.36394537, -0.61434687,
         0.09278887, -0.08720213,  0.09561024,  0.10074785, -0.39900997,
        -0.10790752, -0.22220244]])
In [36]:
# Pairplot after dimension reduction
sns.pairplot(pd.DataFrame(X_train_reduced), diag_kind = 'kde')
Out[36]:
<seaborn.axisgrid.PairGrid at 0x1be9c633688>
In [37]:
# Creating a dimension reduced with features and target
df_train = pd.DataFrame(X_train_reduced).join(pd.DataFrame(y_train, columns = ['class']), how = 'left', sort = False)
df_test = pd.DataFrame(X_test_reduced).join(pd.DataFrame(y_test, columns = ['class']), how = 'left', sort = False)
df_train.shape, df_test.shape
Out[37]:
((634, 7), (212, 7))

Observation 9 - Dimensionality Reduction

  • After dimensionality reduction using PCA our attributes have become independent with no correlation among themselves
In [38]:
f, ((ax1, ax2, ax3), (ax4, ax5, ax6)) = plt.subplots(2, 3, figsize = (20, 10))
f.suptitle('Clusters using Dimensionality Reduction', fontsize = 14)

pca_plots(df_train, 0, 1, 'First Component', 'Second Component', ax1)
pca_plots(df_train, 1, 2, 'Second Component', 'Third Component', ax2)
pca_plots(df_train, 2, 3, 'Third Component', 'Fourth Component', ax3)
pca_plots(df_train, 3, 4, 'Fourth Component', 'Fifth Component', ax4)
pca_plots(df_train, 4, 5, 'Fifth Component', 'Sixth Component', ax5)
pca_plots(df_train, 5, 0, 'Sixth Component', 'First Component', ax6)
In [39]:
features = [f for f in df_train.columns if f not in ['class']]

i = 0
t1 = df_train[df_train['class'] == 1]
t2 = df_train[df_train['class'] == 2]
t3 = df_train[df_train['class'] == 3]

fig, ax = plt.subplots(2, 3, figsize = (20, 10))
fig.suptitle('Distribution for Car, Bus, Van for Principal Components', fontsize = 14)

for feature in features:
    i += 1
    plt.subplot(2, 3, i)
    sns.kdeplot(t1[feature], bw = 0.5, label = 'Van')
    sns.kdeplot(t2[feature], bw = 0.5, label = 'Bus')
    sns.kdeplot(t3[feature], bw = 0.5, label = 'Car')
    plt.xlabel(feature, fontsize = 12)
    locs, labels = plt.xticks()
    plt.tick_params(axis = 'both', which = 'major', labelsize = 12)
plt.show()
In [40]:
features = [f for f in df_train.columns if f not in ['class']]

i = 0
ttr = df_train.drop(['class'], axis = 1)
tte = df_test.drop(['class'], axis = 1)

fig, ax = plt.subplots(2, 3, figsize = (20, 10))
fig.suptitle('Most of the principal components are normally distributed in both train and test set', fontsize = 20)

for feature in features:
    i += 1
    plt.subplot(2, 3, i)
    sns.kdeplot(ttr[feature], bw = 0.5, label = 'Train')
    sns.kdeplot(tte[feature], bw = 0.5, label = 'Test')
    plt.xlabel(feature, fontsize = 12)
    locs, labels = plt.xticks()
    plt.tick_params(axis = 'both', which = 'major', labelsize = 12)
plt.show();

Modelling

As mentioned in the list of tasks, use Naive Bayes and support vector machines. Use grid search for SVC (try C values - 0.01, 0.05, 0.5, 1 and kernel = linear, rbf) and find out the best hyper parameters and do cross validation to find the accuracy.

In [41]:
# Compare different models on the principal components
models = []
models.append(('LR', LogisticRegression()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVC', SVC()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('RF', RandomForestClassifier()))

# evaluate each model in turn
results = []
names = []
scoring = 'f1_macro'
for name, model in models:
    skf = StratifiedKFold(n_splits = 10, random_state = random_state)
    cv_results = cross_val_score(model, X_train_reduced, y_train, cv = skf, scoring = scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

# boxplot algorithm comparison
fig = plt.figure(figsize = (15, 7.2))
fig.suptitle(f'SVC does have the highest cross validation score. Let\'s try SVC and NB for this problem.', fontsize = 14)
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
LR: 0.779001 (0.065916)
KNN: 0.845923 (0.036627)
NB: 0.766583 (0.072973)
SVC: 0.892764 (0.020523)
CART: 0.822720 (0.049417)
RF: 0.829701 (0.040820)
In [42]:
# Naive Bayes Model -- Principal Components
NB = GaussianNB()
NB.fit(X_train_reduced, y_train)

print('Naive Bayes Classifier Scores\n\n')
print('NB accuracy for train set: {0:.3f}'.format(NB.score(X_train_reduced, y_train)))
print('NB accuracy for test set: {0:.3f}'.format(NB.score(X_test_reduced, y_test)))

y_true, y_pred = y_test, NB.predict(X_test_reduced)

# Cross Validation Score
skf = StratifiedKFold(n_splits = 10, random_state = random_state)
nb_score = cross_val_score(NB, X_train_reduced, y_train, cv = skf, scoring = scoring)
print('NB cross validation training score: ', round(nb_score.mean(), 3).astype(str))

# Accuracy Score
auc = accuracy_score(y_true, y_pred)
print('\nAccuracy Score:\n', auc.round(3))

# Classification Report
print('\n{}'.format(classification_report(y_true, y_pred)))

# Confusion Matrix
cm = confusion_matrix(y_true, y_pred)
print('\nConfusion Matrix:\n', cm)
Naive Bayes Classifier Scores


NB accuracy for train set: 0.798
NB accuracy for test set: 0.675
NB cross validation training score:  0.767

Accuracy Score:
 0.675

              precision    recall  f1-score   support

         1.0       0.58      0.67      0.62        49
         2.0       0.60      0.53      0.56        53
         3.0       0.76      0.75      0.75       110

    accuracy                           0.67       212
   macro avg       0.64      0.65      0.64       212
weighted avg       0.68      0.67      0.67       212


Confusion Matrix:
 [[33  3 13]
 [12 28 13]
 [12 16 82]]
In [43]:
# SVC with hyperparameter tuning -- Principal Components
svc = SVC(random_state = random_state)

params = {'C': [0.01, 0.05, 0.5, 1], 'kernel': ['linear', 'rbf']}

skf = StratifiedKFold(n_splits = 10, random_state = random_state)
grid_svc = GridSearchCV(svc, param_grid = params, n_jobs = -1, cv = skf)

grid_svc.fit(X_train_reduced, y_train)

print('SVC Scores with Hyperparameter Tuning\n\n')
print('Best Hyper Parameters are: ', grid_svc.best_params_)
print('Best Score is: ', grid_svc.best_score_.round(3))

print('SVC accuracy for train set: {0:.3f}'.format(grid_svc.score(X_train_reduced, y_train)))
print('SVC accuracy for test set: {0:.3f}'.format(grid_svc.score(X_test_reduced, y_test)))

y_true, y_pred = y_test, grid_svc.predict(X_test_reduced)

# Cross Validation Score
grid_svc_score = cross_val_score(grid_svc, X_train_reduced, y_train, cv = skf, scoring = scoring)
print('SVC cross validation training score: ', round(grid_svc_score.mean(), 3).astype(str))

# Accuracy Score
auc = accuracy_score(y_true, y_pred)
print('\nAccuracy Score:\n', auc.round(3))

# Classification Report
print('\n{}'.format(classification_report(y_true, y_pred)))

# Confusion Matrix
cm = confusion_matrix(y_true, y_pred)
print('\nConfusion Matrix:\n', cm)
SVC Scores with Hyperparameter Tuning


Best Hyper Parameters are:  {'C': 1, 'kernel': 'rbf'}
Best Score is:  0.899
SVC accuracy for train set: 0.923
SVC accuracy for test set: 0.783
SVC cross validation training score:  0.893

Accuracy Score:
 0.783

              precision    recall  f1-score   support

         1.0       0.65      0.71      0.68        49
         2.0       0.79      0.79      0.79        53
         3.0       0.85      0.81      0.83       110

    accuracy                           0.78       212
   macro avg       0.76      0.77      0.77       212
weighted avg       0.79      0.78      0.78       212


Confusion Matrix:
 [[35  2 12]
 [ 7 42  4]
 [12  9 89]]
In [44]:
# SVC with hyperparameter tuning -- Original Features
svc = SVC(random_state = random_state)

params = {'C': [0.01, 0.05, 0.5, 1], 'kernel': ['linear', 'rbf']}

skf = StratifiedKFold(n_splits = 10)
grid_svc_f = GridSearchCV(svc, param_grid = params, n_jobs = -1, cv = skf)

grid_svc_f.fit(X_train, y_train)

print('SVC Scores with Hyperparameter Tuning\n\n')
print('Best Hyper Parameters are: ', grid_svc_f.best_params_)
print('Best Score is: ', grid_svc_f.best_score_.round(3))

print('SVC accuracy for train set: {0:.3f}'.format(grid_svc_f.score(X_train, y_train)))
print('SVC accuracy for test set: {0:.3f}'.format(grid_svc_f.score(X_test, y_test)))

y_true, y_pred = y_test, grid_svc_f.predict(X_test)

# Cross Validation Score
grid_svc_f_score = cross_val_score(grid_svc_f, X_train, y_train, cv = skf, scoring = scoring)
print('SVC cross validation training score: ', round(grid_svc_f_score.mean(), 3).astype(str))

# Accuracy Score
auc = accuracy_score(y_true, y_pred)
print('\nAccuracy Score:\n', auc.round(3))

# Classification Report
print('\n{}'.format(classification_report(y_true, y_pred)))

# Confusion Matrix
cm = confusion_matrix(y_true, y_pred)
print('\nConfusion Matrix:\n', cm)
SVC Scores with Hyperparameter Tuning


Best Hyper Parameters are:  {'C': 1, 'kernel': 'rbf'}
Best Score is:  0.924
SVC accuracy for train set: 0.950
SVC accuracy for test set: 0.929
SVC cross validation training score:  0.921

Accuracy Score:
 0.929

              precision    recall  f1-score   support

         1.0       0.92      0.90      0.91        49
         2.0       0.91      0.96      0.94        53
         3.0       0.94      0.93      0.94       110

    accuracy                           0.93       212
   macro avg       0.92      0.93      0.93       212
weighted avg       0.93      0.93      0.93       212


Confusion Matrix:
 [[ 44   1   4]
 [  0  51   2]
 [  4   4 102]]
In [45]:
# Plot training vs cross validation scores
cv = StratifiedKFold(n_splits = 30, random_state = random_state)

f, ((ax1, ax2, ax3)) = plt.subplots(1, 3, figsize = (15, 7.2))
f.suptitle('Training vs Cross Validation Scores', fontsize = 14)

plot_learning_curve(NB, X_train_reduced, y_train, cv = cv, n_jobs = 1, ax = ax1,
                    name = 'Naive Bayes \n Principal Components Learning Curve')

plot_learning_curve(grid_svc, X_train_reduced, y_train, cv = cv, n_jobs = 1, ax = ax2, 
                    name = 'Support Vector Classifier \n Principal Components Learning Curve')

plot_learning_curve(grid_svc_f, X_train, y_train, cv = cv, n_jobs = 1, ax = ax3,
                    name = 'Support Vector Classifier \n Original Features Learning Curve')

Conclusion

We used correlation matrix and checked the relation of each feature with the class column to reduce the number of features in the dataset to 12 from 18.

PCA being a statistical technique to reduce the dimensionality of the data by the selecting the most important features that captures maximum information about the dataset, does the task here. Here we've reduced the dimension from 12 to 6 and selected those which explained 95% variance. Doing that it removes the correlated features as well, which we saw in the scatterplot before and after PCA.

However, some of the limitations which are clearly seen in this use case are: after implementing PCA on the dataset, we saw features getting converted into principal components. Principal components are the linear combination of original features. This makes the features less interpretable. Additionally, we know that one of limitation of PCA is it assumes linearity i.e. principal components are a linear combinations of the original features, which if not true will not give a sensible results..

We then applied Naive Bayes and Support Vector Classifier on the reduced features (dimensions) and got an accuracy of 67.5% and 78.3% respectively and precision (macro) score of 64% and 76% respectively. Recall (macro) score for the same was 65% and 77% respectively. We then also applied SVC on the 12 actual features (with interpretability) and saw an accuracy score of 92.9%, precision (macro) score of 92% and recall (macro) score of 93%, which is a way better score then SVC when applied on principal components.

Shape of dataset we were dealing with was 846 rows and 12 features + 1 class column. Effect of PCA can be more useful in large datasets with more features.

Based on learning curve, we can conclude that for Naive Bayes with principal components, both training and validation scores are volatile, however the validation score almost flattens beyond a training size of ~330. For SVC with principal components and original features, both training and validation scores increases with the increase in size of the dataset, which would mean the scores can be further increases with more training samples. Howevver, the gap between training and validation score for SVC with principal component is higher than then the others.