Part 02 - Compare Predictors/Classifiers, Final Sanity Check

In this notebook, we compare the performance of three predictors (or classifiers, we use the term interchangeably). For details on the predictors, please refer to the A1_IllustrateML notebook. Here, we simply compare their performance on the training set. In particular, we use

  1. Support Vector Machine
  2. Decision Tree
  3. Random Forest (Ensemble of Decision Trees)
In [1]:
%matplotlib inline
In [2]:
import matplotlib as mpl; mpl.rcParams['savefig.dpi'] = 144
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import NBIM_Helpers as nh
In [3]:
plt.style.use('ggplot')
mpl.rcParams['xtick.labelsize'] = 'x-small'; mpl.rcParams['ytick.labelsize'] = 'x-small'
mpl.rcParams['axes.labelsize'] = 'x-small'; mpl.rcParams['axes.titlesize'] = 'x-small'

Load and Clean Data, Extract Training Set

In [4]:
df = pd.read_csv('case_data.csv')
df = nh.cleanup(df)
df, str2int, int2str = nh.construct_string_to_int_maps(df)
In [5]:
%%time
df = nh.generate_feature_vectors(df, int2str)
CPU times: user 21 s, sys: 36 ms, total: 21 s
Wall time: 21 s
In [6]:
df_classified   = df[~pd.isnull(df.eq_1)].copy()
df_unclassified = df[pd.isnull(df.eq_1)].copy()

Classify and Test

We now run our different classifiers on a varying subset of the training data. We use a fraction of the training data to train the classfier, then assess the performance by attempting to predict the (eq_1) category for the remaining training data (for which we know the truth). We vary (i) classifier, (ii) training data fraction, and (iii) number of levels (fi_1, fi_2, fi_3) used.

In [7]:
fraction_svm_03 = np.zeros(3); fraction_svm_05 = np.zeros(3); fraction_svm_08 = np.zeros(3)
fraction_tree_03 = np.zeros(3); fraction_tree_05 = np.zeros(3); fraction_tree_08 = np.zeros(3)
fraction_forest_03 = np.zeros(3); fraction_forest_05 = np.zeros(3); fraction_forest_08 = np.zeros(3)

for ii, ilevel in enumerate([1,2,3]):
    
    
    _, _, fraction_svm_03[ii] = nh.fit_and_evaluate(df_classified, int2str, \
                                                    training_fraction=0.3, predictor='svm', ilevel=ilevel)
    _, _, fraction_svm_05[ii] = nh.fit_and_evaluate(df_classified, int2str, \
                                                    training_fraction=0.5, predictor='svm', ilevel=ilevel)
    _, _, fraction_svm_08[ii] = nh.fit_and_evaluate(df_classified, int2str, \
                                                    training_fraction=0.8, predictor='svm', ilevel=ilevel)
    
    _, _, fraction_tree_03[ii] = nh.fit_and_evaluate(df_classified, int2str, \
                                                     training_fraction=0.3, predictor='tree', ilevel=ilevel)
    _, _, fraction_tree_05[ii] = nh.fit_and_evaluate(df_classified, int2str, \
                                                     training_fraction=0.5, predictor='tree', ilevel=ilevel)
    _, _, fraction_tree_08[ii] = nh.fit_and_evaluate(df_classified, int2str, \
                                                     training_fraction=0.8, predictor='tree', ilevel=ilevel)
    
    _, _, fraction_forest_03[ii] = nh.fit_and_evaluate(df_classified, int2str, \
                                                       training_fraction=0.3, predictor='forest', ilevel=ilevel)
    _, _, fraction_forest_05[ii] = nh.fit_and_evaluate(df_classified, int2str, \
                                                       training_fraction=0.5, predictor='forest', ilevel=ilevel)
    _, _, fraction_forest_08[ii] = nh.fit_and_evaluate(df_classified, int2str, \
                                                       training_fraction=0.3, predictor='forest', ilevel=ilevel)

So... what gives? Let's have a look at the raw data as well as plot it.

In [8]:
print fraction_svm_03
print fraction_svm_05
print fraction_svm_08
print ''
print fraction_tree_03
print fraction_tree_05
print fraction_tree_08
print ''
print fraction_forest_03
print fraction_forest_05
print fraction_forest_08
[ 0.41268293  0.75609756  0.8497561 ]
[ 0.42213115  0.75136612  0.86885246]
[ 0.40273038  0.7440273   0.8668942 ]

[ 0.41268293  0.75609756  0.8497561 ]
[ 0.42213115  0.75136612  0.86885246]
[ 0.40273038  0.7440273   0.8668942 ]

[ 0.41268293  0.75609756  0.84390244]
[ 0.42213115  0.75409836  0.8647541 ]
[ 0.41268293  0.76292683  0.8497561 ]
In [9]:
fig, axarr = plt.subplots(1,3)
fig.set_size_inches(8,2)

axarr[0].plot(fraction_svm_08, \
              'o-', c=plt.rcParams['axes.color_cycle'][1])
axarr[1].plot([ fraction_svm_03[2], fraction_svm_05[2], fraction_svm_08[2] ], \
              'o-', c=plt.rcParams['axes.color_cycle'][1])
axarr[2].plot([ fraction_svm_08[2], fraction_tree_08[2], fraction_forest_08[2] ], \
              'o-', c=plt.rcParams['axes.color_cycle'][1])

for ax in axarr:
    ax.set_ylim([0,1])
    ax.set_xlim([-0.5,2.5])
    ax.set_xticks([0,1,2])

axarr[0].set_xticklabels([ 'fi_1', 'fi_2', 'fi_3' ])
axarr[1].set_xticklabels([ '0.3', '0.5', '0.8' ])
axarr[2].set_xticklabels([ 'SVM', 'Tree', 'Forest' ])
    
plt.setp(axarr[1].get_yticklabels(), visible=False)
plt.setp(axarr[2].get_yticklabels(), visible=False)

axarr[0].set_ylabel('Fraction Correctly Classified')
axarr[0].set_xlabel('Levels Used, SVM')
axarr[1].set_xlabel('Fraction Used to Train, SVM')
axarr[2].set_xlabel('Classifier')

fig.savefig('compare_predictors.pdf', bbox_inches='tight')

pass
/home/ics/volker/Anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:892: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))

Inspecting the numerical values (two boxes up) and the figure, we find the following:

  • For all classifiers, increasing the feature level (fi_1, fi_2, fi_3) improves classification performance substantially
  • For a given classifier, increasing the amount of data to train improves performance at the percent level. Sometimes (depending the subsample chosen from the training set), performance appears to decrease for a larger fraction. This is likely ficticious because the baseline sample to compare to shrinks too much.
  • Decision Trees and Support Vector Machines (SVM) perform exactly the same. This is a bit odd. It seems that for a binary choice, the SVM bisects space such that it operates as a tree would.
  • Random forests perform a bit worse than Decisions Trees. However, recall that we train a number of trees (eight, in our case) in the ensemble, meaning that each tree has a smaller training set available, decreasing performance.

Final Sanity Check

Let us now do a final sanity check. If our classifier works, we should expect the distribution of (eq_1) to be similar across the original training set and the set have classified. This is simply by virtue of the input distribution (fi_2) being similar.

So, let's pick the decision tree and classifiy on (fi_3).

In [10]:
df_trained = nh.classify_data(df_classified, df_unclassified, int2str, predictor='tree', ilevel=3)
In [11]:
ilevel = 2
df_counts_01 = nh.count_eq_1(df_classified, str2int)
df_counts_02 = nh.count_eq_1(df_trained, str2int)

fig, axarr = plt.subplots(1,2)
axarr[0].barh(df_counts_01.eq_1-0.3, \
              df_counts_01.counts, \
              height=0.6)
axarr[1].barh(df_counts_02.eq_1-0.3, \
              df_counts_02.counts, \
              height=0.6)

for ax in axarr:
    ax.set_yticks(int2str[3].keys())
    ax.set_yticklabels(int2str[3].values())
plt.setp(axarr[1].get_yticklabels(), visible=False)

axarr[0].set_title('Pre-Classified (Training) Data')
axarr[1].set_title('Newly Classified')

axarr[0].set_xlabel('Occurences')
axarr[1].set_xlabel('Occurences')

fig.savefig('eq_1_classified_unclassified.pdf', bbox_inches='tight')

pass

Ok, that looks roughly similar...