Illustrate Classifiers

This notebook contains illustrative examples for Support Vector Machines and Decisions Trees.

In [1]:
%matplotlib inline
In [2]:
import matplotlib as mpl; mpl.rcParams['savefig.dpi'] = 144
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import NBIM_Helpers as nh
In [3]:
# Plotting Styles
plt.style.use('ggplot')
mpl.rcParams['xtick.labelsize'] = 'small'; mpl.rcParams['ytick.labelsize'] = 'small'
mpl.rcParams['axes.labelsize'] = 'small'; mpl.rcParams['axes.titlesize'] = 'small'

Support Vector Machine

Below, we plot random points in two different colours. They represent training data.

The line is the separation vector (generated by manual eyeballing) maximizing the distance between the points in itself. This is our classifier.

Given some new point (Feature01,Feature02), the location wrt to the separation vector then classifies the point as either red or blue.

In [4]:
# Generate two sets of random points. Fix the seed so we get the same ones every time.
np.random.seed(12)
x1 = np.random.randn(32) - 3.0
y1 = np.random.randn(32) + 3.0
x2 = np.random.randn(32) + 3.0
y2 = np.random.randn(32) - 3.0
In [5]:
fig, ax = plt.subplots(1,1)

ax.scatter(x1, y1, s=8**2, c=plt.rcParams['axes.color_cycle'][0], label='Red Label')
ax.scatter(x2, y2, s=8**2, c=plt.rcParams['axes.color_cycle'][1], label='Blue Label')

xr = np.linspace(-6,6,128)
yr = 1.9 * xr - 0.5

ax.plot(xr,yr,c=plt.rcParams['axes.color_cycle'][3])

ax.set_xlim([-6,6])
ax.set_ylim([-6,6])

ax.set_xlabel('Feature 01')
ax.set_ylabel('Feature 02')

# ax.set_title('Support Vector Machine Example')
ax.legend(loc='upper right', scatterpoints=1)

fig.savefig('svm_example_01.pdf', bbox_inches='tight')

pass
/home/ics/volker/Anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:892: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))

Now imagine that instead of "Feature 01" and "Feature 02", we ask "Is this issue in the UTILITIES group?" and "Is this issue in the ENERGY group?" with the answer being a binary choice. The colours correspond to "Oil & Gas" and "Utilities". After fitting the separator, we can classify any point, depending on whether we throw it to the left or right of the line.

In [6]:
fig, ax = plt.subplots(1,1)

ax.scatter(0, 1, s=14**2, c=plt.rcParams['axes.color_cycle'][0], label='Oil & Gas')
ax.scatter(1, 0, s=14**2, c=plt.rcParams['axes.color_cycle'][1], label='Utilities')

xr = np.linspace(-1,2,128)
yr = xr

ax.plot(xr, yr, c=plt.rcParams['axes.color_cycle'][3])

ax.set_xticks([0,1])
ax.set_xticklabels(['No','Yes'])
ax.set_xlabel('UTILITIES')

ax.set_yticks([0,1])
ax.set_yticklabels(['No','Yes'])
ax.set_ylabel('ENERGY')

ax.set_xlim([-1,2])
ax.set_ylim([-1,2])

# ax.set_title('Support Vector Machine Example')
ax.legend(loc='upper right', scatterpoints=1)

fig.savefig('svm_example_02.pdf', bbox_inches='tight')

pass

Clearly, we have more than two features ("ENERGY" and "UTILITIES" here) in our actual dataset. But fear not, we can just extend the methods to higher dimensions (as many as we have features). Instead of separating lines, we will have hyperplanes separating the space.

Decision Tree

Decision trees are a bit more complicated. They are binary trees, where at each node, a given input sample is classified into the left or right branch, until the leaf nodes (with the final classifications) are reached.

When trained a decision tree, at each node the feature (and split value if the feature is non-binary) that maximizes the information gain at this level of the tree. Information can either be computed as a maximum entropy reduction or through the gini coefficient.

A much better description is given at this URL:

http://stackoverflow.com/questions/1859554/what-is-entropy-and-information-gain/1859910#1859910

We now load our financial classification data, and build a shallow tree to illustrate the concept.

In [7]:
# Load Data, Keep only Pre-Classified Around
df = pd.read_csv('case_data.csv')
df = nh.cleanup(df)
df = df[~pd.isnull(df.eq_1)].copy()
df, str2int, int2str = nh.construct_string_to_int_maps(df)
# df_unclassified = df[pd.isnull(df.eq_1)].copy()
In [8]:
# Generate Boolean Feature Vectors, ~10 Seconds
df = nh.generate_feature_vectors(df, int2str)
In [9]:
# Extract Level 1 Features (fi_1)
# Extract Labels (eq_1)
features = np.asarray(df[int2str[0].keys()])
labels = np.asarray(df.eq_1_int)
In [10]:
# Fit Tree
from sklearn import tree
clf = tree.DecisionTreeClassifier(max_depth=None)
clf = clf.fit(features, labels)

We now show the resulting decision tree which we've trained on the top level of the feature hierarchy (fi_1) only.

The process of what happens with each sample that is be classified should be self-explaining.

Naturally, fitting only 3 features onto 11 categories will fail spectecularly.

Note that decisions trees can be prone to overfitting if we use too specific features. For example, if tried to fit the tree to the issuer and weight columns in our dataset, the tree would perform spectecularly for the training data, but would choke actual data (because it has been overfitted to the weight and issuer strings which are too specific).

In [11]:
# Plot Tree
from sklearn.externals.six import StringIO
import pydot
from IPython.display import Image
dot_data = StringIO()
tree.export_graphviz(clf, out_file=dot_data, \
                     filled=True, proportion=True, rounded=True, \
                     feature_names=int2str[0].values(), \
                     class_names=int2str[3].values(), \
                     leaves_parallel=False)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
# graph.write_pdf("tree.pdf")
Out[11]:
In [12]:
# Export Tree
with open('/home/ics/volker/Source/Notebooks/_NBIM/tree.dot', 'w') as f:
    f = tree.export_graphviz(clf, out_file=f, \
                             filled=True, proportion=True, rounded=True, \
                             feature_names=int2str[0].values(), \
                             class_names=int2str[3].values(), \
                             leaves_parallel=False)
In [13]:
# Modified the tree:
#
# cat tree3.dot
# digraph Tree {
# node [shape=box, style="filled, rounded", color="black", fontname=helvetica] ;
# edge [fontname=helvetica] ;
# 0 [label="FINANCIAL INSTITUTIONS?\nsamples = 100.0%", fillcolor="#39e5c518"] ;
# 1 [label="INDUSTRIAL?\nsamples = 77.3%", fillcolor="#e581390c"] ;
# 0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="False"] ;
# 2 [label="samples = 4.4%\nclass = Consumer Goods", fillcolor="#eeaeee"] ;
# 1 -> 2 ;
# 3 [label="samples = 73.0%\nclass = Industrials", fillcolor="#e581390c"] ;
# 1 -> 3 ;
# 4 [label="samples = 22.7%\nclass = Technology", fillcolor="#39e5c5f2"] ;
# 0 -> 4 [labeldistance=2.5, labelangle=-45, headlabel="True"] ;
# }

# Generate w/ 
# dot -Tpdf tree3.dot -o tree3.pdfm

Full Decision Tree

For completeness, we also generate the fill Decision Tree (when we operate on three levels). It's a kind of large.

In [14]:
# Extract Level 3 Features (fi_3)
# Extract Labels (eq_1)
features3 = np.asarray(df[int2str[2].keys()])
labels = np.asarray(df.eq_1_int)
In [15]:
clf3 = tree.DecisionTreeClassifier(max_depth=None)
clf3 = clf3.fit(features3, labels)
In [16]:
# Plot Tree
from sklearn.externals.six import StringIO
import pydot
from IPython.display import Image
dot_data = StringIO()
tree.export_graphviz(clf3, out_file=dot_data, \
                     filled=True, proportion=True, rounded=True, \
                     feature_names=int2str[2].values(), \
                     class_names=int2str[3].values(), \
                     leaves_parallel=False)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
Out[16]: