Finally, we can derive the total weight distribution over classified and as yet unclassified sets of data. We will be used a Decision Tree because it (a) has the same accuracy as SVM, but (b) is significantly faster. Fitting hyperplanes comes a cost of $\mathcal{O}(N^2)$, whereas tree codes come at cost $\mathcal{O}(N \log N)$.
%matplotlib inline
import matplotlib as mpl; mpl.rcParams['savefig.dpi'] = 144
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import NBIM_Helpers as nh
import time
plt.style.use('ggplot')
mpl.rcParams['xtick.labelsize'] = 'small'; mpl.rcParams['ytick.labelsize'] = 'small'
mpl.rcParams['axes.labelsize'] = 'small'; mpl.rcParams['axes.titlesize'] = 'small'
mpl.rcParams['legend.fontsize'] = 'small'
df = pd.read_csv('case_data.csv')
df = nh.cleanup(df)
df, str2int, int2str = nh.construct_string_to_int_maps(df)
%%time
df = nh.generate_feature_vectors(df, int2str)
df_classified_raw = df[~pd.isnull(df.eq_1)].copy()
df_unclassified_raw = df[pd.isnull(df.eq_1)].copy()
Finally, we use the decision tree to classify (eq_1) in thus far unclassified set.
df_trained_raw = nh.classify_data(df_classified_raw, df_unclassified_raw, int2str, predictor='tree', ilevel=3)
Recall that about half the rows did not have actual weight in the data.
So, now that we've trained, let's throw away these rows.
df_classified = df_classified_raw[~pd.isnull(df_classified_raw.weight)]
df_trained = df_trained_raw[~pd.isnull(df_trained_raw.weight)]
Do some more data mangling to get data into plottable form.
weights_classified = np.zeros_like(int2str[3].keys(), dtype=np.float64)
for ii, eq_1_int in enumerate(int2str[3].keys()):
weights_classified[ii] = np.sum(df_classified[df_classified.eq_1_int == eq_1_int].weight)
# print weights_classified
weights_trained = np.zeros_like(int2str[3].keys(), dtype=np.float64)
for ii, eq_1_int in enumerate(int2str[3].keys()):
weights_trained[ii] = np.sum(df_trained[df_trained.eq_1_int == eq_1_int].weight)
# print weights_trained
Let's check if the weights add up to unity.
print np.sum(weights_classified)
print np.sum(weights_trained)
print np.sum(weights_classified+weights_trained)
Apparently, they do not. Interestingly enough, they do when we don't remove the duplicates (see Part 01). It's almost someone prepared the set by hand...
fig, ax = plt.subplots(1,1)
ax.barh(np.asarray(int2str[3].keys(), dtype=np.float64)-0.2+0.2, \
weights_classified, \
height=0.4, \
color=plt.rcParams['axes.color_cycle'][1], label='Pre-Classified (Training Data)')
ax.barh(np.asarray(int2str[3].keys(), dtype=np.float64)-0.2-0.2, \
weights_trained, \
height=0.4, \
color=plt.rcParams['axes.color_cycle'][0], label='Newly Classified')
ax.set_yticks(np.asarray(int2str[3].keys(), dtype=np.float64))
ax.set_yticklabels(int2str[3].values())
ax.legend(loc='upper right')
fig.savefig('weights_final_compare.pdf', bbox_inches='tight')
pass
From eyeballing, the weight distributions between the training and newly classified sets look approximately similar. Although we may intuitively expect them to be, there is actually no reason for this! The only thing we expect to be similar is the distribution of features and labels (which we have shown in the other two parts of this series).
Now, let's return to the weight distribution, and make it pretty. Stack both sets, rescale such that weights sum up to unity, order figure by increase weight, and get rid of that unsightly NaN.
weights_total = weights_trained + weights_classified
weights_total /= np.sum(weights_total)
argidx = np.argsort(weights_total)
weights_total = weights_total[argidx][1:]
eq_1_label_loc = np.array(range(len(weights_total)))+1
eq_1_label_val = np.asarray(int2str[3].values())[argidx][1:]
fig, ax = plt.subplots(1,1)
ax.barh(eq_1_label_loc-0.30, weights_total, height=0.6)
ax.set_yticks(eq_1_label_loc)
ax.set_yticklabels(eq_1_label_val)
ax.set_ylim([0,11])
ax.set_xlim([0,0.5])
ax.set_xlabel('Sector Weight')
fig.savefig('weights_final_total.pdf', bbox_inches='tight')
pass
Overall, it appears that we're disproportionately weighted into the financial sector. On the surface, this seems inadvisable, but it's tricky to quantify without more information.
In particular, the weight distribution per sector may just result from any number of other weight distribution decisions we have applied. For example, we may have distributed evenly over risk profiles (or geography, or market cap, or anything-you-can-imagine), and the per-sector distribution is just a consequence of that.
To go from here, we should investigate other weight distribtuions and take our decision based on the aggregate over all of them.
Interesting distributions could be:
Given such distributions, we could even look at historical data and try to find the weightings (for different distributions) that would have given most consistent historical returns. This is essentially a parameter space sweep, and may require some computing power (and patience). Of course, this comes with the caveat of "Past performance is no guarantee of future returns".