Part 03 - Final Weight Distribution & Outlook

Finally, we can derive the total weight distribution over classified and as yet unclassified sets of data. We will be used a Decision Tree because it (a) has the same accuracy as SVM, but (b) is significantly faster. Fitting hyperplanes comes a cost of $\mathcal{O}(N^2)$, whereas tree codes come at cost $\mathcal{O}(N \log N)$.

In [1]:
%matplotlib inline
In [2]:
import matplotlib as mpl; mpl.rcParams['savefig.dpi'] = 144
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import NBIM_Helpers as nh
import time
In [3]:'ggplot')
mpl.rcParams['xtick.labelsize'] = 'small'; mpl.rcParams['ytick.labelsize'] = 'small'
mpl.rcParams['axes.labelsize'] = 'small'; mpl.rcParams['axes.titlesize'] = 'small'
mpl.rcParams['legend.fontsize'] = 'small'

Load and Clean Data, Extract Training Set, Train Data

In [4]:
df = pd.read_csv('case_data.csv')
df = nh.cleanup(df)
df, str2int, int2str = nh.construct_string_to_int_maps(df)
In [5]:
df = nh.generate_feature_vectors(df, int2str)
CPU times: user 21 s, sys: 16 ms, total: 21 s
Wall time: 21 s
In [6]:
df_classified_raw   = df[~pd.isnull(df.eq_1)].copy()
df_unclassified_raw = df[pd.isnull(df.eq_1)].copy()

Finally, we use the decision tree to classify (eq_1) in thus far unclassified set.

In [7]:
df_trained_raw = nh.classify_data(df_classified_raw, df_unclassified_raw, int2str, predictor='tree', ilevel=3)

Recall that about half the rows did not have actual weight in the data.

So, now that we've trained, let's throw away these rows.

In [8]:
df_classified = df_classified_raw[~pd.isnull(df_classified_raw.weight)]
df_trained    = df_trained_raw[~pd.isnull(df_trained_raw.weight)]

Prepare Final Weight Distribution

Do some more data mangling to get data into plottable form.

In [9]:
weights_classified = np.zeros_like(int2str[3].keys(), dtype=np.float64)
for ii, eq_1_int in enumerate(int2str[3].keys()):
    weights_classified[ii] = np.sum(df_classified[df_classified.eq_1_int == eq_1_int].weight)
# print weights_classified
In [10]:
weights_trained = np.zeros_like(int2str[3].keys(), dtype=np.float64)
for ii, eq_1_int in enumerate(int2str[3].keys()):
    weights_trained[ii] = np.sum(df_trained[df_trained.eq_1_int == eq_1_int].weight)
# print weights_trained

Let's check if the weights add up to unity.

In [11]:
print np.sum(weights_classified)
print np.sum(weights_trained)
print np.sum(weights_classified+weights_trained)

Apparently, they do not. Interestingly enough, they do when we don't remove the duplicates (see Part 01). It's almost someone prepared the set by hand...

Plot Final Weight Distribution

In [12]:
fig, ax = plt.subplots(1,1)

ax.barh(np.asarray(int2str[3].keys(), dtype=np.float64)-0.2+0.2, \
        weights_classified, \
        height=0.4, \
        color=plt.rcParams['axes.color_cycle'][1], label='Pre-Classified (Training Data)')
ax.barh(np.asarray(int2str[3].keys(), dtype=np.float64)-0.2-0.2, \
        weights_trained, \
        height=0.4, \
        color=plt.rcParams['axes.color_cycle'][0], label='Newly Classified')

ax.set_yticks(np.asarray(int2str[3].keys(), dtype=np.float64))

ax.legend(loc='upper right')

fig.savefig('weights_final_compare.pdf', bbox_inches='tight')

/home/ics/volker/Anaconda/lib/python2.7/site-packages/matplotlib/ UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))

From eyeballing, the weight distributions between the training and newly classified sets look approximately similar. Although we may intuitively expect them to be, there is actually no reason for this! The only thing we expect to be similar is the distribution of features and labels (which we have shown in the other two parts of this series).

Now, let's return to the weight distribution, and make it pretty. Stack both sets, rescale such that weights sum up to unity, order figure by increase weight, and get rid of that unsightly NaN.

In [13]:
weights_total = weights_trained + weights_classified
weights_total /= np.sum(weights_total)
argidx = np.argsort(weights_total)
weights_total = weights_total[argidx][1:]
eq_1_label_loc = np.array(range(len(weights_total)))+1
eq_1_label_val = np.asarray(int2str[3].values())[argidx][1:]
In [14]:
fig, ax = plt.subplots(1,1)

ax.barh(eq_1_label_loc-0.30, weights_total, height=0.6)



ax.set_xlabel('Sector Weight')

fig.savefig('weights_final_total.pdf', bbox_inches='tight')


Overall, it appears that we're disproportionately weighted into the financial sector. On the surface, this seems inadvisable, but it's tricky to quantify without more information.

In particular, the weight distribution per sector may just result from any number of other weight distribution decisions we have applied. For example, we may have distributed evenly over risk profiles (or geography, or market cap, or anything-you-can-imagine), and the per-sector distribution is just a consequence of that.

To go from here, we should investigate other weight distribtuions and take our decision based on the aggregate over all of them.

Interesting distributions could be:

  • Risk Class
  • Geography
  • Market Capitalization
  • Perceived Value (maybe we should invest a large fraction into companies we may consider undervalued? especially since we're having a long horizon; probably a lot of manual work if we are to go beyond things like Assets/Share vs. Price/Share, etc)
  • Volatility (mean dispersion about market indices)
  • Dividend History
  • External Ratings
  • Crazy Things (Tweet Frequency, Good/Bad News Frequency, Google Searches, FB Likes, Lawsuit History, ...)

Given such distributions, we could even look at historical data and try to find the weightings (for different distributions) that would have given most consistent historical returns. This is essentially a parameter space sweep, and may require some computing power (and patience). Of course, this comes with the caveat of "Past performance is no guarantee of future returns".