In [1]:

```
%matplotlib inline
```

In [2]:

```
import matplotlib as mpl; mpl.rcParams['savefig.dpi'] = 144
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import NBIM_Helpers as nh
import time
```

In [3]:

```
plt.style.use('ggplot')
mpl.rcParams['xtick.labelsize'] = 'small'; mpl.rcParams['ytick.labelsize'] = 'small'
mpl.rcParams['axes.labelsize'] = 'small'; mpl.rcParams['axes.titlesize'] = 'small'
mpl.rcParams['legend.fontsize'] = 'small'
```

In [4]:

```
df = pd.read_csv('case_data.csv')
df = nh.cleanup(df)
df, str2int, int2str = nh.construct_string_to_int_maps(df)
```

In [5]:

```
%%time
df = nh.generate_feature_vectors(df, int2str)
```

In [6]:

```
df_classified_raw = df[~pd.isnull(df.eq_1)].copy()
df_unclassified_raw = df[pd.isnull(df.eq_1)].copy()
```

Finally, we use the decision tree to classify (eq_1) in thus far unclassified set.

In [7]:

```
df_trained_raw = nh.classify_data(df_classified_raw, df_unclassified_raw, int2str, predictor='tree', ilevel=3)
```

Recall that about half the rows did not have actual weight in the data.

So, now that we've trained, let's throw away these rows.

In [8]:

```
df_classified = df_classified_raw[~pd.isnull(df_classified_raw.weight)]
df_trained = df_trained_raw[~pd.isnull(df_trained_raw.weight)]
```

Do some more data mangling to get data into plottable form.

In [9]:

```
weights_classified = np.zeros_like(int2str[3].keys(), dtype=np.float64)
for ii, eq_1_int in enumerate(int2str[3].keys()):
weights_classified[ii] = np.sum(df_classified[df_classified.eq_1_int == eq_1_int].weight)
# print weights_classified
```

In [10]:

```
weights_trained = np.zeros_like(int2str[3].keys(), dtype=np.float64)
for ii, eq_1_int in enumerate(int2str[3].keys()):
weights_trained[ii] = np.sum(df_trained[df_trained.eq_1_int == eq_1_int].weight)
# print weights_trained
```

Let's check if the weights add up to unity.

In [11]:

```
print np.sum(weights_classified)
print np.sum(weights_trained)
print np.sum(weights_classified+weights_trained)
```

In [12]:

```
fig, ax = plt.subplots(1,1)
ax.barh(np.asarray(int2str[3].keys(), dtype=np.float64)-0.2+0.2, \
weights_classified, \
height=0.4, \
color=plt.rcParams['axes.color_cycle'][1], label='Pre-Classified (Training Data)')
ax.barh(np.asarray(int2str[3].keys(), dtype=np.float64)-0.2-0.2, \
weights_trained, \
height=0.4, \
color=plt.rcParams['axes.color_cycle'][0], label='Newly Classified')
ax.set_yticks(np.asarray(int2str[3].keys(), dtype=np.float64))
ax.set_yticklabels(int2str[3].values())
ax.legend(loc='upper right')
fig.savefig('weights_final_compare.pdf', bbox_inches='tight')
pass
```

From eyeballing, the weight distributions between the training and newly classified sets look approximately similar. Although we may intuitively expect them to be, there is actually no reason for this! The only thing we expect to be similar is the distribution of features and labels (which we have shown in the other two parts of this series).

Now, let's return to the weight distribution, and make it pretty. Stack both sets, rescale such that weights sum up to unity, order figure by increase weight, and get rid of that unsightly NaN.

In [13]:

```
weights_total = weights_trained + weights_classified
weights_total /= np.sum(weights_total)
argidx = np.argsort(weights_total)
weights_total = weights_total[argidx][1:]
eq_1_label_loc = np.array(range(len(weights_total)))+1
eq_1_label_val = np.asarray(int2str[3].values())[argidx][1:]
```

In [14]:

```
fig, ax = plt.subplots(1,1)
ax.barh(eq_1_label_loc-0.30, weights_total, height=0.6)
ax.set_yticks(eq_1_label_loc)
ax.set_yticklabels(eq_1_label_val)
ax.set_ylim([0,11])
ax.set_xlim([0,0.5])
ax.set_xlabel('Sector Weight')
fig.savefig('weights_final_total.pdf', bbox_inches='tight')
pass
```

Overall, it appears that we're disproportionately weighted into the financial sector. On the surface, this seems inadvisable, but it's tricky to quantify without more information.

In particular, the weight distribution per sector may just result from any number of other weight distribution decisions we have applied. For example, we may have distributed evenly over risk profiles (or geography, or market cap, or anything-you-can-imagine), and the per-sector distribution is just a consequence of that.

To go from here, we should investigate other weight distribtuions and take our decision based on the aggregate over all of them.

Interesting distributions could be:

- Risk Class
- Geography
- Market Capitalization
- Perceived Value (maybe we should invest a large fraction into companies we may consider undervalued? especially since we're having a long horizon; probably a lot of manual work if we are to go beyond things like Assets/Share vs. Price/Share, etc)
- Volatility (mean dispersion about market indices)
- Dividend History
- External Ratings
- Crazy Things (Tweet Frequency, Good/Bad News Frequency, Google Searches, FB Likes, Lawsuit History, ...)

Given such distributions, we could even look at historical data and try to find the weightings (for different distributions) that would have given most consistent historical returns. This is essentially a parameter space sweep, and may require some computing power (and patience). Of course, this comes with the caveat of "Past performance is no guarantee of future returns".