In this notebook, we explore the provided data. We'll also do some rudimentary checks to see whether any classification later on is valid.
%matplotlib inline
import matplotlib as mpl; mpl.rcParams['savefig.dpi'] = 144
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import NBIM_Helpers as nh
plt.style.use('ggplot')
mpl.rcParams['xtick.labelsize'] = 'x-small'; mpl.rcParams['ytick.labelsize'] = 'x-small'
mpl.rcParams['axes.labelsize'] = 'x-small'; mpl.rcParams['axes.titlesize'] = 'small'
Let's load some data and have a look.
df_raw = pd.read_csv('case_data.csv')
df_raw.head(6)[['issuer','fi_1','fi_2','fi_3','eq_1','weight']]
Let's check if there are some duplicates.
duplicate_issuers = df_raw[df_raw.issuer.duplicated()].issuer.unique()
print duplicate_issuers
Argh... that's a lot. What do they look like?
df_raw[df_raw.issuer==duplicate_issuers[5]][['issuer', 'fi_1', 'fi_2', 'fi_3', 'eq_1', 'weight']]
Ok. In this example, we want to keep the classified row around. There are other kinds of duplicates, where the classified data has no weight, but there is a row with same entry unclassified has weights, etc. For example:
df_raw[df_raw.issuer==duplicate_issuers[9]][['issuer', 'fi_1', 'fi_2', 'fi_3', 'eq_1', 'weight']]
We need to clean this up a bit...
df = nh.cleanup(df_raw.copy())
Let's segment the data. If the field eq_1 is NULL (Pandas puts an NaN in there), the row is thrown into the unclassified set (i.e., the set that is to be classified). If the field eq_1 has some classification, the row is thrown into the training set (i.e., the one that is already classified). We also check how many rows we have in each set.
df, str2int, int2str = nh.construct_string_to_int_maps(df)
df_classified = df[~pd.isnull(df.eq_1)].copy()
df_unclassified = df[pd.isnull(df.eq_1)].copy()
print "Total Rows : %i" % len(df)
print "Classified Rows : %i" % len(df_classified)
print "Unclassified Rows : %i "% len(df_unclassified)
Let's check if there is some useless data (empty weights)...
print "Rows without Weights (Classified/Training Set): %i" % np.sum(pd.isnull(df_classified.weight))
print "Rows without Weights (Unclassified Set) : %i" % np.sum(pd.isnull(df_unclassified.weight))
Oha, it turns out that ~43% of the classified and 60% of the unclassified rows have missing weights. For calculating the final weight distribution, these are useless to use. We will later remove them for the weight distribution.
However, they're still useful to keep around in in the classified set to train our classifiers.
Now, recall that our job is to find classifiers that map (fi_1,fi_2,fi_3) to (eq_1). To get a handle on what this thing will look like, let's have a look at the mapping that is performed in the training set.
Let's have a look at mapping (fi_1) to (eq_1) first. Here, (fi_1) is on the vertical axis and is mapped onto (eq_1) , which is shown on the horizontal axis. The size of the circle indicates how many datapoints did this particular mapping.
ilevel = 1
df_counts = nh.count_fi_x_eq_x(df_classified, str2int, ilevel=ilevel)
ms, ns = nh.mkline(1.0, 4.0, 150, 32)
s = ms * df_counts.counts + ns
s[df_counts.counts == 0] = 0.0
fig, ax = plt.subplots(1,1)
ax.scatter(np.asarray(df_counts.eq_1), np.asarray(df_counts.fi_x), \
s=s**2, c=plt.rcParams['axes.color_cycle'][1])
ax.set_xticks(int2str[3].keys())
ax.set_xticklabels(int2str[3].values(), rotation=45)
ax.set_yticks(int2str[ilevel-1].keys())
ax.set_yticklabels(int2str[ilevel-1].values())
# ax.set_xlabel('EQ Classes')
# ax.set_ylabel('FI Classes')
fig.savefig('fi_1_eq_1.pdf', bbox_inches='tight')
pass
Unsurprisingly, this doesn't look very good. We only have three input classes that need to be mapped to eleven output classes.
Let's try to work with the (fi_2) level.
ilevel = 2
df_counts = nh.count_fi_x_eq_x(df_classified, str2int, ilevel=ilevel)
ms, ns = nh.mkline(1.0, 4.0, 150, 32)
s = ms * df_counts.counts + ns
s[df_counts.counts == 0] = 0.0
fig, ax = plt.subplots(1,1)
ax.scatter(np.asarray(df_counts.eq_1), np.asarray(df_counts.fi_x), \
s=s**2, c=plt.rcParams['axes.color_cycle'][1])
ax.set_xticks(int2str[3].keys())
ax.set_xticklabels(int2str[3].values(), rotation=45)
ax.set_yticks(int2str[ilevel-1].keys())
ax.set_yticklabels(int2str[ilevel-1].values())
# ax.set_xlabel('EQ Classes')
# ax.set_ylabel('FI Classes')
fig.savefig('fi_2_eq_1.pdf', bbox_inches='tight')
pass
Ok, that looks a little better. Using (fi_3), we can probably still improve things, but visualizing is a bit tricky at this point because there are too many classes.
Finally, let's check whether the classified (training) and unclassified sets have approximately the same distribution in their sectors (we show fi_2 here). If they do not, using whatever classifier we train with our training set won't be performing very well.
ilevel = 2
df_counts_01 = nh.count_fi_x(df_classified, str2int, ilevel=ilevel)
df_counts_02 = nh.count_fi_x(df_unclassified, str2int, ilevel=ilevel)
fig, axarr = plt.subplots(1,2)
axarr[0].barh(df_counts_01.fi_x-30, \
df_counts_01.counts, \
height=60)
axarr[1].barh(df_counts_02.fi_x-30, \
df_counts_02.counts, \
height=60)
for ax in axarr:
ax.set_yticks(int2str[ilevel-1].keys())
ax.set_yticklabels(int2str[ilevel-1].values())
plt.setp(axarr[1].get_yticklabels(), visible=False)
axarr[0].set_title('Classified (Training) Set')
axarr[1].set_title('Unclassified Set')
axarr[0].set_xlabel('Occurences')
axarr[1].set_xlabel('Occurences')
fig.savefig('fi_1_classified_unclassified.pdf', bbox_inches='tight')
pass
Ok, they look sort of similar. Likely something we can work with.