Part 01 - Inspect Data

In this notebook, we explore the provided data. We'll also do some rudimentary checks to see whether any classification later on is valid.

In [1]:
%matplotlib inline
In [2]:
import matplotlib as mpl; mpl.rcParams['savefig.dpi'] = 144
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import NBIM_Helpers as nh
In [3]:
plt.style.use('ggplot')
mpl.rcParams['xtick.labelsize'] = 'x-small'; mpl.rcParams['ytick.labelsize'] = 'x-small'
mpl.rcParams['axes.labelsize'] = 'x-small'; mpl.rcParams['axes.titlesize'] = 'small'

Data Quality, Duplicates, Unclassified vs. Classified

Let's load some data and have a look.

In [4]:
df_raw = pd.read_csv('case_data.csv')
df_raw.head(6)[['issuer','fi_1','fi_2','fi_3','eq_1','weight']]
Out[4]:
issuer fi_1 fi_2 fi_3 eq_1 weight
0 A INDUSTRIAL CONSUMER NON-CYCLICAL HEALTHCARE Industrials 0.00021
1 AA INDUSTRIAL BASIC INDUSTRY METALS & MINING Basic Materials NaN
2 AACE FINANCIAL INSTITUTIONS FINANCIAL OTHER FINANCIAL OTHER NaN NaN
3 AAFFP INDUSTRIAL CONSUMER CYCLICAL RETAILERS NaN NaN
4 AAL INDUSTRIAL TRANSPORTATION AIRLINES Consumer Services NaN
5 AAL INDUSTRIAL TRANSPORTATION AIRLINES NaN 0.00071

Let's check if there are some duplicates.

In [5]:
duplicate_issuers = df_raw[df_raw.issuer.duplicated()].issuer.unique()
print duplicate_issuers
['AAL' 'ABBNVX' 'ABIBB' 'ABXCN' 'ACACN' 'ACAFP' 'ACGL' 'ACHMEA' 'ACM'
 'ADENVX' 'AEGON' 'AEP' 'AES' 'AESGEN' 'AET' 'AGN' 'AIFP' 'AIG' 'AKZANA'
 'ALACN' 'ALB' 'ALBLLC' 'ALVGR' 'AM' 'AMSSM' 'AMXLMM' 'ANZ' 'AON' 'APA'
 'APC' 'ASSGEN' 'ATI' 'AVLN' 'AXASA' 'AXLL' 'AXP' 'AZN' 'BA' 'BAC' 'BACR'
 'BALN' 'BASGR' 'BAYNGR' 'BBT' 'BCOCPE' 'BCP' 'BHARTI' 'BK' 'BMC' 'BMO'
 'BNP' 'BNS' 'BOCOM' 'BRK' 'BRKHEC' 'BRP' 'BYD' 'C' 'CAT' 'CBAAU' 'CBGLN'
 'CCI' 'CEMEX' 'CFG' 'CHINAM' 'CINDBK' 'CITCON' 'CLNVX' 'CM' 'CMA' 'CMCSA'
 'CMS' 'CMZB' 'CNP' 'COF' 'COMM' 'CONGR' 'COP' 'CTIH' 'CTL' 'CVC' 'D'
 'DAIGR' 'DAR' 'DB' 'DBSSP' 'DE' 'DELL' 'DFS' 'DGFP' 'DLNA' 'DLPH' 'DOW'
 'DPWGR' 'DT' 'DTE' 'DUK' 'DVN' 'DYN' 'ECACN' 'EEBCB' 'EIX' 'ELDOR'
 'EMBRBZ' 'ENELIM' 'ENIIM' 'ERSTBK' 'ES' 'ESL' 'ESRX' 'ESV' 'ETP' 'ETR'
 'EUROCA' 'EXC' 'EXXI' 'F' 'FCAIM' 'FCGNZ' 'FCX' 'FE' 'FFHCN' 'FGBUH'
 'FITB' 'FNCIM' 'FNF' 'FTR' 'GE' 'GEF' 'GFSLN' 'GHC' 'GLENLN' 'GM' 'GRNCH'
 'GS' 'GT' 'GWOCN' 'GXP' 'GZRFPR' 'HANRUE' 'HANSEN' 'HAR' 'HBAN' 'HCA'
 'HOLNVX' 'HPQ' 'HSBC' 'ICE' 'ICICI' 'INTNED' 'IRM' 'ISPIM' 'JPM' 'KEY'
 'KMI' 'KO' 'KRFT' 'LATAIR' 'LDOS' 'LEN' 'LGEN' 'LIFP' 'LINE' 'LINGR'
 'LLOYDS' 'LVLT' 'LYB' 'MEOGR' 'MET' 'MFCCN' 'MIICF' 'MITCO' 'MIZUHO' 'MKL'
 'MOLHB' 'MQGAU' 'MRK' 'MRKGR' 'MRWLN' 'MUFG' 'MUNRE' 'NAB' 'NACN' 'NATMUT'
 'NBDCN' 'NGGLN' 'NOC' 'NOMURA' 'NOVNVX' 'NSANY' 'NTRS' 'NTT' 'OI' 'OIBRBZ'
 'ORIG' 'PACD' 'PACLIF' 'PAH' 'PBCT' 'PCG' 'PEP' 'PEUGOT' 'PFE' 'PFG' 'PG'
 'PNC' 'PRGO' 'PRU' 'PSON' 'PTTEPT' 'QBEAU' 'RAI' 'RBS' 'REIUCN' 'RENAUL'
 'RENEPL' 'RF' 'RILIN' 'RWE' 'RY' 'S' 'SABLN' 'SANFP' 'SBAC' 'SCG' 'SE'
 'SEP' 'SESGFP' 'SGMS' 'SKYLN' 'SO' 'SOLBBB' 'SRE' 'SSE' 'SSELN' 'STANLN'
 'STI' 'STSP' 'STT' 'SUCN' 'SUMIBK' 'SUNTOR' 'T' 'TALANX' 'TCGLN' 'TD'
 'TITIM' 'TKAAV' 'TOTAL' 'TOYOTA' 'TRV' 'TSCOLN' 'TTMTIN' 'TWC' 'TWX' 'TXN'
 'TXT' 'UAL' 'UBS' 'UCGIM' 'UNANA' 'UNM' 'UPS' 'USB' 'VAHAU' 'VALEBZ'
 'VOTORA' 'VRXCN' 'VZ' 'WBA' 'WEC' 'WFC' 'WPPLN' 'WPZ' 'WR' 'WSH' 'WSTP'
 'WYNN' 'XEL' 'XOM' 'Y' 'ZURNVX']

Argh... that's a lot. What do they look like?

In [6]:
df_raw[df_raw.issuer==duplicate_issuers[5]][['issuer', 'fi_1', 'fi_2', 'fi_3', 'eq_1', 'weight']]
Out[6]:
issuer fi_1 fi_2 fi_3 eq_1 weight
31 ACAFP FINANCIAL INSTITUTIONS INSURANCE LIFE INSURANCE NaN 0.00030
32 ACAFP FINANCIAL INSTITUTIONS BANKING BANKING Financials 0.00161
33 ACAFP FINANCIAL INSTITUTIONS BANKING BANKING NaN 0.00275

Ok. In this example, we want to keep the classified row around. There are other kinds of duplicates, where the classified data has no weight, but there is a row with same entry unclassified has weights, etc. For example:

In [7]:
df_raw[df_raw.issuer==duplicate_issuers[9]][['issuer', 'fi_1', 'fi_2', 'fi_3', 'eq_1', 'weight']]
Out[7]:
issuer fi_1 fi_2 fi_3 eq_1 weight
54 ADENVX INDUSTRIAL CONSUMER CYCLICAL CONSUMER CYCLICAL SERVICES NaN 0.00026
55 ADENVX INDUSTRIAL CONSUMER CYCLICAL CONSUMER CYCLICAL SERVICES Industrials 0.00006

We need to clean this up a bit...

In [8]:
df = nh.cleanup(df_raw.copy())

Let's segment the data. If the field eq_1 is NULL (Pandas puts an NaN in there), the row is thrown into the unclassified set (i.e., the set that is to be classified). If the field eq_1 has some classification, the row is thrown into the training set (i.e., the one that is already classified). We also check how many rows we have in each set.

In [9]:
df, str2int, int2str = nh.construct_string_to_int_maps(df)
df_classified   = df[~pd.isnull(df.eq_1)].copy()
df_unclassified = df[pd.isnull(df.eq_1)].copy()

print "Total Rows        : %i" % len(df)
print "Classified Rows   : %i" % len(df_classified)
print "Unclassified Rows : %i "% len(df_unclassified)
Total Rows        : 2969
Classified Rows   : 1465
Unclassified Rows : 1504 

Let's check if there is some useless data (empty weights)...

In [10]:
print "Rows without Weights (Classified/Training Set): %i" % np.sum(pd.isnull(df_classified.weight))
print "Rows without Weights (Unclassified Set)       : %i" % np.sum(pd.isnull(df_unclassified.weight))
Rows without Weights (Classified/Training Set): 638
Rows without Weights (Unclassified Set)       : 892

Oha, it turns out that ~43% of the classified and 60% of the unclassified rows have missing weights. For calculating the final weight distribution, these are useless to use. We will later remove them for the weight distribution.

However, they're still useful to keep around in in the classified set to train our classifiers.

Mapping

Now, recall that our job is to find classifiers that map (fi_1,fi_2,fi_3) to (eq_1). To get a handle on what this thing will look like, let's have a look at the mapping that is performed in the training set.

Let's have a look at mapping (fi_1) to (eq_1) first. Here, (fi_1) is on the vertical axis and is mapped onto (eq_1) , which is shown on the horizontal axis. The size of the circle indicates how many datapoints did this particular mapping.

In [11]:
ilevel = 1

df_counts = nh.count_fi_x_eq_x(df_classified, str2int, ilevel=ilevel)
ms, ns = nh.mkline(1.0, 4.0, 150, 32)
s = ms * df_counts.counts + ns
s[df_counts.counts == 0] = 0.0

fig, ax = plt.subplots(1,1)
ax.scatter(np.asarray(df_counts.eq_1), np.asarray(df_counts.fi_x), \
           s=s**2, c=plt.rcParams['axes.color_cycle'][1])

ax.set_xticks(int2str[3].keys())
ax.set_xticklabels(int2str[3].values(), rotation=45)

ax.set_yticks(int2str[ilevel-1].keys())
ax.set_yticklabels(int2str[ilevel-1].values())

# ax.set_xlabel('EQ Classes')
# ax.set_ylabel('FI Classes')

fig.savefig('fi_1_eq_1.pdf', bbox_inches='tight')

pass
/home/ics/volker/Anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:892: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))

Unsurprisingly, this doesn't look very good. We only have three input classes that need to be mapped to eleven output classes.

Let's try to work with the (fi_2) level.

In [12]:
ilevel = 2

df_counts = nh.count_fi_x_eq_x(df_classified, str2int, ilevel=ilevel)
ms, ns = nh.mkline(1.0, 4.0, 150, 32)
s = ms * df_counts.counts + ns
s[df_counts.counts == 0] = 0.0

fig, ax = plt.subplots(1,1)
ax.scatter(np.asarray(df_counts.eq_1), np.asarray(df_counts.fi_x), \
           s=s**2, c=plt.rcParams['axes.color_cycle'][1])

ax.set_xticks(int2str[3].keys())
ax.set_xticklabels(int2str[3].values(), rotation=45)

ax.set_yticks(int2str[ilevel-1].keys())
ax.set_yticklabels(int2str[ilevel-1].values())

# ax.set_xlabel('EQ Classes')
# ax.set_ylabel('FI Classes')

fig.savefig('fi_2_eq_1.pdf', bbox_inches='tight')

pass

Ok, that looks a little better. Using (fi_3), we can probably still improve things, but visualizing is a bit tricky at this point because there are too many classes.

Class Distribution

Finally, let's check whether the classified (training) and unclassified sets have approximately the same distribution in their sectors (we show fi_2 here). If they do not, using whatever classifier we train with our training set won't be performing very well.

In [13]:
ilevel = 2
df_counts_01 = nh.count_fi_x(df_classified, str2int, ilevel=ilevel)
df_counts_02 = nh.count_fi_x(df_unclassified, str2int, ilevel=ilevel)

fig, axarr = plt.subplots(1,2)
axarr[0].barh(df_counts_01.fi_x-30, \
              df_counts_01.counts, \
              height=60)
axarr[1].barh(df_counts_02.fi_x-30, \
              df_counts_02.counts, \
              height=60)

for ax in axarr:
    ax.set_yticks(int2str[ilevel-1].keys())
    ax.set_yticklabels(int2str[ilevel-1].values())
plt.setp(axarr[1].get_yticklabels(), visible=False)

axarr[0].set_title('Classified (Training) Set')
axarr[1].set_title('Unclassified Set')

axarr[0].set_xlabel('Occurences')
axarr[1].set_xlabel('Occurences')

fig.savefig('fi_1_classified_unclassified.pdf', bbox_inches='tight')

pass

Ok, they look sort of similar. Likely something we can work with.