Pages

Wednesday, January 28, 2015

Loan Survivor

This is a republication of a blog post from my old Wordpress site. Some formatting may be broken. Originally published March 30, 2014.

Introduction

Pretend you’re a bank, and you’re about to loan some money to a new client. Of course you’ll try to predict whether your new client will default, but wouldn’t it be great if you could predict the severity of your losses as well? Will this new customer merely default—or default hard?

The Loan Default Prediction Challenge poses just this question. To help, we’re given reams of unlabeled data on hundreds of thousands of debtors, and a single value to predict: lender loss.

Foremost among many challenges was simply cleaning up the data: mixed types, missing values, and highly correlated columns abound. In particular, many columns shift wildly between the training data and the test data, rendering some learning gleamed from training useless! To top it all off, all the metadata is stripped from the columns. We have no idea what each column even represents.

Nevertheless, though I lagged behind the front-runners I was able to beat the benchmark solution. I also used this project to acquaint myself with some contemporary libraries used in the Python data-mining stack: pandas, scikit-learn, ipython, and matplotlib.

Cleaning up the data, broad trends

That this section is the longest is no mistake — the data truly was a mess!

The training set weighs in at over one-hundred thousand rows (debtors) and over seven-hundred columns. The first thing to notice about the loss data is that most debtors don’t default: about 90% of loss values are zero.

Logarithmic histogram of losses. Loss values range from zero to 100. Note that two orders of magnitude more debtors cause no loss compared to the next most common loss values.

Next, we have to reckon with the mixed data types. While most columns appear to be numerical, some may be categorical — that is, they might be codes representing some property. Thus, the particular numbers representing them are meaningless in relationship to each other.

To see what I mean, consider the histogram below. It shows all the data columns which contain only integers binned by the number of unique values. Recall there are over a hundred-thousand rows, meaning potentially over a hundred-thousand unique values. You can see below, however, that most of the integer columns contain just a few thousand unique values or fewer.

Histogram of number of unique values in integer columns. Note most integer columns have relatively few unique values.

The loss column — what we are trying to predict — is an integer column ranging from 0-100, and most certainly numeric. Therefore, I only consider columns with fewer than 100 unique values (which is somewhat arbitrary, of course, but I needed to pare it down somehow).

Histogram of unique Int values for low values — specifically containing fewer unique values than the ‘loss’ column.

To choose which of these columns were categorical and which were numerical, I applied a simple test. I looked through them one by one, and if the histogram of values had a plausible-looking distribution for a random variable, I assumed it was numeric. If the distribution was noisy and erratic-looking, I assumed it was categorical. The figure below illustrates:

Categorical and Continous variables. Note that f293 has the look of a Pareto distribution, while f403 has no obvious distribution.

Categorical variables need to be encoded in such a way that they are distinct, but not ordinal (their numerical values must not bias any downstream calculation). One way to do this is to provide a column for each possible value of the categorical variable, then each column is simply binary. This is called One-hot Encoding, and the scikit-learn library provides a function to take care of this for us

Far more insidious than the mixed data types, however, is the huge shift in behavior of some columns between the training data (rows where we are provided with resulting loss) and competition data (rows where we have to guess the loss). Take for example column “f554″:

Training vs. Competition Data for column f554

This problem is systematic in the dataset, as the histogram below shows. Taking the ratio of the means of the training data versus the competition data, we can see that the mean of many columns shifts up to 100%! Obviously, this makes prediction much more difficult. Unfortunately, I didn’t discover this feature of the dataset until very late in the competition. It explains, however, why my models which did very well on the training data (held back for testing, of course) did rather poorly on the competition set.

Distribution of shift ratios between training and competition data. Note that many columns are far from unity, implying large underlying shifts in the data.

Finally, many of the columns are highly correlated as you can see in the covariance matrix below. For all subsequent machine learning, I used a transformed, whitened copy of the data (specifically, scikit-learn’s RandomizedPCA function).

Covariance Matrix. Note many data are highly correlated.

Machine Learning

Since this contest was scored by mean absolute error, and most of the losses are zero, I thought it important to use both a classifier and a regressor to predict loss. This is because even predicting all losses to be zero gives you an MAE of less than one — in fact, the benchmark solution is just that. Hence, if you are regularly predicting zero-loss rows to be greater than zero, your MAE will rise very quickly.

The best combination I could find was a boosted classifier (AdaBoostClassifier) with a decision tree as the base classifier to tell me which rows would default, and then a support vector regressor (SVR) to predict how much. This did great on the training set, achieving MAEs in the low 0.60s (on the sample held back for testing). However, due to the data shifts noted above, it regularly flopped on the competition data. In fact, I was only able to beat the benchmark using a different strategy (see conclusion).

The dominant parameters are maximum depth of the trees in the classifier, and the C parameter of the SVR, which has to do with how the SVR deals with misclassified points. Below is a grid of the MAE for various values of C and max depth. I used scikit-learn’s train-test-split function to create a training and test sets.

Score Grid. All other parameters used default values, except epsilon = 0.01 and gamma = 0.001 for the SVR. Classifier fit on binarized loss data, and regrossor fit only on rows where loss > 0.

We can see that the sweet spot seems to be around max depth = 5 and C = 10. The tree classifier is very likely overfitting at depths of 10 and 20. One of the features of the decision tree classifier is it gives us a good indication of which columns are important. Data is first split on columns near the “trunk.” If we look the most important features from our classifier, decent separation of the data is already evident. Note also that we are not looking at the first or second principal component here–we’re pretty far down the line, at the 655th and 645th components!

Scatterplots of ground truth and predicted default on most important features of the transformed dataset as determined by the classifier. Data drawn from test set.

And how about predicting the actual loss values? How does our SVR do? It does alright, but could definitely use some work. We can see good correlation below, for data that isn’t mis-classified. There is definitely room for improvement here, however, as we can see in the distribution. The SVR misses out on high loss values, and even predicts some losses to be negative (we wish!). Not to mention that we miss out on some segments of the distribution even at low values.

Scatter plot of predicted vs. actual values. Some values are predicted to be negative, so a floor should be enforced. The nearly solid lines of points along the axes are mis-classified rows
Distributions of predicted loss values super-imposed on actual loss distribution

Nevertheless, we have solid parameters, the data look well-partitioned by the classifier, and the regressor is workable. An MAE of 0.63 is a sizeable improvement on the benchmark solution (which scores around 0.83), so I must be in good shape to improve from here, right?

Conclusions

Wrong. Due to the big shifts between training and competition data noted above, models which performed well on the training data often did worse than the benchmark on competition data. I was still able to beat the benchmark using a RandomForestClassifier, again with a max depth of 5.

Had I realized these discrepancies earlier on, I might have been able to eliminate the irrelevant or spurious features. Nonetheless, great knowledge gains were realized: working through the cleaning, transforming, fitting and re-fitting process simultaneously exposed me to common pitfalls of this type of analysis and the ins and outs of the sci-kit learn library. And more broadly, the machine learning techniques themselves. For instance, I found a support vector classifier to be almost useful in this project because (I suspect) the boundaries were so noisy — the tree-type classifiers really shined here.

Click below to expand source code. Note that much of the analysis was done interactively with ipython.

"""
Builds a model for the loans competition
"""
import numpy as np
import pandas as pd
from sklearn import preprocessing, decomposition, cross_validation, svm

"""
Preprocesses and returns the training data. Drops the 'loss' column
"""
def preprocess(idf, ns=0): 
 #sample the training set if necessary
 if ns > 0:
  numSamples = ns
  df = idf.ix[np.random.choice(idf.index, numSamples, replace=False)]
 else:
  df = idf
 #takes a dataframe, then seperate out float, int values
 indexName = 'id'
 resultName = 'loss'
 floatColumnNames=[df[column].name for column in df.columns if df[column].dtype == 'float64' or df[column].dtype == 'object']
 intColumnNames=[df[column].name for column in df.columns if df[column].dtype == 'int64']
 floatdf, intdf =  df[floatColumnNames].astype('float64'), df[intColumnNames]
 
 #seperate the integer columns into continuous and categorical variables. this was done by visual inspection, so here I drop any columns with only one unique value and set any integer column with fewer unique values than 'f293' as categorical
 uniqueInts=pd.Series([len(np.unique(intdf[name])) for name in intdf.columns],index=intdf.columns) 
 catColumnNames = [col for col in uniqueInts.index if uniqueInts[col] > 1 and uniqueInts[col] < uniqueInts['f293']]
 dropColumnNames = [col for col in uniqueInts.index if uniqueInts[col] <= 1]
 catdf = df[catColumnNames]
 intdf=intdf.drop(dropColumnNames,axis=1)
 intdf=intdf.drop(catColumnNames,axis=1)

 #impute dataframes, drop loss column (should be in intdf)
 floatdf=floatdf.fillna(floatdf.median())
 intdf=intdf.fillna(intdf.median())
 catdf=catdf.fillna(catdf.mode())

 #transform categorical columns to one-of-k encoding
 enc = preprocessing.OneHotEncoder()  
 oneHot = enc.fit_transform(catdf.values) 
 onehotlabels = pd.Series([len(np.unique(catdf[name])) for name in catdf.columns],index=catdf.columns)
 oneHotColNames = []
 for name in onehotlabels.index:
  for i in np.arange(onehotlabels[name]):
   oneHotColNames.append(name + '-' + str(i))
 catonehotdf = pd.DataFrame(oneHot.todense(), columns=oneHotColNames)
 
 #reindex data
 floatdf.index, intdf.index, catonehotdf.index = intdf['id'], intdf['id'], intdf['id']
 intdf = intdf.drop('id', axis=1)

 #return preprocessed, scaled dataframe with output column dropped
 unscaled = floatdf.join([intdf.astype('float64'), catonehotdf.astype('float64')])
 scaled_raw = preprocessing.scale(unscaled.values)
 return pd.DataFrame(preprocessing.normalize(scaled_raw), index=unscaled.index, columns=unscaled.columns).drop('loss', axis=1)

"""
Builds a model with preprocessed data and tests it.  preproc and losses are a pandas dataframe and series. transform, clf_predictor, and reg_predictor are sklearn objects. Losses below lossFloor are dropped for purposes of training the classifier. numberZeroLoss are the number of zeroLoss rows to include in training the regressor (randomly selected)
"""
def estimate(preproc, losses, transform, clf_predictor, reg_predictor, lossFloor=0, numberZeroLoss=0, updateClf=True, updateReg=True):
 X = transform.fit_transform(preproc.values)
 y = losses.values
 X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y)
 y_binary_train = np.where(y_train > lossFloor, 1, 0)
 y_binary_test = np.where(y_test > lossFloor, 1, 0)

 if updateClf:
  clf_predictor.fit(X_train, y_binary_train)
 y_pos_indices = np.where(y_train > 0)[0]
 if numberZeroLoss > 0:
  y_zeroes_indices = np.where(y_train ==0)[0]
  y_zeroes_indices = np.random.choice(y_zeroes_indices, size=numberZeroLoss, replace=False)
  y_pos_indices = np.union1d(y_pos_indices, y_zeroes_indices)
 if updateReg:
  reg_predictor.fit(X_train[y_pos_indices], y_train[y_pos_indices])

 y_clf_predict = clf_predictor.predict(X_test)
 y_reg_predict = reg_predictor.predict(X_test)
 y_predict = np.multiply(y_clf_predict, y_reg_predict)

 return y_test, y_predict

"""
Like estimate, but does not train the model. 
"""
def predict(preproc, clf_predictor, reg_predictor, lossFloor=0):
 X = preproc
 y_clf_predict = clf_predictor.predict(X)
 y_reg_predict = reg_predictor.predict(X)
 y_predict = np.multiply(y_clf_predict, y_reg_predict)
 return y_predict

"""
Returns only columns which have similar means in the train AND test sets.
Also drops columns with zero mean in the test set -- in practice, columns which are all zero
Ignores 'id' and 'loss' columns
"""
def good_cols(both, train_size=105471, tol=0.05):
 shift = pd.Series(index = both.columns)
 for name in both.columns:
  shift[name]=both[name].astype('float64')[train_size:].mean()/both[name].astype('float64')[:train_size].mean()
 shift_centered = (shift-1).abs()
 good_col_names = shift_centered[shift_centered < tol].index.values
 return np.union1d(good_col_names, ['id','loss'])


def main(idf,ns=0):
 return preprocess(idf, ns)

if __name__ == '__main__':
 status = main()
 sys.exit(status)

No comments:

Post a Comment