Introduction
Pretend you’re a bank, and you’re about to loan some money to a new client. Of course you’ll try to predict whether your new client will default, but wouldn’t it be great if you could predict the severity of your losses as well? Will this new customer merely default—or default hard?
The Loan Default Prediction Challenge poses just this question. To help, we’re given reams of unlabeled data on hundreds of thousands of debtors, and a single value to predict: lender loss.
Foremost among many challenges was simply cleaning up the data: mixed types, missing values, and highly correlated columns abound. In particular, many columns shift wildly between the training data and the test data, rendering some learning gleamed from training useless! To top it all off, all the metadata is stripped from the columns. We have no idea what each column even represents.
Nevertheless, though I lagged behind the front-runners I was able to beat the benchmark solution. I also used this project to acquaint myself with some contemporary libraries used in the Python data-mining stack: pandas, scikit-learn, ipython, and matplotlib.
Cleaning up the data, broad trends
That this section is the longest is no mistake — the data truly was a mess!
The training set weighs in at over one-hundred thousand rows (debtors) and over seven-hundred columns. The first thing to notice about the loss data is that most debtors don’t default: about 90% of loss values are zero.
Next, we have to reckon with the mixed data types. While most columns appear to be numerical, some may be categorical — that is, they might be codes representing some property. Thus, the particular numbers representing them are meaningless in relationship to each other.
To see what I mean, consider the histogram below. It shows all the data columns which contain only integers binned by the number of unique values. Recall there are over a hundred-thousand rows, meaning potentially over a hundred-thousand unique values. You can see below, however, that most of the integer columns contain just a few thousand unique values or fewer.
The loss column — what we are trying to predict — is an integer column ranging from 0-100, and most certainly numeric. Therefore, I only consider columns with fewer than 100 unique values (which is somewhat arbitrary, of course, but I needed to pare it down somehow).
To choose which of these columns were categorical and which were numerical, I applied a simple test. I looked through them one by one, and if the histogram of values had a plausible-looking distribution for a random variable, I assumed it was numeric. If the distribution was noisy and erratic-looking, I assumed it was categorical. The figure below illustrates:
Categorical variables need to be encoded in such a way that they are distinct, but not ordinal (their numerical values must not bias any downstream calculation). One way to do this is to provide a column for each possible value of the categorical variable, then each column is simply binary. This is called One-hot Encoding, and the scikit-learn library provides a function to take care of this for us
Far more insidious than the mixed data types, however, is the huge shift in behavior of some columns between the training data (rows where we are provided with resulting loss) and competition data (rows where we have to guess the loss). Take for example column “f554″:
This problem is systematic in the dataset, as the histogram below shows. Taking the ratio of the means of the training data versus the competition data, we can see that the mean of many columns shifts up to 100%! Obviously, this makes prediction much more difficult. Unfortunately, I didn’t discover this feature of the dataset until very late in the competition. It explains, however, why my models which did very well on the training data (held back for testing, of course) did rather poorly on the competition set.
Finally, many of the columns are highly correlated as you can see in the covariance matrix below. For all subsequent machine learning, I used a transformed, whitened copy of the data (specifically, scikit-learn’s RandomizedPCA function).
Machine Learning
Since this contest was scored by mean absolute error, and most of the losses are zero, I thought it important to use both a classifier and a regressor to predict loss. This is because even predicting all losses to be zero gives you an MAE of less than one — in fact, the benchmark solution is just that. Hence, if you are regularly predicting zero-loss rows to be greater than zero, your MAE will rise very quickly.
The best combination I could find was a boosted classifier (AdaBoostClassifier) with a decision tree as the base classifier to tell me which rows would default, and then a support vector regressor (SVR) to predict how much. This did great on the training set, achieving MAEs in the low 0.60s (on the sample held back for testing). However, due to the data shifts noted above, it regularly flopped on the competition data. In fact, I was only able to beat the benchmark using a different strategy (see conclusion).
The dominant parameters are maximum depth of the trees in the classifier, and the C parameter of the SVR, which has to do with how the SVR deals with misclassified points. Below is a grid of the MAE for various values of C and max depth. I used scikit-learn’s train-test-split function to create a training and test sets.
We can see that the sweet spot seems to be around max depth = 5 and C = 10. The tree classifier is very likely overfitting at depths of 10 and 20. One of the features of the decision tree classifier is it gives us a good indication of which columns are important. Data is first split on columns near the “trunk.” If we look the most important features from our classifier, decent separation of the data is already evident. Note also that we are not looking at the first or second principal component here–we’re pretty far down the line, at the 655th and 645th components!
And how about predicting the actual loss values? How does our SVR do? It does alright, but could definitely use some work. We can see good correlation below, for data that isn’t mis-classified. There is definitely room for improvement here, however, as we can see in the distribution. The SVR misses out on high loss values, and even predicts some losses to be negative (we wish!). Not to mention that we miss out on some segments of the distribution even at low values.
Nevertheless, we have solid parameters, the data look well-partitioned by the classifier, and the regressor is workable. An MAE of 0.63 is a sizeable improvement on the benchmark solution (which scores around 0.83), so I must be in good shape to improve from here, right?
Conclusions
Wrong. Due to the big shifts between training and competition data noted above, models which performed well on the training data often did worse than the benchmark on competition data. I was still able to beat the benchmark using a RandomForestClassifier, again with a max depth of 5.
Had I realized these discrepancies earlier on, I might have been able to eliminate the irrelevant or spurious features. Nonetheless, great knowledge gains were realized: working through the cleaning, transforming, fitting and re-fitting process simultaneously exposed me to common pitfalls of this type of analysis and the ins and outs of the sci-kit learn library. And more broadly, the machine learning techniques themselves. For instance, I found a support vector classifier to be almost useful in this project because (I suspect) the boundaries were so noisy — the tree-type classifiers really shined here.
Click below to expand source code. Note that much of the analysis was done interactively with ipython.
""" Builds a model for the loans competition """ import numpy as np import pandas as pd from sklearn import preprocessing, decomposition, cross_validation, svm """ Preprocesses and returns the training data. Drops the 'loss' column """ def preprocess(idf, ns=0): #sample the training set if necessary if ns > 0: numSamples = ns df = idf.ix[np.random.choice(idf.index, numSamples, replace=False)] else: df = idf #takes a dataframe, then seperate out float, int values indexName = 'id' resultName = 'loss' floatColumnNames=[df[column].name for column in df.columns if df[column].dtype == 'float64' or df[column].dtype == 'object'] intColumnNames=[df[column].name for column in df.columns if df[column].dtype == 'int64'] floatdf, intdf = df[floatColumnNames].astype('float64'), df[intColumnNames] #seperate the integer columns into continuous and categorical variables. this was done by visual inspection, so here I drop any columns with only one unique value and set any integer column with fewer unique values than 'f293' as categorical uniqueInts=pd.Series([len(np.unique(intdf[name])) for name in intdf.columns],index=intdf.columns) catColumnNames = [col for col in uniqueInts.index if uniqueInts[col] > 1 and uniqueInts[col] < uniqueInts['f293']] dropColumnNames = [col for col in uniqueInts.index if uniqueInts[col] <= 1] catdf = df[catColumnNames] intdf=intdf.drop(dropColumnNames,axis=1) intdf=intdf.drop(catColumnNames,axis=1) #impute dataframes, drop loss column (should be in intdf) floatdf=floatdf.fillna(floatdf.median()) intdf=intdf.fillna(intdf.median()) catdf=catdf.fillna(catdf.mode()) #transform categorical columns to one-of-k encoding enc = preprocessing.OneHotEncoder() oneHot = enc.fit_transform(catdf.values) onehotlabels = pd.Series([len(np.unique(catdf[name])) for name in catdf.columns],index=catdf.columns) oneHotColNames = [] for name in onehotlabels.index: for i in np.arange(onehotlabels[name]): oneHotColNames.append(name + '-' + str(i)) catonehotdf = pd.DataFrame(oneHot.todense(), columns=oneHotColNames) #reindex data floatdf.index, intdf.index, catonehotdf.index = intdf['id'], intdf['id'], intdf['id'] intdf = intdf.drop('id', axis=1) #return preprocessed, scaled dataframe with output column dropped unscaled = floatdf.join([intdf.astype('float64'), catonehotdf.astype('float64')]) scaled_raw = preprocessing.scale(unscaled.values) return pd.DataFrame(preprocessing.normalize(scaled_raw), index=unscaled.index, columns=unscaled.columns).drop('loss', axis=1) """ Builds a model with preprocessed data and tests it. preproc and losses are a pandas dataframe and series. transform, clf_predictor, and reg_predictor are sklearn objects. Losses below lossFloor are dropped for purposes of training the classifier. numberZeroLoss are the number of zeroLoss rows to include in training the regressor (randomly selected) """ def estimate(preproc, losses, transform, clf_predictor, reg_predictor, lossFloor=0, numberZeroLoss=0, updateClf=True, updateReg=True): X = transform.fit_transform(preproc.values) y = losses.values X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y) y_binary_train = np.where(y_train > lossFloor, 1, 0) y_binary_test = np.where(y_test > lossFloor, 1, 0) if updateClf: clf_predictor.fit(X_train, y_binary_train) y_pos_indices = np.where(y_train > 0)[0] if numberZeroLoss > 0: y_zeroes_indices = np.where(y_train ==0)[0] y_zeroes_indices = np.random.choice(y_zeroes_indices, size=numberZeroLoss, replace=False) y_pos_indices = np.union1d(y_pos_indices, y_zeroes_indices) if updateReg: reg_predictor.fit(X_train[y_pos_indices], y_train[y_pos_indices]) y_clf_predict = clf_predictor.predict(X_test) y_reg_predict = reg_predictor.predict(X_test) y_predict = np.multiply(y_clf_predict, y_reg_predict) return y_test, y_predict """ Like estimate, but does not train the model. """ def predict(preproc, clf_predictor, reg_predictor, lossFloor=0): X = preproc y_clf_predict = clf_predictor.predict(X) y_reg_predict = reg_predictor.predict(X) y_predict = np.multiply(y_clf_predict, y_reg_predict) return y_predict """ Returns only columns which have similar means in the train AND test sets. Also drops columns with zero mean in the test set -- in practice, columns which are all zero Ignores 'id' and 'loss' columns """ def good_cols(both, train_size=105471, tol=0.05): shift = pd.Series(index = both.columns) for name in both.columns: shift[name]=both[name].astype('float64')[train_size:].mean()/both[name].astype('float64')[:train_size].mean() shift_centered = (shift-1).abs() good_col_names = shift_centered[shift_centered < tol].index.values return np.union1d(good_col_names, ['id','loss']) def main(idf,ns=0): return preprocess(idf, ns) if __name__ == '__main__': status = main() sys.exit(status)
No comments:
Post a Comment