This is a republication of a blog post from my old Wordpress site. Some formatting may be broken. Originally published October 20, 2014.
Introduction
Crystal structure prediction is an area of active research in chemistry and physics. A crystal structure is just the arrangement of atoms within a crystalline solid. For instance, table salt:
In general, there is no deterministic way to predict the atomic structure for a given elemental composition at particular thermodynamic conditions. Programs that can reliably predict crystal structures with given properties have the potential to save researchers millions of lab hours (not to mention dollars!).
Of course, a competent structural chemist has a number of heuristic rules for guessing crystal structures. Most of these are inspired by trends in the periodic table. Could a (relatively) simple computer program with examples of real crystal structures and knowledge of the periodic table come up with some similar heuristics on its own?
The Materials Project is an open database of real and computed solid-state structures. I built a program to download the structural details from their database (tens of thousands of entries) and predict structural characteristics of unseen structures using random forests. The approach is very simple, but works surprisingly well for some properties. The code is available on github along with some more in depth explanation (in .pdf form).
I didn’t try to predict the exact crystal structures. Instead, I chose a few parameters that could be combined to generate a reasonable guess: point group, coordination (number of neighbors for each atom), irregularity of axes (ratio of c axis to a axis), and volume per site. I used random forest classification and regression for all tasks.
Point group prediction had an accuracy of around 45%. If the crystal system was already guessed correctly, the point group prediction was much better (over 80%). This is a bit technical, but if you’re interested, you can read more about it (and look at the confusion matrices) in the pdf.
The program was very good at predicting volume per site, however it should be noted that many of the structures were metal alloys: transition metals tend to retain their atomic properties when mixed with other metals, so it might already be easy to guess the volume per site in many cases. All the following plots are on the test set.
The c/a ratio did not fare as well, though there may be some signal here:
Finally, the coordination numbers. This is the number of neighboring atoms each atom is bound to. I don’t know how these numbers were decided (coordination number is a somewhat subjective quantity), but the program can make some decent predictions at any rate. Coordination numbers commonly range from 1 to 12, or higher. Notice the mean average error for many elements is on the order of 0.5.
Conclusion
Not bad results for a few evenings’ worth of work. The next step, if I have time, is to tune the parameters of the random forest model. It would also be nice to have a program to generate a few reasonable example structures from the predicted structural characteristics…
This is a republication of a blog post from my old Wordpress site. Some formatting may be broken. Originally published October 8, 2014.
Introduction
Tetrahedron packing is the open mathematical/computational problem of how best to fill space with tiny pyramids, or tetrahedra.
Tetrahedron do not completely fill space, though the best packings come pretty close (around 83% — compared to ~74% for spheres)
I built a tetrahedral packing program inspired by chemical packing of tetrahedra. I couldn’t get this approach to work very well — in fact it’s trivial to beat my program by just placing tetrahedra on a grid. Nevertheless, it was fun to construct and I might revisit it sometime later.
Some Figures
The code and some explanation is available on my github. The idea was to start with a solid-state structure known to be approximately tetrahedrally close-packed, insert regular tetrahedra into the spaces, and compress.
I used some techniques I was familiar with from the chemical literature (annealing, Ewald summation) to try to move tetrahedra into a higher packing density. While this was able to break out of early jamming, this approach seemed to stall around 14% packing density. This is unfortunate, because a Bravais packing (i.e. placing tetrahedra on a grid) gives 33% packing density! The initial configuration doesn’t seem to make a big difference:
Adding a cooling profile helps a little bit:
Random initial configurations do about as well (if not better) than TCP-inspired configuration (I didn’t look into this extensively, but in retrospect this is not surprising given how distorted some tetrahedra are in the chemical structure).
I ran the simulations on my laptop, which is CPU limiting. I ran one simulation for a very long time, but progress seems to hit a wall around 14%:
This is a republication of a blog post from my old Wordpress site. Some formatting may be broken. Originally published March 30, 2014.
Introduction
Pretend you’re a bank, and you’re about to loan some money to a new client. Of course you’ll try to predict whether your new client will default, but wouldn’t it be great if you could predict the severity of your losses as well? Will this new customer merely default—or default hard?
The Loan Default Prediction Challenge poses just this question. To help, we’re given reams of unlabeled data on hundreds of thousands of debtors, and a single value to predict: lender loss.
Foremost among many challenges was simply cleaning up the data: mixed types, missing values, and highly correlated columns abound. In particular, many columns shift wildly between the training data and the test data, rendering some learning gleamed from training useless! To top it all off, all the metadata is stripped from the columns. We have no idea what each column even represents.
Nevertheless, though I lagged behind the front-runners I was able to beat the benchmark solution. I also used this project to acquaint myself with some contemporary libraries used in the Python data-mining stack: pandas, scikit-learn, ipython, and matplotlib.
Cleaning up the data, broad trends
That this section is the longest is no mistake — the data truly was a mess!
The training set weighs in at over one-hundred thousand rows (debtors) and over seven-hundred columns. The first thing to notice about the loss data is that most debtors don’t default: about 90% of loss values are zero.
Next, we have to reckon with the mixed data types. While most columns appear to be numerical, some may be categorical — that is, they might be codes representing some property. Thus, the particular numbers representing them are meaningless in relationship to each other.
To see what I mean, consider the histogram below. It shows all the data columns which contain only integers binned by the number of unique values. Recall there are over a hundred-thousand rows, meaning potentially over a hundred-thousand unique values. You can see below, however, that most of the integer columns contain just a few thousand unique values or fewer.
The loss column — what we are trying to predict — is an integer column ranging from 0-100, and most certainly numeric. Therefore, I only consider columns with fewer than 100 unique values (which is somewhat arbitrary, of course, but I needed to pare it down somehow).
To choose which of these columns were categorical and which were numerical, I applied a simple test. I looked through them one by one, and if the histogram of values had a plausible-looking distribution for a random variable, I assumed it was numeric. If the distribution was noisy and erratic-looking, I assumed it was categorical. The figure below illustrates:
Categorical variables need to be encoded in such a way that they are distinct, but not ordinal (their numerical values must not bias any downstream calculation). One way to do this is to provide a column for each possible value of the categorical variable, then each column is simply binary. This is called One-hot Encoding, and the scikit-learn library provides a function to take care of this for us
Far more insidious than the mixed data types, however, is the huge shift in behavior of some columns between the training data (rows where we are provided with resulting loss) and competition data (rows where we have to guess the loss). Take for example column “f554″:
This problem is systematic in the dataset, as the histogram below shows. Taking the ratio of the means of the training data versus the competition data, we can see that the mean of many columns shifts up to 100%! Obviously, this makes prediction much more difficult. Unfortunately, I didn’t discover this feature of the dataset until very late in the competition. It explains, however, why my models which did very well on the training data (held back for testing, of course) did rather poorly on the competition set.
Finally, many of the columns are highly correlated as you can see in the covariance matrix below. For all subsequent machine learning, I used a transformed, whitened copy of the data (specifically, scikit-learn’s RandomizedPCA function).
Machine Learning
Since this contest was scored by mean absolute error, and most of the losses are zero, I thought it important to use both a classifier and a regressor to predict loss. This is because even predicting all losses to be zero gives you an MAE of less than one — in fact, the benchmark solution is just that. Hence, if you are regularly predicting zero-loss rows to be greater than zero, your MAE will rise very quickly.
The best combination I could find was a boosted classifier (AdaBoostClassifier) with a decision tree as the base classifier to tell me which rows would default, and then a support vector regressor (SVR) to predict how much. This did great on the training set, achieving MAEs in the low 0.60s (on the sample held back for testing). However, due to the data shifts noted above, it regularly flopped on the competition data. In fact, I was only able to beat the benchmark using a different strategy (see conclusion).
The dominant parameters are maximum depth of the trees in the classifier, and the C parameter of the SVR, which has to do with how the SVR deals with misclassified points. Below is a grid of the MAE for various values of C and max depth. I used scikit-learn’s train-test-split function to create a training and test sets.
We can see that the sweet spot seems to be around max depth = 5 and C = 10. The tree classifier is very likely overfitting at depths of 10 and 20. One of the features of the decision tree classifier is it gives us a good indication of which columns are important. Data is first split on columns near the “trunk.” If we look the most important features from our classifier, decent separation of the data is already evident. Note also that we are not looking at the first or second principal component here–we’re pretty far down the line, at the 655th and 645th components!
And how about predicting the actual loss values? How does our SVR do? It does alright, but could definitely use some work. We can see good correlation below, for data that isn’t mis-classified. There is definitely room for improvement here, however, as we can see in the distribution. The SVR misses out on high loss values, and even predicts some losses to be negative (we wish!). Not to mention that we miss out on some segments of the distribution even at low values.
Nevertheless, we have solid parameters, the data look well-partitioned by the classifier, and the regressor is workable. An MAE of 0.63 is a sizeable improvement on the benchmark solution (which scores around 0.83), so I must be in good shape to improve from here, right?
Conclusions
Wrong. Due to the big shifts between training and competition data noted above, models which performed well on the training data often did worse than the benchmark on competition data. I was still able to beat the benchmark using a RandomForestClassifier, again with a max depth of 5.
Had I realized these discrepancies earlier on, I might have been able to eliminate the irrelevant or spurious features. Nonetheless, great knowledge gains were realized: working through the cleaning, transforming, fitting and re-fitting process simultaneously exposed me to common pitfalls of this type of analysis and the ins and outs of the sci-kit learn library. And more broadly, the machine learning techniques themselves. For instance, I found a support vector classifier to be almost useful in this project because (I suspect) the boundaries were so noisy — the tree-type classifiers really shined here.
Click below to expand source code. Note that much of the analysis was done interactively with ipython.
"""
Builds a model for the loans competition
"""
import numpy as np
import pandas as pd
from sklearn import preprocessing, decomposition, cross_validation, svm
"""
Preprocesses and returns the training data. Drops the 'loss' column
"""
def preprocess(idf, ns=0):
#sample the training set if necessary
if ns > 0:
numSamples = ns
df = idf.ix[np.random.choice(idf.index, numSamples, replace=False)]
else:
df = idf
#takes a dataframe, then seperate out float, int values
indexName = 'id'
resultName = 'loss'
floatColumnNames=[df[column].name for column in df.columns if df[column].dtype == 'float64' or df[column].dtype == 'object']
intColumnNames=[df[column].name for column in df.columns if df[column].dtype == 'int64']
floatdf, intdf = df[floatColumnNames].astype('float64'), df[intColumnNames]
#seperate the integer columns into continuous and categorical variables. this was done by visual inspection, so here I drop any columns with only one unique value and set any integer column with fewer unique values than 'f293' as categorical
uniqueInts=pd.Series([len(np.unique(intdf[name])) for name in intdf.columns],index=intdf.columns)
catColumnNames = [col for col in uniqueInts.index if uniqueInts[col] > 1 and uniqueInts[col] < uniqueInts['f293']]
dropColumnNames = [col for col in uniqueInts.index if uniqueInts[col] <= 1]
catdf = df[catColumnNames]
intdf=intdf.drop(dropColumnNames,axis=1)
intdf=intdf.drop(catColumnNames,axis=1)
#impute dataframes, drop loss column (should be in intdf)
floatdf=floatdf.fillna(floatdf.median())
intdf=intdf.fillna(intdf.median())
catdf=catdf.fillna(catdf.mode())
#transform categorical columns to one-of-k encoding
enc = preprocessing.OneHotEncoder()
oneHot = enc.fit_transform(catdf.values)
onehotlabels = pd.Series([len(np.unique(catdf[name])) for name in catdf.columns],index=catdf.columns)
oneHotColNames = []
for name in onehotlabels.index:
for i in np.arange(onehotlabels[name]):
oneHotColNames.append(name + '-' + str(i))
catonehotdf = pd.DataFrame(oneHot.todense(), columns=oneHotColNames)
#reindex data
floatdf.index, intdf.index, catonehotdf.index = intdf['id'], intdf['id'], intdf['id']
intdf = intdf.drop('id', axis=1)
#return preprocessed, scaled dataframe with output column dropped
unscaled = floatdf.join([intdf.astype('float64'), catonehotdf.astype('float64')])
scaled_raw = preprocessing.scale(unscaled.values)
return pd.DataFrame(preprocessing.normalize(scaled_raw), index=unscaled.index, columns=unscaled.columns).drop('loss', axis=1)
"""
Builds a model with preprocessed data and tests it. preproc and losses are a pandas dataframe and series. transform, clf_predictor, and reg_predictor are sklearn objects. Losses below lossFloor are dropped for purposes of training the classifier. numberZeroLoss are the number of zeroLoss rows to include in training the regressor (randomly selected)
"""
def estimate(preproc, losses, transform, clf_predictor, reg_predictor, lossFloor=0, numberZeroLoss=0, updateClf=True, updateReg=True):
X = transform.fit_transform(preproc.values)
y = losses.values
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y)
y_binary_train = np.where(y_train > lossFloor, 1, 0)
y_binary_test = np.where(y_test > lossFloor, 1, 0)
if updateClf:
clf_predictor.fit(X_train, y_binary_train)
y_pos_indices = np.where(y_train > 0)[0]
if numberZeroLoss > 0:
y_zeroes_indices = np.where(y_train ==0)[0]
y_zeroes_indices = np.random.choice(y_zeroes_indices, size=numberZeroLoss, replace=False)
y_pos_indices = np.union1d(y_pos_indices, y_zeroes_indices)
if updateReg:
reg_predictor.fit(X_train[y_pos_indices], y_train[y_pos_indices])
y_clf_predict = clf_predictor.predict(X_test)
y_reg_predict = reg_predictor.predict(X_test)
y_predict = np.multiply(y_clf_predict, y_reg_predict)
return y_test, y_predict
"""
Like estimate, but does not train the model.
"""
def predict(preproc, clf_predictor, reg_predictor, lossFloor=0):
X = preproc
y_clf_predict = clf_predictor.predict(X)
y_reg_predict = reg_predictor.predict(X)
y_predict = np.multiply(y_clf_predict, y_reg_predict)
return y_predict
"""
Returns only columns which have similar means in the train AND test sets.
Also drops columns with zero mean in the test set -- in practice, columns which are all zero
Ignores 'id' and 'loss' columns
"""
def good_cols(both, train_size=105471, tol=0.05):
shift = pd.Series(index = both.columns)
for name in both.columns:
shift[name]=both[name].astype('float64')[train_size:].mean()/both[name].astype('float64')[:train_size].mean()
shift_centered = (shift-1).abs()
good_col_names = shift_centered[shift_centered < tol].index.values
return np.union1d(good_col_names, ['id','loss'])
def main(idf,ns=0):
return preprocess(idf, ns)
if __name__ == '__main__':
status = main()
sys.exit(status)
This is a republication of a blog post from my old Wordpress site. Some formatting may be broken. Originally published January 30, 2014.
Introduction
What follows is my partial attempt at solving the Santa’s Sleigh problem posted on the Kaggle data mining competition website. More importantly, it was an opportunity to reacquaint myself with the Python ecosystem and particularly Numpy.
The Problem
The problem we are solving is one of packing efficiency: what’s the best way to pack a million variously-shaped boxes into a “sleigh” of fixed footprint and infinite height? The answer will depend on how packing efficiency is defined: do we wish merely to minimize empty space in the sleigh, or do we also want to keep the height of the present stack low? Does the order in which the presents are packed matter?
In this particular definition, only height and present order matter. The exact function used to evaluate a solution is a little opaque, so I’ll try to convey the gist of it in the figure below:
In the BAD case, the presents are out of order: if you were to reach into the sleigh, you would hit Present 2 before Present 1 (only the tops of the presents are considered when deciding the order). Moreover, the present pile in the BAD solution is taller than the BETTER and BEST cases, which will penalize it. In the BETTER case, the presents are in the correct order (again, order only depends on the tops of the presents) and the maximum height is lower. In the BEST case, we improve on the BETTER case and rotate Present 1 to compress the height further without sacrificing order. Note that the contest rules allow the presents to float within the sleigh. A Christmas miracle!
It is trivially easy to create a solution with perfect ordering. If we imagine taking the BETTER or BEST solutions and adding more presents — say Presents 4, 5, and 6 — we need only ensure that in this next layer of presents the tops are flush as in the first layer. This is called the Top Down solution and was provided as a benchmark for the competition.
It’s a bit more difficult to optimize the maximum height. One could, for instance, pack the biggest presents first and then try to fill in the gaps with progressively smaller presents: essentially trying to maximize the density. Such a solution will, however, probably have a terrible ordering.
The trick, of course, is minimizing maximum height at the same time as maximizing order. In contrast to the simple examples above–and adding greatly to the difficulty–the actual problem is 3D, not 2D, and tasks us with optimizing not three but one million presents!
Genetic Algorithms
Genetic algorithms are a class of optimization techniques best employed when:
You don’t know how to make a good solution from scratch,
But you know a good solution when you see it.
In the sleigh problem, we really don’t know how to build a good solution from scratch but we do know how to rank any set of candidate solutions based on the height plus ordering metric described in the previous section.
Genetic algorithms are biologically inspired, and seek to mimic evolution via selection. Members from the population of candidate solutions are selected according to fitness, and reproduce: their genes are mixed–a process known as crossover–and the next generation is born. The next generation may also undergo mutation, wherein members of the new generation are randomly altered, usually only slightly. Crossover is typically the main driver of generation-to-generation improvement, but mutation is imprortant in preventing the program from getting stuck in a local minimum.
Broadly, then, the genetic algorithm works like this:
Select some members from the population for crossover according to fitness (members of the first generation are randomly generated)
Crossover parent individuals to create children for the next generation
Mutate some fraction of the population
Repeat
In the context of the sleigh problem, fitness is straightforward: the ordering plus height metric. But how do we encode the solution into genes that we can crossover and mutate? There is no unique answer, but I chose to model the solution like this: imagine taking the presents in order, one at a time, and dropping the presents into sleigh. On each present is written the xy-coordinate to drop from, the orientation of the present, and how much the present should float above whatever is beneath it. Here is a simple cartoon illustrating this system:
Modeling the genes this way makes setting up crossover and mutation a snap, but has at least one serious limitation: since the presents are always dropped in order there’s no chance for, say, Present 2 to get under Present 1. This could be fixed by allowing the presents to be dropped in different orders, but that would make the crossover implementation much more difficult.
Mutation
Now that we have a way to describe the genes, let’s describe the possible mutations to presents within the gene:
Change position: change the location at the top of the sleigh from which we drop the present.
Change orientation: rotate the present about one of its axes.
Change floating height: alter how high the present floats above anything that’s beneath it
In each successive generation, some of the presents in each new gene will mutate. The fraction of presents that mutate is up to us, and we will explore that briefly in an upcoming section.
Crossover
There are many different ways to choose which individuals get selected for crossover, but any selection method should somehow favor fitter solutions. In this implementation, I used tournament selection, primarily because it is straight-forward to code.
Thanks to the way we set up the genes, the mechanics of crossover is also straight-forward. We randomly select some point along the series of presents and “cross” the line of presents to that point with the line of presents after that point in the mating individual. This process is illustrated below. Note the crossed lines in the middle figure — this is the the actual crossover step:
Selecting the Parameters
There are a lot of “knobs” we can adjust in our genetic algorithm implementation. To give you an idea, I’ll focus on mutation rates. To test which mutation rates were ideal, I ran my program for 50 generations on the first 648 presents while varying my mutation rates. I chose 648 presents because the Top Down benchmark solution has 648 presents in the first layer. Here are the results (a lower Mean Fitness number means a better result):
“rotateFrac” is the fraction of presents that get the “Change Orientation” mutation described above. A value of 0.05 means 5% of the presents in a child get a new orientation. Similarly, “htBufFrac” is fraction of presents which get the “Change Floating Height” operations and “newPosFrac” is the fraction of presents which get the “Change positions” mutation. You can see that best results are achieved when newPosFrac=5%, htBufFrac=10%, and and rotateFrac=5%. This gives you, at best, a Mean Fitness just over 95000 after 50 generations. To give you an idea of how far we have to go, the Top Down Benchmark solution has a fitness score of just 172 for the first 648 presents!
Armed with good parameters, we can now run the genetic algorithm for as many generations as we like. Let’s try 15000 generations and see how we do. Again, I’m only optimizing 648 presents.
Since random initial solutions are so bad, it’s easy to improve them quickly: note the precipitous drop in the first 500 generations — a 30% improvement. After that, each fixed percentage of improvement requires more time. To get from 80000 to 70000 takes the remaining 14000 generations!
Is the optimization asymptotic? That is, has our genetic algorithm reached its limit after 15000 generations, or is it still generating better solutions? Looking at the plot above, you might guess our program has “flat-lined” and is just incestuously combining very similar solutions. To see if this really is the case, we can zoom in on the tail end of the above plot. Let’s focus on the last 3000 generations:
Each unbroken horizontal line represents a persistent individual (the colors only reflect the in-memory location of the individual). We can see that the fittest individual in the population — with fitness around 71500 — was only created at around the 14700th generation. Empirically, then, we can see the program is still optimizing the population. Whether it is too expensive computationally to proceed is up to us.
Digression: Height-only optimization
What happens if you optimize only on height? This is not part of the competition, but I was curious since the height component of fitness is simpler than the ordering component. We might be able to push our algorithm closer to the breaking point — that is, where diminishing returns make further improvements to the population unfeasible.
As above, I ran tests over the parameter space to determine which mutation parameters to use: 5% for both new positions and new orientations. I set the floating height to zero for all presents. Below are the results after 15000 generations:
The most obvious feature is the rapid optimization of the early solutions. There is a roughly 100% improvement in the first 50 generations. Next we see the familiar linear to asymptotic progress as the generations march on. Let’s look at the tail end again:
Remember that there are 30 individuals in this populations. Yet by the 15000th generation, there are only three distinct fitnesses. I didn’t investigate the individual solutions, but this probably means they are “stuck” and largely identical. However, mutation is still doing its job here in preventing the program from getting completely stuck: even at generation 14000, at least one fitter solution emerged.
Conclusions
I never scaled up my program to attempt to tackle the full one-million presents problem. In fact, the only solution I submitted was from a random solution generator I wrote just to make sure I understood the problem! It didn’t fare too well…
My main mistake was not writing my program to take advantage of multiple cores from the beginning. That’s really too bad, because it should have been pretty trivial to parallelize. Given the ubiquity and low cost of computing clusters these days, that was pretty severely hamstringing myself.
I also had to spend a lot of necessary time re-learning Numpy and “Pythonic thinking.” This limited the time I could spend implementing alternate solutions. For instance, I had set up the program to mix benchmark solutions into the population, but ran out of time to really put it through its paces.
In the learning aspect, though, I consider this project a modest success. Now, onto pandas and sci-kit, and more Kaggle competitions!