Pages

Wednesday, January 28, 2015

Elementree: crystal structures and random forests

This is a republication of a blog post from my old Wordpress site. Some formatting may be broken. Originally published October 20, 2014.

Introduction

Crystal structure prediction is an area of active research in chemistry and physics. A crystal structure is just the arrangement of atoms within a crystalline solid. For instance, table salt:


The crystal structure of sodium chloride, NaCl (table salt). The filled polyhedra are there to emphasize the number of neighbors: each sodium (Na, small grey spheres) has 6 chlorines (Cl, large green spheres) neighbors, and vice versa. Image from wikipedia.

In general, there is no deterministic way to predict the atomic structure for a given elemental composition at particular thermodynamic conditions. Programs that can reliably predict crystal structures with given properties have the potential to save researchers millions of lab hours (not to mention dollars!).
Of course, a competent structural chemist has a number of heuristic rules for guessing crystal structures. Most of these are inspired by trends in the periodic table. Could a (relatively) simple computer program with examples of real crystal structures and knowledge of the periodic table come up with some similar heuristics on its own?
The Materials Project is an open database of real and computed solid-state structures. I built a program to download the structural details from their database (tens of thousands of entries) and predict structural characteristics of unseen structures using random forests. The approach is very simple, but works surprisingly well for some properties. The code is available on github along with some more in depth explanation (in .pdf form).

The Data

The compounds in the dataset are largely cubic, though all 32 crystallographic point groups are represented.

Results

I didn’t try to predict the exact crystal structures. Instead, I chose a few parameters that could be combined to generate a reasonable guess: point group, coordination (number of neighbors for each atom), irregularity of axes (ratio of c axis to a axis), and volume per site. I used random forest classification and regression for all tasks.
Point group prediction had an accuracy of around 45%. If the crystal system was already guessed correctly, the point group prediction was much better (over 80%). This is a bit technical, but if you’re interested, you can read more about it (and look at the confusion matrices) in the pdf.
The program was very good at predicting volume per site, however it should be noted that many of the structures were metal alloys: transition metals tend to retain their atomic properties when mixed with other metals, so it might already be easy to guess the volume per site in many cases. All the following plots are on the test set.

The c/a ratio did not fare as well, though there may be some signal here:

Finally, the coordination numbers. This is the number of neighboring atoms each atom is bound to. I don’t know how these numbers were decided (coordination number is a somewhat subjective quantity), but the program can make some decent predictions at any rate. Coordination numbers commonly range from 1 to 12, or higher. Notice the mean average error for many elements is on the order of 0.5.


Mean average error per element. Blue is better, red is worse. Some elements are missing due to lack of data (e.g. all the noble gases).

Conclusion

Not bad results for a few evenings’ worth of work. The next step, if I have time, is to tune the parameters of the random forest model. It would also be nice to have a program to generate a few reasonable example structures from the predicted structural characteristics…

No comments:

Post a Comment