“Every time you miss a protein crystal, because they are so rare, you risk missing on an important biomedical discovery.”
– Patrick Charbonneau, Duke University Dept. of Chemistry and Lead Researcher, MARCO initiative.
Protein crystallization is a key step to biomedical research concerned with discovering the structure of complex biomolecules. Because that structure determines the molecule’s function, it helps scientists design new drugs that are specifically targeted to that function. However, protein crystals are rare and difficult to find. Hundreds of experiments are typically run for each protein, and while the setup and imaging are mostly automated, finding individual protein crystals remains largely performed through visual inspection and thus prone to human error. Critically, missing these structures can result in lost opportunity for important biomedical discoveries for advancing the state of medicine.
In collaboration with researchers from the MAchine Recognition of Crystallization Outcomes (MARCO) initiative, we have published “Classification of Crystallization Outcomes using Deep Convolutional Neural Networks” in PLOS One (ArXiv preprint), in which we discuss how we used some of the most recent architectures of deep convolutional networks and customized them to achieve an accuracy of more than 94% on the visual recognition task of identifying protein crystals. In order to spur further research in this area, we have made the data freely accessible, and open-sourced our model as part of the TensorFlow research model repository, and available to researchers as a Cloud ML Engine endpoint.
|Image of protein crystal, courtesy of the MARCO repository (CC-BY-4.0 license)|
The MARCO initiative is a joint project between several pharmaceutical companies and academic research centers to pool and host a large repository of curated crystallography images, and make them available to the community to help develop better image analysis tools. When a member of the initiative reached out to Google with a well-defined problem, and half a million labelled images, we embraced the challenge of trying to apply the recent advances in deep learning to the problem.
Due to the large variability between imaging technologies and data acquisition approaches, coming up with a single approach to the visual recognition problem may appear daunting. Crystals can be very small, which makes them rare structures in a large image containing otherwise undifferentiated visual clutter.
|Samples from the MARCO repository, illustrating the degree of variability between data sources.|
Fortunately, given sufficient training data, modern deep convolutional networks are well suited to handle extreme variability in visual appearance. We modified the basic Inception V3 model to handle larger images while still being able to be trained quickly. The model achieves a level of precision and recall that makes its use practical in automated assessment pipelines.
This work is a great example of the effectiveness of multi-institutional collaborations aimed at solving problems that require data in amounts and level of diversity that no single collaborator has access to. We invite researchers to take advantage of these resources that are the result of this work and share what they learn. This research was conducted as a personal 20% project by the author. To learn more about this work, please see our paper here and read the recent Duke Research Blog post.