Post 14: 200 Million new protein structures! 😱

1 minute read

Published:

 

The day has come! About a year ago, AlphaFold2 was released, and its creators promised to create a database with predicted proteins counted in the millions. Currently, humanity has managed to have just under 200,000 structures through experimental methods. Today (28.07.2022), approximately 214,000,000 protein structure predictions were released. This makes the plot from this post look very different.

Previously, other work had already made protein predictions in the millions using AlphaFold2; for example, Julia’s team predicted about 200,000, then Zeming’s team predicted 1,000,000 structures, and finally, Chloe’s team predicted 12,000,000. In each case, the vast majority of the predictions are poor, but being so massive, there are thousands of proteins with very good confidence in their predictions, and even new foldings (counted in the hundreds) which remain to be confirmed by experimental methods.

If we consider that only 3% of the predictions will have high confidence (an estimate based on the data from the three examples mentioned), it means that at least ~6,000,000 of predicted structures will provide truly useful information to scientific work and possibly point to new interesting foldings. And I say this as an estimate since the 214M proteins weigh ~23TB of storage (according to the repo) 😬. We will have to wait for the data analysis.

It’s very exciting!

Refs:

  1. (~200M protein structures) AlphaFold reveals the structure of the protein universe

  2. (Julia paper) Sequence-structure-function relationships in the microbial protein universe

  3. (Zeming paper) Language models of protein sequences at the scale of evolution enable accurate structure prediction

  4. (Chloe paper) Learning inverse folding from millions of predicted structures