Post 31: How many sequences do we have so far? 🤔

less than 1 minute read

Published:

 

Well, no one knows for sure.

There are many genomic repositories, each with proteins from different environments and species. One of the most famous repositories is NCBI-nr, which is usually used when we do a BLAST; this repository has ~465 million sequences.

But, is it the largest? Not at all. As far as I know, the largest repository so far is called DeepClust and contains ~19 billion sequences, which can be grouped into ~335 million representative sequences (clusters).

And what about protein structures? The Protein Data Bank has ~200,000 structures resolved with different methods (X-ray, Cryo-EM, etc.), but thanks to artificial intelligence that predicts protein structures from the sequence, such as AlphaFold2 or ESMFold, the largest repository with information on structures is now ESM-Atlas, with about 617 million structural predictions, weighing about 15TB of storage.

So now there’s plenty of data… 🤯

img

Refs:

  1. Interesting pages:
  2. The DeepClust paper