Post 9: LOGAN: The Biggest Bioinformatics Work of the Year ✏️

Published:

 

Usually, when it comes to big things, astronomical scales are mentioned. However, what is truly significant in the world of data is genomics, even more than Twitter or YouTube. Around the world, there are about 1,000 laboratories capable of sequencing DNA, and each produces a vast amount of information annually.

The sequenced DNA is stored in many databases. One of them is called SRA, which stores raw data from humans, animals, or microbes. When analyzing the SRA, we often track its origin to delimit the set we will analyze, for example, all mouse samples. Although ideally, we would process all the information together. This is what the LOGAN project accomplished.

In LOGAN, they processed 96% of the SRA using Amazon’s cloud computing. The computing power required for this is so massive that it is equivalent to about 2 million normal computers, and if a single computer were to do everything, it would take around 3,400 years! To achieve this, they designed a cloud capable of scaling with the data, and since there were so many DNA sequences, they had to organize (assemble) them into larger pieces called Unitigs. They then assembled the Unitigs into even larger DNA fragments called Contigs.

It is preferable to work with structured information like Contigs, from which we can trace the taxonomic origin of the DNA or map the proteins they encode. When assessing how much new information LOGAN represents compared to the most famous database, NCBI (which can be subdivided into other bases like nt or WGS), there was an increase of up to tens of times. For instance, data from the human microbiome increased by about 20 times. And all this information is available to everyone!

Surely with LOGAN, we will discover new biology, and possibly, future AIs as large as the one used by ChatGPT (i.e., GPT-4) will be trained.

img

Refs: