A roadmap for AI-driven protein design

Course, YouTube, 2026

I created this free course consisting of 10 lectures to introduce you to AI-driven protein design.

webinar

General description
Course organization
Access to the slides
How to support this project
About me

Note: there is a version of this page in Spanish / Nota: aquí hay una versión en Español de esta página.

General description

I want more people to learn how to design proteins using Artificial Intelligence (AI). However, I have encountered three main problems:

There is a large amount of information, and it is not clear where to start or which topics are necessary.
There are no comprehensive online courses on this topic in Spanish.
Courses related to this field are usually expensive for most students in Latin America.

To address these issues, I created this free 37-hour course, distributed across 10 lectures, to introduce you to AI-driven protein design. The course includes two main resources:

The 10 lectures on YouTube
A GitHub repository with the following resources:
1. Tools: libraries organized into 25 categories
2. Learning resources: courses, tutorials and useful publications organized into nine categories
3. Databases: resources to download genomic and protein data organized into 12 categories
4. Lectures: links to each lecture and to download the slides
5. YouTube: Recommended channels and videos to learn about proteins, mathematics, and data science

Course organization

lectures

The lectures are organized from 01 to 10 to facilitate conceptual understanding. For example, reviewing AlphaFold requires knowledge of structural biology and deep learning, which are covered in detail in their respective lectures. Below is a brief description of each lecture and its topics:

Basic computing concepts: how CPUs and GPUs work, as well as the essential software for data analysis.
1. Where does your journey begin?
2. Hardware
  - CPU
  - GPU
3. Software
  - Linux/Bash and GitHub
  - Python
Machine learning: what AI is and its subfields, the current capabilities of algorithms, and how a model is trained.
1. Current state of AI
2. How AI learns
  - Patterns
  - Machine learning operations (MLOps)
  - Learning paradigms
3. How to train a model
  - Data processing
  - How to choose a model
  - Training process
Deep learning: how neural networks work, the different types of neural networks, and the software used to work with them.
1. Neural networks
  - Neurons
  - Deep learning
  - Loss functions
  - Backpropagation
  - Optimizers
  - Architectures
  - Explainability (why) and Interpretability (how)
2. Deep learning frameworks
Transformers and language models: how Transformers and modern language models work.
1. Language models
2. Transformers
  - Original architecture
  - BERT and GPT architectures
  - Scaling laws
  - Pre-training and post-training
  - Reinforcement learning
3. Performance and generalization
  - Benchmark saturation
  - Hype
4. How to work with LLMs
  - Optimization techniques (for GPU-poors like us)
  - Hugging Face and Software 2.0
Protein structure: principles of structural biology and organization.
1. Structural organization
  - Amino acids
  - Secondary and tertiary structure
  - Experimental workflow for structure determination
  - Structure Viewers
2. Classifications
  - Folds and domains
  - First classification schemes
  - Similarity metrics
  - Sequence and structural divergence
  - Current classifications schemes
3. The shape of the protein universe
  - Uneven distribution
  - Complex homologous relationships
  - Switch folds
Protein function: how proteins adopt their structure and how function is regulated.
1. Protein folding
  - Cellular environment
  - Thermodynamics and conformational entropy
2. Protein function
  - Diffusion
  - Molecular dynamics and energy functions
  - Enzymes
  - Functional annotation
3. Functional regulation
  - Allosterism
  - Transcriptional regulation
  - Post translational modifications
  - Proteostasis and host physiology
Protein evolution: origin and diversification from simpler peptides.
1. Levels of biological organization
  - Evolution across spatio-temporal scales
  - Chemical evolution
2. Biological evolution
  - RNA world hypothesis and ribosome evolution
  - Ancestral proteins
  - Protein diversification
3. The sequence space
  - Mutations
  - Robustness, evolvability and promiscuity
  - Evolution of protein function
4. Epistasis: How interactions shape the evolution
  - Residue-residue and protein-protein interactions
  - Randomness of mutations
AlphaFold: overview of AF2 and AF3 architectures and impact.
1. The impact of AlphaFold
  - AlphaFoldmania
  - Protein structure prediction before AlphaFold
2. AlphaFold
3. AlphaFold2
  - Protein language models
  - Architecture
  - Post-AlphaFold2 era
4. AlphaFold3
  - Diffusion models for macromolecular modeling
  - Architecture
  - Post-AlphaFold3 era
AI-driven protein design: motivations and modern AI methods.
1. Protein design
  - AI in the biotech market
  - Advances from classical methods to AI-driven methods
  - Basic considerations to increase the success of a design
2. Rational design
  - Classic experimental and bioinformatic approaches
  - Macromolecular modeling and recombineering
3. Evolutionary design
  - Directed evolution, ancestral sequence reconstruction and consensus design
4. Representation learning
  - (Macro)Molecular representations and Foldseek
  - Protein language models and ESMFold
  - Explainability and interpretability of protein language models
  - Scaling laws and multimodality in protein language models
5. Generative AI
  - Integration of multimodal data
  - Sequence generation
  - Generalization and fitness prediction with protein language models
  - Inverse folding and ProteinMPNN
  - Structure generation with diffusion models
  - Model selection and computational scoring of candidates
  - Model generalization and synthetic data
6. Summary
Data and biases: relevant databases and data processing.
1. Big data is Omics
  - Properties of a good dataset
2. Main datasets
  - PDB
  - UniProt
  - NCBI datasets
  - Other interesting datasets
3. Data processing
  - Data cleaning in biology
  - Basic tools for biological data manipulation
  - Data splitting
4. Generalization in (protein) biology
  - Data leakage and other inherent issues
5. Biases in the data
6. A roadmap for AI-driven protein design

Access to the slides

This course includes +800 slides with image sources, citations, and recommended resources for deeper study in the notes section. I recommend reviewing the slides using PowerPoint. You can download the slides from Zenodo (direct download links) and Google Drive:

Tema	Diapositivas	YouTube
Basic computing concepts	Drive, Zenodo	Video
Machine learning	Drive, Zenodo	Video
Deep learning	Drive, Zenodo	Video
Transformers and language models	Drive, Zenodo	Video
Protein structure	Drive, Zenodo	Video
Protein function	Drive, Zenodo	Video
Protein evolution	Drive, Zenodo	Video
AlphaFold	Drive, Zenodo	Video
AI-based protein design	Drive, Zenodo	Video
Data and bias	Drive, Zenodo	Video

By releasing these slides, my goal is to provide access to information for deeper learning. If you are a teacher and have adopted this material for your lectures, please let me know. I would love to learn how you improved the course and to know that more people have learned about protein science.

However, if you identify that someone has plagiarized this course in whole or in part and is charging money for access, I would appreciate being notified, as developing this material required a lot of time and effort, and plagiarism is a serious breach of professionalism and ethics.

How to support this project

If you found this course useful and would like to support it financially, you can donate via PayPal. Donations can be of any amount, or USD 12, 30, or 45 (suggestions based on the economic reality of students in Latin America). Click the image below to donate.

If you do not have financial flexibility, but would like to express your gratitude, you can send me your comments by email: gamamiguelangel@gmail.com

Finally, I would appreciate it if you shared this course with interested colleagues or reposted the official course announcement on social media:

About me

I’m Miguel Angel González Arias. I’m a Mexican biologist interested in proteins, microbes, and computation. For more details about me, my social networks, and other contact information, please visit the following page:

About me

Share on

Twitter Facebook LinkedIn

Contents