Project CodonCraft: Transformer powered codon optimization

CodonCraft is a project that utilizes advanced machine learning models, specifically the transformer model, to optimize the use of codons in a specific organism.

Codons are a sequence of 3 DNA bases (A, T, C, or G) that code for specific amino acids, which are the building blocks of proteins. Different organisms have different preferences for which codon to use, which can cause problems when introducing a gene from one organism to another. This is where codon optimization comes in - it is the process of translating a gene to match the target organism's codons preference. Traditionally, this process was done using simple probability models, but now with advanced models like the transformer model, the process can be more accurate and efficient.

The transformer model is a deep learning architecture that was introduced in 2017 by Google researchers, which is widely used in natural language processing tasks such as language translation, text summarization, and question answering. The transformer model utilizes a self-attention mechanism and positional encoding, which allows it to weigh the importance of different parts of the input and handle sequential input of varying lengths. This makes it an ideal model for bioinformatics as well. In this project, a transformer is trained to learn the language of the lab strain CENPK-113-7D of S. cerevisiae, which will help to optimize the codons use for this specific organism, making it more efficient and cost-effective for researchers to introduce new genes into this organism.

This project page consist out the code and experiment page. The code page shows the most recent code used to build the model. The experiment page shows the results of different experiments done on/with the code. This is described in subs section that are ordered in chronological order. At this point in time I'm still in the experimenting fase trying to make the model to work to meet the requirements. When the model is good enough (requirements are met) the model should be validated to see if the theoretical improvement on the probabilistic models also has an effect on the protein's expression.