EXP. vol1

Base Model

The hyperparameters of the base model are listed in table 1. The base model is used as a bench mark to see what the influence on the models performance is when changing the hyperparameters. With the ultimate goal to find the right set of hyperparameters to make the model work well. The hyperparameters of the base model are based on the hyperparameters of the paper ‘Attention is all you need’ (Google, 2017).

The train dataset consist of the coding codon sequence (target) and the corresponding amino acid sequence (input) of multiple species from all walks of life with a max length of 100 codons (tokens), the dataset has around 8000 sequences. The validation dataset consist of 500 sequences with a maximum length of 1000 tokens with coding sequences from S. cerevisiae, which is not included in the train data set.

Table 1: The hyperparameters of the base model

Figure 1 shows the train and validation loss of the base model after 1000 epochs. The training loss flattens out at 1.2 after approximately 200 epochs. The validation loss decreases at a lower rate compared to the train loss but keeps decreasing and surpasses the train loss at epoch 650. After 1000 epochs both loss values are still very high. This indicates that the model has still much to learn and that some hyperparameters need to be changed before the model is going to be successful.

Figure 1: The train loss and validation loss of the base model over 1000 epochs.


When the model is used with the following input we get a set of codons that is translated back to amino acids which is shown as the output. This is done to check how far our model has come after 1000 epochs. The result shows jibberish, meaning we are far away from a good result.

Input:

'MYCRAFTYPLANWASCLEVERYTRICKINGTHEENEMYMYCRAFTYPLANWASCLEVERYTRICKINGTHEENEMYMYCRAFTYPLANWASCLEVERY*

Output:

AHIFTSG*RTGN*LVGRVWHTCAPFTLSMTISSFHAAL*SIIASPHHPIYAYRRA*HSAAIGIAAAVSAAIHAAAFDH

Attention head experiment

The first parameter that is going to get changed is the attention head parameter. The higher this value the more self-attention layers there are in the encoders and decoders. So what effect could increasing the number of attention heads? Firstly, it increases the capacity of the model to focus on different aspects of the input by allowing it to attend to multiple parts of the input at the same time. Each attention head is responsible for attending to a specific subset of the input, and having more heads allows the model to attend to more subsets. This can help the model to better understand the input and make more accurate predictions. Secondly, increasing the number of attention heads can also increase the overall computation time of the model as each attention head will require additional computation. Lastly, having more attention heads can help the model to be more robust to overfitting, as each head can learn different representations of the input. Table 2 shows the hyperparameters of the model. It turned out that doubling the num of attention heads increased the time per epoch from ~34 s to 41 s.

Table 2: The hyperparameters of the base model

Figure 2 shows the train and validation loss of the model with increased number of attention heads. The graph shows that the training loss is fluctuating a lot more compared to the base model, especially within the first 200 epochs. The validation loss is not decreasing between 60 and 140 (approx.) and decreases slightly till epoch 800.  The results are relatively similar compared to the base model.

Figure 2: The train loss and validation loss of the model over 1000 epochs.


The output of the model is still jibberish.

Input:

'MYCRAFTYPLANWASCLEVERYTRICKINGTHEENEMYMYCRAFTYPLANWASCLEVERYTRICKINGTHEENEMYMYCRAFTYPLANWASCLEVERY*

Output:

*LCSRPEGR*R*R

Increased learning rate

For this experiment we are going to increase the learning rate ten-fold. A higher learning rate can cause the model to converge on a good set of weights faster. This is because the model's weights are updated with bigger steps, allowing it to make quicker progress towards a good set of weights. However, a high learning rate can also cause the model to overshoot the optimal set of weights, which can lead to a sub-optimal solution.

Table 3: The hyperparameters of the base model

The results show that the loss is not decreasing, this is most likely due to overshooting and a too big learning rate. The experiment was therefore stopped prematurely.

Figure 3: The train loss and validation loss of the model over 40 epochs.

Decrease learning rate

The last experiment showed that the model was overshooting and not learning anything. This made me think that maybe my initial learning rate is also to high. Therefore, the learning rate will be decreased ten-fold. A lower learning rate will cause the model's weights to be updated in smaller steps, which can help the model converge on a good set of weights more slowly but more robustly. This means that the model will take longer to train, but will be less likely to overshoot the optimal solution, and more likely to find a better generalizing solution. Table 4 shows the hyperparameters used in this experiment.

Table 4: The hyperparameters of the base model

Figure 4 presents the loss graph, which illustrates the training and validation loss throughout the model's training process. The training loss reaches a low value of 0.08, indicating that the model performs well on the training data. However, the validation loss starts to increase after epoch 140, suggesting overfitting. There are several possible explanations for this. One possibility is that the validation dataset includes sequences longer than 100 tokens (codons), and the model may have learned to limit its predictions to sequences no longer than 100, resulting in a higher validation loss. Another explanation could be the choice of loss function, as the cross-entropy loss used in the model may penalize the model equally for predicting a wrong codon that results in the same amino acid as t would for predicting a codon that translates to a completely different amino acid. Both explanations are worth considering.

Figure 2: The train loss and validation loss of the model over 1000 epochs.


The output of the model on the test sentence shows similarities with the test sentence for the first time. It is also checked if the model output was better at epoch 160 (where the validation loss was lowest), this did not show improvement. Which is making it very likely that we have a problem with the loss function.

 Input:

'MYCRAFTYPLANWASCLEVERYTRICKINGTHEENEMYMYCRAFTYPLANWASCLEVERYTRICKINGTHEENEMYMYCRAFTYPLANWASCLEVERY*

Output epoch 1000:

TRAFT*PLANWASCLEVERYTRICKINGPCGMIE

Output epoch 160:

YSERTNPYAMACNEVRTYRITCKIGNGYMENYNLC*AWIPEH*YLQFCEAIPSVSLGNDRTRNEYRCN 


Increase dropout rate

Dropout is a regularization technique that aims to prevent overfitting by randomly dropping out (i.e. setting to zero) a certain percentage of the neurons in the model during training. When you increase the dropout rate, more neurons are dropped out during each training step, which reduces the model's capacity to memorize the training data and forces it to learn more robust features. This can help the model generalize better to new data. One negative effect of increasing the dropout rate is that the model may become too under-parametrized. When the dropout rate is set too high, too many neurons are dropped out during each training step, which reduces the model's capacity to learn the underlying patterns in the data. This can lead to underfitting, where the model is not complex enough to accurately capture the relationships in the data. Table 5 shows the hyperparameters of the model.


Table 5: The hyperparameters of the base model

*changed parameter compared to the base model


Loss function and validation set update

The loss function and validation set update

The updated loss function is shown below, the goal is to punish the model less if it predicts the wrong codon but the codon codes for the same amino acid, this reduces the chance of the model predicting the wrong amino acid, which detrimental for codon optimization.  The function takes a weight, which is the factor that is multiplied with the loss to reduce the loss when the codon are from the same group. The function also takes a dictionary indicating which token (codon) is part of which group. It uses the nn.CrossEntropyLoss taking the output and target and comparing them.

Figure  7: The code of the CustomCrossEntropyLoss 

The validation set was updated by using the coding sequences of S. cerevisae (CENPK-113-7D) with a length no longer than 100 amino acids/codons. This way the validation set has greater similarity with the train set and more space is available on the GPU, making it possible to increase batch size which can make epoch times faster and have a regularization effect. The validation set consists of 478 sequences.

EXP. Vol2

Updating evaluation method

Table 2: The hyperparameters of the base model