Summary
The project Picture2Protein is a playful project that does not serve any purpose. The project was created to have some fun with the Alphafold algorithm created by deepmind. Picture2Protein converts any picture into a amino acid sequence and shows you the predicted protein structure. Figure 1 shows a schematic of the steps used by Picture2Protein.
Figure 1: The schematic of Picture2Protein.
The first step is cropping the image. The amount of pixels need to be reduced in a way that the total amount of pixels is around 500. Next, grayscaling is used to convert the pictures RGB values to grayscale values. These values go from 0 (pitch black) to 255 (pure white). This leaves us with a matrix of around 500 elements containing numbers between 0 and 255. These values are than converted to one letter abreviations of amino acids. This conversion is done by an algorithm, for more details on how this algorithm works see the section below.
Introduction
The 3D structures of proteins are beautiful. A protein consists of an amino acid sequence that folds into a 3D structure consisting of alpha helices, beta strands, and random coils. Right now, anyone can predict the 3D structure of any amino acid sequence thanks to Deepmind's AlphaFold algorithm. This technique will revolutionize the molecular biology field. It makes it possible to design de novo enzymes, meaning you do not need to rely on nature to make medicines or enzymes. The possibilities are endless! So you can use it to do stupid things like turn pictures into protein. This project consists of a python-based script that can convert any image into an amino acid sequence. The resulting amino acid sequence is used to produce the 3D structure of the protein.
The algorithm
The algorithm needs to convert information from pictures into an amino acid sequence. The first step in this algorithm is to decide what information you should use from the pictures. A picture contains pixels, and pixels contain values. Therefore, it is straightforward to use the pixel information and convert this information into an amino acid sequence. Most pictures have a very large number of pixels. For example, a picture with 1000 by 1000 pixels has one million pixels in total. If you convert every pixel to an amino acid, you end up with a protein of one million amino acids. The biggest protein found in the human body has 38,130 amino acids, meaning that it is unrealistic. Another problem is that the longer a protein is, the more computing power (and time) you need to predict its structure. Experimenting with the alphafold colab showed that a protein greater than 750 amino acids takes too long. In conclusion, the pixel information needs to be downscaled to a maximum of 750. Luckily, this is relatively easy to do using the Pillow library’s ‘resize’ function. Figure 1 shows an example of this downscaling.
Figure 1: An example of downscaling the number of pixels in an image.
Every one of these 600 pixels has three values in them. These values determine the color of the pixel using the RGB system. It is possible to use this information to select certain types of amino acids. However, it is possible to change the pixel value to only one integer by using grayscale. The three values are turned into an integer between 0 and 255. 255 is white and 0 is pitch black. Figure 2 shows an example of this grayscale.
Figure 2: An example of grayscaling an image.
The pixel values must be converted to one of the 20 amino acid types found in protein. To develop an algorithm for this, we must first examine a series of images to determine the distribution of these pixel values. Figure 3 depicts three images and their pixel value distributions.
Figure 3: Three examples of images and their distribution of pixel values.
The picture on the left has a pixel value distribution similar to a normal distribution. This distribution is skewed to the darker side (left). This has to do with the relatively dark colors. The second picture shows the Dutch flag. Only 3 different values are found. Interesting enough, the white band of the flag is exactly one row of pixels larger than the red and the blue. This can be caused by the downscaling of the image. The final picture is very colourful and shows a flat-topped distribution around the middle. This visualization shows that for most pictures, the chance of getting a value around the centre (128) is greater than the chance of getting a value at the sides (0 or 255). This is important information to convert the pixel value to an amino acid. There are 20 types of amino acids present in proteins. Not every type of amino acid has the same abundance in proteins. An important characteristic of the amino acids is their hydrophobicity. Amino acids that do not like water are hydrophobic and tend to shield themselves from the water. Amino acids that are polar are hydrophilic and like water. Table 1 shows the abundance and hydrophobicity of the 20 amino acids.
Table 1: The amino acids abreviations, their abundance in protein, and their hydrophobicity properties.
With the information provided in figure 3 and table 1 it is possible to make an algorithm based on the abundance and type of amino acid (polar, indifferent, or hydrophobic). Let's say the dark side of the spectrum is hydrophobic and the light side is polar. We begin on the edges and work our way to the center, filling in the hydrophobic and polar amino acids. The indifferent amino acids are used to fill up the type lagging. This leaves us with the following distribution of amino acids over the value space (see left panel figure 4).
Figure 4: An illustration of the correlation between the algorithms amino acid space and the pixel values space (left). The abundance of the amino acids and the found amino acids for the colorful image (right).
If you use this algorithm for the right picture in figure 3, you will have a distribution of amino acids as shown by the blue bars in the right graph of figure 4. The orange bars show the abundance of every amino acid in an average protein. The results show that the percentage of amino acids in our created protein sequence is very similar to what one expects from a protein found in nature. Only N, Q, W, and Y differ significantly from the expected abundance. These are all AAs on the outskirts of the value space.
A method was developed to turn a picture into an amino acid sequence. Now, this sequence can be turned into a protein using the alphafold algorithm. Figure 5 shows the composition II in Red, Blue, and Yellow, 1929 by Piet Mondriaan and its derived protein. In the gallery you can find some of the other results.
Figure 5: A picture and its resulting protein structure.
The structures of some of the resulting protein structures are discussed further in the following section. How certain is the alphafold of this structure? What do the various alphafold algorithm outputs mean? Is it possible to improve upon the current picture2protein algorithm?