SynthDNA: Generating Unique DNA Polymerases With a Protein Language Model

<aside> 📄

TL;DR: I used a protein language model (ProGen-2) to generate sequences for novel DNA polymerases. The model was fine-tuned on a dataset of Family B DNA Polymerases, and sequences were filtered using log likelihood, structural confidence, principal component analysis, sequence similarity, and other metrics. All 5 DNA polymerases produced through bacterial protein expression showed modest activity in PCR.

</aside>

How I Started This Project

I’ve been interested in AI for protein design for a while, so I wanted to start a big project which was both interesting and useful. That’s when I came across DNA polymerase. This enzyme is found throughout all domains of life as it is responsible for a crucial task: replicating an organism’s DNA. Beyond its critical presence in cells, it is also very important in biology research. It is used to amplify, sequence, and mutate DNA samples. Beyond the allure of working on an important enzyme, the other reason I chose to focus on DNA polymerase is because of its complex chemistry. When it is adding a new base of DNA to the growing DNA strand, there are four types of molecules present: protein, DNA, small molecule (dNTP), and metal ions. Combined with its multiple domains, complex chemical kinetics, and vast diversity, this makes DNA polymerase one of the hardest proteins to engineer with conventional methods.

The structure of Pfu Polymerase, the protein I based my designs on. It is the most accurate natural DNA polymerase.

With this project, there were three goals. The first goal was to design a more accurate DNA polymerase. If this was achieved, sequencing could be more accurate, DNA-based data storage could become more reliable, and organisms that don’t mutate and evolve during experiments could become more prevalent. The second goal was to see if protein language models are capable of capturing the features of complicated, rare proteins. The polymerase I chose to base my designs on was Pfu polymerase, a very heat-stable protein, retaining 90% of its activity even after an hour at 95°C. Combined with its large size (775 amino acids), this makes it very unique. Finally, I wanted to see if a set of filters could be designed to maximize the number of successful designs from the model. With these goals in mind, I set off on my project.

Methods

Click here to learn more about the machine learning

Click here to learn more about the experiments

Results

At the end of my lab term, I’m happy to say that I was able to achieve some pretty good results. I was able to synthesize the 5 proteins I wanted to, and, excluding the possibility of contamination, impure proteins, or other factors, they all were able to amplify DNA in PCR (see note at bottom for more details).

Protein #1

              Protein #1

Protein #2

              Protein #2

Protein #3

              Protein #3

Protein #4

                               Protein #4

Protein #5

                               Protein #5

Above are the predicted structures of my proteins (generated by BioEmu during trajectory sampling). Unfortunately, I was unable to get the assay right for predicting accuracy, but I was still able to generate the plasmid for testing accuracy. Furthermore, since all five of the polymerases were able to amplify DNA, I would also consider the goal of designing filters a success. So the summary:

Can a protein language model design polymerases with rare traits like Pfu ✅
Can a set of filters be designed to maximize the number of successful designs ✅
Can more accurate DNA polymerases be designed ➖ (maybe)