Protocol for Getting the Sequences

Training the Model

Normally, training language models takes a lot of GPU power. With a base model and clever finetuning libraries, however, this can be substantially subsidized. Here was my training process which ran on a single A40 GPU hosted through RunPod:

Data Collection:

In order to generate high-quality variants of Pfu polymerase, only Family B DNA polymerase sequences were selected for fine-tuning the model. These sequences were obtained from a Pfam entry (PF00136) through the InterPro database. The ProGen2-base model is only able to process sequences with a length of less than 2022 amino acids; sequences longer than this were excluded from the final dataset. Furthermore, due to the presence of start and end tokens, the sequence can be represented from N to C terminus or the reverse, effectively doubling the training data. This left 55,758 sequences to train the model on.

                                       The dataset I used to get sequences for finetuning ProGen2

                                   The dataset I used to get sequences for finetuning ProGen2

Training the Model

Several steps were taken to train ProGen2-base on minimal GPU hardware:

Strategy Method Result
Parameter Efficient Fine-Tuning (PEFT) LoRA Training parameters were decreased from 768,784,928 to 3,981,312 (more than 200x fewer parameters)
Quantization 4-bit precision Memory used was reduced
Float Precision bfloat16 Efficiency was doubled compared to float32 but accuracy was improved compared to float16

The model was fine-tuned using a single A40 GPU for 3 days. 10 epochs were used, along with a learning rate of 1e-5, a weight decay of 0.01, 500 warmup steps, and fp16 precision. After training was completed, the model had a training loss of 1.2533 and a validation loss of 1.209375. Validation loss steadily decreased throughout the epochs, indicating successful training.

Getting the Sequences to Test

This part of the process included inferencing the model to get a large amount of sequences, and then filtering them to end up with the ones I wanted to test.

Generating Sequences from the Model

To generate sequences for computational and experimental classification, the fine-tuned version of ProGen-2 was inferenced in the aforementioned state—quantization, LoRA, and bfloat16. To provide context to the model, two strategies were used. The first used the C terminus as the context (position 351-775, inclusive, from wild-type Pfu polymerase), with the temperature and top p both being 0.5. The second strategy used the N terminus as the prompt (position 1-350, inclusive, from wild-type Pfu polymerase), with the temperature being 1.0 and the top p being 0.9. Inference was done on a single A40 GPU utilizing DeepSpeed. In total, 13,930 sequences were generated with the first strategy, and 5000 sequences were generated with the second strategy.