Protocol for Getting the Sequences
Normally, training language models takes a lot of GPU power. With a base model and clever finetuning libraries, however, this can be substantially subsidized. Here was my training process which ran on a single A40 GPU hosted through RunPod:
Data Collection:
In order to generate high-quality variants of Pfu polymerase, only Family B DNA polymerase sequences were selected for fine-tuning the model. These sequences were obtained from a Pfam entry (PF00136) through the InterPro database. The ProGen2-base model is only able to process sequences with a length of less than 2022 amino acids; sequences longer than this were excluded from the final dataset. Furthermore, due to the presence of start and end tokens, the sequence can be represented from N to C terminus or the reverse, effectively doubling the training data. This left 55,758 sequences to train the model on.

The dataset I used to get sequences for finetuning ProGen2
Training the Model
Several steps were taken to train ProGen2-base on minimal GPU hardware:
| Strategy | Method | Result |
|---|---|---|
| Parameter Efficient Fine-Tuning (PEFT) | LoRA | Training parameters were decreased from 768,784,928 to 3,981,312 (more than 200x fewer parameters) |
| Quantization | 4-bit precision | Memory used was reduced |
| Float Precision | bfloat16 | Efficiency was doubled compared to float32 but accuracy was improved compared to float16 |
The model was fine-tuned using a single A40 GPU for 3 days. 10 epochs were used, along with a learning rate of 1e-5, a weight decay of 0.01, 500 warmup steps, and fp16 precision. After training was completed, the model had a training loss of 1.2533 and a validation loss of 1.209375. Validation loss steadily decreased throughout the epochs, indicating successful training.
This part of the process included inferencing the model to get a large amount of sequences, and then filtering them to end up with the ones I wanted to test.
Generating Sequences from the Model
To generate sequences for computational and experimental classification, the fine-tuned version of ProGen-2 was inferenced in the aforementioned state—quantization, LoRA, and bfloat16. To provide context to the model, two strategies were used. The first used the C terminus as the context (position 351-775, inclusive, from wild-type Pfu polymerase), with the temperature and top p both being 0.5. The second strategy used the N terminus as the prompt (position 1-350, inclusive, from wild-type Pfu polymerase), with the temperature being 1.0 and the top p being 0.9. Inference was done on a single A40 GPU utilizing DeepSpeed. In total, 13,930 sequences were generated with the first strategy, and 5000 sequences were generated with the second strategy.