ProteinMPNN API
The ProteinMPNN API provides an interface to the ProteinMPNN protein design tool. This tool takes as input a PDB file and rapidly generates new sequences predicted to fold to the backbone of the input PDB.
Examples
Predict a single new sequence for an input PDB
lev engine submit protein-mpnn 1ubq.pdb
Predict 1000 new sequences for an input PDB
lev engine submit protein-mpnn 1ubq.pdb --n-mpnn-designs 1000
Predict 1000 new sequences for an input PDB using an elevated temperature (default temperature is 0.1)
lev engine submit protein-mpnn 1ubq.pdb --n-mpnn-designs 1000 --sampling-temperature 0.2
Create De Novo sequences for part of a protein structure
lev engine submit protein-mpnn input.pdb --n-mpnn-designs 50 --fixed-residue-positions=”A1-10 A22 A24 A26 A220-250 B30-60”
Inputs
--pdb-file
- The path to a PDB file containing the protein backbone you want to design sequences for
--n-mpnn-designs
- The number of sequences to design, default is 1
--sampling-temperature
- The sampling temperature of the model, default is 0.1, higher values will result in more variation of output sequenes
--gpu-type
- The GPU to run the model on, default is
t4
. Set this toa100
if you are generating a very large number of sequences
- The GPU to run the model on, default is
--fixed-residue-positions
- The residues which will not be designed, space-separated list of positions in the format {chain}{startres}-{endres}
Outputs
designed_sequences.fasta
- A FASTA file containing all designed sequences. The first record in the file is the native sequence of the protein in the PDB file. The headers of the FASTA file contain score and sequence recovery values for each designed sequence
In-depth
Introduction
ProteinMPNN stands out as a groundbreaking sequence design protocol leveraging deep learning. It can generate sequences by inputting a protein or protein complex file, replacing the entire structure or specific sections. With a remarkable sequence recovery of 52.4%, surpassing classic Rosetta’s 32.9% (fixed backbone version), it demonstrates an enhanced understanding of proteins, surpassing Rosetta’s traditional physics, statistics, and Monte Carlo-based approaches when re-engineering a whole protein structure. Wet lab validation has successfully expressed various protein types, including monomers, oligomers, nanoparticles, active enzymes, and target-binding proteins.
Protein Message Passing Neural Network (ProteinMPNN) is trained to take a structure and predict what sequence is most likely to occur at every position given the structural features. It is a significant improvement over Rosetta when you intend to redesign a large section of the entire sequence of a protein. This has long been a challenging goal because full redesign commonly results in highly insoluble protein expression.
In the initial study, researchers employed ProteinMPNN to re-engineer 150 proteins. Out of these, 73 exhibited soluble expression in E. coli, boasting a median soluble yield of 247 mg per liter. The figure below illustrates the yield, offering a comparative analysis with designs generated by AlphaFold (referred to as Original Hallucination in this context). While it’s important to note that AlphaFold wasn’t specifically designed for protein engineering, making it an imperfect benchmark, the remarkable solubility and yield of the designs remain noteworthy.
Simple example
ProteinMPNN in its simplest form will take an input PDB that contains a protein and will output a fasta file with a completely new sequence for the whole structure that it predicts, if expressed, would fold into a protein with that backbone conformation. It is standard practice to generate many designs because it is capable of generating incredible diversity even for a small peptide. It is common for people to have ProteinMPNN output anywhere from 50 to 50,000 sequences.
lev engine submit protein-mpnn input.pdb --n-mpnn-designs 50
-
input.pdb
- Any file name that comes after ‘protein-mpnn’ is expected to have the protein structure that will be used to generate a new sequence. This can be any name followed by .pdb.
-
--n-mpnn-designs 50
- Flag needed to define # of output sequences as 50
There will be 3 outputs from this run:
inputs
will be a directory with the PDB file used as input which will be named pdb.pdb.JOB_INFO.txt
will have the details of how the job was run including name of the input PDB, number of repeats, sampling temperature, GPU type that ran the job, and positions in the PDB that were fixed (not allowed to change sequence).designed_sequences.fasta
will have the output sequences and qualitative information about the run which will help you understand the confidence level of the run. Here is an example:
score
- This is a measure of how good the sequence is expected to be for the protein conformation. Lower is better.- It is averaged for all residues and chains.
- The input sequence and the designed sequences are scored to help indicate how much better the designs are predicted to be.
- In machine learning terms, it is a negative log probability of sampled amino acids and can be thought of as a loss function.
global_score
- This is the same as the score, but is an average of all chains.fixed_chains
- chains that were not designed (fixed)designed_chains
- chains that were redesignedmodel_name
- ProteinMPNN model version that was used to generate resultsgit_hash
- github version that was used to generate outputsseed
- This is a random number that is generated every run to create randomness for designT=0.1
- T is the temperature. Increasing it will create more diversity in the sequence output but is associated with less sequence recovery (likelihood of generating native sequences).sample
- Output designs are numbered (1, 2, 3…etc).seq_recovery
- percent identity to input
Create De Novo sequences with more diversity
By default, ProteinMPNN will run at a low temperature which mean that it will predict sequences that are closer to what it thinks native sequence should be. You can allow it to diverge more by increasing the temperature from its default 0.1. Common values range from 0.2 to 0.5, but can be done up to 1. In this example, we repeat the last command, but change the temperature to 0.25:
lev engine submit protein-mpnn input.pdb --n-mpnn-designs 50 --sampling-temperature 0.25
Create De Novo sequences for part of a protein structure
ProteinMPNN can be run for one or more sections of an input structure. This is done by listing the positions that you wish to retain the sequence found in the input PDB. The syntax for specifying those residues is {chain}{startres}-{endres} for each block to keep fixed with a space between noncontinuous positions. Here is an example for a redesign of a PDB with chains A and B where everything is designed except positions chain A1-10, 22, 24, 26, 220-250 and in chain B30-60.
lev engine submit protein-mpnn input.pdb --n-mpnn-designs 50 --fixed-residue-positions=”A1-10 A22 A24 A26 A220-250 B30-60”