ProteinMPNN API

The ProteinMPNN API provides an interface to the ProteinMPNN protein design tool. This tool takes as input a PDB file and rapidly generates new sequences predicted to fold to the backbone of the input PDB.

Examples

Predict a single new sequence for an input PDB

lev engine submit protein-mpnn 1ubq.pdb

Predict 1000 new sequences for an input PDB

lev engine submit protein-mpnn 1ubq.pdb --n-mpnn-designs 1000

Predict 1000 new sequences for an input PDB using an elevated temperature (default temperature is 0.1)

lev engine submit protein-mpnn 1ubq.pdb --n-mpnn-designs 1000 --sampling-temperature 0.2

Create De Novo sequences for part of a protein structure

lev engine submit protein-mpnn input.pdb --n-mpnn-designs 50 --fixed-residue-positions=”A1-10 A22 A24 A26 A220-250 B30-60”

Inputs

  • --pdb-file
    • The path to a PDB file containing the protein backbone you want to design sequences for
  • --n-mpnn-designs
    • The number of sequences to design, default is 1
  • --sampling-temperature
    • The sampling temperature of the model, default is 0.1, higher values will result in more variation of output sequenes
  • --gpu-type
    • The GPU to run the model on, default is t4. Set this to a100 if you are generating a very large number of sequences
  • --fixed-residue-positions
    • The residues which will not be designed, space-separated list of positions in the format {chain}{startres}-{endres}

Outputs

  • designed_sequences.fasta
    • A FASTA file containing all designed sequences. The first record in the file is the native sequence of the protein in the PDB file. The headers of the FASTA file contain score and sequence recovery values for each designed sequence

In-depth

Introduction

ProteinMPNN stands out as a groundbreaking sequence design protocol leveraging deep learning. It can generate sequences by inputting a protein or protein complex file, replacing the entire structure or specific sections. With a remarkable sequence recovery of 52.4%, surpassing classic Rosetta’s 32.9% (fixed backbone version), it demonstrates an enhanced understanding of proteins, surpassing Rosetta’s traditional physics, statistics, and Monte Carlo-based approaches when re-engineering a whole protein structure. Wet lab validation has successfully expressed various protein types, including monomers, oligomers, nanoparticles, active enzymes, and target-binding proteins.

Protein Message Passing Neural Network (ProteinMPNN) is trained to take a structure and predict what sequence is most likely to occur at every position given the structural features. It is a significant improvement over Rosetta when you intend to redesign a large section of the entire sequence of a protein. This has long been a challenging goal because full redesign commonly results in highly insoluble protein expression.

In the initial study, researchers employed ProteinMPNN to re-engineer 150 proteins. Out of these, 73 exhibited soluble expression in E. coli, boasting a median soluble yield of 247 mg per liter. The figure below illustrates the yield, offering a comparative analysis with designs generated by AlphaFold (referred to as Original Hallucination in this context). While it’s important to note that AlphaFold wasn’t specifically designed for protein engineering, making it an imperfect benchmark, the remarkable solubility and yield of the designs remain noteworthy.

ProteinMPNN benchmark

Simple example

ProteinMPNN in its simplest form will take an input PDB that contains a protein and will output a fasta file with a completely new sequence for the whole structure that it predicts, if expressed, would fold into a protein with that backbone conformation. It is standard practice to generate many designs because it is capable of generating incredible diversity even for a small peptide. It is common for people to have ProteinMPNN output anywhere from 50 to 50,000 sequences.

lev engine submit protein-mpnn input.pdb --n-mpnn-designs 50

  • input.pdb

    • Any file name that comes after ‘protein-mpnn’ is expected to have the protein structure that will be used to generate a new sequence. This can be any name followed by .pdb.
  • --n-mpnn-designs 50

    • Flag needed to define # of output sequences as 50

There will be 3 outputs from this run:

  1. inputs will be a directory with the PDB file used as input which will be named pdb.pdb.
  2. JOB_INFO.txt will have the details of how the job was run including name of the input PDB, number of repeats, sampling temperature, GPU type that ran the job, and positions in the PDB that were fixed (not allowed to change sequence).
  3. designed_sequences.fasta will have the output sequences and qualitative information about the run which will help you understand the confidence level of the run. Here is an example:

Example containing contents of designed_sequences.fasta

  • score - This is a measure of how good the sequence is expected to be for the protein conformation. Lower is better.
    • It is averaged for all residues and chains.
    • The input sequence and the designed sequences are scored to help indicate how much better the designs are predicted to be.
    • In machine learning terms, it is a negative log probability of sampled amino acids and can be thought of as a loss function.
  • global_score - This is the same as the score, but is an average of all chains.
  • fixed_chains - chains that were not designed (fixed)
  • designed_chains - chains that were redesigned
  • model_name - ProteinMPNN model version that was used to generate results
  • git_hash - github version that was used to generate outputs
  • seed - This is a random number that is generated every run to create randomness for design
  • T=0.1 - T is the temperature. Increasing it will create more diversity in the sequence output but is associated with less sequence recovery (likelihood of generating native sequences).
  • sample - Output designs are numbered (1, 2, 3…etc).
  • seq_recovery - percent identity to input

Create De Novo sequences with more diversity

By default, ProteinMPNN will run at a low temperature which mean that it will predict sequences that are closer to what it thinks native sequence should be. You can allow it to diverge more by increasing the temperature from its default 0.1. Common values range from 0.2 to 0.5, but can be done up to 1. In this example, we repeat the last command, but change the temperature to 0.25:

lev engine submit protein-mpnn input.pdb --n-mpnn-designs 50 --sampling-temperature 0.25

Create De Novo sequences for part of a protein structure

ProteinMPNN can be run for one or more sections of an input structure. This is done by listing the positions that you wish to retain the sequence found in the input PDB. The syntax for specifying those residues is {chain}{startres}-{endres} for each block to keep fixed with a space between noncontinuous positions. Here is an example for a redesign of a PDB with chains A and B where everything is designed except positions chain A1-10, 22, 24, 26, 220-250 and in chain B30-60.

lev engine submit protein-mpnn input.pdb --n-mpnn-designs 50 --fixed-residue-positions=”A1-10 A22 A24 A26 A220-250 B30-60”

Updated: