Alphafold Initial Guess

5 minute read

Enhancing De Novo Binder Design with Initial Guess

The field of protein binder design has been revolutionized by machine learning methods like RFDiffusion for de novo protein backbone generation and ProteinMPNN for sequence assignment. However, despite these advancements, success rates remain limited. Most binder designs fail for one of these reasons:

  1. Type I Error: The structure fails to fold into the proper monomeric state.
  2. Type II Error: The structure folds into the proper monomer but fails to bind the target properly.

To filter out poor designs, we have a broad range of filtering tools. Our newest addition, Initial Guess enhances filtering of type II errors and is discussed in Step 5 of this walkthrough of our protocols.

Our Pipeline: From Generation to Precise Filtering

Our de novo binder design protocols go far beyond a simple automated implementation of RFDiffusion and ProteinMPNN. They offer a comprehensive suite of filtering and optimization tools designed to maximize the chances of identifying effective binders. These protocols are available for subscription, enabling users to access a fully integrated workflow that includes our latest addition, Initial Guess in step 5, for enhanced precision. Here’s an overview of how our complete workflow is typically executed:

Step 1: RFDiffusion for Backbone Generation

  • Start with RFDiffusion, which generates initial protein backbones designed to bind a target protein. This method serves as the foundation for the binder design process.
  • An example command to run rf-diffusion would be something like:
    lev engine submit rf-diffusion input-template.pdb --n-rfdiffusion-designs 23 --rfdiffusion-contigs '[A1-250/0 25-25]'
    

Step 2: Compactness Check with Radius of Gyration

  • Apply a Radius of Gyration (Rg) filter to ensure that the generated binders are globular and compact, removing designs unlikely to be stable or functional.
  • Starting from a directory containing all the pdbs step 1 you could run the following command to run radius of gyration:
    lev engine submit radius-of-gyration *.pdb --chain A
    

Step 3: Sequence Assignment with ProteinMPNN

  • Run ProteinMPNN to assign sequences to the RFDiffusion-generated backbones after filtering backbones with poor radius of gyration scores.
  • For each backbone the command might look like:
    lev engine submit protein-mpnn input.pdb --n-mpnn-designs 5 --fixed-residue-positions "B1-250" 
    
    • The fixed residue positions should be all of those of the target structure

Step 4: Structure Validation with AlphaFold2 (Single Sequence Mode)

  • AlphaFold2 in Single Sequence mode generates folded structures based on the sequences from ProteinMPNN.
  • Using the AF2 confidence score (pAE), only keep models with pAE 0.8 or higher.
  • Align these structures to the original RFDiffusion model and keep models within 2 angstroms RMSD.
  • This will address Type I errors, ensuring designs are likely to fold correctly.
  • Running AF2 Single seq on all the sequences generated by protein mpnn can be done by running the following command for each backbone.
    lev engine submit ai-folding *.fasta --mode singleseq --ai-tool alphafold --af-n-recycles 2 --reference-structure reference.pdb
    
    • The reference structure is the backbone (do novo binder only) generated by RFDiffusion. If this is provided results will all be aligned to this structure and the RMSD will be calculated for you.

Step 5: Introducing Initial Guess for Improved Binding Prediction

ALPHAFOLD2 INITIAL GUESS is a modification of AlphaFold2 that uses the ProteinMPNN sequence and the RFDiffusion backbone to predict a bound complex model.

It essentially works by providing the target design to AlphaFold in the prediction step and biasing it towards the target. This significantly improves the success rate of the multimer prediction and these predictions are in turn highly valuable for filtering away designs which have Type II errors.

Ultimately a balance must be struck between how much bias should be applied to the input structure. Too little and multimeric prediction will produce false negatives and too much and too many false positives will be predicted.

In their paper describing the method researchers from the Institute For Protein Design found that initializing the AF2 pair representation with an encoding of the target binder structure struck this balance and resulted in a significant improvement in the number of complexes that could be predicted accurately. In particular they found that “…the average pAE of interchain residue pairs (pAE_interaction) was extremely effective in identifying the experimentally confirmed binders.” They also found confident predictions had very high success frequencies, in particular those with a pAE_interaction less than 10.

  • Run AF2 Initial Guess to get a more accurate bound model.
  • Use the pAE of interchain residue pairs (pAE_interaction) and keep models if the pAE is less than 10 indicating that the binding expected mode is confident.
  • This step is crucial in addressing potential Type II errors, where the binder may fold correctly but fail to interact with the target protein.
  • AlphaFold2 Initial Guess can be run with the following command
    lev engine submit ai-folding input.fasta --mode intial-guess --ai-tool alphafold --reference-structure reference.pdb  
    
    • The reference structure in this context should be the model containing both the target and the de novo binder.

Step 6: Model Optimization with Relax and Loop Modeling

  • Apply Relax and Loop Model protocols to refine the models from AlphaFold2 and Initial Guess structures, optimizing them for accuracy and stability.
  • The output structures from these can be ranked by Rosetta Energy.
  • You can compare Rosetta Energy for models which have the same RFDiffusion backbone but ended up with different sequences from ProteinMPNN. The sequences that have the lowest Rosetta Energy can be expected to be more stable. Rosetta Energy is not comparable if different length RFDiffusion backbones were used to generate the designs.
  • Relax can be run with the command
    lev engine submit relax input.pdb*
    

Step 7: Interface Analysis with Rosetta Interface Analyzer

  • Using Rosetta Interface Analyzer, we evaluate key metrics such as binding energy and interface surface area to confirm the potential efficacy of the binder-target interaction.
  • The interface analyzer can be run with the command
    lev engine submit interface-analyzer input.pdb --mode analyzer --interface1 A --interface2 B
    

Step 8: Confidence-Based Filtering for Final Selection**

  • Throughout the process, we filter using confidence scores from both AlphaFold2 and Initial Guess, ensuring that only the most promising binder candidates move forward.

Conclusion: Initial Guess - The Missing Piece in Effective Binder Design

By adding Initial Guess to our pipeline, we’ve significantly enhanced our ability to filter out ineffective designs and identify the most promising binders. This latest addition ensures that our de novo binder design pipeline is not only comprehensive but also more accurate and efficient, making it easier to pinpoint designs worth the time and expense of experimental testing. This improvement marks an important step forward in the ongoing evolution of machine learning-driven protein design.

Updated: