Fixing the Flaws in AlphaFold’s Interface Scoring: Meet Dunbrack’s ipSAE
Predicting protein-protein interactions
Since AlphaFold2 was published, one of the major applications of the model was accurately predicting the structure of protein-protein interactions. By taking as input the sequence of each chain in a complex, AlphaFold2, and more recently AlphaFold3, are able to use their training on protein-protein interfaces to output complex structures that can be used to analyze a PPI, generate structure-based hypotheses, and improve properties like binding affinity using rational or AI redesign. You can even use AF2 and AF3 to predict completely de novo multimer structures, where none of the chains are highly homologous to proteins we’ve seen in nature.
A structure by itself, however, is only half the picture when it comes to prediction. Protein modeling scientists also need the model to divulge whether it has produced a reliable structure with a high likelihood of matching one characterized through x-ray crystallography, NMR, or cryo-EM, or if the structure is disordered, in which case no single snapshot can be relied upon for structural insights. When a protein-protein structure is reported with high confidence, it is also more likely that the interaction is observable.
The most commonly used reporter of the model’s interaction confidence is interface predicted template-modeling score, or ipTM. This metric is generated as the prediction model at the same time as the structure is being refined. Along with predicted aligned error (PAE) scores, a residue pair-based metric, and residue-specific predicted long distance different test (pLDDT), it indicates high or low confidence in a predicted structure.
The problem with ipTM
In a February 2025 preprint, Roland Dunbrack Jr., explains ipTM’s failings. The problem with ipTM is that it can give misleading scores when full-length protein sequences are used, especially when those sequences include disordered regions or domains that do not participate in the interaction. Although the predicted interface contacts may remain unchanged, ipTM scores can vary significantly depending on the input sequence constructs. This makes ipTM less reliable for assessing the quality of predicted interfaces in real-world scenarios where the exact interacting domains are not known in advance.
The problem originates from the fact that benchmark datasets used during training and evaluation were typically constructed from PDB structures, which are truncated, ordered constructs that exclude disordered or non-interacting regions. Because ipTM was developed and validated using these cleaner, more ordered datasets, it tends to perform well only under those conditions. Therefore, when applied to full-length sequences from UniProt, which are more reflective of real biological contexts, ipTM becomes less predictive.
For a protein-protein complex, the ipTM score averages alignment scores between all residue pairs across the two chains. It takes the best alignment for each residue, but still includes scores between ordered and disordered regions, which are usually very low. As a result, disordered or non-interacting parts, like mobile domains, can drag down the overall ipTM score, even if the interacting regions are predicted well.
How ipSAE solves the problem
Interaction prediction Score from Aligned Errors (ipSAE) solves the problem with ipTM by narrowing the focus to high-confidence interface regions and avoiding contributions from disordered or non-interacting parts of the protein. It does this in three key ways:
1.Only residue pairs with low predicted alignment error (PAE) are included in the calculation, filtering out poor-quality alignments.
2.The length normalization in the TM-score formula is adjusted to consider only these high-confidence residues, rather than the full protein length.
3.PAE values themselves are used directly to compute residue-residue alignment scores, rather than relying on probability distributions.
These changes make ipSAE more accurate for evaluating interactions in complex or full-length protein inputs, especially when disordered regions are present. It also works with standard AlphaFold outputs, requiring no code modifications. In benchmarks, ipSAE better distinguishes true from false complexes than the original ipTM score.
Using ipSAE in Levitate software
The metric ipSAE is now reported in Bench (our web-based GUI) and Engine (our command line APIs) results for multimer structure prediction. Higher ipSAE values suggest a predicted interaction, with a threshold of > 0.6 commonly used to distinguish likely binders from non-binders. Many true interactions show maximum ipSAE values closer to 0.8. While Dunbrack and others have shown that ipSAE can effectively differentiate true from false interacting pairs—even when proteins contain significant disorder or non-interacting accessory domains—further benchmarking is needed to confirm that ipSAE can also reliably rank the structural accuracy of predicted complex models.
References
Roland L. Dunbrack Jr., “Rēs ipSAE loquunt: What’s wrong with AlphaFold’s ipTM score and how to fix it” bioRxiv (2025). doi: https://doi.org/10.1101/2025.02.10.637595