Subsample MSA API

The Subsample MSA API allows users to subsample a multiple sequence alignment (MSA) by clustering sequences by sequence similarity via DBSCAN. This method scans for sequences clusters and then selects the optimum distance value to use for clustering.

The API will return an MSAs for each cluster generated along with associated tables for cluster metadata and sequences. This API is based on the tool AF-Cluster.

Subsampling an MSA via clustering has been shown to be beneficial for introducing better diversity in structure predictions compared to using the full MSA or a random sampling of sequences.

The API also has a cutoff option. This allows the user to set the threshold for filtering based on the percentage of the sequence alignment represented as gaps. Gap percentage is calculated for a given sequence, and if the gap percentage is above the cutoff, that sequence is ignored in clustering. Optionally modify cutoff from the default 25% to include more or fewer sequences during subsampling.

Command Line Interface

Examples

Subsample input.a3m using a filtering cutoff of 30%

lev engine submit subsample-msa \
    --msa input.a3m \
    --cutoff 0.3

Flags

--cutoff (float) (Optional)
- Cutoff for clustering (default 0.25)
--msa (str) (Required)
- Path to input MSA file (.a3m or .fasta)

Outputs

The results are all returned in a single folder, along with a log file and two TSVs: one TSV has cluster metadata, and the second TSV has the cluster assignments for each sequence ID.

References

Predicting multiple conformations via sequence clustering and AlphaFold2