Tolerance Identification API

The Tolerance Identification API finds the top N closest (by Blosum62) N-mers in the human genome against a given protein of sequence.

Command Line Interface

Examples

Get the top 10 closest 9-mers:

lev engine submit tolerance-identification NLYIQWLKDGGPSSGRPPPS \
    --top-n 10

Get the top 5 closet 9 and 15-mers:

lev engine submit tolerance-identification NLYIQWLKDGGPSSGRPPPS \
    --top-n 5 \
    --nmer-sizes 9,15

Flags

  • --sequence (str) (Required)
    • Input protein of sequence to compare against
  • --top-n (int) (Default: 20)
    • Collect the top N matches
  • --nmer-sizes (str) (Default: 9)
    • Nmer size(s) to run this on (Comma separated string ex: 9,10,11,12)

Python Interface

Examples

Get the top 10 closest 9-mers:

from engine import EngineClient

client = EngineClient()
client.authorize()

client.submit_tolerance_identification(
    sequence="NLYIQWLKDGGPSSGRPPPS",
    top_n=10
)

Get the top 5 closest 9 and 15-mers:

client.submit_tolerance_identification(
    sequence="NLYIQWLKDGGPSSGRPPPS",
    top_n=5,
    nmer_sizes="9,15"
)

Flags

  • sequence (str) (Required)
    • Input protein of sequence to compare against
  • top_n (int) (Default: 20)
    • Collect the top N matches
  • nmer_sizes (str) (Default: 9)
    • Nmer size(s) to run this on (Comma separated string ex: 9,10,11,12)

Outputs

  • out.csv
    • CSV file containing the following columns
      • nmer_size - size of this nmer
      • resnum - residue number (1 indexed) of the nmer position in the query sequence
      • query_seq - query sequence
      • matchrank - Rank (0=best, N = worst ) out of the top-N closest (by blosum62) nmers to the query
      • matchscore - blosum62 score of the result to the query sequence
      • matchseq - the found human genome sequence
      • matchscore/max_score - matchscore divided by the score of a 100% (normalized Blosum62)
  • out.json
    • JSON format of the out.csv

Notes

Running this protocol takes between 4 and 5 GB of memory per CPU

Input proteome

The input proteome file was taken from https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/pep/Homo_sapiens.GRCh38.pep.abinitio.fa.gz

Updated: