Wednesday, October 3, 2018

Python Sub Process Local Psi Blast PSSM Generation from FASTA in Directory using Uniref50 Database in Pycharm


The uniref50 database can be downloaded at using NCBI FTP database as well as, http://ftp.ebi.ac.uk/pub/databases/uniprot/uniref/uniref50/.

The Swissprot database can be used for quick testing, it can be downloaded at http://ftp.ebi.ac.uk/pub/databases/swissprot/release/, or ftp://ftp.ncbi.nlm.nih.gov/blast/db/. Download the tar.gz or, fasta.gz files.

Download the latest version of Psi Blast from here, ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/.

It takes a long time and a lot of memory for generating each PSSM. With 8 threads and 8 GB memory it produces only 4 to 5 PSSM's in 10 minutes. 100% of CPU and RAM is used for Uniref because of its size. Since the whole database cannot be read at once it constantly uses almost 100% of HDD and in SSD almost 50% in reading. Maybe there is some overhead from python, It may be a little bit faster to use something other than python, like C++.

I do not know if makeblastdb command works directly with above archives, but I first extracted the Fasta file from archive then made local blast database from Uniref, Swissprot. Run command from psiblast bin installation folder using command line, makeblastdb -in uniref50.fasta -dbtype prot -out uniref50 to create the local database. It can be renamed and also put into separate folder by changing parameters.

I have mistakenly named the uniref50 database as uniref50.fasta, the database name is the part before .00.* .01.* extensions. Create a folder relative to the python code to keep the database. The swissprot database is faster to download and use for easy learning process.

Set -num_threads based on the processor type, to save each PSSM use -save_each_pssm. The command psiblast -help can be run from installation bin folder to get details on all commands.

Screenshot


Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import subprocess
import os


# Local Psi Blast installation path
path_to_psiblast = 'C:\\Program Files\\NCBI\\blast-2.7.1+\\bin\\psiblast.exe'


# Path to Proteins in Fasta format
fasta_path = 'processed_fastas/mouse_train/'


from os import listdir
from os.path import isfile, join
onlyfiles = [f for f in listdir(fasta_path) if isfile(join(fasta_path, f))]


# psiblast -query A0JNU3.fasta -db swissprot/swissprot -num_iterations 3 -evalue 0.001 -num_threads 8 -save_each_pssm -out_ascii_pssm A0JNU3.pssm
for i in onlyfiles:
    query_fasta = fasta_path + i

 # Output filename for each PSSM
    output_pssm = 'pssm/mouse_train/' + i + '.pssm'

 # Call the sub process with proper arguments
    subprocess.call([path_to_psiblast, '-query', query_fasta, '-db', 'uniref50/uniref50.fasta', '-num_iterations', '3', '-evalue', '0.001', '-num_threads', '8', '-out_ascii_pssm', output_pssm])

No comments: