Wednesday, October 3, 2018
Python Sub Process Local Psi Blast PSSM Generation from FASTA in Directory using Uniref50 Database in Pycharm
October 03, 2018
Ascii PSSM from Fasta
,
bioinformatics
,
Local makeblastdb from Uniref50
,
Process Files in Directory
,
Pycharm Psi Blast
,
Python Sub Process External Program
The uniref50 database can be downloaded at using NCBI FTP database as well as, http://ftp.ebi.ac.uk/pub/databases/uniprot/uniref/uniref50/.
The Swissprot database can be used for quick testing, it can be downloaded at http://ftp.ebi.ac.uk/pub/databases/swissprot/release/, or ftp://ftp.ncbi.nlm.nih.gov/blast/db/. Download the tar.gz or, fasta.gz files.
Download the latest version of Psi Blast from here, ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/.
It takes a long time and a lot of memory for generating each PSSM. With 8 threads and 8 GB memory it produces only 4 to 5 PSSM's in 10 minutes. 100% of CPU and RAM is used for Uniref because of its size. Since the whole database cannot be read at once it constantly uses almost 100% of HDD and in SSD almost 50% in reading. Maybe there is some overhead from python, It may be a little bit faster to use something other than python, like C++.
I do not know if makeblastdb command works directly with above archives, but I first extracted the Fasta file from archive then made local blast database from Uniref, Swissprot. Run command from psiblast bin installation folder using command line, makeblastdb -in uniref50.fasta -dbtype prot -out uniref50 to create the local database. It can be renamed and also put into separate folder by changing parameters.
I have mistakenly named the uniref50 database as uniref50.fasta, the database name is the part before .00.* .01.* extensions. Create a folder relative to the python code to keep the database. The swissprot database is faster to download and use for easy learning process.
Set -num_threads based on the processor type, to save each PSSM use -save_each_pssm. The command psiblast -help can be run from installation bin folder to get details on all commands.
The Swissprot database can be used for quick testing, it can be downloaded at http://ftp.ebi.ac.uk/pub/databases/swissprot/release/, or ftp://ftp.ncbi.nlm.nih.gov/blast/db/. Download the tar.gz or, fasta.gz files.
Download the latest version of Psi Blast from here, ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/.
It takes a long time and a lot of memory for generating each PSSM. With 8 threads and 8 GB memory it produces only 4 to 5 PSSM's in 10 minutes. 100% of CPU and RAM is used for Uniref because of its size. Since the whole database cannot be read at once it constantly uses almost 100% of HDD and in SSD almost 50% in reading. Maybe there is some overhead from python, It may be a little bit faster to use something other than python, like C++.
I do not know if makeblastdb command works directly with above archives, but I first extracted the Fasta file from archive then made local blast database from Uniref, Swissprot. Run command from psiblast bin installation folder using command line, makeblastdb -in uniref50.fasta -dbtype prot -out uniref50 to create the local database. It can be renamed and also put into separate folder by changing parameters.
I have mistakenly named the uniref50 database as uniref50.fasta, the database name is the part before .00.* .01.* extensions. Create a folder relative to the python code to keep the database. The swissprot database is faster to download and use for easy learning process.
Set -num_threads based on the processor type, to save each PSSM use -save_each_pssm. The command psiblast -help can be run from installation bin folder to get details on all commands.
Screenshot
Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | import subprocess import os # Local Psi Blast installation path path_to_psiblast = 'C:\\Program Files\\NCBI\\blast-2.7.1+\\bin\\psiblast.exe' # Path to Proteins in Fasta format fasta_path = 'processed_fastas/mouse_train/' from os import listdir from os.path import isfile, join onlyfiles = [f for f in listdir(fasta_path) if isfile(join(fasta_path, f))] # psiblast -query A0JNU3.fasta -db swissprot/swissprot -num_iterations 3 -evalue 0.001 -num_threads 8 -save_each_pssm -out_ascii_pssm A0JNU3.pssm for i in onlyfiles: query_fasta = fasta_path + i # Output filename for each PSSM output_pssm = 'pssm/mouse_train/' + i + '.pssm' # Call the sub process with proper arguments subprocess.call([path_to_psiblast, '-query', query_fasta, '-db', 'uniref50/uniref50.fasta', '-num_iterations', '3', '-evalue', '0.001', '-num_threads', '8', '-out_ascii_pssm', output_pssm]) |
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment