Tuesday, September 25, 2018

Bioinformatics Python Uniprot Protein Sequence Fasta Downloader with Obsolete Check and Custom Range


Explanation:

This python script allows downloading of protein fasta sequence from Uniprot protein database. The code can be used in two ways, one is pasting protein identifiers directly, another is reading the them from file. The code can be modified for custom range in case there is download problem after a while or sudden disconnection, this will keep the existing download and start downloading from custom range.

Some Protein can be obsolete due to tagging problem or researcher removing it. The code provides support for these obsolete proteins by keeping a list and printing them. It will save the data in fastas / mouse folder by default.

Code:

all_proteins = []


"""
# Use only if your data is in (ID Position Sequence) Format
text_file = open("peptide_data/allfasta.txt", "r")

for i in text_file:
    temp = i.split(' ')
    all_proteins.append(temp[0])
text_file.close()

all_proteins = list(sorted(set(all_proteins)))
print(all_proteins)
"""

# If you use want to this, keep the code above commented vice versa 
all_proteins = [
'P62821',
'Q9R1P0',
'P63101',
'Q8CAQ8',
'Q9ET01'
]


obsolete_list = []

# query the website and return the html to the variable ‘page’
import urllib.request

# Change the range here for custom range of protein query instead of all proteins
for i in range(0, len(all_proteins)):
    query_page = 'https://www.uniprot.org/uniprot/' + all_proteins[i] + '.fasta'
    print(query_page)

    try:
        with urllib.request.urlopen(query_page) as url:
            page = url.read()

            fasta_string = page.decode("utf8")
            print(fasta_string)
            with open("fastas/mouse/" + all_proteins[i] + ".fasta", "a") as p:
                p.write(fasta_string)
            if len(fasta_string) == 0:
                obsolete_list.append(all_proteins[i])
    except:
        obsolete_list.append(all_proteins[i])
        pass

    print("Now at: ", i)

print(list(set(obsolete_list)))

No comments: