Saturday, May 13, 2017

Converting MNIST Handwritten Digits Dataset into CSV with Sorting and Extracting Labels and Features into Different CSV using Python


CHECK ERRORS AND COMPLETE POST.
This tutorial assumes you have python 3+ installed and added to path. Most of the work will be done in command line. Download the MNIST Handwritten Digits database from here.

The first task is to download and extract the data. For example, the training set features are named, train-images.idx3-ubyte and the labels are named, train-labels.idx1-ubyte. The code to make the conversion is taken from the link and explained in here, https://pjreddie.com/projects/mnist-in-csv/.

I will only show the example for training set and my file names are a little different.

MNIST Training Dataset to CSV:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def convert(imgf, labelf, outf, n):
    f = open(imgf, "rb")
    o = open(outf, "w")
    l = open(labelf, "rb")

    f.read(16)
    l.read(8)
    images = []

    for i in range(n):
        image = [ord(l.read(1))]
        for j in range(28*28):
            image.append(ord(f.read(1)))
        images.append(image)

    for image in images:
        o.write(",".join(str(pix) for pix in image)+"\n")
    f.close()
    o.close()
    l.close()

convert("train-images.idx3-ubyte", "train-labels.idx1-ubyte",
        "mnist_train.csv", 60000)

Now save the file as mnist_to_csv.py. Open cmd and type python mnist_to_csv.py. The output from the program is a csv file named mnist_train.csv which is about 104mb. Libre office fails to open this large file and other such programs may also fail.
The easiest way is to split the csv into multiple parts. This can be done using python but a very simple tool called CSV Splitter can get the job done easily.

CSV Splitter:



The number of row per package is the number of rows for each CSV. I choose 10000 for each CSV. It will be faster if 5000 is chosen but it will generate a lot more CSV files. For me it generated 6 CSV files with file numbering appended to the original. Ex: mnist_train-000.csv, ..., mnist_train-005.csv.
Now the data in the CSV's aren't sorted. Iterating over them will take a lot of time. So using a simple python script the data can be sorted based on label. The first column is the label when the code above is used.

Python CSV Sort Rows Based on Column:

1
2
3
4
5
6
7
8
9
import csv

with open('mnist_train-003.csv') as sample, open('randomized.csv', "w", newline='') as out:
    csv1=csv.reader(sample)
    #header = next(csv1, None)
    csv_writer = csv.writer(out)
    #if header:
    #    csv_writer.writerow(header)
    csv_writer.writerows(sorted(csv1, key=lambda x:int(x[0])))

The commented part is there in case of header but in our current dataset there are no column headers. It will generate a file named randomized.csv for the given input CSV. Now the data is sorted. In order to get the label and features in separate files the next step must be followed.
The code below removes the label from randomized.csv and writes the new data to new_randomized.csv.

Python CSV Delete column:

1
2
3
4
5
6
7
8
import csv
fname_in = 'randomized.csv'
fname_out = 'new_randomized.csv'
with open(fname_in, 'r') as fin, open(fname_out, 'w', newline='') as fout:
    reader = csv.reader(fin)
    writer = csv.writer(fout)
    for row in reader:
        writer.writerow(row[1:])

Another task that remains is to get the label in a separate csv file.

CSV Extract colmun to New CSV:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import csv

with open ("randomized.csv", "r") as csvfile:
    reader = csv.reader(csvfile)
    collected = []
    for row in reader:
        collected.append(row[0])
		
with open("extracted_randomized_column.csv", "w", newline='') as csv_file:
    writer = csv.writer(csv_file, delimiter=',')
    for line in collected:
        writer.writerow(line)

Change the delimiter from comma to something else if needed. So now the extracted_randomized_column.csv contains the label and new_randomized.csv contains features from mnist_train-003.csv.

Greyscale CSV to Black and White CSV Conversion:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import csv

with open('new_randomized.csv') as sample:
	csv1=csv.reader(sample)
	c = list(csv1)
	collected = []
	
for i in range(len(c)):
	s = ''
	for j in range(len(c[i])):
		if int(c[i][j]) > 128:
			s = s + '1'
		else:
			s = s + '0'
	collected.append(s)
		
with open("bw_output.csv", "w", newline='') as csv_file:
	writer = csv.writer(csv_file, delimiter=',')
	for line in collected:
		writer.writerow(line)

Now copy paste a row from bw_output.csv to get the black and white 28x28 image or new_randomized.csv to see the 28x28 grey scale image. Just like before instead of copy pasting read a row from csv and see what that row represents.

CSV Row to Image using PIL:

1
2
3
4
5
6
7
8
import numpy as np
from PIL import Image

A = np.array([0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,255,255,255,255,255,255,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,255,255,255,255,255,255,255,255,255,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,255,255,255,255,255,255,255,255,255,255,255,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,255,255,255,255,255,255,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,255,255,255,255,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,255,255,255,255,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,255,255,255,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,255,255,255,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,255,255,255,255,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,255,255,255,255,255,255,255,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,255,255,255,255,255,255,255,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,255,255,255,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,255,255,255,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,255,255,255,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,255,255,255,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,255,255,255,255,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,255,255,255,255,255,255,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,255,255,255,255,255,255,255,255,255,255,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,255,255,255,255,255,255,255,255,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,255,255,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0])
B = np.reshape(A, (-1, 28))

img = Image.fromarray(B)
img.show()

No comments: