In this section, we will write a program that can classify the digits based upon the handwritten digit inputs. Here, we are going to use MNIST Handwritten digits dataset [link: https://www.kaggle.com/c/digit-recognizer/data#] which contains 42.000 samples for training the model. But I will be using this file both for training and testing the model. I am assuming that you are familiar with classification and you have some idea about the scikit-learn library, if not then check out the following articles before you move on with this one:
Introduction to Machine Learning
Linear and Logistic Regression
Introduction to Matpotlib Plotting
In this blog, I am going to write a simple program that can classify handwritten digits and this program can be hugely improved by certain techniques such as grid search, hyperparameter tuning and many more. The IDE that I have used is google collaboratory but you can use any ide of your choice. The first step to start coding the problem is to import the required libraries. So we'll first import the required libraries that we are going to use throughout our program.
# handling imports
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
So if you haven't installed these libraries yet then you can install using the PIP command in python. Then we will proceed with loading our dataset from our local system to pandas so that we can manipulate easily. So what actually pandas does is that it treats the data as a panel and it becomes so convenient to do operations.
# for google colab user only
from google.colab import files
# first you need to upload the file from
# your local system to google drive
uploaded = files.upload()
# Then we will use pandas
import io
df = pd.read_csv(io.StringIO(uploaded['train.csv'].decode('utf-8')))
print(df.head())
# for other ide's
# load the dataset
df = pd.read_csv("train.csv");
# printing first 5 rows
df.head()
Then the next step would be working with data for that as I have told earlier as well, that I am going to use train.csv file both for training and testing purpose. So first we will import the features into a vector
and the respective labels in another vector
.
Then we will use those two vectors to split the data for training(fitting it into the model) and testing (prediction). For splitting the data I have used train_test_split
method from scikit-learn that contains a parameter named test_size
which takes an argument which decides how much from your data that you are going to use for testing and the rest of the data would be used for training.
from sklearn.model_selection import train_test_split as tts
X_train, X_test, Y_train, Y_test = tts(X, Y, test_size = 0.2) # 0.2 indicates 20% of data used for testing
# call the model [DescisonTreeClassifer]
clf = DecisionTreeClassifier()
# Y_train = column_or_1d(Y_train, warn = true)
# train the model with training data
clf.fit(X_train, Y_train)
# predict the model
predicted = clf.predict(X_test)
print(predicted)
# print(len(predicted))
# print(len(Y_test))
# len(X_test)
# uncomment the below 3 lines if you want to use RandomForest as a Classifier
# rf = RandomForestClassifier()
# rf.fit(X_train, Y_train)
# rf.predict(X_test)
# uncomment these 3 lines if you want to use Support Vector Machine as a Classifier
# md = svm.SVC(kernel = 'linear')
# md.fit(X_train, Y_train)
# md.predict(X_test)
Output:
[8 0 5 ... 7 7 5]
Here I have printed the predicted data, you can print anything you want to know about. The more you explore the more you get to know about your dataset. The method clf.fit(X_train, Y_train)
is used to train the model while training and it takes two arguments, first one is the feature and the second one is the respective label. So that our model can learn the pattern.
Then it's time to predict, which means its time to see how our trained model is going to perform on the data that the model has not seen before. Here we use clf.predict(X_test)
method and it takes only one argument(feature) so that it will predict the label and then we'll use the predicted data to compare with the actual label Y_test
and calculate the accuracy of our model. One thing to note here is that I have used a DecisionTreeClassifier()
for classification and I have commented the other two models named RandomForestClassifer()
and SupportVectorMachine()
.
So I would recommend you to use those models as well in the program so that you can compare which model is performing better for this data. There are tons of models available if you want to know more about it then you can visit official scikit-learn documentation.
Till now our model has been trained and started predicting. So now we will calculate our model's accuracy that means how many data points has been correctly classified. I have used two methods, the first one is inbuilt function and second one is custom made so that you can understand it better.
# inbulit function for accuracy calculation!
from sklearn.metrics import accuracy_score
print(accuracy_score(Y_test, predicted))
# custom accuracy calculation
count = 0
for i in range(len(X_test)):
if(predicted[i] == Y_test[i]):
count += 1
print("Accuracy :", (count/len(predicted)*100))
Here we have compared the predicted values with the actual label and the calculated the accuracy. So there are tons of things to discuss but this is the simple implementation of a machine learning model using scikit-learn.
Additionally, if you want to visualize one row and the predict, then write this program:
# sample data
d = xtest[8] # can use any index below 42000
d.shape = (28, 28)
plt.imshow(255-d, cmap = "gray") # we have 255-d because I want white background with black colour
plt.show()
print(clf.predict([xtest[8]]))
Output:
So if have any doubts you can ask me in the comment section below and if you want to read more articles that I have written you can visit my profile.