To understand Model evaluation and Hyperparameter tuning for building and testing a Machine learning model, we will pick a dataset and will implement an ML algorithm on it, dividing the dataset into multiple datasets.
We will be working with the Breast Cancer Wisconsin dataset, which contains 569 samples of malignant and benign tumor cells. The 1st column in the dataset store the unique ID numbers of the samples and the 2nd column has the corresponding diagnosis (M=malignant, B=benign), respectively for the given ID. The next columns from 3 to 32, contain 30 real-value features that have been computed from digitized images of the cell nuclei, which can be used to build a model to predict whether a tumor is benign or malignant.
The Breast Cancer Wisconsin dataset has been deposited on the UCI machine learning repository and more detailed information about this dataset can be found on the UCI Website.
Reading the Data and Splitting it:
In this section we will read the data from the dataset, and split it into training and test datasets in just three simple steps:
import pandas as pd
import urllib
try:
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header=None)
except urllib.error.URLError:
df = pd.read_csv('https://raw.githubusercontent.com/rasbt/python-machine-learning-book/master/code/datasets/wdbc/wdbc.data', header=None)
import sklearn
from sklearn.preprocessing import LabelEncoder
# we are assigning the 30 features to X. Using labelEncoder
# we transform the class labels from their original string respresentation.
X = df.loc[:, 2:].values
y = df.loc[:, 1].values
le = LabelEncoder()
y = le.fit_transform(y)
# two dummy class labels
le.transform(['M', 'B'])
if (sklearn.__version__) < '0.18':
from sklearn.cross_validation import train_test_split
else:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)
# to see datasets
# print(X_train)
# print(X_test)
The above code will divide the dataset into 2 sets.
Combining Transformers and Estimators in a pipeline:
You have learned that many learning algorithms require input features on the same scale for optimal performance. Thus, we need to standardize the columns in the Breast Cancer Wisconsin dataset before we can feed them to a linear classifier, such as logistic regression. Furthermore, let's assume that we want to compress our data from the initial 30 dimensions onto a lower two-dimensional subspace via principal component analysis (PCA), a feature extraction technique for dimensionality reduction.
Instead of going through the fitting and transformation steps for the training and test dataset separately, we can chain the StandardScaler, PCA, and LogisticRegression objects in a pipeline:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
pipe_lr = Pipeline([('scl', StandardScaler()),
('pca', PCA(n_components=2)),
('clf', LogisticRegression(random_state=1))])
pipe_lr.fit(X_train, y_train)
print('Test Accuracy: %.3f' % pipe_lr.score(X_test, y_test))
y_pred = pipe_lr.predict(X_test)
The Pipeline object takes a list of tuples as input, where the first value in each tuple is an arbitrary identifier string that we can use to access the individual elements in the pipeline, and the second element in every tuple is a scikit-learn transformer or estimator.
Below we have a running example of the code:
Fine-tuning Machine Learning models via Grid Search:
In machine learning, we have two types of parameters:
-
those that are learned from the training data, for example, the weights in logistic regression,
-
and the parameters of a learning algorithm that are optimized separately.
The latter are the tuning parameters, also called hyperparameters, of a model, for example, the regularization parameter in logistic regression or the depth parameter of a decision tree. Now, we will try to understand a very strong hyperparameter optimization technique called grid search that can further help to improve the performance of a model by finding the optimal combination of hyperparameter values.
Tuning hyperparameters via Grid search:
The approach of grid search is quite simple, it's a brute-force exhaustive search paradigm where we specify a list of values for different hyperparameters, and the computer evaluates the model performance for each combination of those to obtain the optimal set:
from sklearn.svm import SVC
if (sklearn.__version__) < '0.18':
from sklearn.grid_search import GridSearchCV
else:
from sklearn.model_selection import GridSearchCV
pipe_svc = Pipeline([('scl', StandardScaler()), ('clf', SVC(random_state=1))])
param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]
param_grid = [{'clf__C': param_range, 'clf__kernel': ['linear']}, {'clf__C': param_range, 'clf__gamma': param_range, 'clf__kernel': ['rbf']}]
gs = GridSearchCV(estimator=pipe_svc, param_grid=param_grid, scoring='accuracy', cv=10, n_jobs=-1)
gs = gs.fit(X_train, y_train)
print(gs.best_score_)
print(gs.best_params_)
Using the preceding code, we initialized a GridSearchCV
object from the sklearn.grid_search module to train and tune a support vector machine (SVM) pipeline. We set the param_grid
parameter of GridSearchCV
to a list of dictionaries to specify the parameters that we'd want to tune. For the linear SVM, we only evaluated the inverse regularization parameter C; for the RBF kernel SVM, we tuned both the C and gamma parameters.
Note that the gamma parameter is specific to kernel SVMs. After we used the training data to perform the grid search, we obtained the score of the best-performing model via the best_score_
attribute and looked at its parameters, that can be accessed via the best_params_
attribute. In this particular case, the linear SVM model with 'clf__C= 0.1'
yielded the best k-fold cross-validation accuracy: 97.8 percent.
Conclusion
Now you must have understood the optimization technique for working with the dataset. If you have any doubts ask me in the comment section below and do check out other articles from the curious section.
Reference:
1. Bengio, Yoshua, and Yves Grandvalet. 2004. “No Unbiased Estimator of the Variance of K-Fold Cross-Validation.” J. Mach. Learn. Res. 5 (December). JMLR.org: 1089–1105
You may also like: