Generally, we can visualize things in 1D, 2D, and 3D, but real-life datasets contain hundreds and thousands of features resulting in a very large dimension. So it is really hard for us to visualize such situations. To solve this problem, we have a method called Dimensionality Reduction.
There are three fundamental techniques that will help us to summarize the information content of a dataset by transforming it onto a new feature subspace of lower dimensionality than the original one. Here, all the codes are written using Scikit-learn
in Python.
We will cover the following topics:
-
Principal component analysis (PCA) for unsupervised data compression
-
Linear Discriminant Analysis (LDA) as a supervised dimensionality reduction technique for maximizing class separability
Unsupervised Dimensionality Reduction via Principal Component Analysis
Feature extraction makes new features from old ones to use less data. It keeps most of the important information and makes things faster and simpler. PCA is a way to do feature extraction. It finds patterns in data by looking at how features are related. It puts data in a new space with fewer dimensions but more variance. The new space has axes that are at right angles to each other and show the most variance, like in the picture below. Here, x1
and x2
are the original feature axes, and PC1
and PC2
are the principal components:
Before looking at the PCA algorithm for dimensionality reduction in more detail, let's summarize the approach in a few simple steps:
- Standardize the d -dimensional dataset.
- Construct the covariance matrix.
- Decompose the covariance matrix into its eigenvectors and eigenvalues.
- Select k eigenvectors that correspond to the k largest eigenvalues, where k is the dimensionality of the new feature subspace (
k ? d
).
- Construct a projection matrix W from the “top” k eigenvectors.
- Transform the d – dimensional input dataset X using the projection matrix W to obtain the new k – dimensional feature subspace.
Principal component analysis in Scikit-learn
:
PCA is another one of Scikit-learn
's transformer classes, where we first fit the model using the training data before we transform both the training data and the test data using the same model parameters. Now, let's use the PCA from Scikit-learn
on the Wine training dataset, classify the transformed samples via logistic regression and visualize the decision regions via the plot_decision_region
function.
Code for PCA:
import pandas as pd
url = “https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data”
# Load dataset into Pandas DataFrame
df = pd.read_csv(url, names=[“sepal length”, “sepal width”, “petal length”, “petal width”, “target”])
from sklearn.preprocessing import StandardScaler
features = [“sepal length”, “sepal width”, “petal length”, “petal width”]
# Separate out the features
x = df.loc[:, features].values
# Separate out the target
y = df.loc[:, [“target”]].values
# Standardize the features
x = StandardScaler().fit_transform(x)
from sklearn.decomposition import PCA
# Create a PCA object with two components
pca = PCA(n_components=2)
# Fit and transform the features to get the principal components
principal_components = pca.fit_transform(x)
# Create a DataFrame with the principal components and the target
principal_df = pd.DataFrame(data=principal_components, columns=[“principal component 1”, “principal component 2”]) final_df = pd.concat([principal_df, df[[“target”]]], axis=1)
# Plot the data using the principal components and the target
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(8, 8)) ax = fig.add_subplot(1, 1, 1) ax.set_xlabel(“Principal Component 1”, fontsize=15) ax.set_ylabel(“Principal Component 2”, fontsize=15) ax.set_title(“2 component PCA”, fontsize=20)
targets = [“Iris-setosa”, “Iris-versicolor”, “Iris-virginica”] colors = [“r”, “g”, “b”]
for target, color in zip(targets, colors): indices_to_keep = final_df[“target”] == target ax.scatter(final_df.loc[indices_to_keep, “principal component 1”], final_df.loc[indices_to_keep, “principal component 2”], c=color, s=50)
ax.legend(targets) ax.grid()
# Print the explained variance ratio of the PCA
print(f"The explained variance ratio is: {pca.explained_variance_ratio_}")
Supervised data compression via linear discriminant analysis:
LDA is a way to make new features that separate classes well. It is like PCA, which makes new features that have more variance. But LDA is supervised and PCA is not. LDA might seem better for classification than PCA. But sometimes PCA works better, for example, if there are few samples per class.
Code in sklearn:
import numpy as np
import pandas as pd
# loading the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
dataset = pd.read_csv(url, names=names)
# data preprocessing
X = dataset.iloc[:, 0:4].values
y = dataset.iloc[:, 4].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# performing LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components=1)
X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)
Final Thoughts :
In the case of uniformly distributed data, LDA almost always performs better than PCA. However, if the data is highly skewed (irregularly distributed) then it is advised to use PCA since LDA can be biased towards the majority class.
Finally, it is beneficial that PCA can be applied to labelled as well as unlabeled data, since it doesn't rely on the output labels. On the other hand, LDA requires output classes for finding linear discriminant and hence requires labelled data.