Signup/Sign In
MAY 31, 2023

Dimensionality Reduction in Machine Learning

    Generally, we can visualize things in 1D, 2D, and 3D, but real-life datasets contain hundreds and thousands of features resulting in a very large dimension. So it is really hard for us to visualize such situations. To solve this problem, we have a method called Dimensionality Reduction.

    There are three fundamental techniques that will help us to summarize the information content of a dataset by transforming it onto a new feature subspace of lower dimensionality than the original one. Here, all the codes are written using Scikit-learn in Python.

    We will cover the following topics:

    1. Principal component analysis (PCA) for unsupervised data compression

    2. Linear Discriminant Analysis (LDA) as a supervised dimensionality reduction technique for maximizing class separability

    Unsupervised Dimensionality Reduction via Principal Component Analysis

    Feature extraction makes new features from old ones to use less data. It keeps most of the important information and makes things faster and simpler. PCA is a way to do feature extraction. It finds patterns in data by looking at how features are related. It puts data in a new space with fewer dimensions but more variance. The new space has axes that are at right angles to each other and show the most variance, like in the picture below. Here, x1 and x2 are the original feature axes, and PC1 and PC2 are the principal components:

    Before looking at the PCA algorithm for dimensionality reduction in more detail, let's summarize the approach in a few simple steps:

    1. Standardize the d -dimensional dataset.
    2. Construct the covariance matrix.
    3. Decompose the covariance matrix into its eigenvectors and eigenvalues.
    4. Select k eigenvectors that correspond to the k largest eigenvalues, where k is the dimensionality of the new feature subspace (k ? d).
    5. Construct a projection matrix W from the “top” k eigenvectors.
    6. Transform the d – dimensional input dataset X using the projection matrix W to obtain the new k – dimensional feature subspace.

    Principal component analysis in Scikit-learn:

    PCA is another one of Scikit-learn's transformer classes, where we first fit the model using the training data before we transform both the training data and the test data using the same model parameters. Now, let's use the PCA from Scikit-learn on the Wine training dataset, classify the transformed samples via logistic regression and visualize the decision regions via the plot_decision_region function.

    Code for PCA:

    import pandas as pd
    
    url = “https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data”
    
    # Load dataset into Pandas DataFrame
    df = pd.read_csv(url, names=[“sepal length”, “sepal width”, “petal length”, “petal width”, “target”])
    
    from sklearn.preprocessing import StandardScaler
    
    features = [“sepal length”, “sepal width”, “petal length”, “petal width”]
    
    # Separate out the features
    x = df.loc[:, features].values
    
    # Separate out the target
    y = df.loc[:, [“target”]].values
    
    # Standardize the features
    x = StandardScaler().fit_transform(x)
    
    from sklearn.decomposition import PCA
    
    # Create a PCA object with two components
    pca = PCA(n_components=2)
    
    # Fit and transform the features to get the principal components
    principal_components = pca.fit_transform(x)
    
    # Create a DataFrame with the principal components and the target
    principal_df = pd.DataFrame(data=principal_components, columns=[“principal component 1”, “principal component 2”]) final_df = pd.concat([principal_df, df[[“target”]]], axis=1)
    
    # Plot the data using the principal components and the target
    import matplotlib.pyplot as plt
    
    fig = plt.figure(figsize=(8, 8)) ax = fig.add_subplot(1, 1, 1) ax.set_xlabel(“Principal Component 1”, fontsize=15) ax.set_ylabel(“Principal Component 2”, fontsize=15) ax.set_title(“2 component PCA”, fontsize=20)
    
    targets = [“Iris-setosa”, “Iris-versicolor”, “Iris-virginica”] colors = [“r”, “g”, “b”]
    
    for target, color in zip(targets, colors): indices_to_keep = final_df[“target”] == target ax.scatter(final_df.loc[indices_to_keep, “principal component 1”], final_df.loc[indices_to_keep, “principal component 2”], c=color, s=50)
    
    ax.legend(targets) ax.grid()
    
    # Print the explained variance ratio of the PCA
    print(f"The explained variance ratio is: {pca.explained_variance_ratio_}")

    Supervised data compression via linear discriminant analysis:

    LDA is a way to make new features that separate classes well. It is like PCA, which makes new features that have more variance. But LDA is supervised and PCA is not. LDA might seem better for classification than PCA. But sometimes PCA works better, for example, if there are few samples per class.

    Code in sklearn:

    import numpy as np
    
    import pandas as pd
    
    # loading the dataset
    
    url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
    
    names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
    
    dataset = pd.read_csv(url, names=names)
    
    # data preprocessing
    
    X = dataset.iloc[:, 0:4].values
    y = dataset.iloc[:, 4].values
    
    from sklearn.model_selection import train_test_split
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
    
    # feature scaling
    
    from sklearn.preprocessing import StandardScaler
    
    sc = StandardScaler()
    
    X_train = sc.fit_transform(X_train)
    
    X_test = sc.transform(X_test)
    
    # performing LDA
    
    from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
    
    lda = LDA(n_components=1)
    
    X_train = lda.fit_transform(X_train, y_train)
    
    X_test = lda.transform(X_test)

    Final Thoughts :

    In the case of uniformly distributed data, LDA almost always performs better than PCA. However, if the data is highly skewed (irregularly distributed) then it is advised to use PCA since LDA can be biased towards the majority class.

    Finally, it is beneficial that PCA can be applied to labelled as well as unlabeled data, since it doesn't rely on the output labels. On the other hand, LDA requires output classes for finding linear discriminant and hence requires labelled data.

    Incoming Software Engineer @Vedantu, Codeforces (1765, expert). Former Summer Intern @Wikimedia Foundation(GSoC), @Egnify, @Vedantu.
    IF YOU LIKE IT, THEN SHARE IT
    Advertisement

    RELATED POSTS