The quality of the data and the amount of useful information that it contains are key factors that determine how well a machine learning algorithm can learn. Therefore, we must make sure to examine and preprocess a dataset before we feed it to a learning algorithm.
What is this missing data?
It is not uncommon in real-world applications that our samples are missing one or more values for various reasons. There could have been an error in the data collection process, certain measurements are not applicable, particular fields could have been simply left blank in a survey, and many other reasons can lead to error in data collected. We typically see missing values as the blank spaces in our data table or as placeholder strings such as NaN.
Unfortunately, most computational tools are unable to handle such missing values or would produce unpredictable results if we simply ignored them. Therefore, we must take care of those missing values before we proceed with further analysis. But before we discuss several techniques for dealing with missing values, let's create a simple example data frame from a CSV (comma-separated values) file to get a better grasp of the problem. We will be using pandas library to read data from the CSV file.
Using the preceding code, we read CSV-formatted data into a pandas DataFrame
via the read_csv
function and noticed that the two missing cells were replaced by NaN. The StringIO
function in the preceding code example was simply used for the purposes of illustration. It allows us to read the string assigned to
csv_data
into a pandas DataFrame
as if it was a regular CSV file stored in our hard drive.
For a larger DataFrame
, it can be tedious to look for missing values manually, in this case, we can use the isnull
method to return a DataFrame
with Boolean values that indicate whether a cell contains a numeric value (False
) or if data is missing (True
). Using the sum
method, we can then return the number of missing values per column as follows (You can try and run this code in the terminal above):
df.isnull().sum()
Output:
A 0
B 0
C 1
D 1
dtype: int64
This way, we can count the number of missing values per column; in the following subsections, we will take a look at different strategies for how to deal with this missing data.
Dealing with missing data by removing it
One of the easiest ways to deal with missing data is to simply remove the corresponding features (columns) or samples (rows) from the dataset entirely; rows with missing values can be easily dropped using the dropna
method:
df.dropna()
Output:
A B C D
0 1 2 3 4
Similarly, we can drop columns that have at least one NaN in any row by setting the axis
argument to 1:
df.dropna(axis=1)
Output:
A B
0 1 2
1 5 6
2 0 11
The dropna
method supports several additional parameters that can come in handy:
# only drop rows where all columns are NaN
df.dropna(how='all')
# drop rows that have not at least 4 non-NaN values
df.dropna(thresh=4)
# only drop rows where NaN appear in specific columns (here: 'C')
df.dropna(subset=['C'])
Although the removal of missing data seems to be a convenient approach, it also comes with certain disadvantages; for example, we may end up removing too many samples, which will make a reliable analysis impossible. Or, if we remove too many feature columns, we will run the risk of losing valuable information that our classifier needs to discriminate between classes. In the next section, we will thus look at one of the most commonly used alternatives for dealing with missing values: Interpolation techniques.
Imputing missing values
Often, the removal of samples or dropping of entire feature columns is simply not feasible, because we might lose too much valuable data. In this case, we can use different interpolation techniques to estimate the missing values from the other training samples in our dataset. One of the most common interpolation techniques is mean imputation, where we simply replace the missing value by the mean value of the entire feature column. A convenient way to achieve this is by using the Imputer
class from scikit-learn module, as shown in the following code:
from sklearn.preprocessing import Imputer
imr = Imputer(missing_values='NaN', strategy='mean', axis=0)
imr = imr.fit(df)
imputed_data = imr.transform(df.values)
print(imputed_data)
Output:
array([[ 1., 2., 3., 4.], [ 5., 6., 3., 8.], [ 10., 11., 12., 4.]])
Here, we replaced each NaN value by the corresponding mean, which is separately calculated for each feature column. If we changed the setting axis=0
to axis=1
, we'd calculate the row means.
Other options for the strategy parameter are median or most_frequent, where the latter replaces the missing values by the most frequent values. There are several ways you can experiment to deal with missing data but this is one of the convenient ways that works all the time.
You may also like: