How to deal with missing values in the dataset? (Pandas&PyTorch Code Example)
Missing values are inevitable when solving real-life problems.
There are two approaches to getting rid of these ‘NaN’ values.
- Deletion
- Imputation
The first method is basically ignoring missing values. However, the second one imputes new values instead of ‘NaN’. There are a variety of techniques to calculate this new value. Let us examine some of them.
Imputation Techniques
a. Numerical Value Imputing
Replacing the ‘NaN’ value with the mean/median/mode value of the same column.
Considering the dataset in Fig1, we can separate input and output using iloc.
Mean: Imputing with mean value means averaging the same column values.
Mod: Imputing with mode values means using the most frequent value.
Median: Imputing with median value.
Different approaches such as K-NN result imputing can be implemented. However, these are very fast solutions.
b. Categorical Value Imputing
Encoding is the most common way of discrete value imputing.
In our example, we had only 2 values for categorical features ‘Pave’ and ‘NaN’. We can see that pandas automatically creates 2 columns for these values as ‘Alley_pave’ and ‘Alley_nan’. And encodes them as 0 and 1 according to the alley type of that instance of data.
Finally, we can convert our numerical values to torch tensors.
So we can conclude that there are different approaches for continuous and discrete data types. Statistical methods such as mean/median/mode calculating are very efficient ways. However, predictive models can be developed for finding more realistic values.