How to deal with missing values in the dataset? (Pandas&PyTorch Code Example)

Beyza Cevik
2 min readDec 2, 2020

Missing values are inevitable when solving real-life problems.

Fig1: An example small dataset including ‘NaN’ values

There are two approaches to getting rid of these ‘NaN’ values.

  1. Deletion
  2. Imputation

The first method is basically ignoring missing values. However, the second one imputes new values instead of ‘NaN’. There are a variety of techniques to calculate this new value. Let us examine some of them.

Imputation Techniques

a. Numerical Value Imputing

Replacing the ‘NaN’ value with the mean/median/mode value of the same column.

Considering the dataset in Fig1, we can separate input and output using iloc.

Mean: Imputing with mean value means averaging the same column values.

Considering the fig1

Mod: Imputing with mode values means using the most frequent value.

Fig2: Imputing mean

Median: Imputing with median value.

Fig3: Imputing median

Different approaches such as K-NN result imputing can be implemented. However, these are very fast solutions.

b. Categorical Value Imputing

Encoding is the most common way of discrete value imputing.

Fig4: Encoding categorical value

In our example, we had only 2 values for categorical features ‘Pave’ and ‘NaN’. We can see that pandas automatically creates 2 columns for these values as ‘Alley_pave’ and ‘Alley_nan’. And encodes them as 0 and 1 according to the alley type of that instance of data.

Finally, we can convert our numerical values to torch tensors.

So we can conclude that there are different approaches for continuous and discrete data types. Statistical methods such as mean/median/mode calculating are very efficient ways. However, predictive models can be developed for finding more realistic values.

--

--