Missing data is classified into three types
- MCAR: stands for Missing Completely at Random. This means that the data is missing due to something unrelated to the data and there is no systematic reason for the missing data. In other words, there is an equal probability that data is missing for all cases. This is often due to some instrumentation like a broken instrument or process issue where some of the data is randomly missing.
- MAR: stands for Missing at Random and this is the opposite case where there is some systematic relationship between data and the probability of missing data.
- MNAR stands for Missing Not at Random and this usually means there is a relationship between a value in the dataset and the missing values.
Understanding why data is missing help with choosing the best imputing method to fill or drop the values in your dataset.
Ways for Data Imputation:
How do you tackle missing data in datasets
- Just Ignore them : one of the easiest ways, do nothing and let the algorithm decide, few algorithms have option to ignore missing values as well (ie. LightGBM — use_missing=false ). Some algorithms can factor in the missing values and learn the best imputation values for the missing data based on the training loss reduction (ie. XGBoost). However, other algorithms will go crazy and throw an error complaining about the missing values (ie. Scikit learn — LinearRegression). In such cases, we have to handle the missing data and clean it before feeding it to the algorithm.
- Filling with Zero/any constant: it works best with categorical data by replacing missing data with zero or the most frequent value of that column. It might introduce bias in dataset and also doesn’t factor the correlation between features.
- Filling with mean/median values: one of the statistical ways, calculating the mean/median of the non missing values of the column and fill the missing values of the column. It is easy, fast and works well with small numerical datasets but works poor for categorical features.
- Using K-NN Algorithm : K-nearest neighbours uses ‘feature similarity’ to predict the values of any new data points. This means that the new point is assigned a value based on how closely it resembles the points in the training set. This can be very useful in making predictions about the missing values by finding the k’s closest neighbours to the observation with missing data and then imputing them based on the non-missing values in the neighbourhood. It can be much more accurate than mean/median or frequent value filling. It is computationally expensive.
There are many other methods like using datawig, extrapolation and interpolation. There is no one method works for all, you should experiment and check which method works best for your dataset.
There is a python library called Impyute for missing data imputation algorithms.