Chapter 4 Missing values

Below are some key observations with respect to Missing Values in the dataset.

Missing values by column for the Liar dataset

##     Speaker Job Title            State Info        Venue/Location 
##                  3170                  2446                   110 
##            Subject(s)               Speaker The Party Affiliation 
##                     2                     2                     2 
##    Barely True Counts          False Counts      Half True Counts 
##                     2                     2                     2 
##    Mostly True Counts  Pants on Fire Counts                    ID 
##                     2                     2                     0 
##                 Label             Statement 
##                     0                     0

It is important to note that the Label column (independent feature) doesn’t have missing values in the dataset.
Most of the columns have 2 missing values, which after taking a closer look belongs to ID’s 5872 and 8180
Speaker Job Title has the most missing values - 3170 missing entries

Top 30 row-id having maximum missing values

##  5872  8180   921  1137  1274  2584  2834  3622  3645  4020  5845  6058  6279 
##    11    11     3     3     3     3     3     3     3     3     3     3     3 
##  6553  6603  8125  8172  8222  8952  9323  9583  9752 10096 10434 10667 10871 
##     3     3     3     3     3     3     3     3     3     3     3     3     3 
## 11120     3     8    13 
##     3     2     2     2

Maximum number of columns with missing values in a given row are 11. For examples Id’s like 5872, 8180 etc. have 11 columns with missing values.

Heatmap

Since we noticed that “Speaker Job Title” and “State Info” had maximum missing values, we used heatmaps to see if we can observe any pattern. Below are the observed patterns :-

For “Speaker Job Title”, True & False Staements were completely missing.
For “State Info” feature, False Statements were completely missing.

Missing proportions

We created this missing values plots to identify any patterns for the rows.

The majority of the rows have no missing values
Roughly 1800 rows have Speaker Job Title and State Info missing
Over 1000 rows only have Speaker Job Title missing, and roughly 500 rows only have State Info missing.
The rest of the patterns are hardly present.