Chapter 3 Data transformation
Below Steps were used for transforming the dataset
The first step in the cleaning is to concatenate the 3 files. We merged the train, test and validate files into 1 file, fakeNewsClean.csv. We did this as the task was related to exploring data through visualization, and no modeling of data was required. The three files together will help us to have a better understanding of overall data distribution.
In the process of generating new file, we also replaced ID Column with a new ID column consisting of numbers from 1 to number of rows in the dataframe.
We saw most of the columns in the data had 2 rows missing. Taking a closer look, the rows with ID 5872, and 8180 had most of the columns missing. So we removed those 2 rows from the dataset.
For The columns Subject, SpeakerJobTitle, Speaker, Venue/Location, and “The Party Affiliation”, we choose to replace missing values with a new category “Unknown”. This will be a new category helps us to understand where data was missing and how it is related to each questions.
Below is the Projection of Missing values for each column for each column in the dataset.
## Speaker Job Title State Info Venue/Location ## 3170 2446 110 ## Subject(s) Speaker The Party Affiliation ## 2 2 2 ## Barely True Counts False Counts Half True Counts ## 2 2 2 ## Mostly True Counts Pants on Fire Counts ID ## 2 2 0 ## Label Statement ## 0 0
After Cleaning
## ID Label Statement ## 0 0 0 ## Subject(s) Speaker Speaker Job Title ## 0 0 0 ## State Info The Party Affiliation Barely True Counts ## 0 0 0 ## False Counts Half True Counts Mostly True Counts ## 0 0 0 ## Pants on Fire Counts Venue/Location ## 0 0
For columns Venue, Subject and SpeakerJobTitle, StateInfo, and PartyAffiliation we followed the below steps for pre-processing
For each sentence in the following column :-
:- Converted each sentence to lower case :- Removed all Punctuations :- Removed extra whitespaces :- split the sentences into words by delimiter = " " :- Removed Stop-Words :- Concatenated the words to form the sentence
For Venue, Subject and SpeakerJobTitle, we also did an extra step to group similar words together. For example, a venue like Tweet, Tweets, Tweet! were all grouped into 1 group using “group_str” function.
Using all the steps we have created a new file, FakeNews_Clean.csv which we will be used for discovering insights to our questions.