Chapter 2 Data sources

In the beginning of the semester, we participated in the Columbia Data Science Hackathon 2021 event, where we recieved the LIAR dataset. The topic and dataset of Fake News Detection intrigued us to reveal more insights on the relationship between dependendent and independent features. The Liar dataset has 12,800 human labeled short statements in various contexts related to politics and is evaluated by politifact.com for its truthfulness.

How the data is gathered

Politifact is a fact-checking non-profit organization. Everyday, journalists at Politifact reads transcripts, speeches, news stories, press releases, and campaign brochures. They also get statements from the television, social media, and suggestions from email. They selectively pick claims to fact-check based on the significance of the claim and its transmissibility (whether the claim be passed on easily). Fact-checking itself involves searching through Google, searching databases, and a consulting experts.

Issues

The columns of the data is quite messy. For instance, in “State Info”, there are other countries present besides the US such as China and Qatar. Subjects are made up of permutations of different spellings, which will make it difficult to visualise based on statement topics. There is also plenty of missing data, which will be investigated in Chapter 4: Missing Values of this bookdown.

Another challenge of this dataset is the level of detail, particularly for columns “Subject(s)” and “Venue”. Subjects often have multiple keywords associated with it; rarely will two rows share the exact same subjects. Venue is also specific, though not as specific as Subject(s). Ideally we can find a way to group these specific terms to make our analysis visually appealing without losing too much detail. For instance, we can categorise all political topics under “Politics”, but by doing so we lose much low-level details. This is a challenge we will have to tread through carefully.

Below are the some useful links

  1. Columbia Data Science Hackathon Page (2021) - https://cdssatcu.com/2021-data-science-hackathon

  2. Actual Dataset - https://www.cs.ucsb.edu/~william/data/liar_dataset.zip

  3. Dataset Evaluation by Politifact.com - https://www.politifact.com/

Below is the description of the features present in the dataset.

1) ID - Unique Row ID associated for each Statement.

2) Label - Truthfulness of statements: “True”, “False”, “Barely-True”, “Half-True”, “Mostly-True”, “Pants-Fire”.

3) Statement - Sentences of words varying between 5 to 5000.

4) Subject - The topic related to Statement (ex: healthcare, jobs, economy, etc).

5) Speaker - The person who has given the statement.

6) SpeakerJobTitle - The job title associated with the speaker (ex: president, attorney, etc).

7) StateInfo - The state where the statement was made.

8) Party Affiliated - The party the speaker belongs to (ex: republican, democrat, independent, etc).

9) Barely True Counts - The accumulation of Barely True counts associated with Speaker.

10) False Counts - The accumulation of False Counts associated with Speaker.

11) Half False Counts - The accumulation of Half False Counts associated with Speaker.

12) Mostly True Counts - The accumulation of Mostly True Counts associated with Speaker.

13) Pants Fire Counts - The accumulation of Pants Fire Counts associated with Speaker.

14) Venue/Location - The medium through which news was Spread (ex: television, radio, news press, speech, etc)