Chapter 3 Data transformation
When we intially pulled youtube data, we did several plots and observed some inconsistent pattern with online 2020 presidential polls, so we digged into the result csvs and started to investigate what has been misaligned in the youtube original data.
After random checks based on URLs, we first noticed some weirds characters in csv, like extrac title columns added due to the pulling code re-schedule, we can easily manually cleaned up those.
Then we noticed some misformatted and duplicate columns during checking after polling scripts, this is due to the API failture and re-tries, we cleand it up in python script (duplicate entires), and R script (date time formats).
Then we noticed some of the candidates have extremely unlikely total views from the aggregation level data, so we sorted the videos based on their views, and picked the top ones to eyespot check, then we found some videos that belonged to everyone, which was not specifically related to one candidate (Examples: “2020 Democratic Debate - SNL”, “DNC Town Hall - SNL”,“2020 November Democratic Debate in Atlanta | The Daily Show”,“2020 October Democratic Debate in Ohio | The Daily Show”…etc). After we filtered them out later in R data frame, our plots looked better.
We alos filtered out Mr.“Donald Trump”, since he is not one of the democratic candidates, who are related to the ongoing DNC election right now.
And here is the summary of video data per candidate after the cleaning:
## Amy Klobuchar Andrew Yang Bernie Sanders Cory Booker
## 4788 4800 4885 4777
## Donald Trump Elizabeth Warren Joe Biden Kamala Harris
## 0 4737 4756 4945
## Pete Buttigieg Tom Steyer Tulsi Gabbard
## 4742 4048 4796
Now we can see the 10 randomly selected sample entires from our video data:
Now can continue to aggregate the data, i.e. sum the total video videws group by candidate to see in general who is mostly being watched in youtube online: