Chapter 3 Data transformation

When we intially pulled youtube data, we did several plots and observed some inconsistent pattern with online 2020 presidential polls, so we digged into the result csvs and started to investigate what has been misaligned in the youtube original data.

After random checks based on URLs, we first noticed some weirds characters in csv, like extrac title columns added due to the pulling code re-schedule, we can easily manually cleaned up those.

Then we noticed some misformatted and duplicate columns during checking after polling scripts, this is due to the API failture and re-tries, we cleand it up in python script (duplicate entires), and R script (date time formats).

Then we noticed some of the candidates have extremely unlikely total views from the aggregation level data, so we sorted the videos based on their views, and picked the top ones to eyespot check, then we found some videos that belonged to everyone, which was not specifically related to one candidate (Examples: “2020 Democratic Debate - SNL”, “DNC Town Hall - SNL”,“2020 November Democratic Debate in Atlanta | The Daily Show”,“2020 October Democratic Debate in Ohio | The Daily Show”…etc). After we filtered them out later in R data frame, our plots looked better.

We alos filtered out Mr.“Donald Trump”, since he is not one of the democratic candidates, who are related to the ongoing DNC election right now.

And here is the summary of video data per candidate after the cleaning:

##    Amy Klobuchar      Andrew Yang   Bernie Sanders      Cory Booker 
##             4788             4800             4885             4777 
##     Donald Trump Elizabeth Warren        Joe Biden    Kamala Harris 
##                0             4737             4756             4945 
##   Pete Buttigieg       Tom Steyer    Tulsi Gabbard 
##             4742             4048             4796

Now we can see the 10 randomly selected sample entires from our video data:

Now can continue to aggregate the data, i.e. sum the total video videws group by candidate to see in general who is mostly being watched in youtube online:

Github