Chapter 2 Data sources
We conduct study around 3 social meida platform: Youtube, Twiter, Facebook. Our data source is original, collected from social meida networks, including youtube, facebook, twitter.
Methodology: We wrote a python script (web crawler) to do the web scraping work to collect data.
Obstacle: There certainly many obstacles in a project, however, here is an interesting one, when we were scraping search result data from youtube for each candidate, we faced with obstacles when grabing the full data from the lazy loading result page, which would hide large content and load those later by user scrolling down to the bottom of the page, this actually prevented us to rely on the initial html dom to scrap the full data. And it’s hard to use web scrawler in python to control the lazy loading behavior.
Solution: After serveral attempts, we have resolved this by using a walk-around solution provided in youtube search feature, by adding a paramter ‘&page=N’ (N is the number you would like to paginate) to laod all search result at once, then we can scrap the data page by page.
2.1 Youtube
We use a python script to scape the data, generated csvs and details listed below:
2.1.1 Videos
We have 52196 data entries for youtube video, and here is the summary per candidate:
## Amy Klobuchar Andrew Yang Bernie Sanders Cory Booker
## 4800 4824 4888 4788
## Donald Trump Elizabeth Warren Joe Biden Kamala Harris
## 4761 4745 4768 4960
## Pete Buttigieg Tom Steyer Tulsi Gabbard
## 4759 4095 4808
Here is the preview of the data, randomly sampled 10 entiries:
2.2 Twitter
2.2.1 Tweets
We use a python script to scape the data, generated csvs and details listed below:
Here is the preview of the data, randomly sampled 10 entiries:
2.3 Facebook
2.3.1 Posts
We use a python script to scape the data, generated csvs and details listed below:
Here is the preview of the data, randomly sampled 10 entiries:
2.4 APIs
Here is our code sample to show the API that we are using to pull the data (Youtube):
def helperFinder(VideoId):
headers = {'Accept': '*/*',
'Accept-Language': 'en-US,en;q=0.8',
'Cache-Control': 'max-age=0',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36',
'Connection': 'keep-alive',
'Referer': 'http://www.youtube.com/'
}
key = "<Your Google Youtbue API Key>"
helper = "https://www.googleapis.com/youtube/v3/videos?id={}&part=snippet,statistics,recordingDetails&key={}".format(VideoId,key)
try:
h = urllib.request.urlopen(urllib.request.Request(helper,headers=headers))
r = h.read() # r is bytes
r = json.loads(r) # r now is a dict
print("Success:", VideoId)
except urllib.request.HTTPError:
flg = False
while True:
try:
h = urllib.request.urlopen(urllib.request.Request(helper,headers=headers))
r = h.read() # r is bytes
r = json.loads(r) # r now is a dict
flg = True
break
except:
pass
if flg:
print("Success:", VideoId)
else:
print("Failure:", VideoId)
return None
def findTime():
timeStamp = r["items"][0]["snippet"]["publishedAt"]
trueTime = timeStamp[0:10]
return trueTime
def findLikeCount():
try:
likeCount = r["items"][0]["statistics"]["likeCount"]
except:
likeCount = 0
return int(likeCount)
def findDislikeCount():
try:
dislikeCount = r["items"][0]["statistics"]["dislikeCount"]
except:
dislikeCount = 0
return int(dislikeCount)
def findFavoriteCount():
try:
favoriteCount = r["items"][0]["statistics"]["favoriteCount"]
except:
favoriteCount = 0
return int(favoriteCount)
def findCommentCount():
try:
commentCount = r["items"][0]["statistics"]["commentCount"]
except:
commentCount = -1
return int(commentCount)
return {"VideoId":VideoId,"time":findTime(),"likeCount":findLikeCount(),"dislikeCount":findDislikeCount(),
"favorCount":findFavoriteCount(),"commentCount":findCommentCount()}
2.1.2 Comments
We have 356891 data entries for youtube video comment, and here is the summary per candidate:
Here is the preview of the data, randomly sampled 10 entiries: