Chapter 2 Data sources

We conduct study around 3 social meida platform: Youtube, Twiter, Facebook. Our data source is original, collected from social meida networks, including youtube, facebook, twitter.

Methodology: We wrote a python script (web crawler) to do the web scraping work to collect data.

Obstacle: There certainly many obstacles in a project, however, here is an interesting one, when we were scraping search result data from youtube for each candidate, we faced with obstacles when grabing the full data from the lazy loading result page, which would hide large content and load those later by user scrolling down to the bottom of the page, this actually prevented us to rely on the initial html dom to scrap the full data. And it’s hard to use web scrawler in python to control the lazy loading behavior.

Solution: After serveral attempts, we have resolved this by using a walk-around solution provided in youtube search feature, by adding a paramter ‘&page=N’ (N is the number you would like to paginate) to laod all search result at once, then we can scrap the data page by page.

2.1 Youtube

We use a python script to scape the data, generated csvs and details listed below:

2.1.1 Videos

We have 52196 data entries for youtube video, and here is the summary per candidate:

##    Amy Klobuchar      Andrew Yang   Bernie Sanders      Cory Booker 
##             4800             4824             4888             4788 
##     Donald Trump Elizabeth Warren        Joe Biden    Kamala Harris 
##             4761             4745             4768             4960 
##   Pete Buttigieg       Tom Steyer    Tulsi Gabbard 
##             4759             4095             4808

Here is the preview of the data, randomly sampled 10 entiries:

2.1.2 Comments

We have 356891 data entries for youtube video comment, and here is the summary per candidate:

##    Amy Klobuchar      Andrew Yang   Bernie Sanders      Cory Booker 
##            30441            40566            27892            32216 
##     Donald Trump Elizabeth Warren        Joe Biden    Kamala Harris 
##            26949            31477            27097            36559 
##   Pete Buttigieg       Tom Steyer    Tulsi Gabbard 
##            35558            29588            38548

Here is the preview of the data, randomly sampled 10 entiries:

2.2 Twitter

2.2.1 Tweets

We use a python script to scape the data, generated csvs and details listed below:

Here is the preview of the data, randomly sampled 10 entiries:

2.3 Facebook

2.3.1 Posts

We use a python script to scape the data, generated csvs and details listed below:

Here is the preview of the data, randomly sampled 10 entiries:

2.4 APIs

Here is our code sample to show the API that we are using to pull the data (Youtube):

def helperFinder(VideoId):
    headers = {'Accept': '*/*',
               'Accept-Language': 'en-US,en;q=0.8',
               'Cache-Control': 'max-age=0',
               'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36',
               'Connection': 'keep-alive',
               'Referer': 'http://www.youtube.com/'
               }
    key = "<Your Google Youtbue API Key>"
    helper = "https://www.googleapis.com/youtube/v3/videos?id={}&part=snippet,statistics,recordingDetails&key={}".format(VideoId,key)
    try:
        h = urllib.request.urlopen(urllib.request.Request(helper,headers=headers))
        r = h.read() # r is bytes
        r = json.loads(r) # r now is a dict
        print("Success:", VideoId)
    except urllib.request.HTTPError:
        flg = False
        while True:
            try:
                h = urllib.request.urlopen(urllib.request.Request(helper,headers=headers))
                r = h.read() # r is bytes
                r = json.loads(r) # r now is a dict
                flg = True
                break
            except:
                pass

        if flg:
            print("Success:", VideoId)
        else:
            print("Failure:", VideoId)
        return None
        
    def findTime():
        timeStamp = r["items"][0]["snippet"]["publishedAt"]
        trueTime = timeStamp[0:10]
        return trueTime

    def findLikeCount():
        try:
            likeCount = r["items"][0]["statistics"]["likeCount"]
        except:
            likeCount = 0
        return int(likeCount)

    def findDislikeCount():
        try:
            dislikeCount = r["items"][0]["statistics"]["dislikeCount"]
        except:
            dislikeCount = 0
        return int(dislikeCount)

    def findFavoriteCount():
        try:
            favoriteCount = r["items"][0]["statistics"]["favoriteCount"]
        except:
            favoriteCount = 0
        return int(favoriteCount)

    def findCommentCount():
        try:
            commentCount = r["items"][0]["statistics"]["commentCount"]
        except:
            commentCount = -1
        return int(commentCount)

    return {"VideoId":VideoId,"time":findTime(),"likeCount":findLikeCount(),"dislikeCount":findDislikeCount(),
           "favorCount":findFavoriteCount(),"commentCount":findCommentCount()}
Github