Chapter 1 Introduction

Welcome to our project!

1.1 Background

The election of the president and the vice president of the United States is an indirect election in which citizens of the United States who are registered to vote in one of the fifty U.S. states or in Washington, D.C.

Every once a while, the election will impact the U.S. in many aspects, such as job experience, income taxes, government investments and since U.S. plays a key role in golabal econemy, it cerntianly drive attention from every tax payer.

Every adult citizen can vote and every vote counts, the question is: who will be the chosen one by the collective decision in this world of divergent?

Vote America - Credit: Bing Images

Vote America - Credit: Bing Images

And, does every individual know enough about the candidates that he/she has to choose from?

Candidats - Credit: Bing Images

Candidats - Credit: Bing Images

1.3 Project

With the concerns of all those mysterious, when we try to do some research online to understand these candidates policy and who they are, we usually enter a news media website, watch on TV programs, or go on social media like Youtbue to watch all content from candidates, their campaign and their followers, or go on facebook to chat with your friends, or share tweets about this between your network on Twitter… All those learning processes give me some idea: What if peoole has no opinion in the begining and while trying to understand the politics and searching online, they are influenced by all the content they have consumed? And if that would ever affect their decisions later in the polls? And can we find the connection using statistic model between the media behaviors and polling behavrios to test our hypothesis?

So we start a study using social media data to analyze current democrat candidates status, and help people to understand the 2020 presidential polls.

We use social website data from youtube, twitter, facebook to address the impact of social media in the 2020 United States presidential election.

1.3.1 The Team

Haibo Yu (M.s. in Data Science, DSI, Columbia University)

Kevin Gao (M.s. in Data Science, DSI, Columbia University)

1.3.2 The Plan

Haibo: 1. Data collecting: Wrote a python script to do the web scraping work for youtube, facebook, twitter. 2. Data preprocessing: Clean up missing data, unified data and merge into csv. 3. Data analysis: Data exploration and visualization

Kevin: 1. Data pipeline: Design, and architect the data flow and project details. 2. Data analysis: Data exploration and visualization 3. Data prediction*: Apply machine learning, NLP, GLM to make predictions of our data (According to the final project instruction for EDAV, we have to restrict ourselves to exploratory techniques,rather than modeling / prediction approaches).

1.3.3 The Questions

Here are main questions that we hope we will be able to answer from this research.

1.3.3.0.0.1 Core questions:
  1. In each state, which candidate has more popularity online, in terms of views, likes, retweets…etc ?

  2. For each candidate, in which state he/she has more popularity online, in terms of views, likes, retweets…etc ?

  3. In general, which candidate has more popularity online ?

1.3.3.0.0.2 Potentially, we would like to answer these questions:
  1. Who is more concentrating on what topics ?

  2. Who is more popular in near future ?

  3. Is our result correlating to the final election result ?

1.4 References

  • [1] Pollard, Timothy D.; Chesebro, James W.; Studinski, David Paul (2009). “The Role of the Internet in Presidential Campaigns”. Communications Studies. 60 (5): 574–88. doi:10.1080/10510970903260418.

  • [2] Endres, Warnick (2004). “Text-based Interactivity in Candidate Campaign Web Sites: A case Study from the 2002 Elections”. Western Journal of Communication. 68 (3): 322–42. doi:10.1080/10570310409374804. Smith, Aaron. “Pew Internet & American Life Project”. The Internet and Campaign 2010. Pew Research Center.

Github