R Markdown

library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
library(ggplot2)
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
library(ggmap)
library(RColorBrewer)
library(mfx)
## Loading required package: sandwich
## Loading required package: lmtest
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## Loading required package: betareg
library(foreach)
## 
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
library(doParallel)
## Loading required package: iterators
## Loading required package: parallel

In this project, I use two datasets: US 2016 election results and US County level data. I will first join both datasets, and then create a visualization that shows percentage of voters who voted for Democrats for each county in the US.

1.1 Tidy the data. Merge these datasets, retain only more interesting variables, compute additional variables you find interesting.

# For each row in Presedential results dataset, seperate out state code and county codes from fips_code
# and add new columns to the dataset
pres = pres %>%
  mutate(state_code = floor(fips_code/1000),
         county_code = fips_code %% 1000)
# Combine the datasets on state followed by county codes using Inner join

data11 = sqldf::sqldf("SELECT
                     p.state_code
                     ,p.county_code
                     ,c.region
                     ,c.division
                     ,c.stname
                     ,c.ctyname
                     ,p.total_2016
                     ,p.dem_2016
                     ,p.gop_2016
                     ,p.oth_2016
                     ,popestimate2016
                     FROM pres p
                     JOIN cdata c
                        on p.state_code = c.state
                          AND p.county_code = c.county")

# Add a new column year with 2016 as value
data11$year = 2016

# Change column names: remove 2016 from the names and give more descriptive names

colnames(data11) = c("state_code","county_code",
    "mainregion","division","state_name","county_name",
    "total_votes","dem_votes","gop_votes","other_votes",
    "population_estimate","year")

# Create factors for region and division columns
data11$mainregion = factor(data11$mainregion)
data11$division = factor(data11$division)

# Create levels for factors created above
levels(data11$mainregion) = c("Northeast","Midwest","South","West")
levels(data11$division) = c("New England","Middle Atlantic","East North Central"
                            ,"West North Central","South Atlantic","East South Central"
                            ,"West South Central","Mountain","Pacific")

data11 = data11 %>%
  mutate(perc_dem = 100*dem_votes/population_estimate)

1.2 Look at all variables in the dataset

str(data11)
## 'data.frame':    3111 obs. of  13 variables:
##  $ state_code         : num  26 48 1 48 56 20 37 37 48 21 ...
##  $ county_code        : num  41 295 127 389 17 43 183 147 497 207 ...
##  $ mainregion         : Factor w/ 4 levels "Northeast","Midwest",..: 2 3 3 3 4 2 3 3 3 3 ...
##  $ division           : Factor w/ 9 levels "New England",..: 3 7 6 7 8 4 5 5 7 6 ...
##  $ state_name         : Factor w/ 51 levels "Alabama","Alaska",..: 23 44 1 44 51 17 34 34 44 18 ...
##  $ county_name        : Factor w/ 1927 levels "Abbeville County",..: 481 1008 1803 1444 800 510 1800 1368 1887 1496 ...
##  $ total_votes        : int  18467 1322 29243 3184 2535 3366 510940 78264 24661 8171 ...
##  $ dem_votes          : int  6431 135 4486 1659 400 584 298353 40967 3412 1093 ...
##  $ gop_votes          : int  11112 1159 24208 1417 1939 2601 193607 35191 20655 6863 ...
##  $ other_votes        : int  924 28 549 108 196 181 18980 2106 594 215 ...
##  $ population_estimate: int  36202 3487 64967 14921 4679 7664 1046791 177220 64455 17722 ...
##  $ year               : num  2016 2016 2016 2016 2016 ...
##  $ perc_dem           : num  17.76 3.87 6.91 11.12 8.55 ...
summary(data11)
##    state_code     county_code        mainregion                 division  
##  Min.   : 1.00   Min.   :  1.0   Northeast: 217   West North Central:617  
##  1st Qu.:19.00   1st Qu.: 35.0   Midwest  :1054   South Atlantic    :588  
##  Median :29.00   Median : 78.0   South    :1422   West South Central:470  
##  Mean   :30.54   Mean   :103.2   West     : 418   East North Central:437  
##  3rd Qu.:46.00   3rd Qu.:133.0                    East South Central:364  
##  Max.   :56.00   Max.   :840.0                    Mountain          :281  
##                                                   (Other)           :354  
##     state_name              county_name    total_votes     
##  Texas   : 254   Washington County:  30   Min.   :     64  
##  Georgia : 159   Jefferson County :  25   1st Qu.:   4819  
##  Virginia: 133   Franklin County  :  24   Median :  10935  
##  Kentucky: 120   Jackson County   :  23   Mean   :  40909  
##  Missouri: 115   Lincoln County   :  23   3rd Qu.:  28675  
##  Kansas  : 105   Madison County   :  19   Max.   :2314275  
##  (Other) :2225   (Other)          :2967                    
##    dem_votes         gop_votes       other_votes     population_estimate
##  Min.   :      4   Min.   :    57   Min.   :     3   Min.   :     113   
##  1st Qu.:   1164   1st Qu.:  3207   1st Qu.:   165   1st Qu.:   11194   
##  Median :   3140   Median :  7117   Median :   440   Median :   26027   
##  Mean   :  19567   Mean   : 19350   Mean   :  1992   Mean   :  103623   
##  3rd Qu.:   9536   3rd Qu.: 17397   3rd Qu.:  1394   3rd Qu.:   67968   
##  Max.   :1654626   Max.   :590465   Max.   :117058   Max.   :10137915   
##                                                                         
##       year         perc_dem     
##  Min.   :2016   Min.   : 1.730  
##  1st Qu.:2016   1st Qu.: 8.296  
##  Median :2016   Median :12.045  
##  Mean   :2016   Mean   :13.700  
##  3rd Qu.:2016   3rd Qu.:17.904  
##  Max.   :2016   Max.   :48.347  
## 

Description of the dataset is as follows: This dataset shows results of US 2016 elections results for each county in the US and some of their descriptive statistics like region, division and population estimate for 2016.

I would be interested in analyzing relationship of population estimate 2016, previous election results, international immigration in 2016 with 2016 election outcomes and also see its distribution at state, county, region and division levels.

1.3 For preliminary analysis, let’s plot the percentage of votes for democrats versus the county population.

I am using scatterplots to visualize percentage of democrat votes vs country population.

# Let's see the percentage of democrat voters
# and population estimate 2016 across counties in all the states

# Create a scatter plot for population, 
# percent of democrat voters and color the points by region

data11 %>%
  ggplot() + geom_point(aes(x=population_estimate,
                            y=perc_dem,
                            color=mainregion)) +
  scale_color_brewer(palette="Set1") +
  ggtitle("Population vs % democrat voters for 2016 in US counties") +
  ylab("Percentage of democrat voters")

# Let's find which county has 10m population as per the graph below
data11 %>%
  filter(population_estimate>10000000) %>%
  dplyr::select(mainregion,division,state_name,county_name, population_estimate)
##   mainregion division state_name        county_name population_estimate
## 1       West  Pacific California Los Angeles County            10137915
# Above scatter plot is skewed due to a few outliers; there are a lot of counties with comparatively low population hence there are a lot of points in the left portion of graph


# Filter counties by population less than 1.25m
data11 %>%
  filter(population_estimate < 1250000) %>%
  ggplot() + geom_point(aes(x=population_estimate,
            y=perc_dem,
            color=mainregion)) +
  scale_color_brewer(palette="Set1") +
  ggtitle("Population vs % democrat voters for 2016 in US counties
          with population less than 1.25m") +
  ylab("Percentage of democrat voters")

Looks good, but still many points in the left portion. Majority of counties in higher population range in this plot belong to Northeast or West region. Many of these counties, especially in Northeast region has a higher than average percentage votes for democrats. Some counties here in higher population region also show similar percent of democrat voters.

# Filter counties by population less than 200k
data11 %>%
  filter(population_estimate < 200000) %>%
  ggplot() + geom_point(aes(x=population_estimate,
            y=perc_dem,
            color=mainregion),
            alpha=0.75) +
  scale_color_brewer(palette="Set1") +
  ggtitle("Population vs % democrat voters for 2016 in US counties
          with population less than 200k") +
  ylab("Percentage of democrat voters")

- Majority of counties in this plot are in Midwest and South region and some of Northeast region. Counties in South region has about average votes for democrats, while those in Northeast region have a little above average votes for democrats.

# Let's see counties with a really low population
# Filter counties by population less than 50k
data11 %>%
  filter(population_estimate < 50000) %>%
  ggplot() + geom_point(aes(x=population_estimate,
            y=perc_dem,
            color=mainregion),
            alpha=0.75) +
  scale_color_brewer(palette="Set1") +
  ggtitle("Population vs % democrat voters for 2016 in US counties
          with population less than 50k") +
  ylab("Percentage of democrat voters")

- Counties having population of less than 50k in 2016 are mostly in Northeast and South region. Many of these counties are situated in lower region of the plot, with percentage of democrat voters around 10%.

Problem 1.4 Create a map of percentage of votes for democrats.

# Take county map data
counties = map_data("county")
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map
#Create a copy of data from called data14 for problem 1.4
data14 = data11

data14$state_name = tolower(data14$state_name)
data14$county_name = tolower(data14$county_name)

# Remove word county from county_name
#Reference: stackoverflow https://stackoverflow.com/questions/13093931/r-remove-last-word-from-string
data14$county_name = gsub("\\s*\\w*$", "", data14$county_name)

# Add population estimate and percentage of democrat
#voters to counties dataframe joining on state and county
data14m = sqldf::sqldf("SELECT
                     c.*
                     ,d.population_estimate
                     ,d.perc_dem
                     ,d.total_votes
                     FROM counties c
                     INNER JOIN data14 d
                        on d.state_name = c.region
                          AND d.county_name = c.subregion")
# Take state level map data
states = map_data("state")

# Look at the distribution of rounded up perc_dem
table(round(data14m$perc_dem))
## 
##    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
##  101  594 1424 2359 3787 5311 6131 5582 6228 5618 4789 4448 4342 3981 2640 
##   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31 
## 3352 3348 2848 2437 2596 2194 2110 1493 1098 1171  948  485 1051  955  744 
##   32   33   34   35   36   37   38   39   40   41   42   46   47   48 
##  724  249  400  224  154   31  163   89   33  116    9   26   56  167
# Looking at above distribution, create bins as necessary
data14m$percent_democrat_voters = cut(data14m$perc_dem,
breaks = c(-Inf,5,8,10,12,18,25,35,50))

#Plot state and county map of US with counties
#colored by percentage of voters who voted for democrats in US 2016 elections

perc_dem_plot = ggplot(data = states,
       mapping = aes(x = long, y = lat, group = group)) +
  geom_polygon(data = data14m
               ,mapping = aes(x = long, y = lat, group = group
               ,fill = percent_democrat_voters) ,color = "white") +
  geom_polygon(color = "black", fill = NA) + scale_fill_brewer(
palette = "Blues") + ggtitle(
  "Percentage of voters who voted for democrats in US 2016 elections") + coord_fixed(1.3)
par(mfrow=2:1)
perc_dem_plot + theme_nothing(legend = TRUE)

2: Coin Tossing Game (25p)

You are offered to participate in a coing-tossing game. The rules are following: the coin is tossed until tail comes up. If tail comes up in the first flip - (T), you receive $1. If a single head pops up before the tail(H, T), you get $2. If two heads pop up(H,H, T), you receive $4. If three heads appear before the first tail(H,H,H, T), you get $8. And so forth, so if the realized sequence is n heads and thereafter a tail - (H,H,. . .,H, T), you will receive $2n. The tosses are independent.

2.1.a : What will be the expected payout in this game.

This problem is a slight variation of St. Petersburg paradox.

The expected payoff is probability of outcome multiplied its monetary payoff for all possible number of heads before a tail shows up. The player will win $1 if tail comes up in first coin toss, with probability of 0.5. The player will win $2 if tail comes up after one head, with probability of 0.25. The player will win $4 if tail comes up after two heads, with probability of 1/8 and so on.

Hence \[ expected payoff = 0.5*1 + 0.25*2 + 0.125*4 + 0.0625*8 + ... \] \[ expected payoff = 0.5+0.5+0.5+ ... \] \[ expected payoff = \infty \]

Expected payoff is infinite in this game.

2.1.b : Intuition Check: How much would you actually be willing to pay for the participation?

2.2.a : What is the probability to receive n heads and a tail?

#Let's write down the data in this DGP. Let the data be number of
#heads we receive in the given 10 plays.

headdata = c(0,0,1,1,1,2,3,3,4,6)

Assume that p is the probability of receiving a head in a single coin toss. Hence probability of getting a tail in one coin toss is 1-p

Calculating probability of getting n heads and a tail:

\[ Pr(X=n;p) = p^{n}.(1-p) \]

If we assume a fair coin, the probability p will be 0.5

Hence \[ Pr(X=n;0.5) = (0.5)^{n}.(0.5) = (0.5)^{n+1} \]

2.2.b : What is the probability to receive all 10 outcomes listed above?

Again n is number heads we received as per the data given. Since all the games are independent events, we can multiply the probabilities as follwos:

L(p) = Pr(X = 0; p) . Pr(X = 0; p) . Pr(X = 1; p) . Pr(X = 1; p) . Pr(X = 1; p) . Pr(X = 2; p) . Pr(X = 3; p) . Pr(X = 3; p) . Pr(X = 4; p) . Pr(X = 6; p)

\[ L(p) = Pr(X = 0; p) . Pr(X = 0; p) . Pr(X = 1; p) . Pr(X = 1; p) . Pr(X = 1; p) . Pr(X = 2; p) . Pr(X = 3; p) . Pr(X = 3; p) . Pr(X = 4; p) . Pr(X = 6; p) \]

Using the formula in previous question for each term we get:

\[ L(p) = p^{0} (1-p) . p^{0} (1-p) . p^{1} (1-p) . p^{1} (1-p) . p^{1} (1-p) . p^{2} (1-p) . p^{3} (1-p) . p^{3} (1-p) . p^{4} (1-p) . p^{6} (1-p) \]

\[ L(p) = p^{21} . (1-p)^{10} \]

2.2.c : What is the log-likelihood function of this data as a function of the parameter?

Let l(p) be the log-likelihood function if given data. Taking log of above function for L(p) we get,

\[ l(p) = log[L(p)] \]

\[ l(p) = log[p^{21} . (1-p)^{10}] \] \[l(p) = log[p^{21}] + log[(1-p)^{10}] \] \[l(p) = 21.log[p] + 10.log[1-p] \]

2.2.d : Let’s analytically solve this log-likelihood for the optimal probabilty p hat.

Let’s take derivative with respect to p of l(p) found in previous answer and set the derivative to 0

\[ \frac{\partial l(p)}{\partial p} = 0 \]

\[ \frac{\partial [21.log(p) + 10.log(1-p)]}{\partial p} = 0 \]

Hence after taking derivative we get: \[ 21(1/p) - 10(1/1-p) = 0 \]

\[ 21(1/p) = 10(1/1-p) \]

\[ 21 - 21p = 10p \]

\[ 21 = 31.p \]

\[ p = 21/31 \]

\[ \hat p = 0.6774 \]

This is optimal probability p hat = 0.6774

2.2.e : Let’s plot the log-likelihood as a function of p. Mark the ML estimator p hat on the figure.

#calculate log likelihood
loglik = function(p){ 21*log(p) + 10*log(1-p) }

#Creating a plot and marking ML estimator p hat

curve(loglik, 0, 1, xlab=expression(p), ylab="log-likelihood")
abline(v=0.6774, lty=3)