library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
library(ggplot2)
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
library(ggmap)
library(RColorBrewer)
library(mfx)
## Loading required package: sandwich
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Loading required package: betareg
library(foreach)
##
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
##
## accumulate, when
library(doParallel)
## Loading required package: iterators
## Loading required package: parallel
# For each row in Presedential results dataset, seperate out state code and county codes from fips_code
# and add new columns to the dataset
pres = pres %>%
mutate(state_code = floor(fips_code/1000),
county_code = fips_code %% 1000)
# Combine the datasets on state followed by county codes using Inner join
data11 = sqldf::sqldf("SELECT
p.state_code
,p.county_code
,c.region
,c.division
,c.stname
,c.ctyname
,p.total_2016
,p.dem_2016
,p.gop_2016
,p.oth_2016
,popestimate2016
FROM pres p
JOIN cdata c
on p.state_code = c.state
AND p.county_code = c.county")
# Add a new column year with 2016 as value
data11$year = 2016
# Change column names: remove 2016 from the names and give more descriptive names
colnames(data11) = c("state_code","county_code",
"mainregion","division","state_name","county_name",
"total_votes","dem_votes","gop_votes","other_votes",
"population_estimate","year")
# Create factors for region and division columns
data11$mainregion = factor(data11$mainregion)
data11$division = factor(data11$division)
# Create levels for factors created above
levels(data11$mainregion) = c("Northeast","Midwest","South","West")
levels(data11$division) = c("New England","Middle Atlantic","East North Central"
,"West North Central","South Atlantic","East South Central"
,"West South Central","Mountain","Pacific")
data11 = data11 %>%
mutate(perc_dem = 100*dem_votes/population_estimate)
str(data11)
## 'data.frame': 3111 obs. of 13 variables:
## $ state_code : num 26 48 1 48 56 20 37 37 48 21 ...
## $ county_code : num 41 295 127 389 17 43 183 147 497 207 ...
## $ mainregion : Factor w/ 4 levels "Northeast","Midwest",..: 2 3 3 3 4 2 3 3 3 3 ...
## $ division : Factor w/ 9 levels "New England",..: 3 7 6 7 8 4 5 5 7 6 ...
## $ state_name : Factor w/ 51 levels "Alabama","Alaska",..: 23 44 1 44 51 17 34 34 44 18 ...
## $ county_name : Factor w/ 1927 levels "Abbeville County",..: 481 1008 1803 1444 800 510 1800 1368 1887 1496 ...
## $ total_votes : int 18467 1322 29243 3184 2535 3366 510940 78264 24661 8171 ...
## $ dem_votes : int 6431 135 4486 1659 400 584 298353 40967 3412 1093 ...
## $ gop_votes : int 11112 1159 24208 1417 1939 2601 193607 35191 20655 6863 ...
## $ other_votes : int 924 28 549 108 196 181 18980 2106 594 215 ...
## $ population_estimate: int 36202 3487 64967 14921 4679 7664 1046791 177220 64455 17722 ...
## $ year : num 2016 2016 2016 2016 2016 ...
## $ perc_dem : num 17.76 3.87 6.91 11.12 8.55 ...
summary(data11)
## state_code county_code mainregion division
## Min. : 1.00 Min. : 1.0 Northeast: 217 West North Central:617
## 1st Qu.:19.00 1st Qu.: 35.0 Midwest :1054 South Atlantic :588
## Median :29.00 Median : 78.0 South :1422 West South Central:470
## Mean :30.54 Mean :103.2 West : 418 East North Central:437
## 3rd Qu.:46.00 3rd Qu.:133.0 East South Central:364
## Max. :56.00 Max. :840.0 Mountain :281
## (Other) :354
## state_name county_name total_votes
## Texas : 254 Washington County: 30 Min. : 64
## Georgia : 159 Jefferson County : 25 1st Qu.: 4819
## Virginia: 133 Franklin County : 24 Median : 10935
## Kentucky: 120 Jackson County : 23 Mean : 40909
## Missouri: 115 Lincoln County : 23 3rd Qu.: 28675
## Kansas : 105 Madison County : 19 Max. :2314275
## (Other) :2225 (Other) :2967
## dem_votes gop_votes other_votes population_estimate
## Min. : 4 Min. : 57 Min. : 3 Min. : 113
## 1st Qu.: 1164 1st Qu.: 3207 1st Qu.: 165 1st Qu.: 11194
## Median : 3140 Median : 7117 Median : 440 Median : 26027
## Mean : 19567 Mean : 19350 Mean : 1992 Mean : 103623
## 3rd Qu.: 9536 3rd Qu.: 17397 3rd Qu.: 1394 3rd Qu.: 67968
## Max. :1654626 Max. :590465 Max. :117058 Max. :10137915
##
## year perc_dem
## Min. :2016 Min. : 1.730
## 1st Qu.:2016 1st Qu.: 8.296
## Median :2016 Median :12.045
## Mean :2016 Mean :13.700
## 3rd Qu.:2016 3rd Qu.:17.904
## Max. :2016 Max. :48.347
##
Description of the dataset is as follows: This dataset shows results of US 2016 elections results for each county in the US and some of their descriptive statistics like region, division and population estimate for 2016.
I would be interested in analyzing relationship of population estimate 2016, previous election results, international immigration in 2016 with 2016 election outcomes and also see its distribution at state, county, region and division levels.
I am using scatterplots to visualize percentage of democrat votes vs country population.
# Let's see the percentage of democrat voters
# and population estimate 2016 across counties in all the states
# Create a scatter plot for population,
# percent of democrat voters and color the points by region
data11 %>%
ggplot() + geom_point(aes(x=population_estimate,
y=perc_dem,
color=mainregion)) +
scale_color_brewer(palette="Set1") +
ggtitle("Population vs % democrat voters for 2016 in US counties") +
ylab("Percentage of democrat voters")
# Let's find which county has 10m population as per the graph below
data11 %>%
filter(population_estimate>10000000) %>%
dplyr::select(mainregion,division,state_name,county_name, population_estimate)
## mainregion division state_name county_name population_estimate
## 1 West Pacific California Los Angeles County 10137915
# Above scatter plot is skewed due to a few outliers; there are a lot of counties with comparatively low population hence there are a lot of points in the left portion of graph
# Filter counties by population less than 1.25m
data11 %>%
filter(population_estimate < 1250000) %>%
ggplot() + geom_point(aes(x=population_estimate,
y=perc_dem,
color=mainregion)) +
scale_color_brewer(palette="Set1") +
ggtitle("Population vs % democrat voters for 2016 in US counties
with population less than 1.25m") +
ylab("Percentage of democrat voters")
Looks good, but still many points in the left portion. Majority of counties in higher population range in this plot belong to Northeast or West region. Many of these counties, especially in Northeast region has a higher than average percentage votes for democrats. Some counties here in higher population region also show similar percent of democrat voters.
# Filter counties by population less than 200k
data11 %>%
filter(population_estimate < 200000) %>%
ggplot() + geom_point(aes(x=population_estimate,
y=perc_dem,
color=mainregion),
alpha=0.75) +
scale_color_brewer(palette="Set1") +
ggtitle("Population vs % democrat voters for 2016 in US counties
with population less than 200k") +
ylab("Percentage of democrat voters")
- Majority of counties in this plot are in Midwest and South region and some of Northeast region. Counties in South region has about average votes for democrats, while those in Northeast region have a little above average votes for democrats.
# Let's see counties with a really low population
# Filter counties by population less than 50k
data11 %>%
filter(population_estimate < 50000) %>%
ggplot() + geom_point(aes(x=population_estimate,
y=perc_dem,
color=mainregion),
alpha=0.75) +
scale_color_brewer(palette="Set1") +
ggtitle("Population vs % democrat voters for 2016 in US counties
with population less than 50k") +
ylab("Percentage of democrat voters")
- Counties having population of less than 50k in 2016 are mostly in Northeast and South region. Many of these counties are situated in lower region of the plot, with percentage of democrat voters around 10%.
# Take county map data
counties = map_data("county")
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
#Create a copy of data from called data14 for problem 1.4
data14 = data11
data14$state_name = tolower(data14$state_name)
data14$county_name = tolower(data14$county_name)
# Remove word county from county_name
#Reference: stackoverflow https://stackoverflow.com/questions/13093931/r-remove-last-word-from-string
data14$county_name = gsub("\\s*\\w*$", "", data14$county_name)
# Add population estimate and percentage of democrat
#voters to counties dataframe joining on state and county
data14m = sqldf::sqldf("SELECT
c.*
,d.population_estimate
,d.perc_dem
,d.total_votes
FROM counties c
INNER JOIN data14 d
on d.state_name = c.region
AND d.county_name = c.subregion")
# Take state level map data
states = map_data("state")
# Look at the distribution of rounded up perc_dem
table(round(data14m$perc_dem))
##
## 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 101 594 1424 2359 3787 5311 6131 5582 6228 5618 4789 4448 4342 3981 2640
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
## 3352 3348 2848 2437 2596 2194 2110 1493 1098 1171 948 485 1051 955 744
## 32 33 34 35 36 37 38 39 40 41 42 46 47 48
## 724 249 400 224 154 31 163 89 33 116 9 26 56 167
# Looking at above distribution, create bins as necessary
data14m$percent_democrat_voters = cut(data14m$perc_dem,
breaks = c(-Inf,5,8,10,12,18,25,35,50))
#Plot state and county map of US with counties
#colored by percentage of voters who voted for democrats in US 2016 elections
perc_dem_plot = ggplot(data = states,
mapping = aes(x = long, y = lat, group = group)) +
geom_polygon(data = data14m
,mapping = aes(x = long, y = lat, group = group
,fill = percent_democrat_voters) ,color = "white") +
geom_polygon(color = "black", fill = NA) + scale_fill_brewer(
palette = "Blues") + ggtitle(
"Percentage of voters who voted for democrats in US 2016 elections") + coord_fixed(1.3)
par(mfrow=2:1)
perc_dem_plot + theme_nothing(legend = TRUE)
On the first map, I have plotted percentage of democrat voters at county level. I had to use map data at both states and county level using map_data function to draw state and county boundaries on the US map.
As expected, coastal areas have a higher percentage of democrat voters. Also Noertheast states like Illinois and Wisconsin have more democrat voters. Similarly, Coloradi and New Mexico have counties with higher percent of democrat voters.
This problem is a slight variation of St. Petersburg paradox.
The expected payoff is probability of outcome multiplied its monetary payoff for all possible number of heads before a tail shows up. The player will win $1 if tail comes up in first coin toss, with probability of 0.5. The player will win $2 if tail comes up after one head, with probability of 0.25. The player will win $4 if tail comes up after two heads, with probability of 1/8 and so on.
Hence \[ expected payoff = 0.5*1 + 0.25*2 + 0.125*4 + 0.0625*8 + ... \] \[ expected payoff = 0.5+0.5+0.5+ ... \] \[ expected payoff = \infty \]
Expected payoff is infinite in this game.
#Let's write down the data in this DGP. Let the data be number of
#heads we receive in the given 10 plays.
headdata = c(0,0,1,1,1,2,3,3,4,6)
Assume that p is the probability of receiving a head in a single coin toss. Hence probability of getting a tail in one coin toss is 1-p
Calculating probability of getting n heads and a tail:
\[ Pr(X=n;p) = p^{n}.(1-p) \]
If we assume a fair coin, the probability p will be 0.5
Hence \[ Pr(X=n;0.5) = (0.5)^{n}.(0.5) = (0.5)^{n+1} \]
Again n is number heads we received as per the data given. Since all the games are independent events, we can multiply the probabilities as follwos:
L(p) = Pr(X = 0; p) . Pr(X = 0; p) . Pr(X = 1; p) . Pr(X = 1; p) . Pr(X = 1; p) . Pr(X = 2; p) . Pr(X = 3; p) . Pr(X = 3; p) . Pr(X = 4; p) . Pr(X = 6; p)
\[ L(p) = Pr(X = 0; p) . Pr(X = 0; p) . Pr(X = 1; p) . Pr(X = 1; p) . Pr(X = 1; p) . Pr(X = 2; p) . Pr(X = 3; p) . Pr(X = 3; p) . Pr(X = 4; p) . Pr(X = 6; p) \]
Using the formula in previous question for each term we get:
\[ L(p) = p^{0} (1-p) . p^{0} (1-p) . p^{1} (1-p) . p^{1} (1-p) . p^{1} (1-p) . p^{2} (1-p) . p^{3} (1-p) . p^{3} (1-p) . p^{4} (1-p) . p^{6} (1-p) \]
\[ L(p) = p^{21} . (1-p)^{10} \]
Let l(p) be the log-likelihood function if given data. Taking log of above function for L(p) we get,
\[ l(p) = log[L(p)] \]
\[ l(p) = log[p^{21} . (1-p)^{10}] \] \[l(p) = log[p^{21}] + log[(1-p)^{10}] \] \[l(p) = 21.log[p] + 10.log[1-p] \]
Let’s take derivative with respect to p of l(p) found in previous answer and set the derivative to 0
\[ \frac{\partial l(p)}{\partial p} = 0 \]
\[ \frac{\partial [21.log(p) + 10.log(1-p)]}{\partial p} = 0 \]
Hence after taking derivative we get: \[ 21(1/p) - 10(1/1-p) = 0 \]
\[ 21(1/p) = 10(1/1-p) \]
\[ 21 - 21p = 10p \]
\[ 21 = 31.p \]
\[ p = 21/31 \]
\[ \hat p = 0.6774 \]
This is optimal probability p hat = 0.6774
#calculate log likelihood
loglik = function(p){ 21*log(p) + 10*log(1-p) }
#Creating a plot and marking ML estimator p hat
curve(loglik, 0, 1, xlab=expression(p), ylab="log-likelihood")
abline(v=0.6774, lty=3)