The following is a racing chart of covid death counts over time. I wanted to see how each of the countries were effected by Covid-19 over time from January of 2020 to June of 2020. The goal of this report is to test the following hypotheses.


• Is the average number of deaths in the USA greater than 1000 per day?



We are going to perform some simple data analysis and hypothesis testing of the Covid-19 dataset using R and its various packages. Some of the packages we are going to use are the dplyr for data wrangling. ggplot2 for visual analysis.

Define the library and its various packages

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.3

upload the dataset

# set the working directory in my pc to read the csv file
setwd("C:/Users/Mufaddal/Desktop")
covid <- read.csv("Covid_19dataset.csv", header = T)

Lets take a quick look at the covid dataset (Last Updated: 05/17/2020)

head(covid)
##   ï..Province_State Country_Region         Last_Update     Lat     Long_
## 1           Alabama             US 2020-04-12 23:18:15 32.3182  -86.9023
## 2            Alaska             US 2020-04-12 23:18:15 61.3707 -152.4044
## 3           Arizona             US 2020-04-12 23:18:15 33.7298 -111.4312
## 4          Arkansas             US 2020-04-12 23:18:15 34.9697  -92.3731
## 5        California             US 2020-04-12 23:18:15 36.1162 -119.6816
## 6          Colorado             US 2020-04-12 23:18:15 39.0598 -105.3111
##   Confirmed Deaths Recovered Active FIPS Incident_Rate People_Tested
## 1      3563     93        NA   3470    1      75.98802         21583
## 2       272      8        66    264    2      45.50405          8038
## 3      3542    115        NA   3427    4      48.66242         42109
## 4      1280     27       367   1253    5      49.43942         19722
## 5     22795    640        NA  22155    6      58.13773        190328
## 6      7307    289        NA   7018    8     128.94373         34873
##   People_Hospitalized Mortality_Rate      UID ISO3 Testing_Rate
## 1                 437       2.610160 84000001  USA     460.3002
## 2                  31       2.941176 84000002  USA    1344.7116
## 3                  NA       3.246753 84000004  USA     578.5223
## 4                 130       2.109375 84000005  USA     761.7534
## 5                5234       2.812020 84000006  USA     485.4239
## 6                1376       3.955112 84000008  USA     615.3900
##   Hospitalization_Rate
## 1          12.26494527
## 2          11.39705882
## 3                     
## 4             10.15625
## 5           22.9611757
## 6           18.8312577
nrow(covid)
## [1] 2243

Lets clean this data set to only show figures from USA. I also want to just see the relevant data such as state name, number of confirmed, dead, recovered, people tested, people hospitalized and the total hospitlization rate per 100,000 people.


covid_us <- covid %>%
  filter(ISO3 == "USA")%>%
  select(ï..Province_State, Confirmed, Deaths, Recovered, People_Tested,People_Hospitalized, Hospitalization_Rate)%>%
  rename(States = ï..Province_State)
                  

nrow(covid_us)
## [1] 1873
head(covid_us)
##       States Confirmed Deaths Recovered People_Tested People_Hospitalized
## 1    Alabama      3563     93        NA         21583                 437
## 2     Alaska       272      8        66          8038                  31
## 3    Arizona      3542    115        NA         42109                  NA
## 4   Arkansas      1280     27       367         19722                 130
## 5 California     22795    640        NA        190328                5234
## 6   Colorado      7307    289        NA         34873                1376
##   Hospitalization_Rate
## 1          12.26494527
## 2          11.39705882
## 3                     
## 4             10.15625
## 5           22.9611757
## 6           18.8312577

Lets perform a simple hyopthesis test to varify if the Average number of deaths in USA is greater than 1000 per day based on the data we have available to us.


Lets estimate the 95% confidence interval to estimate the population mean of the total number of deaths

t.test(covid_us$Deaths, conf.level = 0.95)
## 
##  One Sample t-test
## 
## data:  covid_us$Deaths
## t = 14.775, df = 1872, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##   954.1125 1246.1768
## sample estimates:
## mean of x 
##  1100.145

After performing the confidence interval we are 95% confident that the daily average number of deaths in the US are between 954.1125 and 1246.1768.

Let us now define our null and alternate hypotheses:


# Perform the ttest calculation to find a conclusion. 

t.test(covid_us$Deaths, mu = 1000, alternative = "greater")
## 
##  One Sample t-test
## 
## data:  covid_us$Deaths
## t = 1.345, df = 1872, p-value = 0.0894
## alternative hypothesis: true mean is greater than 1000
## 95 percent confidence interval:
##  977.6092      Inf
## sample estimates:
## mean of x 
##  1100.145

As you can see by our conclusion, since our P-Value is greater than 0.05 we fail to reject the null hypothesis. Therefore we can conclude that we dont have sufficient evidence to claim that the average number of deaths is equal to 1000 per day.