Dealing with COVID has been a big part of daily life for about one and a half years now. After extensive quarantines, social distancing, mask mandates, and more, we finally have vaccines going around. But this makes us wonder, how are the vaccines making an impact on the spread of the coronavirus? With many people tired of remote working and other changes to our lives, some optimism can be gained if things look like they are getting better. Now that vaccines have been going around for a few months, we have a sizable amount of data to try and make a conclusion on the efficacy of the vaccines.
In this tutorial we are going to walk through the data science process when it comes to Data Collection, Data Management, Exploratory Data Analysis, Hypothesis Testing, Machine Learning, and Conclusions we can draw.
We are going to be analyzing data from two sources. The first is the data on vaccinations by country, and can be found at https://www.kaggle.com/gpreda/covid-world-vaccination-progress?select=country_vaccinations.csv.
The second dataset is daily covid data by country which includes variables like number of cases, deaths, and so on. The second dataset can be found at https://www.kaggle.com/josephassaker/covid19-global-dataset?select=worldometer_coronavirus_daily_data.csv.
The first step to our analysis is data collection. We have our datasets to take from, but first we need to import the necessary libraries.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model
import statsmodels.api as sm
from scipy import stats
Now we need to read in the data.
When we read in the data, we are also going to create a deep copy of the dataframes we create in order to have an original copy should we need to go back to refer to something after altering the dataframes for analysis.
covid_vaccine_data = pd.read_csv("country_vaccinations.csv")
vaccine_data = covid_vaccine_data.copy(deep=True)
vaccine_data
| country | iso_code | date | total_vaccinations | people_vaccinated | people_fully_vaccinated | daily_vaccinations_raw | daily_vaccinations | total_vaccinations_per_hundred | people_vaccinated_per_hundred | people_fully_vaccinated_per_hundred | daily_vaccinations_per_million | vaccines | source_name | source_website | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | AFG | 2021-02-22 | 0.0 | 0.0 | NaN | NaN | NaN | 0.00 | 0.00 | NaN | NaN | Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm... | World Health Organization | https://covid19.who.int/ |
| 1 | Afghanistan | AFG | 2021-02-23 | NaN | NaN | NaN | NaN | 1367.0 | NaN | NaN | NaN | 35.0 | Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm... | World Health Organization | https://covid19.who.int/ |
| 2 | Afghanistan | AFG | 2021-02-24 | NaN | NaN | NaN | NaN | 1367.0 | NaN | NaN | NaN | 35.0 | Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm... | World Health Organization | https://covid19.who.int/ |
| 3 | Afghanistan | AFG | 2021-02-25 | NaN | NaN | NaN | NaN | 1367.0 | NaN | NaN | NaN | 35.0 | Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm... | World Health Organization | https://covid19.who.int/ |
| 4 | Afghanistan | AFG | 2021-02-26 | NaN | NaN | NaN | NaN | 1367.0 | NaN | NaN | NaN | 35.0 | Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm... | World Health Organization | https://covid19.who.int/ |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 17602 | Zimbabwe | ZWE | 2021-05-08 | 657838.0 | 509274.0 | 148564.0 | 17076.0 | 19648.0 | 4.43 | 3.43 | 1.00 | 1322.0 | Sinopharm/Beijing | Ministry of Health | https://twitter.com/MoHCCZim/status/1392575941... |
| 17603 | Zimbabwe | ZWE | 2021-05-09 | 684243.0 | 526066.0 | 158177.0 | 26405.0 | 22863.0 | 4.60 | 3.54 | 1.06 | 1538.0 | Sinopharm/Beijing | Ministry of Health | https://twitter.com/MoHCCZim/status/1392575941... |
| 17604 | Zimbabwe | ZWE | 2021-05-10 | 690653.0 | 529360.0 | 161293.0 | 6410.0 | 21877.0 | 4.65 | 3.56 | 1.09 | 1472.0 | Sinopharm/Beijing | Ministry of Health | https://twitter.com/MoHCCZim/status/1392575941... |
| 17605 | Zimbabwe | ZWE | 2021-05-11 | 709772.0 | 539526.0 | 170246.0 | 19119.0 | 21428.0 | 4.78 | 3.63 | 1.15 | 1442.0 | Sinopharm/Beijing | Ministry of Health | https://twitter.com/MoHCCZim/status/1392575941... |
| 17606 | Zimbabwe | ZWE | 2021-05-12 | 730365.0 | 549797.0 | 180568.0 | 20593.0 | 22019.0 | 4.91 | 3.70 | 1.21 | 1481.0 | Sinopharm/Beijing | Ministry of Health | https://twitter.com/MoHCCZim/status/1392575941... |
17607 rows × 15 columns
covid_daily_data = pd.read_csv("worldometer_coronavirus_daily_data.csv")
daily_data = covid_daily_data.copy(deep=True)
daily_data
| date | country | cumulative_total_cases | daily_new_cases | active_cases | cumulative_total_deaths | daily_new_deaths | |
|---|---|---|---|---|---|---|---|
| 0 | 2020-2-15 | Afghanistan | 0.0 | NaN | 0.0 | 0.0 | NaN |
| 1 | 2020-2-16 | Afghanistan | 0.0 | NaN | 0.0 | 0.0 | NaN |
| 2 | 2020-2-17 | Afghanistan | 0.0 | NaN | 0.0 | 0.0 | NaN |
| 3 | 2020-2-18 | Afghanistan | 0.0 | NaN | 0.0 | 0.0 | NaN |
| 4 | 2020-2-19 | Afghanistan | 0.0 | NaN | 0.0 | 0.0 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 99459 | 2021-5-07 | Zimbabwe | 38403.0 | 5.0 | 786.0 | 1576.0 | 1.0 |
| 99460 | 2021-5-08 | Zimbabwe | 38414.0 | 11.0 | 786.0 | 1576.0 | 0.0 |
| 99461 | 2021-5-09 | Zimbabwe | 38419.0 | 5.0 | 780.0 | 1576.0 | 0.0 |
| 99462 | 2021-5-10 | Zimbabwe | 38433.0 | 14.0 | 649.0 | 1576.0 | 0.0 |
| 99463 | 2021-5-11 | Zimbabwe | 38448.0 | 15.0 | 648.0 | 1579.0 | 3.0 |
99464 rows × 7 columns
As can be seen, the first table on vaccinations starts from February 2021 while the second table on daily data starts out in 2020. Since we want to be comparing the vaccine efficacy to variables like daily new cases, we are going to filter out rows for dates we don't need.
Another important thing to realize is that upon inspection of the country names, they differ in their naming. For example, one table might use 'and' the other table may use 'And'. In order to avoid issues with this, we are going to capitalize the whole country name for each row in each table.
# Drop rows for dates we do not need
rows_to_drop = []
# Cycle through the daily data table by row
for index, row in daily_data.iterrows():
if daily_data.loc[index,'date'][0:4] == '2020' or daily_data.loc[index,'date'][0:6] == '2021-1' or \
(daily_data.loc[index,'date'][0:6] == '2021-2' and int(daily_data.loc[index,'date'][7:9]) < 22):
rows_to_drop.append(index)
# Capitalizing the country name
daily_data.loc[index,'country'] = daily_data.loc[index,'country'].upper()
# drop the rows
daily_data = daily_data.drop(rows_to_drop)
daily_data
| date | country | cumulative_total_cases | daily_new_cases | active_cases | cumulative_total_deaths | daily_new_deaths | |
|---|---|---|---|---|---|---|---|
| 373 | 2021-2-22 | AFGHANISTAN | 55646.0 | 29.0 | 4316.0 | 2435.0 | 2.0 |
| 374 | 2021-2-23 | AFGHANISTAN | 55664.0 | 18.0 | 4261.0 | 2436.0 | 1.0 |
| 375 | 2021-2-24 | AFGHANISTAN | 55680.0 | 16.0 | 4156.0 | 2438.0 | 2.0 |
| 376 | 2021-2-25 | AFGHANISTAN | 55696.0 | 16.0 | 3973.0 | 2442.0 | 4.0 |
| 377 | 2021-2-26 | AFGHANISTAN | 55707.0 | 11.0 | 3979.0 | 2443.0 | 1.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 99459 | 2021-5-07 | ZIMBABWE | 38403.0 | 5.0 | 786.0 | 1576.0 | 1.0 |
| 99460 | 2021-5-08 | ZIMBABWE | 38414.0 | 11.0 | 786.0 | 1576.0 | 0.0 |
| 99461 | 2021-5-09 | ZIMBABWE | 38419.0 | 5.0 | 780.0 | 1576.0 | 0.0 |
| 99462 | 2021-5-10 | ZIMBABWE | 38433.0 | 14.0 | 649.0 | 1576.0 | 0.0 |
| 99463 | 2021-5-11 | ZIMBABWE | 38448.0 | 15.0 | 648.0 | 1579.0 | 3.0 |
17380 rows × 7 columns
Now that we have filtered out rows containing dates that cannot correspond to the vaccinations data table, we need to filter out the dates that do not correspond to the daily data table from the vaccinations table. In this case, it is just one day, 2021-5-12. So all we need to do is go through and drop rows for that particular day.
Note that we also need to capitalize the country names here too.
# Drop rows for dates we do not need
rows_to_drop = []
# Cycle through the vaccinations data table by row
for index, row in vaccine_data.iterrows():
if vaccine_data.loc[index,'date'][0:10] == '2021-05-12':
rows_to_drop.append(index)
# Capitalizing the country name
vaccine_data.loc[index,'country'] = vaccine_data.loc[index,'country'].upper()
# drop the rows
vaccine_data = vaccine_data.drop(rows_to_drop)
vaccine_data
| country | iso_code | date | total_vaccinations | people_vaccinated | people_fully_vaccinated | daily_vaccinations_raw | daily_vaccinations | total_vaccinations_per_hundred | people_vaccinated_per_hundred | people_fully_vaccinated_per_hundred | daily_vaccinations_per_million | vaccines | source_name | source_website | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AFGHANISTAN | AFG | 2021-02-22 | 0.0 | 0.0 | NaN | NaN | NaN | 0.00 | 0.00 | NaN | NaN | Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm... | World Health Organization | https://covid19.who.int/ |
| 1 | AFGHANISTAN | AFG | 2021-02-23 | NaN | NaN | NaN | NaN | 1367.0 | NaN | NaN | NaN | 35.0 | Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm... | World Health Organization | https://covid19.who.int/ |
| 2 | AFGHANISTAN | AFG | 2021-02-24 | NaN | NaN | NaN | NaN | 1367.0 | NaN | NaN | NaN | 35.0 | Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm... | World Health Organization | https://covid19.who.int/ |
| 3 | AFGHANISTAN | AFG | 2021-02-25 | NaN | NaN | NaN | NaN | 1367.0 | NaN | NaN | NaN | 35.0 | Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm... | World Health Organization | https://covid19.who.int/ |
| 4 | AFGHANISTAN | AFG | 2021-02-26 | NaN | NaN | NaN | NaN | 1367.0 | NaN | NaN | NaN | 35.0 | Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm... | World Health Organization | https://covid19.who.int/ |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 17601 | ZIMBABWE | ZWE | 2021-05-07 | 640762.0 | 500422.0 | 140340.0 | 33407.0 | 20060.0 | 4.31 | 3.37 | 0.94 | 1350.0 | Sinopharm/Beijing | Ministry of Health | https://twitter.com/MoHCCZim/status/1392575941... |
| 17602 | ZIMBABWE | ZWE | 2021-05-08 | 657838.0 | 509274.0 | 148564.0 | 17076.0 | 19648.0 | 4.43 | 3.43 | 1.00 | 1322.0 | Sinopharm/Beijing | Ministry of Health | https://twitter.com/MoHCCZim/status/1392575941... |
| 17603 | ZIMBABWE | ZWE | 2021-05-09 | 684243.0 | 526066.0 | 158177.0 | 26405.0 | 22863.0 | 4.60 | 3.54 | 1.06 | 1538.0 | Sinopharm/Beijing | Ministry of Health | https://twitter.com/MoHCCZim/status/1392575941... |
| 17604 | ZIMBABWE | ZWE | 2021-05-10 | 690653.0 | 529360.0 | 161293.0 | 6410.0 | 21877.0 | 4.65 | 3.56 | 1.09 | 1472.0 | Sinopharm/Beijing | Ministry of Health | https://twitter.com/MoHCCZim/status/1392575941... |
| 17605 | ZIMBABWE | ZWE | 2021-05-11 | 709772.0 | 539526.0 | 170246.0 | 19119.0 | 21428.0 | 4.78 | 3.63 | 1.15 | 1442.0 | Sinopharm/Beijing | Ministry of Health | https://twitter.com/MoHCCZim/status/1392575941... |
17488 rows × 15 columns
As can be observed from checking the number of rows for each of the newly filtered tables, we see that they do not match. This could be because of an inconsistency in the number of countries. We can count the number of countries to make sure they match. If there is not a match, we can drop the countries that do not appear in both tables.
# Count the countries in each table
vaccine_countries = []
for index, row in vaccine_data.iterrows():
if vaccine_data.loc[index,'country'] not in vaccine_countries:
vaccine_countries.append(vaccine_data.loc[index,'country'])
daily_data_countries = []
for index, row in daily_data.iterrows():
if daily_data.loc[index,'country'] not in daily_data_countries:
daily_data_countries.append(daily_data.loc[index,'country'])
print('Number of countries in vaccine table: ' + str(len(vaccine_countries)))
print('Number of countries in daily data table: ' + str(len(daily_data_countries)))
Number of countries in vaccine table: 205 Number of countries in daily data table: 220
As we suspected, there is an inconsistency in the number of countries. Therefore we can now try to find out which countries we need to drop, and drop them.
# Deciding for which countries we need to drop rows.
drop_countries = []
# Checking vaccine data
for country in vaccine_countries:
if country not in daily_data_countries:
drop_countries.append(country)
for country in daily_data_countries:
if country not in vaccine_countries:
drop_countries.append(country)
print(drop_countries)
print('\nNumber of countries to drop:' + str(len(drop_countries)))
['BONAIRE SINT EUSTATIUS AND SABA', 'BRUNEI', 'CAPE VERDE', "COTE D'IVOIRE", 'CZECHIA', 'DEMOCRATIC REPUBLIC OF CONGO', 'ENGLAND', 'ESWATINI', 'FALKLAND ISLANDS', 'GUERNSEY', 'HONG KONG', 'JERSEY', 'KOSOVO', 'MACAO', 'NAURU', 'NORTH MACEDONIA', 'NORTHERN CYPRUS', 'NORTHERN IRELAND', 'PALESTINE', 'SCOTLAND', 'SINT MAARTEN (DUTCH PART)', 'TIMOR', 'TONGA', 'TURKMENISTAN', 'TUVALU', 'UNITED KINGDOM', 'UNITED STATES', 'VIETNAM', 'WALES', 'WALLIS AND FUTUNA', 'BENIN', 'BRITISH VIRGIN ISLANDS', 'BRUNEI DARUSSALAM', 'BURKINA FASO', 'BURUNDI', 'CABO VERDE', 'CARIBBEAN NETHERLANDS', 'CENTRAL AFRICAN REPUBLIC', 'CHAD', 'CHANNEL ISLANDS', 'CHINA HONG KONG SAR', 'CHINA MACAO SAR', 'COTE D IVOIRE', 'CUBA', 'CZECH REPUBLIC', 'DEMOCRATIC REPUBLIC OF THE CONGO', 'ERITREA', 'FALKLAND ISLANDS MALVINAS', 'FRENCH GUIANA', 'GUADELOUPE', 'GUINEA BISSAU', 'HAITI', 'HOLY SEE', 'LIBERIA', 'MACEDONIA', 'MADAGASCAR', 'MARSHALL ISLANDS', 'MARTINIQUE', 'MAYOTTE', 'MICRONESIA', 'REUNION', 'SAINT BARTHELEMY', 'SAINT MARTIN', 'SAINT PIERRE AND MIQUELON', 'SINT MAARTEN', 'STATE OF PALESTINE', 'SWAZILAND', 'TANZANIA', 'TIMOR LESTE', 'UK', 'USA', 'VANUATU', 'VIET NAM', 'WALLIS AND FUTUNA ISLANDS', 'WESTERN SAHARA'] Number of countries to drop:75
It seems as though we have encountered a large naming convention disconnect between the two tables. Although we know that the outputted length of countries to drop is 75, since most countries appear twice in the printed country names but with slightly different names, about 37 countries are not synchronized out of about 210.
Based on this, one option we have is to go ahead with about 180 countries for the vaccine efficacy analysis and still get decent results. Another option is to select major countries from this list, and fix the naming coventions so that we can ahead with the analysis with a few more countries. A final option would be to manually fix each and every one of the names which would take a long time and be inefficient. How about you take a moment and consider which of these options is the best.
If you have all the time in the world, you may think that the final option is the best. However, the second option most likely the most efficient route and the one we will go. If you examine the country names you would see extremely relevant countries such as the United Kingdom and the United States. Adding these could be beneficial to our vaccine efficacy analysis. Something to also consider is that many of these countries to drop in question are islands or very small. So they may not reflect the general global trends and could end up being outliers. Another thing to note is that some of these countries may also be lagging behind in terms of giving adequate data either by censorship or poor data collection ability.
Based on this, we are going to take 'UNITED KINGDOM':'UK', 'UNITED STATES':'USA', 'VIETNAM':'VIET NAM', 'STATE OF PALESTINE':'PALESTINE', and 'NORTH MACEDONIA':'MACEDONIA'. So at this point, we need to go through both tables again and fix the names we have chosen and then drop the rest of the names we chose not to save.
# Filtering out the names to save
saved = ['UNITED KINGDOM','UK','UNITED STATES','USA','VIETNAM','VIET NAM','STATE OF PALESTINE','PALESTINE' \
,'NORTH MACEDONIA','MACEDONIA']
unsaved = [] # Creating a new list because of a remove error trying to remove from drop_countries
for country in drop_countries:
if country not in saved:
unsaved.append(country[0:])
rows_to_drop = []
# Going through vaccine table to change names as necessary
# and mark rows that we need to drop
for index, row in vaccine_data.iterrows():
if vaccine_data.loc[index,'country'] == 'UK':
vaccine_data.loc[index,'country'] = 'UNITED KINGDOM'
elif vaccine_data.loc[index,'country'] == 'USA':
vaccine_data.loc[index,'country'] = 'UNITED STATES'
elif vaccine_data.loc[index,'country'] == 'VIET NAM':
vaccine_data.loc[index,'country'] = 'VIETNAM'
elif vaccine_data.loc[index,'country'] == 'STATE OF PALESTINE':
vaccine_data.loc[index,'country'] = 'PALESTINE'
elif vaccine_data.loc[index,'country'] == 'NORTH MACEDONIA':
vaccine_data.loc[index,'country'] = 'MACEDONIA'
# marking row if needed to be dropped
if vaccine_data.loc[index,'country'] in unsaved:
rows_to_drop.append(index)
# drop marked rows from vaccine table
vaccine_data = vaccine_data.drop(rows_to_drop)
# Repeat the same name change and row drop process as above but with the daily data table instead.
rows_to_drop = []
for index, row in daily_data.iterrows():
if daily_data.loc[index,'country'] == 'UK':
daily_data.loc[index,'country'] = 'UNITED KINGDOM'
elif daily_data.loc[index,'country'] == 'USA':
daily_data.loc[index,'country'] = 'UNITED STATES'
elif daily_data.loc[index,'country'] == 'VIET NAM':
daily_data.loc[index,'country'] = 'VIETNAM'
elif daily_data.loc[index,'country'] == 'STATE OF PALESTINE':
daily_data.loc[index,'country'] = 'PALESTINE'
elif daily_data.loc[index,'country'] == 'NORTH MACEDONIA':
daily_data.loc[index,'country'] = 'MACEDONIA'
# marking row if needed to be dropped
if daily_data.loc[index,'country'] in unsaved:
rows_to_drop.append(index)
daily_data = daily_data.drop(rows_to_drop)
Let's check if the rows are now in order.
print('Rows in vaccine_data: ' + str(len(vaccine_data.index)))
print('Rows in daily_data: ' + str(len(daily_data.index)))
Rows in vaccine_data: 15958 Rows in daily_data: 14220
It appears that there still is some disconnect with the number of rows in each table. This is likely due to a deeper issue with the number of dates for each country not being uniform throughout. However after this much data cleaning, there is enough of a organized format among the tables to explore the data because many values in the tables are NaN. NaN means that the the value is 'Not a Number,' and they are generally put or found in tables when the value is missing. By taking the averages for each country, we should still be able to explore the data and try to come to a conclusion about vaccine efficacy. All we need now is to make sure the countries are synchronized for both tables.
Now let's repeat the same code above to double check the country counts are the same for each table, and let's also double check that the tables now have the same countries.
# Count the countries in each table
vaccine_countries = []
for index, row in vaccine_data.iterrows():
if vaccine_data.loc[index,'country'] not in vaccine_countries:
vaccine_countries.append(vaccine_data.loc[index,'country'])
daily_data_countries = []
for index, row in daily_data.iterrows():
if daily_data.loc[index,'country'] not in daily_data_countries:
daily_data_countries.append(daily_data.loc[index,'country'])
print('Number of countries in vaccine table: ' + str(len(vaccine_countries)))
print('Number of countries in daily data table: ' + str(len(daily_data_countries)))
vaccine_countries.sort()
daily_data_countries.sort()
if vaccine_countries == daily_data_countries:
print('Countries in both lists are the same.')
Number of countries in vaccine table: 180 Number of countries in daily data table: 180 Countries in both lists are the same.
As can be seen, the countries are the same in both tables, just out of order. We can now continue to the next phase of the data science pipeline, which is Exploratory Data Analysis or EDA.
Now comes the exploratory data analysis. This is where we try to visualize the data and notice patterns that can help us in our question. It may cause us to go back to different steps of the data science process to tweak things as necessary. We may also gain a better understanding of what we should be testing.
At this point, it is important to realize that the data science process is not necessarily a linear process from data collection to conclusion. Most of the time, a big part of the process is going back and forth between different parts of the process. For example, if visualizing a variable is problematic due to outliers, then it may help to go back to the Data Management portion of the data science process to clean the data further.
With that out of the way, we know we want to find some relationship between the increase in vaccinations and the coronavirus weakening. To see if the data implies that, we can start off by visualizing the rates of daily cases against the rates of daily vaccininations per month. In order to do this, we will need to calcualte the rates for each of those variables.
# calculate rate of daily cases for each country
country_rate_daily_cases = {} # ex: Country: [3,4,5]
for index, row in daily_data.iterrows():
if daily_data.loc[index, 'country'] in country_rate_daily_cases:
temp_lst = country_rate_daily_cases[daily_data.loc[index, 'country']]
temp_lst.append(daily_data.loc[index, 'daily_new_cases'])
country_rate_daily_cases[daily_data.loc[index, 'country']] = temp_lst
else:
country_rate_daily_cases[daily_data.loc[index, 'country']] = [daily_data.loc[index, 'daily_new_cases']]
# Calculate rate for every country
country_case_rate = [] # ex: [(Country,rate)]
for country, val in country_rate_daily_cases.items():
x_lst = [i for i in range(0,len(val))]
slope, intercept = np.polyfit(x_lst, val, 1)
country_case_rate.append((country,slope))
country_case_rate.sort()
# gather daily vaccinations for each country
country_rate_vac = {} # ex: Country: [3,4,5]
# create a list to keep track of skipped countries
skipped_nan = []
for index, row in vaccine_data.iterrows():
# Check for a NaN, and if present, mark the country then skip this row
if str(vaccine_data.loc[index, 'daily_vaccinations']) == 'nan':
if vaccine_data.loc[index, 'country'] not in skipped_nan:
skipped_nan.append(vaccine_data.loc[index, 'country'])
continue
if vaccine_data.loc[index, 'country'] in country_rate_vac:
temp_lst = country_rate_vac[vaccine_data.loc[index, 'country']]
temp_lst.append(vaccine_data.loc[index, 'daily_vaccinations'])
country_rate_vac[vaccine_data.loc[index, 'country']] = temp_lst
else:
country_rate_vac[vaccine_data.loc[index, 'country']] = [vaccine_data.loc[index, 'daily_vaccinations']]
# Calculate the rate for every country
country_vac_rate = [] # ex: [(Country,rate)]
for country, val in country_rate_vac.items():
x_lst = [i for i in range(0,len(val))]
slope, intercept = np.polyfit(x_lst, val, 1)
country_vac_rate.append((country, slope))
country_vac_rate.sort()
Just a reminder, now is a good time to make sure that the presence of an NaN did not cause any country to be completely skipped from being included into the rate lists. Remember when we were gathering the daily vaccination data into a list, we skipped including a country for a particular row if there was an NaN value for the number of daily vaccinations.
We can examine the lengths of the two 2-tuple lists we created to see if there are any disconnects.
print('Length of vaccination rates list: ' + str(len(country_vac_rate)))
print('Length of case rates list: ' + str(len(country_case_rate)))
Length of vaccination rates list: 178 Length of case rates list: 180
So it turns out that there were two countries that were left out. All we have to do now is find out which ones are included in the larger list (the list of case rates), and drop those two countries. It turns out that we marked the countries that were skipped, so we can use that to help us pinpoint which 2 countries to remove.
for skipped in skipped_nan:
if skipped not in country_rate_vac:
print('Completely skipped: ' + str(skipped))
Completely skipped: TAJIKISTAN Completely skipped: YEMEN
So now we know that the two countries are Tajikistan and Yemen, so we can drop them from the country_case_rate list.
for country, rate in country_case_rate:
if country == 'TAJIKISTAN' or country == 'YEMEN':
country_case_rate.remove((country,rate))
print('Length of vaccination rates list: ' + str(len(country_vac_rate)))
print('Length of case rates list: ' + str(len(country_case_rate)))
Length of vaccination rates list: 178 Length of case rates list: 178
Now we have equal rates lists for vaccinations and cases, so we can go ahead and plot. What we want to plot is the Rate of Daily Cases against the Rate of Daily Vaccinations.
scatter1_x = np.array([rate for (country, rate) in country_vac_rate])
scatter1_y = np.array([rate for (country, rate) in country_case_rate])
plt.subplots(figsize=(16,10))
plt.scatter(scatter1_x,scatter1_y)
plt.xlabel('Rate of Daily Vaccinations by Country')
plt.ylabel('Rate of Daily Cases by Country')
plt.title('Rate of Daily Cases vs Rate of Daily Vaccinations in 2021')
plt.show()
That looks awfully hard to understand. As you probably have understood by now, data science involves a lot of trial and error. You generally are not going to get everything perfect on the first try. By the looks of this graph, there are about 5 outliers, 3 of which are very significant. It is a good idea to understand which countries are so significantly standing out. This way we can remove them possibly research the circumstances they may be in that may cause such a significant deviation from the main cluster.
In order to accomplish this, we can go through the list of rates for vaccinations, and label the outliers with extreme values.
# Go through vaccination rates
for (country, rate) in country_vac_rate:
if (rate > 7000):
print('High vaccination rate: ' + str(country))
High vaccination rate: CHINA High vaccination rate: INDIA High vaccination rate: UNITED STATES
We know the high vaccination rates for China, India, and the United States are very high because of their role in making the vaccines. We also know from the news that India has been going through an extremely hard time with COVID. With such a high population count of over 1 billion people, it is understandable how India can have such a high vaccination rate yet still seem to have the worst increase in cases. More can be read at https://www.wsj.com/articles/why-indias-second-covid-19-surge-is-much-worse-than-its-first-11621071001 to learn more about India's struggle.
Now that we know the significant outliers, we can remove them.
for country, rate in country_case_rate:
if country == 'CHINA' or country == 'INDIA' or country == 'UNITED STATES':
country_case_rate.remove((country,rate))
for country, rate in country_vac_rate:
if country == 'CHINA' or country == 'INDIA' or country == 'UNITED STATES':
country_vac_rate.remove((country,rate))
We can now attempt to plot the scatter plot of the Rate of Daily Cases against the Rate of Daily Vaccinations again as before.
scatter2_x = np.array([rate for (country, rate) in country_vac_rate])
scatter2_y = np.array([rate for (country, rate) in country_case_rate])
plt.subplots(figsize=(16,10))
plt.scatter(scatter2_x,scatter2_y)
plt.xlabel('Rate of Daily Vaccinations by Country')
plt.ylabel('Rate of Daily Cases by Country')
plt.title('Rate of Daily Cases vs Rate of Daily Vaccinations in 2021')
plt.show()
This still does not look good when it comes to trying to establish find vaccine efficacy. Based on the cluster, it looks as if there is a almost somewhat of an increasing trend in rate of cases when it comes to an increase in the rate of vaccinations. One factor that could be changing the view of the data is that the rates need to be segmented for each country. Since this is data on a highly contagious virus, different countries have been jumping in and out of surges. Therefore perhaps it is wrong to look at countries' rates as one rate per country but rather it should be done with multiple rates per month. This may account for any jumps in surges so that the data may be spread apart.
Let's go ahead an calculate the rates per month.
# calculate rate of daily cases for each country by month
country_rate_daily_cases = {} # ex: Country: [[2,3,4],[3,4,5]] ---> [(1,Country,rate), ...]
for index, row in daily_data.iterrows():
month = int(daily_data.loc[index,'date'][5:6]) - 2 #use month-2 because we count from Feb, indexed to 0.
if daily_data.loc[index, 'country'] in country_rate_daily_cases:
country_rate_daily_cases[daily_data.loc[index, 'country']][month-2]\
.append(daily_data.loc[index,'daily_new_cases'])
else:
country_rate_daily_cases[daily_data.loc[index,'country']] = [[],[],[],[]]
country_rate_daily_cases[daily_data.loc[index,'country']][month-2]\
.append(daily_data.loc[index,'daily_new_cases'])
# Calculate rate for every country
country_case_rate = []
for country, val in country_rate_daily_cases.items(): # val is a list of lists here
for val_index in range(0,len(val)):
x_lst = [i for i in range(0,len(val[val_index]))]
slope, intercept = np.polyfit(x_lst, val[val_index], 1)
country_case_rate.append((val_index,country,slope))
country_case_rate.sort()
# calculate rate of daily vaccinations for each country by month
country_rate_daily_vac = {} # ex: Country: [[2,3,4],[3,4,5]] ---> [(1,Country,rate), ...]
for index, row in vaccine_data.iterrows():
# Check for a NaN, and if present, mark the country then skip this row
if str(vaccine_data.loc[index, 'daily_vaccinations']) == 'nan':
continue
month = int(vaccine_data.loc[index,'date'][6:7]) - 2 #use month-2 because we count from Feb, indexed to 0.
if vaccine_data.loc[index, 'country'] in country_rate_daily_vac:
country_rate_daily_vac[vaccine_data.loc[index, 'country']][month-2]\
.append(vaccine_data.loc[index,'daily_vaccinations'])
else:
country_rate_daily_vac[vaccine_data.loc[index,'country']] = [[],[],[],[]]
country_rate_daily_vac[vaccine_data.loc[index,'country']][month-2]\
.append(vaccine_data.loc[index,'daily_vaccinations'])
# Calculate rate for every country
country_vac_rate = []
for country, val in country_rate_daily_vac.items():
for val_index in range(0,len(val)):
if len(val[val_index]) > 0:
x_lst = [i for i in range(0,len(val[val_index]))]
# print(len(x_lst) == len(lst))
# print(lst)
try:
slope, intercept = np.polyfit(x_lst, val[val_index], 1)
except:
continue
country_vac_rate.append((val_index, country, slope))
country_vac_rate.sort()
Now let's make sure that unshared values between the two lists are removed, NaN values are removed, and outliers are removed.
# Storing the rates as values to the 2-tuple (of month and country) key.
case_rate_dict = {}
for (month,country,rate) in country_case_rate:
case_rate_dict[(month,country)] = rate
vac_rate_dict = {}
for (month,country,rate) in country_vac_rate:
vac_rate_dict[(month,country)] = rate
# removing unshared values
to_remove = []
for key, val in case_rate_dict.items():
if key not in vac_rate_dict or str(val) == 'nan':
to_remove.append(key)
for remove_item in to_remove:
case_rate_dict.pop(remove_item,None)
to_remove = []
for key, val in vac_rate_dict.items():
if key not in case_rate_dict or str(val) == 'nan':
to_remove.append(key)
for remove_item in to_remove:
vac_rate_dict.pop(remove_item,None)
if len(case_rate_dict) == len(vac_rate_dict):
print('The unshared values have been removed')
# removing outliers
to_remove = []
for key, val in vac_rate_dict.items():
if val > 20000:
to_remove.append(key)
for remove_item in to_remove:
vac_rate_dict.pop(remove_item,None)
case_rate_dict.pop(remove_item,None)
to_remove = []
for key, val in case_rate_dict.items():
if val > 1800:
to_remove.append(key)
for remove_item in to_remove:
vac_rate_dict.pop(remove_item,None)
case_rate_dict.pop(remove_item,None)
The unshared values have been removed
scatter3_x = np.array([rate for (month, country), rate in vac_rate_dict.items()])
scatter3_y = np.array([rate for (month, country), rate in case_rate_dict.items()])
plt.subplots(figsize=(16,10))
plt.scatter(scatter3_x,scatter3_y, color=['green'])
plt.xlabel('Rate of Daily Vaccinations by Country Per Month')
plt.ylabel('Rate of Daily Cases by Country Per Month')
plt.title('Rate of Daily Cases vs Rate of Daily Vaccinations by Month in 2021')
plt.show()
This does not look that promising. Based on the cluster, it looks as though that the linear relationship wouldn't be too strong. It also may be wrong to look at rate of vaccinations, because at the beginning of the month there could be a large amount that have an impact against the spread of COVID, yet towards the end of a month there could be view, and thereby showing a negative rate for the month. It may be better to instead have the cumulative vaccinations per hundred people per month instead. We can compare against the rate of cases still, because this could tell us how the spread of COVID is progressing as more people are vaccinated.
# calculate the monthly cumulative vaccinated people per hundred
monthly_cumulative_vac = {} # ex: Country: [3000,0,0,0]
for index, row in vaccine_data.iterrows():
# Check for a NaN, and if present, mark the country then skip this row
if str(vaccine_data.loc[index, 'people_fully_vaccinated_per_hundred']) == 'nan':
continue
month = int(vaccine_data.loc[index,'date'][6:7]) - 2 #use month-2 because we count from Feb, indexed to 0.
if vaccine_data.loc[index, 'country'] in monthly_cumulative_vac:
monthly_cumulative_vac[vaccine_data.loc[index, 'country']][month-2]\
= vaccine_data.loc[index,'people_fully_vaccinated_per_hundred']
else:
monthly_cumulative_vac[vaccine_data.loc[index,'country']] = [0,0,0,0]
monthly_cumulative_vac[vaccine_data.loc[index,'country']][month-2]\
= vaccine_data.loc[index,'people_fully_vaccinated_per_hundred']
# Calculate rate for every country
country_vac_curr = [] # ex: [(1,Country,rate), ...]
for country, val in monthly_cumulative_vac.items():
for val_index in range(0, len(val)):
if val[val_index] != 0:
country_vac_curr.append((val_index, country, val[val_index]))
country_vac_curr.sort()
We can now follow the same cleaning process to filter out unshared values between the lists, NaN values, and outliers.
# Storing the rates as values to the 2-tuple (of month and country) key.
case_rate_dict = {}
for (month,country,rate) in country_case_rate:
case_rate_dict[(month,country)] = rate
vac_num_dict = {}
for (month,country,vac_count) in country_vac_curr:
vac_num_dict[(month,country)] = vac_count
# removing unshared values
to_remove = []
for key, val in case_rate_dict.items():
if key not in vac_rate_dict or str(val) == 'nan':
to_remove.append(key)
for remove_item in to_remove:
case_rate_dict.pop(remove_item,None)
to_remove = []
for key, val in vac_num_dict.items():
if key not in case_rate_dict or str(val) == 'nan':
to_remove.append(key)
for remove_item in to_remove:
vac_num_dict.pop(remove_item,None)
if len(case_rate_dict) == len(vac_num_dict):
print('The unshared values have been removed')
# removing outliers
to_remove = []
for key, val in vac_num_dict.items():
if val > 60:
to_remove.append(key)
for remove_item in to_remove:
vac_num_dict.pop(remove_item,None)
case_rate_dict.pop(remove_item,None)
to_remove = []
for key, val in case_rate_dict.items():
if abs(val) > 600:
to_remove.append(key)
for remove_item in to_remove:
vac_num_dict.pop(remove_item,None)
case_rate_dict.pop(remove_item,None)
The unshared values have been removed
scatter4_x = np.array([vac_count for (month, country), vac_count in vac_num_dict.items()])
scatter4_y = np.array([rate for (month, country), rate in case_rate_dict.items()])
plt.subplots(figsize=(16,10))
plt.scatter(scatter4_x,scatter4_y, color=['orange'])
plt.xlabel('Cumulative Number of Those Fully Vaccinated per Hundred People by Country Per Month')
plt.ylabel('Rate of Daily Cases by Country Per Month')
plt.title('Rate of Daily Cases vs Cumulative Number of Those Fully Vaccinated per Hundred People by Month in 2021')
plt.show()
This looks more likely to yield a linear relationship than comparing Rate of Vaccinations. Ideally, we would hope for a negative linear regression line, which indicates that the cases are decreasing as the rate of vaccinations increase. Now that we have established something of a possible relationship that can show vaccine efficacy, we can move on to test it.
This is where we start to learn about hypothesis testing. But first, what is a hypothesis? Well, a hypothesis is a claim about a single parameter, several parameters, or a probability distribution. So Hypothesis testing is checking the validity of such a claim. In it, we have two scenerios. The first is that the null Hypothesis denoted $H_{0}$ which is what is assumed to be true by default before the test. The second is the alternate hypothesis denoted $H_{a}$ which is a claim generally opposite to the null hypothesis.
Generally there are about 5 steps to hypothesis testing. Let's briefly summarize them.
You may be wondering, what is a test statistic, a P-value, and the level of significance? First, the level of significance also known as Alpha, is related to the confidence level at which we want to perform the test. The Confidence level is equal to 100(1 - Alpha)%, so when the Confidence level is high like 95%, Alpha would be 5% or 0.05. The test statistic is the standardized score of the sample statistic when the null hypothesis is assumed to be true. The P-value is the probability that the test statistic will take on values as extreme as the test statistic, when the null hypothesis is true. However, we let the machine learning do the work to produce the P-value which we are looking for to make the statistical conclusion.
With that out of the way, let's start hypothesis testing. We want to test for a significant linear relationship between x and y values for a linear regression test. You should recall that for linear regression, lines take up the form y = mx + b or in other words y = $B_{1}$x + $B_{0}$ where $B_{1}$ is the slope.
We want to know whether the Rate of Daily Cases vs Cumulative Number of Those Fully Vaccinated per Hundred People by Month has a significant linear relationship. Let's perform this hypothesis test at a 0.05 level of signficance.
$H_{0}$: $B_{1}$ = 0
$H_{a}$: $B_{1}$ $\neq$ 0
where $B_{1}$ is the slope of the true regression line.
Let's take a moment to understand the hypotheses we just declared. If the slope is equal to 0 like in the null hypothesis, then it means that there is no significant linear relationship between the x and y variables. If the slope is not equal to 0 like in the case of the alternate hypothesis, then it means that there is a significant linear relationship between the x and y variables. Now we need to figure out the P-value. This is where we can bring in machine learning.
Now let's get into the machine learning. What is machine learning? To keep it simple, Machine learning is training models by fitting those models with data. This can enable us to predict more values based on our model. In this case, we want to create a linear regression model based on the Rate of Daily Cases vs Cumulative Number of Those Fully Vaccinated per Hundred People by Month.
There are many libraries that offer ways to train linear regression models. Among the famous ones are Sklearn and StatsModel. How about we use a bit of both in order to shed more light on how training models work. We can learn about Sklearn at https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html. We already have the values laid out from our latest scatter plot. Let's modify those values so that we can use them to train the linear regression model. But before, let's clear any NaN values that may be present. Once we train the model, we can print out the value of the regression line equation.
# Clear any potential NaN's
ind = []
for vac_count in scatter4_x:
if str(vac_count) != 'nan':
ind.append([vac_count])
dep = []
for rate in scatter4_y:
if str(rate) != 'nan':
dep.append(rate)
lm = linear_model.LinearRegression()
lm.fit(ind,dep)
b1 = lm.coef_[0]
b0 = lm.intercept_
print('Linear regression line for:')
print('Rate of Daily Cases vs Cumulative Number of Those Fully Vaccinated per Hundred People by Month is')
print('y = ' + str(b1) + 'x + ' + str(b0))
Linear regression line for: Rate of Daily Cases vs Cumulative Number of Those Fully Vaccinated per Hundred People by Month is y = -1.3722702484693745x + 9.802660202474815
Take a look at the linear regression line we calculated using the Sklearn library. It looks very promising, as the slope is negative implying that the rate of daily cases is decreasing as more people are vaccinated. However, it is important to understand that linear regression models can produce equations even when there is no significant linear relationship between x and y. This is because linear regression models are mostly calcualted using a least squares approximation, where the calculated regression line minimizes the sum of the squared deviations from the line.
Let's take a look at the regression line from the model on the graph.
scatter5_x = np.array([vac_count for (month, country), vac_count in vac_num_dict.items()])
scatter5_y = np.array([rate for (month, country), rate in case_rate_dict.items()])
figure, ax = plt.subplots(figsize=(16,10))
plt.scatter(scatter5_x,scatter5_y, color=['orange'])
ax.plot(scatter5_x, b1*scatter5_x + b0, color='b') # plotting regression line
plt.xlabel('Cumulative Number of Those Fully Vaccinated per Hundred People by Country Per Month')
plt.ylabel('Rate of Daily Cases by Country Per Month')
plt.title('Rate of Daily Cases vs Cumulative Number of Those Fully Vaccinated per Hundred People by Month in 2021')
plt.show()
To become more confident with the equation our model produced, let's determine the P-value. In order to accomplish this, we can try out another Machine Learning library, StatsModel. StatsModel makes it significantly easier to calculate the P-value. However there are important considerations to make when training the model. Machine Learning libraries are not always going to adhere to the same exact format, so it is very important to pay heed to the documentation. We can learn about StatsModel at https://www.statsmodels.org/stable/regression.html#.
ind2 = sm.add_constant(ind) # Modifying the independent variable as per the documentation
statsModel = sm.OLS(dep, ind2) # OLS is ordinary least squares as we just recently discussed
statsResults = statsModel.fit()
print('P-value: ' + str(statsResults.f_pvalue))
P-value: 0.013826826339792809
The P-value in this case would be about 0.014. Using this, let's continue on with our hypothesis test to determine the statistical significance.
To determine the statistical significance, we need to compare the P-value we calculated with the level of significance. If the P-value is less than or equal to Alpha (the level of significance), then we reject $H_{0}$ in favor of $H_{a}$. However if the P-value is greater than Alpha, then we fail to reject $H_{0}$.
With StatsModel, the default level of significance is 0.05 which is what we chose earlier. Our P-value as we already stated is 0.014. Comparing them, we see that Alpha is greater which leads us to reject $H_{0}$ in favor of $H_{a}$.
Now that we have rejected the null hypothesis in favor of the alternate hypothesis, let's comprehend what this means in terms of vaccine efficacy. This means that based on the sample data, we have sufficient evidence to conclude that there is a significant linear relationship between the Rate of COVID Cases and the Cumulative Number of Those Fully Vaccinated per Hundred People at a 0.05 level of significance.
Overall, our linear regression model gave us the equation y = -1.3722702484693745x + 9.802660202474815 which is promising given that we also found the linear relationship to be signficant. Reiterating the description of the equation, it tells us that for an increase in number of people vaccinated per hundred people, the rate of cases decreases. This is something that would potentially convince us that the vaccine is working and the number of cases are slowing down. However, it is important to understand that there are numerous factors that go into testing vaccine efficacy around the world such as population density, government regulation on peoples' activities, COVID variants, and a lot more. It is important to understand that in this tutorial we made analyzed vaccine efficacy based on two different datasets. There were many missing values, and many disconnects that had to be accounted for. In the future, a more complicated analysis would likely be more accurate with better collected data, and more datasets that give insight to things like the population density, and social norms of a country.
Overall, I hope you learned a lot from this tutorial, and if you missed any resources linked throughout the tutorial you can check them out here:
Vaccine dataset: https://www.kaggle.com/gpreda/covid-world-vaccination-progress?select=country_vaccinations.csv
Daily Data dataset: https://www.kaggle.com/josephassaker/covid19-global-dataset?select=worldometer_coronavirus_daily_data.csv
India's Struggle with COVID: https://www.wsj.com/articles/why-indias-second-covid-19-surge-is-much-worse-than-its-first-11621071001
Machine Learning Libraries:
Thanks for reading!