import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import glob, os
import csv
import plotly.express as px
The original files have each day as a seperate csv file and they were all put into one folder. In this section, I wrote a loop function to load all the files in the folder as python into dataframe, and then I concated all the dataframe into one big dataframe.
path = r"/Users/lindachen/Downloads/COVID-19-master/daily_case_updates"
all_files = glob.glob(path + "/*.csv")
daily_data= []
for file in all_files:
daily_data.append(pd.read_csv(file))
till_feb13df = pd.concat(daily_data, ignore_index=True)
till_feb13df
There are three things that I did in this section:
#rename
tillfeb13_newdf = till_feb13df.rename(columns={"Province/State": "state", "Country/Region": "region", "Last Update": "last_update"})
#organize data column - change datatype from string to date
tillfeb13_newdf["last_update"] = pd.to_datetime(tillfeb13_newdf["last_update"], infer_datetime_format=True)
#organize data column - only keepign the date
tillfeb13_newdf['just_date'] = tillfeb13_newdf['last_update'].dt.date
#print(tillfeb13_newdf.head())
#drop irrelevant columns
tillfeb13_cleandf = tillfeb13_newdf.drop(columns =["Suspected", "ConfnSusp", "last_update", "Notes"], axis = 1)
tillfeb13_cleandf
I want to first give an overall review of the situation internationally, from its outbreak to Feb.13th. And thus, I present the data here in terms of the sum of all confirmed cases, all deaths, all recovered in different locations.
master_view = tillfeb13_cleandf.groupby("state").max().reset_index()
master_view.sort_values(by=['Confirmed'], inplace=True, ascending=False)
master_viewnew = master_view.reset_index()
#to print
master_viewnew
places = np.array(master_viewnew.state)
first= master_viewnew.Confirmed
second = master_viewnew.Deaths
third = master_viewnew.Recovered
plt.figure(figsize=(10,6))
ax=plt.subplot()
plt.bar(range(len(places)), first)
plt.bar(range(len(places)), second, bottom=first)
plt.bar(range(len(places)), third, bottom=(first+second))
ax.set_xticks(range(len(places)))
ax.set_xticklabels(places, rotation = 90, fontsize=8)
plt.title("\nOverview of the Coronavirus Outbreak in Different States/Provinces\n", fontsize=16)
plt.ylabel("Total Infected cases\n", fontsize=14)
plt.xlabel("Provinces/States", fontsize=14)
plt.legend(["Confirmed", "Deaths", "Recovered"], loc=1)
plt.show()
Coronavirus ourbreak is still a regional outbreak in the Hubei Province of China. It is not yet a country-wide outbreak, nor it is an international outbreak.
Due to the huge differences in absoulte values in different areas, it will be more useful to look at rates. Therefore, in the following analysis, I will focusing on comparing the growth rate in different regions.
Due to the uniqueness of Hubei province, I seperate Hubei province from the rest of China. And I group rest of the China together, regions outside of mainland China together for the convenience of analysis.
#seperate the dataframe into three regions
Hubei_df = tillfeb13_cleandf.loc[tillfeb13_cleandf["state"]=="Hubei"]
restChina_df = tillfeb13_cleandf.loc[(tillfeb13_cleandf["region"]=="Mainland China") & (tillfeb13_cleandf["state"] != "Hubei")]
notChina_df = tillfeb13_cleandf.loc[tillfeb13_cleandf["region"] !="Mainland China"]
#organize all the dataframes in terms of day-by-day
Hubei_bydate = Hubei_df.groupby("just_date").sum().reset_index()
restChina_bydate = restChina_df.groupby("just_date").sum().reset_index()
notChina_bydate = notChina_df.groupby("just_date").sum().reset_index()
#to print
Hubei_bydate
#write a function that returns the growth rate with any column in a given dataframe
def growth_rate (df, column):
rate = []
for x in range(len(df[column])):
if x > 0:
new = df[column][x]-df[column][x-1]
growth_rate = new/(df[column][x-1]) * 100
rate.append(growth_rate)
else:
rate.append(0)
return rate
The table below is very intersting because for some days that are some huge drops in the "Confirmed" cases despite the low change in "Recovered" cases, for example, on Jan. 30th, on Feb 4th and on Feb. 12th. This might due to reporting errors.
#create a new column in each dataframe that stores the growth rate of all the confirmed cases
Hubei_bydate["confirmed_growthrate"] = growth_rate(Hubei_bydate, "Confirmed")
restChina_bydate["confirmed_growthrate"] = growth_rate(restChina_bydate, "Confirmed")
notChina_bydate["confirmed_growthrate"] = growth_rate(notChina_bydate, "Confirmed")
#to print
print(Hubei_bydate)
print(restChina_bydate)
print(notChina_bydate)
plt.figure(figsize=(10,6))
ax = plt.subplot()
plt.plot(range(len(Hubei_bydate.just_date)), Hubei_bydate.confirmed_growthrate, label="Hubei", marker="o")
plt.plot(range(len(Hubei_bydate.just_date)), restChina_bydate.confirmed_growthrate, label="Rest of China", marker="o")
plt.plot(range(len(Hubei_bydate.just_date)), notChina_bydate.confirmed_growthrate, label="Internationally", marker="o")
ax.set_ticks = Hubei_bydate.just_date
ax.set_xticks(range(len(Hubei_bydate.just_date)))
ax.set_xticklabels(Hubei_bydate.just_date, rotation=90)
ax.set_yticks(range(-100, 500, 50))
plt.legend()
plt.title("\nGrowth Rate of Confirmed Cases\n", fontsize =16)
plt.ylabel("Rate of Growth per Day", fontsize = 14)
plt.show()
The growth rates of confirmed cases are really volatile, like noted in the obversation, it is probably due to errors in data reporting. It will be intersting to do some data treatment, for example, replacing the data points that have abnormal drop to an educated guess.
In the Hubei Province, the growth rates have been much higher than other areas. This alins with the anlysis that it is an regional outbreak.
To answer the question: Is the growth rate exponential? In other words, doesn't matter which region you are in or how many cases are there in your region now, do all the regions show a consistent growth rate in the numbers of infected cases? If so, doesn't matter how many cases you have in your region, you should be as concern as if you were in Hubei.
If we compare Hubei to other regions, the answer is uncertain because you see the growth rate is signifcally higher than other regions. However, like mentioned before, a sudden surge in growth rate can be due to reporting error on Jan. 31st. In other words, no reporting were being done for several days, which caused a surge in numbers after reporting resume.
If we compare "Rest of China" to "Internationally", the answer is a clear yes. While the "Rest of China" has a lot more confirmed cases in absolute numbers than regions outside of China, the growth rate is about the same.
As we see the growth rate is Hubie province is slwoing down, and getting closer to the growth rate in the rest of the world during the later peirod, it is reasonable to conclude that the virus is spreading exponatially at a rate between 100%-150% every day.
The graph seems to indicate a pattern of start-peak-down, with the first peak around Jan.23/24th, the second peak around Feb 1st/2nd, the third peak around Feb 9th/10th. Thus, in my later analysis, I want to divide the timeline into three periods and study the distribution of the growth rates in each period to see the change in the scale of outbreaks.
In this section, instead of studying the growth rate of Death cases and the growth rate of recovered cases, I studied the change of mortality rate and recovered rate. The reason is that if there is indeed reporting errors, it is very likely that we will see volatile lines with no clear patterns again. It will be more useful to study the overall mortality rate, recovered rate and their changes overtime.
death_rate = lambda row: (row.Deaths)/(row.Confirmed) *100
Hubei_bydate["death_rate"] = Hubei_bydate.apply(death_rate, axis=1)
restChina_bydate["death_rate"] = restChina_bydate.apply(death_rate, axis=1)
notChina_bydate["death_rate"] = notChina_bydate.apply(death_rate, axis=1)
plt.figure(figsize=(10,6))
ax = plt.subplot()
plt.plot(range(len(Hubei_bydate.just_date)), Hubei_bydate.death_rate, label="Hubei", marker="o")
plt.plot(range(len(Hubei_bydate.just_date)), restChina_bydate.death_rate, label="Rest of China", marker="o")
plt.plot(range(len(Hubei_bydate.just_date)), notChina_bydate.death_rate, label="Internationally", marker="o")
#ax.set_ticks = Hubei_bydate.just_date
ax.set_xticks(range(len(Hubei_bydate.just_date)))
ax.set_xticklabels(Hubei_bydate.just_date, rotation=90)
plt.legend()
plt.ylabel("Rate of Death Cases / Total Confirmed Cases", fontsize =14)
plt.title("\nMortality Rate from Jan.21 to Feb 13th 2020\n", fontsize = 16)
plt.show()
The mortality rate in Hubei province was much higher than other regions but it has slowed down and stayed stable; at the end, it even started to show a sign of decline. This might be due to the fact that during the early days of the outbreak, the large numbers of outbreak cases in Hubei province caused lots of local hospitals to run out of protection equipments like masks and gloves. Also the local hospitals might be overwhelmed by the numbers of increased cases in such a short period of time. Thus, patients could not be effectively treated.
In the later peirod, the Chinese government enforced a strict control of the distribution of medical equipments, namely all the supplies must put Hubei Province first. And all the experts dedicated their efforts to help with the situation. Thus, the situation started to stablized.
The mortality rate in rest of the China has remained low and stable while the interntional mortality has not been very positive. This might be due to the fact that international regions did not put effectively measurement into treating cases like China did, or it could simply due to fact that the absolute numbers of other regions are much lower than in China and thus, they are not statistically significant.
In conclusion, the mortality rate thus far shows that the Coronavirus is not vital like SARS and the death rate can be effectivly controlled.
recover_rate = lambda row: (row.Recovered)/(row.Confirmed) *100
Hubei_bydate["recover_rate"] = Hubei_bydate.apply(recover_rate, axis=1)
restChina_bydate["recover_rate"] = restChina_bydate.apply(recover_rate, axis=1)
notChina_bydate["recover_rate"] = notChina_bydate.apply(recover_rate, axis=1)
plt.figure(figsize=(10,6))
ax = plt.subplot()
plt.plot(range(len(Hubei_bydate.just_date)), Hubei_bydate.recover_rate, label="Hubei", marker="o")
plt.plot(range(len(Hubei_bydate.just_date)), restChina_bydate.recover_rate, label="Rest of China", marker="o")
plt.plot(range(len(Hubei_bydate.just_date)), notChina_bydate.recover_rate, label="Internationally", marker="o")
#ax.set_ticks = Hubei_bydate.just_date
ax.set_xticks(range(len(Hubei_bydate.just_date)))
ax.set_xticklabels(Hubei_bydate.just_date, rotation=90)
plt.legend()
plt.ylabel("Rate of Recovered / Total Confirmed Cases", fontsize =14)
plt.title("\nRecovered Rate From Jan.21st to Feb 13th\n", fontsize=16)
plt.show()
Overall, there is a trend of increase in the recovered rate. However, in internatonal regions, the rate has been very unstable in the Hubei province, the recovered rate though stabe, is lower than other regions.
Based on the news report I have read, no region has an effective formula for treatment. The best thing the hospital can do is effectivly isolate the patient and make sure no enviornmental factors will cause the patients to get wrose. The rest depends on the patients' own immune system.
Thus, it is hard to draw any insights without a further look into the demographic of infected people. Nonetheless, the gengeral conclusion is that Coronavirus is not vital.
As I mentioned in step 4, I want to divide the timeline into three periods and study the distribution of the growth rates in each period to see the change in the scale of outbreaks. In other words, though there seems to be a pattern of: start-peak-slow, I wonder is the peak getting lower each time, which means the outbreak of the virus has been effectivly controlled?
from scipy.stats import iqr
#Becuase the growthrate of index 0 is 0, I start the analysis with index 1
hubei_cases = np.array([Hubei_bydate.confirmed_growthrate[1:]])
restChina_cases = np.array([restChina_bydate.confirmed_growthrate[1:]])
notChina_cases = np.array([notChina_bydate.confirmed_growthrate[1:]])
#some basic statistical analysis of the growth rate before looking at the distribution of growth rate
hubei_mean = np.mean(hubei_cases)
restChina_mean = np.mean(restChina_cases)
notChina_mean = np.mean(notChina_cases)
#std, IQR
hubei_std = np.std(hubei_cases)
restChina_std = np.std(restChina_cases)
notChina_std = np.std(notChina_cases)
hubei_iqr = iqr(hubei_cases)
restChina_iqr = iqr(restChina_cases)
notChina_iqr = iqr(notChina_cases)
print(hubei_mean, restChina_mean, notChina_mean)
print(hubei_iqr, restChina_iqr, notChina_iqr)
print(hubei_std, restChina_std, notChina_std)
The mean of Hubei province is a lot higher than the other two regions as expected. All three regiions' growt rate are volatile since its standard deviation is large. I also put in interquartile to see the spread of deviation since interquartile does not get impacted by outliars. The interquartile also indicated the spread is big. Therefore, it gives more reasons divide the analysis into different periods.
plt.figure(figsize=(10, 6))
ax = plt.subplot()
plt.hist(Hubei_bydate.confirmed_growthrate[1:11], range=(-100, 500), bins=30, alpha=0.2, label = "Hubei")
plt.hist(restChina_bydate.confirmed_growthrate[1:11], range=(-100, 500), bins=30, alpha=15, histtype='step',label="Rest of China")
plt.hist(notChina_bydate.confirmed_growthrate[1:11], range=(-100, 500), bins=30, alpha=0.2, label="Internationally")
ax.set_xticks(range(-100, 500, 25))
plt.ylabel("Occurance of Growth Rate", fontsize = 14)
plt.xlabel("\nGrowth rate", fontsize = 14)
plt.title("\nDistribution of Confirmed Cases' Growth Rate \n From Jan 21st to Jan. 31st (1st Period)\n", fontsize=16)
plt.legend()
plt.show()
plt.figure(figsize=(10, 6))
ax = plt.subplot()
plt.hist(Hubei_bydate.confirmed_growthrate[11:18], range=(-100, 500),bins=30, alpha=0.2, label="Hubei")
plt.hist(restChina_bydate.confirmed_growthrate[11:18], range=(-100, 500), bins =30, alpha=15, histtype='step', label="Rest of China")
plt.hist(notChina_bydate.confirmed_growthrate[11:18], range=(-100, 500), bins=30, alpha=0.2, label="Internationally")
ax.set_xticks(range(-100, 500, 25))
plt.ylabel("Occurance of Growth Rate", fontsize = 14)
plt.xlabel("\nGrowth rate", fontsize = 14)
plt.title("\nDistribution of Confirmed Cases' Growth Rate \n From Feb 1st to Feb. 7th (2nd Period)\n", fontsize=16)
plt.legend()
plt.show()
plt.figure(figsize=(10, 6))
ax = plt.subplot()
plt.hist(Hubei_bydate.confirmed_growthrate[18:], range=(-100,500), bins=30, alpha=0.2, label="Hubei")
plt.hist(restChina_bydate.confirmed_growthrate[18:], range=(-100,500), bins=30, alpha=15, histtype='step', label="Rest of China")
plt.hist(notChina_bydate.confirmed_growthrate[18:], range=(-100,500), bins =30, alpha=0.2, label="Internationally")
ax.set_xticks(range(-100, 500, 25))
plt.ylabel("Occurance of Growth Rate", fontsize = 14)
plt.xlabel("\nGrowth rate", fontsize = 14)
plt.title("\nDistribution of Confirmed Cases' Growth Rate \n From Feb 8th to Feb. 13th (3rd Period)\n", fontsize=16)
plt.legend()
plt.show()
Though one can argue that due to the low occurance of events, there is no statistical significance in the following analysis, it is still might be intersting to look at the pattern, or to see if there is a pattern.
My observation is:
In conclusion, the growth rate spreading expontentially, spreading in a rate between 100% - 150% EVERY DAY. Though the rate of growth is showing a sign of decline, it is still alarming.
If we can have demographic data of infected patients, recovered patients and died patients, there is a potential to do K means cluster analysis.
Lots of news report has stated that old people, people who have weaker immune systems are more likely to get infected and are more likely to have a vital impact by the virus. This cluster analysis can also be used to compare with flu's cluster analysis since flu also caused death every year, to see how they differ. This might give the public a better indication of what they can do and if they should panic over this virus.
All comments are welcomed. Thanks!