In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import glob, os
import csv
import plotly.express as px

Step 1: Load all the dataset into dataframe

The original files have each day as a seperate csv file and they were all put into one folder. In this section, I wrote a loop function to load all the files in the folder as python into dataframe, and then I concated all the dataframe into one big dataframe.

In [2]:
path = r"/Users/lindachen/Downloads/COVID-19-master/daily_case_updates"
all_files = glob.glob(path + "/*.csv")

daily_data= []
for file in all_files:
    daily_data.append(pd.read_csv(file))

till_feb13df = pd.concat(daily_data, ignore_index=True)
In [6]:
till_feb13df
Out[6]:
Province/State Country/Region Last Update Confirmed Deaths Recovered Notes Suspected ConfnSusp
0 Hubei Mainland China 2020-02-13 14:13 48206.0 1310.0 3459.0 NaN NaN NaN
1 Guangdong Mainland China 2020-02-13 13:33 1241.0 2.0 314.0 NaN NaN NaN
2 Henan Mainland China 2020-02-13 14:53 1169.0 10.0 296.0 NaN NaN NaN
3 Zhejiang Mainland China 2020-02-13 14:13 1145.0 0.0 360.0 NaN NaN NaN
4 Hunan Mainland China 2020-02-13 11:53 968.0 2.0 339.0 NaN NaN NaN
... ... ... ... ... ... ... ... ... ...
2886 Boston, MA US 2/1/20 19:43 1.0 0.0 0.0 NaN NaN NaN
2887 Los Angeles, CA US 2/1/20 19:53 1.0 0.0 0.0 NaN NaN NaN
2888 Orange, CA US 2/1/20 19:53 1.0 0.0 0.0 NaN NaN NaN
2889 Seattle, WA US 2/1/20 19:43 1.0 0.0 0.0 NaN NaN NaN
2890 Tempe, AZ US 2/1/20 19:43 1.0 0.0 0.0 NaN NaN NaN

2891 rows × 9 columns


Step 2: Organize the dataframe


There are three things that I did in this section:

  1. rename the columns so that they can be referred to more easily
  2. for the "last_update" column, I got rid of the hours because in this analysis, I only want to see the changes of data by date
  3. In my analysis, I am only concerend with the columns of confirmed cases, deaths cases and recovered cases and therefore, I drop the other irrelevant columns.
In [3]:
#rename
tillfeb13_newdf = till_feb13df.rename(columns={"Province/State": "state", "Country/Region": "region", "Last Update": "last_update"})
#organize data column - change datatype from string to date
tillfeb13_newdf["last_update"] = pd.to_datetime(tillfeb13_newdf["last_update"], infer_datetime_format=True)
#organize data column - only keepign the date
tillfeb13_newdf['just_date'] = tillfeb13_newdf['last_update'].dt.date
#print(tillfeb13_newdf.head())

#drop irrelevant columns
tillfeb13_cleandf = tillfeb13_newdf.drop(columns =["Suspected", "ConfnSusp", "last_update", "Notes"], axis = 1)
In [5]:
tillfeb13_cleandf
Out[5]:
state region Confirmed Deaths Recovered just_date
0 Hubei Mainland China 48206.0 1310.0 3459.0 2020-02-13
1 Guangdong Mainland China 1241.0 2.0 314.0 2020-02-13
2 Henan Mainland China 1169.0 10.0 296.0 2020-02-13
3 Zhejiang Mainland China 1145.0 0.0 360.0 2020-02-13
4 Hunan Mainland China 968.0 2.0 339.0 2020-02-13
... ... ... ... ... ... ...
2886 Boston, MA US 1.0 0.0 0.0 2020-02-01
2887 Los Angeles, CA US 1.0 0.0 0.0 2020-02-01
2888 Orange, CA US 1.0 0.0 0.0 2020-02-01
2889 Seattle, WA US 1.0 0.0 0.0 2020-02-01
2890 Tempe, AZ US 1.0 0.0 0.0 2020-02-01

2891 rows × 6 columns



Step 3: Take a master view


I want to first give an overall review of the situation internationally, from its outbreak to Feb.13th. And thus, I present the data here in terms of the sum of all confirmed cases, all deaths, all recovered in different locations.

In [15]:
master_view = tillfeb13_cleandf.groupby("state").max().reset_index()
master_view.sort_values(by=['Confirmed'], inplace=True, ascending=False)
master_viewnew = master_view.reset_index()

#to print
master_viewnew
Out[15]:
index state region Confirmed Deaths Recovered just_date
0 22 Hubei Mainland China 48206.0 1310.0 3459.0 2020-02-13
1 14 Guangdong Mainland China 1241.0 2.0 314.0 2020-02-13
2 20 Henan Mainland China 1169.0 10.0 296.0 2020-02-13
3 59 Zhejiang Mainland China 1145.0 0.0 360.0 2020-02-13
4 23 Hunan Mainland China 968.0 2.0 339.0 2020-02-13
5 0 Anhui Mainland China 910.0 5.0 157.0 2020-02-13
6 27 Jiangxi Mainland China 872.0 1.0 170.0 2020-02-13
7 26 Jiangsu Mainland China 570.0 0.0 139.0 2020-02-13
8 9 Chongqing Mainland China 525.0 3.0 128.0 2020-02-13
9 45 Shandong Mainland China 509.0 2.0 105.0 2020-02-13
10 48 Sichuan Mainland China 451.0 1.0 104.0 2020-02-13
11 19 Heilongjiang Mainland China 395.0 9.0 33.0 2020-02-13
12 3 Beijing Mainland China 366.0 3.0 69.0 2020-02-13
13 46 Shanghai Mainland China 315.0 1.0 62.0 2020-02-13
14 12 Fujian Mainland China 279.0 1.0 57.0 2020-02-13
15 18 Hebei Mainland China 265.0 3.0 68.0 2020-02-13
16 44 Shaanxi Mainland China 229.0 0.0 46.0 2020-02-13
17 15 Guangxi Mainland China 222.0 2.0 33.0 2020-02-13
18 11 Diamond Princess cruise ship Others 175.0 0.0 0.0 2020-02-12
19 17 Hainan Mainland China 157.0 4.0 30.0 2020-02-13
20 58 Yunnan Mainland China 156.0 0.0 27.0 2020-02-13
21 16 Guizhou Mainland China 135.0 1.0 27.0 2020-02-13
22 47 Shanxi Mainland China 126.0 0.0 33.0 2020-02-12
23 29 Liaoning Mainland China 117.0 1.0 22.0 2020-02-13
24 52 Tianjin Mainland China 117.0 3.0 21.0 2020-02-13
25 13 Gansu Mainland China 87.0 2.0 34.0 2020-02-13
26 28 Jilin Mainland China 84.0 1.0 24.0 2020-02-13
27 35 Ningxia Mainland China 64.0 0.0 24.0 2020-02-13
28 57 Xinjiang Mainland China 63.0 1.0 6.0 2020-02-13
29 25 Inner Mongolia Mainland China 61.0 0.0 6.0 2020-02-13
30 10 Cruise Ship Others 61.0 0.0 0.0 2020-02-07
31 21 Hong Kong Hong Kong 53.0 1.0 1.0 2020-02-13
32 38 Qinghai Mainland China 18.0 0.0 11.0 2020-02-13
33 50 Taiwan Taiwan 18.0 0.0 1.0 2020-02-09
34 32 Macau Macau 10.0 0.0 3.0 2020-02-13
35 2 Bavaria Germany 7.0 NaN NaN 2020-02-01
36 39 Queensland Australia 5.0 0.0 0.0 2020-02-09
37 34 New South Wales Australia 4.0 0.0 2.0 2020-02-06
38 55 Victoria Australia 4.0 0.0 0.0 2020-02-01
39 5 British Columbia Canada 4.0 0.0 0.0 2020-02-07
40 6 California US 3.0 NaN NaN 2020-02-01
41 36 Ontario Canada 3.0 0.0 0.0 2020-02-01
42 41 San Diego County, CA US 2.0 0.0 0.0 2020-02-13
43 40 San Benito, CA US 2.0 0.0 0.0 2020-02-03
44 24 Illinois US 2.0 NaN NaN 2020-02-01
45 54 Toronto, ON Canada 2.0 0.0 0.0 2020-02-04
46 8 Chicago, IL US 2.0 0.0 2.0 2020-02-09
47 49 South Australia Australia 2.0 0.0 0.0 2020-02-02
48 42 Santa Clara, CA US 2.0 0.0 0.0 2020-02-03
49 53 Tibet Mainland China 1.0 0.0 1.0 2020-02-12
50 56 Washington US 1.0 NaN NaN 2020-02-01
51 33 Madison, WI US 1.0 0.0 0.0 2020-02-05
52 51 Tempe, AZ US 1.0 0.0 0.0 2020-02-01
53 43 Seattle, WA US 1.0 0.0 1.0 2020-02-09
54 37 Orange, CA US 1.0 0.0 0.0 2020-02-01
55 7 Chicago US 1.0 NaN NaN 2020-01-24
56 31 Los Angeles, CA US 1.0 0.0 0.0 2020-02-01
57 1 Arizona US 1.0 NaN NaN 2020-02-01
58 4 Boston, MA US 1.0 0.0 0.0 2020-02-01
59 30 London, ON Canada 1.0 0.0 1.0 2020-02-12

Insights:

In [16]:
places = np.array(master_viewnew.state)
first= master_viewnew.Confirmed
second = master_viewnew.Deaths
third = master_viewnew.Recovered


plt.figure(figsize=(10,6))
ax=plt.subplot()

plt.bar(range(len(places)), first)
plt.bar(range(len(places)), second, bottom=first)
plt.bar(range(len(places)), third, bottom=(first+second))

ax.set_xticks(range(len(places)))
ax.set_xticklabels(places, rotation = 90, fontsize=8)

plt.title("\nOverview of the Coronavirus Outbreak in Different States/Provinces\n", fontsize=16)
plt.ylabel("Total Infected cases\n", fontsize=14)
plt.xlabel("Provinces/States", fontsize=14)
plt.legend(["Confirmed",  "Deaths", "Recovered"], loc=1)

plt.show()

From the graph aboved, there are two important insights that I concluded:

  1. Coronavirus ourbreak is still a regional outbreak in the Hubei Province of China. It is not yet a country-wide outbreak, nor it is an international outbreak.

  2. Due to the huge differences in absoulte values in different areas, it will be more useful to look at rates. Therefore, in the following analysis, I will focusing on comparing the growth rate in different regions.

    Due to the uniqueness of Hubei province, I seperate Hubei province from the rest of China. And I group rest of the China together, regions outside of mainland China together for the convenience of analysis.


Step 4: Analysis of the Confirmed cases


In [17]:
#seperate the dataframe into three regions
Hubei_df = tillfeb13_cleandf.loc[tillfeb13_cleandf["state"]=="Hubei"]
restChina_df = tillfeb13_cleandf.loc[(tillfeb13_cleandf["region"]=="Mainland China") & (tillfeb13_cleandf["state"] != "Hubei")]
notChina_df = tillfeb13_cleandf.loc[tillfeb13_cleandf["region"] !="Mainland China"]

#organize all the dataframes in terms of day-by-day
Hubei_bydate = Hubei_df.groupby("just_date").sum().reset_index()
restChina_bydate = restChina_df.groupby("just_date").sum().reset_index()
notChina_bydate = notChina_df.groupby("just_date").sum().reset_index() 

#to print
Hubei_bydate
Out[17]:
just_date Confirmed Deaths Recovered
0 2020-01-21 270.0 6.0 25.0
1 2020-01-22 444.0 17.0 28.0
2 2020-01-23 444.0 17.0 28.0
3 2020-01-24 1098.0 48.0 62.0
4 2020-01-25 2542.0 131.0 106.0
5 2020-01-26 2481.0 128.0 86.0
6 2020-01-27 5560.0 252.0 137.0
7 2020-01-28 9822.0 350.0 212.0
8 2020-01-29 11694.0 412.0 258.0
9 2020-01-30 10709.0 366.0 206.0
10 2020-01-31 5806.0 204.0 141.0
11 2020-02-01 34557.0 1142.0 846.0
12 2020-02-02 22354.0 700.0 590.0
13 2020-02-03 11177.0 350.0 300.0
14 2020-02-04 57244.0 1721.0 1712.0
15 2020-02-05 16678.0 479.0 538.0
16 2020-02-06 41777.0 1167.0 1529.0
17 2020-02-07 22112.0 618.0 867.0
18 2020-02-08 24953.0 699.0 1218.0
19 2020-02-09 83831.0 2431.0 4715.0
20 2020-02-10 61359.0 1845.0 4076.0
21 2020-02-11 65094.0 2042.0 4949.0
22 2020-02-12 33366.0 1068.0 2686.0
23 2020-02-13 96412.0 2620.0 6900.0
In [18]:
#write a function that returns the growth rate with any column in a given dataframe
def growth_rate (df, column):
    rate = []
    for x in range(len(df[column])):
        if x > 0:
            new = df[column][x]-df[column][x-1]
            growth_rate = new/(df[column][x-1]) * 100
            rate.append(growth_rate)
        else:
            rate.append(0)
    return rate

Observation:

The table below is very intersting because for some days that are some huge drops in the "Confirmed" cases despite the low change in "Recovered" cases, for example, on Jan. 30th, on Feb 4th and on Feb. 12th. This might due to reporting errors.

In [19]:
#create a new column in each dataframe that stores the growth rate of all the confirmed cases
Hubei_bydate["confirmed_growthrate"] = growth_rate(Hubei_bydate, "Confirmed")
restChina_bydate["confirmed_growthrate"] = growth_rate(restChina_bydate, "Confirmed")
notChina_bydate["confirmed_growthrate"] = growth_rate(notChina_bydate, "Confirmed")

#to print
print(Hubei_bydate)
print(restChina_bydate)
print(notChina_bydate)
     just_date  Confirmed  Deaths  Recovered  confirmed_growthrate
0   2020-01-21      270.0     6.0       25.0              0.000000
1   2020-01-22      444.0    17.0       28.0             64.444444
2   2020-01-23      444.0    17.0       28.0              0.000000
3   2020-01-24     1098.0    48.0       62.0            147.297297
4   2020-01-25     2542.0   131.0      106.0            131.511840
5   2020-01-26     2481.0   128.0       86.0             -2.399685
6   2020-01-27     5560.0   252.0      137.0            124.103184
7   2020-01-28     9822.0   350.0      212.0             76.654676
8   2020-01-29    11694.0   412.0      258.0             19.059255
9   2020-01-30    10709.0   366.0      206.0             -8.423123
10  2020-01-31     5806.0   204.0      141.0            -45.783920
11  2020-02-01    34557.0  1142.0      846.0            495.194626
12  2020-02-02    22354.0   700.0      590.0            -35.312672
13  2020-02-03    11177.0   350.0      300.0            -50.000000
14  2020-02-04    57244.0  1721.0     1712.0            412.158898
15  2020-02-05    16678.0   479.0      538.0            -70.865069
16  2020-02-06    41777.0  1167.0     1529.0            150.491666
17  2020-02-07    22112.0   618.0      867.0            -47.071355
18  2020-02-08    24953.0   699.0     1218.0             12.848227
19  2020-02-09    83831.0  2431.0     4715.0            235.955597
20  2020-02-10    61359.0  1845.0     4076.0            -26.806313
21  2020-02-11    65094.0  2042.0     4949.0              6.087127
22  2020-02-12    33366.0  1068.0     2686.0            -48.741820
23  2020-02-13    96412.0  2620.0     6900.0            188.952826
     just_date  Confirmed  Deaths  Recovered  confirmed_growthrate
0   2020-01-21       56.0     0.0        0.0              0.000000
1   2020-01-22      103.0     0.0        0.0             83.928571
2   2020-01-23      195.0     1.0        2.0             89.320388
3   2020-01-24      683.0     4.0        8.0            250.256410
4   2020-01-25     2156.0     8.0       20.0            215.666179
5   2020-01-26     2318.0     8.0       14.0              7.513915
6   2020-01-27     4537.0    18.0       37.0             95.729077
7   2020-01-28     6252.0    19.0       66.0             37.800309
8   2020-01-29     8124.0    23.0       95.0             29.942418
9   2020-01-30     7073.0    18.0      108.0            -12.936977
10  2020-01-31     4007.0     9.0       75.0            -43.347943
11  2020-02-01    10052.0    22.0      224.0            150.860993
12  2020-02-02    16409.0    33.0      457.0             63.241146
13  2020-02-03    12448.0    22.0      431.0            -24.139192
14  2020-02-04    20510.0    31.0      884.0             64.765424
15  2020-02-05    14563.0    23.0      817.0            -28.995612
16  2020-02-06    16849.0    28.0     1327.0             15.697315
17  2020-02-07     9084.0    18.0      880.0            -46.085821
18  2020-02-08     9947.0    25.0     1193.0              9.500220
19  2020-02-09    21049.0    66.0     2781.0            111.611541
20  2020-02-10    30412.0   104.0     4680.0             44.481923
21  2020-02-11    12503.0    55.0     2268.0            -58.887939
22  2020-02-12    22118.0    91.0     4449.0             76.901544
23  2020-02-13    22182.0   104.0     4983.0              0.289357
     just_date  Confirmed  Deaths  Recovered  confirmed_growthrate
0   2020-01-21        6.0     0.0        0.0              0.000000
1   2020-01-22        8.0     0.0        0.0             33.333333
2   2020-01-23       14.0     0.0        0.0             75.000000
3   2020-01-24       41.0     0.0        0.0            192.857143
4   2020-01-25      113.0     0.0        0.0            175.609756
5   2020-01-26      113.0     0.0        6.0              0.000000
6   2020-01-27      189.0     0.0        9.0             67.256637
7   2020-01-28      251.0     0.0       18.0             32.804233
8   2020-01-29      303.0     0.0       18.0             20.717131
9   2020-01-30      229.0     0.0       16.0            -24.422442
10  2020-01-31      602.0     1.0       48.0            162.882096
11  2020-02-01      919.0     0.0       48.0             52.657807
12  2020-02-02      536.0    11.0        0.0            -41.675734
13  2020-02-03      264.0     0.0        5.0            -50.746269
14  2020-02-04      514.0     2.0       33.0             94.696970
15  2020-02-05      217.0     3.0        0.0            -57.782101
16  2020-02-06      409.0     1.0       47.0             88.479263
17  2020-02-07      561.0    10.0        6.0             37.163814
18  2020-02-08      575.0     2.0       91.0              2.495544
19  2020-02-09      716.0     1.0       46.0             24.521739
20  2020-02-10      925.0     3.0       29.0             29.189944
21  2020-02-11      649.0     2.0      117.0            -29.837838
22  2020-02-12      962.0     5.0       95.0             48.228043
23  2020-02-13      186.0     2.0       38.0            -80.665281

Observation and Insights

In [20]:
plt.figure(figsize=(10,6))
ax = plt.subplot()

plt.plot(range(len(Hubei_bydate.just_date)), Hubei_bydate.confirmed_growthrate, label="Hubei", marker="o")
plt.plot(range(len(Hubei_bydate.just_date)), restChina_bydate.confirmed_growthrate, label="Rest of China", marker="o")
plt.plot(range(len(Hubei_bydate.just_date)), notChina_bydate.confirmed_growthrate, label="Internationally", marker="o")

ax.set_ticks = Hubei_bydate.just_date
ax.set_xticks(range(len(Hubei_bydate.just_date)))
ax.set_xticklabels(Hubei_bydate.just_date, rotation=90)
ax.set_yticks(range(-100, 500, 50))

plt.legend()
plt.title("\nGrowth Rate of Confirmed Cases\n", fontsize =16)
plt.ylabel("Rate of Growth per Day", fontsize = 14)
plt.show()
  1. The growth rates of confirmed cases are really volatile, like noted in the obversation, it is probably due to errors in data reporting. It will be intersting to do some data treatment, for example, replacing the data points that have abnormal drop to an educated guess.

  2. In the Hubei Province, the growth rates have been much higher than other areas. This alins with the anlysis that it is an regional outbreak.

    To answer the question: Is the growth rate exponential? In other words, doesn't matter which region you are in or how many cases are there in your region now, do all the regions show a consistent growth rate in the numbers of infected cases? If so, doesn't matter how many cases you have in your region, you should be as concern as if you were in Hubei.

    If we compare Hubei to other regions, the answer is uncertain because you see the growth rate is signifcally higher than other regions. However, like mentioned before, a sudden surge in growth rate can be due to reporting error on Jan. 31st. In other words, no reporting were being done for several days, which caused a surge in numbers after reporting resume.

    If we compare "Rest of China" to "Internationally", the answer is a clear yes. While the "Rest of China" has a lot more confirmed cases in absolute numbers than regions outside of China, the growth rate is about the same.

    As we see the growth rate is Hubie province is slwoing down, and getting closer to the growth rate in the rest of the world during the later peirod, it is reasonable to conclude that the virus is spreading exponatially at a rate between 100%-150% every day.

  3. The graph seems to indicate a pattern of start-peak-down, with the first peak around Jan.23/24th, the second peak around Feb 1st/2nd, the third peak around Feb 9th/10th. Thus, in my later analysis, I want to divide the timeline into three periods and study the distribution of the growth rates in each period to see the change in the scale of outbreaks.



Step 5: Analysis of Death and Recovered Cases


In this section, instead of studying the growth rate of Death cases and the growth rate of recovered cases, I studied the change of mortality rate and recovered rate. The reason is that if there is indeed reporting errors, it is very likely that we will see volatile lines with no clear patterns again. It will be more useful to study the overall mortality rate, recovered rate and their changes overtime.

Observation and Insights

In [41]:
death_rate = lambda row: (row.Deaths)/(row.Confirmed) *100 

Hubei_bydate["death_rate"] = Hubei_bydate.apply(death_rate, axis=1)
restChina_bydate["death_rate"] = restChina_bydate.apply(death_rate, axis=1)
notChina_bydate["death_rate"] = notChina_bydate.apply(death_rate, axis=1)
In [42]:
plt.figure(figsize=(10,6))
ax = plt.subplot()
plt.plot(range(len(Hubei_bydate.just_date)), Hubei_bydate.death_rate, label="Hubei", marker="o")
plt.plot(range(len(Hubei_bydate.just_date)), restChina_bydate.death_rate, label="Rest of China", marker="o")
plt.plot(range(len(Hubei_bydate.just_date)), notChina_bydate.death_rate, label="Internationally", marker="o")

#ax.set_ticks = Hubei_bydate.just_date
ax.set_xticks(range(len(Hubei_bydate.just_date)))
ax.set_xticklabels(Hubei_bydate.just_date, rotation=90)
plt.legend()
plt.ylabel("Rate of Death Cases / Total Confirmed Cases", fontsize =14)
plt.title("\nMortality Rate from Jan.21 to Feb 13th 2020\n", fontsize = 16)
plt.show()
  1. The mortality rate in Hubei province was much higher than other regions but it has slowed down and stayed stable; at the end, it even started to show a sign of decline. This might be due to the fact that during the early days of the outbreak, the large numbers of outbreak cases in Hubei province caused lots of local hospitals to run out of protection equipments like masks and gloves. Also the local hospitals might be overwhelmed by the numbers of increased cases in such a short period of time. Thus, patients could not be effectively treated.

    In the later peirod, the Chinese government enforced a strict control of the distribution of medical equipments, namely all the supplies must put Hubei Province first. And all the experts dedicated their efforts to help with the situation. Thus, the situation started to stablized.

  2. The mortality rate in rest of the China has remained low and stable while the interntional mortality has not been very positive. This might be due to the fact that international regions did not put effectively measurement into treating cases like China did, or it could simply due to fact that the absolute numbers of other regions are much lower than in China and thus, they are not statistically significant.

In conclusion, the mortality rate thus far shows that the Coronavirus is not vital like SARS and the death rate can be effectivly controlled.

Observation and Insights

In [43]:
recover_rate = lambda row: (row.Recovered)/(row.Confirmed) *100 

Hubei_bydate["recover_rate"] = Hubei_bydate.apply(recover_rate, axis=1)
restChina_bydate["recover_rate"] = restChina_bydate.apply(recover_rate, axis=1)
notChina_bydate["recover_rate"] = notChina_bydate.apply(recover_rate, axis=1)
In [44]:
plt.figure(figsize=(10,6))
ax = plt.subplot()
plt.plot(range(len(Hubei_bydate.just_date)), Hubei_bydate.recover_rate, label="Hubei", marker="o")
plt.plot(range(len(Hubei_bydate.just_date)), restChina_bydate.recover_rate, label="Rest of China", marker="o")
plt.plot(range(len(Hubei_bydate.just_date)), notChina_bydate.recover_rate, label="Internationally", marker="o")

#ax.set_ticks = Hubei_bydate.just_date
ax.set_xticks(range(len(Hubei_bydate.just_date)))
ax.set_xticklabels(Hubei_bydate.just_date, rotation=90)
plt.legend()
plt.ylabel("Rate of Recovered / Total Confirmed Cases", fontsize =14)
plt.title("\nRecovered Rate From Jan.21st to Feb 13th\n", fontsize=16)
plt.show()

Overall, there is a trend of increase in the recovered rate. However, in internatonal regions, the rate has been very unstable in the Hubei province, the recovered rate though stabe, is lower than other regions.

Based on the news report I have read, no region has an effective formula for treatment. The best thing the hospital can do is effectivly isolate the patient and make sure no enviornmental factors will cause the patients to get wrose. The rest depends on the patients' own immune system.

Thus, it is hard to draw any insights without a further look into the demographic of infected people. Nonetheless, the gengeral conclusion is that Coronavirus is not vital.



Step 6: Statistical Analysis of the growth rate of the Confirmed cases


As I mentioned in step 4, I want to divide the timeline into three periods and study the distribution of the growth rates in each period to see the change in the scale of outbreaks. In other words, though there seems to be a pattern of: start-peak-slow, I wonder is the peak getting lower each time, which means the outbreak of the virus has been effectivly controlled?

In [45]:
from scipy.stats import iqr
In [46]:
#Becuase the growthrate of index 0 is 0, I start the analysis with index 1

hubei_cases = np.array([Hubei_bydate.confirmed_growthrate[1:]])
restChina_cases = np.array([restChina_bydate.confirmed_growthrate[1:]])
notChina_cases = np.array([notChina_bydate.confirmed_growthrate[1:]])
In [47]:
#some basic statistical analysis of the growth rate before looking at the distribution of growth rate

hubei_mean = np.mean(hubei_cases)
restChina_mean = np.mean(restChina_cases)
notChina_mean = np.mean(notChina_cases)

#std, IQR
hubei_std = np.std(hubei_cases)
restChina_std = np.std(restChina_cases)
notChina_std = np.std(notChina_cases)

hubei_iqr = iqr(hubei_cases)
restChina_iqr = iqr(restChina_cases)
notChina_iqr = iqr(notChina_cases)

Observation and Insights:

In [48]:
print(hubei_mean, restChina_mean, notChina_mean)
print(hubei_iqr, restChina_iqr, notChina_iqr)
print(hubei_std, restChina_std, notChina_std)
75.18937854754112 49.26579335806146 37.07668646428257
170.4640607619937 92.94828977602447 83.339539706183
144.51253775633018 78.18191456387466 71.04516656692313

The mean of Hubei province is a lot higher than the other two regions as expected. All three regiions' growt rate are volatile since its standard deviation is large. I also put in interquartile to see the spread of deviation since interquartile does not get impacted by outliars. The interquartile also indicated the spread is big. Therefore, it gives more reasons divide the analysis into different periods.

Three periods:

  • First Period: until Jan. 31st
  • Secon Period: until Feb 7th
  • Last Period: Feb 8th till Feb 13th

Observation and Insights:

In [49]:
plt.figure(figsize=(10, 6))
ax = plt.subplot()
plt.hist(Hubei_bydate.confirmed_growthrate[1:11], range=(-100, 500), bins=30, alpha=0.2, label = "Hubei")
plt.hist(restChina_bydate.confirmed_growthrate[1:11], range=(-100, 500), bins=30, alpha=15, histtype='step',label="Rest of China")
plt.hist(notChina_bydate.confirmed_growthrate[1:11], range=(-100, 500), bins=30, alpha=0.2, label="Internationally")

ax.set_xticks(range(-100, 500, 25))

plt.ylabel("Occurance of Growth Rate", fontsize = 14)
plt.xlabel("\nGrowth rate", fontsize = 14)
plt.title("\nDistribution of Confirmed Cases' Growth Rate \n From Jan 21st to Jan. 31st (1st Period)\n", fontsize=16)
plt.legend()
plt.show()
In [50]:
plt.figure(figsize=(10, 6))

ax = plt.subplot()

plt.hist(Hubei_bydate.confirmed_growthrate[11:18], range=(-100, 500),bins=30, alpha=0.2, label="Hubei")
plt.hist(restChina_bydate.confirmed_growthrate[11:18], range=(-100, 500), bins =30, alpha=15, histtype='step', label="Rest of China")
plt.hist(notChina_bydate.confirmed_growthrate[11:18], range=(-100, 500), bins=30, alpha=0.2, label="Internationally")

ax.set_xticks(range(-100, 500, 25))

plt.ylabel("Occurance of Growth Rate", fontsize = 14)
plt.xlabel("\nGrowth rate", fontsize = 14)
plt.title("\nDistribution of Confirmed Cases' Growth Rate \n From Feb 1st to Feb. 7th (2nd Period)\n", fontsize=16)
plt.legend()

plt.show()
In [51]:
plt.figure(figsize=(10, 6))

ax = plt.subplot()

plt.hist(Hubei_bydate.confirmed_growthrate[18:], range=(-100,500), bins=30, alpha=0.2, label="Hubei")
plt.hist(restChina_bydate.confirmed_growthrate[18:], range=(-100,500), bins=30, alpha=15, histtype='step', label="Rest of China")
plt.hist(notChina_bydate.confirmed_growthrate[18:], range=(-100,500), bins =30, alpha=0.2, label="Internationally")

ax.set_xticks(range(-100, 500, 25))

plt.ylabel("Occurance of Growth Rate", fontsize = 14)
plt.xlabel("\nGrowth rate", fontsize = 14)
plt.title("\nDistribution of Confirmed Cases' Growth Rate \n From Feb 8th to Feb. 13th (3rd Period)\n", fontsize=16)
plt.legend()

plt.show()

Though one can argue that due to the low occurance of events, there is no statistical significance in the following analysis, it is still might be intersting to look at the pattern, or to see if there is a pattern.

My observation is:

  • In the first period:it is a little all over the places but growth rates centered around 0-100%.
  • In the second period, the data got extreme, many times we have negative growth rates and we also had 2 extremely large growth rates.
  • In the third period, it is clear that extreme numbers are less extreme than in the second period and growth rates centered aroun 0-60%.

In conclusion, the growth rate spreading expontentially, spreading in a rate between 100% - 150% EVERY DAY. Though the rate of growth is showing a sign of decline, it is still alarming.

Future Studies:

  1. Verify the accuracy of current dataset
  2. It will be intersting and important to build a perdictive model based on the dataset. This is also the goal for myself in the next step. I am still new to programming and haven't learned perictive modelling yet.
  3. If we can have demographic data of infected patients, recovered patients and died patients, there is a potential to do K means cluster analysis.

    Lots of news report has stated that old people, people who have weaker immune systems are more likely to get infected and are more likely to have a vital impact by the virus. This cluster analysis can also be used to compare with flu's cluster analysis since flu also caused death every year, to see how they differ. This might give the public a better indication of what they can do and if they should panic over this virus.

All comments are welcomed. Thanks!

In [ ]: