In this problem, we are interested in studying the predictability of the need to be admitted to an Intensive Care Unit (ICU) for COVID-19 patients. The data can be loaded as follows:

# Import data
covid = read.csv("data/covid.csv")
# First look
head(covid)
##   ic sex age ldh spo2
## 1  0   0  86 251   90
## 2  0   1  71 319   95
## 3  0   0  76 113   90
## 4  0   0  90 315   90
## 5  0   0  79 402   93
## 6  0   0  52 226   95

As our response variable admission to ICU ic is binary (0 for no, 1 for yes), we consider the approach to fit a logistic regression. We start by fitting an initial model with all covariates included (without interactions):

fit.glm = glm(ic ~ sex + age + ldh + spo2, data = covid, family = binomial(link="logit"))
summary(fit.glm)
## 
## Call:
## glm(formula = ic ~ sex + age + ldh + spo2, family = binomial(link = "logit"), 
##     data = covid)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8526  -0.7429  -0.4085   0.6182   1.8504  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  7.226545   7.504417   0.963   0.3356  
## sex          1.567246   0.772703   2.028   0.0425 *
## age         -0.013913   0.021722  -0.641   0.5218  
## ldh          0.003797   0.001868   2.033   0.0421 *
## spo2        -0.106755   0.072588  -1.471   0.1414  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 79.499  on 63  degrees of freedom
## Residual deviance: 58.501  on 59  degrees of freedom
## AIC: 68.501
## 
## Number of Fisher Scoring iterations: 5

We can see that some variables appear not significant, implying that we may be able to find a smaller model with less variables. As an example, we can use the step() function to perform stepwise model selection using the AIC by removing variables iteratively from our initial model. This approach is known as “stepwise backward AIC” and it is an heuristic method which avoids to explore ALL models. It can be done as follows:

# Stepwise backward AIC
fit.glm.aic.backward = step(fit.glm, trace = FALSE)
summary(fit.glm.aic.backward)
## 
## Call:
## glm(formula = ic ~ sex + ldh + spo2, family = binomial(link = "logit"), 
##     data = covid)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8765  -0.7099  -0.4200   0.6055   1.8993  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  4.926065   6.417664   0.768   0.4427  
## sex          1.679296   0.757965   2.216   0.0267 *
## ldh          0.003892   0.001857   2.096   0.0361 *
## spo2        -0.092701   0.067652  -1.370   0.1706  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 79.499  on 63  degrees of freedom
## Residual deviance: 58.914  on 60  degrees of freedom
## AIC: 66.914
## 
## Number of Fisher Scoring iterations: 5

Using the stepwise backward AIC, we indeed find a smaller model with less variables. Let’s check if we reduce the AIC:

AIC(fit.glm)
## [1] 68.50092
AIC(fit.glm.aic.backward)
## [1] 66.91434

It is also possible to consider a forward approach starting from a simple model. This can be done as follows:

# Stepwise forward AIC

# Initial model
fit.glm.inital = glm(ic ~ 1, data = covid, family = binomial(link="logit"))

# Find a model with a forward approach using the AIC
fit.glm.aic.forward = step(fit.glm.inital, 
                           scope = list(lower = formula(fit.glm.inital),
                                        upper = formula(fit.glm)), 
                           direction = "forward", trace = FALSE)

This will also give a smaller model with reduced AIC:

AIC(fit.glm)
## [1] 68.50092
AIC(fit.glm.aic.backward)
## [1] 66.91434
AIC(fit.glm.aic.forward)
## [1] 66.91434

In this case, actually the two methods (i.e. forward and backward) are equivalent:

summary(fit.glm.aic.forward)
## 
## Call:
## glm(formula = ic ~ ldh + sex + spo2, family = binomial(link = "logit"), 
##     data = covid)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8765  -0.7099  -0.4200   0.6055   1.8993  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  4.926065   6.417664   0.768   0.4427  
## ldh          0.003892   0.001857   2.096   0.0361 *
## sex          1.679296   0.757965   2.216   0.0267 *
## spo2        -0.092701   0.067652  -1.370   0.1706  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 79.499  on 63  degrees of freedom
## Residual deviance: 58.914  on 60  degrees of freedom
## AIC: 66.914
## 
## Number of Fisher Scoring iterations: 5

Therefore, the significant variables are sex and ldh. Now let’s check how reliable our model is. If the predicted probability is larger than 0.5, then we predict that the considered individual will be admitted to ICU (ic=1). Otherwise we predict that the individual will not be admitted to ICU (ic=0). Then we can compute the in-sample classification accuracy by comparing the predicted values to the actual observed values for the whole sample.

# in-sample classification accuracy
class_predict = fit.glm.aic.forward$fitted.values > 0.5
in_accuracy = mean((covid$ic == 1) == class_predict)
in_accuracy
## [1] 0.734375

In this case, we have 73.44% in-sample classification accuracy. Is that high?

n = dim(covid)[1] # sample size
table(covid$ic)/n
## 
##      0      1 
## 0.6875 0.3125

So the in-sample classification accuracy is actually higher than if we blindly predict individuals to be admitted to ICU. Therefore, our model is working properly.

Now let’s consider the out-of-sample classification accuracy.

library(boot)

cost = function(resp, pred){
  mean(resp == (pred > 0.5))
}
out_accuracy = cv.glm(covid, fit.glm.aic.forward, cost, K = 10)$delta[2]
out_accuracy
## [1] 0.7116699

In this case, we have 71.17% out-of-sample classification accuracy, which is very similar to the in-sample classification accuracy. This is because we have a large number of observations (n=64) compared to the number of parameters to estimate (p=4). So we again verify that our model is working properly.

Lastly, we can use the gamlss() function in the gamlss R package to perform the model check:

library(gamlss)
fit.gamlss = gamlss(formula(fit.glm.aic.forward), data=covid, family=BI)
## GAMLSS-RS iteration 1: Global Deviance = 58.9143 
## GAMLSS-RS iteration 2: Global Deviance = 58.9143
plot(fit.gamlss)

## ******************************************************************
##   Summary of the Randomised Quantile Residuals
##                            mean   =  0.02456862 
##                        variance   =  0.7784057 
##                coef. of skewness  =  0.147583 
##                coef. of kurtosis  =  2.681554 
## Filliben correlation coefficient  =  0.9936886 
## ******************************************************************