Problem 1: Students with an Infectious Disease

Suppose we are interested in the number of high school students diagnosed with an infectious disease as a function of the number of days from the initial outbreak. The data can be loaded as follows:

students = read.csv("data/students.csv")
head(students)
##   day cases
## 1   1     6
## 2   2     8
## 3   3    12
## 4   3     9
## 5   4     3
## 6   4     3

Can you find a suitable model for this dataset?

Solution

We can first visualize the data as follows:

plot(students$day, students$cases, xlim = c(0,120), ylim = c(0,12),
     xlab = "Days since initial outbreak",
     ylab = "Number of diagnosed students",
     pch = 16, col = "#F8766D99")
grid()

As we are modeling count data, we can use a Poisson regression. There are two functions in R that can fit a Poisson regression. The first one is the classical glm() function by specifying family=poisson.

fit.glm = glm(cases ~ day, data = students, family = poisson)
summary(fit.glm)
## 
## Call:
## glm(formula = cases ~ day, family = poisson, data = students)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.00482  -0.85719  -0.09331   0.63969   1.73696  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.990235   0.083935   23.71   <2e-16 ***
## day         -0.017463   0.001727  -10.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 215.36  on 108  degrees of freedom
## Residual deviance: 101.17  on 107  degrees of freedom
## AIC: 393.11
## 
## Number of Fisher Scoring iterations: 5

In this case, we can perform the model check as follows:

par(mfrow = c(2,2))
plot(fit.glm)

Alternatively, you can use the gamlss() function in the gamlss R package by specifying family = PO.

library(gamlss)
fit.gamlss = gamlss(cases ~ day, data = students, family = PO)
## GAMLSS-RS iteration 1: Global Deviance = 389.1082 
## GAMLSS-RS iteration 2: Global Deviance = 389.1082
summary(fit.gamlss)
## ******************************************************************
## Family:  c("PO", "Poisson") 
## 
## Call:  gamlss(formula = cases ~ day, family = PO, data = students) 
## 
## Fitting method: RS() 
## 
## ------------------------------------------------------------------
## Mu link function:  log
## Mu Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.990235   0.083898   23.72   <2e-16 ***
## day         -0.017463   0.001724  -10.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## No. of observations in the fit:  109 
## Degrees of Freedom for the fit:  2
##       Residual Deg. of Freedom:  107 
##                       at cycle:  2 
##  
## Global Deviance:     389.1082 
##             AIC:     393.1082 
##             SBC:     398.4909 
## ******************************************************************

The gamlss approach provides a nicer summary to assess how well your model fit the data at hands:

par(mfrow = c(1,1))
plot(fit.gamlss)

## ******************************************************************
##   Summary of the Randomised Quantile Residuals
##                            mean   =  0.03783439 
##                        variance   =  0.7860004 
##                coef. of skewness  =  -0.2081693 
##                coef. of kurtosis  =  2.384078 
## Filliben correlation coefficient  =  0.9899953 
## ******************************************************************

Problem 2: Pima Indians Diabetes

Consider a dataset on 768 women of at least 21 years old of the Pima Indian heritage. This dataset includes the following variables:

The dataset is stored in the mlbench R package and you can load it as follows:

library(mlbench)
data(PimaIndiansDiabetes)
dim(PimaIndiansDiabetes)
## [1] 768   9
head(PimaIndiansDiabetes)
##   pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1        6     148       72      35       0 33.6    0.627  50      pos
## 2        1      85       66      29       0 26.6    0.351  31      neg
## 3        8     183       64       0       0 23.3    0.672  32      pos
## 4        1      89       66      23      94 28.1    0.167  21      neg
## 5        0     137       40      35     168 43.1    2.288  33      pos
## 6        5     116       74       0       0 25.6    0.201  30      neg

Can you find a suitable model to predict the possibility of a woman (Pima Indian heritage) being diagnosed with diabetes based on these variables (or a subset of them)?

Solution

As we are interested in fitting binary data (positive or negative of diabetes), we consider the approach to fit a logistic regression. We start by fitting an initial model with all covariates included (without interactions):

fit.glm = glm(diabetes ~ ., data = PimaIndiansDiabetes, family = binomial(link="logit"))
summary(fit.glm)
## 
## Call:
## glm(formula = diabetes ~ ., family = binomial(link = "logit"), 
##     data = PimaIndiansDiabetes)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5566  -0.7274  -0.4159   0.7267   2.9297  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.4046964  0.7166359 -11.728  < 2e-16 ***
## pregnant     0.1231823  0.0320776   3.840 0.000123 ***
## glucose      0.0351637  0.0037087   9.481  < 2e-16 ***
## pressure    -0.0132955  0.0052336  -2.540 0.011072 *  
## triceps      0.0006190  0.0068994   0.090 0.928515    
## insulin     -0.0011917  0.0009012  -1.322 0.186065    
## mass         0.0897010  0.0150876   5.945 2.76e-09 ***
## pedigree     0.9451797  0.2991475   3.160 0.001580 ** 
## age          0.0148690  0.0093348   1.593 0.111192    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 993.48  on 767  degrees of freedom
## Residual deviance: 723.45  on 759  degrees of freedom
## AIC: 741.45
## 
## Number of Fisher Scoring iterations: 5

We can see that among these 8 variables, some appear not significant, implying that we may be able to find a smaller model with less variables. As an example, we can use the step() function to perform stepwise model selection using the AIC by removing variables iteratively from our initial model. This approach is known as “stepwise backward AIC” and it is an heuristic method which avoids to explore ALL models. Indeed, we have in this case \(2^8 = 256\) models without interactions and \(2^{36} = 68,719,476,736\) with first order interactions. If fitting one model takes 1 second, we would need over 2000 years to fit all models. Nevertheless, various modern methods allow to improve on this stepwise approach for model selection but they are beyond the scope of this class.

fit.glm.aic = step(fit.glm, trace = FALSE)
summary(fit.glm.aic)
## 
## Call:
## glm(formula = diabetes ~ pregnant + glucose + pressure + insulin + 
##     mass + pedigree + age, family = binomial(link = "logit"), 
##     data = PimaIndiansDiabetes)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5617  -0.7286  -0.4156   0.7271   2.9297  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.4051362  0.7167033 -11.727  < 2e-16 ***
## pregnant     0.1231724  0.0320688   3.841 0.000123 ***
## glucose      0.0351123  0.0036625   9.587  < 2e-16 ***
## pressure    -0.0132136  0.0051537  -2.564 0.010350 *  
## insulin     -0.0011570  0.0008142  -1.421 0.155275    
## mass         0.0900886  0.0144619   6.229 4.68e-10 ***
## pedigree     0.9475954  0.2980063   3.180 0.001474 ** 
## age          0.0147888  0.0092897   1.592 0.111393    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 993.48  on 767  degrees of freedom
## Residual deviance: 723.45  on 760  degrees of freedom
## AIC: 739.45
## 
## Number of Fisher Scoring iterations: 5

We now find a model with 7 variables (i.e. remove triceps) with slightly smaller AIC.

AIC(fit.glm)
## [1] 741.4454
AIC(fit.glm.aic)
## [1] 739.4534

It is also possible to consider a forward approach starting from a simple model. This can be done as follows:

# Initial model
fit.glm.inital = glm(diabetes ~ 1, data = PimaIndiansDiabetes, 
                     family = binomial(link="logit"))

# Find a model with a forward approach using the AIC
fit.glm.aic.forward = step(fit.glm.inital, 
                           scope = list(lower = formula(fit.glm.inital),
                                        upper = formula(fit.glm)), 
                           direction = "forward", trace = FALSE)

In this case, the two methods (i.e. forward and backward) are equivalent:

AIC(fit.glm.aic)
## [1] 739.4534
AIC(fit.glm.aic.forward)
## [1] 739.4534
summary(fit.glm.aic.forward)
## 
## Call:
## glm(formula = diabetes ~ glucose + mass + pregnant + pedigree + 
##     pressure + age + insulin, family = binomial(link = "logit"), 
##     data = PimaIndiansDiabetes)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5617  -0.7286  -0.4156   0.7271   2.9297  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.4051362  0.7167033 -11.727  < 2e-16 ***
## glucose      0.0351123  0.0036625   9.587  < 2e-16 ***
## mass         0.0900886  0.0144619   6.229 4.68e-10 ***
## pregnant     0.1231724  0.0320688   3.841 0.000123 ***
## pedigree     0.9475954  0.2980063   3.180 0.001474 ** 
## pressure    -0.0132136  0.0051537  -2.564 0.010350 *  
## age          0.0147888  0.0092897   1.592 0.111393    
## insulin     -0.0011570  0.0008142  -1.421 0.155275    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 993.48  on 767  degrees of freedom
## Residual deviance: 723.45  on 760  degrees of freedom
## AIC: 739.45
## 
## Number of Fisher Scoring iterations: 5

Now let’s check how reliable our model is. If the predicted probability is larger than 0.5, then we predict that the considered individual will have diabetes (so positive). Otherwise we predict that the individual is negative. Then we can compute the in-sample classification accuracy by comparing the predicted values to the actual observed values for the whole sample.

class_predict = fit.glm.aic$fitted.values > 0.5
in_accuracy = mean((PimaIndiansDiabetes$diabetes == "pos") == class_predict)
in_accuracy
## [1] 0.7838542

In this case, we have 78.39% in-sample classification accuracy. Is that high?

table(PimaIndiansDiabetes$diabetes)/dim(PimaIndiansDiabetes)[1]
## 
##       neg       pos 
## 0.6510417 0.3489583

So the in-sample classification accuracy is actually higher than if we blindly predict individuals to be positive or negative. Therefore, our model is working properly.

Now let’s consider the out-of-sample classification accuracy.

library(boot)
cost = function(resp, pred){
  mean(resp == (pred > 0.5))
}
out_accuracy = cv.glm(PimaIndiansDiabetes, fit.glm.aic, cost, K = 10)$delta[2]
out_accuracy
## [1] 0.7770844

In this case, we have 77.71% out-of-sample classification accuracy, which is very similar to the in-sample classification accuracy. This is because we have a large number of observations (\(n=768\)) compared to the number of parameters to estimate (\(p=8\)). So we again verify that our model is working properly.

Lastly, we can use the gamlss() function to perform the model check:

fit.gamlss = gamlss(formula(fit.glm.aic), data=PimaIndiansDiabetes, family=BI)
## GAMLSS-RS iteration 1: Global Deviance = 723.4534 
## GAMLSS-RS iteration 2: Global Deviance = 723.4534
plot(fit.gamlss)

## ******************************************************************
##   Summary of the Randomised Quantile Residuals
##                            mean   =  -0.03763598 
##                        variance   =  0.9638481 
##                coef. of skewness  =  -0.02939412 
##                coef. of kurtosis  =  3.057154 
## Filliben correlation coefficient  =  0.9985066 
## ******************************************************************

Problem 3: Breast Cancer

Let’s consider a breast cancer dataset from the mlbench R package. In this example, the objective is to predict whether a cancer is malignant or benign from biopsy details. This dataset includes 699 observations on 11 variables. You can load the data as follows:

data(BreastCancer)
dim(BreastCancer)
## [1] 699  11
head(BreastCancer)
##        Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size
## 1 1000025            5         1          1             1            2
## 2 1002945            5         4          4             5            7
## 3 1015425            3         1          1             1            2
## 4 1016277            6         8          8             1            3
## 5 1017023            4         1          1             3            2
## 6 1017122            8        10         10             8            7
##   Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses     Class
## 1           1           3               1       1    benign
## 2          10           3               2       1    benign
## 3           2           3               1       1    benign
## 4           4           3               7       1    benign
## 5           1           3               1       1    benign
## 6          10           9               7       1 malignant
table(BreastCancer$Class)
## 
##    benign malignant 
##       458       241

More information about this dataset can be found by typing ?BreastCancer in R. Can you find a suitable model to identify benign or malignant classes based on these variables (or a subset of them)?

Solution

We notice that there are some missing values in the dataset. Therefore, we start by removing the observations with missing data.

BreastCancer = na.omit(BreastCancer)
dim(BreastCancer)
## [1] 683  11

After removing the missing values, we remain to have 683 observations. We start our analysis by fitting an initial model with all covariates included (without interactions):

# Initial model
fit.glm = glm(Class ~ Cl.thickness + Cell.size + Cell.shape + Marg.adhesion + Epith.c.size + Bare.nuclei + Bl.cromatin + Normal.nucleoli + Mitoses, data = BreastCancer, family=binomial)
summary(fit.glm)
## 
## Call:
## glm(formula = Class ~ Cl.thickness + Cell.size + Cell.shape + 
##     Marg.adhesion + Epith.c.size + Bare.nuclei + Bl.cromatin + 
##     Normal.nucleoli + Mitoses, family = binomial, data = BreastCancer)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -4.716e-05  -2.100e-08  -2.100e-08   2.100e-08   6.022e-05  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)       -3.869e+01  2.309e+05   0.000    1.000
## Cl.thickness.L     1.059e+02  2.239e+05   0.000    1.000
## Cl.thickness.Q     6.363e+01  1.423e+05   0.000    1.000
## Cl.thickness.C     1.608e+01  2.223e+05   0.000    1.000
## Cl.thickness^4    -2.484e+01  1.381e+05   0.000    1.000
## Cl.thickness^5     6.103e+00  1.457e+05   0.000    1.000
## Cl.thickness^6     2.743e+00  1.372e+05   0.000    1.000
## Cl.thickness^7    -9.061e+00  1.362e+05   0.000    1.000
## Cl.thickness^8    -4.665e+01  7.187e+04  -0.001    0.999
## Cl.thickness^9    -3.427e+01  1.100e+05   0.000    1.000
## Cell.size.L        7.550e+01  1.987e+05   0.000    1.000
## Cell.size.Q        3.664e+01  1.555e+05   0.000    1.000
## Cell.size.C        1.124e+01  1.351e+05   0.000    1.000
## Cell.size^4       -1.961e+01  2.352e+05   0.000    1.000
## Cell.size^5       -3.314e+00  1.387e+05   0.000    1.000
## Cell.size^6       -1.896e+01  1.406e+05   0.000    1.000
## Cell.size^7       -1.854e+01  2.071e+05   0.000    1.000
## Cell.size^8        3.015e+01  2.368e+05   0.000    1.000
## Cell.size^9        8.849e+00  1.569e+05   0.000    1.000
## Cell.shape.L       7.971e+00  2.927e+05   0.000    1.000
## Cell.shape.Q      -1.180e+01  1.134e+05   0.000    1.000
## Cell.shape.C      -9.712e+00  8.956e+04   0.000    1.000
## Cell.shape^4      -3.652e+01  2.466e+05   0.000    1.000
## Cell.shape^5      -7.247e+01  1.684e+05   0.000    1.000
## Cell.shape^6      -4.502e+00  1.728e+05   0.000    1.000
## Cell.shape^7       2.170e+00  1.199e+05   0.000    1.000
## Cell.shape^8       3.628e+01  8.433e+04   0.000    1.000
## Cell.shape^9       6.191e+00  1.189e+05   0.000    1.000
## Marg.adhesion.L    4.963e+01  5.606e+05   0.000    1.000
## Marg.adhesion.Q    3.897e+00  2.566e+05   0.000    1.000
## Marg.adhesion.C   -2.299e+01  2.795e+05   0.000    1.000
## Marg.adhesion^4   -9.591e+00  5.985e+05   0.000    1.000
## Marg.adhesion^5    3.782e+01  7.330e+05   0.000    1.000
## Marg.adhesion^6    5.233e+01  6.011e+05   0.000    1.000
## Marg.adhesion^7    9.672e+00  4.422e+05   0.000    1.000
## Marg.adhesion^8    6.512e+00  2.933e+05   0.000    1.000
## Marg.adhesion^9    2.836e+01  2.041e+05   0.000    1.000
## Epith.c.size.L    -1.649e+01  4.187e+05   0.000    1.000
## Epith.c.size.Q     5.587e+01  2.287e+05   0.000    1.000
## Epith.c.size.C     5.557e+01  1.568e+05   0.000    1.000
## Epith.c.size^4     6.545e+01  1.929e+05   0.000    1.000
## Epith.c.size^5    -1.757e+01  2.691e+05   0.000    1.000
## Epith.c.size^6    -1.309e+01  2.066e+05   0.000    1.000
## Epith.c.size^7     5.877e+01  1.845e+05   0.000    1.000
## Epith.c.size^8     7.058e+01  1.486e+05   0.000    1.000
## Epith.c.size^9     8.315e+00  2.374e+05   0.000    1.000
## Bare.nuclei2       6.625e+00  1.956e+05   0.000    1.000
## Bare.nuclei3       4.188e+01  8.760e+04   0.000    1.000
## Bare.nuclei4       8.282e+01  9.880e+04   0.001    0.999
## Bare.nuclei5       4.103e+01  2.489e+04   0.002    0.999
## Bare.nuclei6       2.328e+02  5.750e+05   0.000    1.000
## Bare.nuclei7       4.607e+01  2.794e+05   0.000    1.000
## Bare.nuclei8       4.272e+01  1.480e+05   0.000    1.000
## Bare.nuclei9       1.152e+02  2.895e+05   0.000    1.000
## Bare.nuclei10      5.540e+01  1.034e+05   0.001    1.000
## Bl.cromatin2       4.600e+01  2.358e+05   0.000    1.000
## Bl.cromatin3       4.604e+01  2.271e+05   0.000    1.000
## Bl.cromatin4       9.047e+01  3.080e+05   0.000    1.000
## Bl.cromatin5       5.218e+01  1.573e+05   0.000    1.000
## Bl.cromatin6       3.742e+01  4.482e+05   0.000    1.000
## Bl.cromatin7       5.050e+01  1.897e+05   0.000    1.000
## Bl.cromatin8       5.987e+01  2.609e+05   0.000    1.000
## Bl.cromatin9       1.032e+02  3.755e+05   0.000    1.000
## Bl.cromatin10      5.455e+01  1.414e+05   0.000    1.000
## Normal.nucleoli2  -7.285e+01  2.696e+05   0.000    1.000
## Normal.nucleoli3  -7.518e+00  7.727e+04   0.000    1.000
## Normal.nucleoli4  -5.315e+00  1.547e+05   0.000    1.000
## Normal.nucleoli5  -1.185e+01  1.118e+05   0.000    1.000
## Normal.nucleoli6   4.114e+01  9.762e+04   0.000    1.000
## Normal.nucleoli7  -7.552e+01  1.229e+05  -0.001    1.000
## Normal.nucleoli8  -3.302e+01  1.344e+05   0.000    1.000
## Normal.nucleoli9   6.847e+01  3.585e+05   0.000    1.000
## Normal.nucleoli10  5.635e+01  1.921e+05   0.000    1.000
## Mitoses2           2.298e+01  1.881e+05   0.000    1.000
## Mitoses3           2.081e+01  1.113e+05   0.000    1.000
## Mitoses4          -1.031e+01  2.044e+05   0.000    1.000
## Mitoses5          -6.506e+01  1.345e+06   0.000    1.000
## Mitoses6          -1.343e+02  5.162e+05   0.000    1.000
## Mitoses7           2.372e+01  3.136e+05   0.000    1.000
## Mitoses8          -2.499e+01  2.638e+05   0.000    1.000
## Mitoses10         -1.638e+01  5.524e+05   0.000    1.000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8.8435e+02  on 682  degrees of freedom
## Residual deviance: 4.6605e-08  on 602  degrees of freedom
## AIC: 162
## 
## Number of Fisher Scoring iterations: 25

As we can see from the output, it looks like there’s a problem with this initial model. Why is that?

To fit a logistic regression, a quantity called the Events Per Variable (EPV) can give us an idea of the largest model we can consider given the number of observations. The EPV is defined as the number of occurrences of the least frequent outcome over the number of covariates. An EPV of 10 is a widely advocated minimal criterion for sample size considerations in logistic regression analysis. However, there’s a lack of evidence to support this EPV rule of thumb. In fact, some studies have pointed out the urgent need for new research to provide more valid guidance. More details can for example be found here.

n = min(table(BreastCancer$Class))
p = 24 # assumed number of paprameter
EPV = n/p
EPV
## [1] 9.958333
length(fit.glm$coefficients) # actual number of parameters in the initial model
## [1] 81

In our case, in order to have an EPV of approximately 10, we should consider up to 24 parameters, but our initial model has 81 parameters! Therefore, this model may not be considered as suitable. Let’s find a more appropriate model using a forward approach based on the AIC starting from a simple model.

# Initial model with forward approach using AIC
fit.glm.inital = glm(Class ~ 1, data = BreastCancer,family=binomial)
forward.fit = step(fit.glm.inital, 
                   scope=list(lower = formula(fit.glm.inital),
                              upper=formula(fit.glm)), 
                   direction="forward", trace = FALSE)

summary(forward.fit)
## 
## Call:
## glm(formula = Class ~ Cell.size + Bare.nuclei + Normal.nucleoli + 
##     Cl.thickness + Epith.c.size, family = binomial, data = BreastCancer)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -2.870e-04  -2.100e-08  -2.100e-08   2.100e-08   2.952e-04  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)       -2.101e+02  1.298e+06   0.000    1.000
## Cell.size.L        4.352e+02  3.214e+06   0.000    1.000
## Cell.size.Q        9.072e+01  3.388e+06   0.000    1.000
## Cell.size.C        3.484e+02  2.942e+06   0.000    1.000
## Cell.size^4        1.973e+02  2.184e+06   0.000    1.000
## Cell.size^5       -5.729e+01  1.395e+06   0.000    1.000
## Cell.size^6       -6.150e+01  7.597e+05   0.000    1.000
## Cell.size^7        3.009e+01  3.453e+05   0.000    1.000
## Cell.size^8        3.172e+02  1.267e+05   0.003    0.998
## Cell.size^9        1.641e+02  7.807e+04   0.002    0.998
## Bare.nuclei2       1.849e+02  7.542e+04   0.002    0.998
## Bare.nuclei3       2.781e+02  1.786e+04   0.016    0.988
## Bare.nuclei4       5.251e+02  2.845e+04   0.018    0.985
## Bare.nuclei5       3.052e+02  1.986e+04   0.015    0.988
## Bare.nuclei6       7.870e+02  1.634e+05   0.005    0.996
## Bare.nuclei7       2.316e+02  7.146e+04   0.003    0.997
## Bare.nuclei8       3.779e+02  2.066e+04   0.018    0.985
## Bare.nuclei9       8.364e+02  6.488e+06   0.000    1.000
## Bare.nuclei10      5.709e+02  2.981e+04   0.019    0.985
## Normal.nucleoli2  -2.326e+02  9.463e+04  -0.002    0.998
## Normal.nucleoli3   1.734e+02  1.687e+04   0.010    0.992
## Normal.nucleoli4   7.623e+01  1.258e+04   0.006    0.995
## Normal.nucleoli5  -7.981e+01  1.218e+04  -0.007    0.995
## Normal.nucleoli6   3.742e+02  2.458e+04   0.015    0.988
## Normal.nucleoli7  -1.962e+02  4.358e+04  -0.005    0.996
## Normal.nucleoli8  -3.963e+02  6.488e+06   0.000    1.000
## Normal.nucleoli9   3.113e+02  5.117e+04   0.006    0.995
## Normal.nucleoli10  6.950e+02  3.975e+04   0.017    0.986
## Cl.thickness.L     4.306e+02  2.501e+06   0.000    1.000
## Cl.thickness.Q     9.716e+01  1.130e+06   0.000    1.000
## Cl.thickness.C    -9.360e+01  9.817e+05   0.000    1.000
## Cl.thickness^4    -2.303e+02  2.669e+06   0.000    1.000
## Cl.thickness^5     1.877e+02  3.253e+06   0.000    1.000
## Cl.thickness^6     5.946e+01  2.778e+06   0.000    1.000
## Cl.thickness^7    -2.786e+00  1.786e+06   0.000    1.000
## Cl.thickness^8    -1.142e+02  8.517e+05   0.000    1.000
## Cl.thickness^9    -2.004e+02  2.690e+05  -0.001    0.999
## Epith.c.size.L    -3.852e+02  9.984e+04  -0.004    0.997
## Epith.c.size.Q     1.813e+02  6.512e+04   0.003    0.998
## Epith.c.size.C     1.441e+02  5.790e+04   0.002    0.998
## Epith.c.size^4     4.865e+02  9.611e+04   0.005    0.996
## Epith.c.size^5    -4.167e+01  1.085e+05   0.000    1.000
## Epith.c.size^6     1.606e+02  9.475e+04   0.002    0.999
## Epith.c.size^7     3.708e+02  6.309e+04   0.006    0.995
## Epith.c.size^8     1.486e+02  4.153e+04   0.004    0.997
## Epith.c.size^9     1.208e+02  1.543e+04   0.008    0.994
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8.8435e+02  on 682  degrees of freedom
## Residual deviance: 1.1045e-06  on 637  degrees of freedom
## AIC: 92
## 
## Number of Fisher Scoring iterations: 25
length(forward.fit$coefficients)
## [1] 46

As we can see, using this forward approach based on the AIC, we find a model of 46 parameters, which is still larger than the expected 24 and therefore the model fails. Alternatively, we can use a forward approach based on the Bayesian Information Criterion (BIC) to penalize more complex models:

forward.fit.BIC = step(fit.glm.inital, 
                       scope=list(lower = formula(fit.glm.inital),
                                  upper=formula(fit.glm)), 
                   direction="forward", 
                   k = log(length(BreastCancer$Id)), # for BIC
                   trace = FALSE)

summary(forward.fit.BIC)
## 
## Call:
## glm(formula = Class ~ Cell.size + Bare.nuclei, family = binomial, 
##     data = BreastCancer)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.72619  -0.08258  -0.08258   0.00002   2.48330  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      3.9250   449.6376   0.009 0.993035    
## Cell.size.L     13.7074   926.8888   0.015 0.988201    
## Cell.size.Q     -5.5694  1717.3389  -0.003 0.997412    
## Cell.size.C      8.7572   980.1640   0.009 0.992871    
## Cell.size^4     18.0289  1513.3939   0.012 0.990495    
## Cell.size^5      5.0943   965.9765   0.005 0.995792    
## Cell.size^6     -8.7193  1298.5416  -0.007 0.994642    
## Cell.size^7      1.4402  1352.2992   0.001 0.999150    
## Cell.size^8     10.0580  1077.5574   0.009 0.992553    
## Cell.size^9      0.3970  2350.9625   0.000 0.999865    
## Bare.nuclei2     1.4633     0.7854   1.863 0.062451 .  
## Bare.nuclei3     2.4390     0.7052   3.459 0.000543 ***
## Bare.nuclei4     2.9375     0.9198   3.194 0.001406 ** 
## Bare.nuclei5     3.3585     0.8342   4.026 5.67e-05 ***
## Bare.nuclei6    21.7317  6957.5574   0.003 0.997508    
## Bare.nuclei7     3.5982     1.2541   2.869 0.004117 ** 
## Bare.nuclei8     2.6608     1.0250   2.596 0.009436 ** 
## Bare.nuclei9    20.9855  5042.4955   0.004 0.996679    
## Bare.nuclei10    5.4373     0.8476   6.415 1.41e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 884.35  on 682  degrees of freedom
## Residual deviance: 134.94  on 664  degrees of freedom
## AIC: 172.94
## 
## Number of Fisher Scoring iterations: 19
length(forward.fit.BIC$coefficients) # Better!
## [1] 19

With BIC we obtain a model with only 19 parameters. Now let’s check how reliable this model is. We can again compute the in-sample and the out-of-sample classification accuracy.

class_predict = forward.fit.BIC$fitted.values > 0.5
in_accuracy = mean((BreastCancer$Class == "malignant") == class_predict)
in_accuracy
## [1] 0.9604685
table(BreastCancer$Class)/length(BreastCancer$Class)
## 
##    benign malignant 
## 0.6500732 0.3499268

So the in-sample classification accuracy is actually higher than if we blindly identifying the benign or malignant classes. Therefore, our model is working properly.

Now let’s consider the out-of-sample classification accuracy.

out_accuracy = cv.glm(BreastCancer, forward.fit.BIC, cost, K = 10)$delta[2]
out_accuracy
## [1] 0.9484618

The resulting out-of-sample classification accuracy is very similar to the in-sample classification accuracy, which again verifies that our model is working properly.

Lastly, we can again use the gamlss() function to perform the model check:

fit.gamlss = gamlss(formula(forward.fit.BIC), data = BreastCancer, family=BI)
## GAMLSS-RS iteration 1: Global Deviance = 134.9427 
## GAMLSS-RS iteration 2: Global Deviance = 134.9424
summary(fit.gamlss)
## ******************************************************************
## Family:  c("BI", "Binomial") 
## 
## Call:  gamlss(formula = formula(forward.fit.BIC), family = BI,  
##     data = BreastCancer) 
## 
## Fitting method: RS() 
## 
## ------------------------------------------------------------------
## Mu link function:  logit
## Mu Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.1250    36.9077   0.058 0.954104    
## Cell.size.L    10.7348    76.0883   0.141 0.887847    
## Cell.size.Q    -4.5250   140.9702  -0.032 0.974403    
## Cell.size.C     6.0367    80.4639   0.075 0.940219    
## Cell.size^4    11.9704   124.2184   0.096 0.923259    
## Cell.size^5     3.8053    79.2849   0.048 0.961734    
## Cell.size^6    -5.6832   106.5880  -0.053 0.957494    
## Cell.size^7     1.1240   110.9913   0.010 0.991923    
## Cell.size^8     6.8044    88.4487   0.077 0.938702    
## Cell.size^9     0.3698   192.9519   0.002 0.998471    
## Bare.nuclei2    1.4633     0.7854   1.863 0.062892 .  
## Bare.nuclei3    2.4390     0.7052   3.459 0.000578 ***
## Bare.nuclei4    2.9375     0.9198   3.194 0.001472 ** 
## Bare.nuclei5    3.3585     0.8342   4.026 6.33e-05 ***
## Bare.nuclei6   15.7316   571.1036   0.028 0.978033    
## Bare.nuclei7    3.5982     1.2541   2.869 0.004247 ** 
## Bare.nuclei8    2.6608     1.0250   2.596 0.009644 ** 
## Bare.nuclei9   14.9855   413.9059   0.036 0.971130    
## Bare.nuclei10   5.4373     0.8476   6.415 2.68e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## No. of observations in the fit:  683 
## Degrees of Freedom for the fit:  19
##       Residual Deg. of Freedom:  664 
##                       at cycle:  2 
##  
## Global Deviance:     134.9424 
##             AIC:     172.9424 
##             SBC:     258.9458 
## ******************************************************************
plot(fit.gamlss)

## ******************************************************************
##   Summary of the Randomised Quantile Residuals
##                            mean   =  -0.04395755 
##                        variance   =  1.027992 
##                coef. of skewness  =  -0.1255655 
##                coef. of kurtosis  =  2.971741 
## Filliben correlation coefficient  =  0.9989149 
## ******************************************************************