Suppose we are interested in the number of high school students diagnosed with an infectious disease as a function of the number of days from the initial outbreak. The data can be loaded as follows:
students = read.csv("data/students.csv")
head(students)
## day cases
## 1 1 6
## 2 2 8
## 3 3 12
## 4 3 9
## 5 4 3
## 6 4 3
Can you find a suitable model for this dataset?
We can first visualize the data as follows:
plot(students$day, students$cases, xlim = c(0,120), ylim = c(0,12),
xlab = "Days since initial outbreak",
ylab = "Number of diagnosed students",
pch = 16, col = "#F8766D99")
grid()
As we are modeling count data, we can use a Poisson regression. There are two functions in R that can fit a Poisson regression. The first one is the classical glm() function by specifying family=poisson.
fit.glm = glm(cases ~ day, data = students, family = poisson)
summary(fit.glm)
##
## Call:
## glm(formula = cases ~ day, family = poisson, data = students)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.00482 -0.85719 -0.09331 0.63969 1.73696
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.990235 0.083935 23.71 <2e-16 ***
## day -0.017463 0.001727 -10.11 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 215.36 on 108 degrees of freedom
## Residual deviance: 101.17 on 107 degrees of freedom
## AIC: 393.11
##
## Number of Fisher Scoring iterations: 5
In this case, we can perform the model check as follows:
par(mfrow = c(2,2))
plot(fit.glm)
Alternatively, you can use the gamlss() function in the gamlss R package by specifying family = PO.
library(gamlss)
fit.gamlss = gamlss(cases ~ day, data = students, family = PO)
## GAMLSS-RS iteration 1: Global Deviance = 389.1082
## GAMLSS-RS iteration 2: Global Deviance = 389.1082
summary(fit.gamlss)
## ******************************************************************
## Family: c("PO", "Poisson")
##
## Call: gamlss(formula = cases ~ day, family = PO, data = students)
##
## Fitting method: RS()
##
## ------------------------------------------------------------------
## Mu link function: log
## Mu Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.990235 0.083898 23.72 <2e-16 ***
## day -0.017463 0.001724 -10.13 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## ------------------------------------------------------------------
## No. of observations in the fit: 109
## Degrees of Freedom for the fit: 2
## Residual Deg. of Freedom: 107
## at cycle: 2
##
## Global Deviance: 389.1082
## AIC: 393.1082
## SBC: 398.4909
## ******************************************************************
The gamlss approach provides a nicer summary to assess how well your model fit the data at hands:
par(mfrow = c(1,1))
plot(fit.gamlss)
## ******************************************************************
## Summary of the Randomised Quantile Residuals
## mean = 0.03783439
## variance = 0.7860004
## coef. of skewness = -0.2081693
## coef. of kurtosis = 2.384078
## Filliben correlation coefficient = 0.9899953
## ******************************************************************
Consider a dataset on 768 women of at least 21 years old of the Pima Indian heritage. This dataset includes the following variables:
pregnant: Number of times pregnantglucose: Plasma glucose concentration in an oral glucose tolerance testpressure: Diastolic blood pressure (mm Hg)triceps: Triceps skin fold thickness (mm)insulin: 2-Hour serum insulin (mu U/ml)mass: Body mass index (weight in kg/(height in m)^2)pedigree: Diabetes pedigree functionage: Age of the patients (years)diabetes: Class variable (test for diabetes)The dataset is stored in the mlbench R package and you can load it as follows:
library(mlbench)
data(PimaIndiansDiabetes)
dim(PimaIndiansDiabetes)
## [1] 768 9
head(PimaIndiansDiabetes)
## pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1 6 148 72 35 0 33.6 0.627 50 pos
## 2 1 85 66 29 0 26.6 0.351 31 neg
## 3 8 183 64 0 0 23.3 0.672 32 pos
## 4 1 89 66 23 94 28.1 0.167 21 neg
## 5 0 137 40 35 168 43.1 2.288 33 pos
## 6 5 116 74 0 0 25.6 0.201 30 neg
Can you find a suitable model to predict the possibility of a woman (Pima Indian heritage) being diagnosed with diabetes based on these variables (or a subset of them)?
As we are interested in fitting binary data (positive or negative of diabetes), we consider the approach to fit a logistic regression. We start by fitting an initial model with all covariates included (without interactions):
fit.glm = glm(diabetes ~ ., data = PimaIndiansDiabetes, family = binomial(link="logit"))
summary(fit.glm)
##
## Call:
## glm(formula = diabetes ~ ., family = binomial(link = "logit"),
## data = PimaIndiansDiabetes)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5566 -0.7274 -0.4159 0.7267 2.9297
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.4046964 0.7166359 -11.728 < 2e-16 ***
## pregnant 0.1231823 0.0320776 3.840 0.000123 ***
## glucose 0.0351637 0.0037087 9.481 < 2e-16 ***
## pressure -0.0132955 0.0052336 -2.540 0.011072 *
## triceps 0.0006190 0.0068994 0.090 0.928515
## insulin -0.0011917 0.0009012 -1.322 0.186065
## mass 0.0897010 0.0150876 5.945 2.76e-09 ***
## pedigree 0.9451797 0.2991475 3.160 0.001580 **
## age 0.0148690 0.0093348 1.593 0.111192
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 993.48 on 767 degrees of freedom
## Residual deviance: 723.45 on 759 degrees of freedom
## AIC: 741.45
##
## Number of Fisher Scoring iterations: 5
We can see that among these 8 variables, some appear not significant, implying that we may be able to find a smaller model with less variables. As an example, we can use the step() function to perform stepwise model selection using the AIC by removing variables iteratively from our initial model. This approach is known as “stepwise backward AIC” and it is an heuristic method which avoids to explore ALL models. Indeed, we have in this case \(2^8 = 256\) models without interactions and \(2^{36} = 68,719,476,736\) with first order interactions. If fitting one model takes 1 second, we would need over 2000 years to fit all models. Nevertheless, various modern methods allow to improve on this stepwise approach for model selection but they are beyond the scope of this class.
fit.glm.aic = step(fit.glm, trace = FALSE)
summary(fit.glm.aic)
##
## Call:
## glm(formula = diabetes ~ pregnant + glucose + pressure + insulin +
## mass + pedigree + age, family = binomial(link = "logit"),
## data = PimaIndiansDiabetes)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5617 -0.7286 -0.4156 0.7271 2.9297
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.4051362 0.7167033 -11.727 < 2e-16 ***
## pregnant 0.1231724 0.0320688 3.841 0.000123 ***
## glucose 0.0351123 0.0036625 9.587 < 2e-16 ***
## pressure -0.0132136 0.0051537 -2.564 0.010350 *
## insulin -0.0011570 0.0008142 -1.421 0.155275
## mass 0.0900886 0.0144619 6.229 4.68e-10 ***
## pedigree 0.9475954 0.2980063 3.180 0.001474 **
## age 0.0147888 0.0092897 1.592 0.111393
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 993.48 on 767 degrees of freedom
## Residual deviance: 723.45 on 760 degrees of freedom
## AIC: 739.45
##
## Number of Fisher Scoring iterations: 5
We now find a model with 7 variables (i.e. remove triceps) with slightly smaller AIC.
AIC(fit.glm)
## [1] 741.4454
AIC(fit.glm.aic)
## [1] 739.4534
It is also possible to consider a forward approach starting from a simple model. This can be done as follows:
# Initial model
fit.glm.inital = glm(diabetes ~ 1, data = PimaIndiansDiabetes,
family = binomial(link="logit"))
# Find a model with a forward approach using the AIC
fit.glm.aic.forward = step(fit.glm.inital,
scope = list(lower = formula(fit.glm.inital),
upper = formula(fit.glm)),
direction = "forward", trace = FALSE)
In this case, the two methods (i.e. forward and backward) are equivalent:
AIC(fit.glm.aic)
## [1] 739.4534
AIC(fit.glm.aic.forward)
## [1] 739.4534
summary(fit.glm.aic.forward)
##
## Call:
## glm(formula = diabetes ~ glucose + mass + pregnant + pedigree +
## pressure + age + insulin, family = binomial(link = "logit"),
## data = PimaIndiansDiabetes)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5617 -0.7286 -0.4156 0.7271 2.9297
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.4051362 0.7167033 -11.727 < 2e-16 ***
## glucose 0.0351123 0.0036625 9.587 < 2e-16 ***
## mass 0.0900886 0.0144619 6.229 4.68e-10 ***
## pregnant 0.1231724 0.0320688 3.841 0.000123 ***
## pedigree 0.9475954 0.2980063 3.180 0.001474 **
## pressure -0.0132136 0.0051537 -2.564 0.010350 *
## age 0.0147888 0.0092897 1.592 0.111393
## insulin -0.0011570 0.0008142 -1.421 0.155275
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 993.48 on 767 degrees of freedom
## Residual deviance: 723.45 on 760 degrees of freedom
## AIC: 739.45
##
## Number of Fisher Scoring iterations: 5
Now let’s check how reliable our model is. If the predicted probability is larger than 0.5, then we predict that the considered individual will have diabetes (so positive). Otherwise we predict that the individual is negative. Then we can compute the in-sample classification accuracy by comparing the predicted values to the actual observed values for the whole sample.
class_predict = fit.glm.aic$fitted.values > 0.5
in_accuracy = mean((PimaIndiansDiabetes$diabetes == "pos") == class_predict)
in_accuracy
## [1] 0.7838542
In this case, we have 78.39% in-sample classification accuracy. Is that high?
table(PimaIndiansDiabetes$diabetes)/dim(PimaIndiansDiabetes)[1]
##
## neg pos
## 0.6510417 0.3489583
So the in-sample classification accuracy is actually higher than if we blindly predict individuals to be positive or negative. Therefore, our model is working properly.
Now let’s consider the out-of-sample classification accuracy.
library(boot)
cost = function(resp, pred){
mean(resp == (pred > 0.5))
}
out_accuracy = cv.glm(PimaIndiansDiabetes, fit.glm.aic, cost, K = 10)$delta[2]
out_accuracy
## [1] 0.7770844
In this case, we have 77.71% out-of-sample classification accuracy, which is very similar to the in-sample classification accuracy. This is because we have a large number of observations (\(n=768\)) compared to the number of parameters to estimate (\(p=8\)). So we again verify that our model is working properly.
Lastly, we can use the gamlss() function to perform the model check:
fit.gamlss = gamlss(formula(fit.glm.aic), data=PimaIndiansDiabetes, family=BI)
## GAMLSS-RS iteration 1: Global Deviance = 723.4534
## GAMLSS-RS iteration 2: Global Deviance = 723.4534
plot(fit.gamlss)
## ******************************************************************
## Summary of the Randomised Quantile Residuals
## mean = -0.03763598
## variance = 0.9638481
## coef. of skewness = -0.02939412
## coef. of kurtosis = 3.057154
## Filliben correlation coefficient = 0.9985066
## ******************************************************************
Let’s consider a breast cancer dataset from the mlbench R package. In this example, the objective is to predict whether a cancer is malignant or benign from biopsy details. This dataset includes 699 observations on 11 variables. You can load the data as follows:
data(BreastCancer)
dim(BreastCancer)
## [1] 699 11
head(BreastCancer)
## Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size
## 1 1000025 5 1 1 1 2
## 2 1002945 5 4 4 5 7
## 3 1015425 3 1 1 1 2
## 4 1016277 6 8 8 1 3
## 5 1017023 4 1 1 3 2
## 6 1017122 8 10 10 8 7
## Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses Class
## 1 1 3 1 1 benign
## 2 10 3 2 1 benign
## 3 2 3 1 1 benign
## 4 4 3 7 1 benign
## 5 1 3 1 1 benign
## 6 10 9 7 1 malignant
table(BreastCancer$Class)
##
## benign malignant
## 458 241
More information about this dataset can be found by typing ?BreastCancer in R. Can you find a suitable model to identify benign or malignant classes based on these variables (or a subset of them)?
We notice that there are some missing values in the dataset. Therefore, we start by removing the observations with missing data.
BreastCancer = na.omit(BreastCancer)
dim(BreastCancer)
## [1] 683 11
After removing the missing values, we remain to have 683 observations. We start our analysis by fitting an initial model with all covariates included (without interactions):
# Initial model
fit.glm = glm(Class ~ Cl.thickness + Cell.size + Cell.shape + Marg.adhesion + Epith.c.size + Bare.nuclei + Bl.cromatin + Normal.nucleoli + Mitoses, data = BreastCancer, family=binomial)
summary(fit.glm)
##
## Call:
## glm(formula = Class ~ Cl.thickness + Cell.size + Cell.shape +
## Marg.adhesion + Epith.c.size + Bare.nuclei + Bl.cromatin +
## Normal.nucleoli + Mitoses, family = binomial, data = BreastCancer)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.716e-05 -2.100e-08 -2.100e-08 2.100e-08 6.022e-05
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.869e+01 2.309e+05 0.000 1.000
## Cl.thickness.L 1.059e+02 2.239e+05 0.000 1.000
## Cl.thickness.Q 6.363e+01 1.423e+05 0.000 1.000
## Cl.thickness.C 1.608e+01 2.223e+05 0.000 1.000
## Cl.thickness^4 -2.484e+01 1.381e+05 0.000 1.000
## Cl.thickness^5 6.103e+00 1.457e+05 0.000 1.000
## Cl.thickness^6 2.743e+00 1.372e+05 0.000 1.000
## Cl.thickness^7 -9.061e+00 1.362e+05 0.000 1.000
## Cl.thickness^8 -4.665e+01 7.187e+04 -0.001 0.999
## Cl.thickness^9 -3.427e+01 1.100e+05 0.000 1.000
## Cell.size.L 7.550e+01 1.987e+05 0.000 1.000
## Cell.size.Q 3.664e+01 1.555e+05 0.000 1.000
## Cell.size.C 1.124e+01 1.351e+05 0.000 1.000
## Cell.size^4 -1.961e+01 2.352e+05 0.000 1.000
## Cell.size^5 -3.314e+00 1.387e+05 0.000 1.000
## Cell.size^6 -1.896e+01 1.406e+05 0.000 1.000
## Cell.size^7 -1.854e+01 2.071e+05 0.000 1.000
## Cell.size^8 3.015e+01 2.368e+05 0.000 1.000
## Cell.size^9 8.849e+00 1.569e+05 0.000 1.000
## Cell.shape.L 7.971e+00 2.927e+05 0.000 1.000
## Cell.shape.Q -1.180e+01 1.134e+05 0.000 1.000
## Cell.shape.C -9.712e+00 8.956e+04 0.000 1.000
## Cell.shape^4 -3.652e+01 2.466e+05 0.000 1.000
## Cell.shape^5 -7.247e+01 1.684e+05 0.000 1.000
## Cell.shape^6 -4.502e+00 1.728e+05 0.000 1.000
## Cell.shape^7 2.170e+00 1.199e+05 0.000 1.000
## Cell.shape^8 3.628e+01 8.433e+04 0.000 1.000
## Cell.shape^9 6.191e+00 1.189e+05 0.000 1.000
## Marg.adhesion.L 4.963e+01 5.606e+05 0.000 1.000
## Marg.adhesion.Q 3.897e+00 2.566e+05 0.000 1.000
## Marg.adhesion.C -2.299e+01 2.795e+05 0.000 1.000
## Marg.adhesion^4 -9.591e+00 5.985e+05 0.000 1.000
## Marg.adhesion^5 3.782e+01 7.330e+05 0.000 1.000
## Marg.adhesion^6 5.233e+01 6.011e+05 0.000 1.000
## Marg.adhesion^7 9.672e+00 4.422e+05 0.000 1.000
## Marg.adhesion^8 6.512e+00 2.933e+05 0.000 1.000
## Marg.adhesion^9 2.836e+01 2.041e+05 0.000 1.000
## Epith.c.size.L -1.649e+01 4.187e+05 0.000 1.000
## Epith.c.size.Q 5.587e+01 2.287e+05 0.000 1.000
## Epith.c.size.C 5.557e+01 1.568e+05 0.000 1.000
## Epith.c.size^4 6.545e+01 1.929e+05 0.000 1.000
## Epith.c.size^5 -1.757e+01 2.691e+05 0.000 1.000
## Epith.c.size^6 -1.309e+01 2.066e+05 0.000 1.000
## Epith.c.size^7 5.877e+01 1.845e+05 0.000 1.000
## Epith.c.size^8 7.058e+01 1.486e+05 0.000 1.000
## Epith.c.size^9 8.315e+00 2.374e+05 0.000 1.000
## Bare.nuclei2 6.625e+00 1.956e+05 0.000 1.000
## Bare.nuclei3 4.188e+01 8.760e+04 0.000 1.000
## Bare.nuclei4 8.282e+01 9.880e+04 0.001 0.999
## Bare.nuclei5 4.103e+01 2.489e+04 0.002 0.999
## Bare.nuclei6 2.328e+02 5.750e+05 0.000 1.000
## Bare.nuclei7 4.607e+01 2.794e+05 0.000 1.000
## Bare.nuclei8 4.272e+01 1.480e+05 0.000 1.000
## Bare.nuclei9 1.152e+02 2.895e+05 0.000 1.000
## Bare.nuclei10 5.540e+01 1.034e+05 0.001 1.000
## Bl.cromatin2 4.600e+01 2.358e+05 0.000 1.000
## Bl.cromatin3 4.604e+01 2.271e+05 0.000 1.000
## Bl.cromatin4 9.047e+01 3.080e+05 0.000 1.000
## Bl.cromatin5 5.218e+01 1.573e+05 0.000 1.000
## Bl.cromatin6 3.742e+01 4.482e+05 0.000 1.000
## Bl.cromatin7 5.050e+01 1.897e+05 0.000 1.000
## Bl.cromatin8 5.987e+01 2.609e+05 0.000 1.000
## Bl.cromatin9 1.032e+02 3.755e+05 0.000 1.000
## Bl.cromatin10 5.455e+01 1.414e+05 0.000 1.000
## Normal.nucleoli2 -7.285e+01 2.696e+05 0.000 1.000
## Normal.nucleoli3 -7.518e+00 7.727e+04 0.000 1.000
## Normal.nucleoli4 -5.315e+00 1.547e+05 0.000 1.000
## Normal.nucleoli5 -1.185e+01 1.118e+05 0.000 1.000
## Normal.nucleoli6 4.114e+01 9.762e+04 0.000 1.000
## Normal.nucleoli7 -7.552e+01 1.229e+05 -0.001 1.000
## Normal.nucleoli8 -3.302e+01 1.344e+05 0.000 1.000
## Normal.nucleoli9 6.847e+01 3.585e+05 0.000 1.000
## Normal.nucleoli10 5.635e+01 1.921e+05 0.000 1.000
## Mitoses2 2.298e+01 1.881e+05 0.000 1.000
## Mitoses3 2.081e+01 1.113e+05 0.000 1.000
## Mitoses4 -1.031e+01 2.044e+05 0.000 1.000
## Mitoses5 -6.506e+01 1.345e+06 0.000 1.000
## Mitoses6 -1.343e+02 5.162e+05 0.000 1.000
## Mitoses7 2.372e+01 3.136e+05 0.000 1.000
## Mitoses8 -2.499e+01 2.638e+05 0.000 1.000
## Mitoses10 -1.638e+01 5.524e+05 0.000 1.000
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8.8435e+02 on 682 degrees of freedom
## Residual deviance: 4.6605e-08 on 602 degrees of freedom
## AIC: 162
##
## Number of Fisher Scoring iterations: 25
As we can see from the output, it looks like there’s a problem with this initial model. Why is that?
To fit a logistic regression, a quantity called the Events Per Variable (EPV) can give us an idea of the largest model we can consider given the number of observations. The EPV is defined as the number of occurrences of the least frequent outcome over the number of covariates. An EPV of 10 is a widely advocated minimal criterion for sample size considerations in logistic regression analysis. However, there’s a lack of evidence to support this EPV rule of thumb. In fact, some studies have pointed out the urgent need for new research to provide more valid guidance. More details can for example be found here.
n = min(table(BreastCancer$Class))
p = 24 # assumed number of paprameter
EPV = n/p
EPV
## [1] 9.958333
length(fit.glm$coefficients) # actual number of parameters in the initial model
## [1] 81
In our case, in order to have an EPV of approximately 10, we should consider up to 24 parameters, but our initial model has 81 parameters! Therefore, this model may not be considered as suitable. Let’s find a more appropriate model using a forward approach based on the AIC starting from a simple model.
# Initial model with forward approach using AIC
fit.glm.inital = glm(Class ~ 1, data = BreastCancer,family=binomial)
forward.fit = step(fit.glm.inital,
scope=list(lower = formula(fit.glm.inital),
upper=formula(fit.glm)),
direction="forward", trace = FALSE)
summary(forward.fit)
##
## Call:
## glm(formula = Class ~ Cell.size + Bare.nuclei + Normal.nucleoli +
## Cl.thickness + Epith.c.size, family = binomial, data = BreastCancer)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.870e-04 -2.100e-08 -2.100e-08 2.100e-08 2.952e-04
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.101e+02 1.298e+06 0.000 1.000
## Cell.size.L 4.352e+02 3.214e+06 0.000 1.000
## Cell.size.Q 9.072e+01 3.388e+06 0.000 1.000
## Cell.size.C 3.484e+02 2.942e+06 0.000 1.000
## Cell.size^4 1.973e+02 2.184e+06 0.000 1.000
## Cell.size^5 -5.729e+01 1.395e+06 0.000 1.000
## Cell.size^6 -6.150e+01 7.597e+05 0.000 1.000
## Cell.size^7 3.009e+01 3.453e+05 0.000 1.000
## Cell.size^8 3.172e+02 1.267e+05 0.003 0.998
## Cell.size^9 1.641e+02 7.807e+04 0.002 0.998
## Bare.nuclei2 1.849e+02 7.542e+04 0.002 0.998
## Bare.nuclei3 2.781e+02 1.786e+04 0.016 0.988
## Bare.nuclei4 5.251e+02 2.845e+04 0.018 0.985
## Bare.nuclei5 3.052e+02 1.986e+04 0.015 0.988
## Bare.nuclei6 7.870e+02 1.634e+05 0.005 0.996
## Bare.nuclei7 2.316e+02 7.146e+04 0.003 0.997
## Bare.nuclei8 3.779e+02 2.066e+04 0.018 0.985
## Bare.nuclei9 8.364e+02 6.488e+06 0.000 1.000
## Bare.nuclei10 5.709e+02 2.981e+04 0.019 0.985
## Normal.nucleoli2 -2.326e+02 9.463e+04 -0.002 0.998
## Normal.nucleoli3 1.734e+02 1.687e+04 0.010 0.992
## Normal.nucleoli4 7.623e+01 1.258e+04 0.006 0.995
## Normal.nucleoli5 -7.981e+01 1.218e+04 -0.007 0.995
## Normal.nucleoli6 3.742e+02 2.458e+04 0.015 0.988
## Normal.nucleoli7 -1.962e+02 4.358e+04 -0.005 0.996
## Normal.nucleoli8 -3.963e+02 6.488e+06 0.000 1.000
## Normal.nucleoli9 3.113e+02 5.117e+04 0.006 0.995
## Normal.nucleoli10 6.950e+02 3.975e+04 0.017 0.986
## Cl.thickness.L 4.306e+02 2.501e+06 0.000 1.000
## Cl.thickness.Q 9.716e+01 1.130e+06 0.000 1.000
## Cl.thickness.C -9.360e+01 9.817e+05 0.000 1.000
## Cl.thickness^4 -2.303e+02 2.669e+06 0.000 1.000
## Cl.thickness^5 1.877e+02 3.253e+06 0.000 1.000
## Cl.thickness^6 5.946e+01 2.778e+06 0.000 1.000
## Cl.thickness^7 -2.786e+00 1.786e+06 0.000 1.000
## Cl.thickness^8 -1.142e+02 8.517e+05 0.000 1.000
## Cl.thickness^9 -2.004e+02 2.690e+05 -0.001 0.999
## Epith.c.size.L -3.852e+02 9.984e+04 -0.004 0.997
## Epith.c.size.Q 1.813e+02 6.512e+04 0.003 0.998
## Epith.c.size.C 1.441e+02 5.790e+04 0.002 0.998
## Epith.c.size^4 4.865e+02 9.611e+04 0.005 0.996
## Epith.c.size^5 -4.167e+01 1.085e+05 0.000 1.000
## Epith.c.size^6 1.606e+02 9.475e+04 0.002 0.999
## Epith.c.size^7 3.708e+02 6.309e+04 0.006 0.995
## Epith.c.size^8 1.486e+02 4.153e+04 0.004 0.997
## Epith.c.size^9 1.208e+02 1.543e+04 0.008 0.994
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8.8435e+02 on 682 degrees of freedom
## Residual deviance: 1.1045e-06 on 637 degrees of freedom
## AIC: 92
##
## Number of Fisher Scoring iterations: 25
length(forward.fit$coefficients)
## [1] 46
As we can see, using this forward approach based on the AIC, we find a model of 46 parameters, which is still larger than the expected 24 and therefore the model fails. Alternatively, we can use a forward approach based on the Bayesian Information Criterion (BIC) to penalize more complex models:
forward.fit.BIC = step(fit.glm.inital,
scope=list(lower = formula(fit.glm.inital),
upper=formula(fit.glm)),
direction="forward",
k = log(length(BreastCancer$Id)), # for BIC
trace = FALSE)
summary(forward.fit.BIC)
##
## Call:
## glm(formula = Class ~ Cell.size + Bare.nuclei, family = binomial,
## data = BreastCancer)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.72619 -0.08258 -0.08258 0.00002 2.48330
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.9250 449.6376 0.009 0.993035
## Cell.size.L 13.7074 926.8888 0.015 0.988201
## Cell.size.Q -5.5694 1717.3389 -0.003 0.997412
## Cell.size.C 8.7572 980.1640 0.009 0.992871
## Cell.size^4 18.0289 1513.3939 0.012 0.990495
## Cell.size^5 5.0943 965.9765 0.005 0.995792
## Cell.size^6 -8.7193 1298.5416 -0.007 0.994642
## Cell.size^7 1.4402 1352.2992 0.001 0.999150
## Cell.size^8 10.0580 1077.5574 0.009 0.992553
## Cell.size^9 0.3970 2350.9625 0.000 0.999865
## Bare.nuclei2 1.4633 0.7854 1.863 0.062451 .
## Bare.nuclei3 2.4390 0.7052 3.459 0.000543 ***
## Bare.nuclei4 2.9375 0.9198 3.194 0.001406 **
## Bare.nuclei5 3.3585 0.8342 4.026 5.67e-05 ***
## Bare.nuclei6 21.7317 6957.5574 0.003 0.997508
## Bare.nuclei7 3.5982 1.2541 2.869 0.004117 **
## Bare.nuclei8 2.6608 1.0250 2.596 0.009436 **
## Bare.nuclei9 20.9855 5042.4955 0.004 0.996679
## Bare.nuclei10 5.4373 0.8476 6.415 1.41e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 884.35 on 682 degrees of freedom
## Residual deviance: 134.94 on 664 degrees of freedom
## AIC: 172.94
##
## Number of Fisher Scoring iterations: 19
length(forward.fit.BIC$coefficients) # Better!
## [1] 19
With BIC we obtain a model with only 19 parameters. Now let’s check how reliable this model is. We can again compute the in-sample and the out-of-sample classification accuracy.
class_predict = forward.fit.BIC$fitted.values > 0.5
in_accuracy = mean((BreastCancer$Class == "malignant") == class_predict)
in_accuracy
## [1] 0.9604685
table(BreastCancer$Class)/length(BreastCancer$Class)
##
## benign malignant
## 0.6500732 0.3499268
So the in-sample classification accuracy is actually higher than if we blindly identifying the benign or malignant classes. Therefore, our model is working properly.
Now let’s consider the out-of-sample classification accuracy.
out_accuracy = cv.glm(BreastCancer, forward.fit.BIC, cost, K = 10)$delta[2]
out_accuracy
## [1] 0.9484618
The resulting out-of-sample classification accuracy is very similar to the in-sample classification accuracy, which again verifies that our model is working properly.
Lastly, we can again use the gamlss() function to perform the model check:
fit.gamlss = gamlss(formula(forward.fit.BIC), data = BreastCancer, family=BI)
## GAMLSS-RS iteration 1: Global Deviance = 134.9427
## GAMLSS-RS iteration 2: Global Deviance = 134.9424
summary(fit.gamlss)
## ******************************************************************
## Family: c("BI", "Binomial")
##
## Call: gamlss(formula = formula(forward.fit.BIC), family = BI,
## data = BreastCancer)
##
## Fitting method: RS()
##
## ------------------------------------------------------------------
## Mu link function: logit
## Mu Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1250 36.9077 0.058 0.954104
## Cell.size.L 10.7348 76.0883 0.141 0.887847
## Cell.size.Q -4.5250 140.9702 -0.032 0.974403
## Cell.size.C 6.0367 80.4639 0.075 0.940219
## Cell.size^4 11.9704 124.2184 0.096 0.923259
## Cell.size^5 3.8053 79.2849 0.048 0.961734
## Cell.size^6 -5.6832 106.5880 -0.053 0.957494
## Cell.size^7 1.1240 110.9913 0.010 0.991923
## Cell.size^8 6.8044 88.4487 0.077 0.938702
## Cell.size^9 0.3698 192.9519 0.002 0.998471
## Bare.nuclei2 1.4633 0.7854 1.863 0.062892 .
## Bare.nuclei3 2.4390 0.7052 3.459 0.000578 ***
## Bare.nuclei4 2.9375 0.9198 3.194 0.001472 **
## Bare.nuclei5 3.3585 0.8342 4.026 6.33e-05 ***
## Bare.nuclei6 15.7316 571.1036 0.028 0.978033
## Bare.nuclei7 3.5982 1.2541 2.869 0.004247 **
## Bare.nuclei8 2.6608 1.0250 2.596 0.009644 **
## Bare.nuclei9 14.9855 413.9059 0.036 0.971130
## Bare.nuclei10 5.4373 0.8476 6.415 2.68e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## ------------------------------------------------------------------
## No. of observations in the fit: 683
## Degrees of Freedom for the fit: 19
## Residual Deg. of Freedom: 664
## at cycle: 2
##
## Global Deviance: 134.9424
## AIC: 172.9424
## SBC: 258.9458
## ******************************************************************
plot(fit.gamlss)
## ******************************************************************
## Summary of the Randomised Quantile Residuals
## mean = -0.04395755
## variance = 1.027992
## coef. of skewness = -0.1255655
## coef. of kurtosis = 2.971741
## Filliben correlation coefficient = 0.9989149
## ******************************************************************