class: title-slide <div class="my-logo-right"></div> <br> <br> <br> <br> # Data Analytics for Pharmaceutical Sciences ## Part III: Linear Regression ### .smaller[Stéphane Guerrier, Data Analytics Lab, University of Geneva 🇨🇭] ### .smaller[Dominique-L. Couturier, Cancer Research UK, University of Cambridge 🇬🇧] ### .smaller[Yuming Zhang, Data Analytics Lab, University of Geneva 🇨🇭] <br> <img src="data:image/png;base64,#pics/liscence.png" width="25%" style="display: block; margin: auto;" /> .center[.tiny[License: [CC BY NC SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/)]] ### .tiny[This document was prepared with the help of Lionel Voirol, Wenfei Chu and Jun Wu.] --- # .smallest[Motivating Example: Reading Ability] .panelset[ .panel[.panel-name[Problem] An educator believes that .purple2[new directed reading activities] in the classroom can help elementary school students (6-12 years old) improve their reading ability. She arranged a pilot study where some students (chosen at random) of age 6 start to take part in these activities .hi-purple2[(treatment group)], meanwhile other students continue with the .pink[classical curriculum] .hi-pink[(control group)]. The educator wishes to evaluate the effectiveness of these activities so all students are given a Degree of Reading Power (DRP) test, which assesses their reading ability. .pink[Can we conclude that these new directed reading activities can help elementary school students improve their reading ability?] ] .panel[.panel-name[Data] ```r library(idarps) data(reading) control = reading$score[reading$group == "Control"] head(control) treatment = reading$score[reading$group == "Treatment"] head(treatment) ``` ``` #> [1] 59.98881 60.39487 70.02395 41.47707 29.15497 58.84009 ``` ``` #> [1] 36.80558 36.70201 59.89362 31.72289 41.04241 51.82160 ``` ] .panel[.panel-name[Graph] <img src="data:image/png;base64,#pics/boxplot_reading.png" width="80%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Test] 1. .purple[Define hypotheses:] `\(H_0: \mu_T = \mu_C\)` and `\(H_a: \mu_T > \mu_C\)`. 2. .purple[Define] `\(\color{#6A5ACD}{\alpha}\)`: We consider `\(\alpha = 5\%\)`. 3. .purple[Compute p-value]: p-value = `\(97.12\%\)` (see R code tab for details). 4. .purple[Conclusion:] We have p-value > `\(\alpha\)` so we cannot reject the null hypothesis at the significance level of 5%. .hi-pink[Remark:] Notice that the p-value for the test with opposite hypotheses is actually `\(100\%-97.12\%=2.88\% < \alpha\)` (Why? 🤔). So we conclude, at the significance level of 5%, that the classical curriculum without these new directed reading activities actually improve students' reading ability more compared to the curriculum with these activities. 😬 ] .panel[.panel-name[`R` Code ] ```r t.test(treatment, control, alternative = "greater") ``` ``` #> #> Welch Two Sample t-test #> #> data: treatment and control #> t = -1.9497, df = 43.527, p-value = 0.9712 #> alternative hypothesis: true difference in means is greater than 0 #> 95 percent confidence interval: #> -10.77571 Inf #> sample estimates: #> mean of x mean of y #> 44.77579 50.56302 ``` ] ] --- # .smaller[Is our analysis comprehensive?] The educator points out that only students of 6-8 years old have participated in the new directed reading activities. In other words, in the sample she collected, the students in the treatment group are only of age 6 to 8, whereas the students in the control group vary from 6 to 12 years old. .pink[Is age a potential explanation to the difference we observe among the students' reading ability?] To make sure that the analysis is reliable, she includes the age information of the students, which can be accessed as follows: ```r treatment_age = reading$age[reading$group == "Treatment"] control_age = reading$age[reading$group == "Control"] ``` --- # .smaller[Should age be taken into account?] <img src="data:image/png;base64,#pics/reading_points.png" width="90%" style="display: block; margin: auto;" /> --- # .smaller[Regression analysis] - Regression analysis corresponds to a set of statistical methods for estimating the .pink[relationships] between a response variable `\(Y\)` of primary interest (also called the *outcome variable*) and some explanatory variables `\(X_1, \ldots, X_p\)` (also called *covariates*, *regressors*, *features* or *predictors*). - The relationship between the response variable `\(Y\)` and the covariates is not .purple2[deterministic] and we model the .purple2[conditional expected value] (i.e. `\(\mathbb{E}[Y | X_1, \ldots, X_p]\)`). - Therefore, we consider the following (general) model: `$$Y_i = \mathbb{E}[Y_i | X_{i1}, \ldots, X_{ip}] + \varepsilon_i,$$`where `\(\mathbb{E}[\varepsilon_i] = 0\)` and `\(i=1,\ldots,n\)`. - .pink[Example]: `\(\mathbb{E}[\text{reading abilities}_i| \ \text{age}_i, \text{treatment}_i, \ldots]\)`. --- # .smaller[Linear regression] - The most common form of regression analysis is .pink[linear regression], in which the conditional expected value `\(\color{#373895}{\mathbb{E}[Y | X_1, \ldots, X_p]}\)` takes the form `$$\color{#373895}{\mathbb{E}[Y_i| X_{i1}, \ldots, X_{ip}]} = \color{#373895}{\beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p}.$$` and `\(\color{#eb078e}{\varepsilon_i \overset{iid}{\sim} \mathcal{N}(0, \sigma^2)}\)`. - Our general model can be expressed as `$$Y_i = \color{#373895}{\mathbb{E}[Y_i | X_{i1}, \ldots, X_{ip}]} + \varepsilon_i = \color{#373895}{\beta_0 + \sum_{j = 1}^p \beta_j X_{ij}} + \varepsilon_i,$$` and therefore, `$$Y_i \color{#eb078e}{ \, \sim \, \mathcal{N}}\left(\color{#373895}{\beta_0 + \sum_{j = 1}^p \beta_j X_{ij}}, \color{#eb078e}{\sigma^2}\right).$$` --- # .smaller[Linear regression] Therefore, this approach makes two (.pink[strong]) assumptions: 1. The conditional expected value `\(\mathbb{E}[Y | X_1, \ldots, X_p]\)` is assumed to be a linear function of the covariates. 2. The errors are assumed to be `\(iid\)` Gaussian, i.e. `\(\color{#eb078e}{\varepsilon_i \overset{iid}{\sim} \mathcal{N}(0, \sigma^2)}\)`, at least when the sample size is small to medium. ⚠️ .pink[In practice, it is important to assess if these assumptions are plausible.] The parameters of the model (i.e. `\(\beta_0, \beta_1, \ldots, \beta_p\)` and `\(\sigma^2\)`) can be estimated by several methods. The most commonly used is the .pink[least squares] approach where `\(\beta_0, \beta_1, \ldots, \beta_p\)` are chosen such that `$$\small \sum_{i = 1}^n \varepsilon_i^2 = \sum_{i = 1}^n \left(Y_i- \mathbb{E}[Y_i | X_{i1}, \ldots, X_{ip}]\right)^2 = \sum_{i = 1}^n \left(Y_i- \beta_0 - \sum_{j = 1}^p \beta_j X_{ij}\right)^2$$` is .pink[minimized], which then allows to estimate `\(\sigma^2\)` further on. --- # .smaller[Anscombe's quartet] <img src="data:image/png;base64,#pics/Anscombe.png" width="80%" style="display: block; margin: auto;" /> 👋 .smallest[Source: [Wikipedia](https://en.wikipedia.org/wiki/Anscombe%27s_quartet).] --- # .smaller[Example: Reading ability assessment] In the reading ability example, we can formulate a linear regression model (without interaction) as follows: `$$\color{#e64173}{\text{Score}_i} = \beta_0 + \beta_1 \color{#6A5ACD}{\text{Group}_i} + \beta_2 \color{#20B2AA}{\text{Age}_i} + \epsilon_i, \quad \epsilon_i \overset{iid}{\sim} \mathcal{N}(0, \sigma^2).$$` - `\(\color{#e64173}{\text{Score}_i}\)`: score of the DRP test of the `\(i\)`-th student. - `\(\color{#6A5ACD}{\text{Group}_i}\)`: indicator of participation of the new directed reading activities for the `\(i\)`-th student (i.e. `\(\color{#6A5ACD}{\text{Group}_i} = 1\)` if participate and `\(\color{#6A5ACD}{\text{Group}_i} = 0\)` if not participate). - `\(\color{#20B2AA}{\text{Age}_i}\)`: age of the `\(i\)`-th student (related to *.turquoise[time since start of treatment]*). With this model the two groups can be compared as the age effect is taken into account. The goal of the educator is now to assess if `\(\beta_1\)` is .pink[significantly larger than 0]. --- # .smaller[Example: Reading ability assessment] .panelset[ .panel[.panel-name[R Code] .pink[R function]: `lm(y ~ x1 + ... + xp, data = mydata)`. Here is the code for our example: ```r # Import data (if you haven't already) library(idarps) data(reading) # Fit linear regression model mod1 = lm(score ~ group + age, data = reading) summary(mod1) ``` ] .panel[.panel-name[Output] ``` #> #> Call: #> lm(formula = score ~ group + age, data = reading) #> #> Residuals: #> Min 1Q Median 3Q Max #> -9.247 -3.532 0.423 3.180 12.569 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) -7.8639 3.3238 -2.366 0.0214 * #> groupTreatment 6.3771 1.3931 4.578 2.6e-05 *** #> age 6.6457 0.3695 17.985 < 2e-16 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 4.447 on 57 degrees of freedom #> Multiple R-squared: 0.8586, Adjusted R-squared: 0.8536 #> F-statistic: 173 on 2 and 57 DF, p-value: < 2.2e-16 ``` ] ] --- # .smaller[Interpretation of coefficients] We can obtain the estimated coefficients. Specifically, - `\(\hat{\beta}_0 = -7.8639\)` represents the estimated baseline average score of the DRP test at birth (but what does it mean? 🤨). - `\(\hat{\beta}_1 = 6.3771\)` means that .pink[for a student of the same age], participating in the new directed reading activities is estimated to increase their average score of the DRP test by 6.3771. - `\(\hat{\beta}_2 = 6.6457\)` means that .pink[when a student receives the same treatment] (either participate or not in the activities), their average score increases by 6.6457 as they become 1 year older. Regression coefficients represent the mean change in the response variable .purple[for one unit of change] in the predictor variable .purple[while holding other covariates in the model constant.] --- # .smaller[Interpretation of coefficient p-values] - We notice that for each coefficient `\(\beta_j\)`, there is a corresponding p-value associated to the (Wald t-)test of `\(H_0: \beta_j = 0\)` and `\(H_a: \beta_j \neq 0\)`. - .pink[A covariate with a small p-value (typically smaller than 5%) is considered to be a significant (meaningful) addition to the model], as changes in the values of such covariate can lead to changes in the response variable. - On the other hand, a large p-value (typically larger than 5%) suggests that the corresponding covariate is not (significantly) associated with changes in the response or that we don't have enough evidence (data) to show its effect. - In this example, the coefficient p-value associated to the `group` covariate is `\(2.6 \times 10^{-3}\)`%. .purple[This suggests that taking into account the effect of `age`, the reading abilities of the students receiving the treatment is significantly .hi.purple[different] from the control group, at the significance level of 5%.] But this is not what we want! --- # .smaller[Interpretation of coefficient p-values] In the linear regression output, the coefficient p-value (which we denote as `\(p\)` below) corresponds to a two-sided test. We can use this result to compute the p-value of a one-sided test using the following relations: | | `\(\small H_a: \beta_j>0\)` | `\(\small H_a: \beta_j<0\)` | | ------------- |:-------------:| :-----:| | `\(\small \hat{\beta_j}>0\)` | `\(\small p/2\)` | `\(\small 1-p/2\)` | | `\(\small \hat{\beta_j}<0\)` | `\(\small 1-p/2\)` | `\(\small p/2\)` | In our example, `\(\beta_1 = 6.3771\)` and `\(p=2.6 \times 10^{-3}\%\)`. So the p-value of the test with hypotheses `\(H_0: \beta_1=0\)` and `\(H_a: \beta_1>0\)` is `\(2.6 \times 10^{-3}\% /2 \approx 1.3 \times 10^{-3}\% < \alpha\)`. So we can conclude that these new directed reading activities can .pink[significantly improve] students' reading ability compared to classical curriculum. .purple2[However, is our model plausible?] 🤔 --- # .smaller[How good is our model? 🤔] <img src="data:image/png;base64,#pics/points_with_mod1.png" width="90%" style="display: block; margin: auto;" /> --- # .smaller[Could we use the] `\(R^2\)`.smaller[?] - The .pink[coefficient of determination], denoted as `\(R^2\)` and often referred to as R-squared, corresponds to the proportion of the variance in the response variable that is "explained" by the model. - `\(\color{#6A5ACD}{R^2}\)` .purple[gives certain information about the goodness of fit of a model.] It measures how well the regression predictions approximate the real data points. An `\(R^2\)` of 1 indicates that the regression predictions perfectly fit the data. - However, the value of `\(R^2\)` is .pink[not related to the adequacy of our model to the data]. - ⚠️ Moreover, adding new covariates to the current model .purple[always] increases `\(R^2\)`, whether the additional covariates are significant or not. Therefore, `\(R^2\)` alone cannot be used as a meaningful comparison of models with different covariates. - The .pink[adjusted] `\(\color{#e64173}{R^2}\)` is a modification of `\(R^2\)` that aims to limit this issue. --- # .smaller[Rexthor, the Dog-Bearer!] <img src="data:image/png;base64,#pics/linear_regression.png" width="90%" style="display: block; margin: auto;" /> 👋 .smallest[If you want to know more have a look [here](https://www.explainxkcd.com/wiki/index.php/1725:_Linear_Regression).] --- # .smaller[Model diagnostic] <img src="data:image/png;base64,#pics/diag_mod1.png" width="90%" style="display: block; margin: auto;" /> --- # .smaller[Model diagnostic] ⚠️ <img src="data:image/png;base64,#pics/diag_mod1_v2.png" width="90%" style="display: block; margin: auto;" /> --- # .smaller[Model diagnostic] ⚠️ <img src="data:image/png;base64,#pics/diag2_mod1.png" width="90%" style="display: block; margin: auto;" /> --- # .smaller[Model diagnostic] <img src="data:image/png;base64,#pics/qqplot_mod1.png" width="90%" style="display: block; margin: auto;" /> --- # .smaller[Let's update our model] Our results suggest that the students of the group participating in these new directed reading activities progress more rapidly (which is actually more reasonable than our initial model 🤔). Therefore, we modify our model by adding an interaction term: `$$\color{#e64173}{\text{Score}_i} = \beta_0 + \beta_1 \color{#6A5ACD}{\text{Group}_i} + \beta_2 \color{#20B2AA}{\text{Age}_i} + \beta_3 \color{#6A5ACD}{\text{Group}_i}\color{#20B2AA}{\text{Age}_i} + \epsilon_i, \quad \epsilon_i \overset{iid}{\sim} \mathcal{N}(0, \sigma^2).$$` - `\(\color{#e64173}{\text{Score}_i}\)`: score of the DRP test of the `\(i\)`-th student. - `\(\color{#6A5ACD}{\text{Group}_i}\)`: indicator of participation of the new directed reading activities for the `\(i\)`-th student (i.e. `\(\color{#6A5ACD}{\text{Group}_i} = 1\)` if participate and `\(\color{#6A5ACD}{\text{Group}_i} = 0\)` if not participate). - `\(\color{#20B2AA}{\text{Age}_i}\)`: age of the `\(i\)`-th student (related to *.turquoise[time since start of treatment]*), The goal of the educator is now to assess if `\(\beta_1\)` and/or `\(\beta_3\)` are .pink[significantly larger than 0]. --- # .smaller[Example: Reading ability assessment] .panelset[ .panel[.panel-name[R Code] Here is the code to fit our second model: ```r # Import data (if you haven't already) library(idarps) data(reading) # Fit linear regression model mod2 = lm(score ~ group*age, data = reading) summary(mod2) ``` ] .panel[.panel-name[Output] ``` #> #> Call: #> lm(formula = score ~ group * age, data = reading) #> #> Residuals: #> Min 1Q Median 3Q Max #> -8.5033 -2.3708 0.4076 2.5478 10.1024 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) -3.2489 2.7903 -1.164 0.249 #> groupTreatment -36.0307 7.5392 -4.779 1.31e-05 *** #> age 6.1207 0.3108 19.693 < 2e-16 *** #> groupTreatment:age 5.9539 1.0468 5.688 4.86e-07 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 3.572 on 56 degrees of freedom #> Multiple R-squared: 0.9104, Adjusted R-squared: 0.9056 #> F-statistic: 189.6 on 3 and 56 DF, p-value: < 2.2e-16 ``` ] ] --- # .smaller[Interpretation of coefficients] We can obtain the estimated coefficients. Specifically, - `\(\hat{\beta}_0 = -3.2489\)` represents the estimated baseline average score of the DRP test at birth (but *again* what does it mean? 🙄) - `\(\hat{\beta}_1 = -36.0307\)` means that .pink[for a student of the same age], participating in the new directed reading activities is estimated to decrease their average score of the DRP test by 36.0307 (does this make sense? 🧐). - `\(\hat{\beta}_2 = 6.1207\)` means that for students not participating to the new directed reading activities, their average score increases by 6.1207 as they become 1 year older. - `\(\hat{\beta}_3 = 5.9539\)` means that the average score of students participating in the new directed reading activities increases by 5.9539 as they become 1 year older .pink[compared to the other students]. This means that the average score of students participating to the new program increases by 12.0746 (i.e. 6.1207 + 5.9539) as they become 1 year older. --- # .smaller[Model fit] <img src="data:image/png;base64,#pics/points_with_mod2.png" width="90%" style="display: block; margin: auto;" /> --- # .smaller[Model diagnostic] <img src="data:image/png;base64,#pics/diag_mod2_v2.png" width="90%" style="display: block; margin: auto;" /> --- # .smaller[Model diagnostic] <img src="data:image/png;base64,#pics/diag2_mod2.png" width="90%" style="display: block; margin: auto;" /> --- # .smaller[Model diagnostic] <img src="data:image/png;base64,#pics/qqplot_mod2.png" width="90%" style="display: block; margin: auto;" /> --- # .smaller[Model selection] In general, we prefer a .pink[parsimonious approach to modeling] in the sense that we only choose a more complex model if the benefits are ".pink[substantial]"<sup>.smallest[👋]</sup>. We want our model to be such that: 1. The model fits the data well. 2. Avoid (excessive) overfitting. Naturally these two objectives are .pink[contradictory] and there are many ways to select a suitable model (actually this is one of the most active areas of research in modern Statistics). In this class, we will only consider one (simple) approach based on the .pink[Akaike Information Criterion (AIC)]. This criterion corresponds to an .purple2[estimator of prediction error] and thereby .purple2[relative quality of statistical models for a given set of data]. .footnote[.smallest[👋 This point of view is based on .pink[Occam's razor] (or law of parsimony), the problem-solving principle stipulating that ".purple2[the simplest explanation is usually the right one]".]] --- # .smaller[Model selection] In `R`, the AIC can be computed for a given model (i.e. use the output of the function `lm(...)` in `AIC(model)`.) For example, we can compare the AIC of the previously considered models as follows: ```r AIC(mod1) # First model (no interaction) ``` ``` #> [1] 354.2688 ``` ```r AIC(mod2) # Second model (with interaction) ``` ``` #> [1] 328.9095 ``` As expected, these results suggest that the second model is more appropriate. .pink[But should we further improve it?] --- # .smaller[Let's update our model (again)] It should be reasonable that the average reading scores of the two groups are the same at the start of the program. `$$\color{#e64173}{\text{Score}_i} = \beta_0 + \beta_1 \color{#20B2AA}{(\text{Age}_i - 6)} + \beta_2 \color{#6A5ACD}{\text{Group}_i}\color{#20B2AA}{(\text{Age}_i - 6)} + \epsilon_i, \quad \epsilon_i \overset{iid}{\sim} \mathcal{N}(0, \sigma^2).$$` - `\(\color{#e64173}{\text{Score}_i}\)`: score of the DRP test of the `\(i\)`-th student. - `\(\color{#6A5ACD}{\text{Group}_i}\)`: indicator of participation of the new directed reading activities for the `\(i\)`-th student (i.e. `\(\color{#6A5ACD}{\text{Group}_i} = 1\)` if participate and `\(\color{#6A5ACD}{\text{Group}_i} = 0\)` if not participate). - `\(\color{#20B2AA}{\text{Age}_i - 6}\)`: corresponds to the time since start of treatment of the `\(i\)`-th student. With this model the two groups can be compared as the age effect is taken into account. The goal of the educator now is (.pink[only!]) to assess if `\(\beta_1\)` is .pink[significantly larger than 0]. --- # .smaller[Example: Reading ability assessment] .panelset[ .panel[.panel-name[R Code] Here is the code to fit our third model: ```r # Import data (if you haven't already) library(idarps) data(reading) reading$age_minus_6 = reading$age - 6 # Fit linear regression model mod3 = lm(score ~ age_minus_6 + group:age_minus_6, data = reading) summary(mod3) ``` ] .panel[.panel-name[Output] ``` #> #> Call: #> lm(formula = score ~ age_minus_6 + group:age_minus_6, data = reading) #> #> Residuals: #> Min 1Q Median 3Q Max #> -8.5095 -2.4182 0.4826 2.6312 10.0853 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 33.3504 0.7904 42.192 < 2e-16 *** #> age_minus_6 6.1522 0.2604 23.625 < 2e-16 *** #> age_minus_6:groupTreatment 5.8103 0.7157 8.119 4.37e-11 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 3.542 on 57 degrees of freedom #> Multiple R-squared: 0.9103, Adjusted R-squared: 0.9072 #> F-statistic: 289.2 on 2 and 57 DF, p-value: < 2.2e-16 ``` ] .panel[.panel-name[AIC] ```r AIC(mod1) ``` ``` #> [1] 354.2688 ``` ```r AIC(mod2) ``` ``` #> [1] 328.9095 ``` ```r AIC(mod3) ``` ``` #> [1] 326.9479 ``` ] ] --- # .smaller[Model fit] <img src="data:image/png;base64,#pics/points_with_mod3.png" width="90%" style="display: block; margin: auto;" /> --- # .smaller[Model diagnostic] <img src="data:image/png;base64,#pics/diag_mod3_v2.png" width="90%" style="display: block; margin: auto;" /> --- # .smaller[Model diagnostic] <img src="data:image/png;base64,#pics/diag2_mod3.png" width="90%" style="display: block; margin: auto;" /> --- # .smaller[Model diagnostic] <img src="data:image/png;base64,#pics/qqplot_mod3.png" width="90%" style="display: block; margin: auto;" /> --- # .smaller[Concluding remarks] - The last model we consider appears to fit the data, avoids overfitting and allows to answer whether the new reading activities are of interest. Indeed, the programs significantly improve the reading performance of the students (e.g. 5.81 more per year compared to control, p-value < 5%). - Our model .pink[assumes a linear relationship] between the response and the covariates. However, this may be incorrect. - Our model .pink[only considers independent data] (which may not be correct here). - Finally, linear regression .pink[should not be used to extrapolate], i.e. to estimate beyond the original observation range. For example, if we consider a 100 year-old person in this reading ability example, we would predict (using the third model) that the corresponding score of the DRP test would be 1157.59 and 611.45, respectively, with and without these activities. Does it really make sense? 😯 --- # .smaller[Extrapolating] <img src="data:image/png;base64,#pics/extrapolating.png" width="90%" style="display: block; margin: auto;" /> 👋 .smallest[If you want to know more have a look [here](https://www.explainxkcd.com/wiki/index.php/605:_Extrapolating).]