Revisiting the study design

Problem

  • The MSc research had some issues regarding the data and the design of the analysis.
  • Issue 1 - building height: This data only covered the city centre (two 10kmx10km grid areas: NS56, NS66). This spatial extent was not suitable to cover the entire glasgow city boundary.
  • Issue 2 - spatial aggregation: The spatial scale of LSOA was too fine to get a meaningful outcome.
  • Issue 3 - Negative Binomial (NB) Regression: NB regression is conducted for count data and assumes a mixture of Poisson distributions with mean distributed as a gamma distribution. Although the distribution of the Strava data looks quite similar to the Gamma distribution (i.e. long tail), the counts are all fairly large enough for us to ignore the discreteness and use linear models.

(Potential) Solution

  • Issue 1 ⇒ Re-Download the buildings files that covers the spatial boundary of Glasgow: N44-N47, N54-N57, N64-N67, N74-N77, and N84-N87
  • Issue 2 ⇒ Enlarge the spatial aggregation from LSOA to MSOA. Here, Scotland uses Data Zones (LSOA) and Immediate Zones (MSOA).
  • Issue 3 ⇒ Geographic Weighted Regression or Ordinary Linear Square Regression.

Data Exploration

Call Packages and data

library(tidyverse) # data wrangling and plotting
library(sf) # Manipulating spatial files
library(tmap) # Dealing with maps
library(spgwr) # GWRs

As the data were all pre-cleaned, the csv files were all put into the same column key called Name.

strava <- read_csv("Cleaned Files/strava.csv") # response
green <- read_csv("Cleaned Files/green.csv") # predictors
ptai <- read_csv("Cleaned Files/ptai.csv") # predictors
buildings <- read_csv("Cleaned Files/buildings.csv") # predictors
shp <- read_sf("Cleaned Files/Glasgow_IZ.shp") # shapefile

# merge all 
strava %>% 
  left_join(green, by = "Name") %>% 
  left_join(ptai, by = "Name") %>% 
  left_join(buildings, by = "Name") -> glasgow_df

The Figure@ref(fig:correlation) shows the correlation between Strava 2017 and Strava 2018. The correlation is `0.9975779`, which means it is nearly identical - not what I expected!

plot(glasgow_df$ride17, glasgow_df$ride18)
Strava 2017 vs Strava 2018

Strava 2017 vs Strava 2018

cor(glasgow_df$ride17, glasgow_df$ride18)
## [1] 0.9975779

Exploring the Variables

To get an immediate understanding of the variables, the best thing is to map the variables and get summary tables. First, let us look at the summary of the variables Table@ref(tab:summary-table).

glasgow_df %>% summary()
##      Name               ride17            ride18            green       
##  Length:136         Min.   :   1760   Min.   :   2275   Min.   : 1.320  
##  Class :character   1st Qu.:  32800   1st Qu.:  33606   1st Qu.: 5.780  
##  Mode  :character   Median :  79292   Median :  81615   Median : 8.068  
##                     Mean   : 128697   Mean   : 135287   Mean   : 8.837  
##                     3rd Qu.: 153040   3rd Qu.: 168255   3rd Qu.:11.670  
##                     Max.   :1312075   Max.   :1421505   Max.   :20.664  
##       PTAI            height      
##  Min.   : 245.8   Min.   : 5.667  
##  1st Qu.: 542.8   1st Qu.: 7.358  
##  Median : 808.5   Median : 8.098  
##  Mean   :1022.2   Mean   : 9.029  
##  3rd Qu.:1213.8   3rd Qu.: 9.566  
##  Max.   :4990.8   Max.   :21.133

Here we transform the data to a longer format glasgow_df_long using the pivot_longer function. This is to directly execute ggplot with facet wrapping. After the data transformation, we then merge the shapefile with the integrated data frame that is gl_shp.

glasgow_df %>% 
  rename(Strava2017 = ride17,
         Strava2018 = ride18) %>% 
  pivot_longer(!Name,
               names_to = "Type",
               values_to = "Value") -> glasgow_df_long

shp %>% 
  left_join(glasgow_df_long, by = "Name") -> gl_shp

Lets look at the strava data first. Here we see that during 2017 and 2018 people including the City Centre South, Laurieston and Tradeston, City Centre East, Finnieston and Kelvinhaugh, and Calton and Gallowgate were identified as the most reported areas.

# Strava Users
gl_shp %>% 
  filter(Type %in% c("Strava2017", "Strava2018")) %>% 
  mutate(Value2 = cut(Value,
                     breaks = c(0, 50000, 100000, 150000, 200000, 300000, +Inf),
                     labels = c("0-50", "50-100", "100-150", "150-200", "200-30", ">300"))) %>% 
  tm_shape() +
  tm_polygons("Value2", title = "Strava ('000s)", palette="-RdBu") +
  tm_facets(by = "Type", free.coords = F, free.scales = F, ncol = 2) -> gl_strava

gl_strava

The figures below show that the per cent of the greenness (by Immediate Zones) gradually tends to decrease as it goes outside the city centre. The average is 8% across the whole area but the lowest is situated in the city centre and the city south.

The height of the buildings were concentrated around the city centre. The City Centre South was the highest at 21.1% followed by City Centre East and City Centre West.

PTAI (Public Transport Availability Indicators) also tend to more clustered in the city centre (>3000) and around the major bus routes (>2000) while the north and the east were relatively lower (<1000).

# Other variables
gl_shp %>% 
  filter(Type %in% c("green", "PTAI", "height")) %>%
  tm_shape() +
  tm_polygons("Value", title = "", palette="-RdBu") +
  tm_facets(by = "Type", free.coords = F, free.scales = T, ncol = 3) +
  tm_layout(legend.position = c("right", "top"),  
            title.position = c('right', 'top')) -> gl_variable

gl_variable

OLS Regression - log transformation

model17 <- lm(log(ride17) ~ green + PTAI + height, data = glasgow_df) 
summary(model17)
## 
## Call:
## lm(formula = log(ride17) ~ green + PTAI + height, data = glasgow_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.78035 -0.59604  0.08002  0.72427  1.76984 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.4216433  0.4704823  22.151  < 2e-16 ***
## green       -0.0656807  0.0232332  -2.827  0.00543 ** 
## PTAI         0.0003435  0.0001738   1.976  0.05021 .  
## height       0.1055714  0.0503079   2.099  0.03777 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.01 on 132 degrees of freedom
## Multiple R-squared:  0.3153, Adjusted R-squared:  0.2997 
## F-statistic: 20.26 on 3 and 132 DF,  p-value: 7.342e-11
residuals(model17)
##            1            2            3            4            5            6 
##  0.135919942  0.670526943  1.304312489  0.644729034  1.310050040 -2.170126101 
##            7            8            9           10           11           12 
## -0.616559434  0.186769318 -0.992017762  0.395836924  0.382950246 -0.237382366 
##           13           14           15           16           17           18 
##  0.261310638  0.746033800 -0.028220730  0.104845381  1.243101580 -0.191884841 
##           19           20           21           22           23           24 
##  0.717018818  1.250844748 -1.012700849  0.603650453  0.650193984  0.641932450 
##           25           26           27           28           29           30 
##  0.417989745 -1.270145620 -1.663419395 -0.461354261  0.107555031 -2.372796941 
##           31           32           33           34           35           36 
## -0.037508281 -0.074528380 -0.359106739  0.868173407 -0.144176683  1.327318558 
##           37           38           39           40           41           42 
##  0.775932529  0.568731380 -0.259055184  1.149408266 -0.485250263  0.216155911 
##           43           44           45           46           47           48 
## -0.767564993 -0.796870162 -0.520444416 -0.250865175 -2.780353678 -1.024349128 
##           49           50           51           52           53           54 
##  1.754635369 -2.435174049 -1.395130000 -0.586207215  1.287896849  0.037091493 
##           55           56           57           58           59           60 
##  0.088020682  0.391203952 -0.238610053 -0.276728072  0.459131120 -2.667679226 
##           61           62           63           64           65           66 
## -2.536940688 -0.591850870  1.335127075 -0.065398596 -1.274398121 -1.343570940 
##           67           68           69           70           71           72 
##  0.390423472 -0.608590419  0.843423877  1.431849596  1.267265991 -0.197921445 
##           73           74           75           76           77           78 
##  0.813957256  0.815116030 -0.808387981 -0.200014442  1.190940720  1.076184237 
##           79           80           81           82           83           84 
## -1.043505139 -0.511448027 -0.711094619 -0.956795760 -0.983510977  1.677010192 
##           85           86           87           88           89           90 
##  0.001904391  1.769843643  1.341548964 -0.396210381 -1.800924705  0.030742474 
##           91           92           93           94           95           96 
##  1.425707450 -0.195669084  0.254578002  0.640165410  0.991057030  0.207139287 
##           97           98           99          100          101          102 
## -0.514297919  0.586451723 -1.257783873 -1.106506884  1.242315480 -1.444023674 
##          103          104          105          106          107          108 
## -0.625451113 -1.198823994 -0.865268566  1.546974649  0.450987252  1.163708165 
##          109          110          111          112          113          114 
## -0.342306469  0.089096060  0.493365219  1.072947660  0.489509480  0.464487910 
##          115          116          117          118          119          120 
##  1.133359294 -0.670206320  0.072023334 -0.489271064 -0.725716621 -1.413417010 
##          121          122          123          124          125          126 
##  1.182463268 -0.196395703  0.052870528  0.673666748 -0.083751979  0.190938856 
##          127          128          129          130          131          132 
##  1.476760642 -1.962769408  0.764908618  0.190832993  0.864277834 -0.404320866 
##          133          134          135          136 
##  0.194651293  0.916943145 -0.324950757  0.450908084
exp(coef(model17)["green"])
##     green 
## 0.9364298
exp(coef(model17)["PTAI"])
##     PTAI 
## 1.000344
exp(coef(model17)["height"])
##   height 
## 1.111345
model18 <- lm(log(ride18) ~ green + PTAI + height, data = glasgow_df) 
summary(model18)
## 
## Call:
## lm(formula = log(ride18) ~ green + PTAI + height, data = glasgow_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.64293 -0.68402  0.06562  0.70549  1.76134 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.2875367  0.4715255  21.818   <2e-16 ***
## green       -0.0576020  0.0232847  -2.474   0.0146 *  
## PTAI         0.0003667  0.0001742   2.105   0.0372 *  
## height       0.1134229  0.0504195   2.250   0.0261 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.012 on 132 degrees of freedom
## Multiple R-squared:  0.3204, Adjusted R-squared:  0.305 
## F-statistic: 20.75 on 3 and 132 DF,  p-value: 4.485e-11
residuals(model18)
##           1           2           3           4           5           6 
##  0.14269549  0.70133233  1.25228045  0.57175425  1.32685895 -2.06094239 
##           7           8           9          10          11          12 
## -0.69197530  0.27568227 -0.91100410  0.53839889  0.48091551 -0.22270837 
##          13          14          15          16          17          18 
##  0.34883614  0.72936131 -0.02824141 -0.08431092  1.22492455 -0.16192034 
##          19          20          21          22          23          24 
##  0.69529060  1.25668680 -1.11264819  0.61570390  0.66997249  0.67447194 
##          25          26          27          28          29          30 
##  0.39601051 -1.36664472 -1.64117100 -0.52012037  0.26263258 -2.48349256 
##          31          32          33          34          35          36 
## -0.12959215 -0.14479454 -0.39948072  0.85085124 -0.04477760  1.29672144 
##          37          38          39          40          41          42 
##  0.79802979  0.58854589 -0.20556453  1.15890213 -0.58598927  0.33485400 
##          43          44          45          46          47          48 
## -0.70670609 -0.72872088 -0.62026026 -0.30106989 -2.64293199 -1.01550880 
##          49          50          51          52          53          54 
##  1.75092521 -2.13491868 -1.16803803 -0.83349604  1.40045222  0.09665470 
##          55          56          57          58          59          60 
##  0.05602764  0.47714511 -0.21347598 -0.16992831  0.47016156 -2.18988703 
##          61          62          63          64          65          66 
## -2.07668593 -0.41256081  1.32300157 -0.17461704 -1.17608792 -1.57761833 
##          67          68          69          70          71          72 
##  0.49000214 -0.46384759  0.87540220  1.42239081  1.25604181 -0.17645867 
##          73          74          75          76          77          78 
##  0.71795108  0.76202870 -0.90171341 -0.27773839  1.17915340  1.03685111 
##          79          80          81          82          83          84 
## -1.26220185 -0.86914869 -0.72584218 -0.83264698 -0.88685603  1.72108964 
##          85          86          87          88          89          90 
## -0.08281366  1.76133757  1.34409734 -0.19948692 -2.19123284  0.03364852 
##          91          92          93          94          95          96 
##  1.43134382 -0.14909499  0.18651506  0.82565358  0.98438225  0.14966778 
##          97          98          99         100         101         102 
## -0.46244627  0.55842696 -1.40142898 -1.31669698  1.12401910 -1.53919326 
##         103         104         105         106         107         108 
## -0.68136626 -1.07743307 -0.89341158  1.55339901  0.48170046  1.09049743 
##         109         110         111         112         113         114 
## -0.38345752  0.21119879  0.36696025  1.01608276  0.64728469  0.60464044 
##         115         116         117         118         119         120 
##  1.09255862 -0.76782073  0.22620488 -0.48588075 -0.78941882 -1.76100279 
##         121         122         123         124         125         126 
##  1.10745587 -0.25255957 -0.20474737  0.50112109 -0.04163603  0.07522218 
##         127         128         129         130         131         132 
##  1.62153263 -2.30606518  0.67548535  0.22161371  1.00915950 -0.34409869 
##         133         134         135         136 
##  0.17304265  0.95836567 -0.21875499  0.62478122
exp(coef(model18)["green"])
##     green 
## 0.9440256
exp(coef(model18)["PTAI"])
##     PTAI 
## 1.000367
exp(coef(model18)["height"])
##   height 
## 1.120105

Geographic Weighted Regression (GWR)

shp %>% 
  left_join(glasgow_df, by = "Name") %>% 
bind_cols(
tibble(Residuals18 = residuals(model18),
       Residuals17 = residuals(model17))) -> glasgow_gwr

plot(glasgow_gwr["Residuals17"]) # randomly distributed

plot(glasgow_gwr["Residuals18"])

glasgow_sp <- as(glasgow_gwr, "Spatial")

gwr.bandwidth <-gwr.sel(log(ride17) ~ green + PTAI + height, glasgow_sp) #estimated optimal bandwidth
## Bandwidth: 8649.304 CV score: 142.7482 
## Bandwidth: 13980.9 CV score: 142.8954 
## Bandwidth: 5354.199 CV score: 138.9338 
## Bandwidth: 3317.712 CV score: 127.933 
## Bandwidth: 2059.094 CV score: 115.1526 
## Bandwidth: 1281.226 CV score: 116.9604 
## Bandwidth: 1859.807 CV score: 113.5978 
## Bandwidth: 1736.559 CV score: 113.0917 
## Bandwidth: 1562.637 CV score: 113.305 
## Bandwidth: 1683.767 CV score: 113.0271 
## Bandwidth: 1679.924 CV score: 113.0265 
## Bandwidth: 1677.647 CV score: 113.0264 
## Bandwidth: 1677.809 CV score: 113.0264 
## Bandwidth: 1677.813 CV score: 113.0264 
## Bandwidth: 1677.813 CV score: 113.0264 
## Bandwidth: 1677.812 CV score: 113.0264 
## Bandwidth: 1677.811 CV score: 113.0264 
## Bandwidth: 1677.812 CV score: 113.0264 
## Bandwidth: 1677.812 CV score: 113.0264 
## Bandwidth: 1677.812 CV score: 113.0264 
## Bandwidth: 1677.812 CV score: 113.0264
gwr.bandwidth
## [1] 1677.812
gwr.fit1<-gwr(log(ride17) ~ green + PTAI + height, 
              data = glasgow_sp, 
              bandwidth = gwr.bandwidth, 
              se.fit=T, 
              hatmatrix=T)
## Warning in proj4string(data): CRS object has comment, which is lost in output; in tests, see
## https://cran.r-project.org/web/packages/sp/vignettes/CRS_warnings.html
gwr.fit1
## Call:
## gwr(formula = log(ride17) ~ green + PTAI + height, data = glasgow_sp, 
##     bandwidth = gwr.bandwidth, hatmatrix = T, se.fit = T)
## Kernel function: gwr.Gauss 
## Fixed bandwidth: 1677.812 
## Summary of GWR coefficient estimates at data points:
##                     Min.     1st Qu.      Median     3rd Qu.        Max.
## X.Intercept.  6.98465774  9.73352953 10.77907541 11.49108338 14.03347332
## green        -0.23998838 -0.07149230 -0.02740574  0.01209931  0.13576117
## PTAI         -0.00083649  0.00015632  0.00040764  0.00058934  0.00451235
## height       -0.54505680 -0.02654695  0.05253565  0.12528696  0.27437353
##               Global
## X.Intercept. 10.4216
## green        -0.0657
## PTAI          0.0003
## height        0.1056
## Number of data points: 136 
## Effective number of parameters (residual: 2traceS - traceS'S): 42.71778 
## Effective degrees of freedom (residual: 2traceS - traceS'S): 93.28222 
## Sigma (residual: 2traceS - traceS'S): 0.8286924 
## Effective number of parameters (model: traceS): 31.87474 
## Effective degrees of freedom (model: traceS): 104.1253 
## Sigma (model: traceS): 0.7843587 
## Sigma (ML): 0.6863148 
## AICc (GWR p. 61, eq 2.33; p. 96, eq. 4.21): 371.1238 
## AIC (GWR p. 96, eq. 4.22): 315.4401 
## Residual sum of squares: 64.0598 
## Quasi-global R2: 0.6741362
results <-as.data.frame(gwr.fit1$SDF)
names(results)
##  [1] "sum.w"               "X.Intercept."        "green"              
##  [4] "PTAI"                "height"              "X.Intercept._se"    
##  [7] "green_se"            "PTAI_se"             "height_se"          
## [10] "gwr.e"               "pred"                "pred.se"            
## [13] "localR2"             "X.Intercept._se_EDF" "green_se_EDF"       
## [16] "PTAI_se_EDF"         "height_se_EDF"       "pred.se.1"
glasgow_gwr %>% 
  select(-c(green, PTAI, height)) %>% 
  bind_cols(results) -> gwr_results

qtm(gwr_results, fill = "localR2")

qtm(gwr_results, fill = "green")
## Variable(s) "green" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

qtm(gwr_results, fill = "PTAI")
## Variable(s) "PTAI" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

qtm(gwr_results, fill = "height")
## Variable(s) "height" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.