The function ggstatsplot::ggcorrmat
provides a quick way to produce publication-ready correlation matrix (aka correlalogram) plot. The function can also be used for quick data exploration. In addition to the plot, it can also be used to get a correlation coefficient matrix or the associated p-value matrix. Currently, the plot can display Pearson’s r, Spearman’s rho, and Kendall’s tau, and robust correlation coefficient (percentage bend correlation; see ?WRS2::pbcor
). This function is a convenient wrapper around ggcorrplot::ggcorrplot
function with some additional functionality.
We will see examples of how to use this function in this vignette with the gapminder
and diamonds
dataset.
To begin with, here are some instances where you would want to use ggcorrmat
-
ggplot2
Note before: The following demo uses the pipe operator (%>%
), so in case you are not familiar with this operator, here is a good explanation: http://r4ds.had.co.nz/pipes.html
ggcorrmat
For the first example, we will use the gapminder
dataset (available in eponymous package on CRAN) provides values for life expectancy, Gross Domestic Product (GDP) per capita, and population, every five years, from 1952 to 2007, for each of 142 countries and was collected by the Gapminder Foundation. Let’s have a look at the data-
library(gapminder) library(dplyr) dplyr::glimpse(gapminder) #> Rows: 1,704 #> Columns: 6 #> $ country <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afgha... #> $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi... #> $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 199... #> $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 4... #> $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372,... #> $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.113...
Let’s say we are interested in studying correlation between population of a country, average life expectancy, and GDP per capita across countries only for the year 2007.
The simplest way to get a correlation matrix is to stick to the defaults-
# setup set.seed(123) library(ggstatsplot) # select data only from the year 2007 gapminder_2007 <- dplyr::filter(.data = gapminder::gapminder, year == 2007) # producing the correlation matrix ggstatsplot::ggcorrmat( data = gapminder_2007, # data from which variable is to be taken cor.vars = lifeExp:gdpPercap # specifying correlation matrix variables )
This plot can be further modified with additional arguments-
ggstatsplot::ggcorrmat( data = gapminder_2007, # data from which variable is to be taken cor.vars = lifeExp:gdpPercap, # specifying correlation matrix variables cor.vars.names = c( "Life Expectancy", "population", "GDP (per capita)" ), type = "spearman", # which correlation coefficient is to be computed lab.col = "red", # label color ggtheme = ggplot2::theme_light(), # selected ggplot2 theme ggstatsplot.layer = FALSE, # turn off default ggestatsplot theme overlay matrix.type = "lower", # correlation matrix structure colors = NULL, # turning off manual specification of colors palette = "category10_d3", # choosing a color palette package = "ggsci", # package to which color palette belongs title = "Gapminder correlation matrix", # custom title subtitle = "Source: Gapminder Foundation" # custom subtitle )
As seen from this correlation matrix, although there is no relationship between population and life expectancy worldwide, at least in 2007, there is a strong positive relationship between GDP, a well-established indicator of a country’s economic performance.
Given that there were only three variables, this doesn’t look that impressive. So let’s work with another example from ggplot2
package: the diamonds
dataset. This dataset contains the prices and other attributes of almost 54,000 diamonds.
Let’s have a look at the data-
library(ggplot2) dplyr::glimpse(ggplot2::diamonds) #> Rows: 53,940 #> Columns: 10 #> $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,... #> $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, ... #> $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,... #> $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS... #> $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,... #> $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,... #> $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,... #> $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,... #> $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,... #> $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,...
Let’s see the correlation matrix between different attributes of the diamond and the price.
# for reproducibility set.seed(123) # let's use just 5% of the data to speed it up ggstatsplot::ggcorrmat( data = dplyr::sample_frac(tbl = ggplot2::diamonds, size = 0.05), cor.vars = c(carat, depth:z), # note how the variables are getting selected cor.vars.names = c( "carat", "total depth", "table", "price", "length (in mm)", "width (in mm)", "depth (in mm)" ), ggcorrplot.args = list(outline.color = "black", hc.order = TRUE) )
We can make a number of changes to this basic correlation matrix. For example, since we were interested in relationship between price and other attributes, let’s make the price
column to the the first column. Additionally, since we are running 6 correlations that are of a priori interest to us, we can adjust our threshold of significance to (0.05/6 ~ 0.008). Additionally, let’s use a non-parametric correlation coefficient. Please note that it is important to always make sure that the order in which cor.vars
and cor.vars.names
are entered is in sync. Otherwise, wrong column labels will be displayed.
# for reproducibility set.seed(123) # let's use just 5% of the data to speed it up ggstatsplot::ggcorrmat( data = dplyr::sample_frac(tbl = ggplot2::diamonds, size = 0.05), cor.vars = c(price, carat, depth:table, x:z), # note how the variables are getting selected cor.vars.names = c( "price", "carat", "total depth", "table", "length (in mm)", "width (in mm)", "depth (in mm)" ), type = "spearman", sig.level = 0.008, matrix.type = "lower", title = "Relationship between diamond attributes and price", subtitle = "Dataset: Diamonds from ggplot2 package", colors = c("#0072B2", "#D55E00", "#CC79A7"), pch = 7, # additional aesthetic arguments passed to `ggcorrmat` ggcorrplot.args = list( lab_col = "yellow", lab_size = 6, tl.srt = 90, # vertical labels for the x-axis (useful in case of long variable names) pch.col = "white", pch.cex = 14 ), caption = substitute( paste(italic("Note"), ": Point shape denotes correlation non-significant at p < 0.008; adjusted for 6 comparisons") ) ) + # modification outside ggstatsplot using ggplot functions ggplot2::theme( axis.text.x = ggplot2::element_text( margin = ggplot2::margin(t = 0.15, r = 0.15, b = 0.15, l = 0.15, unit = "cm") ) )
As seen here, and unsurprisingly, the strongest predictor of the diamond price is its carat value, which a unit of mass equal to 200 mg. In other words, the heavier the diamond, the more expensive it is going to be.
ggcorrmat
Another utility of ggcorrmat
is in obtaining a dataframe containing all details from statistical analyses. Such dataframes can be easily embedded in manuscripts as tables.
# for reproducibility set.seed(123) # to get correlations ggstatsplot::ggcorrmat( data = dplyr::sample_frac(tbl = ggplot2::txhousing, size = 0.15), cor.vars = sales:inventory, output = "dataframe", type = "robust", digits = 3 ) #> # A tibble: 10 x 10 #> parameter1 parameter2 r ci_low ci_high t df p method #> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> #> 1 sales volume 0.980 0.978 0.982 172. 1202 0. Perce~ #> 2 sales median 0.469 0.423 0.512 18.3 1195 0. Perce~ #> 3 sales listings 0.930 0.921 0.937 83.0 1083 0. Perce~ #> 4 sales inventory -0.385 -0.434 -0.333 -13.7 1080 0. Perce~ #> 5 volume median 0.550 0.509 0.588 22.7 1195 0. Perce~ #> 6 volume listings 0.898 0.886 0.909 67.1 1083 0. Perce~ #> 7 volume inventory -0.368 -0.419 -0.316 -13.0 1080 0. Perce~ #> 8 median listings 0.403 0.352 0.451 14.5 1083 0. Perce~ #> 9 median inventory -0.203 -0.259 -0.145 -6.80 1080 1.70e-11 Perce~ #> 10 listings inventory -0.163 -0.220 -0.104 -5.42 1080 7.25e- 8 Perce~ #> # ... with 1 more variable: nobs <int>
Note that if cor.vars
are not specified, all numeric variables will be used. Moreover, you can also use abbreviations to specify what output you want in return.
# for reproducibility set.seed(123) # show four digits in a tibble options(pillar.sigfig = 4) # getting the correlation coefficient matrix ggstatsplot::ggcorrmat( data = iris, # all numeric variables from data will be used type = "np", # non-parametric output = "dataframe" ) #> # A tibble: 6 x 9 #> parameter1 parameter2 rho ci_low ci_high s p method nobs #> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <int> #> 1 Sepal.Len~ Sepal.Wid~ -0.1668 -0.3185 -0.006695 656283. 4.137e- 2 Spear~ 150 #> 2 Sepal.Len~ Petal.Len~ 0.8819 0.8404 0.9131 66429. 3.443e-50 Spear~ 150 #> 3 Sepal.Len~ Petal.Wid~ 0.8343 0.7781 0.8773 93208. 4.189e-40 Spear~ 150 #> 4 Sepal.Wid~ Petal.Len~ -0.3096 -0.4477 -0.1572 736637. 1.154e- 4 Spear~ 150 #> 5 Sepal.Wid~ Petal.Wid~ -0.2890 -0.4294 -0.1350 725048. 3.343e- 4 Spear~ 150 #> 6 Petal.Len~ Petal.Wid~ 0.9377 0.9149 0.9545 35061. 8.157e-70 Spear~ 150
grouped_ggcorrmat
What if we want to do the same analysis separately for each quality of the diamond cut
(Fair, Good, Very Good, Premium, Ideal)?
ggstatsplot
provides a special helper function for such instances: grouped_ggcorrmat
. This is merely a wrapper function around ggstatsplot::combine_plots
. It applies ggcorrmat
across all levels of a specified grouping variable and then combines list of individual plots into a single plot. Note that the grouping variable can be anything: conditions in a given study, groups in a study sample, different studies, etc.
# for reproducibility set.seed(123) # let's use just 5% of the data to speed it up ggstatsplot::grouped_ggcorrmat( # arguments relevant for ggstatsplot::ggcorrmat data = dplyr::sample_frac(tbl = ggplot2::diamonds, size = 0.05), type = "robust", # percentage bend correlation coefficient beta = 0.2, # bending constant p.adjust.method = "holm", # method to adjust p-values for multiple comparisons grouping.var = cut, title.prefix = "Quality of cut", cor.vars = c(carat, depth:z), cor.vars.names = c( "carat", "total depth", "table", "price", "length (in mm)", "width (in mm)", "depth (in mm)" ), lab.size = 3.5, # arguments relevant for ggstatsplot::combine_plots title.text = "Relationship between diamond attributes and price across cut", title.args = list(size = 16, color = "red"), caption.text = "Dataset: Diamonds from ggplot2 package", caption.args = list(size = 14, color = "blue"), plotgrid.args = list( labels = c("(a)", "(b)", "(c)", "(d)", "(e)"), nrow = 3, ncol = 2 ) )
Note that this function also makes it easy to run the same correlation matrix across different levels of a factor/grouping variable.
# for reproducibility set.seed(123) # let's obtain correlation coefficients with their CIs ggstatsplot::grouped_ggcorrmat( data = ggplot2::msleep, grouping.var = vore, output = "dataframe" ) #> # A tibble: 60 x 11 #> vore parameter1 parameter2 r ci_low ci_high t df #> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <int> #> 1 carni sleep_tot~ sleep_rem 0.9189 0.6864 0.9810 6.589e+0 8 #> 2 carni sleep_tot~ sleep_cyc~ 0.3764 -0.7574 0.9449 7.038e-1 3 #> 3 carni sleep_tot~ awake -1.000 -1.000 -1.000 -1.386e+3 17 #> 4 carni sleep_tot~ brainwt -0.5244 -0.8815 0.2144 -1.629e+0 7 #> 5 carni sleep_tot~ bodywt -0.4427 -0.7468 0.01441 -2.036e+0 17 #> 6 carni sleep_rem sleep_cyc~ 0.1216 -0.8521 0.9066 2.122e-1 3 #> 7 carni sleep_rem awake -0.9189 -0.9810 -0.6865 -6.589e+0 8 #> 8 carni sleep_rem brainwt -0.5006 -0.9331 0.5237 -1.157e+0 4 #> 9 carni sleep_rem bodywt -0.4786 -0.8516 0.2162 -1.542e+0 8 #> 10 carni sleep_cyc~ awake -0.3764 -0.9449 0.7574 -7.038e-1 3 #> # ... with 50 more rows, and 3 more variables: p <dbl>, method <chr>, #> # nobs <int>
ggcorrmat
+ purrr
Although grouped_
function is good for quickly exploring the data, it reduces the flexibility with which this function can be used. This is the because the common parameters used are applied to plots corresponding to all levels of the grouping variable and there is no way to customize the arguments for different levels of the grouping variable. We will see how this can be done using the purrr
package.
See the associated vignette here: https://indrajeetpatil.github.io/ggstatsplot/articles/web_only/purrr_examples.html
Following tests are carried out for each type of analyses-
Type | Test | CI? |
---|---|---|
Parametric | Pearson’s correlation coefficient | Yes |
Non-parametric | Spearman’s rank correlation coefficient | Yes |
Robust | Percentage bend correlation coefficient | Yes |
Bayes Factor | Pearson’s correlation coefficient | Yes |
If you find any bugs or have any suggestions/remarks, please file an issue on GitHub
: https://github.com/IndrajeetPatil/ggstatsplot/issues
For details, see- https://indrajeetpatil.github.io/ggstatsplot/articles/web_only/session_info.html