RRRR Ch.10: Showing Results with Figures

Andy Kaempf

September 5, 2016

Set-up: prepare R environment

Load necessary packages

```{r}
library(knitr);     library(rmarkdown);
library(devtools);  library(repmis);
library(tidyr);     library(Zelig);
library(ggplot2);   library(googleVis);
library(WDI);
```

 

Set working directory and global chunk options

```{r label=setup}
setwd('H:/Reproducible Research/Reproducibility study group/Andy files')
knitr::opts_chunk$set(message=FALSE, warning=FALSE, fig.path='Ch10_figures/');
```

Outline of “Slidy” HTML presentation

Advantages of including dynamic figures

  1. In R Markdown or LaTeX files, figures are not copy-and-pasted like in MS Word. Instead, figures are included by linking to the image file (.png, .pdf, .jpg)
    • if an image changes, the presentation document will reflect this change the next time the markup document is compiled. No recopying and repasting is necessary.
  2. Dynamic figures are sized and aligned with markup code rather than by pointing and clicking. This takes more time up front but saves time when making updates, allows for consistent formatting for all included figures, and is reproducible
  3. Because the images are not actually loaded in markup documents, this method of linking avoids the lag time often experienced while editing a Word doc that contains many images

Non-knitted graphics in R Markdown (slide 1)

A “non-knitted” graphic is one that already exists outside the knittable document and is not created by R code in a knitr code chunk.

Use the following code in a .Rmd file to include a non-knitted image

  ![text](file path)
  

Non-knitted graphics in R Markdown (slide 2)

Markdown does not allow you to resize or position your image so you must use the HTML image (‘img’) tag and assign the image’s file path/name to the src attribute.

In .Rmd files, include images (like the below image of butterflies) that can be sized and aligned using HTML markup like so:

  <img src="four_butterflies.jpg" 
   alt="Butterflies" 
   width="500px" height="375px"></img>

Butterflies

Non-knitted graphics in R Markdown (slide 3)

The HTML src attribute can also accept a URL.

This picture of lions (centered with ‘center’ tag) was included with the code:

  <center><img src="http://thumbs.media.smithsonianmag.com
   //filer/two-male-lions-Kenya-631.jpg__800x600_q85_crop.jpg" 
   alt="Lions" 
   width="560px" height="420px"></img></center>
Lions

Non-knitted graphics in R LaTeX (slide 1)

In .Rnw files, link to images using the includegraphics command:

  \includegraphics[options]{file path}

A picture of four butterflies is included in a LaTeX document with the code:

  \begin{figure}[h]
    \caption{Caption goes here}
    \label{lab_name}
    \includegraphics[width=3in,keepaspectratio=true]{Meyer_2006.png}
    {\small{Source: \cite{meyer2006repeating}}}
  \end{figure}

By linking to this image from within the figure float environment, the image can be “floated” away from the document text at a specified location and given a caption (\caption command)

LaTeX automatically numbers figures that are cross-referenced with the \label command. Use \ref command where you want the figure’s number to appear and \pageref to show the figure’s page number

POSITION_SPEC arguments go inside brackets after \begin{figure}
– ‘h’ places the table where it is written in the text (ie, here), ‘t’ places it at the top of the page, and ‘b’ at the bottom of the page

Non-knitted graphics in R LaTeX (slide 2)

Notes about including pre-existing images with LaTeX markup

Code chunk figure options

R Markdown files: ‘out.height’ and ‘out.width’ take arguments with units of pixels (px)

R LaTeX files: ‘out.height’ and ‘out.width’ take arguments with units of inches (in), centimeters (cm), or scaled as a proportion of a page element

R default graphics – Example #1 (slide 1)

Goal: Create a scatterplot of cars’ speeds and stopping distances

Data Source: Preloaded cars dataset. Type ?cars into the R console to learn about this dataset.

Steps to produce the scatterplot:

  1. Inspect the data
  2. use plot() function to create the plot

 

Extra: plot() is a generic function, meaning it invokes UseMethod() to determine the class of its first argument and then search for a plot function specific to this class. If no class-specific function is found, plot.default() is called.

Other R default graphic commands are hist(), boxplot(), pie(), stars()

R default graphics – Example #1 (slide 2)

Step #1: inspect the data

# look at variables in cars data.frame
str(cars)
## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...

str() gives the structure of an object and lists its components – which are two numeric variables: speed and dist

Step #2: create the scatterplot

```{r cars, fig.align='center', fig.height=6, fig.width=6, dev='jpeg'}
plot(x = cars$speed, y=cars$dist, 
     main = "Scatterplot:  default R graphics", 
     xlab = "Speed (mph)", ylab = "Stopping Distance (feet)", 
     cex.lab = 1.25, cex.main = 1.5)
```

Note that a JPEG of the scatterplot is created by knitr (default is PNG for .Rmd)

R default graphics – Example #1 (slide 3)

Alternately, this scatterplot can be created by sourcing a code file hosted on GitHub

source_url("https://raw.githubusercontent.com/christophergandrud
            /Rep-Res-Examples/master/Graphs/SimpleScatter.R")

R default graphics – Example #2 (slide 1)

Goal: Create scatterplot matrix of World Bank variables

Data source: World Bank’s Development Indicators database, accessed by:

Other R packages that enable direct downloads of web data by utilizing API’s are found here. Examples include twitteR (for tweets and trending topics), quantmod (for Google and Yahoo finance data), and ZillowR (for real estate and mortgage data)

Steps to produce scatterplot matrix:
1. Load data and examine variable names
2. Subset the data (only include 2003 obs.) and remove ID variables
3. use pairs() function to create the figure

R default graphics – Example #2 (slide 2)

Step #1: load data from GitHub, list variable names, and make sure the three variables being plotted are quantitative

MainData <- source_data("https://raw.githubusercontent.com/christophergandrud
                         /Rep-Res-Examples/master/DataGather_Merge/MainData.csv")
# make sure source_data() loads a data.frame object
class(MainData)
## [1] "data.frame"

names(MainData)
## [1] "V1"                    "iso2c"                 "year"                 
## [4] "country"               "reg_4state"            "disproportionality"   
## [7] "FertilizerConsumption"

# sapply() applies a function to each element of the 1st arg and returns a vector
# sapply() is a wrapper for lapply(), which returns a list
sapply(MainData[,5:7], class)
##            reg_4state    disproportionality FertilizerConsumption 
##             "integer"             "numeric"             "numeric"

R default graphics – Example #2 (slide 3)

Step #2: subset the data and assess the quantitative variables of interest

SubData <- subset(MainData[, 5:7], MainData$year == 2003)
# compute frequency counts of discrete var's levels
table(SubData$reg_4state)
## 
##  1  2  3  4 
## 16 15 19 33

# compute summary stats for continuous vars. 
lapply(SubData[,c("disproportionality", "FertilizerConsumption")], summary)
## $disproportionality
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.050   2.485   3.985   5.734   7.738  20.810      69 
## 
## $FertilizerConsumption
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##     1.595    78.950   136.400   367.400   292.800 11570.000         2

# remove rows with very large fertilizer values
No_tail <- subset(SubData, FertilizerConsumption < 1000)

R default graphics – Example #2 (slide 4)

Step #3: create the scatterplot matrix with this code chunk:

```{r WB_plot, fig.cap="Caption is placed here", fig.align='center'}
pairs(x = No_tail, main="WB development indicators", lower.panel=NULL)
```
Caption is placed here

Caption is placed here

References for R default graphics

Intro to knitting ggplot2 graphics (slide 1)

Intro to knitting ggplot2 graphics (slide 2)

According to Hadley Wickham’s book ‘ggplot2: Elegant Graphics for Data Analysis’ (2nd edition, 2016), all plots created by ggplot2 have three key components:

  1. The data
  2. The aesthetic mapping of variables to visual features
  3. Layers that describe how to render the observations. Most layers are created with a ‘geom’ function

An example of these 3 components using the ggplot2 dataset ‘mpg’:

ggplot(data=mpg, aes(x=displ, y=hwy)) + geom_point() + geom_smooth(span=0.7)

ggplot2 – Example #1 (slide 1)

Goal: Create line plot showing change over time of actual and forecasted inflation

Data source: stored on the author’s GitHub page

Download data from a GitHub page using the repmis package’s source_data() command and passing the data file URL as the lone argument  

Steps to produce the time series plot:

  1. Load data and examine data structure
  2. Reshape data with tidyr to prepare for using ggplot2
  3. Use ggplot() function to create the line plot

ggplot2 – Example #1 (slide 2)

Step #1: Download the data

```{r eval=TRUE, include=TRUE}
InflationData <- repmis::source_data("https://raw.githubusercontent.com
/christophergandrud/Rep-Res-Examples/master/Graphs/InflationData.csv");
```

Determine variable names and inspect the the first 5 observations

names(InflationData);
## [1] "Quarter"            "ActualInflation"    "EstimatedInflation"
head(InflationData,n=5);
##   Quarter ActualInflation EstimatedInflation
## 1  1969.1              NA            4.55342
## 2  1969.2             3.5            4.80281
## 3  1969.3             3.5            5.28242
## 4  1969.4             3.3            5.14525
## 5  1970.1             3.7            5.54280

There are 3 variables in this dataset:
1. Quarter: the year and fiscal quarter
2. ActualInflation: the actual U.S. inflation rate for the given quarter
3. EstimatedInflation: Federal Reserve’s inflation forecast made two quarters prior

ggplot2 – Example #1 (slide 3)

To plot a separate line for each inflation measurement, we need to restructure the data so that ActualInflation and EstimatedInflation become the values of a new variable (key) and the measures themselves are stored in a separate variable (value).

Step #2: Reshape data with tidyr package:

# reshape data from wide to long format with `gather` command 
GatheredInflation <- tidyr::gather(data=InflationData, key=key, value=value, 2:3)

# inspect the first 5 observations of the long-format dataset
head(GatheredInflation,n=5)
##   Quarter             key value
## 1  1969.1 ActualInflation    NA
## 2  1969.2 ActualInflation   3.5
## 3  1969.3 ActualInflation   3.5
## 4  1969.4 ActualInflation   3.3
## 5  1970.1 ActualInflation   3.7

ggplot2 – Example #1 (slide 4)

Step #3: Create the multi-line time series plot using ggplot()

```{r label=time_plot, fig.align='center'}
LinePlot <- ggplot2::ggplot(data = GatheredInflation, 
        aes(x = Quarter, y = value, color = variable, 
        linetype = variable)) + geom_line() + 
        scale_color_discrete(name="", labels=c("Actual","Estimated")) + 
        scale_linetype(name="", labels=c("Actual","Estimated")) + 
        xlab("\n Quarter") + ylab("Inflation\n") + theme_bw(base_size = 15)
print(LinePlot)
```

ggplot2 – Example #1 (slide 5)

Explanation of ggplot2 functions used:

You can reproduce this analysis by sourcing the R code file from GitHub at: http://bit.ly/VEvGJG. To run the entire analysis just shown, include the following R code in a chunk:

    devtools::source_url("http://bit.ly/VEvGJG")

ggplot2 – Example #2 (slide 1)

Goal: Create a caterpillar plot showing estimates of a regression model’s parameters

Data source: swiss dataset preloaded with R. This dataset provides fertility and socioeconomic indicator data for 47 Swiss provinces from 1888.

Steps to produce the caterpillar plot:

  1. Inspect the data
  2. Fit the linear regression model (see ‘Note’ below)
  3. Extract and reformat model output to prepare for plotting
  4. Use ggplot() function to create the plot

Note: A frequentist rather than a Bayesian normal linear regression model (like what is shown in the book) is fit here because the Zelig package has been modified and the author’s code no longer works (according to the book’s Errata page.)

ggplot2 – Example #2 (slide 2)

Step #1 – inspect the data

# determine number of observations and variables in data.frame
dim(swiss)
## [1] 47  6

# list variable names
names(swiss)
## [1] "Fertility"        "Agriculture"      "Examination"     
## [4] "Education"        "Catholic"         "Infant.Mortality"

# show distribution of outcome measure
hist(x=swiss$Examination, xlab="% military draftees with highest score", 
    main = "Histogram of Examination variable", border="dodgerblue")

ggplot2 – Example #2 (slide 3)

Steps #2 and #3 – fit the model with lm() and reformat output

Model the percentage of draftees receiving the highest army exam score (Examination) as a function of educational level (Education), male involvement in agriculture (Agriculture), proportion of catholics (Catholic), and infant mortality rate (Infant.Mortality). Each observation is a separate Swiss province.

# fit the model
LinearModel <- lm(Examination ~ Education + Agriculture + 
                  Catholic + Infant.Mortality, data = swiss)

# create the summary object and save the coefficient estimates 
Model_coeff_DF <- data.frame(summary(LinearModel)$coefficients)

# make row.names attribute into a new column/variable
Model_coeff_DF$Coeff <- row.names(Model_coeff_DF)

# extract the confidence limits for the coefficients using confint()
CI_DF <- data.frame(confint(LinearModel))

# make row.names attribute a new column of CI data.frame
CI_DF$Coeff <- row.names(CI_DF)

# merge the two data frames to combine the point and interval estimates 
merged_DF <- merge(Model_coeff_DF,CI_DF)

# remove intercept so coefficients are plotted on a more interpretable scale
final_DF <- subset(merged_DF, Coeff != "(Intercept)")

ggplot2 – Example #2 (slide 4)

Step #4: create the caterpillar plot

```{r label=caterpillar, eval=TRUE}
CatPlot <- ggplot(data = final_DF, aes(x = reorder(Coeff, X2.5..), 
           y = Estimate, ymin = X2.5.., ymax = X97.5..)) + 
           geom_pointrange(size = 1.4) + 
           geom_hline(aes(yintercept = 0), linetype = "dotted") + 
           ggtitle("Caterpillar plot of regression estimates\n")
           xlab("Variable\n") + ylab("\nCoefficient Estimate") + 
           coord_flip() + theme_bw(base_size = 20)
print(CatPlot)
```

New ggplot2 commands:

ggplot2 – Example #2 (slide 5)

ggplot2 – Example #3 (slide 1)

Goal: Create a boxplot inside a violin plot showing blood pressure distribution by gender

Data source: Online via Univ. of Washington professor Ken Rice’s webpage

Example and code modified from the OHSU Jamboree on Data Visualization (June 2016)

Steps to produce the violin plot:

  1. Load the data
  2. Examine the variables
  3. Use ggplot() function to create the figure

ggplot2 – Example #3 (slide 2)

Step #1: Download the data in 2 ways:

  1. using read_csv() from readr package

    heart <- read_csv("http://faculty.washington.edu/kenrice
                       /heartgraphs/nhaneslarge.csv", na=".")
  2. using source_data() from repmis package

    heart <- source_data("http://faculty.washington.edu/kenrice
                          /heartgraphs/nhaneslarge.csv")

Step #2: Examine var. names and summary stats for outcome measure (by sex)

names(heart)
##  [1] "BPXSAR"    "BPXDAR"    "BPXDI1"    "BPXDI2"    "race_ethc"
##  [6] "gender"    "DR1TFOLA"  "RIAGENDR"  "BMXBMI"    "RIDAGEYR"

summary(heart[heart$gender == "Female",]$BPXSAR)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    82.0   104.0   114.7   118.0   126.0   216.0

summary(heart[heart$gender == "Male",]$BPXSAR)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    85.0   111.0   120.8   121.8   131.0   191.3

ggplot2 – Example #3 (slide 3)

Step #3: Create the violin plot overlaid with a boxplot

```{r label=violin, fig.align='center'} 
ggplot(heart, aes(x = gender, y=BPXSAR)) + 
geom_violin(alpha = 0.5, size = 0.8) + 
geom_boxplot(width = 0.2, size = 0.8, fill="cyan", outlier.colour=NA) + 
stat_summary(fun.y=median, geom="point", fill="white", shape=21, size=4) +
stat_summary(fun.y=mean, geom="point", fill="red", shape=21, size=4) +
labs(title="SBP distribution by sex (NHANES)", x = "Systolic BP (mmHg)", y="")
```

New ggplot2 commands:

ggplot2 – Example #3 (slide 4)

Knitted interactive graphics (slide 1)

Thus far, we have only dealt with static graphics from base R and ggplot2

googleVis is an R package that accesses Google Chart Tools via Google’s Visualization API in order to display interactive tables, figures, and maps (written in JavaScript). The Google Chart Tool utilized in this example is GeoChart

Goal: Create a choropleth map showing Fertilizer Consumption (in kilograms per hectare of arable land) for countries across the globe

Data source: Use same data from World Bank Development Indicators website that was used for the scatterplot matrix. The data can be downloaded from this website using WDI package or from GitHub with the source_data() command

Steps to produce choropleth map:
1. Subset the data.frame to concentrate on 2003 observations
2. remove countries with very small fertilizer values and log-transform
3. use googleVis package’s gvisGeoChart() function to create the map

Extra: a choropleth map uses graded differences in color shading (or various symbols or patterns) inside of defined map regions to represent the value of some continuous measurement

Knitted interactive graphics (slide 2)

Steps #1 and #2: Subset data and take natural log to reduce skew

```{r eval=TRUE}
Data_2003 <- subset(MainData, year == 2003 & FertilizerConsumption > 0.1)
Data_2003$LogConsumption <- round(log(Data_2003$FertilizerConsumption), 
                            digits = 1)
```

Step #3: Create global map of fertilizer consumption using gvisGeoChart()

```{r geo_chart, eval=TRUE, results='asis'}
FCMap <- gvisGeoChart(data = Data_2003, locationvar = "iso2c", 
         colorvar = "LogConsumption", 
         options = list(colors = "['#ECE7F2', '#A6BDDB', '#2B8CBE']", 
         width = 936, height = 600))
print(FCMap, tag = "chart")
```

Interactive choropleth map of fertilizer consumption

Knitted interactive graphics (slide 4)

Alternatively, create this interactive map by sourcing the author’s R script stored on GitHub

```{r eval=FALSE, results='asis'}
# Create and print Global map of fertilizer consumption 
devtools::source_url("http://bit.ly/VNnZxS")
```

 

References: