Andy Kaempf
September 5, 2016
Load necessary packages
```{r}
library(knitr); library(rmarkdown);
library(devtools); library(repmis);
library(tidyr); library(dplyr);
library(WDI); library(ggplot2);
library(googleVis); library(RJSONIO);
library(Hmisc); library(gridExtra);
```
Set working directory and global chunk options
```{r label=setup}
setwd('H:/Reproducible Research/Reproducibility study group/Andy files')
knitr::opts_chunk$set(message=FALSE, warning=FALSE, fig.path='Ch10_figures/');
```
knitr
code chunk options for figuresggplot2
graphics in presentation docsgoogleVis
packageA “non-knitted” graphic is one that already exists outside the knittable document and is not created by R code in a knitr
code chunk.
Use the following code in a .Rmd file to include a non-knitted image

![]()
is similar to the []()
syntax used to include a hyperlinkMarkdown does not allow you to resize or position your image so you must use the HTML image (‘img’) tag and assign the image’s file path/name to the src attribute.
In .Rmd files, include images (like the below image of butterflies) that can be sized and aligned using HTML markup like so:
<img src="four_butterflies.jpg"
alt="Butterflies"
width="500px" height="375px"></img>
The HTML src attribute can also accept a URL.
This picture of lions (centered with ‘center’ tag) was included with the code:
<center><img src="http://thumbs.media.smithsonianmag.com
//filer/two-male-lions-Kenya-631.jpg__800x600_q85_crop.jpg"
alt="Lions"
width="560px" height="420px"></img></center>
In .Rnw files, link to images using the includegraphics command:
\includegraphics[options]{file path}
A picture of four butterflies (Meyer 2006) is included in a LaTeX document with the code:
\begin{figure}[h]
\caption{Caption goes here}
\label{lab_name}
\includegraphics[width=3in,keepaspectratio=true]{Meyer_2006.png}
{\small{Source: \cite{meyer2006repeating}}}
\end{figure}
By linking to this image from within the figure float environment, the image can be “floated” away from the document text at a specified location and given a caption (\caption
command)
LaTeX automatically numbers figures that are cross-referenced with the \label
command. Use \ref
command where you want the figure’s number to be printed and \pageref
for the figure’s page number
POSITION_SPEC arguments go inside brackets after \begin{figure}
– ‘h’ places the table where it is written in the text (ie, here), ‘t’ places it at the top of the page, and ‘b’ at the bottom of the page
Notes about including pre-existing images with LaTeX markup
graphicx
package in the LaTeX document’s preamble (ie, top portion of the file above document environment)
width=3in
sets the image width at 3 inches. Other image size or alignment options are: ‘scale’, ‘height’, ‘angle’, and ‘keepaspectratio’scale=0.8
makes the image 80 percent of its actual size and width=0.8\textwidth
sets the image width to 80 percent of the document’s text width via the LaTeX command textwidth
knitr
will store figure filesR Markdown files: ‘out.height’ and ‘out.width’ take arguments with units of pixels (px)
R LaTeX files: ‘out.height’ and ‘out.width’ take arguments with units of inches (in), centimeters (cm), or scaled as a proportion of a page element
Goal: Create a scatterplot of cars’ speeds and stopping distances
Data Source: Preloaded cars
dataset. Type ?cars
into the R console to learn about this dataset.
Steps to produce the scatterplot:
Extra: plot() is a generic function, meaning it invokes UseMethod() to determine the class of its first argument and then search for a plot function specific to this class. If no class-specific function is found, plot.default() is called.
Other R default graphic commands are hist(), boxplot(), pie(), stars()
Step #1: inspect the data
# look at variables in cars data.frame
str(cars)
## 'data.frame': 50 obs. of 2 variables:
## $ speed: num 4 4 7 7 8 9 10 10 10 11 ...
## $ dist : num 2 10 4 22 16 10 18 26 34 17 ...
str() gives the structure of an object and lists its components – which are two numeric variables: speed and dist
Step #2: create the scatterplot
```{r cars, fig.align='center', fig.height=6, fig.width=6, dev='jpeg'}
plot(x = cars$speed, y=cars$dist,
main = "Scatterplot: default R graphics",
xlab = "Speed (mph)", ylab = "Stopping Distance (feet)",
cex.lab = 1.25, cex.main = 1.5, pch=8)
abline(lsfit(cars$speed,cars$dist), lwd=2.5, col="red")
lines(lowess(cars$speed,cars$dist), lwd=2.5, col="blue")
legend(x=5, y=115, c("linear regression","loess smoother"),
lwd=2.5, col=c("red","blue"))
```
Note that a JPEG of the scatterplot is created by knitr
(default is PNG for .Rmd)
A simpler version of this scatterplot can be created by sourcing a code file hosted on GitHub
source_url("https://raw.githubusercontent.com/christophergandrud
/Rep-Res-Examples/master/Graphs/SimpleScatter.R")
Goal: Create scatterplot matrix of World Bank variables
Data source: World Bank’s Development Indicators database (The World Bank Group 2016), accessed by either:
WDI
package that utilizes the World Bank website API.
repmis
package’s source_data() functionOther R packages that enable direct downloads of web data by utilizing API’s are found here. Examples include twitteR
(for tweets and trending topics), quantmod
(for Google and Yahoo finance data), and ZillowR
(for real estate and mortgage data)
Steps to produce scatterplot matrix:
1. Load data and examine variable names
2. Subset the data (only include 2003 obs.) and remove ID variables
3. use pairs() function to create the figure
Step #1: load data from GitHub, list variable names, and make sure the three variables being plotted are quantitative
MainData <- source_data("https://raw.githubusercontent.com/christophergandrud
/Rep-Res-Examples/master/DataGather_Merge/MainData.csv")
# make sure source_data() loads a data.frame object
class(MainData)
## [1] "data.frame"
names(MainData)
## [1] "V1" "iso2c" "year"
## [4] "country" "reg_4state" "disproportionality"
## [7] "FertilizerConsumption"
# shorten variable names with rename()
MainData2 <- dplyr::rename(MainData, Fert = FertilizerConsumption, disprop = disproportionality)
# sapply() applies a function to each element of the 1st arg and returns a vector
# sapply() is a wrapper for lapply(), which returns a list
sapply(MainData2[,5:7], class)
## reg_4state disprop Fert
## "integer" "numeric" "numeric"
Step #2: subset the data and assess the quantitative variables of interest
# these two commands do the same thing (only run 1 of them)
SubData <- MainData2 %>% dplyr::filter(year == 2003) %>% dplyr::select(reg_4state, disprop, Fert)
SubData2 <- subset(MainData2[, 5:7], MainData2$year == 2003)
# compute frequency counts of discrete var's levels
table(SubData$reg_4state)
##
## 1 2 3 4
## 16 15 19 33
# compute summary stats for continuous vars.
lapply(SubData[,c("disprop", "Fert")], summary, digits=3)
## $disprop
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.05 2.48 3.98 5.73 7.74 20.80 69
##
## $Fert
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.6 78.9 136.0 367.0 293.0 11600.0 2
# remove rows with very large fertilizer values
No_tail <- subset(SubData, Fert < 1000)
Extra: use describe() function from Hmisc
package to provide extra summary info such as number of unique values and the lowest 5 and highest 5 values
Step #3: create the scatterplot matrix with this code chunk:
```{r WB_plot, fig.cap="Caption is placed here", fig.align='center'}
pairs(x = No_tail, main="WB development indicators", lower.panel=NULL)
```
Caption is placed here
ggplot2
is an R package for creating publication-quality figuresgeoms
(geometric objects), which are the visual representations of data pointsgeom
examples (listed as R function names):
According to the book ‘ggplot2: Elegant Graphics for Data Analysis, 2nd ed.’ (see Wickham 2016), all plots created by ggplot2
have three key components:
An example of these 3 components using the ggplot2
dataset ‘mpg’:
plot1 <- ggplot(data=mpg, aes(x=displ, y=hwy)) + geom_point() + geom_smooth(span=0.3)
plot2 <- ggplot(data=mpg, aes(x=displ, y=hwy)) + geom_point() + geom_smooth(span=0.7)
gridExtra::grid.arrange(plot1, plot2, ncol=2)
Goal: Create line plot showing change over time of actual and forecasted inflation
Data source: stored on the author’s GitHub page
Download data from a GitHub page using the repmis
package’s source_data() command and passing the data file URL as the lone argument
Steps to produce the time series plot:
tidyr
to prepare for using ggplot2
Step #1: Download the data
```{r eval=TRUE, include=TRUE}
InflationData <- repmis::source_data("https://raw.githubusercontent.com
/christophergandrud/Rep-Res-Examples/master/Graphs/InflationData.csv");
```
Determine variable names and inspect the first 5 observations
dim(InflationData)
## [1] 152 3
names(InflationData)
## [1] "Quarter" "ActualInflation" "EstimatedInflation"
head(InflationData,n=5)
## Quarter ActualInflation EstimatedInflation
## 1 1969.1 NA 4.55342
## 2 1969.2 3.5 4.80281
## 3 1969.3 3.5 5.28242
## 4 1969.4 3.3 5.14525
## 5 1970.1 3.7 5.54280
There are 3 variables in this dataset:
1. Quarter: the year and fiscal quarter
2. ActualInflation: the actual U.S. inflation rate for the given quarter
3. EstimatedInflation: Federal Reserve’s inflation forecast made two quarters prior
To plot a separate line for each inflation measurement, we need to restructure the data so that ActualInflation and EstimatedInflation become the values of a new variable (key) and the measures themselves are stored in a separate variable (value).
Step #2: Reshape data with tidyr
package:
# reshape data from wide to long format with `gather` command
GatheredInflation <- tidyr::gather(data=InflationData, key=key, value=value, 2:3)
# notice there are now twice as many obs. (hence, long format)
dim(GatheredInflation)
## [1] 304 3
# inspect the first 5 observations of the long-format dataset
head(GatheredInflation,n=5)
## Quarter key value
## 1 1969.1 ActualInflation NA
## 2 1969.2 ActualInflation 3.5
## 3 1969.3 ActualInflation 3.5
## 4 1969.4 ActualInflation 3.3
## 5 1970.1 ActualInflation 3.7
Step #3: Create the multi-line time series plot using ggplot()
```{r label=time_plot, fig.align='center'}
LinePlot <- ggplot2::ggplot(data = GatheredInflation,
aes(x = Quarter, y = value, color = key, linetype = key)) +
geom_line() +
scale_color_discrete(name="", labels=c("Actual","Estimated")) +
scale_linetype(name="", labels=c("Actual","Estimated")) +
xlab("\n Quarter") + ylab("Inflation\n") + theme_bw(base_size = 15)
print(LinePlot)
```
Explanation of ggplot2
functions used:
You can reproduce this analysis by sourcing the R code file from GitHub at: http://bit.ly/VEvGJG. To run the entire analysis just shown, include the following R code in a chunk:
devtools::source_url("http://bit.ly/VEvGJG")
Goal: Create a caterpillar plot showing estimates of a regression model’s parameters
Data source: swiss
dataset preloaded with R. This dataset provides fertility and socioeconomic indicator data for 47 Swiss provinces from 1888.
Steps to produce the caterpillar plot:
Note: A frequentist rather than a Bayesian normal linear regression model (like what is shown in the book) is fit here because the Zelig
package has been modified and the author’s code no longer works (according to the book’s Errata page.)
Step #1 – inspect the data
# determine number of observations and variables in data.frame
dim(swiss)
## [1] 47 6
# list variable names
names(swiss)
## [1] "Fertility" "Agriculture" "Examination"
## [4] "Education" "Catholic" "Infant.Mortality"
# show distribution of outcome measure
hist(x=swiss$Examination, xlab="% military draftees with highest score",
main = "Histogram of Examination variable", border="dodgerblue")
Steps #2 and #3 – fit the model with lm() and reformat output
Model the percentage of draftees receiving the highest army exam score (Examination) as a function of educational level (Education), male involvement in agriculture (Agriculture), proportion of catholics (Catholic), and infant mortality rate (Infant.Mortality). Each observation is a separate Swiss province.
# fit the model
LinearModel <- lm(Examination ~ Education + Agriculture +
Catholic + Infant.Mortality, data = swiss)
# create the summary object and save the coefficient estimates
Model_coeff_DF <- data.frame(summary(LinearModel)$coefficients)
# make row.names attribute into a new column/variable
Model_coeff_DF$Coeff <- row.names(Model_coeff_DF)
# extract the confidence limits for the coefficients using confint()
CI_DF <- data.frame(confint(LinearModel))
# make row.names attribute a new column of CI data.frame
CI_DF$Coeff <- row.names(CI_DF)
# merge the two data frames to combine the point and interval estimates
merged_DF <- merge(Model_coeff_DF,CI_DF)
# remove intercept so coefficients are plotted on a more interpretable scale
final_DF <- subset(merged_DF, Coeff != "(Intercept)")
Step #4: create the caterpillar plot
```{r label=caterpillar, eval=TRUE}
CatPlot <- ggplot(data = final_DF, aes(x = reorder(Coeff, X2.5..),
y = Estimate, ymin = X2.5.., ymax = X97.5..)) +
geom_pointrange(size = 1.4) +
geom_hline(aes(yintercept = 0), linetype = "dotted") +
ggtitle("Caterpillar plot of regression estimates\n")
xlab("Variable\n") + ylab("\nCoefficient Estimate") +
coord_flip() + theme_bw(base_size = 20)
print(CatPlot)
```
New ggplot2
commands:
Goal: Create a boxplot inside a violin plot showing blood pressure distribution by gender
Data source: Online via Univ. of Washington professor Ken Rice’s webpage and graphically analyzed in a paper (Rice and Lumley 2015)
R code modified from the OHSU Jamboree on Data Visualization (June 2016)
Steps to produce the violin plot:
Step #1: Download the data in 2 ways:
using read_csv() from readr
package
heart <- read_csv("http://faculty.washington.edu/kenrice
/heartgraphs/nhaneslarge.csv", na=".")
using source_data() from repmis
package
heart <- source_data("http://faculty.washington.edu/kenrice
/heartgraphs/nhaneslarge.csv")
Step #2: Examine var. names and summary stats for outcome measure (by sex)
names(heart)
## [1] "BPXSAR" "BPXDAR" "BPXDI1" "BPXDI2" "race_ethc"
## [6] "gender" "DR1TFOLA" "RIAGENDR" "BMXBMI" "RIDAGEYR"
summary(heart[heart$gender == "Female",]$BPXSAR)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 82.0 104.0 114.7 118.0 126.0 216.0
summary(heart[heart$gender == "Male",]$BPXSAR)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 85.0 111.0 120.8 121.8 131.0 191.3
Step #3: Create the violin plot overlaid with a boxplot
```{r label=violin, fig.align='center'}
ggplot(heart, aes(x = gender, y = BPXSAR)) +
geom_violin(alpha = 0.5, size = 0.8) +
geom_boxplot(width = 0.2, size = 0.8, fill="cyan", outlier.colour=NA) +
stat_summary(fun.y=median, geom="point", fill="white", shape=21, size=4) +
stat_summary(fun.y=mean, geom="point", fill="red", shape=21, size=4) +
labs(title="SBP distribution by sex (NHANES)", x="Systolic BP (mmHg)", y="")
```
New ggplot2
commands:
Thus far, we have only dealt with static graphics from base R and ggplot2
googleVis
is an R package that accesses Google Chart Tools via Google’s Visualization API in order to display interactive tables, figures, and maps (written in JavaScript). The Google Chart Tool utilized in this example is GeoChart
Goal: Create a choropleth map showing Fertilizer Consumption (in kilograms per hectare of arable land) for countries across the globe
Data source: Use same data from World Bank Development Indicators website that was used for the scatterplot matrix. The data can be downloaded from this website using WDI
package or from GitHub with the source_data() command
Steps to produce choropleth map:
1. Subset the data.frame to concentrate on 2003 observations
2. remove countries with very small fertilizer values and log-transform
3. use googleVis
package’s gvisGeoChart() function to create the map
Extra: a choropleth map uses graded differences in color shading (or various symbols or patterns) inside of defined map regions to represent the value of some continuous measurement
Steps #1 and #2: Subset data and take natural log to reduce skew
```{r eval=TRUE}
Data_2003 <- subset(MainData, year == 2003 & FertilizerConsumption > 0.1)
Data_2003$LogConsumption <- round(log(Data_2003$FertilizerConsumption),
digits = 1)
```
Step #3: Create global map of fertilizer consumption using gvisGeoChart()
```{r geo_chart, eval=TRUE, results='asis'}
FCMap <- gvisGeoChart(data = Data_2003, locationvar = "iso2c",
colorvar = "LogConsumption",
options = list(colors = "['#ECE7F2', '#A6BDDB', '#2B8CBE']",
width = 936, height = 600))
print(FCMap, tag = "chart")
```
Alternatively, create this interactive map by sourcing the author’s R script stored on GitHub
```{r eval=FALSE, results='asis'}
# Create and print Global map of fertilizer consumption
devtools::source_url("http://bit.ly/VNnZxS")
```
must set chunk option results='asis'
in order to have the interactive image (instead of the JavaScript code) appear in the HTML presentation document
interactive JavaScript graphics cannot be directly incorporated in R LaTeX documents
References for JavaScript interactive graphics:
googleVis
websiteMeyer, Axel. 2006. “Repeating Patterns of Mimicry.” PLoS Biol 4 (10). Public Library of Science: e341.
Rice, Kenneth, and Thomas Lumley. 2015. “Effective Graphs for Data Display: Recommendations for Authors.”
The World Bank Group. 2016. “World Development Indicators.” http://data.worldbank.org/data-catalog/world-development-indicators.
Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. 2nd ed. Springer Science & Business Media.