Lab 4

Published

February 10, 2011

Download the R code

Objectives

  1. Introduce generation of graphics using the ggplot2 R package

  2. Practice generating plots

  3. Review for Quiz 1

Graphics in R Using ggplot2

While base R has graphical capabilities, we have decided to present material using ggplot2 for a few reasons:

  1. it generates a moderately nice plot “right off the shelf”,
  2. the syntax is consistent regardless of the type of plot you are trying to generate, and
  3. is a powerful tool for data visualization and presentation used by many statisticians and scientists today.

Overview

“ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.” -(ggplot2 homepage)

An R package is a bundled collection of code, data, and documentation that expands the core functionality of the R language to perform specific tasks. Think of them as specialized “apps” for your workspace; they allow you to share and reuse tools—like ggplot2 for visualization—across different projects without reinventing the wheel. These packages are stored in a standard directory structure called a “library” on your computer, ready to be loaded whenever you need.1

For any app that you’d like to use, you must first download it to your device, then open it any time you need it. In terms of packages, you must first install the package, then load it every time you need it. Just like apps are only downloaded once, you only need to install the package on your device once. Copy and paste the code below into your script and run it.

install.packages("ggplot2") # "downloading the app", do this once
Loading ggplot2

Running the code below is necessary for accessing the functionality of ggplot2–just like opening an app.

library(ggplot2) # "opening the app", do this every time

Grammar of Graphics

The framework of ggplot is quite intuitive. Consider the unconscious steps you take when you sketch a plot of some data. You probably begin by drawing the axes2 and label them with numbers that show the scale. Then, if you’re visualizing data3, you add points, bars, or lines to show what your data looks like. If we wanted to summarize this into three, general steps, these steps might be:

  1. Prepare/set/choose your data

  2. Define your aesthetics: this is what your plot will loot like. a. Define axes b. Define variable colors

  3. Add geometries (points, bars, lines, etc.)

General structure
ggplot(data = dataset,           # step 1: set your data
       aes(x = x_var, y = y_var) # step 2: define aesthetics
       ) +
  geom_boxplot() +                 # step 3: add geometries
  geom_bar() +
  geom_histogram()
Additional Geometries

Additional geometries will come up throughout the semester and will appear in these yellow colored boxes. You are also welcome to learn new geometries in the ggplot documentation4 for your own research interests, but you will only be assessed on the code presented in lecture and lab.

Your Turn

Problem 1

Using the tips dataset, construct a histogram visualizing the distribution of the total bill amounts (variable = TotBill).

# code to get you started!
library(ggplot2)
tips <- read.delim('https://raw.githubusercontent.com/IowaBiostat/data-sets/main/tips/tips.txt')
Solution
ggplot(tips, aes(x = TotBill)) + 
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Note that geom_histogram() only requires one continuous variable to be supplied to the x-aesthetic to show the frequency distribution.

Problem 2

Now using the tips dataset, construct a boxplot visualizing the distribution of the total bill categorized by day of the week.

Which day has the highest median TotBill?

Solution
ggplot(tips, aes(x = Day, y = TotBill)) +
  geom_boxplot() 

Problem 3

Finally, facet your figure of boxplots from the previous problem by variable Sex. Discuss the interpretation of this figure with your neighbor or a TA.

Solution
ggplot(tips, aes(x = Day, y = TotBill)) +
  geom_boxplot() +
  facet_wrap(~Sex) 

Remember:

  • the variable inside facet_wrap always needs to be a categorical variable

  • don’t forget the tilde ~ !

Problem 4

Using the lister dataset, construct a bar plot of the number of subjects in each treatment group (variable = ‘Group’).

Since you will be assessed on being able to read in data, it’s important to start learning!

Navigate to the datasets page on the course website, find the dataset, access the information icon on the far right, and copy and paste the code to read in your data.

If you’re totally lost on this one, ask your instructor to walk you through it. :)

Solution
# code to read in the data
lister <- read.delim('https://github.com/IowaBiostat/data-sets/raw/main/lister/lister.txt')

# construct the plot
ggplot(lister, aes(x = Group)) +
  geom_bar()

Notice how geom_bar() automatically counts the occurrences of each level of a categorical variable.

Problem 5

Again using the lister dataset, construct a side-by-side barplot showing the Outcome (Survived vs Died) for each Group.

Use the fill=Group inside the aesthetic (aes()) to automatically color the Group variable. This also adds a legend to the plot.

For the side-by-side view, use position = "dodge" as an argument of geom_bar()

Solution
ggplot(lister, aes(x = Outcome, fill = Group)) +
  geom_bar(position = "dodge")

If you flew through these exercises, and are waiting for the rest of us to catch up, try going back through each plot you made and polish it. Try adding

  1. A main plot title

  2. Axes labels

  3. Colors by group in the bar plot

  4. Change the theme of your plot. You might try one of the following:

    • theme_minimal()

    • theme_classic()

    • theme_bw()

    • find another theme online

  5. When showing our data, we often use a plot in conjunction with a summary table.

    • look up the documentation for xtabs by typing ?xtabs in your console (bottom right pane). Can you create a contingency table to for the data you displayed in problem 5?
Solution (5)
# look up xtabs
?xtabs 
# Use xtabs to count the frequencies of Outcome within each Group
lister_table <- xtabs(~ Group + Outcome, data = lister)

# Print the table to the console
lister_table
         Outcome
Group     Died Survived
  Control   16       19
  Sterile    6       34

Quiz Review

Reminders for the Quiz
  • Quizzes take place in class and will replace the last half-hour of lecture.

  • open-book and open-note

  • you may not use a laptop, cellphone or any device capable of communication or internet access

  • bring a calculator (your phone is not a substitute)

-Reference: Syllabus

Practice

  1. A study published in the Journal of Wildlife Biology investigated the effectiveness of two treatments for a skin infection in platypuses. Researchers conducted a randomized controlled experiment involving 600 infected platypuses housed in wildlife rehabilitation centers across Australia.

Each platypus was first classified by infection severity (mild or severe). Within each severity group, platypuses were randomly assigned to receive one of two treatments:

  • Topical Ointment (TO)

  • Oral Antibiotics (OA)

Trained personnel administered the assigned treatment for 4 weeks and then examined the infection. Treatment success was then recorded in the below table:

Infection Severity Treatment Successful Total
Mild TO 180 200
Mild OA 72 80
Severe TO 40 120
Severe OA 144 200

The numbers of successful treatments and total cases, by infection severity, are shown below:

  1. Which treatment has the higher overall success rate?

  2. Which infection severity (mild or severe) has the higher overall success rate?

  3. Why does randomization of treatment within severity groups help reduce confounding in this study?

  4. Is the comparison of treatments still subject to confounding? Explain briefly.

  1. In sports analytics, it is now common to analyze thousands of performance metrics simultaneously. For example, an analyst studying the Winter Olympics might compare performance data from medal-winning athletes and non-medal-winning athletes in order to identify metrics that are significantly associated with winning a medal.

The analyst is therefore testing a separate null hypothesis for each performance metric that is measured.

Suppose that an analyst examines 5,000 performance metrics, of which 40 are truly associated with winning a medal. Suppose further that the analyst’s hypothesis tests have:

  • Type I error rate of 5%, and

  • Type II error rate of 25%.

  1. How many times did the analyst correctly reject the null hypothesis?

  2. How many type I errors were made?

  1. Suppose an investigator conducts a survey asking drivers whether they always use their car’s turn signals when changing lanes. The survey is conducted via in-person interviews at a busy intersection. What kind of bias is this study subjective too?

Solutions

Footnotes

  1. Explanation assisted by Google Gemini 3↩︎

  2. the horizontal and vertical lines, typically labeled X and Y↩︎

  3. which we will be doing a fair amount of in this class↩︎

  4. see ggplot Cheatsheet for advanced geometries↩︎