Lab 5

Published

February 17, 2018

Download the R code

It is recommended that you create your own R script and follow along step by step with your instructor. Then, after the lab, you can reference the correct code from this file.

Objectives

  1. Learn coding skills:

    1. Simple data manipulation

    2. Create scatter plots

    3. Finding regression line (Constructing a Linear Model)

  2. Explore possible relationships

We are going to be exploring the tips dataset from the course website. Before looking at the code below, can you successfully read in the data as a dataset called “tips”? Hint: first find the dataset on the course website.

tips <- read.delim('https://raw.githubusercontent.com/IowaBiostat/data-sets/main/tips/tips.txt')

Simple Data Manipulation

Sometimes it is useful to create new variables in a dataset that are a function of one or more other variables. Using the $ operator, you can add new variables to the dataset by specifying a new name using the syntax shown below.

New Code: Create New Variables Using the $ Operator
dataset$new_variable

In the United States, we usually base our tip off of a percentage of the total bill amount. Let’s add a new variable to the tips dataset showing the percent tipped for each bill.

# calculate percent tipped for each bill
## using the $ operator
tips$tip_perc <- tips$Tip / tips$TotBill * 100

## the with() function simplifies this by specifying the dataset to be used
tips$tip_perc <- with(tips, Tip / TotBill * 100)

# look at a quick summary of our new variable
summary(tips$tip_perc)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.564  12.913  15.477  16.080  19.148  71.034 

Scatter Plots

As discussed in lecture, when we are interested in visualizing the connection between two continuous variables, we can use a scatter plot.

New Geometry: geom_point()

Recall the general structure of a ggplot graph from lab 4 and note the new geometry.

ggplot(data = dataset,           # step 1: set your data
       aes(x = x_var, y = y_var) # step 2: define aesthetics
       ) +
  geom_point()                 # step 3: add geometries

Exercise 1

Using the structure above, create a scatter plot of the variables TotBill and Tip to visualize the relationship between the total bill and the amount tipped. What kind of association is there between these two variables?

Solution
library(ggplot2)
ggplot(tips, 
       aes(x=TotBill, y=Tip)
      ) +
  geom_point()

Looking at this plot we can see a positive association between the variables and that there is a good amount of variation. In other words, it seems that as the bill increases, the the tip does as well. We can use a linear model to find the equation for a straight line that would best fit the points in this scatterplot. We can then use that model to plot the line directly to our scatterplot.

The Regression Line

Recall that the regression line can be represented by the equation,

\[ y = \alpha + \beta x \]

where \(\alpha\) is the intercept and \(\beta\) is the slope. This is called a linear model. In R, the function to construct a linear model is lm(), and has a syntax pattern that closely matches the linear regression equation.

New Function: lm()
# syntax
lm(y_variable ~ x_variable, data = dataset)

Exercise 2

Create a linear model corresponding to the scatterplot we made. Our model will be using the Tip amount as the outcome of interest (y) and TotBill as a predictor (x). Print out the model and take turns with your neighbor interpreting the slope term.

Solution
my_model <- lm(Tip ~ TotBill, data = tips)
my_model

Call:
lm(formula = Tip ~ TotBill, data = tips)

Coefficients:
(Intercept)      TotBill  
     0.9203       0.1050  

Interpretation:

Printing out the model itself gives you the intercept and slope. “TotBill” ie, the slope, tells us that for every additional dollar that a meal costs, the waiter can expect to get 10.5 cents more on his tip. The intercept theoretically tells us that for a bill that costs $0, the waiter should expect 92 cents in tip. (Note that this is not possible, so evaluating this at 0 doesn’t make much sense)

We can add the regression line to our plot as follows:

ggplot(tips, 
       aes(x=TotBill, y=Tip)
      ) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
`geom_smooth()` using formula = 'y ~ x'

the default method in geom_smooth uses a “loess” curve (locally estimated scatterplot smoothing). The details are far beyond the scope of this course, but you might try Wikipedia for an introduction if you are interested. For this course, we are only interested in the linear model method of geom_smooth.

se stands for standard error. Omitting this code from the function plots the SE of the regression curve onto the plot. We remove it for simplicity and neatness.

Finding Slope with Correlation

We can also calculate this using the output from cor() function. By itself, the cor() function calculates the correlation coefficient between two vectors (shown below using the tip and and total bill amounts). When we multiply that output by the ratio of the standard deviations of those two variables, we end up with the slope of the model.

# Correlation between tip and total bill: 
tip_bill_corr <- cor(tips$Tip, tips$TotBill)
tip_bill_corr
[1] 0.6757341
tip_bill_corr * sd(tips$Tip) / sd(tips$TotBill)
[1] 0.1050245
Note

The ratio takes the sd of our y variable as the numerator, and the sd of our x variable as the denominator. This is analogous to “rise over run”, as some of you might have learned years ago.

While the correlation between two variables is the same regardless of order, this is not true for regression equations. Remember also that when regression is used for prediction, the order of the variables entered into lm() will change the slope and intercept terms.

Example

Code
par(mfrow = c(1,2))
ggplot(tips, 
       aes(x=TotBill, y=Tip)
      ) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

Code
ggplot(tips, 
       aes(x=Tip, y=TotBill)
      ) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

We note visually the slopes change, and we can also see this is the output of the two lm’s.

lm(Tip ~ TotBill, tips)

Call:
lm(formula = Tip ~ TotBill, data = tips)

Coefficients:
(Intercept)      TotBill  
     0.9203       0.1050  
lm(TotBill ~ Tip, tips)

Call:
lm(formula = TotBill ~ Tip, data = tips)

Coefficients:
(Intercept)          Tip  
      6.750        4.348  

Exploring Relationships

Let’s explore more interesting questions about tipping behavior. Obviously the tip amount will increase as the total bill amount increases. We will use the tip_perc variable created earlier that tells us the tip percentage to more fairly make comparisons between groups.

  1. Does the total bill affect the percentage that people tip?

    1. Make a plot

    2. Fit a model using lm function. What are the slope and intercept?

Solution 1
ggplot(tips, aes(TotBill, y=tip_perc)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE) +
  labs(
    y="Tip Rate", 
    x="Total Bill", 
    title = "Tip Rate Decreases as Bill Increases"
  )
`geom_smooth()` using formula = 'y ~ x'

cor(tips$TotBill, tips$tip_perc)
[1] -0.3386241
model2 <- lm(tip_perc ~ TotBill, data = tips)
model2

Call:
lm(formula = tip_perc ~ TotBill, data = tips)

Coefficients:
(Intercept)      TotBill  
    20.6766      -0.2323  

We can see a moderate negative association between the total bill and the percent tipped. Why might this be the case?

  1. What role does gender play in this data?

    1. Are men more likely to pick up the check than women?

    Hint: We can see if men are more likely than women to pick up the check by simply comparing the proportions of men and women who paid the bill. We can look at a table that breaks down the proportion of males and females by using the table() and prop.table() functions.

    1. Does this depend on if the meal is lunch or dinner?
Solution 2

Part a

gender_tab <- table(tips$Sex) 
gender_tab # number of men/women

  F   M 
 87 157 
prop.table(gender_tab) # proportion of men and women

        F         M 
0.3565574 0.6434426 

We can see that males tend to cover the bill about 2/3 of the time and women about 1/3 of the time. Now, we are wondering how this behavior might be different based off of the time of day the meal happens.

part b

gender_vs_time <- table(tips$Sex, tips$Time)
prop.table(gender_vs_time, 2) # proportion of female/male given time of day
   
          Day     Night
  F 0.5147059 0.2954545
  M 0.4852941 0.7045455
  1. How might we to show the relationship of 2 b in a plot?
Solution 3

A barplot can show the relative differences in bill payment by gender.

ggplot(tips, aes(x=Time, fill=Sex)) +
  geom_bar()

Practice Problems

Further Data Exploration

For the following questions, use the methods we’ve learned about in previous labs to perform your own comparisons about different groups in the data and state your conclusion.

  1. Do smokers tip differently than nonsmokers?

Hint: Refer to Lab 4 to solve this problem, or ask your instructor. We have 1 continuous outcome and 1 categorical grouping variable.

Answer
by(tips$tip_perc, tips$Smoker, summary)
tips$Smoker: No
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   5.68   13.69   15.56   15.93   18.50   29.20 
------------------------------------------------------------ 
tips$Smoker: Yes
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.564  10.677  15.385  16.320  19.506  71.034 
ggplot(tips, aes(x=Smoker, y=tip_perc)) +
  geom_boxplot() +
  labs(
    title = "Do smokers tip differently than nonsmokers?",
    x="Smoker in Party?", 
    y = "Percent Tipped"
  )

After analyzing the table and side-by-side boxplots, it seems that smokers and non-smokers tip about the same percentage on their bills. They both had similar median tipping percentages around 15.5%. From the boxplot, we can see there is more variation in tipping behavior for smokers with two very high tips. Overall, I would conclude that smokers do not tip differently than nonsmokers.

  1. Does tipping behavior change at lunch versus dinner?

We can apply the same type analysis on this question as the question above. Our new categorical grouping variable is Time.

Answer
by(tips$tip_perc, tips$Time, summary)
tips$Time: Day
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  7.296  13.915  15.408  16.413  19.392  26.631 
------------------------------------------------------------ 
tips$Time: Night
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.564  12.319  15.540  15.952  18.821  71.034 
ggplot(tips, aes(x=Time, y=tip_perc)) +
  geom_boxplot() +
  labs(
    x="Time", 
    y="Tip Rate", 
    title = "Does tipping behavior change at lunch versus dinner?"
  )

The results look very similar to the previous question. We see there are some extreme tip percentages at night, so we should examine the medians as they are more robust to outliers. The medians between meals during the day and meals at night look very similar (approximately 15.5% for both). The box plots look to be about the same shape and spread, besides the few outliers. It seems that there is little difference in tipping percentages based on the time of day the meal occurred.

  1. Does tipping behavior differ by days of the week?

Again, this question can be completed by using a side-by-side boxplot and summary of the tipping percentages broken out by day.

Answer
by(tips$tip_perc, tips$Day, summary)
tips$Day: Fri
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10.36   13.37   15.56   16.99   19.66   26.35 
------------------------------------------------------------ 
tips$Day: Sat
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.564  12.386  15.183  15.315  18.827  32.573 
------------------------------------------------------------ 
tips$Day: Sun
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  5.945  11.998  16.110  16.690  18.789  71.034 
------------------------------------------------------------ 
tips$Day: Thu
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  7.296  13.821  15.385  16.128  19.269  26.631 
ggplot(tips, aes(x=Day, y=tip_perc)) +
  geom_boxplot()+
  labs(
    x="Day", 
    y="Tip Rate", 
    title = "Does tipping behavior differ by days of the week?"
  )

Overall, tipping behavior appears consistent across the 4 days measured in this dataset. The median tip rate is all about 15-16% and their variability is all similar as well. One note is that Sunday does have two high outliers and Saturday also has a couple of higher tips as well. However, in general, we can conclude that tipping behavior does not differ by day of the week.

Regression Review

  1. Suppose a table is 1 standard deviation above average in terms of total bill. How many dollars above average in terms of tip would you expect it to be?
Answer

We multiply the correlation between the two variables by the standard deviation of tip. Recall the notation from lecture slides:

\[ \tilde{y} = \overbrace{r}^{correlation} \tilde{x} \]

cor(tips$TotBill, tips$Tip) * sd(tips$Tip)
[1] 0.9349715
  1. Suppose a table is $2 above average tip. How many dollars above the average total bill would you expect it to be?
  • Hint: This is the same process as found on slide 6 of the Regression lecture notes.
Answer
Zx <- 2 / sd(tips$Tip)
Zy <- Zx * cor(tips$Tip,tips$TotBill)
Zy * sd(tips$TotBill)
[1] 8.695428
  1. Suppose a table is $10 above the average total bill. What would we expect the tip to be?
Answer
# By hand:
Zx <- 10 / sd(tips$TotBill)
Zy <- Zx * cor(tips$TotBill,tips$Tip)
(y <- mean(tips$Tip) + Zy * sd(tips$Tip))
[1] 4.048524
# Using the model:
# if you have questions about this code, ask your instructor
model <- lm(Tip ~ TotBill, tips)
model$coefficients[1] + model$coefficients[2]*(mean(tips$TotBill)+10)
(Intercept) 
   4.048524 
# Using the raw numbers:
0.9203 + 0.1050 * (mean(tips$TotBill) + 10)
[1] 4.047824