tips <- read.delim('https://raw.githubusercontent.com/IowaBiostat/data-sets/main/tips/tips.txt')Lab 5
It is recommended that you create your own R script and follow along step by step with your instructor. Then, after the lab, you can reference the correct code from this file.
Objectives
Learn coding skills:
Simple data manipulation
Create scatter plots
Finding regression line (Constructing a Linear Model)
Explore possible relationships
We are going to be exploring the tips dataset from the course website. Before looking at the code below, can you successfully read in the data as a dataset called “tips”? Hint: first find the dataset on the course website.
Simple Data Manipulation
Sometimes it is useful to create new variables in a dataset that are a function of one or more other variables. Using the $ operator, you can add new variables to the dataset by specifying a new name using the syntax shown below.
$ Operator
dataset$new_variableIn the United States, we usually base our tip off of a percentage of the total bill amount. Let’s add a new variable to the tips dataset showing the percent tipped for each bill.
# calculate percent tipped for each bill
## using the $ operator
tips$tip_perc <- tips$Tip / tips$TotBill * 100
## the with() function simplifies this by specifying the dataset to be used
tips$tip_perc <- with(tips, Tip / TotBill * 100)
# look at a quick summary of our new variable
summary(tips$tip_perc) Min. 1st Qu. Median Mean 3rd Qu. Max.
3.564 12.913 15.477 16.080 19.148 71.034
Scatter Plots
As discussed in lecture, when we are interested in visualizing the connection between two continuous variables, we can use a scatter plot.
geom_point()
Recall the general structure of a ggplot graph from lab 4 and note the new geometry.
ggplot(data = dataset, # step 1: set your data
aes(x = x_var, y = y_var) # step 2: define aesthetics
) +
geom_point() # step 3: add geometriesExercise 1
Using the structure above, create a scatter plot of the variables TotBill and Tip to visualize the relationship between the total bill and the amount tipped. What kind of association is there between these two variables?
Solution
library(ggplot2)
ggplot(tips,
aes(x=TotBill, y=Tip)
) +
geom_point()The Regression Line
Recall that the regression line can be represented by the equation,
\[ y = \alpha + \beta x \]
where \(\alpha\) is the intercept and \(\beta\) is the slope. This is called a linear model. In R, the function to construct a linear model is lm(), and has a syntax pattern that closely matches the linear regression equation.
lm()
# syntax
lm(y_variable ~ x_variable, data = dataset)Exercise 2
Create a linear model corresponding to the scatterplot we made. Our model will be using the Tip amount as the outcome of interest (y) and TotBill as a predictor (x). Print out the model and take turns with your neighbor interpreting the slope term.
Solution
my_model <- lm(Tip ~ TotBill, data = tips)
my_model
Call:
lm(formula = Tip ~ TotBill, data = tips)
Coefficients:
(Intercept) TotBill
0.9203 0.1050
Interpretation:
Printing out the model itself gives you the intercept and slope. “TotBill” ie, the slope, tells us that for every additional dollar that a meal costs, the waiter can expect to get 10.5 cents more on his tip. The intercept theoretically tells us that for a bill that costs $0, the waiter should expect 92 cents in tip. (Note that this is not possible, so evaluating this at 0 doesn’t make much sense)
We can add the regression line to our plot as follows:
ggplot(tips,
aes(x=TotBill, y=Tip)
) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)`geom_smooth()` using formula = 'y ~ x'
method = "lm", se = FALSE mean?
the default method in geom_smooth uses a “loess” curve (locally estimated scatterplot smoothing). The details are far beyond the scope of this course, but you might try Wikipedia for an introduction if you are interested. For this course, we are only interested in the linear model method of geom_smooth.
se stands for standard error. Omitting this code from the function plots the SE of the regression curve onto the plot. We remove it for simplicity and neatness.
Finding Slope with Correlation
We can also calculate this using the output from cor() function. By itself, the cor() function calculates the correlation coefficient between two vectors (shown below using the tip and and total bill amounts). When we multiply that output by the ratio of the standard deviations of those two variables, we end up with the slope of the model.
# Correlation between tip and total bill:
tip_bill_corr <- cor(tips$Tip, tips$TotBill)
tip_bill_corr[1] 0.6757341
tip_bill_corr * sd(tips$Tip) / sd(tips$TotBill)[1] 0.1050245
The ratio takes the sd of our y variable as the numerator, and the sd of our x variable as the denominator. This is analogous to “rise over run”, as some of you might have learned years ago.
While the correlation between two variables is the same regardless of order, this is not true for regression equations. Remember also that when regression is used for prediction, the order of the variables entered into lm() will change the slope and intercept terms.
Example
Code
par(mfrow = c(1,2))
ggplot(tips,
aes(x=TotBill, y=Tip)
) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)Code
ggplot(tips,
aes(x=Tip, y=TotBill)
) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)We note visually the slopes change, and we can also see this is the output of the two lm’s.
lm(Tip ~ TotBill, tips)
Call:
lm(formula = Tip ~ TotBill, data = tips)
Coefficients:
(Intercept) TotBill
0.9203 0.1050
lm(TotBill ~ Tip, tips)
Call:
lm(formula = TotBill ~ Tip, data = tips)
Coefficients:
(Intercept) Tip
6.750 4.348
Exploring Relationships
Let’s explore more interesting questions about tipping behavior. Obviously the tip amount will increase as the total bill amount increases. We will use the tip_perc variable created earlier that tells us the tip percentage to more fairly make comparisons between groups.
Does the total bill affect the percentage that people tip?
Make a plot
Fit a model using
lmfunction. What are the slope and intercept?
Solution 1
ggplot(tips, aes(TotBill, y=tip_perc)) +
geom_point() +
geom_smooth(method="lm", se=FALSE) +
labs(
y="Tip Rate",
x="Total Bill",
title = "Tip Rate Decreases as Bill Increases"
)`geom_smooth()` using formula = 'y ~ x'
cor(tips$TotBill, tips$tip_perc)[1] -0.3386241
model2 <- lm(tip_perc ~ TotBill, data = tips)
model2
Call:
lm(formula = tip_perc ~ TotBill, data = tips)
Coefficients:
(Intercept) TotBill
20.6766 -0.2323
We can see a moderate negative association between the total bill and the percent tipped. Why might this be the case?
What role does gender play in this data?
- Are men more likely to pick up the check than women?
Hint: We can see if men are more likely than women to pick up the check by simply comparing the proportions of men and women who paid the bill. We can look at a table that breaks down the proportion of males and females by using the
table()andprop.table()functions.- Does this depend on if the meal is lunch or dinner?
Solution 2
Part a
gender_tab <- table(tips$Sex)
gender_tab # number of men/women
F M
87 157
prop.table(gender_tab) # proportion of men and women
F M
0.3565574 0.6434426
We can see that males tend to cover the bill about 2/3 of the time and women about 1/3 of the time. Now, we are wondering how this behavior might be different based off of the time of day the meal happens.
part b
gender_vs_time <- table(tips$Sex, tips$Time)
prop.table(gender_vs_time, 2) # proportion of female/male given time of day
Day Night
F 0.5147059 0.2954545
M 0.4852941 0.7045455
- How might we to show the relationship of 2 b in a plot?
Solution 3
A barplot can show the relative differences in bill payment by gender.
ggplot(tips, aes(x=Time, fill=Sex)) +
geom_bar()Practice Problems
Further Data Exploration
For the following questions, use the methods we’ve learned about in previous labs to perform your own comparisons about different groups in the data and state your conclusion.
- Do smokers tip differently than nonsmokers?
Hint: Refer to Lab 4 to solve this problem, or ask your instructor. We have 1 continuous outcome and 1 categorical grouping variable.
Answer
by(tips$tip_perc, tips$Smoker, summary)tips$Smoker: No
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.68 13.69 15.56 15.93 18.50 29.20
------------------------------------------------------------
tips$Smoker: Yes
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.564 10.677 15.385 16.320 19.506 71.034
ggplot(tips, aes(x=Smoker, y=tip_perc)) +
geom_boxplot() +
labs(
title = "Do smokers tip differently than nonsmokers?",
x="Smoker in Party?",
y = "Percent Tipped"
)After analyzing the table and side-by-side boxplots, it seems that smokers and non-smokers tip about the same percentage on their bills. They both had similar median tipping percentages around 15.5%. From the boxplot, we can see there is more variation in tipping behavior for smokers with two very high tips. Overall, I would conclude that smokers do not tip differently than nonsmokers.
- Does tipping behavior change at lunch versus dinner?
We can apply the same type analysis on this question as the question above. Our new categorical grouping variable is Time.
Answer
by(tips$tip_perc, tips$Time, summary)tips$Time: Day
Min. 1st Qu. Median Mean 3rd Qu. Max.
7.296 13.915 15.408 16.413 19.392 26.631
------------------------------------------------------------
tips$Time: Night
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.564 12.319 15.540 15.952 18.821 71.034
ggplot(tips, aes(x=Time, y=tip_perc)) +
geom_boxplot() +
labs(
x="Time",
y="Tip Rate",
title = "Does tipping behavior change at lunch versus dinner?"
)The results look very similar to the previous question. We see there are some extreme tip percentages at night, so we should examine the medians as they are more robust to outliers. The medians between meals during the day and meals at night look very similar (approximately 15.5% for both). The box plots look to be about the same shape and spread, besides the few outliers. It seems that there is little difference in tipping percentages based on the time of day the meal occurred.
- Does tipping behavior differ by days of the week?
Again, this question can be completed by using a side-by-side boxplot and summary of the tipping percentages broken out by day.
Answer
by(tips$tip_perc, tips$Day, summary)tips$Day: Fri
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.36 13.37 15.56 16.99 19.66 26.35
------------------------------------------------------------
tips$Day: Sat
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.564 12.386 15.183 15.315 18.827 32.573
------------------------------------------------------------
tips$Day: Sun
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.945 11.998 16.110 16.690 18.789 71.034
------------------------------------------------------------
tips$Day: Thu
Min. 1st Qu. Median Mean 3rd Qu. Max.
7.296 13.821 15.385 16.128 19.269 26.631
ggplot(tips, aes(x=Day, y=tip_perc)) +
geom_boxplot()+
labs(
x="Day",
y="Tip Rate",
title = "Does tipping behavior differ by days of the week?"
)Overall, tipping behavior appears consistent across the 4 days measured in this dataset. The median tip rate is all about 15-16% and their variability is all similar as well. One note is that Sunday does have two high outliers and Saturday also has a couple of higher tips as well. However, in general, we can conclude that tipping behavior does not differ by day of the week.
Regression Review
- Suppose a table is 1 standard deviation above average in terms of total bill. How many dollars above average in terms of tip would you expect it to be?
Answer
We multiply the correlation between the two variables by the standard deviation of tip. Recall the notation from lecture slides:
\[ \tilde{y} = \overbrace{r}^{correlation} \tilde{x} \]
cor(tips$TotBill, tips$Tip) * sd(tips$Tip)[1] 0.9349715
- Suppose a table is $2 above average tip. How many dollars above the average total bill would you expect it to be?
- Hint: This is the same process as found on slide 6 of the Regression lecture notes.
Answer
Zx <- 2 / sd(tips$Tip)
Zy <- Zx * cor(tips$Tip,tips$TotBill)
Zy * sd(tips$TotBill)[1] 8.695428
- Suppose a table is $10 above the average total bill. What would we expect the tip to be?
Answer
# By hand:
Zx <- 10 / sd(tips$TotBill)
Zy <- Zx * cor(tips$TotBill,tips$Tip)
(y <- mean(tips$Tip) + Zy * sd(tips$Tip))[1] 4.048524
# Using the model:
# if you have questions about this code, ask your instructor
model <- lm(Tip ~ TotBill, tips)
model$coefficients[1] + model$coefficients[2]*(mean(tips$TotBill)+10)(Intercept)
4.048524
# Using the raw numbers:
0.9203 + 0.1050 * (mean(tips$TotBill) + 10)[1] 4.047824