# code to get you started
pearson <- read.delim('https://raw.githubusercontent.com/IowaBiostat/data-sets/main/pearson/pearson.txt')Lab 7
Objectives:
Practice with Binomial
Review for Quiz
Binomial Distribution
Recall that Binomial Distribution is given by the following function:
\[ \frac{n!}{x!(n-x)!} \pi^x(1-\pi)^{n-x}, \quad 0\le x\le n \]
Where \(n\) is the total number of trials, \(x\) is the number of successes, and \(\pi\) is the probability of success.
Practice Problems
1
A few weeks ago, we examined the Pearson dataset to show the correlation between height of 1078 fathers and their fully grown sons. Suppose we want to investigate how many sons were taller than their fathers. Conduct a test of a null hypothesis that the proportion of sons taller than their own fathers is equal to the proportion of sons shorter than their fathers.
Hint: First find how many sons are taller than their fathers.
There are a couple ways we can solve this problem.
We can approach the problem step by step: find how many sons are taller, find how many observations we have, and put these values into binom.test.
# create a vector of true and falses
# this is interpreted in R as 1's and 0's
sons_taller <- with(pearson, Son > Father)
binom.test(
sum(sons_taller), # sum up all the 1's = 686
length(sons_taller) # use total number of observations = 1078
)
Exact binomial test
data: sum(sons_taller) and length(sons_taller)
number of successes = 686, number of trials = 1078, p-value < 2.2e-16
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.6068424 0.6651381
sample estimates:
probability of success
0.6363636
Alternatively, you can use xtabs() and pipe into the binom.test function for a quick and clean result. Note that both of these give us the same answer. So use the one more intuitive to you.
xtabs(~ Son > Father, pearson) |>
binom.test()
Exact binomial test
data: xtabs(~Son > Father, pearson)
number of successes = 392, number of trials = 1078, p-value < 2.2e-16
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.3348619 0.3931576
sample estimates:
probability of success
0.3636364
This problem may be especially helpful for your homework 6
2
For which of the following scenarios could we apply a binomial distribution: Recall: The binomial distribution has the following characteristics:
- There are a specific number of trials (n), each with a binary outcome
- The n trials are independent
- The probability of success (p) is constant with each trial
- The number of jackpots in 1,000 pulls of a slot machine
Answer
Yes - The number of people who get sick in a 5-person household
Answer
No, since the probability of one person getting sick will likely affect the probability that others will get sick, this fails the 3rd condition. - The number of free throws Caitlin Clark makes in 10 attempts
Answer
Yes. We assume Caitlin’s probability of a successful throw won’t change for each throw. - The number of questions a student answers correctly on a multiple choice test (they are not randomly guessing).
Answer
No, if a student is not randomly guessing, the probability of success for each question changes depending on their level of knowledge for that question.
3
Suppose we have a standard 6 sided die, and we roll the die 10 times. Consider the following questions
- What’s the probability that we get at least two rolls of 4?
Answer
We can find this probability by taking 1 - the probability that we get strictly less than 2 rolls of 4.
# Recall that the pbinom function calculates the probability of getting less than or equal to x. Hence, for our first argument we put 1 instead of 2.
1 - pbinom(1, size = 10, prob = 1/6)[1] 0.5154833
- Suppose that we roll five 4’s. Using the above techniques, find the p-value. Do you believe that the die is biased?
Answer
We can do this simply using the binom.test function.
binom.test(5, 10, p=1/6)$p.value[1] 0.01546197
The p-value is small, indicating evidence that the die may be biased.
Quiz Review
Probability
Visual
Suppose the probability that
a potato is a Yukon Gold is 1/3.
a potato is mashed, given that it was Yukon Gold, is 3/4.
a potato is mashed, given that it was NOT Yukon Gold, is 1/2.
- What is the probability that a potato is both Yukon Gold AND mashed?
Answer
Multiplication Rule:
\(P(Mashed \cap Yukon) = P(Mashed | Yukon)*P(Yukon)\)
= (3/4)(1/3) = 1/4
- What is the probability that a potato is mashed?
Answer
Law of Total Probability & Multiplication Rule:
\(P(Mashed) = P(Mashed | Yukon)*P(Yukon) + P(Mashed | Yukon^C)*P(Yukon^C)\)
= (3/4)(1/3) + (1/2)(1-1/3) = 7/12
- What is the probability that a potato is Yukon Gold, given that it is mashed?
Answer
\(P(Yukon | Mashed) = \frac{P(Yukon \cap Mashed)}{P(Mashed)}\)
= (1/4) / (7/12) = 3/7
- What is the probability that a potato is Yukon Gold OR mashed?
Answer
Addition Rule:
\(P(Yukon \cup Mashed) = P(Yukon) + P(Mashed) - P(Yukon \cap Mashed)\)
= (1/3) + (7/12) - (1/4) = 2/3
- Assuming picking potatoes involves independent events, what is the probability that I pick two Yukon Golds in a row (with replacement)?
Answer
\(P(Yukon)*P(Yukon) = (1/3)^2 = 1/9\)
- Is the event that a given potato is Yukon Gold independent of the event that a given potato is mashed?
Answer
No. \(P(Yukon|Mashed) = \frac{3}{7} \ne \frac{1}{3} = P(Yukon)\)
Summary Statistics
- Estimate the median of blood pressure both before and after surgery.
Answer
The median blood pressure before surgery is about 130 and the median blood pressure after surgery is approximately 80.
- Are there any outliers? If so, identify them and state how they impact the mean and the median?
Answer
The “before surgery” blood pressures has 1 outlier of about 220. This does not affect the median, but it has a large effect on the mean.
- What percentage of individuals had a systolic blood pressure lower than 80 post surgery?
Answer
About 50% (since the median is at about 80)
- 75 percent of individuals had a pre-surgery systolic blood pressure lower than what?
Answer
About 160 (the 3rd quartile is at about 160)
Histograms
- Are the data symmetrically distributed, right-skewed, or left-skewed?
Answer
The data are skewed right (the tail trails off to the right).
- How does the mean compare to the median in this data?
Answer
The mean will be higher than the median (since the data are right-skewed). - Estimate the number of college students that got between 10 and 12 hours of sleep, on average.
Answer
Approximately 40 college students slept between 10 and 12 hours, on average, in this data set. The y-axis denotes the frequency (number of college students) and we can see that the bars of the histogram each contain 2 hours of sleep. This means we can find the number of college students that got between 10 and 12 hours of sleep by looking at the height (frequency) of the 5th bar from the left.
Correlation and Regression
A study investigated the effect of classical music for studying on test scores for students in a biostatistics class. The data is given below: time spent is in minutes per week; test scores given as percentage.
Mean Time Listening to Classical Music: 54.44
Std Dev of Time Listening to Classical Music: 7.3
Mean Test Scores: 80.36
Std Dev of Test Scores: 9.9
Correlation: 0.3919
- If we have a student who listens to classical music 1 standard deviation less than the average student, how many standard deviations below average would we predict their exam score to be?
Answer
\(Z_{score} = r*Z_{time}\)
\(Z_{score} = 0.3919*-1 = -0.3919\) standard deviations above average (ie 0.3919 below average). - If we have a student who scores 3 standard deviations below average on their exam, how much time do we predict that they spend listening to classical music per week?
Answer
\(Z_{time} = r*Z_{score}\)
\(Z_{time} = 0.3919*-3 = -1.1739\) standard deviations above average for music
\(\text{predicted time} = \text{avg. time} + SD_{time} * Z_{time}\)
\(y = 54.44 + 7.3 * -1.1739 = 45.87\) mins - If a student listens to 5 minutes more classical music per week than average, what is their predicted test score?
Answer
Method 1: Using the Correlation \(Z_{x} = \frac{x - \overline{x}}{SD_x}\)
\(Z_{time} = \frac{5}{7.3} = 0.685\)
\(Z_{score} = r*Z_{time}\)
\(Z_{score} = 0.3919*0.685 = 0.268\)
\(\text{predicted score} = \text{avg. score} + SD_{score} * Z_{score}\)
\(y = 80.36 + 9.9*0.268 = 83.017\)
Method 2: Using the Regression Equation \(y = \overline{y} + \hat{\beta} *(x-\overline{x})\)
\(y = 80.36 + 0.531768*5 = 83.019\)
Contingency Table
Below is data about a fictitious HIV rapid-diagnostic test. Please fill in the rest of the table and answer the following questions. Assume the prevalence in this population is 0.1%.
Test Result | |||
|---|---|---|---|
Disease | Positive | Negative | Total |
Present | 2,970 | 30 | 3,000 |
Absent | 11,000 | 539,000 | 550,000 |
Total | 13,970 | 539,030 | 553,000 |
- Find the sensitivity of the test.
Answer
Sensitivity = \(P(Test+ | D+) = \frac{2970}{3000} = 0.99\)
- Find the specificity of the test.
Answer
Specificity = \(P(Test- | D-) = \frac{539000}{550000} = 0.98\)
- Find the false positive rate.
Answer
\(P(Test+ | D-) = 1 - P(Test- | D-) = 1 - 0.98 = 0.02\)
- Find the positive predictive value. \(P(D^+|T^+)\)
Answer
Using Bayes’ rule, we find the positive predictive value: \(P(D+|Test+) = \frac{P(Test+ | D+)P(D+)}{P(Test+ | D+)P(D+) + P(Test+ | D-)P(D-)} = \frac{(0.99)(0.001)}{(0.99)(0.001) + (0.02)(0.999)} = 0.04721\)
- Find the negative predictive value. \(P(D^-|T^-)\)
Answer
Neg Predictive Value = \(P(D-|T-) = \frac{P(Test-| D-)P(D-)}{P(Test- | D-)P(D-) + P(Test- | D+)P(D+)} = \frac{(0.98)(1-0.001)}{(0.98)(1-0.001) + (1-0.99)(0.001)} = 0.999\)