Lab 8
Objectives
Explore and Visualize the Central Limit Theorem
Practice probability calculations using the Normal distribution
CLT Activity
Recall from class that the central limit theorem states that as you increase your sample size toward infinity, the distribution of the sample means will approach a normal with \(mean = \mu\) and \(SD = \frac{\sigma}{\sqrt{n}}\), regardless of what the underlying distribution looks like.
We are going to simulate the central limit theorem using the TRG readings of the nhanes dataset. Let us assume that this data represents the population, meaning that our population is defined as the 3026 women in the study as opposed to perhaps all women in the US. First, let’s look at the distribution.
The Intuition
Remember the distribution above is describing our population. In reality, when we are running an experiment we likely don’t know what the true distribution looks like. Now suppose you tested 10 women from the population (sample size = 10) and found their average triglyceride level. How close would you expect your sample mean to be to the population mean?
We’ll discuss this more next time. For now, if we consider repeating our experiment (randomly sampling 10 women and measuring their triglyceride levels) again, then again, and again, we can start getting an idea of the variability of our sample means. By the laws of probability, we can have an idea of the general variability of our sample mean after only collecting one sample!
sample function
We can use the sample function to randomly sample from any set of values in R.
Let’s take a look at a simple example:
names <- c( # put some names in the vector
"maggie",
"joe",
"jill",
"steve",
"Karen"
)
n <- 2 # choose some number you want to sample
sample(names, 2) # randomly choose n people from the group (population)Note for the TA
Feel free to input whatever values you’d like into the names vector. It could be a vector of numbers, places, or other things. However, the intent of using the names of students in each lab is to hone in on the idea of randomly drawing from a “population”. It is easy to consider an individual class as a population from which we sample a subset of students.Copy and paste the code below into your R script
Run block 1 three times and record the mean of each run.
Walk up to the TA computer and input your three means into the column titled
samp_means_10on the google spreadsheet.Run block 2 three time and record the mean of each run.
Walk up to the TA computer and input your three means into the column titled
samp_means_30Run block 3 three times and record the mean of each run.
Walk up to the TA computer and input your three means into the column titled
samp_means_300Observe with the class as we visualize the distribution of these sample means.
# read in the data
nhanes <- read.delim('https://raw.githubusercontent.com/IowaBiostat/data-sets/main/lipids/lipids.txt')
# block 1
sample10 <- sample(nhanes$TRG, 10)
mean(sample10)
# block 2
sample30 <- sample(nhanes$TRG, 30)
mean(sample30)
# block 3
sample300 <- sample(nhanes$TRG, 300)
mean(sample300)Demo Results
✔ Reading from "clt-dem".
✔ Range 'Sheet1'.
What do we observe as the size of our sample gets bigger?
samp_mean | samp_se | true_mean | true_se |
|---|---|---|---|
120.0844 | 17.56 | 116.9451 | 21.48553 |
samp_mean | samp_se | true_mean | true_se |
|---|---|---|---|
116.5146 | 11.52376 | 116.9451 | 12.40468 |
samp_mean | samp_se | true_mean | true_se |
|---|---|---|---|
116.6196 | 4.711131 | 116.9451 | 3.922703 |
Practice Problems
Use the nhanes dataset from above.
- Find the probability that a randomly selected LDL measurement has a value above 123.
Answer
mean_LDL <- mean(nhanes$LDL)
sd_LDL <- sd(nhanes$LDL)
# standardize x = 123
z <- (123 - mean_LDL) / sd_LDL
# use pnorm to find area under standard normal curve
1 - pnorm(z)[1] 0.3254162
- Find the probability that a randomly selected LDL measurement has a value between 118 and 126.
Answer
mean_LDL <- mean(nhanes$LDL)
sd_LDL <- sd(nhanes$LDL)
# standardize x1= 118 and x2= 126
z1 <- (118 - mean_LDL) / sd_LDL
z2 <- (126 - mean_LDL) / sd_LDL
pnorm(z2) - pnorm(z1)[1] 0.0812601
- Find the LDL measurement that only 20% of participants had higher than.
Answer
mean_LDL <- mean(nhanes$LDL)
sd_LDL <- sd(nhanes$LDL)
# find z value such that 20% is higher
z1 <- qnorm(.2, lower.tail=F)
# alternatively
z1 <- abs(qnorm(.2))
# unstandardize (put onto original scale)
z1 * sd_LDL + mean_LDL[1] 136.9375
- Find the probability that a sample of 50 LDL measurements will have a mean greater than 123.
Answer
mean_LDL <- mean(nhanes$LDL)
se_LDL <- sd(nhanes$LDL) / sqrt(50)
# standardize x=123
z <- (123 - mean_LDL) / se_LDL
# looking for probability greater than
1 - pnorm(z)[1] 0.0006861629
- Find the probability that a sample of 50 LDL measurements will have a mean between 118 and 126.
Answer
mean_LDL <- mean(nhanes$LDL)
se_LDL <- sd(nhanes$LDL) / sqrt(50)
# standardize x=118 and x=126
z1 <- (118 - mean_LDL) / se_LDL
z2 <- (126 - mean_LDL) / se_LDL
# find area between these z values
pnorm(z2) - pnorm(z1)[1] 0.0133539
- What two value values contain the middle 95% of sample means of LDL measurements of sample size 50?
Answer
mean_LDL <- mean(nhanes$LDL)
se_LDL <- sd(nhanes$LDL) / sqrt(50)
# find z values that mark the quantiles 0.025 and 0.975
z_lower <- qnorm(0.025)
z_upper <- qnorm(0.975)
# transform z values onto scale of the data
x1 <- z_lower * se_LDL + mean_LDL
x2 <- z_upper * se_LDL + mean_LDL
c(lower_bound = x1, upper_bound = x2) |> round(digits = 3)lower_bound upper_bound
96.853 116.715