Lab 8

Published

March 10, 2026

Objectives

Explore and Visualize the Central Limit Theorem
Practice probability calculations using the Normal distribution

CLT Activity

Recall from class that the central limit theorem states that as you increase your sample size toward infinity, the distribution of the sample means will approach a normal with \(mean = \mu\) and \(SD = \frac{\sigma}{\sqrt{n}}\), regardless of what the underlying distribution looks like.

We are going to simulate the central limit theorem using the TRG readings of the nhanes dataset. Let us assume that this data represents the population, meaning that our population is defined as the 3026 women in the study as opposed to perhaps all women in the US. First, let’s look at the distribution.

The Intuition

Remember the distribution above is describing our population. In reality, when we are running an experiment we likely don’t know what the true distribution looks like. Now suppose you tested 10 women from the population (sample size = 10) and found their average triglyceride level. How close would you expect your sample mean to be to the population mean?

We’ll discuss this more next time. For now, if we consider repeating our experiment (randomly sampling 10 women and measuring their triglyceride levels) again, then again, and again, we can start getting an idea of the variability of our sample means. By the laws of probability, we can have an idea of the general variability of our sample mean after only collecting one sample!

Brief look at sample function

We can use the sample function to randomly sample from any set of values in R.

Let’s take a look at a simple example:

names <- c( # put some names in the vector
  "maggie", 
  "joe", 
  "jill", 
  "steve",
  "Karen"
)

n <- 2 # choose some number you want to sample
sample(names, 2) # randomly choose n people from the group (population)

Note for the TA

Feel free to input whatever values you’d like into the names vector. It could be a vector of numbers, places, or other things. However, the intent of using the names of students in each lab is to hone in on the idea of randomly drawing from a “population”. It is easy to consider an individual class as a population from which we sample a subset of students.

Instructions

Copy and paste the code below into your R script
Run block 1 three times and record the mean of each run.
Walk up to the TA computer and input your three means into the column titled samp_means_10 on the google spreadsheet.
Run block 2 three time and record the mean of each run.
Walk up to the TA computer and input your three means into the column titled samp_means_30
Run block 3 three times and record the mean of each run.
Walk up to the TA computer and input your three means into the column titled samp_means_300
Observe with the class as we visualize the distribution of these sample means.

# read in the data
nhanes <- read.delim('https://raw.githubusercontent.com/IowaBiostat/data-sets/main/lipids/lipids.txt')

# block 1
sample10 <- sample(nhanes$TRG, 10)
mean(sample10) 

# block 2
sample30 <- sample(nhanes$TRG, 30)
mean(sample30)

# block 3
sample300 <- sample(nhanes$TRG, 300)
mean(sample300)

Demo Results

✔ Reading from "clt-dem".

✔ Range 'Sheet1'.

What do we observe as the size of our sample gets bigger?

samp_mean	samp_se	true_mean	true_se
120.0844	17.56	116.9451	21.48553

samp_mean	samp_se	true_mean	true_se
116.5146	11.52376	116.9451	12.40468

samp_mean	samp_se	true_mean	true_se
116.6196	4.711131	116.9451	3.922703

Practice Problems

Use the nhanes dataset from above.

Find the probability that a randomly selected LDL measurement has a value above 123.

Answer

mean_LDL <- mean(nhanes$LDL)
sd_LDL <- sd(nhanes$LDL)

# standardize x = 123
z <- (123 - mean_LDL) / sd_LDL

# use pnorm to find area under standard normal curve
1 - pnorm(z)

[1] 0.3254162

Find the probability that a randomly selected LDL measurement has a value between 118 and 126.

Answer

mean_LDL <- mean(nhanes$LDL)
sd_LDL <- sd(nhanes$LDL)

# standardize x1= 118 and x2= 126
z1 <- (118 - mean_LDL) / sd_LDL
z2 <- (126 - mean_LDL) / sd_LDL

pnorm(z2) - pnorm(z1)

[1] 0.0812601

Find the LDL measurement that only 20% of participants had higher than.

Answer

mean_LDL <- mean(nhanes$LDL)
sd_LDL <- sd(nhanes$LDL)

# find z value such that 20% is higher
z1 <- qnorm(.2, lower.tail=F)

# alternatively
z1 <- abs(qnorm(.2))

# unstandardize (put onto original scale)
z1 * sd_LDL + mean_LDL

[1] 136.9375

Find the probability that a sample of 50 LDL measurements will have a mean greater than 123.

Answer

mean_LDL <- mean(nhanes$LDL)
se_LDL <- sd(nhanes$LDL) / sqrt(50)

# standardize x=123
z <- (123 - mean_LDL) / se_LDL

# looking for probability greater than
1 - pnorm(z)

[1] 0.0006861629

Find the probability that a sample of 50 LDL measurements will have a mean between 118 and 126.

Answer

mean_LDL <- mean(nhanes$LDL)
se_LDL <- sd(nhanes$LDL) / sqrt(50)

# standardize x=118 and x=126
z1 <- (118 - mean_LDL) / se_LDL
z2 <- (126 - mean_LDL) / se_LDL

# find area between these z values
pnorm(z2) - pnorm(z1)

[1] 0.0133539

What two value values contain the middle 95% of sample means of LDL measurements of sample size 50?

Answer

mean_LDL <- mean(nhanes$LDL)
se_LDL <- sd(nhanes$LDL) / sqrt(50)

# find z values that mark the quantiles 0.025 and 0.975
z_lower <- qnorm(0.025)
z_upper <- qnorm(0.975)

# transform z values onto scale of the data
x1 <- z_lower * se_LDL + mean_LDL
x2 <- z_upper * se_LDL + mean_LDL

c(lower_bound = x1, upper_bound = x2) |> round(digits = 3)

lower_bound upper_bound 
     96.853     116.715

Handwritten solutions