Lab 14

Published

April 28, 2026

Download R code

Objectives

  1. Transformations
    • Log Transformations
    • Rank Transformations
      • Mann-Whitney (Wilcoxon Rank Sum) Test
      • Wilcoxon Signed Rank
      • Spearman’s Rank Correlation
  2. ANOVA
    • Multiple Comparisons (Bonferroni, Tukey HSD)
    • Calculating \(R^2\)
  3. Quiz Review

Computational Material

Let’s review the dataset we examined in our previous lab–a Peruvian study looking at infant mortality as a result of diarrhea-induced dehydration.

library(ggplot2)
theme_set(theme_classic())
diarrhea <- read.delim("https://raw.githubusercontent.com/IowaBiostat/data-sets/main/diarrhea/diarrhea.txt")

Transformations

Log transform

Examine the visualizations below. How does the distribution of stool output change after applying the log transformation?

ggplot(diarrhea, aes(x=Stool, y = Group)) +
  labs(
    title = "Effect of Bismuth Salicylate", 
    x = "Infant Stool Output (ml/kg)"
  )+
  geom_boxplot(fill = c("grey", "snow2"))

ggplot(diarrhea, aes(x=log(Stool), y = Group)) +
  labs(
    title = "Effect of Bismuth Salicylate", 
    x = "Infant Stool Output (ml/kg)\nLog Scale"
  )+
  geom_boxplot(fill = c("grey", "snow2"))

ggplot(diarrhea, aes(x = Stool)) + 
  geom_histogram(fill = "snow2", 
                 color = "black", bins = 20) + 
  facet_wrap(~Group)

ggplot(diarrhea, aes(x = log(Stool))) + 
  geom_histogram(fill = "snow2", 
                 color = "black", bins = 20) + 
  facet_wrap(~Group)

Since the data look relatively normal after the transformation, we can perform a t-test on the transformed observations.

t.test(log(Stool) ~ Group, data = diarrhea)

    Welch Two Sample t-test

data:  log(Stool) by Group
t = 2.8182, df = 164.68, p-value = 0.005421
alternative hypothesis: true difference in means between group Control and group Treatment is not equal to 0
95 percent confidence interval:
 0.1023326 0.5812590
sample estimates:
  mean in group Control mean in group Treatment 
               5.212370                4.870574 

Notice that the p-value here (\(p= 0.005421\)) is smaller than when we performed the Welch test in lab 12, which was \(p=0.02638\). This occurs because testing the transformed data results in a more powerful test1.

Rank Transform

Mann-Whitney Test

Now let’s apply a non-parametric approach to analyzing this data. Remember that the Mann-Whitney2 test makes no distributional assumption on the data because it only uses the ranks.

wilcox.test(Stool ~ Group, data = diarrhea)

    Wilcoxon rank sum test with continuity correction

data:  Stool by Group
W = 4452, p-value = 0.005573
alternative hypothesis: true location shift is not equal to 0

We interpret these results as follows:

There is strong evidence that the shift in median stool output for infants on the treatment is not equal to zero.

Wilcoxon Signed-Rank test

For one sample continuous data, or when our data is paired, we can use the Wilcoxon Signed-Rank test. Let’s revisit the paired dataset from the Oatbran study.

oatbran <- read.delim('https://raw.githubusercontent.com/IowaBiostat/data-sets/main/oatbran/oatbran.txt')
wilcox.test(oatbran$CornFlakes, oatbran$OatBran, paired=TRUE)

    Wilcoxon signed rank exact test

data:  oatbran$CornFlakes and oatbran$OatBran
V = 93, p-value = 0.008545
alternative hypothesis: true location shift is not equal to 0
Paired t-test
t.test(oatbran$CornFlakes, oatbran$OatBran, paired = TRUE)$p.value
[1] 0.005277683
t.test(oatbran$CornFlakes, oatbran$OatBran, paired = TRUE)$conf.int
[1] 0.1284606 0.5972537
attr(,"conf.level")
[1] 0.95
How does the pvalue compare to the wilcoxon test?

Spearman Correlation

Again using the Oatbran dataset, we see a clear visual correlation pattern in the plot. There are no outliers in this dataset that would lead us to consider a rank transformation, but what if there were? Let’s place an outlier in the oatbran dataset and observe what happens to the Pearson Correlation3.

Code
# create an outlier and attach it to oatbran dataset
outlier <- list(CornFlakes = c(2.5), OatBran = c(6))
oatbran_outlier <- rbind(oatbran, outlier)

# generate plots with and without outlier
ggplot(oatbran, aes(x=CornFlakes, y  = OatBran)) +
  geom_point(size = 3)
ggplot(oatbran_outlier, aes(x = CornFlakes, y = OatBran)) + 
  geom_point(size = 3) +
  geom_point(aes(x=outlier$CornFlakes, y = outlier$OatBran), 
             color = "red", size = 3) +
  geom_text(
    aes(x=outlier$CornFlakes, y = outlier$OatBran - .25), 
    label = "Outlier", 
    color = "red", 
    size = 7
  )

We want to test this correlation for the null, \(H_0:\) True correlation \(=0\). How does the Pearson correlation compare to the Spearman?

pears <- cor.test(oatbran_outlier$CornFlakes, oatbran_outlier$OatBran, method = "pearson")
pears$estimate
      cor 
0.5241722 
pears$p.value
[1] 0.04488176
spear <- cor.test(oatbran_outlier$CornFlakes, oatbran_outlier$OatBran, method = "spearman")
spear$estimate
      rho 
0.6023237 
spear$p.value
[1] 0.01749349

Discuss these results with your instructor.

ANOVA

flicker <- read.delim('https://raw.githubusercontent.com/IowaBiostat/data-sets/main/flicker/flicker.txt')

model <- aov(Flicker ~ Color , data = flicker)
summary(model)
            Df Sum Sq Mean Sq F value Pr(>F)  
Color        2  23.00  11.499   4.802 0.0232 *
Residuals   16  38.31   2.394                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\(R^2\)

From lecture we know that,

\[ R^2 = \frac{RSS_0-RSS_1}{RSS_0} = 1 - \frac{RSS_1}{RSS_0} = \frac{\text{SS Group}}{\text{SS Total}} \] Which are all ways we can compute \(R^2\). Let’s use the summary output to calculate \(R^2\).

\[ R^2 = \frac{SS Group}{SS Total} = \quad ... \]
Answer \[ = \frac{23.0}{38.31+23.0} = \frac{23}{61.1} = 0.376 \]

The notation above defines \(RSS_0\) as the residual sum of squares from the null model and \(RSS_1\) as the residual sum of squares from the full model. Calculating \(R^2\) using two models should yield the same as our manual calculation above. We already fit a full model above, let’s fit a null model.

null_model <- aov(Flicker ~ 1, data = flicker)

# Now calculate the residual sums of squares for each model and plug them into the formula written above
RSS_0 <- sum(null_model$residuals^2)
RSS_1 <- sum(model$residuals^2)
(RSS_0 - RSS_1) / RSS_0
[1] 0.3751145

We interpret this value as 37.5% of the variation in individual flicker frequency is explained by the model.

You can also have R calculate \(R^2\) for you by using the lm() function and calling r.squared. This works because ANOVA is a linear model with categorical predictor variables.

summary(lm(Flicker ~ Color , data = flicker))$r.squared
[1] 0.3751145

This value matches what we computed manually above.

Our p-value from the model indicates a significant finding: at least one group mean is different from the others. But which? Let’s plot the data:

Code
ggplot(flicker, aes(Color, Flicker, fill = Color)) +
  scale_fill_manual( # this maps my chosen colors to the variable-level names
    values = c("Blue" = "steelblue3", 
               "Brown" = "brown4", 
               "Green" = "darkgreen")
    )+
  geom_boxplot() +
  guides(fill = "none") # removes my legend, since the labels are on x axis

This boxplot indicates that those with blue eyes have the best flicker detection, while those with brown eyes have the worst; green is somewhere in between. However, we cannot say whether this difference is significant enough to conclude that blue eyes are better at detecting flicker than brown eyes (as well as other comparisons with green eyes) without performing pairwise tests. Anytime we perform multiple tests, we should account for them to maintain statistical validity and academic integrity.

Multiple Comparisons

We will cover these in detail in lecture on Thursday. For now, we will just present the code that you can use to perform these corrections.

Performing a Tukey’s “Honest Significant Difference” is straightforward in R. This function performs pairwise comparisons of the group means and returns the estimated difference, a confidence interval for each comparison, and an adjusted p-value for that test. Using this addresses the concern of inflating Type I error rate.

TukeyHSD(model)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Flicker ~ Color, data = flicker)

$Color
                 diff        lwr       upr     p adj
Brown-Blue  -2.579167 -4.7354973 -0.422836 0.0183579
Green-Blue  -1.246667 -3.6643959  1.171063 0.3994319
Green-Brown  1.332500 -0.9437168  3.608717 0.3124225

Interpretation

There is strong evidence that Blue-eyed individuals have a higher critical flicker frequency than brown-eyed individuals. Confidence interval of the difference \(brown - blue\): 95%CI(0.423, 4.735)

Another way is to use Bonferroni adjustment and in R we use the pairwise.t.test() function.

pairwise.t.test(flicker$Flicker, flicker$Color, p.adj = "bonf")

    Pairwise comparisons using t tests with pooled SD 

data:  flicker$Flicker and flicker$Color 

      Blue  Brown
Brown 0.021 -    
Green 0.606 0.451

P value adjustment method: bonferroni 

Here, R automatically scales your p-value so you can compare it to 0.05 like usual. If you ran the code above with the argument p.adj = “none”, you would have to calculate the Bonferroni-adjusted alpha level (here it would be .05/3 = .017) and compare results to 0.017 as a threshold for rejection.

pairwise.t.test(flicker$Flicker, flicker$Color, p.adj = "none")

    Pairwise comparisons using t tests with pooled SD 

data:  flicker$Flicker and flicker$Color 

      Blue   Brown 
Brown 0.0071 -     
Green 0.2020 0.1504

P value adjustment method: none 

From the p-values we can see flicker detection is significantly better for blue eyes compared to brown eyes. However, there is not sufficient evidence to conclude that flicker significantly differs between brown and green, and green and blue eyes.

Quiz Review

Problem 1

A wildlife study examined 793 groundhogs involved in burrow cave-in incidents. Researchers were interested in whether reinforcing burrows reduced injury risk. Of the 147 groundhogs that used reinforced burrows, 17 suffered injuries. Of the remaining groundhogs that used non-reinforced burrows, 428 did not suffer injuries.

  1. Make a contingency table for the data.

  2. Without running any tests, does there appear to be a benefit to using reinforced burrows? (Hint: calculate the odds ratio.)

  3. Make a 95% CI for the OR found in part b.

  4. What are the expected counts for the contingency table?

  5. Test if there is an association between burrow type and injuries.

Problem 2

A study compared the fire efficiency (measured in flames per gallon of fuel) of Western dragons (sample 1) to Eastern dragons (sample 2). The sample size for Western dragons was 249 with a sample mean of 20.145 and sample standard deviation of 6.415. Eastern dragons had a sample size of 79 with a sample mean of 30.481 and sample standard deviation of 6.108. The pooled standard deviation is 6.343.

  1. If we assume the data are normally distributed, what test should we run? What else might we consider when determining the most appropriate test?

  2. Conduct a t-test comparing the two group means and interpret the results. Assume Student’s t-test is appropriate.

  3. Further analysis shows that two of the Western dragons in the sample produced fewer than 5 flames per gallon of fuel. How might this affect the test results? How might you remedy this issue?

Handwritten solution

Footnotes

  1. Remember, we are not manipulating data. We are simply changing the way that it is measured. Think of it like changing the measurement tool in an experiment↩︎

  2. also called the Wilcoxon Rank Sum test↩︎

  3. this is the typical correlation coefficient that you are used to for two continuous variables↩︎