BIOS:4120 List of Functions

Modified

April 28, 2026

Please advise your TA if Date Modified is not current with the most recent lecture.

This document is intended to summarize all the functions necessary to complete the homeworks and ultimately the computational assessment at the end of the semester. You are discouraged from using AI tools to generate your code because it is likely to give you code that you don’t actually need for this class. There are hundreds of functions in R and many ways to accomplish the same task. For simplicity, try to use only the code provided here. Know that it is possible to complete any task that will be asked of you.

Good luck!

Operators

scalar_object <- 1
vector_object <- c(1,2,3,4,5)
data_object <- read.delim(...)
  • This assigns a scalar value, vector, or dataset into a named “object”
  • Once assigned, you will see the new, named object appear in your environment
  • names of objects cannot contain dashes (-)
  • Conventionally, we use underscores if our object name has multiple words
  • see Tidyverse style guide for more details

The dollar sign operator($) allows us to select variables from a dataset. Some common uses are:

  • assigning a variable from a dataset to an object for simple referencing
  • calling specific variables within functions or plots (example below)
  • creating new variables inside an existing dataset to faciliate analysis

Suppose we have a dataset called dataset with variables: var1 is continuous, var2 is categorical. We can find the mean of the continuous variable in two ways:

continuous_var <- dataset$var1
mean(continuous_var)

## OR ##
mean(dataset$var1)

Logical Operators evaluate the truth of a statement and return a value of TRUE or FALSE.

We can apply logical operators to vectors to do many things, but some useful tasks for this class are:

  • creating a new variable in a dataset that indicates when an existing variable meets a condition
    • Say we have a dataset about cardiovascular health and we want to create a cutoff value for bp, blood_pressure, that defines high vs low values. We can use a line such as dataset$bp > 200 within a statement that creates a new variable indicating which patients have high blood pressure

Utility Functions

by(dataset$var, dataset$group, mean) 
  • Applies a function to a variable “var” split by levels of “group”

  • “group” argument isn’t necessary, but when used will configure the output by the levels of the grouping variable

with(dataset, mean(variable)) 
  • Evaluates an expression in an environment constructed from data

  • Allows you to specify your dataset only once instead of using the form dataset$variable every time

  • can be used with any other functions to accomplish a task

lm(outcome_var ~ explanatory_var, data = dataset)
  • constructs a linear model of the form

\[ \overbrace{Y}^{\text{Outcome}} = \underbrace{\alpha + \beta \overbrace{X}^{\text{Explanatory var}}}_{\text{Linear Predictor}} \]

  • the order of the variables matters!
choose(n, r)
  • binomial coefficient given by the formula

\[ \frac{n!}{x!(n-x)!} \]

  • A “Combination” a selections of items from a set where order is unimportant Binomial lecture

  • n is the total number

  • r is the selection of items from the set

sum(x)
  • sums a numeric variable x
subset(x, condition)
  • x is the vector or dataset that we want to subset
  • condition is the logical expression indicating the subset we want from the vector or dataset
    • a logical expression means we use symbols like less than(<), greater than(>), is equal to(==), is not equal to(!=) some value
    • only the rows that meet the condition will be kept in the subset dataset

For Distributions

dbinom(x, n, prob) 
  • binomial density function

    • x = number of successes

    • n = number of trials

    • prob = probability of success

  • calculates the probability of a particular outcome given parameters \(n\) and \(\pi\).

  • can be supplied a vector of outcomes as in this example

pbinom(x, n, prob)
  • sums the probabilities starting from the lower tail (left)
pnorm(q)
  • where q is the “quantile”, or number of standard deviations away from 0

  • calculates the area to the left of q under a standard normal density curve (mean = 0, sd = 1)

  • to find area to the right of q, we can use the compliment rule 1-pnorm(q)

qnorm(p)
  • finds the quantile/percentile given a probability p

    • ie. to find 60th percentile, we input p=60 to tell R that the probability is 60% and we want to know what the associated percentile is under the standard normal curve

The t‑distribution is always centered at zero, but how spread out it is depends on the degrees of freedom. Fewer degrees of freedom mean more variability in our estimate, so the distribution looks wider and has heavier tails.

pt(q, df)
  • calculates the area to the left of q under the t distribution

  • where q is the “quantile”, or number of standard deviations away from 0

  • df is the degrees of freedom

    • for a paired sample t test, \(df = n-1\) where \(n\) is the number of pairs

    • for independent sample t test, and assuming equal variances in both groups, \(df = n_1 + n_2 - 2\) where \(n_1\) is the size of group 1 and \(n_2\) is the size of group 2

qt(p, df)
  • finds the quantile/percentile given a probability p
power.t.test(
  n = NULL, 
  delta = NULL, 
  sd = 1, 
  power = NULL, 
  type = "paired"
)
  • calculates either power or sample size

    • When using the power.t.test() function, the parameter you do not specify will be the parameter the function will calculate. But you must specify 3 of these 4 parameters: delta, sd, n, power.
  • delta: expected difference in means

  • sd: expected standard deviation of the data; usually comes from other studies

  • n: sample size; Only specify if you’re solving for power

  • power: only specify if you’re solving for n

  • type: possible values are "two.sample", "one.sample", and "paired"; adjust this argument to fit your data appropriately

pchisq(q, df = 1)
  • calculates the area to the left of q, a \(\chi^2\) test statistic

  • for our class, we will only use df=1 for \(2\times 2\) tables

Statistical Tests

aov_model <- aov(y ~ x, data = yourdata)
  • y is the continuous response variable
  • x is the categorical predictor variable
  • this function performs the classic Analysis of Variance

Multiple Comparisons

TukeyHSD(aov_model)

This function performs pairwise comparisons of the group means and returns the estimated difference, a confidence interval for each comparison, and an adjusted p-value for that test. This addresses the concern of inflating Type I error rate that occurs with multiple testing.

binom.test(x, n)
binom.test(x, n)$conf.int
binom.test(x, n)$p.value
  • calculates the test statistic and p-value for a test of proportion being different from 0.5 (by default, but this can be changed)

  • appending $conf.int at the end will return only the confidence interval

  • appending $p.value will return only the p-value

chisq.test(
  x, 
  correct = FALSE
)
  • x is a table or matrix of counts

  • correct=FALSE turns off the default continuity correction; which is beyond the scope of this course

  • See xtabs section for how to create a table from a dataset

  • If you need to manually construct a \(2 \times 2\) table before you run the test, use the following and replace the NA’s:

tbl <- rbind(
  c(NA, NA),
  c(NA, NA)
)

chisq.test(tbl, correct = FALSE)
fisher.test(x)
  • x is a \(2 \times 2\) contingency table

  • See xtabs section for how to create a table from a dataset

  • If you need to manually construct a \(2 \times 2\) table before you run the test, use the following and replace the NA’s:

For continuous data, either one sample, paired data, or two groups.

t.test(x, y, 
       mu = 0, 
       paired = FALSE, 
       var.equal = FALSE
       )

# Alternatively, the formula syntax
t.test(continuous_variable ~ grouping_variable, 
       mu = 0, 
       paired = FALSE, 
       var.equal = FALSE
       )
  • x is a vector of continuous data values

  • y is an optional vector of continuous data values. This means that you do not have to supply it to the function for it to run

  • paired = FALSE is the default, change this to TRUE if your data is paired

  • mu is a number indicating the hypothesized value of the mean, \(\hat \mu\) (or difference of means if we are performing a two sample test)

  • var.equal indicates whether we assume the variances (and standard deviations) as equal

    • Student’s two sample t-test assume standard deviations as equal
    • Welch’s two sample t-test does not make that assumption
wilcox.test(
  x, y = NULL, paired = FALSE, exact = TRUE
)
  • x (and optionally, y) are the data vectors to input. You may need to use $ to select the columns of interest
  • this function is similar to t.test in that we can use it for similar types of data and scientific questions and the syntax for our purposes is nearly identical
  • Performs one- and two-sample Wilcoxon tests on vectors of data; the latter is also known as ‘Mann-Whitney’ test.
  • also accepts the formula notation quantitative_response ~ categorical_grouping_variable in place of x and y

Summary Statistics

For Categorical Variables

(Descriptive Statistics Slide 7)

xtabs(~ x + y, data = df)

# three-way table
xtabs(~ x + y + z, data = df) 
  • x and y correspond to variables of a dataset; use with() to easily reference variable names from a dataset

  • ~ x + y: Creates a 2x2 contingency table of counts

  • ~ x + y + z creates a 3x3 contingency table for three categorical variables

proportions(table) 
  • Converts a table of counts into a table of proportions

  • Can used in conjunction with xtabs

For Continuous variables

median(x) 
  • Calculates the middle value of a numeric variable
mean(x) 
  • Calculates the arithmetic average of a numeric variable
sd(x)
  • Calculates the sample standard deviation of a numeric variable
quantile(x, probs = c(q1, q2, q3, ...)) 
  • Produces sample quantiles corresponding to given probabilities
cor(x, y) 
  • Computes the correlation coefficient between two numeric variables

  • Correlation Slide 13 for example

summary(dataset$variable) 
  • gives 5 number summary of continuous variable
min(dataset$variable) 
  • gives minimum of continuous variable
max(dataset$variable) 
  • gives maximum of continuous variable
sort(dataset$variable) 
  • sorts variable in ascending order by default

  • can be continuous or categorical

ggplot2

library(ggplot2) # always load the library
ggplot(data, aes(x, y)) +
  geom_bar() +        # bar chart
  geom_histogram() +  # histogram
  geom_point() +      # scatterplot
  geom_boxplot() +    # box and whisker plot
  facet_wrap(~var) +   # Split plot into a multi-panel layout by a variable
  facet_grid(~var) # essentially the same as facet_wrap, with slightly different functionality