BIOS:4120 List of Functions

Modified

April 28, 2026

Please advise your TA if Date Modified is not current with the most recent lecture.

How to Use this Document

This document is intended to summarize all the functions necessary to complete the homeworks and ultimately the computational assessment at the end of the semester. You are discouraged from using AI tools to generate your code because it is likely to give you code that you don’t actually need for this class. There are hundreds of functions in R and many ways to accomplish the same task. For simplicity, try to use only the code provided here. Know that it is possible to complete any task that will be asked of you.

Good luck!

Operators

Assignment ( <- )

scalar_object <- 1
vector_object <- c(1,2,3,4,5)
data_object <- read.delim(...)

This assigns a scalar value, vector, or dataset into a named “object”
Once assigned, you will see the new, named object appear in your environment
names of objects cannot contain dashes (-)
Conventionally, we use underscores if our object name has multiple words
see Tidyverse style guide for more details

( $ )

The dollar sign operator($) allows us to select variables from a dataset. Some common uses are:

assigning a variable from a dataset to an object for simple referencing
calling specific variables within functions or plots (example below)
creating new variables inside an existing dataset to faciliate analysis

Suppose we have a dataset called dataset with variables: var1 is continuous, var2 is categorical. We can find the mean of the continuous variable in two ways:

continuous_var <- dataset$var1
mean(continuous_var)

## OR ##
mean(dataset$var1)

Logical ( < , > , <= , >= , == , !=)

Logical Operators evaluate the truth of a statement and return a value of TRUE or FALSE.

We can apply logical operators to vectors to do many things, but some useful tasks for this class are:

creating a new variable in a dataset that indicates when an existing variable meets a condition
- Say we have a dataset about cardiovascular health and we want to create a cutoff value for bp, blood_pressure, that defines high vs low values. We can use a line such as dataset$bp > 200 within a statement that creates a new variable indicating which patients have high blood pressure

Utility Functions

by(dataset$var, dataset$group, mean)

Applies a function to a variable “var” split by levels of “group”
“group” argument isn’t necessary, but when used will configure the output by the levels of the grouping variable

with

with(dataset, mean(variable))

Evaluates an expression in an environment constructed from data
Allows you to specify your dataset only once instead of using the form dataset$variable every time
can be used with any other functions to accomplish a task

lm(outcome_var ~ explanatory_var, data = dataset)

constructs a linear model of the form

\[ \overbrace{Y}^{\text{Outcome}} = \underbrace{\alpha + \beta \overbrace{X}^{\text{Explanatory var}}}_{\text{Linear Predictor}} \]

the order of the variables matters!

choose

choose(n, r)

binomial coefficient given by the formula

\[ \frac{n!}{x!(n-x)!} \]

A “Combination” a selections of items from a set where order is unimportant Binomial lecture
n is the total number
r is the selection of items from the set

sum

sum(x)

sums a numeric variable x

subset

subset(x, condition)

x is the vector or dataset that we want to subset
condition is the logical expression indicating the subset we want from the vector or dataset
- a logical expression means we use symbols like less than(<), greater than(>), is equal to(==), is not equal to(!=) some value
- only the rows that meet the condition will be kept in the subset dataset

For Distributions

Binomial

dbinom(x, n, prob)

binomial density function
- x = number of successes
- n = number of trials
- prob = probability of success
calculates the probability of a particular outcome given parameters $n$ and $\pi$.
can be supplied a vector of outcomes as in this example

pbinom(x, n, prob)

sums the probabilities starting from the lower tail (left)

Normal

pnorm(q)

where q is the “quantile”, or number of standard deviations away from 0
calculates the area to the left of q under a standard normal density curve (mean = 0, sd = 1)
to find area to the right of q, we can use the compliment rule 1-pnorm(q)

qnorm(p)

finds the quantile/percentile given a probability p
- ie. to find 60th percentile, we input p=60 to tell R that the probability is 60% and we want to know what the associated percentile is under the standard normal curve

Student’s t

The t‑distribution is always centered at zero, but how spread out it is depends on the degrees of freedom. Fewer degrees of freedom mean more variability in our estimate, so the distribution looks wider and has heavier tails.

pt(q, df)

calculates the area to the left of q under the t distribution
where q is the “quantile”, or number of standard deviations away from 0
df is the degrees of freedom
- for a paired sample t test, $df = n-1$ where $n$ is the number of pairs
- for independent sample t test, and assuming equal variances in both groups, $df = n_1 + n_2 - 2$ where $n_1$ is the size of group 1 and $n_2$ is the size of group 2

qt(p, df)

finds the quantile/percentile given a probability p

power.t.test(
  n = NULL, 
  delta = NULL, 
  sd = 1, 
  power = NULL, 
  type = "paired"
)

calculates either power or sample size
- When using the power.t.test() function, the parameter you do not specify will be the parameter the function will calculate. But you must specify 3 of these 4 parameters: delta, sd, n, power.
delta: expected difference in means
sd: expected standard deviation of the data; usually comes from other studies
n: sample size; Only specify if you’re solving for power
power: only specify if you’re solving for n
type: possible values are "two.sample", "one.sample", and "paired"; adjust this argument to fit your data appropriately

$\chi^2$ (Chi-squared)

pchisq(q, df = 1)

calculates the area to the left of q, a $\chi^2$ test statistic
for our class, we will only use df=1 for $2\times 2$ tables

Statistical Tests

aov

aov_model <- aov(y ~ x, data = yourdata)

y is the continuous response variable
x is the categorical predictor variable
this function performs the classic Analysis of Variance

Multiple Comparisons

TukeyHSD(aov_model)

This function performs pairwise comparisons of the group means and returns the estimated difference, a confidence interval for each comparison, and an adjusted p-value for that test. This addresses the concern of inflating Type I error rate that occurs with multiple testing.

binom.test

binom.test(x, n)
binom.test(x, n)$conf.int
binom.test(x, n)$p.value

calculates the test statistic and p-value for a test of proportion being different from 0.5 (by default, but this can be changed)
appending $conf.int at the end will return only the confidence interval
appending $p.value will return only the p-value

chisq.test

chisq.test(
  x, 
  correct = FALSE
)

x is a table or matrix of counts
correct=FALSE turns off the default continuity correction; which is beyond the scope of this course
See xtabs section for how to create a table from a dataset
If you need to manually construct a $2 \times 2$ table before you run the test, use the following and replace the NA’s:

tbl <- rbind(
  c(NA, NA),
  c(NA, NA)
)

chisq.test(tbl, correct = FALSE)

fisher.test

fisher.test(x)

x is a $2 \times 2$ contingency table
See xtabs section for how to create a table from a dataset
If you need to manually construct a $2 \times 2$ table before you run the test, use the following and replace the NA’s:

t.test

For continuous data, either one sample, paired data, or two groups.

t.test(x, y, 
       mu = 0, 
       paired = FALSE, 
       var.equal = FALSE
       )

# Alternatively, the formula syntax
t.test(continuous_variable ~ grouping_variable, 
       mu = 0, 
       paired = FALSE, 
       var.equal = FALSE
       )

x is a vector of continuous data values
y is an optional vector of continuous data values. This means that you do not have to supply it to the function for it to run
paired = FALSE is the default, change this to TRUE if your data is paired
mu is a number indicating the hypothesized value of the mean, $\hat \mu$ (or difference of means if we are performing a two sample test)
var.equal indicates whether we assume the variances (and standard deviations) as equal
- Student’s two sample t-test assume standard deviations as equal
- Welch’s two sample t-test does not make that assumption

wilcox.test

wilcox.test(
  x, y = NULL, paired = FALSE, exact = TRUE
)

x (and optionally, y) are the data vectors to input. You may need to use $ to select the columns of interest
this function is similar to t.test in that we can use it for similar types of data and scientific questions and the syntax for our purposes is nearly identical
Performs one- and two-sample Wilcoxon tests on vectors of data; the latter is also known as ‘Mann-Whitney’ test.
also accepts the formula notation quantitative_response ~ categorical_grouping_variable in place of x and y

Summary Statistics

For Categorical Variables

xtabs

(Descriptive Statistics Slide 7)

xtabs(~ x + y, data = df)

# three-way table
xtabs(~ x + y + z, data = df)

x and y correspond to variables of a dataset; use with() to easily reference variable names from a dataset
~ x + y: Creates a 2x2 contingency table of counts
~ x + y + z creates a 3x3 contingency table for three categorical variables

proportions

proportions(table)

Converts a table of counts into a table of proportions
Can used in conjunction with xtabs

For Continuous variables

median

median(x)

Calculates the middle value of a numeric variable

mean

mean(x)

Calculates the arithmetic average of a numeric variable

sd(x)

Calculates the sample standard deviation of a numeric variable

quantile

quantile(x, probs = c(q1, q2, q3, ...))

Produces sample quantiles corresponding to given probabilities

cor

cor(x, y)

Computes the correlation coefficient between two numeric variables
Correlation Slide 13 for example

summary

summary(dataset$variable)

gives 5 number summary of continuous variable

min

min(dataset$variable)

gives minimum of continuous variable

max

max(dataset$variable)

gives maximum of continuous variable

sort

sort(dataset$variable)

sorts variable in ascending order by default
can be continuous or categorical

ggplot2

ggplot2 Structure

library(ggplot2) # always load the library
ggplot(data, aes(x, y)) +
  geom_bar() +        # bar chart
  geom_histogram() +  # histogram
  geom_point() +      # scatterplot
  geom_boxplot() +    # box and whisker plot
  facet_wrap(~var) +   # Split plot into a multi-panel layout by a variable
  facet_grid(~var) # essentially the same as facet_wrap, with slightly different functionality