scalar_object <- 1
vector_object <- c(1,2,3,4,5)
data_object <- read.delim(...)BIOS:4120 List of Functions
Please advise your TA if Date Modified is not current with the most recent lecture.
This document is intended to summarize all the functions necessary to complete the homeworks and ultimately the computational assessment at the end of the semester. You are discouraged from using AI tools to generate your code because it is likely to give you code that you don’t actually need for this class. There are hundreds of functions in R and many ways to accomplish the same task. For simplicity, try to use only the code provided here. Know that it is possible to complete any task that will be asked of you.
Good luck!
Operators
- This assigns a scalar value, vector, or dataset into a named “object”
- Once assigned, you will see the new, named object appear in your environment
- names of objects cannot contain dashes (-)
- Conventionally, we use underscores if our object name has multiple words
- see Tidyverse style guide for more details
The dollar sign operator($) allows us to select variables from a dataset. Some common uses are:
- assigning a variable from a dataset to an object for simple referencing
- calling specific variables within functions or plots (example below)
- creating new variables inside an existing dataset to faciliate analysis
Suppose we have a dataset called dataset with variables: var1 is continuous, var2 is categorical. We can find the mean of the continuous variable in two ways:
continuous_var <- dataset$var1
mean(continuous_var)
## OR ##
mean(dataset$var1)Logical Operators evaluate the truth of a statement and return a value of TRUE or FALSE.
We can apply logical operators to vectors to do many things, but some useful tasks for this class are:
- creating a new variable in a dataset that indicates when an existing variable meets a condition
- Say we have a dataset about cardiovascular health and we want to create a cutoff value for
bp, blood_pressure, that defines high vs low values. We can use a line such asdataset$bp > 200within a statement that creates a new variable indicating which patients have high blood pressure
- Say we have a dataset about cardiovascular health and we want to create a cutoff value for
Utility Functions
by(dataset$var, dataset$group, mean) Applies a function to a variable “var” split by levels of “group”
“group” argument isn’t necessary, but when used will configure the output by the levels of the grouping variable
with(dataset, mean(variable)) Evaluates an expression in an environment constructed from data
Allows you to specify your dataset only once instead of using the form
dataset$variableevery timecan be used with any other functions to accomplish a task
lm(outcome_var ~ explanatory_var, data = dataset)- constructs a linear model of the form
\[ \overbrace{Y}^{\text{Outcome}} = \underbrace{\alpha + \beta \overbrace{X}^{\text{Explanatory var}}}_{\text{Linear Predictor}} \]
- the order of the variables matters!
choose(n, r)- binomial coefficient given by the formula
\[ \frac{n!}{x!(n-x)!} \]
A “Combination” a selections of items from a set where order is unimportant Binomial lecture
nis the total numberris the selection of items from the set
sum(x)- sums a numeric variable
x
subset(x, condition)xis the vector or dataset that we want to subsetconditionis the logical expression indicating the subset we want from the vector or dataset- a logical expression means we use symbols like less than(<), greater than(>), is equal to(==), is not equal to(!=) some value
- only the rows that meet the condition will be kept in the subset dataset
For Distributions
dbinom(x, n, prob) binomial density function
x= number of successesn= number of trialsprob= probability of success
calculates the probability of a particular outcome given parameters \(n\) and \(\pi\).
can be supplied a vector of outcomes as in this example
pbinom(x, n, prob)- sums the probabilities starting from the lower tail (left)
pnorm(q)where
qis the “quantile”, or number of standard deviations away from 0calculates the area to the left of
qunder a standard normal density curve (mean = 0, sd = 1)to find area to the right of
q, we can use the compliment rule1-pnorm(q)
qnorm(p)finds the quantile/percentile given a probability
p- ie. to find 60th percentile, we input
p=60to tell R that the probability is 60% and we want to know what the associated percentile is under the standard normal curve
- ie. to find 60th percentile, we input
The t‑distribution is always centered at zero, but how spread out it is depends on the degrees of freedom. Fewer degrees of freedom mean more variability in our estimate, so the distribution looks wider and has heavier tails.
pt(q, df)calculates the area to the left of
qunder the t distributionwhere
qis the “quantile”, or number of standard deviations away from 0dfis the degrees of freedomfor a paired sample t test, \(df = n-1\) where \(n\) is the number of pairs
for independent sample t test, and assuming equal variances in both groups, \(df = n_1 + n_2 - 2\) where \(n_1\) is the size of group 1 and \(n_2\) is the size of group 2
qt(p, df)- finds the quantile/percentile given a probability
p
power.t.test(
n = NULL,
delta = NULL,
sd = 1,
power = NULL,
type = "paired"
)calculates either power or sample size
- When using the power.t.test() function, the parameter you do not specify will be the parameter the function will calculate. But you must specify 3 of these 4 parameters:
delta,sd,n,power.
- When using the power.t.test() function, the parameter you do not specify will be the parameter the function will calculate. But you must specify 3 of these 4 parameters:
delta: expected difference in meanssd: expected standard deviation of the data; usually comes from other studiesn: sample size; Only specify if you’re solving for powerpower: only specify if you’re solving forntype: possible values are"two.sample","one.sample", and"paired"; adjust this argument to fit your data appropriately
pchisq(q, df = 1)calculates the area to the left of
q, a \(\chi^2\) test statisticfor our class, we will only use
df=1for \(2\times 2\) tables
Statistical Tests
aov_model <- aov(y ~ x, data = yourdata)yis the continuous response variablexis the categorical predictor variable- this function performs the classic Analysis of Variance
Multiple Comparisons
TukeyHSD(aov_model)This function performs pairwise comparisons of the group means and returns the estimated difference, a confidence interval for each comparison, and an adjusted p-value for that test. This addresses the concern of inflating Type I error rate that occurs with multiple testing.
binom.test(x, n)
binom.test(x, n)$conf.int
binom.test(x, n)$p.valuecalculates the test statistic and p-value for a test of proportion being different from 0.5 (by default, but this can be changed)
appending
$conf.intat the end will return only the confidence intervalappending
$p.valuewill return only the p-value
chisq.test(
x,
correct = FALSE
)xis a table or matrix of countscorrect=FALSEturns off the default continuity correction; which is beyond the scope of this courseSee xtabs section for how to create a table from a dataset
If you need to manually construct a \(2 \times 2\) table before you run the test, use the following and replace the
NA’s:
tbl <- rbind(
c(NA, NA),
c(NA, NA)
)
chisq.test(tbl, correct = FALSE)fisher.test(x)xis a \(2 \times 2\) contingency tableSee xtabs section for how to create a table from a dataset
If you need to manually construct a \(2 \times 2\) table before you run the test, use the following and replace the
NA’s:
For continuous data, either one sample, paired data, or two groups.
t.test(x, y,
mu = 0,
paired = FALSE,
var.equal = FALSE
)
# Alternatively, the formula syntax
t.test(continuous_variable ~ grouping_variable,
mu = 0,
paired = FALSE,
var.equal = FALSE
)xis a vector of continuous data valuesyis an optional vector of continuous data values. This means that you do not have to supply it to the function for it to runpaired = FALSEis the default, change this toTRUEif your data is pairedmuis a number indicating the hypothesized value of the mean, \(\hat \mu\) (or difference of means if we are performing a two sample test)var.equalindicates whether we assume the variances (and standard deviations) as equal- Student’s two sample t-test assume standard deviations as equal
- Welch’s two sample t-test does not make that assumption
wilcox.test(
x, y = NULL, paired = FALSE, exact = TRUE
)x(and optionally,y) are the data vectors to input. You may need to use$to select the columns of interest- this function is similar to
t.testin that we can use it for similar types of data and scientific questions and the syntax for our purposes is nearly identical - Performs one- and two-sample Wilcoxon tests on vectors of data; the latter is also known as ‘Mann-Whitney’ test.
- also accepts the formula notation
quantitative_response ~ categorical_grouping_variablein place ofxandy
Summary Statistics
For Categorical Variables
(Descriptive Statistics Slide 7)
xtabs(~ x + y, data = df)
# three-way table
xtabs(~ x + y + z, data = df) xandycorrespond to variables of a dataset; usewith()to easily reference variable names from a dataset~ x + y: Creates a 2x2 contingency table of counts~ x + y + zcreates a 3x3 contingency table for three categorical variables
proportions(table) Converts a table of counts into a table of proportions
Can used in conjunction with
xtabs
For Continuous variables
median(x) - Calculates the middle value of a numeric variable
mean(x) - Calculates the arithmetic average of a numeric variable
sd(x)- Calculates the sample standard deviation of a numeric variable
quantile(x, probs = c(q1, q2, q3, ...)) - Produces sample quantiles corresponding to given probabilities
cor(x, y) Computes the correlation coefficient between two numeric variables
Correlation Slide 13 for example
summary(dataset$variable) - gives 5 number summary of continuous variable
min(dataset$variable) - gives minimum of continuous variable
max(dataset$variable) - gives maximum of continuous variable
sort(dataset$variable) sorts variable in ascending order by default
can be continuous or categorical
ggplot2
library(ggplot2) # always load the library
ggplot(data, aes(x, y)) +
geom_bar() + # bar chart
geom_histogram() + # histogram
geom_point() + # scatterplot
geom_boxplot() + # box and whisker plot
facet_wrap(~var) + # Split plot into a multi-panel layout by a variable
facet_grid(~var) # essentially the same as facet_wrap, with slightly different functionality