dplyr

Introduction to tabular data

We will be working with data from the Portal Project.
Long-term experimental study of small mammals in Arizona.

Packages

Main way that reusable code is shared in R
Combination of code, data, and documentation
R has a rich ecosystem of packages for data manipulation & analysis
We’ve already installed the packages for this demo
Even if we’ve installed a package it is automatically available to do analysis with
This because different packages may have functions with the same names
So don’t want to have to worry about all of the packages we’ve installed every time we right a piece of code
Load all of the functions in the package:

Loading and viewing the dataset

Data on small mammal surveys is available from ratdat in the surveys data frame

Basic `dplyr`

Modern data manipulation library for R

Select

Select a subset of columns.

They can occur in any order.

Do Shrub Volume Data Basics 1-2.

Mutate

Add new columns with calculated values using mutate()

If we look at surveys now will it contain the new column?
Open surveys
All of these commands produce new values, data frames in this case
To store them for later use we need to assign them to a variable

Or we could overwrite the existing variable if we don’t need it

Do Shrub Volume Data Basics 3.

Arrange

We can sort the data in the table using arrange
To sort the surveys table by by weight

We can reverse the order of the sort by “wrapping” weight in another function, desc for “descending

We can also sort by multiple columns, so if we wanted to sort first by plot_id and then by date

Do Shrub Volume Data Basics 4.

Filter

Use filter() to get only the rows that meet certain criteria.
Combine the data frame to be filtered with a series of conditional statements.
Column, condition, value
To filter the data frame to only keep the data on species DS
- Type the name of the function, filter
- Parentheses
- The name of the data frame we want to filter, surveys
- The column the want to filter on, species_id
- The condition, which is == for “is equal to”
- And then the value, "DS"
- DS here is a string, not a variable or a column name, so we enclose it in quotation marks

Like with vectors we can have a condition that is “not equal to” using “!=”
So if we wanted the data for all species except “DS

We can also filter on multiple conditions at once
In computing we combine conditions in two ways “and” & “or”
“and” means that all of the conditions must be true
Do this in dplyr using additional comma separate arguments
So, to get the data on species “DS” for the year 1995:

Alternatively we can use the & symbol, which stands for “and”

This approach is mostly useful for building more complex conditions
“or” means that one or more of the conditions must be true
Do this using |
To get data on all of the Dipodomys species

Do Shrub Volume Data Basics 5-7.

Filtering null values

One of the common tasks we use filter for is removing null values from data
Based on what we learned before it’s natural to think that we do this by using the condition weight != NA

Why didn’t that work?
Null values like NA are special
We don’t want to accidentally say that two “missing” things are the same
- We don’t know if they are
So use special commands
is.na() checks if the value is NA
So if we wanted all of the data where the weigh is NA

We’ll learn more about why this works in the same way as the other conditional statements when we study conditionals in detail later in the course
To remove null values we combine this with ! for “not”

So !is.na(weight) is conceptually the same as “weight != NA”
It is common to combine a null filter with other conditions using “and”
For example we might want all of the data on a species that contains weights

Do Shrub Volume Data Basics 8.