dplyr
Introduction to tabular data
- We will be working with data from the Portal Project.
- Long-term experimental study of small mammals in Arizona.
Packages
- Main way that reusable code is shared in R
- Combination of code, data, and documentation
- R has a rich ecosystem of packages for data manipulation & analysis
- We’ve already installed the packages for this demo
- Even if we’ve installed a package it is automatically available to do analysis with
- This because different packages may have functions with the same names
- So don’t want to have to worry about all of the packages we’ve installed every time we right a piece of code
- Load all of the functions in the package:
Loading and viewing the dataset
- Data on small mammal surveys is available from
ratdat
in thesurveys
data frame
Basic dplyr
- Modern data manipulation library for R
Select
- Select a subset of columns.
- They can occur in any order.
Mutate
- Add new columns with calculated values using
mutate()
- If we look at
surveys
now will it contain the new column? - Open
surveys
- All of these commands produce new values, data frames in this case
- To store them for later use we need to assign them to a variable
- Or we could overwrite the existing variable if we don’t need it
Arrange
- We can sort the data in the table using
arrange
- To sort the surveys table by by weight
- We can reverse the order of the sort by “wrapping”
weight
in another function,desc
for “descending
- We can also sort by multiple columns, so if we wanted to sort first by
plot_id
and then by date
Filter
- Use
filter()
to get only the rows that meet certain criteria. - Combine the data frame to be filtered with a series of conditional statements.
- Column, condition, value
- To filter the data frame to only keep the data on species
DS
- Type the name of the function,
filter
- Parentheses
- The name of the data frame we want to filter,
surveys
- The column the want to filter on,
species_id
- The condition, which is
==
for “is equal to” - And then the value,
"DS"
DS
here is a string, not a variable or a column name, so we enclose it in quotation marks
- Type the name of the function,
- Like with vectors we can have a condition that is “not equal to” using “!=”
- So if we wanted the data for all species except “DS
- We can also filter on multiple conditions at once
- In computing we combine conditions in two ways “and” & “or”
- “and” means that all of the conditions must be true
- Do this in
dplyr
using additional comma separate arguments - So, to get the data on species “DS” for the year 1995:
- Alternatively we can use the
&
symbol, which stands for “and”
This approach is mostly useful for building more complex conditions
“or” means that one or more of the conditions must be true
Do this using
|
To get data on all of the Dipodomys species
Filtering null values
- One of the common tasks we use
filter
for is removing null values from data - Based on what we learned before it’s natural to think that we do this by using the condition
weight != NA
- Why didn’t that work?
- Null values like
NA
are special - We don’t want to accidentally say that two “missing” things are the same
- We don’t know if they are
- So use special commands
is.na()
checks if the value isNA
- So if we wanted all of the data where the weigh is
NA
We’ll learn more about why this works in the same way as the other conditional statements when we study conditionals in detail later in the course
To remove null values we combine this with
!
for “not”
- So
!is.na(weight)
is conceptually the same as “weight != NA” - It is common to combine a null filter with other conditions using “and”
- For example we might want all of the data on a species that contains weights