dplyr

Introduction to tabular data

  • We will be working with data from the Portal Project.
  • Long-term experimental study of small mammals in Arizona.

Packages

  • Main way that reusable code is shared in R
  • Combination of code, data, and documentation
  • R has a rich ecosystem of packages for data manipulation & analysis
  • We’ve already installed the packages for this demo
  • Even if we’ve installed a package it is automatically available to do analysis with
  • This because different packages may have functions with the same names
  • So don’t want to have to worry about all of the packages we’ve installed every time we right a piece of code
  • Load all of the functions in the package:

    

Loading and viewing the dataset

  • Data on small mammal surveys is available from ratdat in the surveys data frame

    

Basic dplyr

  • Modern data manipulation library for R

Select

  • Select a subset of columns.

    
  • They can occur in any order.

    

Do Shrub Volume Data Basics 1-2.

Mutate

  • Add new columns with calculated values using mutate()

    
  • If we look at surveys now will it contain the new column?
  • Open surveys
  • All of these commands produce new values, data frames in this case
  • To store them for later use we need to assign them to a variable

    
  • Or we could overwrite the existing variable if we don’t need it

    

Do Shrub Volume Data Basics 3.

Arrange

  • We can sort the data in the table using arrange
  • To sort the surveys table by by weight

    
  • We can reverse the order of the sort by “wrapping” weight in another function, desc for “descending

    
  • We can also sort by multiple columns, so if we wanted to sort first by plot_id and then by date

    

Do Shrub Volume Data Basics 4.

Filter

  • Use filter() to get only the rows that meet certain criteria.
  • Combine the data frame to be filtered with a series of conditional statements.
  • Column, condition, value
  • To filter the data frame to only keep the data on species DS
    • Type the name of the function, filter
    • Parentheses
    • The name of the data frame we want to filter, surveys
    • The column the want to filter on, species_id
    • The condition, which is == for “is equal to”
    • And then the value, "DS"
    • DS here is a string, not a variable or a column name, so we enclose it in quotation marks

    
  • Like with vectors we can have a condition that is “not equal to” using “!=”
  • So if we wanted the data for all species except “DS

    
  • We can also filter on multiple conditions at once
  • In computing we combine conditions in two ways “and” & “or”
  • “and” means that all of the conditions must be true
  • Do this in dplyr using additional comma separate arguments
  • So, to get the data on species “DS” for the year 1995:

    
  • Alternatively we can use the & symbol, which stands for “and”

    
  • This approach is mostly useful for building more complex conditions

  • “or” means that one or more of the conditions must be true

  • Do this using |

  • To get data on all of the Dipodomys species


    

Do Shrub Volume Data Basics 5-7.

Filtering null values

  • One of the common tasks we use filter for is removing null values from data
  • Based on what we learned before it’s natural to think that we do this by using the condition weight != NA

    
  • Why didn’t that work?
  • Null values like NA are special
  • We don’t want to accidentally say that two “missing” things are the same
    • We don’t know if they are
  • So use special commands
  • is.na() checks if the value is NA
  • So if we wanted all of the data where the weigh is NA

    
  • We’ll learn more about why this works in the same way as the other conditional statements when we study conditionals in detail later in the course

  • To remove null values we combine this with ! for “not”


    
  • So !is.na(weight) is conceptually the same as “weight != NA”
  • It is common to combine a null filter with other conditions using “and”
  • For example we might want all of the data on a species that contains weights

    

Do Shrub Volume Data Basics 8.