Data Frames and Lists

Andrew Singleton

harp training 16/02/2022

Data frames

  • Data frames are used to store tabular data
  • R’s built in class is data.frame
  • Most data in harp are stored as data frames
  • Data frames should be tidy
    • Columns are variables
    • Rows are observations
    • Cells are values
  • Each column has is limited to a single data type

Tibbles

  • Tibbles are a class of data frame with a more user friendly print method
  • All harp data frames are of class tbl_df
  • This becomes very important with list columns

Accessing Data

  • Columns can be accessed by
    • df$<column> or
    • df[["<column>"]]
  • If you must use position rows come first
    • df[1, 3] : row 1, column 3
  • All rows / columns are denoted by a space
    • df[1, ] : All columns for row 1
    • df[, 3] : All rows for column 3

dplyr and tidyr

  • dplyr and tidyr are packages for manipulating and tidying data
  • Along with ggplot2 they form the backbone of the tidyverse
  • harp is designed to integrate with the tidyverse and harnesses many methods

Selecting columns

  • Use select(df, col, col, col)
  • Use a selection helper
    • starts_with
    • ends_with
    • contains
    • matches
    • num_range
    • where

Extract a column

  • Use pull(df, col)
  • Same as df$col, but useful in pipelines

Modify or add columns

  • Use mutate(df, col = f(col))
  • Modify a single column
mutate(df, temp = temp - 273.15)
  • Make new columns (operations are done in series)
mutate(
  df,
  fc      = fc - 273.15,
  bias    = fc - obs,
  sq_bias = bias ^ 2
)

Scoped mutates

  • Use across, col names or a selection helper and a function or formula
mutate(df, across(c(fc, obs), round))
mutate(df, across(c(fc, obs), ~.x - 273.15))
mutate(df, across(contains("mbr"), ~.x - 273.15))
mutate(df, across(matches("_mbr[[:digit:]]{3}"), ~.x - 273.15))
mutate(df, across(num_range("_mbr", 1:5, width = 3), ~.x - 273.15))

Filtering

  • Use filter(df, condition)
  • Returns rows where the condition is TRUE
filter(df, temp <= 0)
filter(df, temp <= 0, precip > 0)
filter(df, date == str_datetime_to_datetime(2022021600))
filter(df, SID %in% c(1001, 1010, 1047))
filter(df, between(temp, -20, -10))

Grouping and summarizing

  • Use group_by to define groups
  • Use summarize to get a single value for groups
  • Use pipe %>% to send result to the next function
group_by(df, year, month) %>%
  summarize(
    num_cases = n(),
    temp_mean = mean(temp, na.rm = TRUE),
    temp_sd   = sd(temp, na.rm = TRUE),
    temp_max  = max(temp, na.rm = TRUE),
    temp_in   = min(temp, na.rm = TRUE)
  )
    

Lists

  • A list is like a vector, BUT
    • each element can have its own type / class
    • elements can have length > 1
    • elements can be named (this is true for vectors too, but less common)
    • An element of list can be another list
    • A lot of harp data exists as lists of data frames
    • 3d gridded data (geolist) are lists of 2d fields of gridded data (geofield)

Accessing list elements

  • An element is extracted, or set, by name:
    • li$name
    • li[[“name”]]
  • or by position
    • li[[i]]

Working with lists

  • A list is created with list()
  • Other types can be coerced into list with list() or as.list() (they work a bit differently…)
  • Lists can be joined by the concatenation function c()
  • lapply can be used to apply a function to all elements of a list
  • The purrr package is great for working with lists

Data frames and lists

  • A data frame is actually a special case of a list where all elements have to have the same length
  • A column of a data frame can be a list
    • This is where the print method for tibbles really helps
    • Operations on list columns need to use lapply, or a map function from the purrr package

Up Next

Troubleshooting harp Installation