Data Frames and Lists

Andrew Singleton

harp training 16/02/2022

Data frames

Data frames are used to store tabular data
R’s built in class is data.frame
Most data in harp are stored as data frames
Data frames should be tidy
- Columns are variables
- Rows are observations
- Cells are values
Each column has is limited to a single data type

Tibbles

Tibbles are a class of data frame with a more user friendly print method
All harp data frames are of class tbl_df
This becomes very important with list columns

Accessing Data

Columns can be accessed by
- df$<column> or
- df[["<column>"]]
If you must use position rows come first
- df[1, 3] : row 1, column 3
All rows / columns are denoted by a space
- df[1, ] : All columns for row 1
- df[, 3] : All rows for column 3

dplyr and tidyr

dplyr and tidyr are packages for manipulating and tidying data
Along with ggplot2 they form the backbone of the tidyverse
harp is designed to integrate with the tidyverse and harnesses many methods

Selecting columns

Use select(df, col, col, col)
Use a selection helper
- starts_with
- ends_with
- contains
- matches
- num_range
- where

Extract a column

Use pull(df, col)
Same as df$col, but useful in pipelines

Modify or add columns

Use mutate(df, col = f(col))
Modify a single column

mutate(df, temp = temp - 273.15)

Make new columns (operations are done in series)

mutate(
  df,
  fc      = fc - 273.15,
  bias    = fc - obs,
  sq_bias = bias ^ 2
)

Scoped mutates

Use across, col names or a selection helper and a function or formula

mutate(df, across(c(fc, obs), round))
mutate(df, across(c(fc, obs), ~.x - 273.15))
mutate(df, across(contains("mbr"), ~.x - 273.15))
mutate(df, across(matches("_mbr[[:digit:]]{3}"), ~.x - 273.15))
mutate(df, across(num_range("_mbr", 1:5, width = 3), ~.x - 273.15))

Filtering

Use filter(df, condition)
Returns rows where the condition is TRUE

filter(df, temp <= 0)
filter(df, temp <= 0, precip > 0)
filter(df, date == str_datetime_to_datetime(2022021600))
filter(df, SID %in% c(1001, 1010, 1047))
filter(df, between(temp, -20, -10))

Grouping and summarizing

Use group_by to define groups
Use summarize to get a single value for groups
Use pipe %>% to send result to the next function

group_by(df, year, month) %>%
  summarize(
    num_cases = n(),
    temp_mean = mean(temp, na.rm = TRUE),
    temp_sd   = sd(temp, na.rm = TRUE),
    temp_max  = max(temp, na.rm = TRUE),
    temp_in   = min(temp, na.rm = TRUE)
  )

Lists

A list is like a vector, BUT
- each element can have its own type / class
- elements can have length > 1
- elements can be named (this is true for vectors too, but less common)
- An element of list can be another list
- A lot of harp data exists as lists of data frames
- 3d gridded data (geolist) are lists of 2d fields of gridded data (geofield)

Accessing list elements

An element is extracted, or set, by name:
- li$name
- li[[“name”]]
or by position
- li[[i]]

Working with lists

A list is created with list()
Other types can be coerced into list with list() or as.list() (they work a bit differently…)
Lists can be joined by the concatenation function c()
lapply can be used to apply a function to all elements of a list
The purrr package is great for working with lists

Data frames and lists

A data frame is actually a special case of a list where all elements have to have the same length
A column of a data frame can be a list
- This is where the print method for tibbles really helps
- Operations on list columns need to use lapply, or a map function from the purrr package

Up Next

Troubleshooting harp Installation