Read observations from CSV files • harpIO

harpIO only includes functionality to read observations from vobs files, though observations could come in many formats. A simple format could be a comma separated variables (CSV) text file. harp is written in a way that you can write your own functions that harp’s read functions will call so that the data formats you’re working with in harp will always be the same regardless of the file formats they came from. This vignette demonstrates how you can make a function that read_obs_file() or read_obs() could call to read in CSV files following a particular format.

Let’s say our CSV data have columns for station ID (“stationId”), station elevation (“stationElevation”), and columns for the variables 2m temperature (“TA”), 10m wind speed (“FF10”), 1-hour precipitation (“RR1”), 2m relative humidity, (“RH2”) and mean sea level pressure (“MSLP”), and that there is one file for each observation time, We’ll begin by making a function to make some fake data following these conventions.

library(harpIO)
#> Loading required package: harpCore
#> 
#> Attaching package: 'harpCore'
#> The following object is masked from 'package:stats':
#> 
#>     filter
library(tibble)

make_fake_data <- function(num_stn) {

  set.seed(num_stn) # so stationElevation is the same every time
  station_elevs <- round(runif(num_stn, 0, 350)) 
  set.seed(Sys.time())

  tibble::tibble(
    stationId        = seq_len(num_stn),
    stationElevation = station_elevs,
    TA               = round(runif(num_stn, 5, 35), 1),
    FF10             = round(runif(num_stn, 0.25, 35), 1),
    RR1              = round(round(rgamma(num_stn, 1, 1)) * runif(num_stn, 1, 2), 1),
    RH2              = round(runif(num_stn, 0.5, 1), 2),
    MSLP             = round(runif(num_stn, 970, 1030))
  )
}

And then another function to call make_fake_data() that will write those data to CSV files for a range of date-times.

make_csv_files <- function(dttm, num_stn, file_path) {

  invisible(lapply(
    dttm,
    \(x) {
      df <- make_fake_data(num_stn)
      file_name <- file.path(file_path, paste0("obs_", x, ".csv"))
      write.table(df, file_name, sep = ",", row.names = FALSE)
    }
  ))
}

And run the function to create 24 hours of fake data, saving to tempdir().

make_csv_files(seq_dttm(2023112800, 2023112823), 30, tempdir())

Now we have a bunch of CSV files we need to function that will read them in and return data in a format that read_obs_file() and read_obs() can use. There are a couple of important things to remember - the function should return a named list of data frames and those data frames should be one for the data themselves and one that contains information about the units and accumulation times of the observations. We will call these data frames “synop” for the data and “synop_params” for the information about units. “synop” is chosen so that harp knows that these are near surface observations akin to official SYNOP meteorological observations. We should also use a parameter naming convention that harp knows (see show_param_defs()) as well as column names “SID” for station ID and “elev” for station elevation. Note that we use namespaces for all functions from packages to future proof the function in case we want to use it in a package in the future.

read_csv_obs <- function(file_name, dttm, parameter = NULL, ...) {
  
  # read the csv data using read.csv
  obs_data <- read.csv(file_name)
  
  # add a column for the date-time
  obs_data <- dplyr::mutate(
    obs_data, valid_dttm = dttm, .before = dplyr::everything()
  )
  
  # Change column names to names harp knows about using psub()
  # We wrap in suppressWarnings as we don't want it to warn about 
  # substitutions that don't exist in case not all files contain all the 
  # expected columnn names. 
  colnames(obs_data) <- suppressWarnings(psub(
    colnames(obs_data),
    c("stationId", "stationElevation", "TA", "FF10", "RR1", "RH2", "MSLP"),
    c("SID", "elev", "T2m", "S10m", "AccPcp1h", "RH2m", "Pmsl")
  ))
  
  # Only return parameters that have been asked for
  if (length(parameter) > 0 && !all(is.na(parameter))) {
    parameter <- stats::na.omit(parameter)
    obs_data <- dplyr::select(
      obs_data, 
      dplyr::any_of(c("valid_dttm", "SID", "elev", parameter))
    )
  }
  
  # Make a data frame for parameter units
  obs_units <- tibble::tribble(
    ~parameter, ~accum_hours, ~units,
    "T2m"     , 0           , "degC",
    "S10m"    , 0           , "m/s",
    "AccPcp1h", 1           , "kg/m^2",
    "RH2m"    , 0           , "fraction",
    "Pmsl"    , 0           , "hPa",
  )
  
  # Filter the obs_units data frame to only those parameters that are in the 
  # data
  obs_units <- dplyr::filter(obs_units, .data$parameter %in% colnames(obs_data))
  
  # return the data as a named list
  list(synop = obs_data, synop_params = obs_units)
}

We can now test the read_csv_obs() function on its own or called by read_obs_file() by setting file_format = "csv_obs".

read_csv_obs(file.path(tempdir(), "obs_2023112812.csv"), 2023112812)
#> $synop
#>    valid_dttm SID elev  T2m S10m AccPcp1h RH2m Pmsl
#> 1  2023112812   1   35 17.4 10.2      0.0 0.63  984
#> 2  2023112812   2  171  6.2  6.6      0.0 0.77 1017
#> 3  2023112812   3  127 32.0 23.8      0.0 0.89  974
#> 4  2023112812   4  147 10.2 17.1      3.4 0.95 1011
#> 5  2023112812   5  105 35.0  3.1      1.9 0.73 1012
#> 6  2023112812   6   52 23.5 21.5      1.0 0.70 1026
#> 7  2023112812   7  315 19.0 29.0      1.9 0.87 1012
#> 8  2023112812   8   78 11.4 16.3      1.2 0.76 1025
#> 9  2023112812   9  338 12.9 16.8      2.0 0.53 1011
#> 10 2023112812  10   49 10.9 16.1      0.0 0.50 1026
#> 11 2023112812  11   23  8.1  0.5      0.0 0.73 1008
#> 12 2023112812  12  139  7.8  7.7      2.5 0.94  978
#> 13 2023112812  13  190  5.1 15.3      1.5 0.68 1028
#> 14 2023112812  14  308  9.2 24.4      1.6 0.90  994
#> 15 2023112812  15   78  5.7 32.8      0.0 0.82  996
#> 16 2023112812  16  322 10.1 28.1      1.7 0.82  990
#> 17 2023112812  17   88 25.9 25.3      1.2 0.94 1006
#> 18 2023112812  18  291 29.6 20.9      5.4 0.58 1003
#> 19 2023112812  19  213  9.6 26.5      0.0 0.95 1025
#> 20 2023112812  20  151 19.5 30.1      7.3 0.78  985
#> 21 2023112812  21   54 26.7  1.6      0.0 0.65  994
#> 22 2023112812  22  205 19.6 33.9      1.1 0.79 1028
#> 23 2023112812  23   12  5.3 25.1      0.0 0.65 1011
#> 24 2023112812  24  115 24.9 24.3      1.1 0.95  997
#> 25 2023112812  25   88 14.6  6.3      0.0 0.85 1014
#> 26 2023112812  26  104 16.9 24.1      0.0 0.91  991
#> 27 2023112812  27  167 11.3 17.7      0.0 0.76  995
#> 28 2023112812  28   90 29.8  9.5      0.0 0.75  976
#> 29 2023112812  29  284 31.1  5.8      0.0 0.85  993
#> 30 2023112812  30  164 23.4  2.9      0.0 0.58  989
#> 
#> $synop_params
#> # A tibble: 5 × 3
#>   parameter accum_hours units   
#>   <chr>           <dbl> <chr>   
#> 1 T2m                 0 degC    
#> 2 S10m                0 m/s     
#> 3 AccPcp1h            1 kg/m^2  
#> 4 RH2m                0 fraction
#> 5 Pmsl                0 hPa
read_obs_file(
  file.path(tempdir(), "obs_2023112812.csv"), NULL, 
  dttm = 2023112812, file_format = "csv_obs"
)
#> $synop
#>    valid_dttm SID elev  T2m S10m AccPcp1h RH2m Pmsl
#> 1  1701172800   1   35 17.4 10.2      0.0 0.63  984
#> 2  1701172800   2  171  6.2  6.6      0.0 0.77 1017
#> 3  1701172800   3  127 32.0 23.8      0.0 0.89  974
#> 4  1701172800   4  147 10.2 17.1      3.4 0.95 1011
#> 5  1701172800   5  105 35.0  3.1      1.9 0.73 1012
#> 6  1701172800   6   52 23.5 21.5      1.0 0.70 1026
#> 7  1701172800   7  315 19.0 29.0      1.9 0.87 1012
#> 8  1701172800   8   78 11.4 16.3      1.2 0.76 1025
#> 9  1701172800   9  338 12.9 16.8      2.0 0.53 1011
#> 10 1701172800  10   49 10.9 16.1      0.0 0.50 1026
#> 11 1701172800  11   23  8.1  0.5      0.0 0.73 1008
#> 12 1701172800  12  139  7.8  7.7      2.5 0.94  978
#> 13 1701172800  13  190  5.1 15.3      1.5 0.68 1028
#> 14 1701172800  14  308  9.2 24.4      1.6 0.90  994
#> 15 1701172800  15   78  5.7 32.8      0.0 0.82  996
#> 16 1701172800  16  322 10.1 28.1      1.7 0.82  990
#> 17 1701172800  17   88 25.9 25.3      1.2 0.94 1006
#> 18 1701172800  18  291 29.6 20.9      5.4 0.58 1003
#> 19 1701172800  19  213  9.6 26.5      0.0 0.95 1025
#> 20 1701172800  20  151 19.5 30.1      7.3 0.78  985
#> 21 1701172800  21   54 26.7  1.6      0.0 0.65  994
#> 22 1701172800  22  205 19.6 33.9      1.1 0.79 1028
#> 23 1701172800  23   12  5.3 25.1      0.0 0.65 1011
#> 24 1701172800  24  115 24.9 24.3      1.1 0.95  997
#> 25 1701172800  25   88 14.6  6.3      0.0 0.85 1014
#> 26 1701172800  26  104 16.9 24.1      0.0 0.91  991
#> 27 1701172800  27  167 11.3 17.7      0.0 0.76  995
#> 28 1701172800  28   90 29.8  9.5      0.0 0.75  976
#> 29 1701172800  29  284 31.1  5.8      0.0 0.85  993
#> 30 1701172800  30  164 23.4  2.9      0.0 0.58  989
#> 
#> $synop_params
#> # A tibble: 5 × 3
#>   parameter accum_hours units   
#>   <chr>           <dbl> <chr>   
#> 1 T2m                 0 degC    
#> 2 S10m                0 m/s     
#> 3 AccPcp1h            1 kg/m^2  
#> 4 RH2m                0 fraction
#> 5 Pmsl                0 hPa
read_obs_file(
  file.path(tempdir(), "obs_2023112812.csv"), c("T2m", "S10m"), 
  dttm = 2023112812, file_format = "csv_obs"
)
#> $synop
#>    valid_dttm SID elev  T2m S10m
#> 1  1701172800   1   35 17.4 10.2
#> 2  1701172800   2  171  6.2  6.6
#> 3  1701172800   3  127 32.0 23.8
#> 4  1701172800   4  147 10.2 17.1
#> 5  1701172800   5  105 35.0  3.1
#> 6  1701172800   6   52 23.5 21.5
#> 7  1701172800   7  315 19.0 29.0
#> 8  1701172800   8   78 11.4 16.3
#> 9  1701172800   9  338 12.9 16.8
#> 10 1701172800  10   49 10.9 16.1
#> 11 1701172800  11   23  8.1  0.5
#> 12 1701172800  12  139  7.8  7.7
#> 13 1701172800  13  190  5.1 15.3
#> 14 1701172800  14  308  9.2 24.4
#> 15 1701172800  15   78  5.7 32.8
#> 16 1701172800  16  322 10.1 28.1
#> 17 1701172800  17   88 25.9 25.3
#> 18 1701172800  18  291 29.6 20.9
#> 19 1701172800  19  213  9.6 26.5
#> 20 1701172800  20  151 19.5 30.1
#> 21 1701172800  21   54 26.7  1.6
#> 22 1701172800  22  205 19.6 33.9
#> 23 1701172800  23   12  5.3 25.1
#> 24 1701172800  24  115 24.9 24.3
#> 25 1701172800  25   88 14.6  6.3
#> 26 1701172800  26  104 16.9 24.1
#> 27 1701172800  27  167 11.3 17.7
#> 28 1701172800  28   90 29.8  9.5
#> 29 1701172800  29  284 31.1  5.8
#> 30 1701172800  30  164 23.4  2.9
#> 
#> $synop_params
#> # A tibble: 2 × 3
#>   parameter accum_hours units
#>   <chr>           <dbl> <chr>
#> 1 T2m                 0 degC 
#> 2 S10m                0 m/s

Note the difference in the valid_dttm column. read_csv_obs() returns the date times that were given to it, whereas read_obs_file() converts the valid_dttm column to UNIX epoch time format - that is number of seconds since 00:00:00 1 Januray 1970. This is because this is the most efficient way to store and process date-time data.

If we want to use the observations throughout harp, for example for point observations, it is most efficient to convert the observations to SQLite format, which is much quicker to access and filter to what we need using read_point_obs(). This process is done using read_obs() and setting output_format_opts.

read_obs(
  seq_dttm(2023112800, 2023112823), 
  NULL, 
  file_format        = "csv_obs", 
  file_path          = tempdir(),
  file_template      = "obs_{YYYY}{MM}{DD}{HH}.csv",
  output_format_opts = obstable_opts(file.path(tempdir(), "OBSTABLE"))
)
#> 
#> ***
#> New SQLITE obs file created: /tmp/RtmpsHBatv/OBSTABLE/OBSTABLE_2023.sqlite
#> ***
#> Writing to: SYNOP in /tmp/RtmpsHBatv/OBSTABLE/OBSTABLE_2023.sqlite
#> No data returned. Set `return_data = TRUE` to return data.

And then read the data back in using read_point_obs()

read_point_obs(
  seq_dttm(2023112803, 2023112819), "T2m", 
  obs_path = file.path(tempdir(), "OBSTABLE"), stations = c(4, 6, 9)
)
#> Getting T2m.
#> 
#> Reading: /tmp/RtmpsHBatv/OBSTABLE/OBSTABLE_2023.sqlite
#> # A tibble: 51 × 5
#>    valid_dttm            SID  elev   T2m units
#>    <dttm>              <int> <dbl> <dbl> <chr>
#>  1 2023-11-28 03:00:00     4   147  10.2 degC 
#>  2 2023-11-28 03:00:00     6    52  23.5 degC 
#>  3 2023-11-28 03:00:00     9   338  12.9 degC 
#>  4 2023-11-28 04:00:00     4   147  10.2 degC 
#>  5 2023-11-28 04:00:00     6    52  23.5 degC 
#>  6 2023-11-28 04:00:00     9   338  12.9 degC 
#>  7 2023-11-28 05:00:00     4   147  10.2 degC 
#>  8 2023-11-28 05:00:00     6    52  23.5 degC 
#>  9 2023-11-28 05:00:00     9   338  12.9 degC 
#> 10 2023-11-28 06:00:00     4   147  10.2 degC 
#> # ℹ 41 more rows