harpIO only includes functionality to read observations from
vobs files, though observations could come in many formats. A
simple format could be a comma separated variables (CSV) text file. harp
is written in a way that you can write your own functions that harp’s
read functions will call so that the data formats you’re working with in
harp will always be the same regardless of the file formats they came
from. This vignette demonstrates how you can make a function that
read_obs_file()
or read_obs()
could call to
read in CSV files following a particular format.
Let’s say our CSV data have columns for station ID (“stationId”), station elevation (“stationElevation”), and columns for the variables 2m temperature (“TA”), 10m wind speed (“FF10”), 1-hour precipitation (“RR1”), 2m relative humidity, (“RH2”) and mean sea level pressure (“MSLP”), and that there is one file for each observation time, We’ll begin by making a function to make some fake data following these conventions.
library(harpIO)
#> Loading required package: harpCore
#>
#> Attaching package: 'harpCore'
#> The following object is masked from 'package:stats':
#>
#> filter
library(tibble)
make_fake_data <- function(num_stn) {
set.seed(num_stn) # so stationElevation is the same every time
station_elevs <- round(runif(num_stn, 0, 350))
set.seed(Sys.time())
tibble::tibble(
stationId = seq_len(num_stn),
stationElevation = station_elevs,
TA = round(runif(num_stn, 5, 35), 1),
FF10 = round(runif(num_stn, 0.25, 35), 1),
RR1 = round(round(rgamma(num_stn, 1, 1)) * runif(num_stn, 1, 2), 1),
RH2 = round(runif(num_stn, 0.5, 1), 2),
MSLP = round(runif(num_stn, 970, 1030))
)
}
And then another function to call make_fake_data()
that
will write those data to CSV files for a range of date-times.
make_csv_files <- function(dttm, num_stn, file_path) {
invisible(lapply(
dttm,
\(x) {
df <- make_fake_data(num_stn)
file_name <- file.path(file_path, paste0("obs_", x, ".csv"))
write.table(df, file_name, sep = ",", row.names = FALSE)
}
))
}
And run the function to create 24 hours of fake data, saving to
tempdir()
.
Now we have a bunch of CSV files we need to function that will read
them in and return data in a format that read_obs_file()
and read_obs()
can use. There are a couple of important
things to remember - the function should return a named list of data
frames and those data frames should be one for the data themselves and
one that contains information about the units and accumulation times of
the observations. We will call these data frames “synop” for the data
and “synop_params” for the information about units. “synop” is chosen so
that harp knows that these are near surface observations akin to
official SYNOP meteorological observations. We should also use a
parameter naming convention that harp knows (see
show_param_defs()
) as well as column names “SID” for
station ID and “elev” for station elevation. Note that we use namespaces
for all functions from packages to future proof the function in case we
want to use it in a package in the future.
read_csv_obs <- function(file_name, dttm, parameter = NULL, ...) {
# read the csv data using read.csv
obs_data <- read.csv(file_name)
# add a column for the date-time
obs_data <- dplyr::mutate(
obs_data, valid_dttm = dttm, .before = dplyr::everything()
)
# Change column names to names harp knows about using psub()
# We wrap in suppressWarnings as we don't want it to warn about
# substitutions that don't exist in case not all files contain all the
# expected columnn names.
colnames(obs_data) <- suppressWarnings(psub(
colnames(obs_data),
c("stationId", "stationElevation", "TA", "FF10", "RR1", "RH2", "MSLP"),
c("SID", "elev", "T2m", "S10m", "AccPcp1h", "RH2m", "Pmsl")
))
# Only return parameters that have been asked for
if (length(parameter) > 0 && !all(is.na(parameter))) {
parameter <- stats::na.omit(parameter)
obs_data <- dplyr::select(
obs_data,
dplyr::any_of(c("valid_dttm", "SID", "elev", parameter))
)
}
# Make a data frame for parameter units
obs_units <- tibble::tribble(
~parameter, ~accum_hours, ~units,
"T2m" , 0 , "degC",
"S10m" , 0 , "m/s",
"AccPcp1h", 1 , "kg/m^2",
"RH2m" , 0 , "fraction",
"Pmsl" , 0 , "hPa",
)
# Filter the obs_units data frame to only those parameters that are in the
# data
obs_units <- dplyr::filter(obs_units, .data$parameter %in% colnames(obs_data))
# return the data as a named list
list(synop = obs_data, synop_params = obs_units)
}
We can now test the read_csv_obs()
function on its own
or called by read_obs_file()
by setting
file_format = "csv_obs"
.
read_csv_obs(file.path(tempdir(), "obs_2023112812.csv"), 2023112812)
#> $synop
#> valid_dttm SID elev T2m S10m AccPcp1h RH2m Pmsl
#> 1 2023112812 1 35 17.4 10.2 0.0 0.63 984
#> 2 2023112812 2 171 6.2 6.6 0.0 0.77 1017
#> 3 2023112812 3 127 32.0 23.8 0.0 0.89 974
#> 4 2023112812 4 147 10.2 17.1 3.4 0.95 1011
#> 5 2023112812 5 105 35.0 3.1 1.9 0.73 1012
#> 6 2023112812 6 52 23.5 21.5 1.0 0.70 1026
#> 7 2023112812 7 315 19.0 29.0 1.9 0.87 1012
#> 8 2023112812 8 78 11.4 16.3 1.2 0.76 1025
#> 9 2023112812 9 338 12.9 16.8 2.0 0.53 1011
#> 10 2023112812 10 49 10.9 16.1 0.0 0.50 1026
#> 11 2023112812 11 23 8.1 0.5 0.0 0.73 1008
#> 12 2023112812 12 139 7.8 7.7 2.5 0.94 978
#> 13 2023112812 13 190 5.1 15.3 1.5 0.68 1028
#> 14 2023112812 14 308 9.2 24.4 1.6 0.90 994
#> 15 2023112812 15 78 5.7 32.8 0.0 0.82 996
#> 16 2023112812 16 322 10.1 28.1 1.7 0.82 990
#> 17 2023112812 17 88 25.9 25.3 1.2 0.94 1006
#> 18 2023112812 18 291 29.6 20.9 5.4 0.58 1003
#> 19 2023112812 19 213 9.6 26.5 0.0 0.95 1025
#> 20 2023112812 20 151 19.5 30.1 7.3 0.78 985
#> 21 2023112812 21 54 26.7 1.6 0.0 0.65 994
#> 22 2023112812 22 205 19.6 33.9 1.1 0.79 1028
#> 23 2023112812 23 12 5.3 25.1 0.0 0.65 1011
#> 24 2023112812 24 115 24.9 24.3 1.1 0.95 997
#> 25 2023112812 25 88 14.6 6.3 0.0 0.85 1014
#> 26 2023112812 26 104 16.9 24.1 0.0 0.91 991
#> 27 2023112812 27 167 11.3 17.7 0.0 0.76 995
#> 28 2023112812 28 90 29.8 9.5 0.0 0.75 976
#> 29 2023112812 29 284 31.1 5.8 0.0 0.85 993
#> 30 2023112812 30 164 23.4 2.9 0.0 0.58 989
#>
#> $synop_params
#> # A tibble: 5 × 3
#> parameter accum_hours units
#> <chr> <dbl> <chr>
#> 1 T2m 0 degC
#> 2 S10m 0 m/s
#> 3 AccPcp1h 1 kg/m^2
#> 4 RH2m 0 fraction
#> 5 Pmsl 0 hPa
read_obs_file(
file.path(tempdir(), "obs_2023112812.csv"), NULL,
dttm = 2023112812, file_format = "csv_obs"
)
#> $synop
#> valid_dttm SID elev T2m S10m AccPcp1h RH2m Pmsl
#> 1 1701172800 1 35 17.4 10.2 0.0 0.63 984
#> 2 1701172800 2 171 6.2 6.6 0.0 0.77 1017
#> 3 1701172800 3 127 32.0 23.8 0.0 0.89 974
#> 4 1701172800 4 147 10.2 17.1 3.4 0.95 1011
#> 5 1701172800 5 105 35.0 3.1 1.9 0.73 1012
#> 6 1701172800 6 52 23.5 21.5 1.0 0.70 1026
#> 7 1701172800 7 315 19.0 29.0 1.9 0.87 1012
#> 8 1701172800 8 78 11.4 16.3 1.2 0.76 1025
#> 9 1701172800 9 338 12.9 16.8 2.0 0.53 1011
#> 10 1701172800 10 49 10.9 16.1 0.0 0.50 1026
#> 11 1701172800 11 23 8.1 0.5 0.0 0.73 1008
#> 12 1701172800 12 139 7.8 7.7 2.5 0.94 978
#> 13 1701172800 13 190 5.1 15.3 1.5 0.68 1028
#> 14 1701172800 14 308 9.2 24.4 1.6 0.90 994
#> 15 1701172800 15 78 5.7 32.8 0.0 0.82 996
#> 16 1701172800 16 322 10.1 28.1 1.7 0.82 990
#> 17 1701172800 17 88 25.9 25.3 1.2 0.94 1006
#> 18 1701172800 18 291 29.6 20.9 5.4 0.58 1003
#> 19 1701172800 19 213 9.6 26.5 0.0 0.95 1025
#> 20 1701172800 20 151 19.5 30.1 7.3 0.78 985
#> 21 1701172800 21 54 26.7 1.6 0.0 0.65 994
#> 22 1701172800 22 205 19.6 33.9 1.1 0.79 1028
#> 23 1701172800 23 12 5.3 25.1 0.0 0.65 1011
#> 24 1701172800 24 115 24.9 24.3 1.1 0.95 997
#> 25 1701172800 25 88 14.6 6.3 0.0 0.85 1014
#> 26 1701172800 26 104 16.9 24.1 0.0 0.91 991
#> 27 1701172800 27 167 11.3 17.7 0.0 0.76 995
#> 28 1701172800 28 90 29.8 9.5 0.0 0.75 976
#> 29 1701172800 29 284 31.1 5.8 0.0 0.85 993
#> 30 1701172800 30 164 23.4 2.9 0.0 0.58 989
#>
#> $synop_params
#> # A tibble: 5 × 3
#> parameter accum_hours units
#> <chr> <dbl> <chr>
#> 1 T2m 0 degC
#> 2 S10m 0 m/s
#> 3 AccPcp1h 1 kg/m^2
#> 4 RH2m 0 fraction
#> 5 Pmsl 0 hPa
read_obs_file(
file.path(tempdir(), "obs_2023112812.csv"), c("T2m", "S10m"),
dttm = 2023112812, file_format = "csv_obs"
)
#> $synop
#> valid_dttm SID elev T2m S10m
#> 1 1701172800 1 35 17.4 10.2
#> 2 1701172800 2 171 6.2 6.6
#> 3 1701172800 3 127 32.0 23.8
#> 4 1701172800 4 147 10.2 17.1
#> 5 1701172800 5 105 35.0 3.1
#> 6 1701172800 6 52 23.5 21.5
#> 7 1701172800 7 315 19.0 29.0
#> 8 1701172800 8 78 11.4 16.3
#> 9 1701172800 9 338 12.9 16.8
#> 10 1701172800 10 49 10.9 16.1
#> 11 1701172800 11 23 8.1 0.5
#> 12 1701172800 12 139 7.8 7.7
#> 13 1701172800 13 190 5.1 15.3
#> 14 1701172800 14 308 9.2 24.4
#> 15 1701172800 15 78 5.7 32.8
#> 16 1701172800 16 322 10.1 28.1
#> 17 1701172800 17 88 25.9 25.3
#> 18 1701172800 18 291 29.6 20.9
#> 19 1701172800 19 213 9.6 26.5
#> 20 1701172800 20 151 19.5 30.1
#> 21 1701172800 21 54 26.7 1.6
#> 22 1701172800 22 205 19.6 33.9
#> 23 1701172800 23 12 5.3 25.1
#> 24 1701172800 24 115 24.9 24.3
#> 25 1701172800 25 88 14.6 6.3
#> 26 1701172800 26 104 16.9 24.1
#> 27 1701172800 27 167 11.3 17.7
#> 28 1701172800 28 90 29.8 9.5
#> 29 1701172800 29 284 31.1 5.8
#> 30 1701172800 30 164 23.4 2.9
#>
#> $synop_params
#> # A tibble: 2 × 3
#> parameter accum_hours units
#> <chr> <dbl> <chr>
#> 1 T2m 0 degC
#> 2 S10m 0 m/s
Note the difference in the valid_dttm column.
read_csv_obs()
returns the date times that were given to
it, whereas read_obs_file()
converts the
valid_dttm column to UNIX epoch time format - that is
number of seconds since 00:00:00 1 Januray 1970. This is because this is
the most efficient way to store and process date-time data.
If we want to use the observations throughout harp, for example for
point observations, it is most efficient to convert the observations to
SQLite format, which is much quicker to access and filter to what we
need using read_point_obs()
. This process is done using
read_obs()
and setting output_format_opts.
read_obs(
seq_dttm(2023112800, 2023112823),
NULL,
file_format = "csv_obs",
file_path = tempdir(),
file_template = "obs_{YYYY}{MM}{DD}{HH}.csv",
output_format_opts = obstable_opts(file.path(tempdir(), "OBSTABLE"))
)
#>
#> ***
#> New SQLITE obs file created: /tmp/RtmpsHBatv/OBSTABLE/OBSTABLE_2023.sqlite
#> ***
#> Writing to: SYNOP in /tmp/RtmpsHBatv/OBSTABLE/OBSTABLE_2023.sqlite
#> No data returned. Set `return_data = TRUE` to return data.
And then read the data back in using
read_point_obs()
read_point_obs(
seq_dttm(2023112803, 2023112819), "T2m",
obs_path = file.path(tempdir(), "OBSTABLE"), stations = c(4, 6, 9)
)
#> Getting T2m.
#>
#> Reading: /tmp/RtmpsHBatv/OBSTABLE/OBSTABLE_2023.sqlite
#> # A tibble: 51 × 5
#> valid_dttm SID elev T2m units
#> <dttm> <int> <dbl> <dbl> <chr>
#> 1 2023-11-28 03:00:00 4 147 10.2 degC
#> 2 2023-11-28 03:00:00 6 52 23.5 degC
#> 3 2023-11-28 03:00:00 9 338 12.9 degC
#> 4 2023-11-28 04:00:00 4 147 10.2 degC
#> 5 2023-11-28 04:00:00 6 52 23.5 degC
#> 6 2023-11-28 04:00:00 9 338 12.9 degC
#> 7 2023-11-28 05:00:00 4 147 10.2 degC
#> 8 2023-11-28 05:00:00 6 52 23.5 degC
#> 9 2023-11-28 05:00:00 9 338 12.9 degC
#> 10 2023-11-28 06:00:00 4 147 10.2 degC
#> # ℹ 41 more rows