One of the most important capabilities of harpIO is reading data from
NWP and climate models. The original solution to this task was to have
separate functions for different, but related tasks.
read_det_interpolate()
and
read_eps_interpolate()
to read data from multiple files for
deterministic and ensemble forecasts respectively and to interpolate to
point locations, and read_grid()
to read gridded data from
a single file. This solutions means the user having to learn several
functions, each with their own arguments and output structures. There is
also a lack of flexibility in, for example, reading from both
deterministic and ensemble forecast files at the same time.
We have attempted to streamline this process and to make it as
generic as possible with the read_forecast()
function,
which we hope results in a more strightforward user interface. The
function is designed to both read the forecast data from multiple files
and, where required, transform the data by interpolation or regridding.
This means that the output is always a class harp_fcst list of data
frames whether the data are at points, a 2d grid or a vertical cross
section.
Furthermore, many arguments as passed as lists generated by
*_opts()
functions. This means that you can use tab
autocompletion to get the correct argument names, and can store various
options combinations as variables.
Example data
For all of these examples we will use data provided in the harpData package, which, if you haven’t done so already, you can install with:
remotes::install_github("harphub/harpData")
Deterministic forecasts
The simplest application of read_forecast()
is to read
deterministic forecasts. At a minimum, the start date, end date, a name
for the forecast model (fcst_model
) and the parameter to be
read in are needed as arguments. The data we will use are grib files
from the AROME-Arctic model that come with harpData. By default the
function looks in the current working directory for the files that have
the “vfld” template (see show_file_templates()
). Therefore,
we need to also pass the file_path
and
file_template
arguments. Finally, the default behaviour of
read_forecast()
is to not return any data - this is because
one of its main roles is to write the data to output files in another
format and the volumes of data involved could easily become too large to
hold in memory - so we should also set
return_data = TRUE
.
#> Loading required package: harpCore
#>
#> Attaching package: 'harpCore'
#> The following object is masked from 'package:stats':
#>
#> filter
read_forecast(
2018071000,
"AROME_Arctic",
"T2m",
file_path = system.file("grib/AROME_Arctic", package = "harpData"),
file_template = "harmonie_grib_fp",
return_data = TRUE
)
#> ::deterministic gridded forecast:: # A tibble: 17 × 11
#> fcst_model fcst_dttm lead_time parameter valid_dttm
#> <chr> <dttm> <dbl> <chr> <dttm>
#> 1 AROME_Arctic 2018-07-10 00:00:00 0 T2m 2018-07-10 00:00:00
#> 2 AROME_Arctic 2018-07-10 00:00:00 3 T2m 2018-07-10 03:00:00
#> 3 AROME_Arctic 2018-07-10 00:00:00 6 T2m 2018-07-10 06:00:00
#> 4 AROME_Arctic 2018-07-10 00:00:00 9 T2m 2018-07-10 09:00:00
#> 5 AROME_Arctic 2018-07-10 00:00:00 12 T2m 2018-07-10 12:00:00
#> 6 AROME_Arctic 2018-07-10 00:00:00 15 T2m 2018-07-10 15:00:00
#> 7 AROME_Arctic 2018-07-10 00:00:00 18 T2m 2018-07-10 18:00:00
#> 8 AROME_Arctic 2018-07-10 00:00:00 21 T2m 2018-07-10 21:00:00
#> 9 AROME_Arctic 2018-07-10 00:00:00 24 T2m 2018-07-11 00:00:00
#> 10 AROME_Arctic 2018-07-10 00:00:00 27 T2m 2018-07-11 03:00:00
#> # ℹ 7 more rows
#> # ℹ 6 more variables: step_range <chr>, level_type <chr>, level <int>,
#> # units <chr>, fcst_cycle <chr>, fcst <geolist>
The last column in the data frame, named AROME_Arctic_det is the forecast data, which are stored as geofield objects - these are simply 2d arrays with attributes that describe the projection, time and parameter information of the data. For all deterministic models the name of the column is the forecast model name with a suffix “_det”.
Ensemble forecasts
Reading ensemble forecasts works in exactly the same way. harpData
includes output from the AROME-Arctic ensemble model, AAEPS, for the
same dates as before, but this time stored in a single netcdf file.
Since the data are in a netcdf file, and netcdf files can use many
different dimension names and variable names,
read_forecast()
needs some information about the structure
of the netcdf file. This structure information is generated with the
function netcdf_opts()
, and since the data come from an EPS
model produced by MET Norway we can use one of the built it options
sets: netcdf_opts(options_set = "met_norway_eps")
. All lead
times and ensemble members are in the same file, so if you want to read
all members from the file you don’t have to pass the members as an
argument.
read_forecast(
2018071000,
"AAEPS",
"T2m",
file_path = system.file("netcdf/AAEPS", package = "harpData"),
file_template = "fc{YYYY}{MM}{DD}{HH}.nc",
file_format_opts = netcdf_opts(options_set = "met_norway_eps"),
return_data = TRUE
)
#> ::ensemble gridded forecast:: # A tibble: 5 × 14
#> fcst_model fcst_dttm lead_time parameter valid_dttm
#> <chr> <dttm> <dbl> <chr> <dttm>
#> 1 AAEPS 2018-07-10 00:00:00 0 T2m 2018-07-10 00:00:00
#> 2 AAEPS 2018-07-10 00:00:00 3 T2m 2018-07-10 03:00:00
#> 3 AAEPS 2018-07-10 00:00:00 6 T2m 2018-07-10 06:00:00
#> 4 AAEPS 2018-07-10 00:00:00 9 T2m 2018-07-10 09:00:00
#> 5 AAEPS 2018-07-10 00:00:00 12 T2m 2018-07-10 12:00:00
#> # ℹ 9 more variables: level_type <chr>, level <dbl>, units <chr>,
#> # fcst_cycle <chr>, AAEPS_mbr000 <geolist>, AAEPS_mbr001 <geolist>,
#> # AAEPS_mbr002 <geolist>, AAEPS_mbr003 <geolist>, AAEPS_mbr004 <geolist>
You will see that there is a warning about missing lead times. This is because the default behaviour is to ask for lead times 0 - 48 hours, every three hours. The forecast data are in columns with the name of the forecast model with a suffix “_mbrXXX” where XXX is the member number. We could also ask for a limited set of lead times and ensemble members:
read_forecast(
2018071000,
"AAEPS",
"T2m",
lead_time = seq(0, 6, 3),
members = c(0, 2, 4),
file_path = system.file("netcdf/AAEPS", package = "harpData"),
file_template = "fc{YYYY}{MM}{DD}{HH}.nc",
file_format_opts = netcdf_opts(options_set = "met_norway_eps"),
return_data = TRUE
)
#> ::ensemble gridded forecast:: # A tibble: 3 × 12
#> fcst_model fcst_dttm lead_time parameter valid_dttm
#> <chr> <dttm> <dbl> <chr> <dttm>
#> 1 AAEPS 2018-07-10 00:00:00 0 T2m 2018-07-10 00:00:00
#> 2 AAEPS 2018-07-10 00:00:00 3 T2m 2018-07-10 03:00:00
#> 3 AAEPS 2018-07-10 00:00:00 6 T2m 2018-07-10 06:00:00
#> # ℹ 7 more variables: level_type <chr>, level <dbl>, units <chr>,
#> # fcst_cycle <chr>, AAEPS_mbr000 <geolist>, AAEPS_mbr002 <geolist>,
#> # AAEPS_mbr004 <geolist>
Lagged ensembles
For lagged ensembles, data are available in harpData for the CMEPS
model in vfld
format. vfld
files are text
files with gridded data already interpolated to points. The CMEPS
ensemble is quite complex with a number of members available each hour,
but tied to a control member every three hours. harpData has members 0,
1, 3, 4, 5 and 6 where members 0 and 1 are run at hours 0, 3, 6, …, 21;
members 3 and 4 at hours 1, 4, 7, …, 22; and members 5 and 6 at hours 2,
5, 8, …, 23. This means that members 3 and 4 are two hours behind
members 0 and 1, and members 5 and 6 are one hour behind members 0 and
1. The lags
argument is used to pass the lagging
information to read_forecast
and the lags aer assumed to be
in the same order as the members
argument. Note also that
the dates for CMEPS data are different to the ones that have been used
before.
read_forecast(
seq_dttm(2019021700, 2019021718, "6h"),
"CMEPS_prod",
"T2m",
lead_time = seq(0, 12, 3),
members = c(0, 1, 3, 4, 5, 6),
lags = c(0, 0, 2, 2, 1, 1),
file_path = system.file("vfld", package = "harpData"),
file_template = "vfld_eps",
return_data = TRUE
)
#> ::ensemble point forecast:: # A tibble: 22,980 × 16
#> fcst_model fcst_dttm lead_time parameter SID lat lon units
#> <chr> <dttm> <dbl> <chr> <int> <dbl> <dbl> <chr>
#> 1 CMEPS_prod 2019-02-17 00:00:00 0 T2m 1001 70.9 -8.67 K
#> 2 CMEPS_prod 2019-02-17 00:00:00 0 T2m 1010 69.3 16.1 K
#> 3 CMEPS_prod 2019-02-17 00:00:00 0 T2m 1014 69.2 17.9 K
#> 4 CMEPS_prod 2019-02-17 00:00:00 0 T2m 1015 69.6 17.8 K
#> 5 CMEPS_prod 2019-02-17 00:00:00 0 T2m 1018 69.2 16.0 K
#> 6 CMEPS_prod 2019-02-17 00:00:00 0 T2m 1023 69.1 18.5 K
#> 7 CMEPS_prod 2019-02-17 00:00:00 0 T2m 1025 69.7 18.9 K
#> 8 CMEPS_prod 2019-02-17 00:00:00 0 T2m 1026 69.7 18.9 K
#> 9 CMEPS_prod 2019-02-17 00:00:00 0 T2m 1027 69.7 18.9 K
#> 10 CMEPS_prod 2019-02-17 00:00:00 0 T2m 1033 70.2 19.5 K
#> # ℹ 22,970 more rows
#> # ℹ 8 more variables: valid_dttm <dttm>, fcst_cycle <chr>,
#> # CMEPS_prod_mbr000 <dbl>, CMEPS_prod_mbr001 <dbl>,
#> # CMEPS_prod_mbr003_lag2h <dbl>, CMEPS_prod_mbr004_lag2h <dbl>,
#> # CMEPS_prod_mbr005_lag1h <dbl>, CMEPS_prod_mbr006_lag1h <dbl>
You will see that a number of members have NA - this is because
certain combinations of forecast time and lead time do not exist for
those members. When data are in read in from SQLite files using
read_point_forecast()
this information is adjusted to take
care of the lagged members. When working with point data, the most
efficient workflow is to use read_forecast()
to write the
data to files and then read from the files.