Creating an R Data Package

Author

TheCoatlessProfessor

Published

January 17, 2016

Intro

In the previous entry, there was a discussion regarding CRAN’s R package policy, specifically on the size of R data packages. Within this post, the aim is to address the best way to create a data package that is able to be distributed via CRAN. To do so, we reflect upon different methods used to construct a data package on CRAN. The next entry deals with constructing an external repository when the size of the package is considerably over the 5 MB limit imposed by CRAN.

The Advent of the Data Package

As was shown in the previous entry, many packages listed on CRAN are below the threshold of 5 MB. Therefore, datasets referenced within these packages are either very small or are hosted in an external package. That is to say, authors create a stand-alone data package, which is then uploaded alongside the original package to CRAN. This jives with the recommendation that:.

Where a large amount of data is required (even after compression), consideration should be given to a separate data-only package which can be updated only rarely (since older versions of packages are archived in perpetuity).

As a result, there is a considerable amount of data packages on CRAN. Some examples of data specific package are:

faraway, HistData, HSAUR, Ecdat, learnEDA, ElemStatLearn, abc.data, wpp2012, astrodatR, ocedata, TH.data, USAboundaries, mapdata, hglm.data, MM2Sdata, synbreedData, usdanutrients, babynames, fueleconomy, nasaweather, nycflights13

Note, that some of these packages are associated with textbooks and others are complementary packages that tie into a package containing only statistical methodology.

Moreover, there are some R package developers that prefer to forgo bundling data within a package. Instead, the developers prefer that the data is downloaded into the working directory or a temporary directory at a later time via download.file(). Examples of the remote download are: raster, dataRetrieval, and FedData.

Constructing a Data Package

With that being said, let’s talk about how to construct an efficient data package. We’ll talk about how the package’s files should be structured, documenting data sets, storage and compression of data, and LazyLoad. The discussion of these points will revolve around imudata, an R data package that contains observations from an Inertial Measurement Unit (IMU).

File Structure

The file structure of a data package is typically like so:

- datapackage
  |- data
     |- datalist
     |- example.rda
  |- man
     |- example.Rd
     |- datapackage.Rd
  |- DESCRIPTION
  |- NAMESPACE
  |- LICENSE

That is:

  • Data is placed into the /data directory.
  • Help documentation for the data set (example.Rd) and package (datapackage.Rd) is placed into /man,
  • The traditional DESCRIPTION, NAMESPACE, and LICENSE files.

The file structure of a data package is reminiscent of a regular R package. The main difference of a data specific package is that no R code is shipped with the package. Furthermore, there is a /data directory that exists and is filled with data sets. Within the /data directory, there is also a file called datalist. This file lists the name of the data sets shipped with the package. It is meant to speed up the installation time as the creation of this list can be slow for packages with huge datasets. You can automatically generate this list using:

tools::add_datalist("~/path/to/pkg/source", force = T)

In the case of the imudata package, the contents of the datalist file generated are:

cont.imu6
imu6

However, note that sometimes it is advantageous to include a /R directory. The imudata package has an /R directory that is used to house roxygen2 markup. The roxygen2 markup is preferable to manually writing out the .Rd files. That is, one can generate the .Rd documentation found under the /man directory using #' @tag markup. Thus, its structure is:

- imudata
  |- data
     |- cont.imu1.rda
     |- datalist
     |- imu6.rda
  |- man
     |- cont.imu1.Rd
     |- imu6.Rd
     |- imudata.Rd
  |- R
     |- cont.imu1.R
     |- imu6.R
     |- imudata.R
  |- DESCRIPTION
  |- NAMESPACE
  |- LICENSE

Constructing a data package in this way provides the following: 1. The data is exportable to other packages (hence, stand-alone) 2. Documentation is available for the user to peruse. 3. The user is able to easily load the data via data("dataset").

Note: Do not construct a standalone data package if functions within the statistical methods package need to use data that has been pre-computed (e.g. T-tables, Z-tables, F-tables, color tables, et cetera). In these cases, the data should be shipped with the package and stored in /R/sysdata.rda. This data is not exported nor does it need any documentation. To construct it, see the “Saving Data using save()” section.

Documenting Data Sets with roxygen2

Let’s take a look at how to document a data set using roxygen2. This example is from imudata/R/imu6.R

#' @title IMU Data from a XSens MTi-G sensor
#' @description This data set contains gyroscope and accelerometer data from a XSens MTi-G sensor.
#' @format A \code{matrix} with 873,684 observations and 6 columns. The columns are defined as follows:
#' \describe{
#'  \item{\code{Gyro. X}}{Gyroscope X-Axis}
#'  \item{\code{Gyro. Y}}{Gyroscope Y-Axis}
#'  \item{\code{Gyro. Z}}{Gyroscope Z-Axis}
#'  \item{\code{Accel. X}}{Accelerometer X-Axis}
#'  \item{\code{Accel. Y}}{Accelerometer Y-Axis}
#'  \item{\code{Accel. Z}}{Accelerometer Z-Axis}
#' }
#' @source Geodetic Engineering Laboratory (TOPO), Swiss Federal Institute of Technology Lausanne (EPFL)
#' @author JJB
#' @examples
#' \dontrun{
#' data(imu6)
#' summary(imu6)
#' }
"imu6"

Documenting with this approach requires the name of the data set to be specified as a string underneath the roxygen2 mark up. Furthermore, the version of roxygen2 that must be used has to be greater than 4.0. (As of this writing, it is on version 5.0.1.)

Alternatively, if you are adverse to having one .R file per data set, another acceptable practice is to create one large .R file called data.R that contains all of the data documentation.

To generate the .Rd files run:

# install.packages("devtools")

devtools::document(roclets=c('rd', 'collate', 'namespace'))

Saving Data

Data that can be found within /data is allowed to come in alot of different forms. Specifically, there are:

Some of these forms are considerably better to use versus others. As shown above, the use of .csv leads to a lower amount of observations being able to be included in a package vs a .rda.

For the most, data is expected to be saved within /data. To access the file path of this folder within a package, use:

data.path = system.file("data", package = "datapackage")
F:/Program Files/R/R-3.2.3/library/imudata/data

Saving Data with Plain R Code

To save data in plain R code, one creates a data dump of the R object (e.g. a data.frame, matrix, list, vector, et cetera) via:

dput(dataset)

This creates an R command to generate a structure() with the objects information. Take for example its usage on the first 4 observations in the iris data set:

dput(head(iris,4), file="data/iris4.R")
structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6), Sepal.Width = c(3.5, 
3, 3.2, 3.1), Petal.Length = c(1.4, 1.4, 1.3, 1.5), Petal.Width = c(0.2, 
0.2, 0.2, 0.2), Species = structure(c(1L, 1L, 1L, 1L), .Label = c("setosa", 
"versicolor", "virginica"), class = "factor")), .Names = c("Sepal.Length", 
"Sepal.Width", "Petal.Length", "Petal.Width", "Species"), row.names = c(NA, 
4L), class = "data.frame")

To load this into R:

# Load data set
data("iris4")

# Source R Code in
source(file.path(data.path,"iris4.R"))

This approach is not at all ideal since R must then reassemble the object. From personal experience, I’ve seen instances were old PCs took about 30 minutes to load the spam data set from structure form into R, but the same machine when given a .rda the PC could load it in seconds.

Using plain R code should be reserved to generating reproducible examples that communicate a data set which is hard to replicate with a few R commands. It should not be used to store data for an R package.

Saving Data with write.*

R offers two options for writing files: write.table and write.csv. The later is just a version of the prior with certain parameters preset. By default, write.table will use space as a separator for data whereas write.csv will use a comma (,) as the separator. (write.csv2 is customized for European usage where the , represents the decimal point and the separator between fields is then a semicolon ;)

R objects can be written to .csvs using:

write.table(dataset, file="data/dataset.csv", sep = ",", row.names = F)
write.csv(dataset, file="data/dataset.csv", row.names = F)

Files can be read from the R package into R using:

# Handles read
data("dataset")

# Custom read
read.table(file.path(data.path,"dataset.csv"))
read.csv(file.path(data.path,"dataset.csv"))

This format is helpful since humans are able to easily look at the contents of the file via opening it in Excel, Numbers, LibreOffice Calc, or a text editor. The downside is the file size is larger than one would like as it is saved using ASCII. Thus, the total number of observations possible to ship is lower.

Saving data with save()

The save() function offers the ability to take an R object and represent it in binary format that a computer can read but not a human. R Binary Files, .rda, are used to store the R object in its exact state.

To save an R object* in this manner use:

save(dataset, file="data/dataset.rda")

Users can the subsequently load the data using:

data("dataset")
attach("dataset")

# Custom load
load(file.path(data.path,"dataset.rda"))

There are considerable benefits to saving using a .rda file. One is the reduction in file size and two the ability to compress data even more due to the reduction in the number of characters. Overall, this is the preferred method to save and ship data.

* Note: save() is able to accommodate more than 1 data set. This is particularly useful if you are shipping pre-computed data for a function to use within a package via R/sysdata.rda. e.g.

save(ds1, ds2, file="R/sysdata.rda")

Otherwise, do not save more than 1 data set into a .rda file!

Compressing Data

To maximize the number of observations in a data package, the data needs to be compressed. Compression is done through an algorithm that creates a dictionary table of more commonly used patterns. This enables the reduction of the file size as large patterns are reduced to smaller patterns.

The save() option offers three different types of compression: "gzip", "bzip2" or "xz". By default, .rda files are created using gzip compression. Different compression algorithms may yield better results. To figure out the current compression for data sets in /data use:

tools::checkRdaFiles("~/path/to/package/data")

The output for imudata is:

                                         size ASCII compress version
F:/GitHub/imudata/data/cont.imu1.rda  3809716 FALSE       xz       2
F:/GitHub/imudata/data/imu6.rda      10696124 FALSE       xz       2

The size here is given in bytes.

To automatically resave the data using the best compression possible, use:

tools::resaveRdaFiles("~/path/to/package/data", compress="auto")

However, I often find that this sometimes does not work as well as I had hoped.

                                               size ASCII compress version
F:/GitHub/imudata/data/cont.imu1.rda  3832934 FALSE     gzip       2
F:/GitHub/imudata/data/imu6.rda      10663784 FALSE       xz       2

Note that the gzip of cont.imu1 is greater than that of the xz. Though, the xz compression of the imu6 data set improved from the previous xz compression.

I personally prefer using xz compression algorithm. Hence:

tools::resaveRdaFiles("~/path/to/package/data", compress="xz")

Overall, this is probably the step that will greatly help ensuring the inclusion of data within a package or force you to seek out an alternative route.

LazyData load

LazyData means that datasets will be lazily loaded. More specifically, the data sets will not use any memory until the user explicitly loads the data via data("dataset"). Thus, it is ideal to include a line in the package’s DESCRIPTION function containing:

LazyData: true