R Data Packages in External Data Repositories using the Additional_repositories field

Author

TheCoatlessProfessor

Published

January 17, 2016

Intro

In the prior series entry on data packages, there was a discussion about how to create an R data package. Within the final entry in the series, the goal is to address the unthinkable: a data packages rejected from CRAN. Rejected data packages are particularly problematic as they show up as a missing dependency on the statistical methodology package under R CMD check. Fear not though, one can still use the data package that was rejected by CRAN and R’s convenient installation command. However, a few things do change… One of them is you need to create your own version of CRAN!

Finding a Distribution Method

There are three scenarios that I’ve encountered: the good, the bad, and the ugly.

In the good scenario, upon saving the data set as a .rda and applying some form of compression, the data package is under the 5 MB limit. In the bad scenario, the data package was permitted onto CRAN after a petition was put forward to receive an exception to the size constraint. In the ugly scenario, you have to setup your own CRAN-esq repository to hold your package.

The good news is that for all of these methods, we are able to use the traditional:

install.packages("package.name")

Creating an External Data Repositories

To create an external repository you have to have access to a webserver that is able to host and serve files to the general public. By default, everyone on GitHub has this capability. So, I’ll present the webserver version first followed up by the GitHub version.

If possible, do try to obtain a web server. The benefits of using your own webserver are:

  1. Web traffic statistics regarding package downloads akin to metacran’s useage of rstudio’s cran logs.
  2. You do not have to worry about a 1 GB repository limit.
  3. You can use an FTP client instead of git to update your packages.

The downsides:

  1. Paying for server and/or bandwidth.
  2. Managing the server

External R Package Repository File Structure (Web Server)

The file structure for an external R package repository is as follows:

- www (e.g. /)
  |- bin
     |- windows
        |- contrib
           |- 3.2
              |- datapackage_1.0.0.zip
              |- PACKAGES
              |- PACKAGES.gz
     |- macosx
        |- contrib
           |- 3.2
              |- datapackage_1.0.0.tgz
              |- PACKAGES
              |- PACKAGES.gz
  |- src
     |- contrib
        |- datapackage_1.0.0.tar.gz
        |- PACKAGES
        |- PACKAGES.gz

So, the structure has*:

  • Any package sources (e.g. .tar.gz) in /src/contrib/
  • Any binary sources (e.g. windows: .zip, macosx: .tgz) in /bin/<os>/contrib/3.2/

* Note: It is only advisable to have a /bin directory if the package involves compilation. Otherwise, it is better to serve packages as sources since the size of the package vs. a binary is smaller. This is especially the case with data packages.

To obtain the PACKAGES and PACKAGES.gz files, they can be automatically generated using:

# Generate Source information
tools::write_PACKAGES("~/path/to/src/contrib/")

# Generate Windows binary information
tools::write_PACKAGES("~/path/to/bin/windows/contrib/3.2")

# Generate OS X Binary information
tools::write_PACKAGES("~/path/to/bin/macosx/contrib/3.2")

There are some slight differences between the src and the bin PACKAGES files. In this case, let’s look at the imudata package in the SMAC-Group data repository

The PACKAGES file in bin

Package: imudata
Version: 1.0.0
Depends: R (>= 3.2)
Suggests: gmwm
License: file LICENSE

The PACKAGES file in src

Package: imudata
Version: 1.0.0
Depends: R (>= 3.2)
Suggests: gmwm
License: file LICENSE
MD5sum: 88c542836a5d27f95085bb4e689ef736
NeedsCompilation: no

The primary differences between src and bin is that src has an MD5 hash and a string indicating if compilation is necessary.

Redux External R Package Repository File Structure (GitHub)

Dirk Eddelbuettel wrote a really nice package called drat that automates a majority of this process. The package primarily setups the git repository and helps create the structure for src and bin files.

Below is a script that uses drat.


# Install drat and git2r
# install.packages(c("drat","git2r"))

## Create an empty repository
drat::initRepo(name = "datarepo", basepath = "~/GitHub/")

# Store the basepath + name
options(dratRepo = "~/GitHub/datarepo")
# Using this will disable the need to enter "repoDir"
# I've left it in just to illustrate...

# Add a package to it
drat::insertPackage("~/path/to/pkg/src/datapkg_1.0.0.tar.gz",  # Path to src 
              repodir = "~/GitHub/datarepo",                   # Location of git repo: not need if dratRepo set
              action="prune",                                  # Remove old package version
              commit = T)                                      # Commit to repo

# (Optional) Remove old packages from repo at a later time
# pruneRepo()

## Push Repository onto GitHub

# Open repository
repo = git2r::repository("~/GitHub/datarepo")

# Authorize (not secure, need SSH key)
cred = git2r::cred_user_pass("username", "password")
  
# Push changes in local repository to GitHub
git2r::push(repo, "origin", "refs/heads/gh-pages", credentials = cred)

# Add the repository to local R session for use with install.packages()
drat::addRepo("datarepo","http://<username>.github.io/datarepo")

Since git2r is shaky sometimes regarding file commits and push events, you may want to use the GitHub for Desktop or commandline git to push the repository onto GitHub.

Adding an External Repository to a Package

Now that the repository is set up, all you need to do is… Update the statistical method package’s description file, add an .onLoad() event, and appropriately attach data as needed.

This is radio nowhere, is there anybody alive out there?

In order for R to be aware of an external package that the current statistical methodology package depends upon, one must modify the statistical methodology package’s DESCRIPTION file so that it contains the following two lines:

Suggests: 
    datapackage
Additional_repositories: http://location-of/datarepo

The Additional_repositories line allows R during the R CMD check process to verify the existent of a package. This does NOT add the repository to your getOptions("repos") list. It merely states that the package is available in the ether of the internet. Make sure that the Additional_repositories URL is pointed at the root directory of the repository and NOT /src/contrib/package_x.y.z.tar.gz.

The reason for using Suggests field instead of Enhance is due to a guideline in Package Dependency section of the Writing an R Package guide. In this case, the data set is being used within a check and, thus, should be used. Plus, the guidelines behind Enhances indicate a mere object override.

Enabling the External Repository for use with install.packages()

To be able to use install.packages(), the repos option in options() must be updated so that it knows to look within that repository. To do so, I advocated creating an .onLoad() function to modify the list of repositories when the package is loaded. This is accomplished with:

.onLoad <- function(libname, pkgname) {
  repos = getOption("repos")
  repos["<NAME_REPO>"] = "http://location-of/datarepo"
  options(repos = repos)
  invisible(repos)
}

Therefore, when the package is loaded, all one needs to do is:

install.packages("datapackage")

And voila! The data package is installed. Note the package is NOT installed along with other dependencies. Users or the package developer must initiate the install.

Checking for the Data Package

Since the data package may or may not be installed, it is ideal to check to see if the namespace can be loaded. If it can, then the data is able to be loaded. Otherwise, you will need to initialize the download of the package via:

if (!requireNamespace("datapackage", quietly = TRUE)) {
  install.packages("datapackage")
}
# Load data
data("dataset", package="datapackage")