Dynamically Retriving Latest Pandoc from GitHub and Deploying on the Illinois Campus Cluster

programming
linux
Author

TheCoatlessProfessor

Published

May 4, 2020

Today, I had an epiphany on generating a report inside of the Illinois Campus Cluster (ICC) with R through rmarkdown. For a long time, I’ve wanted to avoid downloading data from the cluster to my personal computer and, then, running a report generation. Alas, my priorities never aligned with solving this problem as downloading data was quick inside of a University building; but, with COVID-19, my internet is no longer Mazda’s “zoom zoom” fast. I would likely wager downloading the same amount of data is equivalent to placing it on a flash drive and shipping it overnight via FedEx.

The epiphany was simple:

What if we had pandoc on the cluster?

For those who aren’t familiar with pandoc, the software serves as a universal document convert or in more relatable terms it is the “swiss-arm knife” for moving between different document formats. Alas, pandoc is built ontop of Haskell, which wasn’t available on the cluster. So, under usual operating principles on the cluster, I would have to build pandoc from source; though that would require setting up a Haskell environment or so I thought…

But, wait… There’s a binary! From pandoc’s linux section, there is a binary package for amd64 arhitecture that is standalone with both pandoc and pandoc-citeproc. Both binaries are statically linked and have no dynamic dependencies or dependencies on external data files. Huzzah! Let’s try out the binary…

Dynamic retrieval script

First, we want to always get the latest version of a software release from GitHub. With a quick Google, we land at Hanwen Wu’s One Liner to Download the Latest Release from Github Repo gist. Though, we removed the wget to allow for a more targeted pipe into tar.

# Determine latest version from GitHub
LATEST_RELEASE_URL=$(curl -s https://api.github.com/repos/jgm/pandoc/releases/latest | grep "browser_download_url.*amd64.tar.gz" | cut -d : -f 2,3 | tr -d \")

Next, we’ll need to retrieve the binary name, download the file, and unpack to our local binary location.

# Destination directory (bin will be created inside)
DESTDIR=~/project-stat/

# Retrieve filename
PANDOC_FILENAME="${LATEST_RELEASE_URL##*/}"

# Download the latest pandoc version
wget -q ${LATEST_RELEASE_URL}

# Unpack into $DESTDIR/bin
tar xvzf ${PANDOC_FILENAME} --strip-components 1 -C ${DESTDIR}

From there, we need to append onto the PATH variable the location of where the binary can be found. To do so, place in ~/.bashrc:

export PATH="~/project-stat/bin:${PATH}"

Replace ~/project-stat with the appropriate directory.

Then, open up R and trigger the render of the report using rmarkdown:

rmarkdown::render("path/to/RmarkdownFile.Rmd")

Once the report is generated, the next step would be to send e-mail with it attached at the end of the simulation.

Fin

In short, this post showed how to dynamically retrieve the latest version of pandoc from GitHub, extract it into a local bin directory on the cluster, and, then, how to include the bin directory to be recognized by R to generate documents using rmarkdown.

Acknowledgements

Special thanks to
NCSA’s Weddie Jackson, who has been my go-to person for getting the R version on the cluster upgraded ever so often and who triggered the epiphany with an e-mail. (He bumped the cluster version to R 4.0.0 as well.)