Working with R on a Cluster – TheCoatlessProfessor

Intro

Often times I receive inquiries on how to deploy R packages or conduct simulation studies on the Illinois Campus Cluster (ICC). After writing a few responses, I realized that it would probably benefit not only the Illinois R community but also the larger R community if this information was more widely available. The information is primarily a pointed discussion on using R non-interactively (e.g. command line, shell or terminal) that follows from Invoking R from the Commandline and Scripting with R in Appendix B Invoking R of An Introduction to R. Below is a collation of previous discussions I’ve had with various personnel on campus regarding clustered use of R.

Command Line R

To begin, let’s start with the options that are available when launching R via command line, shell, or terminal. In particular, we have R, Rscript, and R CMD BATCH. There is a fourth option that is written by Dirk Eddelbuettel called littler, however, I have yet to experiment with it as it is not part of what comes in a standard Base R installation.

The difference between the Base R options are relatively large.

For starters, using R by iteself indicates that an interactive session should be spawned and the commands should be executed from within it. This is particularly bad as the script could potentially stop working if it requires user input to advance the process (e.g. if(interactive())). Furthermore, the output is directed straight back to the R session in the terminal window. There is no output file that is generated. Thus, this method is not preferred when deploying to a cluster. However, for personal use, this provides a GUI experience free interaction with R that focuses on computational and not graphical results (e.g. no plotting).

With this being said, there are only really two options for cluster-based use: R CMD BATCH and Rscript.

The difference between the two can be stated succiently as:

R CMD BATCH:

Requires an input file (e.g. helloworld.R)
Saves to an output file (e.g. Run script helloworld.R get helloworld.r.Rout)
By default, echoes both input and output statement inline (e.g. as if you were actually typing them into console).
Is not able to write output to stdout.

Rscript:

Similar to bash scripts
Requires the use of a shebang (#!/usr/bin/Rscript)
Requires authorization before being able to be run (chmod +x script.r)
Output from print() and cat() are directly sent to STDOUT.
No additional file is made.
Able to issue one line comments (e.g. Rscript -e "print('hi!')")

To further emphasis the differences between the two let’s create two short examples.

For R CMD BATCH, create a file called hellobatch.R with contents:

print("Hello Batch World!")

To running this using R CMD BATCH use:

$ R CMD BATCH hellobatch.R

This yields a file called hellobatch.R.Rout in the same directory as hellobatch.R with contents:

print("Hello Batch World!")
## [1] "hello world"

proc.time()
## user  system elapsed
## 0.401 0.021  0.422

Under Rscript, we’ll use hellorscript.R with contents:

#!/usr/bin/Rscript
print("Hello Batch World!")

To run the script with Rscript, we must first authorize the file:

$ chmod +x hellorscript.R

Then, we can either run the file with:

$ Rscript hellorscript.R
$ ./hellorscript.R

Doing so will produce output directly in the terminal, e.g.

$ Rscript hellorscript.R
[1] "Hello Batch World!"

Personally, I opt more for the R CMD BATCH over the Rscript. Though, there is a considerable amount of folks that prefer the later as input and output (I/O) options are more aligned with the tenets of Unix. I may flip flop on this later. Stay tuned for an update.

Update: Rscript is sooo much better than R CMD BATCH.

Each of these commands responds to many different options or flags. The options presented below are truncated from the full list as these options are the ones that I’ve found to be the most relevant when working with R in a non-interactive state. To access a full list of these options, type:

$ R --help

Options:

Options	Description
`--save`	Do save workspace at the end of the session
`--no-save`	Don’t save it
`--no-environ`	Don’t read the site and user environment files
`--no-site-file`	Don’t read the site-wide Rprofile
`--no-init-file`	Don’t read the user R profile
`--vanilla`	Combine –no-save, –no-restore, –no-site-file,
	–no-init-file and –no-environ
`--no-readline`	Don’t use readline for command-line editing
`-q`, `--quiet`	Don’t print startup message
`--silent`	Same as –quiet
`--slave`	Make R run as quietly as possible
`--interactive`	Force an interactive session
`--verbose`	Print more information about progress
`--args`	Skip the rest of the command line
`-f FILE`, `--file=FILE`	Take input from ‘FILE’
`-e EXPR`	Execute ‘EXPR’ and exit

The main options that I end up using when executing queries on the cluster are:

$ R CMD BATCH --no-save --quiet --slave < $HOME/folder/script.R

Thus, the R session is not saved on close, there is no startup messages and there is NO command line echo respectively. Only the results are saved within script.Rout.

The same can be accomplished with:

$ Rscript $HOME/folder/script.R > script.Rout

Note: You may need to further under this approach to suppress package startup messages using:

suppressPackageStartupMessages(library(yourpackagehere))

Setup a local R library for installing and loading R Packages

The R package library directory is traditionally used in a system-wide approach. However, on a cluster, there is more than one user who is using the system at a given moment in time and each user has unique needs. That is to say, one user may want to use a package on an earlier version vs. another user who wants to be as up-to-date as possible. As a result, there is no real utopia that is available with a shared resource outside of having users maintain their own installs of R packages. Thus, each user must create and maintain their own library. As a result, packages must be installed BEFORE being used.

To create your own library, you will need to do the following:

# Create a directory for your R packages 
# Note: This counts against your 2 GB home dir limit on ICC
mkdir ~/Rlibs

#   Load the R modulefile 
# You may want to specify version e.g. R/3.2.2
module load R

# Set the R library environment variable (R_LIBS) to include your R package directory   
export R_LIBS=~/Rlibs

To ensure that the R_LIBS variable remains set even after logging out run the following command to permanently add it to the environment (e.g. this modifies your the .bashrc file, which is loaded on startup).

cat <<EOF >> ~/.bashrc
  if [ -n $R_LIBS ]; then
      export R_LIBS=~/Rlibs:$R_LIBS
  else
      export R_LIBS=~/Rlibs
  fi
EOF

(Above code snippets are based upon from ICC R Help Docs)

To add packages in the future to your private library, use:

# Use the install.packages function to install your R package.  
$ Rscript -e "install.packages('devtools', '~/Rlibs', 'http://ftp.ussg.iu.edu/CRAN/')"

Note: You will need to install packages prior to queuing the script.

Another feature that is nice with this approach is the ability to use devtools to install packages from external repositories (e.g. GitHub, BitBucket )

$ Rscript -e "devtools::install_github('coatless/bmisc')"

Passing arguments

Woah boy, this one is a doozie. In addition to BASE R, there are many different options on CRAN…

It seems as if the concensus is really around optparse

I’m more of a simpleton when it comes to batch arguments and just use the base R packaged commands. My normal file construction with passed args is:

sampler.R

# Expect command line args at the end. 
args = commandArgs(trailingOnly = TRUE)
# Extract and cast as numeric from character
rnorm(n = as.numeric(args[1]), mean = as.numeric(args[2]))

Then call the file with:

$ Rscript sampler.R 5 100 > sampler.Rout

Alternatively, you can use:

$ R CMD BATCH sampler.R 5 100

You should receive 5 observations from a normal distribution centered at 100.

PBS Files

In order to submit to the Illinois Campus Cluster, one must create a PBS file to be used with qsub. Alternatively, one can just define a cluster job within a one-line statement via qsub. I’m not a huge fan of that approach.

There are two “modes” to PBS files: Single Job and Array Jobs. The Array Job is preferred as it groups work together vs. sending multiple individual jobs.

For an Array Job, there are three important parts:

inputs.txt: List of parameter values to use.
sim_runner_array.R: Script governing the desired computations.
sim_array_job.pbs: Controls how the job is executed on the cluster

`inputs.txt`

Customize the parameters with an input.txt file. To customize the job, create an input.txt that specifies a configuration for each Job Array ID, where each line corresponds to an array ID!

In the case of our normal simulation, we have:

5 100
9 2.3
42 4.8

where the first column denotes the number of observations and the second column denotes the mean. Each line acts as an individual job and will be given an ARRAY_ID corresponding to the line number, e.g. Job 1 has parameters 5 and 100, Job 2 has parameters 9 and 2.3, and so on…

`sim_array_job.pbs`

In this case, the batch file here setups up various pbs qualities from the duration the script can run (walltime), the number of machines (nodes) and cores to request on each machine, the name of the job, the queue the job should run on, and the I/O option.

#!/bin/bash
#
## Set the maximum amount of runtime to 4 Hours
#PBS -l walltime=04:00:00
## Request one node with `nodes` and one core with `ppn`
#PBS -l nodes=1:ppn=1
#PBS -l naccesspolicy=shared
## Name the job
#PBS -N jobname
## Queue in the secondary queue
#PBS -q secondary
## Run with job array indices 1 through 6. 
#PBS -t 1-6
## Merge standard output into error output
#PBS -j oe
######################################

## Grab the job id from an environment variable and create a directory for the
## data output
export CUSTOM_JOBID=`echo "$PBS_JOBID" | cut -d"[" -f1`
mkdir ${PBS_O_WORKDIR}/${CUSTOM_JOBID}

cd ${PBS_O_WORKDIR}/${CUSTOM_JOBID}

# Load R
module load R/3.4.2

## Grab the appropriate line from the input file.  
## Put that in a shell variable named "PARAMS"
export PARAMS=`cat ${HOME}/inputs.txt |
               sed -n ${PBS_ARRAYID}p`

## Run R script based on the array number. 
Rscript $HOME/sim_job.R $PARAMS

`sim_job_runner.R`

This script is responsible for computing and saving the results.

# Expect command line args at the end. 
args = commandArgs(trailingOnly = TRUE)

# Obtain the ID being accessed from the array
jobid = as.integer(Sys.getenv("PBS_ARRAYID"))

# Set seed for reproducibility
set.seed(jobid)

# Extract and cast as numeric from character
rnorm(n = as.numeric(args[1]), mean = as.numeric(args[2]))