Intro

One of the downsides of having multiple programming languages is that each have their own defined niches in both academia and industry worlds that results in data not flowing as easily as spice between them. For example, engineers have an affinity for Python while Statisticans are in love R. Thus, to process data in one language and then use an algorithm in another language is a headache in itself. Recently, support has begun to emerge for a standard data format for data.frames between Python and R via the Feather initiative, a joint work between Wes McKinney of pandas fame and Hadley Wickham of the majority of user-oriented R developments in the last half decade (ggplot2, dplyr, tidyr, rvest, and so on…), using an underlying columnar memory specification provided by Apache Arrow. Unfortunately, this does not target NumPy arrays, which is where a lot of the data seems to be contained in some engineering applications. To that end, Dirk Eddelbuettel of Rcpp fame wrote a nice package called RcppCNPy that enables the loading and writing of 1D to 2D NumPy arrays within R. e.g.

numpy_r_ex.R

# install.packages("RcppCNPy")
library("RcppCNPy")

# Set seed for reproducibility
set.seed(1337)

# Generate data in R
vec = rnorm(100)
mat = matrix(vec, nrow = 25, ncol = 4)

# Rewrite to file
npySave("vec.npy", vec)
npySave("mat.npy", mat)

# Load 
vec2 = npyLoad("vec.npy")
mat2 = npyLoad("matrix.npy")

# Check equality
all.equal(vec,vec2)

all.equal(mat,mat2)

However, when 3-D arrays are used, the common error is:

“Unsupported dimension in npyLoad”

The fault for this is primarily on the Rcpp data types that are unable to scale above \(N\)-D array greater than or equal to 4. However, there is no object export inplace for an Rcpp object with 3 dimensions. Hence, there is no \(N\)-D arrays greater than 2 that can be loaded into R or written to a .npy binary using RcppCNPy.

Thus, saves of \(N\)-D arrays greater than 2 using numpy.save seemed to only be in existence in the Python environment. Until now…

Generating NumPy data to use

Before we can begin transferring data into R, we must first have some data in NumPy binary form (.npy) using numpy.save. In this case, I’ve opted to generate a 4D array with dimensions of \(3 \times 4 \times 5 \times 2\) that contains values between \([0,1)\) via numpy.random.random

gen_numpy.py

import numpy as np

# Generate a 4D array (3x4x5x2) of 1s
a = np.random.random((3,4,5,2))
b = np.random.random((3,4,5,2))

# Save 
np.save('a_patches_z1.npy', a)
np.save('b_patches_z1.npy', b)

Convert the Data to an R readable object

With this data in hand, let’s view the NumPy 2 R Object (n2r.py) Script. The script itself has two sections. The first section enables the user to feed in parameters via the command line. The second section deals with using rpy2 package within Python to convert NumPy arrays to R objects.

Command Line Interface to the Script

The command line options are defined as follows:

n2r.py -i <inputdirectory> -f <matchfname> -e <exportdirectory>

With actual values we have:

n2r.py -i /Users/James/Desktop/lidar -f _patches_ -e rout

Note: The export directory is placed within the input directory and, thus, we have R objects in /Users/James/Desktop/lidar/rout.

The NumPy binary to R object script `n2r.py`

The first order of business is to have the function set default parameter values. The second order is to then process all files within the directory that match a specific sequence (e.g. _patches_). The third order is export these objects under .gzip extension so that R is able to read them via load(). The reason for using .gzip instead of .rda is mainly when we tried to export using .rda there were a lot of unexpected complications that caused the writting of the file to be prolonged and then fail.

With that being said, here’s the script:

n2r.py

import os, sys, getopt
import numpy as np
import re

from rpy2.robjects import r
from rpy2.robjects.numpy2ri import numpy2ri

"""
Conversion function for .npy files 

@author : Avinash Balakrishnan

Commandline argumentation 

@author: JJB
"""

def main(argv):
   # Declare some default values
   dirname = '/Users/James/Desktop/lidar_data'
   fname = '_patches_'
   expdir = 'R_data'
   
   # Try to parse the arguments
   try:
      opts, args = getopt.getopt(argv,"hi:f:e:",["indir=","fname=","expdir="])
   except getopt.GetoptError as err:
      print str(err)
      print 'n2r.py -i <inputdirectory> -f <matchfname> -e <exportdirectory>'
      sys.exit(2)
   # Set the correct values
   for opt, arg in opts:
      if opt == '-h':
         print 'n2r.py -i <inputdirectory> -f <matchfname> -e <exportdirectory>'
         sys.exit()
      elif opt in ("-i", "--indir"):
         dirname = arg
      elif opt in ("-f", "--fname"):
         fname = arg
      elif opt in ("-e", "--expdir"):
         export_dir = arg
   # Call function
   convert_numpy(dirname, fname, export_dir)


def convert_numpy(path_to_data, fname, export_dir):
    """Convert NumPy N-D array to R object

    Keyword arguments:
    path_to_data -- full dir path to data
    fname        -- partial file name to match
    export_dir   -- Name of export dir added to data dir
    """  
    # Create a directory path
    if not os.path.exists("%s/%s" % (path_to_data,export_dir)):
        os.makedirs("%s/%s" % (path_to_data,export_dir))

    # Get list of files in the directory
    files = os.listdir(path_to_data)
    
    # Sort out which files are of each type
    numpy_files = sorted([f for f in files if fname in f])

    # Begin process conversion
    for numpy_fname in numpy_files:
        
        # Load in 4D Numpy Array
        d = np.load("%s/%s" % (path_to_data, numpy_fname))
        
        # Remove the file extension of .npy binary
        file_name = re.sub('\.npy$', '', numpy_fname)
        
        # Convert the numpy object to R
        ro = numpy2ri(patches)
        
        # Assign the name
        r.assign("%s" % file_name,ro)
        
        # Export to .gzip readable by R's load() 
        r("save(%s, file='%s/%s/%s.gzip', compress=TRUE)" % (file_name,path_to_data,export_dir,file_name))
        
if __name__ == "__main__":
   main(sys.argv[1:])

After running this script, we now have objects R would recognize via the load() command.

load("a_patches_b1.gzip")

Convert from an R object to another R object for better storage

Now, I’m not necessarily a huge fan of the .gzip extension. I would prefer if the file was identified as an R object just by extension type. So, I’ve added another function to be run within R that changes the format once more from .gzip to R’s .rda format. Again, this is mainly because using the .rda extension within the rpy2 does not work.

gzip_to_rda.R

#' Change file format from GZIP to RDA 
#' 
#' Modifies file format from .gzip to R's binary format .rda
#' @param indir  A \code{string} with the location of the data directory 
#' @param fname  A \code{string} that contains commonalities between files.
#' @param outdir A \code{string} representing the out directory to save to.
#' @examples
#' gzip_to_rda("/Users/James/Desktop/lidar/rout", "_patches_","/Users/James/Desktop/lidar/rda")
gzip_to_rda = function(indir, fname, outdir){
  
  # Grab a list of files within the directory
  m = list.files(path = indir, pattern=fname)
  
  # Make output dir
  if(!dir.exists(outdir)) {
    if(outdir != "."){
      dir.create(outdir,  recursive = T)
    }
  }
  
  # Load in each file
  for(i in 1:length(m)){
    # Obtain file names
    f = tools::file_path_sans_ext(m[i])
    
    # Create an absolute link to file and load
    load(file.path(indir,m[i]))
    
    # Save file 
    save(list=f, file = file.path(outdir,paste0(f,".rda")))
    
    # Remove data.frame from memory
    rm(list = c((m[i]))
  }
}

Credit

This post has a code contribution by Avinash Balakrishnan, who is a Masters student in the Department of Statistics and a Graduate Research Assistant (GRA) that is working with me during the Spring 2016 at the University of Illinois at Urbana-Champaign.