Intro
The default analyst behavior is to export results as a CSV file and share them with their colleagues. Now, CSV files are preferred in this instance because they show “diffs” between versions and require minimal software like a text editor to open and edit them. However, CSV files are problematic when it comes to sparse matrices as one of my fellow PhD students discovered recently (generating ~ >500 Gb of data). For a refresher on sparse matrices, see the prior post on benefits to using sparse matrices. In cases like this, it would be better to save the object itself through one of R’s binary file formats. This logic expands to large data sets and simulation results as well.
The goal of this post is to highlight the different binary file formats offered by R, version compatibility, and compression differences.
Formats
First, we begin with an overview of the different kinds of R binary files that are available.
.rda
/.RData
is “R Data”- Description: Save and restore one or more named objects into an environment.
- Notes: Useful for storing workspaces and multiple R objects as-is. As an example, see the
save.image()
function called upon closing every R session.
.rds
is a “R Data Single”- Description: Save and load a single R object to a binary file.
- Notes: Great for exporting a single result and loading it into a new variable.
.rdx
and.rdb
- Description:
.rdx
contains the index while.rdb
stores objects for an R Database used in Lazy Loading - Notes: Primarily for R’s internal usage. Though, benefits exist around delayed assignment by the use of promises for large data.
- Description:
Creation of the binary files and the ability to read them in are given next.
.rda
/.RData
# Define values
= "apple"
fruit = "ribbit"
toad
# Save R objects
save(fruit, toad, file = "all_objects.rda")
# Remove objects in environment
rm(list = ls())
# Load objects from disk
load("all_objects.rda")
.rds
# Define a value
= 42L
life
# Save a single R object
saveRDS(life, file = "myobj.rds")
# Remove objects in environment
rm(list = ls())
# Read in the object from disk and
# assign it to a new variable
= readRDS(file = "myobj.rds")
my_age
my_age# [1] 42
life# Error: object 'life' not found
.rdx
and .rdb
# Save R objects into an environment
= new.env(parent = emptyenv());
my_lazy_env $my_df = data.frame(x = 1, y = 2)
my_lazy_env$grades = data.frame(pct = 95, letter = "A")
my_lazy_env
# Store database in folder
dir.create("data-db")
# Save objects inside a LazyLoadDB
# Requires an environment and the name of a file.
:::makeLazyLoadDB(my_lazy_env, "data-db/my_lazyload_db");
tools
# Remove objects in environment
rm(list = ls())
# Load objects from disk
lazyLoad("data-db/my_lazyload_db")
# NULL
Note: Using .rdx
and .rdb
requires the objects being saved into an environment and, then, supplying the argument to construct the Lazy DB. Moreover, note the use of three colons, e.g. :::
, to access makeLazyLoadDB
in tools
. This means that the function is not exported from the tools
package and should be considered internal.
R Binary File Versions and Compatibilities
With this being said, there is a need to emphasize saving into R’s binary format introduces compatibility issues. That is, some versions of R are using a newer variant of the binary format and others aren’t. To control the version that the object is saved in use version = 2
or version = 3
parameter when writing the object via save()
The following table provides information as to when the different versions came into service.
R Version | Binary Version |
---|---|
R 3.5.1 - Present | 3 |
R 1.4.0 - R 3.5.0 | 2 (Default) |
R 0.99.0 - R 1.3.1 | 1 |
More information about version differences can be found in Section 1.8: Serialization Formats of the R Internals manual.
When looking at an R binary file, note that version information is stored within the first line of the written file under the scheme of X
for binary serialization and A
for ASCII serialization.
Fin
In short, give R binary files a shot if you are looking for reduced file size and don’t mind giving up being able to view the data’s information without opening it in R.