Intro
The goal of this post is to demonstrate how to load criteo data set associated with the kaggle competition into R using the ff and ffbase R packages. I’ll also present a way to model the data using biglm R Package that will require you to be able to clean the data before running the modeling command.
In order to proceed in this guide, you will need to be using a computer with AT LEAST 4 gigs of RAM. Preferably, you should also be able to save the data to a magnetic hard drive and NOT a solid state drive (SSD). This remark comes from personal experience of quickly burning out an SSD when manipulating big data.
Background
The size of the dataset is about 10 gigs with 45,840,617 observations and 40 variables. Yes, you read that correctly… The data set has over 45 million observations!
A note…
The file paths provided below correspond to my storage drive that I use for working with bigdata. You may not have an F:/
drive. You may need to place big data on your C:/
drive or within a specific volume on OS X or Linux.
As a result, before running the script, please change the file paths so that they are relative to your machine!
Load Script
# Any package that is required by the script below is given here:
# Check to see if packages are installed, if not install.
= load_pkgs = c("ff","ffbase","biglm")
inst_pkgs = inst_pkgs[!(inst_pkgs %in% installed.packages()[,"Package"])]
inst_pkgs if(length(inst_pkgs)) install.packages(inst_pkgs)
# Dynamically load packages
= lapply(load_pkgs, require, character.only=T)
pkgs_loaded
# Set Working Directory to where big data is
setwd("F:/bigdata/kaggle_criteo/dac/")
# Check temporary directory ff will write to (avoid placing on a drive with SSD)
getOption("fftempdir")
# Set new temporary directory
options(fftempdir = "F:/bigdata/kaggle_criteo/dac/temp")
# Load in the big data
= read.table.ffdf(file="train.txt", # File Name
ffx sep="\t", # Tab separator is used
header=FALSE, # No variable names are included in the file
fill = TRUE, # Missing values are represented by NA
colClasses = c(rep("integer",14),rep("factor",26))
# Specify the import type of the data
)
# Assign names to column
colnames(ffx) = c("Label",paste0("I",1:13),paste0("C",1:26))
Quicker load on subsequent runs
Instead of recreating the ffdf
object each time we open the workspace, we opt to save the ffdf
object using:
# Export created R Object by saving files
ffsave(ffx, # ffdf object
file="F:/bigdata/kaggle_criteo/dac/ffdata/ffdac", # Permanent Storage location
# Last name in the path is the name for the file you want
# e.g. ffdac.Rdata and ffdac.ff etc.
rootpath="F:/bigdata/kaggle_criteo/dac/temp") # Temporary write directory
# where data was initially loaded via the options(fftempdir) statement
After the ffdf
object has been saved, we are able to open the ffdf
object using:
# Load Data R object on subsequent runs (saves ~ 20 mins)
ffload(file="F:/bigdata/kaggle_criteo/dac/ffdata/ffdac", # Load data from archive
overwrite = TRUE) # Overwrite any existing files with new data
Note, if we modify the ffdf
object via data cleaning or et cetera, we need to RESAVE the object! Otherwise, our modifications will not be stored in the permanent directory and will only exist for the duration of the R session.
Sample modeling using ffbase’s biglm hook.
One of the nice benefits of using ffbase is the many options it has for working with ff data. In particularly, there is a wrapper that allows us to feed information into the biglm package without having to worry about converting the ffdf
object into a bigmemory
matrix.
# Model
# Get predictor variable names (only 1 categorical is included)
= colnames(ffx)[c(-1,-(18:40))]
data_variables
# Create model formula statement
= as.formula(paste0("Label ~", paste0(data_variables, collapse="+")))
model_formula
## YOU MUST CLEAN THE DATA BEFORE RUNNING THE REGRESSION! RUNNING THE REGRESSION WITH MISSING VALUES WILL YIELD AN ETA ERROR!
# Use a modified version of bigglm so that bigglm will not try to convert to a regular data.frame
= bigglm.ffdf(model_formula, family=binomial(), data=ffx,chunksize=100, na.action=na.exclude) model_out