Oddly enough, I spend a lot of my time talking about different kinds of data structures and when they should be used. Recently, one of my fellow graduate students discovered why using sparse matrices instead of regular matrices are beneficial. Principally, the sparse matrix format reduces the size of the overall data structure as it discards uneventful values. For instance, if an entry value is zero or very close to it, then retaining such values wouldn’t be informative. Instead, there would be a preference to only store the locations inside the matrix that have non-zero entries.
The Matrices War: Regular Matrices vs. Sparse Matrices
To start the discussion, let’s create a regular matrix that is “sparse”. For data, let’s aim to use the binomial distribution to replicate a coin flip under a low probability. Thus, heads would be a 1 and tails would be a 0. So, we’re only going to be interested in knowing when heads lands.
# Set a seed for reproducibilityset.seed(9134)# Generate a matrix containing 1's and 0's with# a low probability of 1.standard_matrix =matrix(rbinom(100*100, size =1, prob = .06), nrow =100)# Show the first four rows and columnsstandard_matrix[1:4, 1:4]
If we convert the matrix to a sparse matrix, the zero entries will be discarded while the non-zero entries are preserved and stored by coordinate. To convert to a sparse matrix, either use the as() or Matrix() functions that reside in the MatrixR package.
# Load the Matrix packagelibrary("Matrix")# Convert the regular matrix to a sparse matrixsparse_matrix =as(standard_matrix, "sparseMatrix")# Alternatively, we could use:# Matrix(sparse_matrix, sparse = TRUE)# Preview the sparse matrixsparse_matrix[1:4, 1:4]
4 x 4 sparse Matrix of class "dgCMatrix"
[1,] . . . .
[2,] . . . .
[3,] . . . 1
[4,] . . . .
Notice, wherever there was a 0 entry a period, e.g. ., now acts as a placeholder. As a result of discarding zero entries, the size of the matrix decreases greatly.
# Standard matrixprint(object.size(standard_matrix), units="Kb")
In short, sparse matrices are incredibly powerful for storing data that has a small amount of entries. Consider using a sparse matrix if you are encountering memory issues.