Porting R’s table function to C++

programming
cpp
Author

TheCoatlessProfessor

Published

May 5, 2016

Intro

Have you ever wanted to see if you can write the part of the table() function’s frequency summation for one vector in C++?! Well, that’s exactly what I did by accident. As some of the best results come from pure accidents, I felt it only right to add a post describing how to obtain such a result.

The Plan

First of all, we need to find the unique values for the vector of numbers. To do this, we opt to store the number being counted as a key within a std::map and increment the value each time we observe that number.

Next, we need to export the data from the std::map to a std::vector<std::pair<double,int> > so that we can sort by the number of occurrences.

With a sorted vector in hand by std::pair<double,int>, we need to then write a list structure to export into R. This is the case for because Rcpp does not support std::set<T>.

The Implementation

#include <Rcpp.h>

// Save on the typing... 
typedef std::pair<double, int>  ptype;


// A comparison function to rank values in descending order
bool compare_values(const ptype &p1, const ptype &p2)
{
  return p1.second > p2.second;
}

// Get the top number of observations
// [[Rcpp::export]]
Rcpp::List table_cpp(const Rcpp::NumericVector & v, bool sort_data = false)
{

  // Create a map
  std::map<double, int> Elt;
  
  Elt.clear();
  
  // Fill the map with occurrences per number.
  for (int i = 0; i != v.size(); ++i) {
  Elt[ v[i] ] += 1;
  }
  
  // Get how many unique elements exist... 
  unsigned int n_obs = Elt.size();
  
  // Switch map to a vector so that we can sort by value
  std::vector< ptype > sorted_Elt(Elt.begin(), Elt.end());
  
  if(sort_data){
    // Perform the sort with a custom sort function.
    std::sort(sorted_Elt.begin(), sorted_Elt.end(), compare_values);
  }
  // Else, return. 
  
  // Stop here if you do not need to import into R.
  // Why? There is no ability to export a set w/ a pair into R. *cries*
  
  // Recast for R using Rcpp::*Vectors to avoid another copy)      
  Rcpp::NumericVector result_keys(n_obs);
  Rcpp::IntegerVector result_vals(n_obs);
  
  unsigned int count = 0;
  
  // Need to use iterators to access objects
  for( std::vector< ptype >::iterator it = sorted_Elt.begin(); it != sorted_Elt.end(); ++it )
  {
    // Move them into split vectors
    result_keys(count) = it->first;
    result_vals(count) = it->second;
    
    count++;
  }
  
  return Rcpp::List::create(Rcpp::Named("lengths") = result_vals,
  Rcpp::Named("values") = result_keys);
}

Short Test

Let’s verify that it works by running over some data:

# Set seed for reproducibility
set.seed(223)
(a = sample(seq(0, 1, by = .2), 10, replace = T))
 [1] 0.0 0.0 0.2 1.0 0.4 1.0 0.8 0.6 0.2 0.2

Generates:

     [1] 0.8 0.2 0.0 0.6 0.4 1.0 0.6 0.2 0.8 0.8

And now we seek to obtain the occurrence information:

# Call our function
table_cpp(a)
$lengths
[1] 2 3 1 1 1 2

$values
[1] 0.0 0.2 0.4 0.6 0.8 1.0

Gives us:

$lengths
[1] 3 2 2 1 1 1

$values
[1] 0.8 0.2 0.6 0.0 0.4 1.0

Let’s check by looking at table()’s output very quickly.

table(a)
  0 0.2 0.4 0.6 0.8   1 
  1   2   1   2   3   1 

And now, we go to the winchester to grab a pint and wait for this to …