Talkin’ Bout A Big Data Revolution

Big Data is here and there is no turning back.

No longer are there whispers about Big Data… Big Data is being yelled from the mountain tops as a result of the complexity of the NSA programs. Not only that, but students are rising up to demand Big Data courses (e.g. hadoop, storm, pig, hive, etc.) be offered so that they are more attractive as a potential new hire.

Companies are appreciating the additional pressure students are exerting to be trained in Big Data due to the vast troves of data they possess on each individual. They are desperately seeking to capitalize on these data portfolios they have developed instead of letting them bit rot in a dark server room. But, are the companies actually performing a Big Data analysis? Or is it just another “Oooh, they’re doing it… So, we’re doing it! Even though, we’re not” ’cause you want to be in the cool kids club ’n stuff.

To describe this new struggle, the term “Jabberwocky” comes to mind. “Jabberwocky” was used in Better Off Ted(S1E12). To companies and students, this is the next thing, this is something new, this is tomorrow land. But, it really isn’t. The fact that “Big Data” can be broken down into keywords such as velocity, volume and variety has helped decrease the activation threshold for what constitutes “Big Data.”

Truth be told, “Big Data” was available back in the 1980s (e.g. AA loyalty programs & preferred shoppers cards) and since then no one has really gotten a clue about “Big Data.” In fact, the theory and hardware behind it is actively being developed as we speak. In the interim, we try to extend the modeling paradigm by creating data structures that enable us to look at high volumes of data (large N) and globs of predictors (large P).

Recently, I gave a talk at the Big Data and Analytics Council @ UIUC that centered around performing supervised learning (primarily logistic regression), using large amounts of data, and R. The slides are here: Supervised Learning with Big Data using R (R Code). Through the talk, the main focus was showing the current modeling paradigm and how it can extend into working with large N through R’s bigmemory, biganalytics, and biglm packages.