Two Free Tools to Play with Big Data

The Pew Internet & American Life project opened a recent report on the future of big data with the following line: “We swim in a sea of data…and the sea level is rising rapidly.”  Just how quickly are humans creating data?  According to IBM, humans create an incomprehensible 2.5 quintillion bytes of data each day and 90% of extant data has been created in just the past two years!  This rising tide of data is particularly daunting for businesses that now must capture, store and analyze terabytes and petabytes instead of kilobytes and gigabytes of data. One unsurprising consequence of this trend is that data analyst jobs are hot.  A simple search on indeed.com for “data analy*” (jobs that include both data analysis and data analyst) reveals over 275 thousand unique opportunities in the U.S., nearly half of which pay in excess of $70K/year.  Many of these jobs require knowledge of statistics and typically some enterprise data-analysis software such as SPSS.

On potential barrier to entry for students wanting to compete in the data economy is the cost of access to proprietary data analysis software systems, which begs the question: are there any robust, open-source alternatives to programs like SPSS?  There are indeed, and in this post, I would like to profile two of the most common and powerful open-source statistical software packages: R and PSPP.

R:

R is a statistical computing language and environment that allows users to conduct statistical analysis on large data sets and generate publication quality visualizations.  While R has limits to the amount of data it can analyze, it can handle substantially more data than Excel, which can only deal with 6500 rows of data.

R operates on a Spartan Graphic User Interface (GUI) that looks very similar to Unix—that is, users typically type lines of code directly into a console.  Programming novices, like your author, need not worry, though.  R has enough built in functions to spare users from writing too much code.  Folks with programming skills, on the other hand, will find some similarities to the C programming language and will likely be writing their own functions in relatively short order.

R Pros:

  • Robust set of statistical functions
  • Creates publication-quality statistical plots with ease
  • Huge community of developers and tons of freely available practice data sets to play with

R Cons:

  • Users must conquer their fear of the command line

PSPP:

PSPP is open-source statistical computing software for the analysis of sampled data.  It looks almost identical to SPSS and, like R, allows users to perform a variety of statistical tests on huge data sets (over 1 billion variables and over 1 billion cases!).  Unlike R, PSPP can be operated from either directly from the command line or on a more user-friendly GUI (PSPP’s GUI is a dead ringer for SPSS).

PSPP is great for students who want to get SPSS experience, but don’t have access to the proprietary software.  This is no small point.  Searching Linkedin, Simply Hired and Indeed.com for jobs that require SPSS proficiency yields thousands of opportunities in a wide variety of industries.  Furthermore, PSPP’s GUI might prove less intimidating for users who are uncomfortable working from a command line.

PSPP Pros:

  • Can crunch numbers in ridiculously big datasets
  • Has a user-friendly GUI
  • Is basically a free version of SPSS

PSPP Cons:

  • PSPP graphics are not as slick as those in R

How to get R and PSPP

Download R HERE

Download PSPP HERE