Maybe it’s time you asked yourself that hard question: "Should I really be using Hadoop?"

By on 24 Aug 2014

It has become a trend with many startups to use Hadoop even when the data is 1TB or smaller, making people’s lives difficult for lack of a better known option. In an interview I was asked to analyze an 800M file that’s primary function was to count data in which the value zero can't occur. I used R's VGAM package to solve this problem and when I explained my process to the interviewer, he looked at me as if I’d told him that my dog had done the analysis. He was shocked that I hadn’t used Hadoop to do it. I wondered why he thought I should have used a program like Hadoop for such a straightforward problem that didn’t involve Big Data.

Let me say straight off: you don't need to fight Hadoop's java memory error, its file fragmentation in performing such analysis, and a slew of other annoyances that accompany its use when Big Data isn’t in the picture, it's not worth the headache. Just because Hadoop has become a buzzword, it is not the only program worthy of use regarding data analysis.

Choose what’s right for you

Every computation written in Hadoop can be written in SQL, R, Clojure or Python script. You don't need to use Hadoop if you are sure that your data is not going to increase in size, unlike what happens in the case of logging &c.

If your data is nicely formatted, SQL would be the best choice since it would be pretty fast and it has minimum leakage, whereas Hadoop provides leaky abstraction. However, if I was working with econometrics or biometrics on a small dataset, I would use R since CRAN(Comprehensive R Archive Network) has lots of packages for those. The only problem that I face with R is that it loads data as objects in memory, and if I am working with 4G RAM and numeric data this is not preferable. If a file has 2,000,000 rows and 200 columns, roughly it would take about 2,000,000 x 200 x 8 bytes/numeric =3200000000 bytes =3200000000 bytes/MB =3051.7578125 MB =2.98 GB, which means it will roughly need double the RAM, almost 6GB at least. Hence, if you’re thinking about using R you should first ask yourself if you have enough memory.

If data is not nicely formatted and too big for R and too small for Hadoop to be used, your best options are Ruby script, Python script, or Clojure to read, process, and write data. This is because if you can work with filesystem it would be less overhead since Hadoop (and Map Reduce) call for multiple servers. Clojure has the great advantage of running on the JVM which means that it can use any Java-based library immediately. Clojure also has some really good libraries that make it a great language to use for data analytics. Storm is a library for doing real time processing whereas Incanter is a Clojure-based, R-like platform for statistical computing & graphics.

When to use Hadoop

Hadoop has multi-output support, which is a very strong reason to run it on Big Data, bigger than 5TB, because it will eventually scale. Also, if for the time being your data is small but you're pretty sure that it will scale to a larger size, you should use Hadoop.

comments powered by Disqus