Maybe it’s time you asked yourself that hard question: “Should I really be using Hadoop?”

It has become a trend with many startups to use Hadoop even when the data is 1TB or smaller, making people’s lives difficult for lack of a better known option. In an interview I was asked to analyze an 800M file that’s primary function was to count data in which the value zero can’t occur. I used R’s VGAM package to solve this problem and when I explained my process to the interviewer, he looked at me as if I’d told him that my dog had done the analysis. He was shocked that I hadn’t used Hadoop to do it. I wondered why he thought I should have used a program like Hadoop for such a straightforward problem that didn’t involve Big Data.

Let me say straight off: you don’t need to fight Hadoop’s java memory error, its file fragmentation in performing such analysis, and a slew of other annoyances that accompany its use when Big Data isn’t in the picture, it’s not worth the headache. Just because Hadoop has become a buzzword, it is not the only program worthy of use regarding data analysis.

Choose what’s right for you

Every computation written in Hadoop can be written in SQL, R, Clojure or Python script. You don’t need to use Hadoop if you are sure that your data is not going to increase in size, unlike what happens in the case of logging &c.

If your data is nicely formatted, SQL would be the best choice since it would be pretty fast and it has minimum leakage, whereas Hadoop provides leaky abstraction. However, if I was working with econometrics or biometrics on a small dataset, I would use R since CRAN(Comprehensive R Archive Network) has lots of packages for those. The only problem that I face with R is that it loads data as objects in memory, and if I am working with 4G RAM and numeric data this is not preferable. If a file has 2,000,000 rows and 200 columns, roughly it would take about 2,000,000 x 200 x 8 bytes/numeric =3200000000 bytes =3200000000 bytes/MB =3051.7578125 MB =2.98 GB, which means it will roughly need double the RAM, almost 6GB at least. Hence, if you’re thinking about using R you should first ask yourself if you have enough memory.

If data is not nicely formatted and too big for R and too small for Hadoop to be used, your best options are Ruby script, Python script, or Clojure to read, process, and write data. This is because if you can work with filesystem it would be less overhead since Hadoop (and Map Reduce) call for multiple servers. Clojure has the great advantage of running on the JVM which means that it can use any Java-based library immediately. Clojure also has some really good libraries that make it a great language to use for data analytics. Storm is a library for doing real time processing whereas Incanter is a Clojure-based, R-like platform for statistical computing & graphics.

When to use Hadoop

Hadoop has multi-output support, which is a very strong reason to run it on Big Data, bigger than 5TB, because it will eventually scale. Also, if for the time being your data is small but you’re pretty sure that it will scale to a larger size, you should use Hadoop.


2 thoughts on “Maybe it’s time you asked yourself that hard question: “Should I really be using Hadoop?””

  1. Sadly there are not enough articles on this – personally I think it should be the case that 2/3 of articles are on how to use Hadoop with 1/3 being based on the harsh realities behind the hype.

    What I would be interested in is why this is happening? Is it that some data engineers are like that guy who interviewed you, and fail to see the business side of things. Or is it that some business analysts dismiss what’s being said here as “technical” as if what you’re saying is an aside to business analysis (which would be ironic considering what you’ve presented here is the best business analysis available and to the point) and arrogantly quoting newspaper articles or some formal “Big Data” studies? Or both?

    Liked by 1 person

  2. The software world is always associated with a lot of hype & FUD, the more Data Sciences pave their way into every day software use cases, the more such misconceptions will fade away. I guess it’s our prerogative to educate the others to improve the scene and remove all FUD.

    Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s