One common misconception about using Hadoop is that use Hadoop if your data is l...

yuanchuan · on Jan 19, 2015

It is that buzz surrounding Hadoop that makes people misunderstood its use and capability. I have met non-technical analysts who want RDBMS performance on Hadoop. They expect seconds to minutes scale queries on hundreds of GB of data.

I always throw this analogy to people who misunderstood Hadoop: A stone to crack an egg or a spoon?

Hadoop and RDBMS only have a thin overlapping region in the Venn diagram that describes their capabilities and use cases.

Ultimately, it is cost vs efficiency. Hadoop can solve all data problems. Likewise for RDBMS. This is an engineering tradeoff that people have to make.

sleepythread · on Jan 19, 2015

I totally agree with you. Capability <strong>"LIKE"</strong> will drive Hadoop adoption, Hadoop should not be seen as replacement of R.D.B.M.S. These are two different tools for made for different purpose.

pacala · on Jan 19, 2015

> They expect seconds to minutes scale queries on hundreds of GB of data.

Use BigQuery from Google.

yuanchuan · on Jan 20, 2015

On-premise cluster.

Cloud solution are totally out due to the nature of the data. Not everything can be done in cloud.

If you have such huge amount of data, the total amount of time it takes to transfer there and compute is not as competitive as an on-premise solution, unless all your data live in the cloud.

pacala · on Jan 20, 2015

I would look into https://spark.apache.org/ then. You can get quite good performance out of it, but you need to spend more effort in babysitting your data.