Though Hadoop is all the rage in big data circles, it may not be the best solution for all situations and sometimes should be sidelined for alternative solutions, so said a big data expert at a recent event.
Dean Wampler, principal consultant at Think Big Analytics, a big data solution integrator, said Hadoop needs to become more approachable for developers. In fact, Wampler compared Hadoop to Enterprise Java Beans (EJB) of Java 2 Platform, Enterprise Edition (J2EE) famewhich has been known for being complex and onerous for programmers. The Spring Framework helped ease that pain.
Hadoop is the EJB of our time, Wampler said while speaking at the QCon New York 2012 conference. It does work, but I think it has a lot of issues in the way its implemented and its limitations. I sense there are things out there coming that will be like the Spring Framework was for EJB.
If Rod Johnson, creator of the Spring Framework and founder of SpringSourcenow part of VMware, was the angry man of Java, Wampler just might be the slightly irritated man of Hadoop. Johnson would appear at Java-oriented conferences and rail about the unnecessary complexities of EJB to packed audiences. He later parlayed that into a fortune when VMware acquired SpringSource for $420 million in 2009.
I don't consider myself the Reincarnation of Rod Johnson, partly because I haven't implemented an alternative myself, Wampler told eWEEK. Hadoop is a better technology than EJB, simply because it works better than EJB ever worked, but I see unfortunate parallels.
Wampler said the Hadoop Java API is too verbose and object-oriented. The API is too invasive and it obscures the application logic, he said. I call this API the assembly language of MapReduce programming.
Moreover, Wampler added, Writing applications in the Java API has the same problems that the EJB APIs had; your application logic is mixed in with messy infrastructure code, rather than cleanly separated and more modular. There are insufficient abstractions for all the design problems people face. One of Rod Johnson's innovations was the separation of application logic and infrastructure code. Another innovation was using less heavyweight infrastructure, which both scales down and up better than bloated middleware. Hadoop doesn't scale down very well, which may seem like a nonissue, but I've found that the best architectures have this property of scaling over a wide range.
Wampler noted that his remarks apply to MapReduce generally, not just Hadoop. Apache Hadoop is an open-source software framework that supports data-intensive distributed applications. It enables applications to work with thousands of computational independent computers and petabytes of data. Hadoop was derived from Google's MapReduce.
For his part, Wampler argues that mathematics is the best way to work with data and functional programming [FP] is the closest software paradigm to mathematics. The benefits of object-oriented programming are of little use and OOP is missing crucial concepts that FP provides. This affects both users' applications and the implementation of Hadoop. I see lots of bloated code in the Hadoop code base and missing features that a good FP implementation would fix.
Essentially, Wampler recommends that organizations carefully assess their needs because they might not need Hadoop. He also recommended looking at alternatives like Storm, a Hadoop-like technology. Storm makes it easy to write and scale complex real-time computations on a cluster of computers, doing for real-time processing what Hadoop did for batch processing.