An expert in big data solutions says Hadoop is good enough for the task of handling big data. However, Hadoop projects onerous complications onto users, much like Enterprise JavaBeans did for enterprise Java developers.
Though Hadoop is all the rage in big
data circles, it may not be the best solution for all situations and sometimes
should be sidelined for alternative solutions, so said a big data expert at a
recent event.
Dean Wampler, principal consultant
at
Think
Big Analytics, a big data solution integrator, said Hadoop needs to
become more approachable for developers. In fact, Wampler compared Hadoop to
Enterprise Java Beans (EJB) of Java 2 Platform, Enterprise Edition (J2EE) famewhich
has been known for being complex and onerous for programmers. The Spring
Framework helped ease that pain.
Hadoop is the EJB of our time,
Wampler said while speaking at the
QCon New York 2012 conference. It does work, but
I think it has a lot of issues in the way its implemented and its limitations.
I sense there are things out there coming that will be like the Spring
Framework was for EJB.
If Rod Johnson, creator of the
Spring Framework and founder of SpringSourcenow part of VMware, was the angry
man of Java, Wampler just might be the slightly irritated man of Hadoop.
Johnson would appear at Java-oriented conferences and rail about the unnecessary
complexities of EJB to packed audiences. He later parlayed that into a fortune
when VMware acquired SpringSource for $420 million in 2009.
I don't consider myself the
Reincarnation of Rod Johnson, partly because I haven't implemented an
alternative myself, Wampler told
eWEEK. Hadoop is a better technology
than EJB, simply because it works better than EJB ever worked, but I see
unfortunate parallels.
Wampler said the Hadoop Java API is
too verbose and object-oriented. The API is too invasive and it obscures the
application logic, he said. I call this API the assembly language of MapReduce
programming.
Moreover, Wampler added, Writing
applications in the Java API has the same problems that the EJB APIs had; your
application logic is mixed in with messy infrastructure code, rather than
cleanly separated and more modular. There are insufficient abstractions for all
the design problems people face. One of Rod Johnson's innovations was the
separation of application logic and infrastructure code. Another innovation was
using less heavyweight infrastructure, which both scales down and up better
than bloated middleware. Hadoop doesn't scale down very well, which may seem
like a nonissue, but I've found that the best architectures have this property
of scaling over a wide range.
Wampler noted that his remarks apply
to MapReduce generally, not just Hadoop. Apache Hadoop is an open-source
software framework that supports data-intensive distributed applications. It
enables applications to work with thousands of computational independent
computers and petabytes of data. Hadoop was derived from Google's MapReduce.
For his part, Wampler argues that
mathematics is the best way to work with data and functional programming [FP] is
the closest software paradigm to mathematics. The benefits of object-oriented
programming are of little use and OOP is missing crucial concepts that FP provides.
This affects both users' applications and the implementation of Hadoop. I see
lots of bloated code in the Hadoop code base and missing features that a good
FP implementation would fix.
Essentially, Wampler recommends that
organizations carefully assess their needs because they might not need Hadoop.
He also recommended looking at alternatives like Storm, a Hadoop-like
technology. Storm makes it easy to write and scale complex real-time
computations on a cluster of computers, doing for real-time processing what
Hadoop did for batch processing.