Though Hadoop is all the rage in big data circles, it may not be the best solution for all situations and sometimes should be sidelined for alternative solutions, so said a big data expert at a recent event.
Dean Wampler, principal consultant at Think Big Analytics, a big data solution integrator, said Hadoop needs to become more approachable for developers. In fact, Wampler compared Hadoop to Enterprise Java Beans (EJB) of Java 2 Platform, Enterprise Edition (J2EE) famewhich has been known for being complex and onerous for programmers. The Spring Framework helped ease that pain.
Hadoop is the EJB of our time, Wampler said while speaking at the QCon New York 2012 conference. It does work, but I think it has a lot of issues in the way its implemented and its limitations. I sense there are things out there coming that will be like the Spring Framework was for EJB.
If Rod Johnson, creator of the Spring Framework and founder of SpringSourcenow part of VMware, was the angry man of Java, Wampler just might be the slightly irritated man of Hadoop. Johnson would appear at Java-oriented conferences and rail about the unnecessary complexities of EJB to packed audiences. He later parlayed that into a fortune when VMware acquired SpringSource for $420 million in 2009.
I don’t consider myself the Reincarnation of Rod Johnson, partly because I haven’t implemented an alternative myself, Wampler told eWEEK. Hadoop is a better technology than EJB, simply because it works better than EJB ever worked, but I see unfortunate parallels.
Wampler said the Hadoop Java API is too verbose and object-oriented. The API is too invasive and it obscures the application logic, he said. I call this API the assembly language of MapReduce programming.
Moreover, Wampler added, Writing applications in the Java API has the same problems that the EJB APIs had; your application logic is mixed in with messy infrastructure code, rather than cleanly separated and more modular. There are insufficient abstractions for all the design problems people face. One of Rod Johnson’s innovations was the separation of application logic and infrastructure code. Another innovation was using less heavyweight infrastructure, which both scales down and up better than bloated middleware. Hadoop doesn’t scale down very well, which may seem like a nonissue, but I’ve found that the best architectures have this property of scaling over a wide range.
Wampler noted that his remarks apply to MapReduce generally, not just Hadoop. Apache Hadoop is an open-source software framework that supports data-intensive distributed applications. It enables applications to work with thousands of computational independent computers and petabytes of data. Hadoop was derived from Google’s MapReduce.
For his part, Wampler argues that mathematics is the best way to work with data and functional programming [FP] is the closest software paradigm to mathematics. The benefits of object-oriented programming are of little use and OOP is missing crucial concepts that FP provides. This affects both users’ applications and the implementation of Hadoop. I see lots of bloated code in the Hadoop code base and missing features that a good FP implementation would fix.
Essentially, Wampler recommends that organizations carefully assess their needs because they might not need Hadoop. He also recommended looking at alternatives like Storm, a Hadoop-like technology. Storm makes it easy to write and scale complex real-time computations on a cluster of computers, doing for real-time processing what Hadoop did for batch processing.
Organizations Should Assess Whether They Need Hadoop
However, Hadoop is the big dog in the big data world. Attendance at Hadoop-related events continues to grow. Cloudera, a leading Hadoop distributor and service provider, has hosted annual Hadoop World conferences to ever-increasing sold out crowds in New York. The event has outgrown two different venues.
And at the last Hadoop World last November, Accel Partners announced a $100 million fund to invest in big data companies. At the show, Ping Li, a partner at Accel, announced its Big Data Fund, calling it “incredibly important given the explosion of big data.” The new initiative’s goal is to fund transformative early-stage and growth companies throughout the big data ecosystem, from next-generation storage and data management platforms to a wide range of revolutionary software applications and servicesi.e., data analytics, business intelligence, collaboration, mobile, vertical applications and many more, Accel said.
As organizations increasingly struggle to extract value from an ever-expanding sea of data, more and more of them are turning to Hadoop, said Stephen O’Grady, an analyst with RedMonk.
Yet despite its popularity, implementers and analysts alike agree that Hadoop needs help to become more palatable for the enterprise.
Weve been working with customers to help them use Hadoop to solve various problems, said Mike Olson, CEO of Cloudera. Hadoop on its own is not enough to tackle the big data analysis problems and other problems they face.
In other words, it takes a village, Olson said.
Big data will require a big ecosystem to make its way into the enterprise, said Tony Baer, principal analyst at Ovum. Enterprises demand a marketplace of tools, skills and services to take advantage of Hadoop. Cloudera is leveraging its early jumpstart in the Hadoop market with an effective partnering program that is showing true results.
As interest in Hadoop expands from early adopters to mainstream enterprise and government users, we are increasingly seeing the focus shift from development and testing to understanding potential use cases for the core distribution to the value-added tools and services that will enable and accelerate enterprise adoption, said Matt Aslett, senior analyst at 451 Research.
Theres a lot of talk in the industry about making Apache Hadoop deployments easier; however, Clouderas approach encompasses the entire lifespan of systems, not just the initial setup, Olson said. Workloads shift, teams change and the types of questions you want to ask change over time. You should be able to easily manage your systems while your usage of Hadoop evolves and grows.
I also work for a company betting on Hadoop, and we are helping clients be successful with it, Wampler said. I believe it is a good, but not great, first-generation technology that is meeting needs today. It’s fair to say Hadoop is the only game in town. However, I also believe it will be replaced over time with technologies that are more modular, scalable and flexible to address the growing variety of big data applications, just as Spring eventually replaced EJB.
Wampler reiterated that Hadoop reminds him of EJB in almost every way. And just as the Spring Framework brought an essential rethinking to Enterprise Java, there is an essential rethink that needs to happen in big data, he said. Wampler, an aficionado of the Scala functional programming language, said the Scala community is well-positioned to create change in the big data world.
I think we should stop using Java, he said. The object-oriented model is not really the right approach for data. We can take our existing Java code and we can start writing Clojure or Scala or JRuby. Im not slamming the JVM [Java Virtual Machine], Im slamming the language.