An increasing number of jumbo-size enterprise data sets-and all the technology needed to create, store, network, analyze, archive and retrieve them-are considered “big data.” This massive amount of information is pushing the limits on storage, servers and security, creating an immense problem for IT departments that must be addressed.
So what’s the tipping point? When does average-size data become big data?
eWEEK’s crack at this definition, with help from research firm Gartner, goes like this: “Big data refers to the volume, variety and velocity of structured and unstructured data pouring through networks into processors and storage devices, along with the conversion of such data into business advice for enterprises.”
These elements can be broken down into three distinct categories: volume, variety and velocity.
Volume (terabytes, petabytes and eventually exabytes): The increasing amount of business data-created by both humans and machines-is putting a major hit on IT systems, which are struggling to store, secure and make accessible all that information for future use.
Variety: Big data is also about the increasing number of data types that need to be handled differently from simple email, data logs and credit card records. These include sensor- and other machine-gathered data for scientific studies, health care records, financial data and rich media: photos, graphic presentations, music, audio and video.
Velocity: It’s about the speed at which this data moves from endpoints into processing and storage.
Big Data: Tools, Processes and Procedures
“In simplest terms, the phrase [big data] refers to the tools, processes and procedures allowing an organization to create, manipulate, and manage very large data sets and storage facilities,” analyst Dan Kusnetzky of the Kusnetzky Group, wrote in his blog. “Does this mean terabytes, petabytes or even larger collections of data?
“The answer offered by [IT] suppliers is -yes.’ They would say, -You need our product to manage and make the best use of that mass of data.’ Just thinking about the problems created by the maintenance of huge, dynamic sets of data gives me a headache.”
In addition to volume, variety and velocity, there’s another “v” that fits into the big data picture: value. Accurate analysis of big data provides value by helping businesspeople make the right decision at the right time.
Whole Sets, Instead of Subsets
Historically, data analytics software hasn’t had the capability to take a large data set and use all of it-or at least most of it-to compile a complete analysis for a query. Instead, it has relied on representative samplings, or subsets, of the information to render these reports, even though analyzing more information produces more accurate results.
That approach is changing with the emergence of new big data analytics engines, such as Apache Hadoop, LexisNexis’ HPCC Systems and 1010data’s cloud-based analytics service. These new platforms are causing “the disappearing role of summarization,” said Tim Negris, senior vice president of 1010data, a cloud-based data analytics provider. “With regard to big data, it’s one thing to just suck [data] in and put it somewhere, but it’s quite another thing to actually make use of it.
“One of the barriers to this is that most of the database makers, like Oracle and others, require a good deal of work [to prepare the data] prior to actually doing anything with it. We eliminate that and put the data directly in the hands of the analysts.”
Hadoop and HPCC Systems do that, as well. All three platforms provide complete looks at big data sets. Instead of a team of analysts spending days or weeks preparing the parameters for data subsets, and then taking 1, 2 or 10 percent samplings, all the data can be analyzed at one time, in real time.
Why bother? Because data sitting in storage arrays and cloud accounts represents unrefined value in its most basic form. If interpreted properly, the stories, guidelines and essential information buried in storage and databases can open the eyes of business executives as they make strategic decisions for their company.
Management consultant Management consultant and venture capitalist Peter Cohan, president of Peter S. Cohan & Associates and a faculty member at Babson College in Wellesley, Mass., recently gave a cogent example of this in a Forbes article: Walmart wanted to find out the biggest-selling items people bought before a hurricane hits.
The No. 1 answer-batteries- was not a surprise. But the unexpected No. 2 item was Kellogg’s Pop-Tarts. It turns out that those sugar-boost pastries are great for emergencies: They last a long time, don’t require refrigeration or preparation, and are easy to carry and store.
As a result of this intelligence, Walmart can now stock up on Pop-Tarts in its Gulf Coast stores ahead of storm season. This is where the reach of new-generation business analytics tools shine: by directly helping enterprises make smart decisions.
Hadoops Stocks in Trade
Apache Hadoop, open-source software, has proved to be the data prospector with the most market traction in the last five years. Originally created by current Cloudera Architect and Apache Foundation Chairman Doug Cutting while he worked at Yahoo, Hadoop got its name from a stuffed elephant (anappropriate image for so-called big data) belonging to Cutting’s son.
Hadoop processes large caches of data by breaking them into smaller, more accessible batches and distributing them to multiple servers to analyze. (Agility is a vital attribute: It’s like cutting your food into smaller pieces for easier consumption.) Hadoop then processes queries and delivers the requested results in far less time than old-school analytics software-most often minutes instead of hours or days.
“The analysts at Gartner and IDC have described big data as being about the volume, velocity and variety of data, and those are the things that draw people to Hadoop as a system,” said Cloudera Vice-President of Products Charles Zedlewski.
After Cutting and his internal Yahoo team came up with the Hadoop code, it was tested and used extensively within the Yahoo IT system for several years. The company subsequently released the code to the open-source community, which enabled a whole new IT sector: the productization of Hadoop.
Giving Away the Code
Why give away the code? Because when Cutting and Yahoo developed, tested and ran the base code inhouse, they learned how complicated it is to use. They immediately saw that the money-earning future of the software would come from surrounding services: an intuitive user interface, customized deployments and additional features.
In March 2009, startup Cloudera was the first independent company to take the open-source code and productize the Hadoop analytics engine with its CDH (Cloudera’s Distribution, including Apache Hadoop) and Cloudera Enterprise. An impressive group of investors and advisors teamed up to launch the company, including VMware founder and former CEO Diane Greene, Flickr co-founder Caterina Fake, former MySQL CEO Marten Mickos, LinkedIn President Jeff Weiner and Facebook CFO Gideon Yu.
Since Cloudera’s debut, a handful of top-tier companies and startups have crafted their own versions of Hadoop based on the freely available open-source architecture.
This is truly a new-generation enterprise IT competition. It’s similar to a relay race in that all the contestants have the same type of baton (Hadoop code) and have to compete based strictly on their own speed, agility and creativity. Currently, the race is on among a new set of competitors attempting to market big data analytics to the most enterprises in the most effective way.
Big Bet at Big Blue
IBM, the first large systems maker to use the engine, provides its Hadoop-based InfoSphere BigInsights in basic and enterprise editions. But the company has even bigger plans.
Speaking to a Computer History Museum audience Aug. 4 in Mountain View, Calif., CEO Sam Palmisano said Big Blue is putting a heavy R&D emphasis on new generation data analytics, describing it as one of the company’s “big bets”-a project that requires at least a $100 million investment. At the same event, IBM Fellow and Computer Science Research Director Laura Haas said that IBM Labs is far beyond the big data research mode and is into “exadata” analytics research. “We’re working on some very, very interesting things in this area,” Haas told eWEEK.
While Haas wasn’t at liberty to discuss details of the plans, Palmisano revealed this in his Aug. 4 presentation: “In about a year from now, you’ll be starting to see the fruits of our -big bet’ on big data. The work we’ve been doing for the last several years with Watson [the IBM computer that won Jeopardy! matches against two human champions] will move into products that will be used for a great many purposes, including health care, science and financial applications.
“Our engineers say they’re not far away from building a supercomputer about the size of a human brain that can fit into a shoebox.” Now that’s squeezing big data into a small package.
Other Hadoop Distributions
Newcomer MapR Technologies released a distributed file system and MapReduce engine, the MapRDistribution for Apache Hadoop. It also partnered with storage and security giant EMC to provide another enterprise Hadoop package for that company’s customers.
Another vendor, Platform Computing, launched support for the Hadoop MapReduce application programming interface in its Symphony software. And Silicon Graphics International offers Hadoopoptimized solutions based on the SGI Rackable and CloudRack server lines with implementation services.
Announced Aug. 4, the latest Hadoop edition is a Dell/Cloudera for Apache Hadoop configuration. This consists of Hadoop, Cloudera Enterprise interface, Dell Crowbar software, and a Dell PowerEdge C2100 server and PowerConnect 6248 48-port Gigabit Ethernet Layer 3 switch. Joint service and support and a deployment guide are included.
Alternatives Coming to the Fore
Alternatives are also surfacing. In addition to 1010data’s cloud service, LexisNexis Risk Solutions-which has been using its own home-developed, large-scale analytics system for 10 years-recently announced that it is sharing some if its intellectual property with the open-source community with an alternative to Hadoop.
The risk-management and fraud detection service provider has made available a data-intensive supercomputing platform under a dual license, open-source spinoff called HPCC Systems, which can manage, sort, link and analyze billions of records within seconds.
“We think the time is right to do this, and we believe that HPCC Systems will take big data computing to the next level,” said CEO James Peck.
Child of the Mother Ship
In June 2011, Yahoo (which created Hadoop) and Benchmark Capital formed a separate company called Hortonworks, named after the elephant character in Dr. Seuss. Though run by several former Yahoo employees, Hortonworks will remain independent of its mother ship from a business perspective and will develop with its own commercial edition.
Yahoo CTO Raymie Stata, a key figure in this transition, is responsible for all IT development at the company. Even though Hadoop has moved to a new home, Stata told eWEEK that Yahoo doesn’t consider the new company a “spinout.”
“We will have more people within Yahoo working on Hadoop and related technologies than there will be at Hortonworks,” Stata said. “We see this as increasing the investment that’s being made in Hadoop.
“We’re taking some of our key talent and using it to seed Hortonworks, so some employees will be moving from Yahoo to the new company. But this is not downsizing, and it’s not a spinout. It’s increasing the investment in Hadoop. Yahoo will continue to be a major contributor to all aspects of Hadoop going forward.”
Yahoo’s Vision for Hadoop
Stata explained that Yahoo has always had a vision of Hadoop becoming the industry standard in big data analytics software, but it also knew Hadoop would have to establish its own business entity.
One of the main reasons for creating Hortonworks, Stata said, is that Yahoo had already seen what the future holds for enterprise analytics (thanks to its six-year-long Hadoop development stage) and knew what would work. It saw that the need for big data analytics would soon become so widespread that a dedicated company would be necessary to focus solely on that-not on the advertising and Web services businesses that are Yahoo’s meal ticket.
“We have been running a truly enterprise deployment of Hadoop, and I don’t think anybody does that: It’s a departmental solution today,” Stata said. “But it’s not going to be six years before other people are doing enterprise [analytics as Yahoo does]. That gap between Yahoo and the rest of the user base is shrinking.
“It’s great to have an independent company that can have this relationship with Yahoo and see pain points that are on the road ahead. We now need to look at other customers, bring that input in and synthesize it with Yahoo’s more futuristic view. Obviously, an independent company with a commercial mandate is going to do it a lot better than an open-source team inside Yahoo.”
“What we do on Hadoop ultimately creates value for our shareholders,” Stata concluded. “If Hadoop becomes the de facto industry standard for big data processing, that’s goodness for us. That’s been our mission in being so open in the development of Hadoop. We’re getting to the last mile now, and it’s all set up to reach that stature.”