Yahoo Adds Workflows, Authentication to Hadoop
SANTA CLARA, Calif.-Yahoo,
which is running 38,000 Apache Hadoop Web servers and counting among its
hundreds of thousands of other servers, is quickly building quite an ecosystem
for its heavy-lifting cloud computing software platform.
Hadoop, an open-source
project created by Yahoo developers in 2005 that became an Apache project in
late 2006, has engaged dozens of core developers, hundreds of contributors and
thousands of interested IT folks since then.
The Hadoop software layer handles control and scaling of Yahoo's exponentially
increasing volumes of data. In only five years, the company has taken Hadoop
from a 20-server prototype in Yahoo Labs to the world's largest Web server
deployment running in production across Yahoo's global network.
The attendance at the annual Hadoop
Summit at the Santa Clara Convention
Center has been growing in parallel to the amount
of data Yahoo has to process each day, if not quite as fast. The first Hadoop
event attracted about 300 people in 2008, and it increased to about 600 last
year, while this year more than 1,000 people crammed into the convention center
ballroom here on June 29.
So there's much interest in how this software works. Thousands of Websites are
experiencing problems in dealing with the processing and storage of a deluge of
business and personal data, and there is a lot to learn from how Yahoo is
approaching this.
At the summit, Yahoo announced two key enhancements to the beta-level platform:
Hadoop with Security, and Oozie, a new workflow engine.
"Hadoop with Security is Hadoop integrated with Kerberos [authentication
securityware], which amounts to a set of security updates that enable much
stronger authentication," Blake Irving, Yahoo's new chief product officer,
told summit attendees. "Hadoop with Security brings more secure
collaboration and sharing of authenticated data."
Hadoop with Security also sets the stage for secure cloud computing multitenancy
by providing authenticated secure access and processing of sensitive data, Irving
said.
Oozie, which integrates with Hadoop with Security, is the platform's new
open-source workflow management and coordination engine for developers managing
jobs running on Hadoop servers. It includes Hadoop Distributed File System, Pig
and MapReduce, Irving said. It is
designed for Yahoo's internal compute-intensive use cases that require managing
complex work processes and ETL (extraction, transformation and loading) on a
global scale, Irving said.
Both Hadoop with Security and Oozie are available for free download here.
Hadoop is currently used only for internal Yahoo purposes, but because it is open
source, it is freely available for anybody to use.
Yahoo originally used Hadoop for specific science projects, but it quickly
morphed into the enterprise-class platform it is today to improve its own
personalized user experiences. Hadoop plays a key role in Yahoo's home page,
Yahoo Search, Yahoo Mail and others by remembering user preferences, among
other things.
"Businesses across all sectors are looking for ways to leverage the vast
quantities of data they are accumulating, and Apache Hadoop is an efficient
solution for processing data at scale," said IDC
analyst Melanie Posey. "Hadoop has matured and is now becoming an
enterprise-ready cloud computing technology with the addition of Kerberos
authentication. Now organizations of various sizes can leverage Yahoo's Hadoop
investment and deployments to run it on their own systems and build out their
own Hadoop deployments without starting from scratch on internal science
experiments."
How else can Yahoo monetize this considerable five-year investment? Yahoo is
not, and never has been, in the software-selling business.
"We're already monetizing Hadoop every day," Shelton Shugar, senior
vice president of cloud computing at Yahoo, told eWEEK. "We use it to
optimize our advertising and ad placement businesses here at Yahoo, and it's a
very important ingredient in our overall IT environment."
On the idea of commercializing Hadoop-or parts of it-with special Yahoo
"secret sauce" of some kind, Shugar told eWEEK that the company has
certainly talked about it.
"But at this time, we don't have any plans to use Hadoop in that
way," he said.
In other news from the summit, Cloudera announced a new distribution of its own Hadoop implementation.
