SANTA CLARA, Calif.-Yahoo, which is running 38,000 Apache Hadoop Web servers and counting among its hundreds of thousands of other servers, is quickly building quite an ecosystem for its heavy-lifting cloud computing software platform.
Hadoop, an open-source project created by Yahoo developers in 2005 that became an Apache project in late 2006, has engaged dozens of core developers, hundreds of contributors and thousands of interested IT folks since then.
The Hadoop software layer handles control and scaling of Yahoo’s exponentially increasing volumes of data. In only five years, the company has taken Hadoop from a 20-server prototype in Yahoo Labs to the world’s largest Web server deployment running in production across Yahoo’s global network.
The attendance at the annual Hadoop Summit at the Santa Clara Convention Center has been growing in parallel to the amount of data Yahoo has to process each day, if not quite as fast. The first Hadoop event attracted about 300 people in 2008, and it increased to about 600 last year, while this year more than 1,000 people crammed into the convention center ballroom here on June 29.
So there’s much interest in how this software works. Thousands of Websites are experiencing problems in dealing with the processing and storage of a deluge of business and personal data, and there is a lot to learn from how Yahoo is approaching this.
At the summit, Yahoo announced two key enhancements to the beta-level platform: Hadoop with Security, and Oozie, a new workflow engine.
“Hadoop with Security is Hadoop integrated with Kerberos [authentication securityware], which amounts to a set of security updates that enable much stronger authentication,” Blake Irving, Yahoo’s new chief product officer, told summit attendees. “Hadoop with Security brings more secure collaboration and sharing of authenticated data.”
Hadoop with Security also sets the stage for secure cloud computing multitenancy by providing authenticated secure access and processing of sensitive data, Irving said.
Oozie, which integrates with Hadoop with Security, is the platform’s new open-source workflow management and coordination engine for developers managing jobs running on Hadoop servers. It includes Hadoop Distributed File System, Pig and MapReduce, Irving said. It is designed for Yahoo’s internal compute-intensive use cases that require managing complex work processes and ETL (extraction, transformation and loading) on a global scale, Irving said.
Both Hadoop with Security and Oozie are available for free download here.
Hadoop is currently used only for internal Yahoo purposes, but because it is open source, it is freely available for anybody to use.
Yahoo originally used Hadoop for specific science projects, but it quickly morphed into the enterprise-class platform it is today to improve its own personalized user experiences. Hadoop plays a key role in Yahoo’s home page, Yahoo Search, Yahoo Mail and others by remembering user preferences, among other things.
“Businesses across all sectors are looking for ways to leverage the vast quantities of data they are accumulating, and Apache Hadoop is an efficient solution for processing data at scale,” said IDC analyst Melanie Posey. “Hadoop has matured and is now becoming an enterprise-ready cloud computing technology with the addition of Kerberos authentication. Now organizations of various sizes can leverage Yahoo’s Hadoop investment and deployments to run it on their own systems and build out their own Hadoop deployments without starting from scratch on internal science experiments.”
How else can Yahoo monetize this considerable five-year investment? Yahoo is not, and never has been, in the software-selling business.
“We’re already monetizing Hadoop every day,” Shelton Shugar, senior vice president of cloud computing at Yahoo, told eWEEK. “We use it to optimize our advertising and ad placement businesses here at Yahoo, and it’s a very important ingredient in our overall IT environment.”
On the idea of commercializing Hadoop-or parts of it-with special Yahoo “secret sauce” of some kind, Shugar told eWEEK that the company has certainly talked about it.
“But at this time, we don’t have any plans to use Hadoop in that way,” he said.
In other news from the summit, Cloudera announced a new distribution of its own Hadoop implementation.
Home Cloud