SANTA CLARA, Calif.-Yahoo, which is running 38,000 Apache Hadoop Web servers and counting among its hundreds of thousands of other servers, is quickly building quite an ecosystem for its heavy-lifting cloud computing software platform.
Hadoop, an open-source project created by Yahoo developers in 2005 that became an Apache project in late 2006, has engaged dozens of core developers, hundreds of contributors and thousands of interested IT folks since then.
The Hadoop software layer handles control and scaling of Yahoo’s exponentially increasing volumes of data. In only five years, the company has taken Hadoop from a 20-server prototype in Yahoo Labs to the world’s largest Web server deployment running in production across Yahoo’s global network.
The attendance at the annual Hadoop Summit at the Santa Clara Convention Center has been growing in parallel to the amount of data Yahoo has to process each day, if not quite as fast. The first Hadoop event attracted about 300 people in 2008, and it increased to about 600 last year, while this year more than 1,000 people crammed into the convention center ballroom here on June 29.
So there’s much interest in how this software works. Thousands of Websites are experiencing problems in dealing with the processing and storage of a deluge of business and personal data, and there is a lot to learn from how Yahoo is approaching this.
At the summit, Yahoo announced two key enhancements to the beta-level platform: Hadoop with Security, and Oozie, a new workflow engine.
“Hadoop with Security is Hadoop integrated with Kerberos [authentication securityware], which amounts to a set of security updates that enable much stronger authentication,” Blake Irving, Yahoo’s new chief product officer, told summit attendees. “Hadoop with Security brings more secure collaboration and sharing of authenticated data.”
Hadoop with Security also sets the stage for secure cloud computing multitenancy by providing authenticated secure access and processing of sensitive data, Irving said.
Oozie, which integrates with Hadoop with Security, is the platform’s new open-source workflow management and coordination engine for developers managing jobs running on Hadoop servers. It includes Hadoop Distributed File System, Pig and MapReduce, Irving said. It is designed for Yahoo’s internal compute-intensive use cases that require managing complex work processes and ETL (extraction, transformation and loading) on a global scale, Irving said.
Both Hadoop with Security and Oozie are available for free download here.
Hadoop is currently used only for internal Yahoo purposes, but because it is open source, it is freely available for anybody to use.
Yahoo originally used Hadoop for specific science projects, but it quickly morphed into the enterprise-class platform it is today to improve its own personalized user experiences. Hadoop plays a key role in Yahoo’s home page, Yahoo Search, Yahoo Mail and others by remembering user preferences, among other things.
“Businesses across all sectors are looking for ways to leverage the vast quantities of data they are accumulating, and Apache Hadoop is an efficient solution for processing data at scale,” said IDC analyst Melanie Posey. “Hadoop has matured and is now becoming an enterprise-ready cloud computing technology with the addition of Kerberos authentication. Now organizations of various sizes can leverage Yahoo’s Hadoop investment and deployments to run it on their own systems and build out their own Hadoop deployments without starting from scratch on internal science experiments.”
How else can Yahoo monetize this considerable five-year investment? Yahoo is not, and never has been, in the software-selling business.
“We’re already monetizing Hadoop every day,” Shelton Shugar, senior vice president of cloud computing at Yahoo, told eWEEK. “We use it to optimize our advertising and ad placement businesses here at Yahoo, and it’s a very important ingredient in our overall IT environment.”
On the idea of commercializing Hadoop-or parts of it-with special Yahoo “secret sauce” of some kind, Shugar told eWEEK that the company has certainly talked about it.
“But at this time, we don’t have any plans to use Hadoop in that way,” he said.
In other news from the summit, Cloudera announced a new distribution of its own Hadoop implementation.

AI thrives on data but feeding it the right data is harder than it seems. As enterprises scale their AI initiatives, they face the challenge of managing diverse data pipelines, ensuring proximity to insights, and supporting a growing range of workloads. In this episode, Corey Knowles speaks with Vrashank Jain, lead product manager for Dell’s AI Data Platform, about how businesses can overcome these hurdles with solutions that simplify data management, enhance performance, and unlock the full potential of their AI investments.

In this episode of eSpeaks, Jennifer Margles, Director of Product Management at BMC Software, discusses the transition from traditional job scheduling to the era of the autonomous enterprise.

eSpeaks’ Corey Noles talks with Rob Israch, President of Tipalti, about what it means to lead with Global-First Finance and how companies can build scalable, compliant operations in an increasingly uncertain world. They explore how automation, AI, and integrated platforms are helping finance teams tackle today’s biggest challenges, from cross-border compliance and FX volatility to […]
-
Latest News - Resources Resource HubsFeatured ResourcesLink to The Real AI Power Play: Who Controls Your Enterprise Data Layer?
The Real AI Power Play: Who Controls Your Enterprise Data Layer?IT and data teams were promised that AI would make work easier. Instead, it's created new layers of complexity.Link to Building the Backbone of Agentic AI with Trusted, Context-Rich Data
Building the Backbone of Agentic AI with Trusted, Context-Rich DataIn this 10-minute take video, Reltio Principal Solutions Consultant Guy Vorster explains how organizations can overcome fragmented data challenges to power AI agents.Link to IHG scales real-time, trusted data across global brands
IHG scales real-time, trusted data across global brandsAccelerating time to value while powering data-driven engagementLink to Dell’s Vrashank Jain on The Data Problem That Could Break Your AI
Dell’s Vrashank Jain on The Data Problem That Could Break Your AIAI thrives on data but feeding it the right data is harder than it seems. As enterprises scale their AI initiatives, they face the challenge of managing diverse data pipelines, ensuring proximity to insights, and supporting a growing range of workloads. In this episode, Corey Knowles speaks with Vrashank Jain, lead product manager for Dell’s AI Data Platform, about how businesses can overcome these hurdles with solutions that simplify data management, enhance performance, and unlock the full potential of their AI investments.
Link to BMC’s Jennifer Margules on Intelligent Enterprise Orchestration
BMC’s Jennifer Margules on Intelligent Enterprise OrchestrationIn this episode of eSpeaks, Jennifer Margles, Director of Product Management at BMC Software, discusses the transition from traditional job scheduling to the era of the autonomous enterprise.
Link to Global-First Finance: Building Scalable, Compliant Operations in an Uncertain World
Global-First Finance: Building Scalable, Compliant Operations in an Uncertain WorldeSpeaks’ Corey Noles talks with Rob Israch, President of Tipalti, about what it means to lead with Global-First Finance and how companies can build scalable, compliant operations in an increasingly uncertain world. They explore how automation, AI, and integrated platforms are helping finance teams tackle today’s biggest challenges, from cross-border compliance and FX volatility to […]
-
Artificial Intelligence -
Video -
Big Data & Analytics -
Cloud -
Networking - Cybersecurity Cybersecurity
- Applications Applications
- IT Management IT Management
- Storage Storage
- Mobile Mobile
- Small Business Small Business
- Development Development
- Database Database
- Servers Servers
- Android Android
- Apple Apple
- Innovation Innovation
- PC Hardware PC Hardware
- Reviews Reviews
- Search Engines Search Engines
- Virtualization Virtualization
-
- Blogs Blogs
- Events Events