Processing Massive Amounts of Data
Yahoo's
M45 cluster runs Hadoop, an open-source distributed file system and
parallel execution environment that enables its users to process
massive amounts of data. Apache Hadoop is an open-source project of the
Apache Software Foundation, to which Yahoo engineers have been the
primary contributors to date.
"Hadoop powers many of our most broadly used and complex systems at
Yahoo, from Web search to optimizing content for the home page," said
Shelton Shugar, senior vice president of cloud computing at Yahoo, in a
statement. "Continuing to invest in the open-source community and in
technologies like Hadoop is an important element in our efforts to
drive breakthroughs in Internet-scale computing and ultimately to
continually improve the quality of the consumer experience of Yahoo. By
partnering with these top educational institutions to share our M45
cluster and our technical expertise, we hope to further key insights
into the next generation of systems software research and development."
Shankar Sastry, dean of the College of Engineering at the University
of California, Berkeley, said: "Access to the cluster is a first step
in helping us analyze the vast amounts of societal-scale information
available on the Web, such as voting records, online news sources and
polling data. The Yahoo cluster will also enable us to conduct
computationally intensive econometrics research, combining economic
theory with statistics to analyze and test large-scale economic
relationships."
"Our partnership with Yahoo will enable us to attack problems
ranging from wildlife preservation and biodiversity, to balancing
socio-economic needs and the environment, to large-scale deployment and
management of renewable energy sources," said Bob Constable, dean of
the faculty of Computing and Information Science at Cornell University.
"Our vision is to improve upon current technology through the
processing of large data sets," said Jim Kurose, dean of College of
Natural Sciences and Mathematics at the University of Massachusetts,
Amherst. "Yahoo's supercomputing cluster will enable us to do
data-intensive research on a large set of scanned books drawn from the
Internet Archive's million-book collection. The latter includes 8.5
terabytes of text and half a petabyte of scanned images. Research on
such large datasets would not be possible without the use of clusters
like the one Yahoo is offering us access to."
Partnership with these universities is the next step in expanding
Yahoo's support for cloud-computing research, the company said. In July
2008, Yahoo joined forces with HP, Intel, the University of Illinois at
Urbana-Champaign, the Infocomm Development Authority (IDA) in
Singapore, and the Karlsruhe Institute of Technology (KIT) in Germany
to create Open Cirrus, a global, multi-data center, open-source test
bed for advancing cloud computing research and education. The
partnership with Illinois also includes the National Science Foundation
(NSF), creating a cloud computing cluster that is made available to the
entire reach of the NSF academic community, Yahoo officials said. The
international partnership promotes open collaboration among industry,
academia and governments by removing the financial and logistical
barriers to research in data-intensive, Internet-scale computing. As
the Yahoo M45 cluster is part of the Open Cirrus cloud computing test
bed, the above universities will also gain access to and be part of the
Open Cirrus community.








