Managing the overall system is the Google File System, which features a master that manages metadata. Data transfers occur directly between clients and chunkservers, files are broken into 64MB chunks, and chunks are triplicated across three machines for safety, Hoelzle said. "The machines are cheap and not reliable, so we take our files and put them into chunks and spread them across a few machines and randomly distribute copies," he said. "So you need to have a master that tells you where the chunks are." The master will look at one chunkserver, and if it gets no response it assumes it is dead and it seeks out the next one.
Hoelzle said there are more than 30 clusters at Google, with clusters as large as 2,000 machines to address a petabyte-sized file system.A key Windows architect has defected to Google. Read more here. "Youd like to be able to write an application that can run on 1,000 machines in parallel," he said. Googles MapReduce framework provides automatic and efficient parallelization, fault tolerance, I/O scheduling and status monitoring. "MapReduce basically does a grep over the Web on a thousand machines," Hoelzle said. Grep is a Unix/Linux function that searches one or more input files for lines containing a match to a specified pattern. As far as scheduling, the Google system has one master and many workers, and tasks are assigned to workers dynamically, Hoelzle said. The master assigns each map task to a free worker. MapReduce has broad applicability, Hoelzle said. "Its parallel to Eclipse. If you have a good tool that is easy to use, your users come out of the woodwork. In the first year we had hundreds of MapReduce jobs being written. Our production index system is written on top of MapReduce." As a demonstration, Hoelzle produced a diagram of the Google activity around searches of the term "eclipse" over the last few years. The diagram showed three spikes, all in line with the occurrence of a solar eclipse. Check out eWEEK.coms for the latest news, views and analysis on enterprise search technology.
Hoelzle said there are more than 30 clusters at Google, with clusters as large as 2,000 machines to address a petabyte-sized file system.