eBay: Sold on Grid

eBay Senior VP Marty Abbott spells out how the online auction company has reconstructed its IT architecture and rebuilt its data centers on a gridlike model.

Following the disastrous outages of 1999, online auction company eBay Inc. launched an overhaul of its IT infrastructure, rebuilding its data centers around a grid-type architecture and rewriting its applications in Java 2 Platform, Enterprise Edition. That process was completed in July. Now Marty Abbott, senior vice president of technology, is tasked with maintaining uptime worldwide, around the clock, even as eBay handles more transactions than ever before and expands into new markets. Abbott, a graduate of the U.S. Military Academy who worked at Motorola Inc. and Gateway Inc. before coming to eBay in 1999, laid out eBays IT strategy to eWEEK Executive Editor Stan Gibson in an interview at eBays San Jose, Calif., operations center.

How far have you come from 1999 until now?

Our job was to move from crisis to world-class in short order. To do that, we went about deconstructing and re-constructing the site—while it was live, while transactions were happening. In the course of about two years, we re-architected the infrastructure and the software for the site.

Before, all the transactions hit one massive database, which was the point of contention, along with some storage problems. Most of the meltdown was due to a monolithic application addressing everything in a single, monolithic database. One application held all the applications of eBay, with the exception of search.

But it didnt work.

It didnt work because of the limitations of SMP [symmetric multiprocessing] systems. High-end scalable multiprocessor systems, first of all, are very costly. Its one of the first things we learned. Two, they represent a single point of failure from which you can recover with clustering, but with downtime.

So the answer was what?

Job One was to eliminate the single point of failure in the huge, monolithic system. Then, reconstruction of the application to ensure that we had fault isolation and that processes and tasks of like size and cost werent congesting and competing with each other. We disaggregated the monolithic system and ensured scale and fault tolerance.

A good way to think about it is that its one of the first examples of grid computing. Its an array of systems, each of which has a service component that answers to another system: fault tolerance meant to allow for scale. As a matter of fact, we would have potential vendors and partners come in and try to sell us on the idea of grid computing and wed say, "It sounds an awful lot like what we were doing. We didnt know there was a name for it."

So you went from how many servers to how many?

We went from one huge back-end system and four or five very large search databases. Search used to update in 6 to 12 hours from the time frame in which someone would place a bid or an item for sale. Today, updates are usually less than 90 seconds. The front end in October 99 was a two-tiered system with [Microsoft Corp.] IIS [Internet Information Services] and ISAPI [Internet Server API]. The front ends were about 60 [Windows] NT servers. Fast-forward to today. We have 200 back-end databases, all of them in the 6- to 12-processor range, as opposed to having tens of processors before. Not all those are necessary to run the site. We have that many for disaster recovery purposes and for data replication.

Next Page: Location, location, location.