eBay: Sold on Grid

By Stan Gibson  |  Posted 2004-08-30

eBay: Sold on Grid

Following the disastrous outages of 1999, online auction company eBay Inc. launched an overhaul of its IT infrastructure, rebuilding its data centers around a grid-type architecture and rewriting its applications in Java 2 Platform, Enterprise Edition. That process was completed in July. Now Marty Abbott, senior vice president of technology, is tasked with maintaining uptime worldwide, around the clock, even as eBay handles more transactions than ever before and expands into new markets. Abbott, a graduate of the U.S. Military Academy who worked at Motorola Inc. and Gateway Inc. before coming to eBay in 1999, laid out eBays IT strategy to eWEEK Executive Editor Stan Gibson in an interview at eBays San Jose, Calif., operations center.

How far have you come from 1999 until now?

Our job was to move from crisis to world-class in short order. To do that, we went about deconstructing and re-constructing the site—while it was live, while transactions were happening. In the course of about two years, we re-architected the infrastructure and the software for the site.

Before, all the transactions hit one massive database, which was the point of contention, along with some storage problems. Most of the meltdown was due to a monolithic application addressing everything in a single, monolithic database. One application held all the applications of eBay, with the exception of search.

But it didnt work.

It didnt work because of the limitations of SMP [symmetric multiprocessing] systems. High-end scalable multiprocessor systems, first of all, are very costly. Its one of the first things we learned. Two, they represent a single point of failure from which you can recover with clustering, but with downtime.

So the answer was what?

Job One was to eliminate the single point of failure in the huge, monolithic system. Then, reconstruction of the application to ensure that we had fault isolation and that processes and tasks of like size and cost werent congesting and competing with each other. We disaggregated the monolithic system and ensured scale and fault tolerance.

A good way to think about it is that its one of the first examples of grid computing. Its an array of systems, each of which has a service component that answers to another system: fault tolerance meant to allow for scale. As a matter of fact, we would have potential vendors and partners come in and try to sell us on the idea of grid computing and wed say, "It sounds an awful lot like what we were doing. We didnt know there was a name for it."

So you went from how many servers to how many?

We went from one huge back-end system and four or five very large search databases. Search used to update in 6 to 12 hours from the time frame in which someone would place a bid or an item for sale. Today, updates are usually less than 90 seconds. The front end in October 99 was a two-tiered system with [Microsoft Corp.] IIS [Internet Information Services] and ISAPI [Internet Server API]. The front ends were about 60 [Windows] NT servers. Fast-forward to today. We have 200 back-end databases, all of them in the 6- to 12-processor range, as opposed to having tens of processors before. Not all those are necessary to run the site. We have that many for disaster recovery purposes and for data replication.

Next Page: Location, location, location.

Page 2

Where are they located?

We have two data centers in Santa Clara County [Calif.], one data center in Sacramento [Calif.] and one in Denver. When you address eBay or make a request of eBay, you have an equal chance of hitting any of those four.

So you have, say, 50 database servers per site?

Approximately. And it takes about 50 or so to run the site. Not including search systems.

Do Denver and Santa Clara mirror Sacramento, for example?

No. Weve taken a unique approach with respect to our infrastructure. In a typical disaster recovery scenario, you have to have 200 percent of your capacity—100 percent in one location, 100 percent in another location—which is cost-ineffective. We have three centers, each with 50 percent of the traffic, actually 55 percent, adding in some bursts.

What hardware platform are you on now?

We use Sun [Microsystems Inc.] systems, as we did before. We use Hitachi Data Systems [Corp.] storage on Brocade [Communications Systems Inc.] SANs [storage area networks] running Oracle [Corp.] databases and partner with Microsoft for the [Web server] operating system. IBM provides front and middle tiers, and we use WebSphere as the application server running our J2EE code—the stuff that is eBay. The code is also migrated from C++ to Java, for the most part. Eighty percent of the site runs with Java within WebSphere.

Did you look at Microsoft .Net?

We did, and we thought that using a J2EE-compliant system gave us a lot more flexibility with respect to the other components in our architecture. .Net was definitely a viable solution, and Microsofts a wonderful partner. But we wanted the extra flexibility of being able to port to other application servers and other underlying infrastructures.

How long will this architecture last?

We believe the infrastructure we have today will allow us to scale nearly indefinitely. There are always little growth bumps, new things that we experience, and not a whole lot of folks from whom we can learn. But using the principles of scaling out, rather than scaling up; disaggregating wherever possible; attempting to avoid state, because state is very costly and increases your failure rate; partnering with folks like Microsoft and IBM, Sun, Hitachi Data Systems, where they feel they have skin in the game and are actually helping us to build something; and then investing in our people, along with commodity hardware and software—applying those principles, we think we can go indefinitely.

Would it make more sense to go with a commodity Intel [Corp.]-architecture back end?

Were in a continuous state of re-evaluation, and were not afraid to swap out where necessary. With the help of Sun people, weve tuned the applications to take advantage of the benefits of their systems. There would be a bit of work to change, but should the time come where we believe theres a significant benefit, we would probably make that move.

Suns recent strategy, with their alliance with AMD [Advanced Micro Devices Inc.], shows that they are willing to move toward the commodity space. Were actually running a pilot of that architecture, AMD-based Sun servers with a Linux variant.

Next Page: eBay open all night.

Page 3

Are you happy with your uptime? Is there room for improvement?

eBay has a wonderful culture of continual process improvement, of continually raising the bar. I dont think were ever going to be happy. The biggest step function we made was the elimination of the scheduled maintenance period. So now eBay is available 24-by-7. The systems are never off.

How do you know if something isnt working for a customer?

We use proxy measuring systems, using partners like Keynote [Systems Inc.], to measure ISP performance, and Gomez [Inc.], which measures availability to the customer.

We also have internal monitoring systems that monitor about 45 cities around the world, where we perform transactions against our site and count the error rates.

A key strategy for eBay appears to be international expansion. How do you support that?

We deliver the content for most countries from the U.S. The exceptions are Korea and China, which have their own platforms. In the other 28 countries, when you list an item for sale or when you attempt to bid or buy an item, that comes back to the U.S. We distribute the content around the world through a content delivery network. We put most of the content thats downloaded—except for the dynamic pieces—in a location near where you live. Thats about 95 percent of the activity, making the actions or requests that come back to eBay in the U.S. very lightweight. A page downloads in the U.K. in about the same time that it downloads in the U.S., thanks to our partner Akamai [Technologies Inc.], whose content delivery network resides in just about every country, including China.

Do PayPal [Inc.] transactions that take place in a European country still come back here?

They do—for many of the same reasons. You get to use your capital investment many times over in any country to which you would move. However, PayPal is not yet at the maturity that eBay is. Its an acquisition that we made in October 2002 and still isnt quite up to where eBay is in availability and reliability.

Are PayPal transactions happening on the eBay infrastructure?

Its a separate platform, but its starting to look much like eBays. Our intent is to keep it separate, but the same principles we apply to eBay, we also apply to PayPal. Were starting to disaggregate their systems, trying to ensure that we have appropriate fault isolation and redundancy. Theyre in our Denver and in one of our Santa Clara County facilities.

Next Page: Built-in security.

Page 4

What about security and phishing?

We ensure that in every piece of our infrastructure and architecture, we have security built in. To date, weve never been hacked. Weve never had customer information retrieved from the eBay systems. Weve never had an intrusion to our systems of which Im aware. However, a customer, through an act of phishing, may have given information away. Phishing is an industry problem that happens outside of our platforms, outside of our control. Nevertheless, we attempt to ensure that we can secure people through education, making sure they have appropriate firewalls and anti-virus software. Weve invested in things like the eBay tool bar to make sure that people stay safe anywhere they go on the Internet. The tool bar is a feature that was introduced in February, called Account Guard. It helps people recognize, reject and report back to us when theyre on a site that purports to be eBay or PayPal but isnt. If youre on a spoof site, it will flash red.

Are you using any specific business intelligence applications?

Until about a year and a half ago, our approach was to dump all our data into our data warehouse and let our very intelligent employees figure out what to do with it. Then we had an epiphany: that not all accessors of the data warehouse were created equally. We started to take more of a marketing approach to figure out what different groups needed. We started giving people the access privileges and education appropriate to their needs. We also have data marts sitting off to the side of the warehouse. With the user class came tools specific to that user class. So deep analytics will have access to SAS [Institute Inc.] databases, for instance. Most folks in the middle tier will have access to a number of tools, from either Informatica [Corp.] or homegrown ROLAP [relational OLAP] tools. And, finally, at the pinnacle of the pyramid are the executive users that get information and knowledge distributed to their desk specific to their business and to their needs.

The next vision for this thing is to give the 450,000 people who make their living in whole or in part on eBay information from the data warehouse to help them understand what products are selling the best and what they should put more of on the site.

Looking ahead, as the year progresses, is there any big project that really stands out for you?

Were in a continual improvement process, where our community tells us in real time whats working and what isnt. Weve got 114 million people working in our behalf, telling us what to do. Were the heart, the brain, the soul. The greatest thing about this place is that we get real-time feedback from everyone. Thats why we have the community boards and our engineers go out and talk to the community.

Is there one example that has been a result of this process?

My eBay 2.0, which started up this past spring, is a great example. But just about everything we build comes from the community.

Rocket Fuel