E-Trade VP Talks Open-Source

 
 
By eweek  |  Posted 2006-01-29
 
 
 

E-Trade VP Talks Open-Source


To say that E-Trade Financial has embraced open source is putting it mildly. The financial services firms open-source journey began in 2001, when it replaced several of its Sun Microsystems Solaris-based servers with IBM x86 Linux-based systems.

E-Trade saved millions of dollars annually in the process, and has extended its open-source interests to middleware and even the use of the community development process as a model for its own internal development.

To learn how E-Trade effectively leveraged open source and how other organizations can apply its lessons learned for eWEEKs January Road Map, eWEEK Labs Senior Analyst Jason Brooks spoke at length with E-Trades Vice President of Architecture Lee Thompson.

What started E-Trade on this journey?

[Prior to Sept. 11, 2001,] trading volumes were above 300,000 average trades a day, and that dropped to something in the range of 55,000 a day. Our cash flow was affected, obviously, quite dramatically.

You could go into an E-Trade data center, and you could imagine yourself looking down the aisles of shelves. Instead of food, you would see Sun Enterprise 4500s—three full rows of them. And it was costing quite a lot of money to keep those things running.

Around 2001, I had my architecture department allocated for a project that was canceled. We kept on playing with Linux.

I knew that Amazon had already moved over to Linux, and Yahoo was on a FreeBSD stack since its inception. And so I knew that, at some point, architecturally, an open-source operating system would be of some interest.

Around that time, in 2001, the Red Hat 7.2 kernel came out, which had support for SMP and a 32-bit message queue for shared memory. And, all of a sudden, our application booted.

I went to the CIO and said, I have some people not allocated to a project right now; Id like a chance to look at this further and do a feasibility study.

And so we grabbed a bunch of our architects and we ran like crazy and got a full stack of our application. …We grabbed Apache, and we grabbed the Jakarta Tomcat servlet system, and we ported a representative stack of our application—our authentication, quote services, product services, some of our trading services and the servlets that rendered the HTML—over to this new stack.

So, you can imagine on the left-hand side of your piece of paper, iPlanet Web server, iPlanet app server, Tuxedo on Solaris and Sybase on Solaris. And over on the right side of the picture, for this feasibility study, we did Apache, Tomcat, Tuxedo and Sybase on Solaris, and we demonstrated that to the CIO and the rest of the technology management on Halloween 2001, a quite memorable meeting.

We ran some load testing on it, and we knew when it fell over, and the way the Sun systems worked, we could keep adding more and more test users on the Sun box and it would just keep, cranking along—it didnt really elbow. The Linux box was much faster, and then around, it was somewhere around 180 users, it would elbow. … But before 180 users it was much faster.

I demonstrated this on Halloween 2001, and it was pretty simple. These grocery store rows of Sun 4500s were costing about a quarter of a million dollars apiece, and we could run 300 to 400 users on them. And heres this little box that cost us about $4,000. It was running faster, but you could only run 180 people on it.

Josh Levine, the CIO at the time, said, Well, then we just need to get two, for $8000. Do it—put a box in production. So it was a fairly short meeting.

What was the hardware at that time?

It was IBM x330s, and then we started buying x335s. At the time, I remember reading something like [IBM proposing] putting a billion dollars of investment into open source and Linux, and moving a lot of the IBM software over. And they were kind of going around and asking companies if they wanted to look into an open-source stack, and we had this research going at the same time, and we were like, "Yeah, wed love to look at that with you." So we had IBM helping us with that prototype. And of course we used IBM hardware.

Next Page: The switching-over process.

The Switching


-Over Process">

How did the switching over process—to the cheaper hardware and Linux—go?

What we did was, we forked the code, and so the Linux was on a branch, and instead of trying to do the whole stack, we decided to put one box in production. There was a lot of activity at the time when we were rebranding the company, if I remember correctly, from E-Trade to E-Trade Financial, and releasing a new Web site and spending a lot of marketing dollars on a Super Bowl ad. And so heres this architecture department working on a new platform, and there was a lot of concern about disrupting the major push for a new look and feel on the Web site, so what we did is we forked the code. We were constantly merging the changes that were going on for the new Web site into the other branch.

The GCC [GNU Compiler Collection] version that we were using was catching a lot of syntax problems in C code that the Sun compiler was not using—anytime you put C code into a new compiler, it seems like syntax always gets caught differently, and so we were fixing all these little things.

We went through a whole test cycle, and we put one box in production [in December 2001]. The market opened, and I think all the rows of Sun 4500s were warming up, and I think they hit about 25 percent CPU utilization because we take our customer load and spray it across these load balancers and partition the load up so all the machines are working in tandem. [The Linux box] was running about 4 to 5 percent CPU—I mean, almost nothing.

Constantly, almost everybody youll talk to in financial services overdeploys hardware, just because you dont take any risks with that, everything about our technology is risk-avoidance, risk-mitigation.

And so, heres this little $4,000 box, playing, literally, with the big boys of computing, and doing very, very well. Everybody was pretty floored by this—a lot of the systems engineers who do our production operation were all going over to the box and couldnt believe how well this little machine was doing.

Again, I think a lot of the timing was almost perfect—one, there was the investment in open source that IBM was doing; two, the emergence of the 32-bit IPC stack in Linux, and also the support for SMP in the Linux kernel; and, three, our precipitous drop in revenue. It just was one of these projects that, from conception to production, took really only about 12 to 13 weeks.

Next Page: Hardware.

Hardware


So you started out small, in terms of hardware. How did it grow from there?

We had this one box in production—the plan was to pull that box out, and then do this cutover of our Web site from the E-Trade look and feel to the E-Trade Financial look and feel, which coincided with the Super Bowl that year. Everyone was so pleased with the performance and rock-solid reliability that we were seeing, that we left it in. Operations said, we want to look at this, we want to keep it in, we want to study it.

We had a lot of work to do from this mid-December date in 2001 to merge all the changes back into the main code base. We did so, I think, over New Years Eve that year, and merged the code back, and now were back to a single code base. And so all the changes that were done for the Linux variation of the stack were merged back into the Solaris code base, and now were shipping one product out to two different operating systems. And we continued to do that until we got the rest of the stack done, which was the Java application and all the configuration parameters worked out for Solaris and OpenSSL, which are different from the way you run the encryption on the iPlanet side.

So we went through that, through spring, bought, I think, one, maybe one and a half racks of servers. I think a rack is 52 servers from the floor to the top. … We plugged all those servers in, installed the apps, and then in the load balancers, kicked over load. Ten percent, 25 percent, 50 percent and then 100 percent of our load went from the gigantic rows of Sun 4500s over to the open-source stack, which was Red Hat, Apache, Tomcat, Tuxedo on Linux.

One of the most dramatic things that happened is, almost everybody who runs some kind of commerce facility on the Internet looks at [benchmarks from] Gomez and Keynote, which are these point-of-presence scanners on the Internet, where they have all these little agents all over the Internet that will run through a sequence of queries on some kind of commerce facility—like if you were going to do reservations for travel or go to Amazon and buy something, or if you want to trade a stock on E-Trade.

On E-Trade, what Keynote does, for instance, is they log in, get a quote, they place a trade, they cancel a trade, and they log out.

Read more here about Keynotes efforts to give e-commerce companies comprehensive views of performance and usability.

On the Solaris stack—a quarter million dollars a box, huge rows of machines—[Keynote measured the transaction] running about 8 to 9 seconds for us. After we were 100 percent on Linux, we were running, I believe, in the 4 to 5 seconds range. This is kind of a flexion point for me, technically.

So were now at summer of 2002, and at this point, I realized, this is a much, much bigger phenomenon than simply taking [down the] dramatic cost of the data center, which it definitely was—millions and millions and millions of dollars came out of our expenses to run our facility.

This was very welcome news from our business perspective, considering the diminished trading volume that we were incurring. So that was a fantastically successful project. However, something else was also going on, and I did a deep dive on open source at this particular time. I started running lots of different distros. I ended up running Gentoo. Personally, I run the Gentoo distro.

Next Page: The Gentoo phenomenon.

The Gentoo Phenomenon


Is that right? I have Gentoo running on my laptop right now.

OK, so you know the phenomenon—the phenomena is, the amount of change that you are sustaining on a Gentoo system is orders of magnitude larger than the amount of change that a typical proprietary operating system from anybody—Solaris, HP-UX, mainframes, whatever—[would go through].

Whatever operating system, the rate of patches coming out of the vendor is much lower than what you enjoy on, you know, my Gentoo laptop or your Gentoo machine.

And then I started looking, kind of watching this, obviously, from a technology management perspective. … If you can sustain change faster than somebody else, youre going to survive, and the person who cant sustain the change is not going to evolve, and theyre going to die off. This is almost more important a realization than the direct cost savings, which is still phenomenal.

And so, I started looking at our own code base, and kind of, reflectively, going, "OK, how much change are we sustaining on our own code base?" I kind of indirectly compared that to, say, Kernel.org or some of the Apache.org projects, and its much, much lower. And this is kind of scary.

So what is the secret sauce of this sustainable change rate? Because, the Linux kernel is fairly stable, right? E-Trade is fairly stable, and we have the data to prove that, but how can they sustain a higher rate of change than me? Are they smarter? Are the open-source developers smarter? And I kind of sat on some open-source projects, and fixed some bugs and things like that.

You find out that [developers on these projects] are really smart people, but E-Trade has really smart people—thats not the answer.

But the methodology is different, and thats the secret sauce, in my mind. The methodology for developing open-source code is completely different.

I could point you at a Wall Street Journal article [at that time in which] Microsoft described a couple thousand developers checking in source code, and it takes a week to build, which is kind of a high-level review of that article. And this is a pretty common process you find in a lot of corporations. There is no open-source project structured like that.

All the open-source projects are structured where there are a limited number of committers. You know Apache got its name because, during the early days of Apache, you couldnt even submit a bug unless youd submitted a patch—thus, Apache. And so, through their patch-centric restricted committer access to source code … the aggregate rate of change to that source base goes dramatically higher, through that process.

Next Page: Challenges.

Challenges


That rate of change can be a challenge. Can it be too much, and make things too complicated? Im referring more to the high volume of change. As any Gentoo user can tell you, you emerge something, and then things dont work, and you have to sort through to fix it.

Yeah, Ive been running Gentoo for the 2002 to 2003 time frame, and Ive had several issues. Ive said to myself, well, you know, the change rate is worth it. Change destabilizes, but change is good, and thats kind of a classic problem. I dont want to suffer from innovators dilemma at E-Trade. I want to keep pushing this company very, very hard. So I want to drive change. The downside of that is if you try to change, you can destabilize the system.

[Gentoos Chief Architect] Daniel Robbins always wanted to do a server variant of Gentoo, which the project, I dont think, ever started, but its always been something that was kind of on the mind of the Gentoo community—that there should be the top-of-tree distro, and behind it something a little more stable, almost exactly mirroring what the Fedora community project is and the Red Hat AS series of servers.

So, here I am, the guy whos trying to push change. I work on a Gentoo box, while our production system is Red Hat AS 3.4, which is very stable. And so thats kind of a good way of balancing aggressive change and stability, in our mind.

But back to that WSJ article. In some regards, [the Microsoft situation it describes] is the same as at E-Trade. We have a very large code base with a large number of committers, and [there is] the probability of conflicting change occurring. We have a very complicated application—nowhere near as complicated as a project like Microsofts Longhorn, but complicated enough that youll have, say, our stock options business, which is an employee benefit.

They have some developers there who can make a change, and then somebody in our cash transfer business puts a change in, and they conflict, and we have to resolve that conflict in our build process.

The way the open-source project does that is that the guys who are submitting the change for the employee benefits site would submit a patch, and the other team doing cash transfer would submit a patch, and the committers would look at both and go, You know, that might conflict. We should probably do one versus the other.

And this is the way that you now do your development at E-Trade?

Let me back up a little bit. I think IBM kicked off the open-source phenomena in a big way. Of course, Richard Stallman kicked it off in the 80s, but whats happened is its a disruptive technology event. The first time we looked, the first time we saw Mosaic—in 93 or 94—we thought, "This changes everything." It was a disruptive technology event. The way I see things architecturally, with open source, is its a disruptive technology event.

Theres such a predominant availability of raw material, and that raw material is code, and the change rate of that code is so high. You know it, because youre a Gentoo user—youre subjecting yourself to that rate of change. Its blowing out all the processes that corporations have established for dealing with change. Its an order of magnitude larger.

So, the game is: How do you change your technology development and change processes to adapt to this realization that theres a massive amount of raw material, code, and it changes very, very quickly. Thats probably the biggest architectural effort in the company right now.

And so youre at the beginning stages of trying to take that on? And youre looking at open-source projects as examples of how other large projects are dealing with that?

Absolutely correct. And, by the way, Microsoft has a similar issue. Who did they hire recently, over the summer?

Thats right, Daniel Robbins.

You know, if you look at pluggable strategy in software, like a servlet or a Web service or, in this case, just a tarball of code, you look at the number of packages that are in a, say, Red Hat or SUSE distro—its about 1,200 to 1,400. Fedoras pretty good—I think its in the 3,000 to 4,000 range. Nice, 4,000 packages. Gentoo has 14,000 packages. BSD ports, as well, 14,000 to 15,000 packages.

Debian, also.

Yes, Debian package, also quite good. I did a deep dive on distro tools, and actually got a lot of the E-Trade stack running with small source-code bases. I would say, hands down, our developers hated it. And so, we studied it, and we looked at the OpenPKG thing. The other thing were looking at is using, not a distro automation tool to run what Im calling the package-centric approach to software development, but a development automation tool like Maven to drive the creation of distros. So were working with a lot of the Maven guys right now. Well let you know how it goes.

Next Page: Desktop Linux.

Desktop Linux


Where else in your organization do you use open source or might you use it? Any desktop Linux plans?

I run rdesktop into Terminal Server, which gets me into the regular Office stack, which is great. It works fine. We keep on thinking about, for instance, our customer service reps. They run an internal tool thats called Genie, and its all Java-based, so we keep kicking around the idea of starting up a project to run a Linux distro desktop.

Im a fan of Knoppix—I think it could just run off a CD. There are so many compelling reasons to do so from a security defense perspective. You try to send an attack on a Knoppix system and youre attacking RAM, not a hard drive. So thats something we keep kicking around. Id say theres a very high likelihood we kick something off in 2006 in that regard.

The savings for desktop Linux probably wouldnt be as dramatic as they were for the data center. What other motivations would lead you to adopt desktop Linux?

No, the economy of scale isnt there. But its good organizationally, because we run a common technology stack, and the technical community at E-Trade can build up better, more detailed knowledge about one system. We have to know how to defend a Linux deployment on the Internet, for instance, and we have to know how to defend a Microsoft deployment on the desktop.

Some of those defenses are common between the two, and it just seems to make sense that if Im committed to the Linux server strategy in my production system, for our customers, it would make a lot of sense—organizationally, technically—to use the same defense strategy for my associates.

Youre running Red Hat AS. Are you happy with it? Have you thought of moving to another distro? Maybe a non-commercial one?

All the time. One thing I do like is what Red Hat does from a distro point of view: They establish a distro (and when I say distro, I mean a set of packages that are thought to be compatible), and then the open-source guys will keep moving along, and they frequently will backport bug fixes from the top of tree back into their more stable base of packages. … All the Linux distros do a very good job at this style of taking something that, almost by definition, has a change rate thats almost too high for an enterprise to deal with and kind of cherry-picking and going back into a set of packages that are somewhat frozen. Youre not doing functional improvement on that distro, youre doing defect fixes.

So it seems to be a good way to balance the change rate and stability—kind of a happy go-between. Weve tested SUSE, and its great. Weve tested Red Hat, and its great.

Are you using the Apache that Red Hat is shipping as part of the distro? What pieces are you customizing, and where are you sticking with the distro makers packages?

There is a feature that wasnt supported in the Red Hat distro that my application needed, so Im now creating my own Apache package, immediately.

We have a package management and deployment methodology that predates RPM. In 1999, we were buying lots and lots of Sun hardware, to keep up with the dramatic ramp-up in business. As soon as the boxes came off of the dock, we already had the configuration ready, so the machine would be plugged in, go through a stress test to make sure it didnt do out-of-box failure, and then we would push the application stack to it and the configuration to it, put it in the load balancer, and itd be trading.

This methodology is still consistent today on the Linux stack. … When you go to smaller packages of code, you have to get the dependency chaining right. Your organization has to be able to assert, I depend on our ID tool 1.2.whatever, right? And then somebody else, maybe in QA or maybe ops, This version of our ID tool doesnt work right; you need a different version.

And so you need a community process for defining the distro—meaning, the set of packages that are thought to work together—which is exactly what Gentoo does, a community process for establishing a distro.

Next Page: The development process.

The Development Process


At E-Trade, how large is the community of developers, and how do you manage them? Are there mailing lists, etc.?

Yes, theres an architectural review process and an engineering deployment process. Theres a system engineering practice that defines the packages that make up an application stack, and its a process that were going to tool up to look more like what were learning from the way open-source teams work on it.

You were talking earlier about Solaris. I kept thinking about what Suns been doing with open source—OpenSolaris, talking about making Java Enterprise System open source. Whats your take on all of this?

Its great.

Have you done any testing of Solaris 10 on the sort of hardware on which youre running Linux?

No. Right now, were concentrating on whats running in production, and making that better, and trying to match our change rate to what were seeing as the upstream change rate hitting us. Thats pretty much the thing were studying the most right now, which is so far away from an operating system.

So youre happy with what Red Hat is doing for you, and focusing more on your own application?

Yes, and the numbers are just phenomenal. We have a lot of internal metrics on our technology—we review them all the time. And all those metrics are very, very good. And the external measurements—we won all the Keynote awards, for both banking and brokerage technology, in the last two years, and were very, very pleased with that.

Your technology is obviously a very big differentiator, but are there any parts of your internal stack that might be a fit to open source, to tap into the forces that are building your foundational software. Do you ever think of that?

Every day. Weve benefited in so many ways from what community processes have taught us, and I think its just natural that there might be something that we do that the open-source community may be interested in. Im actually talking with OSDL [Open Source Development Labs] about just such an idea. … Weve been a consumer, not a producer. What if E-Trade became a producer? What would that look like? Weve been asking for some help on ideas like that.

Next Page: Opening components.

Opening Components


What are some of the components that you might open up in that way?

Were a SOA [service-oriented architecture] since 1997. The good thing about that—you know, being ahead of the curve in computational topology—is that, a decade later, you find out that, hey, we made the right guess. It was a guess.

But theres a problem with it: You go look at an open-source project like Axis, and its all Java. Well, guess what? We did all our SOA in C.

This is a real weird thing, because, here we made the right guess, but we cant leverage the community knowledge thats going into open-source containers because most of the good ones are Java.

We are running a container is an inversion of control pluggable strategy for services, with a lot of features that are pointed at a data center, like measurement and operability of services. We dont see anything like that that will run C code. We might make it an open-source project. I dont know.

Have you been thinking about switching over to Java?

Weve been a huge Java shop since very early, 1997. Again, we kind of guessed that this idea of an app server was a good idea, so we used this project called Kiva, which runs all of our model view control strategy for services. You can imagine, inside E-Trade, theres a service for quotes, theres a service for portfolio positions. Invoking those services for data, and running a model view control strategy to generate HTML, this is the idea around some of these app servers like Kiva.

So we used Kiva, and then Kiva turned into the Netscape app server, which turned into the iPlanet app server, and some of that code we wrote in 1997 now runs in Tomcat, under a servlet.

So its a layered architecture, and you can say that 50 percent of our code base is java, and 50 percent is these Web services that run in C.

Its currently up to about 7 million lines of code, and, so, we thought, why dont we just port all that C code over to Java? We actually seriously thought about it, and the answer is, its very reliable, its stable. We took all the nose bleeds to get it that way ...

In your consideration of open sourcing some of your code, are you looking at it as a way of "giving back," or do you think youd get something out of having the community involved in development?

I think that a good architecture for computation has no concept of language. Theres a big push on SOA, and weve believed in it since the late 90s. And the Java community moves so fast; it seems like theyre way ahead of the game in developing the concepts surrounding SOA.

If SOA is the right idea, and I think it is the right idea, then there will be implementations of SOA in every language, and maybe I can help kick that off.

Check out eWEEK.coms for the latest open-source news, reviews and analysis.

Rocket Fuel