Database Legend: How Real-Time Data Analysis Will Transform Society

 
 
By Lisa Vaas  |  Posted 2005-02-15
 
 
 

Database Legend: How Real-Time Data Analysis Will Transform Society


Mike Stonebraker is a database superstar. Not only is the former UC/Berkeley computer science professor the father of the popular relational databases Ingres and Postgres, he was also the founder of Illustra Information Technologies Inc., acquired by Informix, which in turn was acquired by IBM.

The next project for this database pioneer takes shape in the form of StreamBase Systems Inc., a company thats churning out software designed to process, analyze and act on real-time data "within milliseconds of its arrival." Stonebraker is StreamBases founder and chief technology officer.

StreamBase announced its Stream Processing Engine at the DEMOConference on Monday in Scottsdale, Ariz. eWEEK.com Database Editor Lisa Vaas recently got a chance to talk with Stonebraker about the issue of real-time data analysis, about how it leaves relational databases in its dust and, most importantly, how this cutting-edge technology is poised to transform our society. Financial services comes to mind, of course, but what really fires up Stonebraker are prospects like revolutionizing the care of emergency-room patients, the care of soldiers on the front lines or simply the ability to find your child when shes lost at Disney World.

Youve said that streaming data on the fly is something that ordinary relational databases cant handle. Why?

Heres a quick, simple little problem. This was a pilot we were asked to do early on. [It was] a large, mutual funds company. They subscribe to every feed on the planet, [including feeds such as Reuters]. They have a current application that watches each feed to determine if the data is late, so they can say, "Dont trust Reuters now, the feed is screwed up."

They defined "late" as [when the] inter-arrival time of ticks between the same stocks is greater than a certain number. You see an IBM tick, and if you dont see another IBM tick in x seconds, its an indication of late data.

They wanted to issue an alarm if you saw a late tick. Then they wanted to say, "If you see 100 late ticks that are coming from the feed vendor, then ring the red telephone."

The current application is written on top of bare metal in C++. They were unhappy with the performance of the current application, and it was hard to maintain. And expensive.

In addition to StreamBases real-time data analysis technology, DEMO showgoers were treated to peeks at more support for mobility from emerging technology companies. Read more here.

On this application, they said, "How fast can you go?" We processed about 150,000 messages per second on this, on a $1,500 PC, a commodity piece of hardware. Their current production application does about 3,000 messages per second. The best we could get out of one of the very popular relational databases was 900 messages per second.

Next Page: Elephants store data.

Database Legend: How Real-Time Data Analysis Will Transform Society - Page 2


In round numbers, were two orders of magnitude faster than the elephants. And the two orders of magnitude are on identical hardware. If you normalize for clock speed of our production application vs. theirs, were one order of magnitude faster.

What accounts for this speed gain?

There are three big reasons: One, the elephants store the data. Theres no need to store the data. One of the characteristics of real-time, streaming data, its like IT sushi. It has high value right now, and the value decays very quickly. Theres no need to keep the data around for the long term in some sort of repository. That just takes up time, latency and resources to do that.

Reason No. 2 is when youre looking for the inter-arrival time between ticks, thats a time-series notion. When youre doing real-time stream processing, we have time-oriented primitives in the bottom of the screen. … We have extended SQL to something we call StreamSQL, which has extra stuff in it. … Weve had to add another notion to SQL, the notion of time windows. You can do SQL-like calculations over time windows. Do them in real time as data is flying by. …

[Finally,] if you want to count to 100, which is what this [application] had to do in order to decide to ring the red phone, the most efficient way to do that is with four lines of C++. In this application, it makes sense to mix small amounts of code in a general-purpose environment with database-oriented processing steps. We can do that in our architecture: freely intermix C++ with our StreamSQL primitives. The relational guys all run client/server, and C++ code has to run in the client in a separate place from the server. So the client/server architecture slows you down on this style of application.

What types of enterprises need this type of fast analysis?

Financial services, industrial process control, monitoring oil refineries, the government: Military and homeland security is full of this style of application. Weve been talking to one of the three-letter agencies. The guys who wont give you their business cards. Theyre monitoring Arabic chatter. When the czar of homeland security says, "The chatter has changed," theres a real-time system processing incoming feeds, computing statistics on incoming Arabic language streams, to actually determine that. They started yakking with us on piloting that application.

Another example: network monitoring, for DOS [denial of service] attacks. Fraud detection.

Next Page: Financial firms seek to thwart identity theft.

Database Legend: How Real-Time Data Analysis Will Transform Society - Page 3


Another very large financial services company is exploring piloting another application with us. Theyre terrified the really bad guys, who do credit card fraud and identity theft, will target financial services. This company wants to monitor their worldwide network and watch application-level events. For example, they want to watch every log-in to their systems and watch for suspicious events such as the same user logged in more than once from two IP addresses more than a mile apart.

RFID [radio frequency ID] must pose big opportunities for this type of real-time data analysis, right?

Whats coming is a microsensor revolution. The cost of microsensors is being driven down at a vast rate. … One of my favorite applications: I have kids, Ive taken them to Disneyland and Disney World. Its a stressful situation. Its a crowded place, and you dont want to lose your kids, and its awfully easy to lose them. The paper wristband you wear will turn into an electronic tag, and that will allow parents to dock at a kiosk so you can say, "Exactly where are my kids, so I can go get them?"

Another example: Mass General Hospital in Boston is very interested in getting hospital personnel to wear electronic tags. If theres a code blue, now, they issue a global alarm, and everybody lines up at the door of the person who has the emergency. If they knew where everybody was, they can dispatch the right person more efficiently.

The military is very interested in tagging all soldiers and all vehicles [so they can] monitor medical vital signs in real time.

There will be incredible social good from medical monitoring that will be possible from wireless technology downstream of cheap microprocessing technology.

The current database vendors are all selling one-size-fits-all, with a single engine being good for everything. I think at least in streaming data it isnt true, since theres just a huge performance problem with the one-size-fits-all model. … The one-size-fits-all paradigm is getting stretched. It will be interesting to see how in unfolds in the next few years.

Check out eWEEK.coms for the latest database news, reviews and analysis.

Rocket Fuel