Amazon SimpleDB a Solid Choice for Simple Web-based Data Storage

 
 
By Jeff Cogswell  |  Posted 2009-02-05
 
 
 

Amazon SimpleDB a Solid Choice for Simple Web-based Data Storage


In December, Amazon released the beta version of its SimpleDB product. SimpleDB is part of a suite of tools making up the Amazon Web Services, or AWS.

It has been in the works for quite some time; indeed, Amazon created an early program that people were signing up for a year earlier. Over that year, Amazon made many tweaks and improvements to SimpleDB, apparently listening to the concerns of the people trying it out.

Now, with this new beta release (which anybody can sign up for), we can see what the final product will most likely look like. I thought I'd take it for a test drive, and since I'm primarily a Visual Studio developer, my tool of choice was the C#/Visual Studio library. (Amazon offers several official libraries for other platforms, such as Java, Perl, and PHP.)

Fitting In

SimpleDB takes a rather unique approach to storage, giving your cloud-based applications a place to store simple data. The approach is similar to that of a spreadsheet and definitely not relational. (It reminds me of Google's BigTable, which is available in a smaller form, called DataStore, through the Google App Engine.) Amazon also has several other database offerings should SimpleDB not fit with your needs. For example, the company offers Amazon Simple Storage for storing files, and within its main cloud platform-EC2 (Elastic Compute Cloud)-you can run any of several database servers, including SQL Server.  You're not limited to just SimpleDB.

Overview

The idea behind SimpleDB (like that of Google's DataStore) is fast reading. Most Web sites-but not all-need to retrieve information quickly, much more quickly than they need to save data. Amazon's own site is such an example. People want to browse for books and other products, and want to see the pages come up quickly. There's little data storing taking place beyond your browsing history. People don't want to have to wait for these pages to load. However, when people are, for example, entering a message into a forum on Amazon, the posting might have a very short but noticeable delay, and people seem to be forgiving of that. In general, most people seem to be happy with fast reads and maybe-not-quite-as-fast writes. That's what SimpleDB (and Google DataStore) offer.

While most databases today are relational (and implement the SQL language), SimpleDB is definitely not relational. Instead of creating data in tables with identical rows, you create sets of data (called domains) that contain data items. Each item can have multiple "attributes," where each attribute is given a name.

This is where things get quite different from a traditional, relational database. The sample docs in SimpleDB give a pretty good example. Suppose you create a domain that stores information on products. One product might be an article of clothing and would have attributes such as color and size. Another product might be something altogether different, such as a car engine. That product wouldn't have color and size, but rather something like make, model and year.

In a relational database, this would either require two separate tables or a single table with empty columns for those attributes that do not apply. (For example, size would be left empty for a row containing information about an engine.) But in SimpleDB, size wouldn't even exist for the engine item, and make and model wouldn't exist for a sweater. Yet the engine and the sweater can both be stored in a single domain.

Thus, you could have the following data in a single domain:

Item1: Category=Clothes; SubCategory=Sweater; Name=Cathair Sweater; Color=Siamese, Size=Small, Medium, Large.

Item2: Category=Car Parts; SubCategory=Engine; Name=Turbos; Make=Audi; Model=S4; Year=2000,2001,2002.

Notice the attributes Category, SubCategory and Name are used in both data items. But the other attributes are unique to the item. But also notice that you can store more than one value for an attribute. For Item1, the sweater, the Size has three values: Small, Medium and Large. For Item2, the engine, the Year attribute also has three values: 2000, 2001 and 2002.

The Libraries and Request Methods


 

I mentioned that I chose the C# library. However, I want to be clear: These libraries are written by Amazon, but they are not the only libraries available. The interface to SimpleDB is through either of two methods: as a Web service (using SOAP) and REST (Representational State Transfer). While the C# library includes several classes and methods for interacting with the SimpleDB data stored on Amazon's servers, behind the scenes the libraries are constructing simple URLs and sending them to the Web servers and waiting for a response. The response comes back in the form of XML, and the C# library parses this XML and stores it in a collection.

(Incidentally, if you're serious about REST, there is a lot of discussion online about whether Amazon broke unofficial REST rules in creating its interface. I'm not going to cover that here. But the information is online if you google SimpleDB REST.)

Since the C# library simply creates REST requests and reads responses, you could actually create your own class library. I imagine over time we'll see more (and better) libraries, as the one that Amazon created really isn't too sophisticated. But since this library is just a wrapper around REST calls, I'm not going to use this library to pass judgment on the SimpleDB product (and I don't recommend you do so either).

I do, however, have one concern that I want to raise: The responses are all in XML. And while XML works great (and those of you who read my blogs and follow me on Twitter know that I'm a proponent of it), XML is a shortcoming here. Returning a set of two numbers, say 1 and 2, takes a significant amount of space, including the XML header line and several XML tags, like so:

<?xml version="1.0"?>

<GetAttributesResponse xmlns="http://sdb.amazonaws.com/doc/2007-11-07/">

<GetAttributesResult>

    <Attribute>

    <Name>Height</Name>

    <Value>1</Value>

    </Attribute>

    <Attribute>

    <Name>Width</Name>

    <Value>2</Value>

    </Attribute>

</GetAttributesResult>

<ResponseMetadata>

    <RequestId>12345678-1234-1234-1234-123456789012</RequestId>

    <BoxUsage>0.0000012345</BoxUsage>

    </ResponseMetadata>

</GetAttributesResponse>

That's a lot of wasted space to get back two integers. Two integers can be stored in just a few bytes; this response is more than 400 bytes. If you're doing a huge amount of data retrieval, that can really add up. (And that means you'll want to make sure you request the data in batches rather than individually. Multiple data items can be returned in a single XML response.) And unfortunately, (perhaps even unfarily) Amazon bills you for data going both ways-data moving into their servers and data coming out. You'll definitely want to count your beans carefully.

Experience with SimpleDB

The C# library, while somewhat simple, worked quite well and I was able to easily put data onto the SimpleDB servers. I ran code to create domains, add and remove data to and from the domains, and list the data in the domains. It worked well. There's really not a lot more to it- SimpleDB is just that: simple. It's for storing data. It's your job to decide what data your site needs to upload.

Of course, working in C# presents an interesting consideration: When dealing with the Web, C# is normally a server-side language. That means you might have your own Web site either hosted on Amazon's EC2 or on your own site. The site would be an ASP.NET site and you would create C# code that interacts with the Amazon SimpleDB servers. Your user, then, would be browsing your site, without realizing behind the scenes that your site is storing its data on Amazon's servers.

But is that reasonable? Does it really make sense to be hosting your own site but not your own data? And does it make sense to have your server make connections to another server on another network to get the data? I can't answer that for your own situation, but it is a question that would need to be answered.

The other option for server-side code is to have your site hosted on EC2. Then it starts to make more sense. EC2 supports Windows servers and you can create ASP.NET sites. (And you can use other platforms and languages on EC2 as well, if you don't want to use ASP.NET.) This makes a bit more sense, but the server still needs to connect to SimpleDB and process the raw XML data.

Of course, another possibility is to push into Web 2.0 and create a site that generates Web pages that include JavaScript code; these pages could then use AJAX to connect right to the site. But there are some serious security risks here, because you probably don't want your users' browsers writing data directly to your SimpleDB domains. Further, by default, the browsers don't even let JavaScript use AJAX to connect to sites that have URL domains different from those hosting the Web page itself.

Another option, then, is to use a hybrid approach where you would have your server make the requests to the SimpleDB server, but then return XML directly to the client browser, and let the browser process the XML using Xpath and XSL transformations.

These are all serious issues that you'll need to explore when building a site that makes use of SimpleDB. Typically, I imagine people will be hosting their server-side code on EC2, as that's certainly what Amazon has in mind. From there you'll have to weigh the pros and cons of how to process the data and what to make the browser do, while factoring in the security issues.

Summary: A Unique Approach

In general, the SimpleDB is a good approach for data that doesn't need to be relational. Will it work for all situations? No. For example, the usual textbook example of a customer, products and sales database that's fully normalized would not lend itself well to this example, unless you want to manually do the joins yourself by reading a customer ID from a customer domain, and then searching the sales domain for product IDs based on that customer ID, and then searching the product domain for the list of products the customer purchased. That would be a lot of work when a simple two- or three-line SQL join would do the job nicely with a relational database.

However, for cases where you need to quickly look up data that doesn't need to be joined (such as a list of products matching a certain set of criteria), then the SimpleDB would work quite well.

Senior Editor Jeff Cogswell can be reached at jeffrey.cogswell@ZiffDavisEnterprise.com.

Rocket Fuel