It's clear that NoSQL adoption has paid dividends to the Twitters and Netflixes of the world. But it's been less apparent just how much attention mainstream organizations ought to pay to the trend, since relational databases are familiar and well-entrenched, and since many well-established solutions exist for scaling relational databases.
Over the past few years, NoSQL
databases have received a great deal of attention, with countless attestations
of their virtues in enabling consumer-facing, Web-based businesses to manage
fast-growing user demand and make use of the huge quantities of data that their
users create.
It's clear that NoSQL adoption
has paid dividends to the Twitters and Netflixes of the world. But it's been
less apparent just how much attention mainstream organizations ought to pay to
the trend, since relational databases are familiar and well-entrenched, and
since many well-established solutions exist for scaling relational databases
Despite the heavy focus on the
virtues of NoSQL for "Internet-scale" businesses, the products and services
covered under the NoSQL umbrella are well worth consideration by organizations
of all sizes, not as across-the-board replacements for relational databases but
as additional tools for meeting business goals.
NoSQL refers
to a broad class of database products which tend not to expose SQL interfaces.
What separates these products from traditional databases has less to do with
SQL and more to do with a departure from relational models. In particular,
these databases do away with fixed schema, which can be beneficial when
developing applications with changing requirements. For this reason,
non-relational is a better, if less broadly referenced, handle for this group
of products.
One of the canonical documents
describing the design concepts and rationales for non-relational databases is Amazon.com's
2007 paper on its "Dynamo" data store, which the company developed to meet its
internal service-level requirements.
The paper describes how
traditional relational database management systems, with their focus on
prioritizing data consistency above write-operation availability, proved
ill-suited to the Web retailer's needs in the context of Amazon's
infrastructure, which is comprised of large numbers of commodity servers of
varying capacities. For Amazon, blocking customers from adding new items to
their carts while waiting for separate application nodes to get in sync was too
high a price to pay, so Dynamo was designed to boost availability by
de-prioritizing consistency.
While the scale of Amazon's
infrastructure and user base is relatively unique (as is Amazon's capacity for
rolling its own data store solution), the need to prioritize certain
application characteristics above others is common to every organization.
Today's crop of non-relational database products provide businesses with more
options without requiring that they create solutions from scratch.
There are several different
types of non-relational databases that fall under the NoSQL umbrella, including
key-value stores, document-oriented databases, columnar databases and graph
databases, each with their own data models, scaling strategies and use cases.
Pinning down particular NoSQL
databases into a specific category can get confusing, as some of the categories
tend to blend into each other. For understanding the broad categories of NoSQL
data stores, I found this paper by Rick Cattell helpful, in which the former
Sun Microsystems database architect breaks down the options into key-value
stores, document stores and extensible-record stores.
In a key-value store,
individual records amount to some arbitrary lump of information, indexed by a
key. These systems typically do not interpret the data themselves, leaving that
function to the application. Riak, which is supported by Basho Technologies,
and Oracle's Berkeley DB are examples of popular key-value stores.
In a document store, records
are comprised of documents that consist of a variable number of named
attributes of various types, such as integers, strings and nested objects.
Document-oriented databases tend to recognize the structure of the data they
store and have more querying functionality than key-value stores. MongoDB, from
10gen, and Apache CouchDB, which is supported by Couchbase, are examples of popular
document stores.
Extensible-record stores, which
are also known as wide-column stores, provide a data model similar to
relational databases, but with a focus on organizing data into columns (rather
than rows) and column families (rather than tables). Apache Cassandra, which is
supported by DataStax, and Apache HBase, supported by Cloudera, are examples of
popular extensible-record stores.
More
important than worrying about which bucket a given non-relational database fits
into is focusing on the particular set of features it offers-in
particular, which controls it offers for balancing availability, consistency
and fault-tolerance, how it handles scaling and which interfaces it provides
for accessing data.
For example, Apache Cassandra
enables administrators to set their desired trade-offs between availability and
consistency on a per-query basis. To maximize consistency, administrators can
configure a Cassandra cluster to hold off on reporting a write complete or
responding to a read until all nodes in a cluster have responded. To maximize
availability, the system can complete an operation if any one node completes a
write or responds to a read. Administrators can also opt for several gradations
in between to reach a balance and to provide for resiliency in case nodes fail.
MongoDB provides for scaling
out across nodes in a cluster through auto-partitioning. If a data set grows
too large for a single machine, MongoDB can chunk up the collection and
distribute it across the nodes assigned to it, with distributed replica sets to
recover from a node failure.
Among the primary challenges
for administrators working to wrap their minds around NoSQL databases are the
differences in accessing data stored in these systems. Due to the major
differences between these products, there isn't a straight equivalent to SQL in
the relational world. Rather, most non-relational databases provide bindings
for accessing data using multiple programming languages.
There are a number of SQL-like
querying languages that have sprouted up to offer higher-level data access, such
as Google's GQL for its AppEngine platform as a service (PaaS), MongoDB's Mongo
Query Language, Cassandra Query Language and the nascent UnQL (Unstructured
Query Language). For Apache Hadoop-based systems, Apache Pig and Apache Hive
offer two separate routes for working with data from a higher level.
In my own efforts to better
understand the differences in accessing data on relational and non-relational
data stores, I've found helpful the open-source, Django-nonrelational project.
Django is a Python framework for building Web-based applications that sports an
object-relational mapping layer for abstracting the differences between
separate relational databases. Django nonrelational supports Google's AppEngine
datastore, and offers in-development backend support for Cassandra and MongoDB.
For administrators and
developers familiar with Django, experimenting with the various backends
provides a hands-on reference for the differences between relational and
non-relational stores, and between some of the different NoSQL systems.
As Editor in Chief of eWEEK Labs, Jason Brooks manages the Labs team and is responsible for eWEEK's print edition. Brooks joined eWEEK in 1999, and has covered wireless networking, office productivity suites, mobile devices, Windows, virtualization, and desktops and notebooks. Jason's coverage is currently focused on Linux and Unix operating systems, open-source software and licensing, cloud computing and Software as a Service. Follow Jason on Twitter at jasonbrooks, or reach him by email at jbrooks@eweek.com.