Making data more directly accessible to developers, with greater freedom to use it in ways not anticipated when it was compiled, is the hottest fashion statement in development technology. With rapid refinement of tools for working with XML on datas supply side and with data manipulation becoming better integrated with mainstream code on the demand side, theres a trunk full of ways for developers to win.
The business-suit sleekness of an enterprise application is today often marred by ugly seams, where a procedural language has been clumsily stitched to nonprocedural SQL statements manipulating relational data. Further opportunities for ugliness arise with the growing use of XML, a nonrelational data representation—but one that programmers would like to query with the same ease and performance they get from a SQL database.
Most programming languages depend on sequence and hierarchy for data reference and navigation—for example, in accessing arrays. The tables of relational databases have neither sequence nor hierarchy, and thats just the beginning of the differences that make it difficult to put together a unified look.
Donald Chamberlin, an IBM fellow who co-invented SQL in the 1970s with colleague Raymond Boyce, summarized key differences between relational data and XML in an IBM Systems Journal article in October 2002. Relational data, Chamberlin observed, “tend to have a regular structure, which allows the descriptive metadata for these data to be stored in a separate catalog. XML data, in contrast, are often quite heterogeneous, and distribute their metadata throughout the document.”
Relational tables typically have a flat structure, Chamberlin noted—that is, each column of each record represents a single-valued instance of some fundamental data type, such as an integer or a character string. In the time since Chamberlins paper, databases have adopted structured types to handle more complex multivalued data, and foreign keys have always enabled one-to-many relationships among tables. Even so, lookups of arbitrary attribute values in a database using these facilities are not as easy as with XML documents using multilevel nested elements.
“XML documents have an intrinsic order, whereas relational data are unordered except where an ordering can be derived from data values,” Chamberlin added in the paper, further observing that a relational data set typically has a value in almost every column of every record while XML data are often “sparse”—with many of their possible data elements left blank.
But theres still great demand for XMLs versatility and the ease that it offers for repurposing data for multiple tasks. XML addresses a genuine need.
Datas XQuery adventure
Developers confronting an XML-demarcated data stream have been faced, until recently, with three unattractive choices. At one extreme, programmers could burn machine cycles upfront to “shred,” as the process is popularly called, the multilevel XML into a collection of relational tables. At the other extreme, they could defer all processing by storing the entire XML corpus as a single chunk of textual data, delving into it as needed later on.
Both these approaches leave open the door to use of standards-based technologies down the line. The third choice, in contrast, limits that flexibility by adopting the XML representations of a specific database platform.
A better way, though, has been devised by IBMs Chamberlin and others with their development of XQuery, both a query and a transformation language. Released on Sept. 15 as a World Wide Web Consortium working-draft specification, XQuery goes beyond queries to meet the needs of data integration and reformatting. (The working-draft XQuery specification is online at www.w3.org/TR/xquery, while an informative page of FAQs is at www.stylusstudio.com/xquery/xquery_faq.html.)
Before it reached even its current intermediate level of formalization, XQuery had already attracted widespread interest from developers. As reported in eWEEK in March, a privately sponsored survey of 550 developers found more than half already using XQuery and another third planning to use it within the year.
That survey was commissioned by the DataDirect Technologies division of Progress Software Corp., which released last month its DataDirect XQuery 1.0 embeddable component for Java application developers. The product works with databases from IBM, Microsoft Corp., Oracle Corp. and Sybase Inc., as well as MySQL AB.
Also available from DataDirect is the Stylus Studio IDE (integrated development environment) for XQuery programming , combining XML editing with XQuery debugging and other related capabilities.
Meanwhile, the increasingly mainstream environment of Microsofts C# may soon offer access to nearly any kind of data source with a single syntax and with the full syntactic and semantic support of Microsofts Visual Studio environment. Microsofts LINQ (Language Integrated Query) technology, previewed at the companys Professional Developers Conference in Los Angeles last month, extends C#—and Visual Basic .Net, as well as potentially any other .Net language—to unify interactions with objects, XML-formatted data and relational data under a single umbrella.
“A developer who is working with a collection of customer objects and wants to query a list of names from that collection would need to write five or 10 lines of code to accomplish that simple task with todays technology. With LINQ, that list can be obtained by writing just one line of code, Select Name From Customers,” said Microsoft Technical Fellow Anders Hejlsberg in a statement accompanying his conference presentation.
eWEEK Labs regards all such proposals with suspicion: Database lore is rife with tales of poorly optimized queries that interacted with disparate table sizes or varying data bandwidths to produce crippling performance problems. To the degree that LINQ succeeds in enabling data abstraction, it will do so by concealing underlying data representation and by putting the actual means of data access—which Hejlsberg derides as “plumbing”—into a box labeled “No Developer-Serviceable Parts Inside.”
One must recall, however, that exactly the same concerns were raised when C replaced assembly language programming or, for that matter, when databases themselves were introduced into general enterprise use. The question that most developers will want answered is, “Will LINQ get me the application I want in less time, for less money, while Moores Law offsets the likely performance overheads?”
The other issue accurately raised by some observers is that of developer lock-in. Rather than being adept at writing SQL and injecting it into any development environment, a developer may instead become proficient only at working with a universal abstraction of data thats available only within the .Net environment. Whether LINQ is a better-tailored suit or a straitjacket is very much yet to be seen.
Technology Editor Peter Coffee can be reached at peter_coffee@ziffdavis.com.