VorteXML Turns Data Into XML

Datawatch server is easy to use but has holes in functionality.

Datawatch Corp.s new VorteXML Server 1.0, which started shipping last month, provides a flexible template-based system to extract data from the straw of undifferentiated text files and turn it into XML gold.

The servers sweet spot is with organizations that have collections of plain-text or HTML files (such as invoices, reports, confirmation e-mail messages or log files) that they want to turn into the more usable XML data format.

However, eWeek Labs tests also found a number of important limitations that make this product more difficult to deploy than it should be and could point users toward products from close rivals Whitehill Technologies Inc. and ItemField Inc.

Our main concern is that the server generates no warnings when it detects formatting errors in input text files. For example, a numeric value lost its cents digits when we added a dollar sign before the value and was silently rounded down to the nearest 10th-of-a-dollar value—despite our tagging the text data as a numeric value with two decimal places.

Data fields not having the correct input format in test files (we tried with both numeric and date fields) were just skipped, leaving empty elements in our output XML. As a result of our feedback, Datawatch will add a key error-checking feature to its next VorteXML Server release, enabling administrators to prevent the generation of empty elements or attributes.

VorteXML Server also has little flexibility in the data formats or platforms it supports. Only ASCII or ANSI text files can be imported; it lacks input filters for nontext data types such as Microsoft Corp. Word documents, rich-text- format documents or Adobe Systems Inc.s PDF. ItemFields ContentMaster has more flexibility in this area.

Although VorteXML Server supports two older XML metadata formats—Document Type Definition and XML Data Reduced—it doesnt support the current standard, the much more powerful XML Schema. Text, numeric and date data types are supported, and XML Schema support is planned for a future update.

The server is moderately priced at $7,999 per server for up to two CPUs and $1,999 for each additional two CPUs. A copy of Datawatchs $599 VorteXML Windows-based desktop text file-to-XML conversion tool is required to create the import templates VorteXML Server uses.

VorteXML Server needs the full Microsoft stack: Windows 2000 or higher and SQL Server 7.0 or later. (A copy of the free Microsoft Data Engine is included for those who dont already have a copy of SQL Server.) Microsofts IIS (Internet Information Services) is required if VorteXML Servers Simple Object Access Protocol interface is used.

Converting nonstructured formats such as text files into a structured format such as XML is inherently a hard problem to solve. VorteXML Servers strongest feature is its VorteXML desktop tool, which uses an intuitive, flexible "painting" system to highlight data fields in input text files. (VorteXML can do the XML conversion itself but only on a single input file.)

VorteXML provides a mechanism to identify data fields through a combination of nearby text field labels, delimiters and absolute line position. It also has an expression language (although not a full programming language) to perform variable manipulation.

VorteXML handles HTML data in an unusual way: HTML files are preparsed to extract only text between tags, and this text is marked with a tag sequence number generated by VorteXML. The sequence number makes it easy to select items that appear only once in a file, but we had to resort to trickery to extract a list of items without the lists column heading (which, with the loss of its heading tag, was not differentiated from the list items). Preserving tag metadata, such as tag type or attribute values, would make this process easier.

Once we had a template created, we used VorteXML Servers management console to define a conversion project with input and output file directories and an associated conversion template.

Conversion was a beautifully simple matter of just dropping input files into the input directory. The new files were automatically detected and moved to a processed directory; matching XML files showed up shortly thereafter in the output directory. The ability to put output data directly into a relational database would be a good future addition.

Performance was slow in tests using VorteXML Servers included Microsoft Data Engine database: Converting 100 files took 33 minutes on a dual Intel Corp. Pentium III server (results that Datawatch didnt see in its replications of our tests). Switching to SQL Server 2000 reduced our processing time to 2 minutes, more than an order of magnitude faster.

West Coast Technical Director Timothy Dyck is at timothy_dyck@ziffdavis.com.