VorteXML Turns Data Into XML

 
 
By Timothy Dyck  |  Posted 2003-01-20
 
 
 

VorteXML Turns Data Into XML


Datawatch Corp.s new VorteXML Server 1.0, which started shipping last month, provides a flexible template-based system to extract data from the straw of undifferentiated text files and turn it into XML gold.

The servers sweet spot is with organizations that have collections of plain-text or HTML files (such as invoices, reports, confirmation e-mail messages or log files) that they want to turn into the more usable XML data format.

However, eWeek Labs tests also found a number of important limitations that make this product more difficult to deploy than it should be and could point users toward products from close rivals Whitehill Technologies Inc. and ItemField Inc.

Our main concern is that the server generates no warnings when it detects formatting errors in input text files. For example, a numeric value lost its cents digits when we added a dollar sign before the value and was silently rounded down to the nearest 10th-of-a-dollar value—despite our tagging the text data as a numeric value with two decimal places.

Data fields not having the correct input format in test files (we tried with both numeric and date fields) were just skipped, leaving empty elements in our output XML. As a result of our feedback, Datawatch will add a key error-checking feature to its next VorteXML Server release, enabling administrators to prevent the generation of empty elements or attributes.

VorteXML Server also has little flexibility in the data formats or platforms it supports. Only ASCII or ANSI text files can be imported; it lacks input filters for nontext data types such as Microsoft Corp. Word documents, rich-text- format documents or Adobe Systems Inc.s PDF. ItemFields ContentMaster has more flexibility in this area.

Although VorteXML Server supports two older XML metadata formats—Document Type Definition and XML Data Reduced—it doesnt support the current standard, the much more powerful XML Schema. Text, numeric and date data types are supported, and XML Schema support is planned for a future update.

The server is moderately priced at $7,999 per server for up to two CPUs and $1,999 for each additional two CPUs. A copy of Datawatchs $599 VorteXML Windows-based desktop text file-to-XML conversion tool is required to create the import templates VorteXML Server uses.

VorteXML Server needs the full Microsoft stack: Windows 2000 or higher and SQL Server 7.0 or later. (A copy of the free Microsoft Data Engine is included for those who dont already have a copy of SQL Server.) Microsofts IIS (Internet Information Services) is required if VorteXML Servers Simple Object Access Protocol interface is used.

Converting nonstructured formats such as text files into a structured format such as XML is inherently a hard problem to solve. VorteXML Servers strongest feature is its VorteXML desktop tool, which uses an intuitive, flexible "painting" system to highlight data fields in input text files. (VorteXML can do the XML conversion itself but only on a single input file.)

VorteXML provides a mechanism to identify data fields through a combination of nearby text field labels, delimiters and absolute line position. It also has an expression language (although not a full programming language) to perform variable manipulation.

VorteXML handles HTML data in an unusual way: HTML files are preparsed to extract only text between tags, and this text is marked with a tag sequence number generated by VorteXML. The sequence number makes it easy to select items that appear only once in a file, but we had to resort to trickery to extract a list of items without the lists column heading (which, with the loss of its heading tag, was not differentiated from the list items). Preserving tag metadata, such as tag type or attribute values, would make this process easier.

Once we had a template created, we used VorteXML Servers management console to define a conversion project with input and output file directories and an associated conversion template.

Conversion was a beautifully simple matter of just dropping input files into the input directory. The new files were automatically detected and moved to a processed directory; matching XML files showed up shortly thereafter in the output directory. The ability to put output data directly into a relational database would be a good future addition.

Performance was slow in tests using VorteXML Servers included Microsoft Data Engine database: Converting 100 files took 33 minutes on a dual Intel Corp. Pentium III server (results that Datawatch didnt see in its replications of our tests). Switching to SQL Server 2000 reduced our processing time to 2 minutes, more than an order of magnitude faster.

West Coast Technical Director Timothy Dyck is at timothy_dyck@ziffdavis.com.

Executive Summary


: VorteXML Server 1.0">

Executive Summary: VorteXML Server 1.0

Usability Excellent
Capability Good
Performance Good
Interoperability Poor
Manageability Fair
Scalability Fair
Security Good

VorteXML makes what can be a difficult job—turning text data into XML—straightforward. The tool is easy to use and will do the hoped-for job in many situations. However, the 1.0 release has a significant number of functional gaps that make it difficult for administrators to detect when input text files contain formatting errors.

COST ANALYSIS

At $8,000 per server, VorteXML isnt that expensive, but for one-off jobs we would turn first to text processing languages such as Perl, sed or awk.

(+) Easy, powerful text file template definition tool and expression language; automatic file-based import system eases data input; Web services interface.

(-) No mechanism to alert administrators to bad data in import files; no import filters included, so only straight text files can be imported; HTML files are parsed in a way that discards most tag metadata; doesnt support XML Schema; requires Windows 2000, IIS and Microsoft SQL Server (or Microsoft Data Engine).

EVALUATION SHORT LIST

  • In-house development of small programs to do text transformation
  • Whitehill Technologies xml Transport
  • ItemFields ContentMaster
  • Data Junction Corp.s Data Junction Content Extractor
  • vortexml.datawatch.com

  • Rocket Fuel