Data by Design

Content categorization systems help pinpoint data and improve the usability of Web sites, apps.

Critical enterprise software platforms—including portals, content and document management systems, customer relationship management applications, and e-commerce systems—generate reams of data. But what good is that data if employees and business partners cant find the information they need or even figure out what they need?

Categorization engines can do just that, working with each of these enterprise applications to make it possible to pinpoint content.

A good categorization engine will take any data—from Web pages to Microsoft Word documents to Adobe Acrobat files to dynamically generated database content—and apply it to a category taxonomy. Done right, this can improve the success of searches, content access and the general usability of Web sites and enterprise applications.

While these content categorization systems may sound esoteric, their value is proven. One need only compare the success of Web sites that are easily searchable with the performance of those that are not. In many ways, the early success of the Yahoo Inc. site was based on its excellent categorization, which gave it an edge over many of its search-based competitors.

Although the goal of any company looking to accurately categorize content is the same, there are many different ways to get there. For this eValuation, eWeek Labs looked at three applications that take very different approaches to categorizing content.

Applied Semantics Inc.s Auto-Categorizer 1.1 takes perhaps the most unusual approach. The product leverages a massive ontology of conceptual meanings that understands the relation of terms and concepts.

Auto-Categorizer makes it easy to create and refine taxonomies and categorize content accordingly. Grounded in XML and with strong support for most scripting and development languages, Auto-Categorizer can be easily integrated with most systems.

Interwoven Inc.s MetaTagger 3.0 is heavily focused on categorizing content as it is created, specifically as it is created within Interwovens TeamSite content management application. Like other products that are tied to a content management system, MetaTagger will suggest categories to content authors as they are publishing content. However, MetaTagger can also categorize content outside the content management system and can even assist with taxonomy creation.

The most traditional approach to content categorization is found in Thunderstone Software LLCs Texis Categorizer 4.1. Using "training sets" of content within categories, Texis Categorizer can be fine-tuned to consistently (and constantly) categorize content within a taxonomy. The products use of standard SQL queries and Common Gateway Interface scripts makes it easy to integrate with any application.

Theres really only one way to test content categorization applications, and thats with lots of content. We provided each vendor with three sets of content: one consisting of content from a university science course, another comprising government health and insurance documents, and the third—and by far the largest, with more than 1,000 documents—consisting of content generated by analysts here at eWeek Labs.

Each set consisted of a variety of content, including Web pages; Microsoft Word, Excel and PowerPoint files; Acrobat files; straight text; and multimedia files. Along with content, we supplied each vendor with suggested basic taxonomies and training sets. We asked the vendors not only to categorize the content but also to show, step by step, how they did it. (Most vendors will interview a company about its taxonomy and provide training on the categorization system.)

At the same time, we evaluated each product here in the Labs, submitting content, accessing administration options, creating scripts and configuration files, and evaluating integration options.

In the end, all the vendors did an effective job, with each leveraging their respective strengths: Auto-Categorizer in category refinement, MetaTagger in content creation and Texis Categorizer in categorization fine-tuning.

Some of the most interesting differences among the products were in how each got to the end point, illustrating not just the various ways in which content can be categorized but also the key differentiators for determining which system is the best choice for an organization and/or a specific application within an organization.

This eValuation is not meant to be an all-encompassing look at categorization applications but instead is intended to give businesses insight into the different types of categorization applications.

There are many other categorization applications available on the market, with most search engine vendors providing some form of categorization engine.

Another product category to keep in mind when looking at content categorization is taxonomy generation and management applications. These products make it possible for companies with large and complex content sets to create effective and accurate taxonomies. They also often specialize in taxonomies that are geared toward specific industry segments. An example of this type of vendor is Saqeware Inc., which provides a wide set of taxonomy creation and management services.

East Coast Technical Director Jim Rapoza can be reached at Other Articles in this eValuation:

  • Reviews: Three Paths to Sorting Content
  • eVal Scorecard: Content Categorization
  • Standards Target Categorization