Content categorization doesnt seem that hard—that is, until you start doing it. When human beings can disagree about where content should be categorized in even a small taxonomy, the task starts to seem impossible for software.
The good news is that all three of the programs eWeek Labs tested in this eValuation provided excellent results, both in our labs-based tests and in the work that the vendors did with the content. This is especially impressive given that the performance of all products in this area tends to improve over time.
We tested Applied Semantics Inc.s Auto-Categorizer 1.1, Interwoven Inc.s MetaTagger 3.0 and Thunderstone Software LLCs Texis Categorizer 4.1. Although each of these products is very different in its design and in its approach to categorization, we were able to focus on key areas that will concern businesses implementing a categorization system.
We looked at each products ability to import, edit and possibly create taxonomies, how each product deals with legacy and new content, how categorization is trained and refined, and how each product integrates with other enterprise systems.
In many ways, each of the products that we tested represents a different categorization approach. Businesses interested in categorization should look at each product not only as a stand-alone system but also as a representation of different means to a categorization end.
For example, Interwovens MetaTagger 3.0 works only with the Interwoven TeamSite content management platform. However, organizations that are not running TeamSite might still find that the approach of integrating categorization with a content management system would best meet their needs.
For other companies, the open approach of Texis Categorizer 4.1 will provide the best opportunity to easily integrate categorization with a wide number and variety of applications and sites.
Auto-Categorizer, meanwhile, offers an ontology-based approach that, while currently limited to areas that the ontology addresses, provides a highly focused and flexible method of creating accurate categorizations.
Whatever the approach, there are several steps that businesses can follow to take some of the pain out of implementing categorization systems. Its extremely important, for example, to know where all content is stored and how it is generated. Some content can be categorized after it is created, but in some cases it might work better to suggest categories during the authoring phase.
Most content categorization system vendors will send analysts to assist organizations in creating taxonomies. This can be extremely helpful, especially for businesses that have no in-house expertise or those that dont tend to classify their content in a standard way. However, its important for IT managers to come to such a meeting with at least a basic plan in place. Otherwise, youll end up with the same taxonomy built for other companies similar to yours.
Finally, businesses should know which of their systems will need to integrate with the categorization platform. Both the technical and design aspects of categorization vary radically depending on the systems involved.
In addition, the extent to which a product supports open standards and common development languages will greatly affect the ease with which the product can be integrated with other applications.
The main point of differentiation for Applied Semantics Auto-Categorizer is its application of the companys massive ontology. For a few years now, Applied Semantics has been growing this ontology through massive sweeps of the Web and through input by linguistics experts. At this point, with more than 1.2 million terms, the ontology can effectively map concepts and meaning to terms, making it highly effective for categorization.
Applied Semantics typically provides Auto-Categorizer as a pre-installed appliance, although companies can also choose to install the server software (which runs on Linux, Solaris and Windows systems) themselves. To administer the system during tests, eWeek Labs accessed it remotely using the Taxonomy Administrator client, which runs only on Windows.
Using this client, we were able to easily create a unique taxonomy for our needs. The product also includes a number of industry-specific taxonomies that can be used as is or customized as needed.
After we created our taxonomy, we mapped each category to one or more concepts in the ontology. For example, for a category such as “mental health,” we could map to concepts such as “depression” and “mental health care” (see screen).
Auto-Categorizer also allows multiple subcategories, which could be visible to end queries or available to improve granularity. In tests, it was easy to update categories on a regular basis.
Within the Taxonomy Administrator client is the Gist tool. Using this tool, we could test the effectiveness of categorizations by entering sample content and determining if the correct categories were displayed.
Content can be tested with the Gist tool either by cutting and pasting into the Gist client or by referring to a URL. The Gist tool will then display which categories and concepts it relates to the content being tested.
All data input and output for Auto-Categorizer is done through XML, which makes it possible to communicate with almost any system. As far as integration with other applications, the product has APIs for C, Java, Perl and Visual Basic.
Pricing for Auto-Categorizer ranges from $140,000 to $160,000, par for the course in this product category.
Because of the nature of the ontology and how the system works, Applied Semantics focuses Auto-Categorizer on specific industry segments. The first iteration (the one we tested) is focused on publishing organizations; a future release will support pharmaceuticals.
Interwoven has built its name on Web content management, so its no surprise that its categorization product—MetaTagger 3.0—is tightly integrated with the Interwoven TeamSite content management platform (see screen).
This is both positive and negative, however. Negative, because only companies that are already using or are planning to implement TeamSite will be able to use MetaTagger. Positive, because although almost all categorization tools can integrate with content management applications, few have the same level of integration that MetaTagger has with TeamSite.
To run MetaTagger, we had to first install TeamSite, which runs on Windows server platforms and Solaris. TeamSite was easy to install, especially for a content management system, and we were soon up and running.
Much of the initial setup of MetaTagger—including taxonomy configuration and the designation of training sets—is done by editing XML-based configuration files. Taxonomies can also be derived from directory structures. Using the MetaSource Editor client, we could easily view, customize and fine-tune category taxonomies.
Once everything is set up, MetaTagger can be accessed from the browser-based administration interface or from the command line.
Legacy content can be categorized through the use of a command-line batch tool, but the main focus of MetaTagger is categorizing and correctly tagging content as it is created. In both our own tests and in the results generated by Interwoven, we could see how users inside TeamSite could leverage MetaTagger to effectively categorize content as it entered the content management system.
Within the TeamSite interface, we could easily view all category and taxonomy information. It was very simple to accept suggestions and make necessary changes, and if edits changed content, we could generate categories on the fly.
In addition to using categories from a taxonomy, MetaTagger can suggest related topics pulled from 4,600 terms. In tests, this proved useful for determining keywords for Web content, and MetaTagger did a good job of managing some nontraditional content such as multimedia files.
With multimedia files that have good built-in tagging, such as MP3 files, MetaTagger can pull information directly from the file. It is also possible for authors to directly relate multimedia content to specific categories.
Any company interested in MetaTagger has probably already made a six-figure investment in the TeamSite content management system. MetaTagger costs from $85,000 to $110,000 per server, depending on the customer deployment.
Texis Categorizer 4
Texis Categorizer 4.1
Traditionally, categorization applications have come from search engine vendors such as Thunderstone, and these applications still comprise the biggest chunk of the categorization market.
Texis Categorizer is an excellent example of this type of categorization application, with a great deal of flexibility in its implementation and with the ability to easily integrate with other systems, especially Web-based applications.
Texis Categorizer runs on almost anything, from Windows servers to most flavors of Unix. We ran our test system on a Linux box.
The main forces under Texis Categorizer are the Texis SQL database and the Vortex scripting engine, which uses standard CGI (Common Gateway Interface) scripting. Almost any Web developer will be able to jump into this system very quickly.
Once the initial scripts are set up, which includes defining the category taxonomy, much of the remaining work is done in an easy-to-use, browser-based interface.
For each category in the taxonomy, Thunderstone recommends using about 20 training sets. For example, in the eWeek Labs taxonomy, we would use 20 reviews of storage products to train the application on how to categorize content on storage.
However, even if the training proves incomplete, Texis Categorizer makes it easy to fine-tune categories. During tests, for example, we could load uncategorized content into the interface and adjust categories as needed. With each new piece of content, we could see the accuracy of the categorization improve (see screen).
In addition, if we needed to change categorization information, we could uncategorize content that had already been processed by the system, then re-enter it.
The price of Texis Categorizer is well below that of many competing products: $10,000 for the Texis engine and $10,000 for Categorizer.
Other Articles in this eValuation:
- Data By Design
- eVal Scorecard: Content Categorization
- Standards Target Categorization