Many companies believe that they’ll be able to implement enterprise search platforms without putting much work into it. But, according to Yves Schabes, president of Teragram, having an effective enterprise search system requires an initial investment in time, which will pay in dividends once your content management system is integrated and automated within your enterprise search system.
As consumers, all we see when surfing the Web is the final layer of search, so we assume that enterprise search should just be that easy.
However, the effectiveness of internet search ranking relies heavily on the availability of naturally occurring metadata, which is generated through Internet hyperlinking. Each time someone links text to a Web page, the linked text is interpreted by Internet search engines as metadata about this particular page, thus impacting a page’s ranking on the Web search results.
Contrary to the Internet, there are no textual links between documents in an enterprise, and no implicitly created metadata that a search engine can use.
The success of an enterprise search deployment relies heavily on the automatic creation of metadata. In order to achieve accurate page ranking in enterprise search, the following three things must occur: The assignment of metatags to content, the creation of taxonomies and the occasional checks and balances from the IT professional or information architect.
1) <i>Install a system to automate the creation of metadata for existing content, and new content as it’s added to the server.</i> When a set of metadata tags has been defined, newly created documents go though the step of the automatic creation of the metadata. A metadata generation server program can be accessed through various programming interfaces such as JAVA API or SOAP APIs.
Several kinds of metadata can be generated automatically through the use of metadata management tools (i.e. people, locations, dates, product types, relationships between entities, etc.). The metadata generation server API will pull documents from the document source (disk share, document management system or content management system) and produce the metadata automatically for each document. The document metadata can be stored either in a metadata repository (database associating a document identifier to the metadata), alongside the document in a document management system, or within a structured document (i.e.,
2) Categorize information into logical groups based on folksonomies, taxonomies and ontologies. Automatic categorization software can help with this process by “reading” the metatags assigned in step one and grouping words, phrases, entities and events into proper bins, depending on relationships.
The software can then extract keywords, entities and concepts automatically from documents, and instantaneously associate topics with each document. The ability to cross-reference documents in enterprise search is invaluable, especially when attempting to find related documents. Index these topics and categories with associated documents to create an efficient data structure that allows for fast retrieval. By giving users an alternative, faceted search (taxonomy) interface to supplement the standard keyword search they’re used to, you’re more apt to achieve greater recall.
3) Define what you want to understand from the documents in advance, and check to see that automated systems coincide with these goals. For example, a pharmaceutical company needs to define types of drugs, symptoms, etc, while a financial services company needs to define quarterly earnings statements, market capitalization, stock tickers, etc.
As terms within an organization evolve, and new terms enter a company’s vernacular, automatic metadata generation systems and taxonomies may need to be re-evaluated. Terms may need to be added, and new cross-references assigned. Perform a systematic human check of your automated search and content management tools at least once per quarter.
Steps one and two are the most time-consuming at the startup of the search project, but will not likely have to be revisited unless the system is achieving poor recall. Automatic categorization, metadata generation and taxonomy management will allow you to build a semantic search system in your organization, where relevant documents replace nebulous keywords.
Dr. Yves Schabes co-founded multilingual natural language technology company Teragram Corporation with Dr. Emmanuel Roche in 1997. Dr. Schabes has spent the past fifteen years working on issues relating to natural language processing and computer science. He is the author, or editor, of more than fifty international scientific publications, including co-editor, with Emmanuel Roche, of Finite-State Language Processing (1997, MIT Press, Cambridge MA).
Dr. Schabes also is an Associate to the Division of Applied Science, Harvard University, Cambridge MA. Prior to founding Teragram, Dr. Schabes was a Senior Scientist at Mitsubishi Electric Research Laboratories in Cambridge, MA. He received a Ph.D in 1990 in Computer Science from University of Pennsylvania, Philadelphia, PA and a Master of Science in Electrical Engineering from l’Ecole Sup??«rieure D’Electricit??« (France) in 1985.