Analysts say that following the Ascential acquisition—only the latest in a string of acquisitions over the past years—IBM is well on its way to becoming the leader in the field of data integration.
But analysts and customers also say that they havent yet heard the full metadata story. Thats important, because metadata is the glue that binds it all together. As vendors seek to tie together disparate integration tools such as ETL (extraction, transformation and loading), EAI (enterprise application integration), data profiling, into one, integrated platform of integration tools, this metadata story is crucial.
Whether youre talking about operational metadata that describes datas journey from source to data warehouse—including data origin and travel speed—or business process metadata that describes what a “transaction” is and what business processes it encompasses, customers will need to know details of how that metadata is handled.
How automatic will discovery be? How deep will metadata descriptions be?
Database Editor Lisa Vaas sat down with Nelson Mattos, vice president of strategy for IBMs Information Integration Solutions, to get this metadata story.
You say that many of the answers to the metadata questions are held within the history of IBMs Information Management strategy. Could you give us a summary?
The first step of our metadata strategy was to focus on operational metadata.
What I mean by that is the metadata that is necessary for my integration platform to be able to operate. It describes the fact that I have, [say,] an Oracle database that Im integrating, that that Oracle database runs on an AIX machine, that the AIX machine has 2GB of memory, that the speed of the network that connects my integration platform to Oracle can move so many bits per second, and so on and so forth.
The whole focus of the operation on the metadata effort was so that early adopters can use [this] information technology in well-defined projects: [i.e.,] where the data understanding was known, so I know that Im building, say, a portal [to deliver] an integrated view of customers, and I know the data is coming from Siebel or Oracle [etc.].
Obviously, we have always understood that business metadata and semantic metadata is very, very critical for integration, both for runtime of the integration platform as well as for the tools that business users, data integrators, administrators, data architects and developers will use.
Another key aspect of our metadata strategy is the delivery of the open-source Eclipse tool framework. Eclipse provided tool integration at the metadata level, particularly for application developers.
Next Page: Success of Eclipse is a data-integration proof point.
Eclipse creates common metadata
“>
Proof that this has been very successful: We have hundreds of vendors providing Eclipse plug-ins so that these tools can interoperate at the metadata level, and not just within the context of [any given] vendors product. A customer can use the Eclipse model framework across products from distinct vendors. Common metadata was absolutely the key point that enabled that success.
Then we said OK, step No. 3 was to expand the operational metadata and tool metadata from Eclipse to the world of information integration.
We had kicked off the Project Serrano a couple of years ago. … What Serrano does is extend that Eclipse framework to allow data architects and developers to find, to model, to visualize, to relate and to develop data assets, so that they can very quickly understand what data assets exist, and so they can very quickly discover relationships between data assets. So you have a table that represents customers. The tools can discover the relationship between that customer table and a transaction table, such as an order entry table. Whether [data relating to] those customers are in Oracle, the transaction is in DB2 or SQL Server [or what have you].
I can discover the assets, the relationships, and I can use that to accelerate or speed up development of applications. The developer does not need to do any of that.
Second, because its all built into the metadata infrastructure, I can reuse it. If I have a new application, I dont need to go and ask the system to go rediscover the transaction table or any of that. All of that is available for reuse.
Finally, you can now use this metadata to assess, What would be the impact to my whole application, to my infrastructure, if I make changes to the Oracle system, or to the DB2 environment?
Thats what we are delivering with Serrano.
Then we acquired Ascential. It was already in the path of unifying metadata for all their tools, so that the profiling, quality, ETL [tools] could all leverage the same metadata infrastructure, combined with a single, unified repository that will be coming out in the “Hawk” release.
So now that we have acquired Ascential, its easy for us to unify this roadmap. Why? Ascential was focused on providing a common repository. We were focused on extending the Eclipse tool framework. Both things can sit on top of each other.
With that I have infrastructure to support the operational metadata, the architect and the developer, in Serrano and Ascential, and Ascential had support for the business user as part of their Hawk release.
People are wondering, though, How is IBM going to manage the metadata in a federated environment?
When I build a data warehouse, I am extracting data from different systems and, in general, heterogeneous systems. The only difference between federated and warehouse environments is I do consolidation before I access the data: I move all data to one place. In the case of federation, I leave the data where it is and access it when [the request] comes in. In both cases, I need to know where the data is stored, where its coming from, and what kinds of transformations I need to do. Whether I decide to extract and transform beforehand or in real time, the problem from the metadata perspective is exactly the same: I need to understand Oracle tables, DB2 tables, the relations between them and what transformations I need to do.
Next Page: Delving into the guts of IBMs metadata engine.
The metadata guts
People tend to think federated means ignoring the metadata problem. Thats absolutely not the case. I have to invest in the metadata infrastructure for both data warehouses and federation.
To what depth does the metadata go?
Our goal is to be able to handle the operational metadata: e.g. which systems … [and] development metadata: e.g. what is the data model Im dealing with? What relations …and business metadata: e.g. what kind of summaries Im seeing, what kinds of aggregations, [etc. for the business user]. Business managers dont need to know the plumbing or that its coming from Oracle or a Siebel system. They want to see what is their profitability
Business managers dont need to know the plumbing or that its coming from Oracle or a Siebel system. They want to see what is their profitability.
When a new application comes online, how does that new metadata get reconciled with the existing metadata? Is it automatic, or are we talking about an army of people dedicated to metadata?
As you probably know and analysts know, there have been a lot of efforts in the early 90s around metadata management. These have not been successful because the metadata repository was independent of the tools. It was kind of a tool by itself, not fully integrated with the information management system. Every time I changed the information management system, say a database, somebody had to update in the metadata repository that the table had been modified.
Thats not the approach we have. Our metadata repository is fully integrated with the information integration platform. [It has] all the tooling that consumers of metadata need to use that. Because its integrated with the information management infrastructure, I can discover metadata automatically. I can monitor when changes happen or a new application gets added. If its a change, I can do analysis and tell the consumer of metadata what needs to be modified.
You dont need hundreds of human beings sitting there extracting information from the system. All that happens, basically, automatically.
Next Page: Who else is spending such huge wads of cash on this stuff?
IBMs spending big on
this”>
As I understand it, the problem with metadata is its a negotiation between human beings as to what a given entity, say, a transaction, encompasses: what business processes it encompasses. Its never cut and dried.
There has been a lot of research in the last few years. [Weve gained] the ability to develop taxonomy that can model these different terms across the enterprise, and relations between them, in a hierarchical fashion. We also have things like versioning. Lets say a version evolves over time. Maybe I have a different version when I employ my Siebel platform.
[Were] allowing terminology, business objects, to evolve over time using versioning. The whole notion of taxonomy and the ability to do semantics did not exist in the early 90s when metadata systems were developed.
None of this sounds easy.
Integration is not an easy problem. I am not saying that this is a piece of cake. It is not. And this is why IBM, years ago, decided to go after this space, knowing wed have to make significant investments in this space.
We have over 1,500 people working in information integration today. We have a history of taking on the big challenges.
I feel very comfortable that we are on the right path. With the delivery of Hawk, well be able to show that [IBM] can take a handle on metadata challenges in support of the four major consumers: business users, data architects, developers and administrators.
Think of it: who [else] has spent $1.1 billion on technology for information integration [as IBM spent on Ascential]? Has anybody come [even] close to spending that much money?
Check out eWEEK.coms for the latest database news, reviews and analysis.