Here’s the latest edition of a new occasional feature in eWEEK called IT Science, in which we look at what really happens at the intersection of new-gen IT and legacy systems.
These articles describe industry solutions. The idea is to look at real-world examples of how new-gen IT products and services are making a difference in production each day. Most of them are success stories, but there will also be others about projects that blew up. We’ll have IT integrators, system consultants, analysts and other experts helping us with these as needed.
We’ve published similar articles to these in the past, but the format has evolved. We’ll keep them short and clean, and we’ll add relevant links to other eWEEK articles, whitepapers, video interviews and occasionally some outside expertise as we need it in order to tell the story.
An important feature, however, is this: We will report available ROI of some kind in each article, whether it is income on the bottom line, labor hours saved or some other valued business asset.
Today’s IT Science Feature: Deutsche Börse
Germany-based Deutsche BörseAG (or Deutsche Börse Group) is a marketplace organizer for the trading of shares and other securities. It is also a transaction services provider. Information for this edition of IT Science was provided by Konrad Sippel, head of Content Lab and Senior Advisor at Deutsche Börse.
Name the problem being solved: The Content Lab of Deutsche Börse is a data-driven R&D team that collects, analyzes and enriches data from the entire value chain of trading, clearing and settlement. In this function, the team consumes all sorts of structured and unstructured data sets from various areas of the business. Typically, the ingestion, cleansing and normalization of these data sets occupies a large part of the data scientists’ time spent on any given project. We brought in Trifacta to help our data science team with data ingestion, cleansing and preparation to allow the team to more efficiently spend their time on generating insights and analytics rather than formatting tables.
Describe the strategy that went into finding the solution: As part of our regular activities, we trial new big-data tools and software on a regular basis. As a result, we ran PoCs with various data wrangling tool providers including Trifacta. During our PoC, we duplicated work that had previously been done on a particularly dirty dataset related to historical fixed-income reference data. During the trial we were able to re-do months of work in a matter of a few weeks.
List the key components in the solution: We run Trifacta’s Data Wrangling software solution within our cloud based research set-up on AWS on top of a Cloudera-based cluster. Trifacta is a platform for exploring and preparing data for analysis. Trifacta works with cloud and on-premises data platforms.
Trifacta is designed to allow analysts to explore, transform and enrich raw, diverse data into clean and structured formats for analysis through self-service data preparation.
Describe how the deployment went, perhaps how long it took, and if it came off as planned: For the PoC installation, the software was installed in less than a day in a temporary environment. For the permanent installation, the set-up was done in parallel to the set-up of the R&D cluster with Cloudera, which also ran very smoothly within a few days.
Describe the result, new efficiencies gained, and what was learned from the project: We have fully integrated Trifacta into our data science process and technology stack. New data that we acquire or access are ingested through Trifacta to ensure data scientists start off working on a clean set of data with minimal effort spent on cleaning and preparing the dataset. Additionally, data scientists may use further functionalities to combine or further modify datasets more efficiently using Trifacta.
Describe ROI, carbon footprint savings, and staff time savings, if any: We are at an early stage of our usage of the software, so figures are hard to come by. We strongly believe that efficiency gains in data preparation and ingestion will lead to a massive increase in data scientist efficiency and reduce the time required to ingest new datasets from a process point of view.
If you have an idea for an IT Science case study, email cpreimesberger@eweek.com.