Step No. 2: Cleanse the Data

By Jim Falgout  |  Posted 2011-02-16 Print this article Print

Step No. 2: Cleanse the data

Now that the operations team has a good idea of what data is available to them and in what state, they'll need to cleanse the data in preparation for modeling. From their data discovery phase, they notice that a large percentage of call records are missing values for important fields. The records are needed, so they develop a plan for calculating the missing data. Again, a tool is needed to implement their transformations on the call records to fill in the missing values.

The operations team also notices that call records of a specific type are redundant for their calculations. They use their cleansing tool to drop those records to prevent downstream problems. They'll also want to exclude records from known, high-usage customers.

Step No. 3: Iteratively develop a working model

Now the operations team is ready to apply a data mining technique to their data. They decide to cluster the data using a set of attributes they deemed important during the discovery phase. A visual, workflow-oriented tool is extremely useful here, as this phase of the project is very iterative. They also need a tool that can handle large amounts of data with ease. Since they are looking for fraud, sampling is not a viable option.

The operations team employs a modeling tool that can ingest their cleansed data and builds a clustering model. They apply the model to a set of test data and visualize the results. The clusters look interesting but no telling patterns are appearing. They adjust the algorithm and try different attributes, applying normalization where it fits. After some iteration, they settle on a model that appears to be working. The visualization shows outliers that, after more inspection, appear to be fraud.

Jim Falgout has 20+ years of large-scale software development experience and is active in the Java development community. As Chief Technologist for Pervasive DataRush, Jim's responsible for setting innovative design principles that guide Pervasive's engineering teams as they develop new releases and products for partners and customers. Prior to Pervasive, Jim held senior positions with NexQL, Voyence Net Perceptions/KD1 Convex Computer, Sequel Systems and E-Systems. An officer with Toastmasters, Jim is an experienced public speaker. He has delivered presentations to user groups, client conferences and IEEE working groups. He is a popular analyst and go-to thought leader. Jim has a Bachelor's degree (cum laude) in Computer Science from Nicholls State University. He can be reached at

Submit a Comment

Loading Comments...
Manage your Newsletters: Login   Register My Newsletters

Rocket Fuel