Step No. 2: Cleanse the Data
Step No. 2: Cleanse the data
Now that the operations team has a good idea of what data is available to them and in what state, they'll need to cleanse the data in preparation for modeling. From their data discovery phase, they notice that a large percentage of call records are missing values for important fields. The records are needed, so they develop a plan for calculating the missing data. Again, a tool is needed to implement their transformations on the call records to fill in the missing values.
The operations team also notices that call records of a specific type are redundant for their calculations. They use their cleansing tool to drop those records to prevent downstream problems. They'll also want to exclude records from known, high-usage customers.
Step No. 3: Iteratively develop a working model
Now the operations team is ready to apply a data mining technique to their data. They decide to cluster the data using a set of attributes they deemed important during the discovery phase. A visual, workflow-oriented tool is extremely useful here, as this phase of the project is very iterative. They also need a tool that can handle large amounts of data with ease. Since they are looking for fraud, sampling is not a viable option.
The operations team employs a modeling tool that can ingest their cleansed data and builds a clustering model. They apply the model to a set of test data and visualize the results. The clusters look interesting but no telling patterns are appearing. They adjust the algorithm and try different attributes, applying normalization where it fits. After some iteration, they settle on a model that appears to be working. The visualization shows outliers that, after more inspection, appear to be fraud.