Enterprises are filled with custom applications such as claims processing or fraud detection that routinely analyze large and growing data sets. So, how do these enterprises benefit from recent advances in data-intensive analytics? The answer is embedded analytics.
Embedded analytics allow you to include in your existing software applications the same analytics capabilities that software giants such as SAS and SPSS deliver in their stand-alone analytics packages. For example, if your in-house system currently processes health insurance claims, you can improve your fraud detection by applying models derived from historical data to flag suspicious claims.
Fortunately, there are best practices for adding analytics to your software. Powerful tools are available that make it relatively easy and efficient, both at design time and runtime. I will illustrate this with a use case in order to convey the high-level steps necessary to enhance an existing application with high-value analytics.
A telecom use case
The use case involves a midsized telecommunications company. The operations team of the company receives log files continuously from switches within the company’s networks. The logs from the switches contain call detail records (CDRs) that provide exacting details about each call serviced by a switch. In fact, many CDRs may be produced per call. The operations team also receives logs from many of their operational servers. In all, the large volume, fast velocity and variety of the data can overwhelm an already overworked operations team.
The operations team has done a great job of getting the needed information from the CDRs for required company operations such as billing. However, they suspect they may be losing money to fraud. They already have processes in place to scrub the logs for their required needs but they would like to extend that to add some level of fraud detection.
Here are five steps the operations team follows to move the fraud detection project from conceptual to production:
Step No. 1: Understand the Data
Step No. 1: Understand the data
Even though the operations team has been looking at the CDR data for years, this is their first foray into deeper analytics. They have to begin by taking a more in-depth look at their data and determining what attributes in the data may be useful. They suspect fraud is being accomplished using specific phone numbers distributed across regions, so they start there.
What is extremely useful at this point is a data profiling tool. A good profiling tool allows the operations team to begin to discover trends in their data. The team’s use of fields in the CDRs has been limited to this point. They need to ensure the quality of values in the fields targeted for the fraud detection project.
Along with a good profiling tool, the operations team also needs a data analytics tool. Besides profiling the data for discovery, they’ll also want to experiment with different aggregations of the data. For example, aggregating the data by region and phone number may show that certain phone numbers are used much more than others. This in itself does not mean their use is fraudulent but it may help support the team’s suspicions.
Step No. 2: Cleanse the Data
Step No. 2: Cleanse the data
Now that the operations team has a good idea of what data is available to them and in what state, they’ll need to cleanse the data in preparation for modeling. From their data discovery phase, they notice that a large percentage of call records are missing values for important fields. The records are needed, so they develop a plan for calculating the missing data. Again, a tool is needed to implement their transformations on the call records to fill in the missing values.
The operations team also notices that call records of a specific type are redundant for their calculations. They use their cleansing tool to drop those records to prevent downstream problems. They’ll also want to exclude records from known, high-usage customers.
Step No. 3: Iteratively develop a working model
Now the operations team is ready to apply a data mining technique to their data. They decide to cluster the data using a set of attributes they deemed important during the discovery phase. A visual, workflow-oriented tool is extremely useful here, as this phase of the project is very iterative. They also need a tool that can handle large amounts of data with ease. Since they are looking for fraud, sampling is not a viable option.
The operations team employs a modeling tool that can ingest their cleansed data and builds a clustering model. They apply the model to a set of test data and visualize the results. The clusters look interesting but no telling patterns are appearing. They adjust the algorithm and try different attributes, applying normalization where it fits. After some iteration, they settle on a model that appears to be working. The visualization shows outliers that, after more inspection, appear to be fraud.
Step No. 4: Embed the Model into the Production Application
Step No. 4: Embed the model into the production application
Now that they have a model that appears to be working, it’s time to integrate the application of the model with their production software. The operations team uses the same tool from the cleansing step to execute the cleansing and model application functions in a single workflow embedded within their application. The tool is able to ingest large amounts of data and apply the model quickly and efficiently.
Step No. 5: Update and refresh
Over time, the billing team provides feedback to the operations team. They’d like a prioritized list of cases to investigate based on a scoring system. They’d also like more historical information to determine how long the fraud has been occurring. The operations team once again leverages the powerful set of tools that can take them through data discovery to production deployment.
Summary
The motivators for embedding analytics are rapidly becoming clear in the industry. Streamlining business processes, enabling operational decisions to be made quickly after receiving critical information, and substantially improving business performance are key reasons for adopting an embedded analytics approach. This has never been easier to do than it is today using the tools now available.
Adopting a practical, best-practices approach of understanding the data, cleansing the data, iteratively developing a working model, embedding the model into the production application, and updating and refreshing information back to operations teams will help drive success through the benefits of analytics.
Jim Falgout has 20+ years of large-scale software development experience and is active in the Java development community. As Chief Technologist for Pervasive DataRush, Jim’s responsible for setting innovative design principles that guide Pervasive’s engineering teams as they develop new releases and products for partners and customers.
Prior to Pervasive, Jim held senior positions with NexQL, Voyence Net Perceptions/KD1 Convex Computer, Sequel Systems and E-Systems. An officer with Toastmasters, Jim is an experienced public speaker. He has delivered presentations to user groups, client conferences and IEEE working groups. He is a popular analyst and go-to thought leader. Jim has a Bachelor’s degree (cum laude) in Computer Science from Nicholls State University. He can be reached at jim.falgout@pervasive.com.