How AI Data Actually Moves from Collection to Algorithm

eWEEK DATA POINTS: Though excitement about AI and ML is legitimately growing, we hear little about how the data actually goes from collection to algorithm. By examining the process behind building hypothetical machine learning models, we can look at what important processes are often glossed over in articles extolling the virtues of AI.

eWEEK logo Data Points copy

It seems as if we hear more talk each day about the high potential for artificial intelligence (AI) and the techniques, like machine learning (ML), used to achieve it. As AI grows in prominence, stories of use cases or potential, future use cases will also become more ubiquitous.

Though excitement about AI and ML is legitimately growing, we hear little about how the data actually goes from collection to algorithm. By examining the process behind building hypothetical machine learning models, we can look at what important processes are often glossed over in articles extolling the virtues of AI.  

In this eWEEK Data Points article, Kiran Vajapey, human-computer interaction developer at Figure Eight, offers five key insights about this data journey and how it works. Figure Eight Inc. has developed a human-in-the-loop AI software platform that trains, tests and tunes machine learning models for data science and machine learning teams. It supports text, image, audio and video data types.

Data Point No. 1: Annotation

If, for example, we take Google image searches of “city streets” and feed those into our autonomous car algorithm, the results it produces probably won’t be actionable. Instead, we’d need to have human annotators use tools to create bounding boxes or label the data before sending it through the model. Humans will need to put boxes around and label every curb, fire hydrant, telephone pole, and human being, among other items, in each photo presented to the model.

To build an autonomous car model, an organization will likely want to go further than bounding boxes and labeled items in a photo. In this case, organizations can turn to what’s known as semantic segmentation, whereby every single pixel in an image receives a label. When the model’s results are doing something as important as directing a self-driving car, it’s crucial that the AI is as knowledgeable as possible about its surroundings.

The annotation process is especially crucial for ensuring data quality and accuracy. To do this, you should ensure the tools you use to annotate data apply human intelligence to the process adequately. Even before labeling data, organizations will want to consider their approaches to collecting data in the first place.

Data Point No. 2: Data Augmentation

If the perfect data set for your algorithm doesn’t exist, you can typically perform data augmentation to enhance the dataset you do have. Consider a model for a speech-recognition system (such as Alexa or Siri). If you collect crisp sound bites from a recording studio, the algorithm may run into problems in the real world. Because the model is trained to recognize the clean sounds of a sterile environment, it may struggle when presented with voice controls littered with ambient noises or static. Luckily, to make the data more realistic, you can simulate noise in the background of the clean data via augmentation methods.

Data Point No. 3: Transfer Learning

If you are attempting to build an ML algorithm for a commercial application, there is a good chance that the exact data set for your use case doesn’t exist. Consider a model to detect cancer in x-ray images. There likely isn’t a lot of publicly available data—x-ray images from cancer patients—for your use case. Transfer learning allows you to leverage existing models. In this instance, you may be able to use an available model that has learned rules about pixel-level edge detection and general image component identification from a previous data set.

Rather than pre-training your model with millions of images, you can instead remove layers of this existing model until you have an appropriate starting point. Then, you can feed your specific data set into an algorithm that is already trained to identify certain pixels in images. As you work through your specific data set, you can retrain the model to better understand the nuances of x-ray images. In the process of retraining the existing algorithm with your data, you’ll develop a neural net suited to your use case.

Data Point No. 4: Iterations

While this may sound counter-intuitive, it’s easy for a team to collect too much data. When training a model, the most sound approach is usually to work iteratively. If you happen to have 1,000 images of x-ray data, use those images first. Once you train the model, you’ll gain a better understanding of whether or not that model works. Let’s say your goal is 85 percent accuracy. If those 1,000 images get you 85 percent accuracy, then you don’t need to collect more. If they only lead to a model that offers 67 percent accuracy, then you will have to invest in finding more images for your data set.

Even when you do have access to a larger data set to begin with, working iteratively is likely the most efficient option for creating a model. Consider data that needs labels and bounding boxes. You can use existing labeled data to train a model that labels additional pieces of data on its own. As you run labeled data through the model, it will build up your neural net and eventually improve the confidence of your algorithm.

The model may produce one image that is labeled with 20 percent confidence and another with 80 percent confidence. By giving the images below a certain confidence threshold to a human to assign labels to, you can incorporate human intelligence into the process. This will help acquire a ground truth from humans for the data about which the model is uncertain. Once humans annotate the select data points, you can train the model with the appropriately labeled data.

Data Point No. 5: Using These Tools Improves Algorithms Without Exploding Costs

The main data challenge companies run into is that they aren’t sure of the best way to use their data. We worked once with a company that tried to predict stock prices. When trying to predict Apple’s stock price, for example, we gathered all kinds of sentiment data about Apple. Eventually, we learned that we needed to incorporate data points that categorized entities other than Apple for a more accurate prediction. We realized that collecting different types of data yielded a more stable, long-term projection algorithm.

Companies first must set a goal to understand what it is they’re trying to build with their data. Had we set that goal for ourselves ahead of time, we may have created a more accurate model from the get go. By creating a goal, you will have a frame of reference as you develop strategies and build your AI initiatives.

The specifics of your data and the given problem you’re trying to solve will change over time. But, if you have a state you’d like to achieve, you can develop your tools and algorithms to reach that specific point. By leaning into these four tools while building your models, it is more likely that your projects end up more efficient, accurate and cost-effective.

Chris Preimesberger

Chris J. Preimesberger

Chris J. Preimesberger is Editor-in-Chief of eWEEK and responsible for all the publication's coverage. In his 15 years and more than 4,000 articles at eWEEK, he has distinguished himself in reporting...