Handling Truth by Embedding ML into Databases

eWEEK DATA POINTS RESOURCE PAGE: When machine learning is embedded into the database, organizations can improve their ability to curate data through the automation of quality checks, harmonization, mastering and enrichment, among many other—often complex—tasks.


Machine learning helps organizations handle the truth by enabling them to create an accurate data model from among huge stores of data and then build processes that are constantly improving through input from the community.

This is easier said than done, however, because today’s ML tools ecosystem is incredibly complex, and organizations often don’t have people with the skills to navigate it. In addition, organizations have a difficult time trusting the so-called black box output of many ML models. 

Go here to see a listing of eWEEK's Top Predictive Analytics Companies.

This eWEEK Data Points article, using industry information from MarkLogic Senior Product Manager Anthony Roach, offers eight things you need to know about the current state of ML and how embedding ML into the database platform enables organizations to innovate quickly and intelligently, based on trusted data and trusted analysis of that data.

Data Point No. 1: Data analysis is difficult, to put it mildly

For several years, all we heard about was big data—how amassing and analyzing huge amounts of data would lead to amazing insights that would enable the business to make smarter decisions. What companies found, however, was that amassing the data was easy; intelligently storing and analyzing it—with any level of trustworthiness—was really hard. 

Data Point No. 2: Machine learning can help

As noted above, organizations have found it relatively easy to collect data—lots and lots of data. The trick is to find patterns in that data, which is what machine learning is designed to do. Indeed, data is so voluminous and complex that it is very difficult to detect underlying patterns and complex relationships in your data you can convert into rules for your system. But all of this is easy and straightforward for modern machine learning tools. And, since models change their behavior with experience, these tools can improve their behavior, putting the “learning” in machine learning. 

Data Point No. 3: Machine learning is only as good as the data

With all that said, it’s not just the amount of data that’s important; it’s also the quality of data and its fitness for its intended purpose. Good data is critical because machine learning is especially sensitive to data quality—or lack thereof. Think about it: You’re using the same data to both train and then execute the model. Any problems with data quality will get amplified. And, as we know, features even about a single entity may be scattered across multiple systems in your organization. If an ML tool cannot detect a pattern because some of the data at the root of the pattern is hidden away in some siloed system, then the value of the ML output is significantly reduced.

Data Point No. 4: Return on machine learning investment can be low

In fact, while there is a lot of hype about machine learning right now, the truth is that investment in AI and machine learning often yields very low ROI. There are a number of reasons for this, but one of the biggest is that organizations often don’t trust the “black box” outputs of machine learning models. Of course, when you don’t trust the output (even if it’s accurate), you’re not going to make a major decision on it. Making matters worse is that the machine learning tools ecosystem is incredibly complex, making it difficult for companies to put machine learning into the all-important contexts of security and governance. It’s also tough to find people with the right skill sets to build and maintain the systems.

Data Point No. 5: To be most effective, bring your algorithms to the data, not the other way around

You’ve often heard that security should be built in and not bolted on. Well, the same can be said for machine learning. The efficacy of ML increases when it close to the data itself—preferably in a data hub model, where data can be secured, governed and curated. With this trusted approach to getting at data’s truth, organizations can solve many of the challenges they face with governance and trust, while more efficiently and confidently unlocking the benefits that machine learning promises.

Data Point No. 6: Embedded machine learning can improve how the database operates

Embedded machine learning improves the quality of data and the truthfulness of data models, but it also helps the database itself operate more effectively. This application of ML technology is still evolving, but by monitoring workload patterns and access plans, performance can be improved through automatic re-tuning of the system. The database can also be run more efficiently by using machine learning to develop and leverage models of infrastructure workload patterns to enable, for example, automatic adjustment of the rules that govern data and index rebalancing.

Data Point No. 7: Embedded machine learning improves organizations’ data curation abilities

When machine learning is embedded into the database, organizations can improve their ability to curate data through the automation of quality checks, harmonization, mastering and enrichment, among many other—often complex—tasks. Organizations may even be able to augment existing rules-based mastering processes to improve accuracy and manage exceptions. Further, machine learning can be used during the modeling phase to identify whether particular data may include, for example, personally identifiable information and improves the matching algorithms to make them more accurate (and to require less human intervention). Machine learning can also assist with the classification of attributes and suggest mapping and modeling rules. Importantly, these models are continuously retrained, so they become smarter over time.

Data Point No. 8: Embedded machine learning can improve the efficiency of data scientists

Data scientists spend an inordinate amount of time wrangling data. According to an article in the The New York Times, data scientists spend 80% of their time assembling large training sets of data. When machine learning is embedded into the database platform, much of that work is done for them, freeing data scientists to do the work of training and executing models.

Machine learning and artificial intelligence are no longer considered science fiction. The technologies are now being used every day, in a wide variety of use cases. However, that does not mean that machine learning is being fully exploited, or that it is accessible to organizations of all types and sizes. When machine learning is embedded at the database level, much of its complexity is mitigated, enabling companies to get to the truth of their data and fully reap the rewards of ML.

If you have a suggestion for an eWEEK Data Points article, email [email protected].

Chris Preimesberger

Chris J. Preimesberger

Chris J. Preimesberger is Editor-in-Chief of eWEEK and responsible for all the publication's coverage. In his 15 years and more than 4,000 articles at eWEEK, he has distinguished himself in reporting...