SHARE

The CIO’s Guide to Building a Rockstar Data Science and AI Team

Written By

Jun 15, 2021

7 minute read

eWeek content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

Just about everyone agrees that data scientists and AI developers are the new superstars of the tech industry. But ask a group of CIOs to define the precise area of expertise for data science-related job titles, and discord becomes the word of the day.

As businesses seek actionable insights by hiring teams that include data analysts, data engineers, data scientists, machine learning engineers and deep learning engineers, a key to success is understanding what each role can — and can’t — do for the business.

Read on to learn what your data science and AI experts can be expected to contribute as companies grapple with ever-increasing amounts of data that must be mined to create new paths to innovation.

The Ideal vs. The Real World
The Data Analyst
The Data Engineer
The Data Scientist
The Machine Learning Engineer
The Deep Learning Engineer
How to Match Skills with Strategy

The Ideal vs. The Real World

In a perfect world, every company employee and executive works under a well-defined set of duties and responsibilities.

Data science isn’t that world. Companies often will structure their data science organization based on project need: Is the main problem maintaining good data hygiene? Or is there a need to work with data in a relational model? Perhaps the team requires someone to be an expert in deep learning, and to understand infrastructure as well as data?

Depending on a company’s size and budget, any one job title might be expected to own one or more of these problem-solving skills. Of course, roles and responsibilities will change with time, just as they’ve done as the era of big data evolves into the age of AI.

That said, it’s good for a CIO — and the data science team she or he is managing today — to remove as much of the ambiguity as possible regarding roles and responsibilities for some of the most common roles — those of the data analyst, data engineer, data scientist, machine learning engineer and deep learning engineer.

Teams that have the best understanding of how each fits into the company’s goals are best positioned to deliver a successful outcome. No matter the role, accelerated computing infrastructure is also key to powering success throughout the pipeline as data moves from analytics to advanced AI.

The Data Analyst

It’s important to recognize the work of a data analyst, as these experts have been helping companies extract information from their data long before the emergence of the modern data science and AI pipeline.

Data analysts use standard business intelligence tools like Microsoft Power BI, Tableau, Qlik, Yellowfin, Spark, SQL and other data analytics applications. Broad-scale data analytics can involve the integration of many different data sources, which increases the complexity of the work of both data engineers and data scientists — another example of how the work of these various specialists tends to overlap and complement each other.

Data analysts still play an important role in the business, as their work helps the business assess its success. A data engineer might also support a data analyst who needs to evaluate data from different sources.

Data scientists take things a step further so that companies can start to capitalize on new opportunities with recommender systems, conversational AI, and computer vision, to name a few examples.

The Data Engineer

A data engineer makes sense of messy data — and there’s usually a lot of it. People in this role tend to be junior teammates who make data nice and neat (as possible) for data scientists to use. This role involves a lot of data prep and data hygiene work, including lots of ETL (extract, transform, load) to ingest and clean data.

The data engineer must be good with data jigsaw puzzles. Formats change, standards change, even the fields a team is using on a webpage can change frequently. Datasets can have transmission errors, such as when data from one field is incorrectly entered into another.

When datasets need to be joined together, data engineers need to fix the data hygiene problems that occur when labeling is inconsistent. For example, if the day of the week is included in the source data, the data engineer needs to make sure that the same format is used to indicate the day, as Monday could also be written as Mon., or even represented by a number that could be one or zero depending on how the days of the week are counted.

Expect your data engineers to be able to work freely with scripting languages like Python, and in SQL and Spark. They’ll need programming language skills to find problems and clean them up. Given that they’ll be working with raw data, their work is important to ensuring your pipeline is robust.

If enterprises are pulling data from their data lake for AI training, this rule-based work can be done by a data engineer. More extensive feature engineering is the work of a data scientist. Depending on their experience and the project, some data engineers may support data scientists with initial data visualization graphs and charts.

Depending on how strict your company has been with data management, or if you work with data from a variety of partners, you might need a number of data engineers on the team. At many companies, the work of a data engineer often ends up being done by a data scientist, who preps her or his own data before putting it to work.

The Data Scientist

Data scientists experiment with data to find the secrets hidden inside. It’s a broad field of expertise that can include the work of data analytics and data processing, but the core work of a data scientist is done by applying predictive techniques to data using statistical machine learning or deep learning.

For years, the IT industry has talked about big data and data lakes. Data scientists are people who finally turn these oceans of raw data into information. These experts use a broad range of tools to conduct analytics, experiment, build and test models to find patterns. To be great at their work, data scientists also need to understand the needs of the business they’re supporting.

These experts use many applications, including NumPy, SciKit-Learn, RAPIDS, CUDA, SciPy, Matplotlib, Pandas, Plotly, NetworkX, XGBoost, domain-specific libraries and many more. They need to have domain expertise in statistical machine learning, random forests, gradient boosting, packages, feature engineering, training, model evaluation and refinement, data normalization and cross-validation. The depth and breadth of these skills make it readily apparent why these experts are so highly valued at today’s data-driven companies.

Data scientists often solve mysteries to get to the deeper truth. Their work involves finding the simplest explanations for complex phenomena and building models that are simple enough to be flexible yet faithful enough to provide useful insight. They must also avoid some perils of model training, including overfitting their data sets (that is, producing models that do not effectively generalize from example data) and accidentally encoding hidden biases into their models.

The Machine Learning Engineer

A machine learning engineer is the jack of all trades. This expert architects the entire process of machine and deep learning. They take AI models developed by data scientists and deep learning engineers and move them into production.

These unicorns are among the most sought-after and highly paid in the industry — and companies work hard to make sure they don’t get poached. One way to keep them happy is to provide the right accelerated computing resources to help fuel their best work. A machine learning engineer has to understand the end-to-end pipeline, and they want to ensure that pipeline is optimized to deliver great results, fast.

It’s not always easily intuitive, as the machine learning engineers must know the apps, understand the downstream data architecture, and key in on system issues that may arise as projects scale. A person in this role must understand all the applications used in the AI pipeline, and usually needs to be skilled in infrastructure optimization, cloud computing, containers, databases and more.

To stay current, AI models need to be reevaluated to avoid what’s called model drift as new data impacts the accuracy of the predictions. For this reason, machine learning engineers need to work closely with their data science and deep learning colleagues who will need to reassess models to maintain their accuracy.

The Deep Learning Engineer

A critical specialization for the machine learning engineer is deep learning engineer. This person is a data scientist who is an expert in deep learning techniques. In deep learning, AI models are able to learn and improve their own results through neural networks that imitate how human beings think and learn.

These computer scientists specialize in advanced AI workloads. Their work is part science and part art to develop what happens in the black box of deep learning models. They do less feature engineering and far more math and experimentation. The push for explainable AI (XAI) model interpretability and explainability can be especially challenging in this domain.

Deep learning engineers will need to process large datasets to train their models before they can be used for inference, where they apply what they’ve learned to evaluate new information. They use libraries like PyTorch, TensorFlow and MXNet, and need to be able to build neural networks and have strong skills in statistics, calculus and linear algebra.

How to Match Skills with Strategy

Given all of the broad expertise in these key roles, it’s clear that enterprises need a strategy to help them grow their team’s success in data science and AI. Many new applications need to be supported, with the right resources in place to help this work get done as quickly as possible to solve business challenges.

Those new to data science and AI often choose to get started with accelerated computing in the cloud, and then move to a hybrid solution to balance the need for speed with operational costs. In-house teams tend to look like an inverted pyramid, with more analysts and data engineers funneling data into actionable tasks for data scientists, up to the machine learning and deep learning engineers.

Your IT paradigm will depend on your industry and its governance, but a great rule of thumb is to ensure your vendors and the skills of your team are well aligned. With a better understanding of the roles of a modern data team, and the resources they need to be successful, you’ll be well on your way to building an organization that can transform data into business value.

ABOUT THE AUTHOR

By Scott McClellan, Head of Data Science, NVIDIA