Blog

Machine learning

How to Build a Machine Learning Pipeline to Ensure Efficient Project Delivery

Eugene Dorfman

29 Oct 2021

8 min

How to Build a Machine Learning Pipeline to Ensure Efficient Project Delivery

01 What tasks does ML solve?

02 What is wrong with the common approach to managing ML projects?

03 How to achieve better ML model performance

04 What is a machine learning pipeline?

05 Machine learning pipeline steps

06 Infrastructure and tools

07 The benefits of working with an ML pipeline delivery

Establishing and maintaining real-time accuracy, precision prediction capabilities, and rapid processing are absolute necessities for successful ML execution. ML models are deployed to monitor the accuracy and success of a ML project and flag issues or errors. However, the traditional ML model approach is manual, dependent on human ML engineers to update with constantly evolving datasets that are often already out-of-date by the time they’re applied. This standard method also makes it nearly impossible to test version success, run the volume of experiments needed for appropriate measure, or predict outcomes.

Our teams determined that automation was the answer. They developed an MLOps approach to support an ML pipeline infrastructure designed to handle previously manual tasks and data updates. Postindustria’s MLOps combines data extraction and analysis, modeling and testing, and continuous service delivery to train and evaluate models, track testing, and deploy well-performing models.

By building a flexible, scalable, automated structure for each project instead of attempting to implement a pre-made, manual ML model, we improved and streamlined the previously unmanageable, unmeasurable, inefficient ML model functionality, freed up ML engineers, and had consistently reliable, real-time datasets and objective measures of success.

A year ago we lost a potential client. A company declined our machine learning (ML) project proposal to automate sifting through abusive content on display ads. Neither a decade of experience in delivering solutions for the AdTech industry nor a team of certified ML engineers helped my team close that deal. We realized that our proposal must be lacking something but couldn’t figure out exactly what was missing.

Months later, we discovered our weak spot. We’d been deploying a ready-to-use trained machine learning model that was inflexible, inefficient, and ultimately, drained value.

We could do better.

We reviewed our entire ML project management approach and discovered that by building an infrastructure for its training, evaluation, tuning, deployment, and continuous improvement, we gained control and added structure to the critical R&D process.

In this article, I’ll share with you what I wish we had known about the ML process back then, outline the machine learning pipeline steps, and explain why its delivery is the most viable option to bring business value to its users in any domain — from AdTech to healthcare.

What is Machine Learning?

Machine learning is a branch of computer science that uses algorithms to build and train models to perform routine tasks based on detecting patterns in the data and learning experience.

ML relies on a host of data samples that are used to train a computer to perform certain tasks that are usually well-performed by humans but are poorly handled by traditional computer programs.

What tasks does ML solve?

The tasks for ML projects can be grouped into two big categories: classification or regression algorithms.

Classification Tasks

Classification tasks allow us to predict or classify discrete values such as Safe or Unsafe, Suspicious or Unsuspicious, Spam or Not Spam, etc.

Regression Tasks

The tasks related to regression algorithms are used to predict such continuous values as price, age, and salary.

These tasks can be grouped into smaller subcategories that lean on either of the above algorithms: object detection, image classification, anomaly detection, ranking, and clustering.

How to Build a Machine Learning Pipeline to Ensure Efficient Project Delivery - photo 1

Spotting abusive content is a classic example of a classification task, while a task to predict a stock price at a given time, for example, falls into the category of tasks related to regression algorithms.

In healthcare, algorithms can be trained to classify X-Ray, MRI scans or any other medical images to detect potentially malignant lesions, tumors in human organs and eventually help in early disease diagnosis. The applications of machine learning in healthcare go far beyond image classification and include automatic health report generation, smart records generation, drug discovery, patient condition tracking, and more.

Regardless of what type of task you deal with, you’ll need to deliver a trained machine learning model to solve it.

But a model alone is not enough. And here is why.

What is Wrong with the Common Approach to Managing ML Projects

A common approach to managing ML projects leans on deploying ML models manually and usually follows these steps:

Define the problem and set the goal
Obtain the required data for training
Train, evaluate and fine-tune the models
Present the trained model and integrate it into the existing workflow

Below you can see the breakdown of the manual process of working with ML projects as perceived by Google Cloud services.

How to Build a Machine Learning Pipeline to Ensure Efficient Project Delivery - photo 2

This workflow suggests that a trained model should be the final result or a delivery artifact of a successfully completed ML project.

However, this common ML approach is flawed for several reasons.

The world is constantly evolving, stimulated by disruptive technologies, leading to continuously growing data. Therefore, the data ML engineers have to process to train the models is similarly continuously changing.

A trained model delivered as an artifact can only meet client’s immediate needs based on the data it was given, but it will prove ineffective in the long run since it doesn’t automatically adjust to the changes in the datasets.
Updates in holdout datasets used for training result in changes in the model, leading to downgrading its accuracy.
This is especially important for projects in healthcare. The higher accuracy the model shows in the classification of medical images, for example, the higher the chances it can rival professional medics in making accurate diagnoses, eliminating the risk of human error.
Machine learning model accuracy is defined as the percentage of correct predictions for the test data and is calculated by dividing the number of correct predictions by the total number of predictions. It is usually the determining factor when evaluating the success of a machine learning project — the higher the accuracy, the better the machine learning model performs. However, it’s worth noting that this is true only for balanced validation datasets — those containing an equal number of examples for each category of tested data.
When dealing with unbalanced validation datasets, metrics such as precision, recall, and F-score come into play. Precision measures how many selected items are relevant, recall shows how many relevant items are selected, and these metrics are combined to give the F-score.

How to Achieve Better ML Model Performance

To optimize the metrics for test data, room for numerous experiments is needed. Experimenting with different model architectures, preprocessing code, and the hyperparameters that define the success of the learning process essentially re-trains the model multiple times. Here is how it works:

First, we choose a set of hyperparameters for the experiment.
Then, we train the model defined by these hyperparameters.
Afterward, we assess the result — if the target metrics are not optimal, we tune the model either by modifying its architecture (add/remove layers, etc.) or by changing the hyperparameters.

Through this process, the training dataset changes and the architecture of the model changes. On top of that, an enormous number of experiments need to be carried out to achieve an acceptable level of accuracy. Carrying out this task manually is onerous, which is why it makes sense to deliver an automated infrastructure for the process — a machine learning pipeline.

What is a Machine Learning Pipeline?

A machine learning pipeline is a robust infrastructure that systematically trains and evaluates models, tracks experiments, and deploys well-performing models.

It relies on the idea of MLOps — a set of practices that ensure the implementation and automation of continuous integration, delivery, and training for ML systems.

How to Build a Machine Learning Pipeline to Ensure Efficient Project Delivery - photo 3

MLOps combines data extraction and analysis (ML), modeling and testing (Dev) and continuous service delivery (Ops).

However, while MLOps works both with small projects and big data, our suggested pipeline is more suitable for datasets and models that fit on a conventional hard drive (usually their size does not exceed several hundred GB).

Machine Learning Pipeline Steps

A machine learning pipeline consists of the following steps:

Dataset retrieval – downloading the needed datasets from storage and bringing them together.
Dataset preprocessing – aimed at transforming raw data into a usable format for further analysis.
Dataset splitting – splitting the data into training and validation subsets. We use the validation subset to evaluate how well the model generalizes after being trained on a training subset. This allows us to judge whether the model improves after each training cycle.
Preparing for experiments – fine-tuning of the models’ architecture and modifying hyperparameters.
Running tests or training loops.
Tracking experiment results – using the metrics (accuracy, precision, recall, F-scores for classification models and loss metrics, including MAE, RMSE and NRMSE for regression models).
Choosing the best model – based on the metrics.
Performance evaluation of the chosen model.
Deployment – using the top-performing model.

This workflow allows for continuous fine-tuning of existing models alongside constant performance evaluations. The biggest advantage of this process is that it can be automated with the help of available tools.

For med-tech startups, this means that they can get an ML model for their particular project that will be constantly updated as more patient data comes in, which will result in higher accuracy of the received results.

Book a strategy session_

Get actionable insights for your product

Book a call

Thank you for reaching out,
User!

Make sure to check for details.

ML Pipeline Infrastructure and Tools

To build an infrastructure for a machine learning pipeline and automate the process, the following components are needed (we prefer the Google Cloud Platform, but similar infrastructures are provided by other cloud vendors like AWS or Microsoft Azure):

Google Compute Engine (GCP) – in particular cloud graphics processing units (GPUs). The best-performing chip for now is the NVIDIA A 100, but the NVIDIA T4 is a viable alternative due to its cost-effectiveness.
TensorBoard – which provides tools to visualize data and the results of experiments needed for training machine learning models.
MLflow — a platform for managing a full machine learning cycle, including experimenting and evaluation tracking, delivering a results dashboard and model deployment.

In addition to these necessary components for building your ML pipeline, you might want to experiment with some other tools that can also be helpful: DVC (for saving new datasets and iterations of trained models), Jenkins (for automation and control) and Docker containers for deployment.

Benefits of Working with an ML Pipeline Delivery

Clients looking for ML-driven solutions for their business receive a fully-fledged ML system instead of a trained ML model that needs revisiting.

With a deployed ML pipeline, the client will not have to keep returning to the team of ML engineers each time the model needs a substantial update. However, some level of technical expertise in machine learning will be required for maintenance.

The automation of the process is another significant benefit that speeds up the deployment of an ML model. Crucially it alleviates the burden of manual tasks.

These machine learning pipeline steps take project management in machine learning to a new level, resulting in benefits both for clients and developers. Postindustria offers a full range of services for the delivery of solutions based on machine learning. Leave us your contact information in the form above and we’ll contact you to discuss your project.