• Machine learning

How to Build a Machine Learning Pipeline to Ensure Efficient Project Delivery

Eugene Dorfman
29 Oct 2021
8 min
How to Build a Machine Learning Pipeline to Ensure Efficient Project Delivery

A year ago we lost a potential client. A company declined our machine learning (ML) project proposal which would automate sifting through abusive content on display ads. Neither a decade of experience in delivering solutions for the AdTech industry nor a team of certified ML engineers helped my team close the deal on that occasion. We realized that our proposal must be lacking something but couldn’t figure out exactly what was missing.

Months later, we discovered our weak spot — a focus on deploying a ready-to-use trained machine learning model instead of building an infrastructure for its training, evaluation, tuning, deployment, and continuous improvement. This made us review our entire approach to managing ML projects and come up with a solution — a machine learning pipeline

In this article, I’ll share with you what I wish we had known about the ML process back then, outline the machine learning pipeline steps, and explain why its delivery is the most viable option to bring business value to its users in any domain — from AdTech to healthcare.

Want to know more about tech trends?
Sign up to be the first who receive our expert articles

    Success!
    Thank you

    What tasks does ML solve?

    Machine learning is a branch of computer science that uses algorithms to build and train models to perform routine tasks based on detecting patterns in the data and learning experience. The essence of ML is that it relies on a host of data samples that are used to train a computer to perform certain tasks that are usually well-performed by humans but are poorly handled by traditional computer programs.

    The tasks for ML projects can be grouped into two big categories: those based either on classification or regression algorithms. The first group allows us to predict or classify discrete values such as Safe or Unsafe, Suspicious or Unsuspicious, Spam or Not Spam, etc. The tasks related to regression algorithms are used to predict such continuous values as price, age, salary, etc. 

    These tasks can be grouped into smaller subcategories that lean on either of the above algorithms: object detection, image classification, anomaly detection, ranking and clustering. 

    How to Build a Machine Learning Pipeline to Ensure Efficient Project Delivery - photo 1

    Spotting abusive content is a classic example of a classification task, while a task to predict a stock price at a given time, for example, falls into the category of tasks related to regression algorithms. 

    In healthcare, algorithms can be trained to classify X-Ray, MRI scans or any other medical images to detect potentially malignant lesions, tumors in human organs and eventually help in early disease diagnosis. The applications of machine learning in this domain go far beyond image classification and include automatic health report generation, smart records generation, drug discovery, patient condition tracking, etc.

    Regardless of what type of task you deal with, you’ll need to deliver a trained machine learning model to solve it. But a model alone is not enough. And here is why.

    What is wrong with the common approach to managing ML projects? 

    A common approach to managing ML projects leans on deploying ML models manually and usually follows these steps:

    • Define the problem and set the goal
    • Obtain the required data for training
    • Train, evaluate and fine-tune the models
    • Present the trained model and integrate it into the existing workflow

    Below you can see the breakdown of the manual process of working with ML projects as perceived by Google Cloud services.

    How to Build a Machine Learning Pipeline to Ensure Efficient Project Delivery - photo 2

    This workflow suggests that a trained model should be the final result or a delivery artifact of a successfully completed ML project. However, this approach is flawed for several reasons.

    The world is constantly evolving, stimulated by disruptive technologies, leading to continuously growing data. Therefore, the data ML engineers have to process to train the models is similarly continuously changing. A trained model delivered as an artifact can only meet client’s immediate needs based on the data it was given, but it will prove ineffective in the long run since it doesn’t automatically adjust to the changes in the datasets. 

    Updates in holdout datasets used for training result in changes in the model, leading to downgrading its accuracy

    This is especially important for projects in healthcare. The higher accuracy the model shows in the classification of medical images, for example, the higher the chances it can rival professional medics in making accurate diagnoses, eliminating the risk of human error.

    Machine learning model accuracy is defined as the percentage of correct predictions for the test data and is calculated by dividing the number of correct predictions by the total number of predictions. It is usually the determining factor when evaluating the success of a machine learning project — the higher the accuracy, the better the machine learning model performs. 

    However, it’s worth noting that this is true only for balanced validation datasets — those containing an equal number of examples for each category of tested data. 

    When dealing with unbalanced validation datasets, metrics such as precision, recall, and F-score come into play. Precision measures how many selected items are relevant, recall shows how many relevant items are selected and these metrics are combined to give the F-score.

    How to achieve better ML model performance

    To optimize the metrics for test data, room for numerous experiments is needed. Experimenting with different model architectures, preprocessing code and the hyperparameters that define the success of the learning process essentially re-trains the model multiple times. Here is how it works:

    • First, we choose a set of hyperparameters for the experiment
    • Then, we train the model defined by these hyperparameters
    • Afterward, we assess the result — if the target metrics are not optimal, we tune the model either by modifying its architecture (add/remove layers etc.) or by changing the hyperparameters.

    Through this process, the training dataset changes and the architecture of the model changes. On top of that, an enormous number of experiments need to be carried out to achieve an acceptable level of accuracy. Carrying out this task manually is onerous, which is why it makes sense to deliver an automated infrastructure for the process — a machine learning pipeline.

    What is a machine learning pipeline?

    A machine learning pipeline is a concept for delivering ML projects based on building a robust infrastructure that systematically trains and evaluates models, tracks experiments and deploys well-performing models.

    It relies on the idea of MLOps — a set of practices that ensure the implementation and automation of continuous integration, delivery and training for ML systems. 

    How to Build a Machine Learning Pipeline to Ensure Efficient Project Delivery - photo 3

    MLOps combines data extraction and analysis (ML), modeling and testing (Dev) and continuous service delivery (Ops).

    However, while MLOps works both with small projects and big data, our suggested pipeline is more suitable for datasets and models that fit on a conventional hard drive (usually their size does not exceed several hundred GB). 

    Machine learning pipeline steps

    A machine learning pipeline consists of the following steps:

    1. Dataset retrieval – downloading the needed datasets from storage and bringing them together.
    2. Dataset preprocessing – aimed at transforming raw data into a usable format for further analysis.
    3. Dataset splittingsplitting the data into training and validation subsets. We use the validation subset to evaluate how well the model generalizes after being trained on a training subset. This allows us to judge whether the model improves after each training cycle. 
    4. Preparing for experiments – fine-tuning of the models’ architecture and modifying hyperparameters. 
    5. Running tests or training loops.
    6. Tracking experiment results – using the metrics (accuracy, precision, recall, F-scores for classification models and loss metrics, including MAE, RMSE and NRMSE for regression models).
    7. Choosing the best model – based on the metrics.
    8. Performance evaluation of the chosen model.
    9. Deployment – using the top-performing model.

    This workflow allows for continuous fine-tuning of existing models alongside constant performance evaluations. The biggest advantage of this process is that it can be automated with the help of available tools.

    For med-tech startups, this means that they can get an ML model for their particular project that will be constantly updated as more patient data will come in, which will result in higher accuracy of the received results.

    Book a strategy session_

    Get actionable insights for your product

      Thank you for reaching out,
      User!

      Make sure to check for details.

      Infrastructure and tools

      To build an infrastructure for a machine learning pipeline and automate the process, the following components are needed (we prefer the Google Cloud Platform, but similar infrastructures are provided by other cloud vendors like AWS or Microsoft Azure):

      • Google Compute Engine (GCP) – in particular cloud graphics processing units (GPUs). The best-performing chip for now is the NVIDIA A 100, but the NVIDIA T4 is a viable alternative due to its cost-effectiveness.
      • TensorBoard – which provides tools to visualize data and the results of experiments needed for training machine learning models.
      • MLflow — a platform for managing a full machine learning cycle, including experimenting and evaluation tracking, delivering a results dashboard and model deployment.

      In addition to these necessary components for building your ML pipeline, you might want to experiment with some other tools that can also be helpful: DVC (for saving new datasets and iterations of trained models), Jenkins (for automation and control) and Docker containers for deployment.

      The benefits of working with an ML pipeline delivery

      For clients looking for ML-driven solutions for their business one of the key benefits of the approach outlined above is that they receive a fully-fledged ML system instead of a trained ML model that needs revisiting. With a deployed ML pipeline, the client will not have to keep returning to the team of ML engineers each time the model needs a substantial update. However, some level of technical expertise in machine learning will be required for maintenance.

      The automation of the process is another significant benefit that speeds up the deployment of an ML model. Crucially it alleviates the burden of manual tasks.

      These machine learning pipeline steps take project management in machine learning to a new level, resulting in benefits both for clients and developers.Postindustria offers a full range of services for the delivery of solutions based on machine learning. Leave us your contact information in the form above and we’ll contact you to discuss your project.

      Market Trends and SDK Analysis
      Where to send your copy?

        By opting in you agree with your information being stored by us in relation to dealing with your inquiry and to get an email with News, Blog Posts and Offers from Postindustria. You can unsubscribe anytime. Please have a look at our Privacy Policy.
        Hooray!
        We just sent you a copy. Please, check your email or download it here.