Unpacking the Complexity of Machine Learning Deployments
Nazeer Shaik
September 17, 2019

Deploying and maintaining Machine Learning models at scale is one of the most pressing challenges faced by organizations today. Machine Learning workflow which includes Training, Building and Deploying machine learning models can be a long process with many roadblocks along the way. Many data science projects don’t make it to production because of challenges that slow down or halt the entire process. To overcome the challenges of model deployment, we need to identify the problems and learn what causes them. Some of the top challenges organizations face when trying to take a Machine Learning model into production are:

Machine Learning Demands Heterogeneity

End-to-end ML applications often comprise of components written in different programming languages. Depending on the use case, a data scientist might choose Python, R, Scala, or another language to build one model, then another language for a second model. Within a given language, there are numerous frameworks and toolkits available. TensorFlow, PyTorch, and Scikit–learn all work with Python, but each is tuned for specific types of operations, and each outputs a slightly different type of model. For data pre-processing, JVM-based systems such as Apache Spark are often preferred, due to static typing in the language for better support for parallel execution. Such heterogeneous codebases are often hard to keep consistent.

The variety of frameworks and toolkits enable the data scientist to choose a language and tool that best suits the problem at hand. However, each new tool and language they choose must be enabled and handled by IT teams. Containerization technologies, such as Docker, can solve incompatibility and portability challenges introduced by the multitude of tools. However, automatic dependency checking, error checking, testing, and build tools will not be able to tackle problems across the language barrier.

Reproducibility is also a challenge in these scenarios. Data Scientists may build many versions of the model each using different programming languages, libraries or different versions of the same library. It is difficult to track these dependencies manually. To solve these challenges, an ML lifecycle tool is required that can automatically track and log these dependencies during the training phase as configuration as code and later bundle them along with the trained model in a ready-to-deploy artifact.

ML Deployments Are Not Standalone

Machine learning model deployments are not self-contained solutions. They are usually either embedded or integrated into business applications. Deploying a Model by wrapping it as a REST API is the easiest solution to integrate with business applications in a language-agnostic way.

This approach aligns well with microservices architecture and enables to individually update orscale the machine learning model component. Creating REST APIs is easy as most of theframeworks provide the capability out of the box, but some times Models need to be deployedas gRPC APIs for efficient network usage and better performance, especially when the size ofinputs (images, videos, text) is large.

As edge devices (mobiles, IoT, etc) are becoming more and more powerful in terms of computing and storage, there is an increasing trend of deploying and running models directly on these devices. However, models in this case still need to be optimized for CPU and memory usage. In most cases, models are embedded in the applications running on these devices, so the challenges of runtime compatibility and portability arise. The challenge here is to take a model trained in any programming language/library and optimize and export it to match the runtime language and version of the edge devices.

A widely used approach other than REST/gRPC APIs for integrating disparate components is the use of messaging systems. An ability to package and deploy a model that can integrate with messaging systems can avoid a lot of boilerplate code and ease deployments.

In some cases, project constraints and compliance requirements dictate deployment needs, for example, converting a model trained using a python library and deploying it to Azure confidential computing environment which supports a C++ runtime.

Intricacies of ML Model Definition

What is a Machine Learning model? Is it only the model parameters obtained after training (e.g., the weights of a logistic regression model)? Does it need to also include feature transformations which are important for the model to work correctly? Many libraries combine feature transformations and the actual ML model in a single abstraction often referred to as ML pipelines. From a systems perspective, the model can be considered a ‘black-box’ with defined inputs and outputs or it could be considered as a combination of operations with known semantics. A model could be a combination of models (e.g., ensembles where models from different languages or libraries are combined, or where an output of one model is the input to another model).

Service-Oriented Architecture and microservices have moved applications from monolithiccode to more composable and manageable pieces of components. Machine learning is evenmore composable as its building blocks are more granular and disparate. In real-worldapplications, ML models are deployed and managed as a single unit or as multiple componentseach managed and updated individually.

For example, a Public Relations firm looking to identify news and reports critical of one of itscustomers might use the following pipeline:

  • Extract text from a stream of scanned documents with an Optical Character Recognition (OCR) model
  • Identify the language of that text with a language-identification model
  • Translate non-English text to English
  • Prepare the text for sentiment
ML Deployment Graph
Figure: An ML pipeline that extracts text from scanned documents and analyzes the sentiment of OCR’ed text.

In each case, a model might be developed using a different set of languages and tools.

Testing & Validation Struggles

Models evolve continuously as data changes, methods improve or software dependencies change. Every time such a change occurs, model performance must be re-validated. These validations introduce several challenges and pitfalls:

  • The models must be evaluated using the same test and validation datasets to be able to compare the performance of different ML models.
  • The same code for evaluating metrics must be used throughout time and across different models to be able to guarantee comparability.
  • Updates to test/validation datasets or code require the different ML models (including old and new) to be re-evaluated in order to be comparable. This introduces unique challenges in CI/CD pipelines complicating the process of automatically training and deploying newer versions of ML models in production.
  • The improvement in a new model may come at a cost e.g., longer prediction times. To identify such impacts, benchmark tests and load/stress tests must be part of the validation and decision-making process.

Apart from the validation of models in offline tests, assessing the performance of models in production is crucial. This is discussed in the deployment strategy and monitoring sections.

Convoluted Release Strategies

Like any software application being deployed today, the release of ML models is not in any way less complex. ML models need to be updated more frequently than regular software applications.

Release strategy and deployment infrastructure for ML must consider the heterogeneity factor and the fact that a Model may include multiple components each built using a different programming language or ML framework. Each of these components may need to be updated or rolled back independently or as a single unit. Also, to see the best ROI from ML models, it is important to be able to deploy models as fast as possible and repeatedly.

Launching a new model in “shadow mode” (i.e., capture the inputs and predictions in production without actually serving those predictions) helps catch operational problems, smoke test the model and analyze results without any impact to end-users.

A/B or bandit test modes are required to compare model performance in a production environment and analyze impacts on user experience and ROI. However, this could be quite challenging in cases where the feedback loops are long and indirect. The capability to accept inputs and random inspection from humans should also be considered. This is particularly required in scenarios where less labeled data is available and where the cost of errors is too high.

Figure: Example Runtime Model Graph

CI/CD: A Gaggle of Disparate Pipelines

It has become very common for software developers to use Continuous Integration (CI) and Continuous Deployment (CD) tools. CI/CD tools are helping development teams to push rapid and accurate updates to production. Other benefits of CI/CD tools are reliability, reproducibility, velocity, security and version control.

Most CI/CD tools support the well-known software development workflows which include build, test and deploy steps. Machine learning workflows exhibit unique characteristics that are not observed in traditional development workflows.

Machine Learning Workflow
Figure: Machine Learning Workflow

The most significant difference between traditional applications and ML models is the fact that the primary input is not just code. Rather, there are two equally important input components: code and data. One must apply versioning to both these inputs to achieve reproducibility and auditability. One must also monitor both the data and code for any changes and then automatically trigger the workflow. This may not be trivial, especially given the intricacies of data management. As discussed already under Model Validation, a change in test/validation data will need retraining and/or re-validating the Models. When building ML workflows, one must consider this need.

Hardware and software dependencies for traditional application development are usually homogenous. In the case of machine learning, each stage of the workflow may demand specific hardware and software components. Training stages are typically long and intensive on computing. Some workloads may demand the availability of hardware accelerators such as GPUs. CI/CD tools used for ML workflows should be capable of provisioning such dependencies on demand.

ML without Monitoring is a Nightmare

Monitoring tools are constantly evolving to support today’s cloud-native distributed containerized applications. Monitoring is now being replaced with Observability and includes logs, traces, and metrics. The tools, however, still need to evolve further to support machine learning.

Increased Scope

The scope of monitoring has been increasing to support more stakeholders. Most monitoring tools today not just help DevOps to proactively monitor systems but also help developers to debug and understand issues.

In the case of machine learning, the scope of monitoring further increases to include Data Scientists and Business Owners. Monitoring tools need to help these new stakeholders know how well a model is performing in production and to understand its output. The metrics to monitor, the information to log, the compliance needs, and the audit requirements for machine learning are very different from regular applications. More on this in the sections below.

Increased Complexity

As discussed in the above sections, ML is heterogeneous. Standardizing and baking in monitoringinto the deployment pipeline is easy for a single programming language or framework. Doing thisto support different programming languages and frameworks is time-consuming, error-prone andcomplex. Given the fact that ML models can be comprised of other models and components, it isvery important to be able to trace these components individually and be able to narrow downissues to one of them. Employing microservices architecture with containers, service meshes andimmutable infrastructure are great techniques to deal with and standardize deployments ofcomplex ML models with heterogeneity. However, these tools are not easy to configure andmaintain. Specialized teams are required to put these into practice.


Hardware Accelerators

ML frameworks like TensorFlow, PyTorch, Theanos and more support use of GraphicalProcessing Units (GPUs) to improve the speed of computation. Google provides TensorflowProcessing Units (TPUs) which provide acceleration over and above GPUs. Intel recently releasedNeural Network Processors (NNP) for deep learning training and inference at scale.

These hardware accelerators are expensive and their usage should be monitored and optimizedfor cost-effectiveness. Most monitoring tools don’t provide this monitoring capability out of thebox and require setup of additional tools/plugins.

ML Specific Metrics

ML performance is far more multidimensional, integrating several different kinds of metrics.

  • Accuracy: how well is the model making predictions, determined from feedback and actuals received
  • Data Drift: drift computation of training & actual feedback disparity (output), drift computation of training/runtime data disparity (input) and drift computation of correlation across features
  • Bias: computation of Input vs. Output (train vs. actual)
  • Anomaly: detection and logging all inputs with anomalies
  • Explainability: of top features per prediction


Machine learning is still in its early stages. Both software and hardware components are constantly evolving to meet the current demands of ML. They are evolving at a faster rate but the tools required to operationalize them are yet to catch up.

Docker/Kubernetes and micro-services architecture can be employed to solve the heterogeneity and infrastructure challenges. Comet and MLflow are trying to address experiment versioning and reproducibility problems. Cloud platform services such as SageMaker, Azure ML, Google AI cater to the scalable deployments. Existing tools can partly solve some problems individually. Bringing all these tools together to operationalize ML is the biggest challenge today. Most times, this may not be practical especially in enterprises and regulated industries.

As existing tools grow and as new tools get introduced to solve these challenges, one should consider the fact that the data science community is largely comprised of individuals from academic and research background. The tools that operationalize ML must be simple enough to be usable by such users and at the same time be able to address the intricate challenges of ML.