Webinar

MLOps Live #34 - Agentic AI Frameworks: Bridging Foundation Models and Business Impact - January 28th

What is Model Serving?

Developing a model is one thing, but serving a model in production is a completely different task. When a data scientist has a model ready, the next step is to deploy it in a way that it can server the application. 

In general, there are two types of model serving: Batch and online. Batch means that you feed the model , typically as a scheduled job, with a large amount of data and write the output to a table. Online deployment means that you deploy the model with an endpoint so applications can send a request to the model and get a fast response at low latency. 

The basic meaning of model serving is to host machine-learning models (on the cloud or on premises) and to make their functions available via API so that applications can incorporate AI into their systems. Model serving is crucial, as a business cannot offer AI products to a large user base without making its product accessible. Deploying a machine-learning model in production also involves resource management and model monitoring  including operations stats as well as model drifts. 

Any machine-learning application ends with a deployed model.  Some require simple deployments, yet others involve more complex pipelines. Companies like Amazon, Microsoft, Google, and IBM provide tools to make it easier to deploy machine-learning models as web services. Moreover, advanced tools can automate tedious workflows to build your machine-learning model services.

In this post, you’ll learn what production-grade model serving actually is. We’ll also discuss model serving use cases, tools, and model serving with Iguazio.

What Is Production-Grade Model Serving?

A monolithic system may embed a machine-learning model and not expose the model available outside the system. This type of architecture requires every application using the same machine-learning model to own a copy. If there are many such applications, it quickly becomes a nightmare for MLOps. A better approach is to make the machine-learning model accessible to multiple applications via API. This deployment type has various names, including model serving, ML model serving, or machine learning serving, but they all mean the same thing.

Model serving, at a minimum, makes machine-learning models available via API. A production-grade API has the following extra functions:

  • Access points (endpoints): An endpoint is a URL that allows applications to communicate with the target service via HTTPS protocol.
  • Traffic management: Requests at an endpoint go through various routes, depending on the destination service. Traffic management may also deploy a load-balancing feature to process requests concurrently.
  • Pre- and post-processing requests: A service may need to transform request messages into the format suitable for the target model and convert response messages into the format required by client applications. Often, serverless functions can handle such transformations.
  • Monitor model drifts: We must monitor how each machine-learning model performs and detect when the performance deteriorates and requires retraining. 

New call-to-action

Model Serving Use Cases

There are many use cases for model serving. One example is data science for healthcare, which involves machine-learning deployment at scale in a complex clinical environment. These systems may monitor patients’ vital signs in real time and access medical histories and doctors’ diagnoses to make predictions that assist in making decisions and taking actions. As many applications will be working towards common goals, it makes sense to deploy real-time machine-learning models using model serving.

Another significant use case is data science for financial services. This spans various tasks, including risk monitoring, like real-time fraud detection, and personalization services based on customer preferences. AI-based trading algorithms are often time critical and must operate in a secure environment. Before deploying them to the real world, data scientists and system engineers should use a machine-learning test server equipped with the same API as the production systems, since a small latency or mistake can cause a massive impact on the bottom line.

Model Serving Tools

Managing “serving as a model” for non-trivial AI products is not a simple task and can have a significant monetary impact on business operations. There are various ML serving tools for deploying machine-learning models in secure environments at scale, including:

  • TensorFlow Serving covers a wide range of production requirements, from creating an endpoint to performing real-time model serving at scale.
  • Amazon’s ML API provides purpose-built AI services for handling common business problems, such as intelligent chatbots for contact centers and personalized customer recommendations.
  • Amazon SageMaker makes it easy to set up endpoints using a Jupyter notebook-style environment called Amazon SageMaker Studio.
  • Azure Machine Learning supports enterprises building and deploying machine-learning products at scale using the Azure platform.
  • Google’s Prediction API eases machine-learning model implementation, as it automatically learns from a user’s training data and generates an AI model.
  • IBM Watson offers pre-built applications for natural language understanding and speech recognition, to name a few.

Model Serving with Iguazio

Iguazio’s MLOps platform can ease the management of model serving pipelines. The platform ticks all the boxes when it comes to the major features required for model serving, complete with open-source tools like Iguazio’s MLRun and Nuclio.

MLRun is an MLOps orchestration tool for deploying machine-learning models in multiple containers. It can automate model training and validation and supports a real-time deployment pipeline. Its feature store can handle pre- and post-processing of requests. You can cut time to production by automating MLOps.

MLRun leverages Nuclio ,which is a serverless function platform, for deploying machine-learning models, making it easy to handle on-demand resource utilization and auto-scaling. It also integrates well with streaming engine (e.g. kafka, kinesis) so the function that runs the model runs it on live events. Moreover, it supports multiple worker processes ingesting data concurrently in real time.
MLrun provides a simple way for converting a machine-learning model to a deployable function by using an easy to use python SDK that can be run from your Jupyter notebook.

Model serving is a generic solution that works for any vertical that requires online serving, and Iguazio’s MLOps platform can help simplify building them all.