MLOps Challenges, Solutions and Future Trends
Yaron Haviv | February 19, 2020
Summary of my MLOps NYC talk, major AI/ML & Data challenges and how they will be solved with emerging open source technologies
AI and ML practices are no longer the luxury of research institutes or technology giants, they are becoming an integral part of any modern business application. According to analysts, most organizations fail to successfully deliver AI-based applications and are stuck in the process of turning data-science models, which were tested on sample or historical data, into interactive applications which work with real-world and large-scale data.
A new engineering practice called MLOps has emerged to address the challenges. As the name indicates, it combines AI/ML practices with DevOps practices, and its goal is to create continuous development and delivery (CI/CD) of data and ML intensive applications.
Recently Iguazio, the provider of an end-to-end data science platform and developer of open-source MLOps technologies, together with the major cloud providers, leading enterprises and technology giants, held the MLOps NYC event. The agenda was to discuss different approaches, best practices, and start creating collaboration and standardization for MLOps.
My session summarized the current challenges of MLOps and the trends we will be seeing in the near future (see the 12 minute video).
Challenges
According to different surveys, data-science teams don’t do data science, they spend most of their time on data wrangling, data preparation, managing software packages and frameworks, configuring infrastructure, and integrating various components. Those can be generalized as feature management tasks and MLOps tasks (i.e. DevOps for ML).
The MLOps Challenge
Data-science originated in research organizations and was later used to produce reports and detect anomalies within mountains of data. The emerging trend to incorporate data-science in every business application, in order to intelligently react to events and data as they occur, is creating fundamental changes in machine learning practices.
Unlike research or postmortem data analysis, business applications need to be able to handle real-time and constantly changing data, they must be on 24/7, they must meet decent response times, support a large number of users, etc. What once was the goal — producing an ML model — today is just the first step in a very long process of bringing data science to production (as illustrated in the following diagram). Many organizations underestimate the amount of effort it takes to incorporate machine learning into production applications. This leads to entire projects being abandoning halfway (75% of data science projects never make it to production) or to the consumption of far more resources and time than first anticipated.
The Data Challenge
Data scientists start with sample data, they work on Jupyter notebooks or use AutoML tools to identify patterns and train models. At a certain point, they need to train the models on larger data sets. This is when things start to become difficult. They might find that most of the tools which work off CSV files and load data into memory just can’t work at scale, and that they need to re-architect everything to fit distributed platforms and structured databases.
Much time is spent creating features from raw data, and in many cases the same feature extraction work is duplicated for multiple projects, or by different teams. The overhead is further amplified every time datasets change, the derived data and models changes, or experiments need to be repeated to get the required accuracy.
When the data science team tries to deploy models into production, they find that the real-world data is different, and that they can’t use the same data preparation methodologies on data that is always changing. Latency or computation power constraints require a fundamentally different data pipeline, one which depends on stream processing, fast key/value and time-series databases to deliver real-time feature vectors.
At MLOps NYC, Uber, Twitter, and Netflix discussed and shared their experience of building online and offline feature stores which are a fundamental component in their data-science platforms.
Solutions and Future Trends
In my session I outline the industry’s vision to overcoming the challenges outlined above. The way to solve these challenges is through:
- Use of scalable and production ready data-science platforms and practices from day-one.
- Adoption of automation and higher-level abstractions where possible.
- Design for collaboration and re-use.
Serverless ML Functions
The way we eliminate ML pipelines complexity is by adopting the concept of serverless “ML Functions”. Serverless technologies allow you to write code and specification which automagically translates itself to auto-scaling production workloads. Until recently, these were limited to stateless and event driver workloads, but now with the new open-source technologies we demonstrated (MLRun+Nuclio+KubeFlow), Serverless functions can take on larger challenges of real-time, extreme scale data-analytics and machine learning.
The steps of packaging, scaling, tuning, instrumentation, and continuous delivery are fully automated, addressing the two main challenges of every organization:
- Significantly reduce time to market
- Minimize the amount of resources and skill level needed to complete the project
ML Functions can easily be chained to produce ML pipelines (using KubeFlow). They can generate data and features which will be used by subsequent stages. The following diagram demonstrates the pipeline used to create a real-time recommendation engine application:
Transitioning to micro-services and functional programming models enables collaboration and code re-use. Users can gradually extend and tune functions without breaking their pipelines, while only consuming the right amount of CPU, GPU, and memory resources. Kubernetes and KubeFlow play a central role in this architecture, scheduling the right resources, scaling out the workloads, and managing pipelines.
Orit Nissan-Messing, Iguazio’s Chief Architect & Co-Founder, presented a session describing the Nuclio ML Functions and MLRun architecture with a live demo at KubeCon + CloudNativeCon 2019 in San Diego (watch her video to learn more about this challenge and ways to overcome it).
Built-in Feature Stores
The second challenge is the complexity in building, managing, and consuming offline and online features. Digital giants like Uber, Netflix and others have all built “Feature Stores” internally to overcome this. Most organizations can’t afford to or don’t have the skills internally to build a feature store from scratch and need it to be an integral part of the data platform they use.
We can build Feature Stores using ML Functions which are connected to a shared online + offline data repository and wrap it up with metadata management and automation as shown in the following diagram.
Summary
Organizations which plan to incorporate ML and AI in their applications must start with the end in mind and build for production by adopting MLOps, i.e. continuous integration and deployment (CI/CD) and DevOps practices for their data-science activities, this way they gain agility and can serve real-world online application.
MLOps and DataOps can be a resource drain and can lead to significant delays without proper abstractions and automation, those challenges will lead to a rise in data-science optimized “Serverless”, SaaS offering, ML function marketplaces, and managed feature stores.
It is important to bet on the right technology, one which is open and uses Kubernetes and its vast eco-system vs. point or cloud specific solutions.
For more info check out this 12 minute video: