Building ML Pipelines Over Federated Data & Compute Environments
Yaron Haviv | September 15, 2020
A Forbes survey shows that data scientists spend 19% of their time collecting data sets and 60% of their time cleaning and organizing data. All told, data scientists spend around 80% of their time on preparing and managing data for analysis.
One of the greatest obstacles that make it so difficult to bring data science initiatives to life is the lack of robust data management tools. These tools would enable data scientists to quickly ingest, analyze, and work with an exponentially increasing deluge of data to develop and support complex models that offer actionable insights and responses in real time.
Contemporary MLOps tools are solving many of the problems that have plagued data scientists for decades (by seamlessly enabling the entire ML lifecycle and creating/executing ML pipelines within a single environment). However, there’s still the problem of how to deliver access to live, training, and continuously up-to-date data within the same environment.
This makes it difficult, and sometimes impossible, to construct an ML pipeline from heterogeneous data sources and compute environments and build production-ready AI applications that work in hybrid environments.
What is needed is an efficiently holistic data science platform — one that simplifies the data management process, performs training, inference, validation and refinement of ML models, and lowers the cost of collecting, ingesting, classifying and preparing data.
Building ML pipelines that meet data scientists’ every need
Given relevant training data for the use case, data scientists can easily build, train and deploy ML models with predictive performance on an offline holdout dataset. However, the challenge is building a seamless and integrated ML pipeline that can consistently perform in production environments and automatically train, test, validate and deploy new models as data profiles evolve.
For one, data must flow efficiently from collection and ingestion all through to preparation and then on to training and validation. This is a huge problem given the fact that in most organizations, data is often scattered across departments and domiciled in multiple databases, cloud infrastructures and silos.
Furthermore, there are unresolved issues regarding privacy-related aspects of machine learning and large scale data processing (where the anonymity and privacy of sensitive user data must be preserved), especially in highly regulated industries such as finance, healthcare, proprietary use cases, etc. Data scientists need to keep in mind the sovereignty, sensitivity, security and privacy of the data underlying the models when building ML pipelines.
As such, there must be a way to enable full traceability for all models deployed in production applications (how the models were trained, what data was used to train them, how the models arrive at predictions, etc.) to ensure compliance with data privacy and policy regulations in these industries.
Bringing compute to data or data to compute — which method is best?
However, the data used in training models is all over the place. It’s usually distributed across data lakes and warehouses or structured databases where it’s not feasible to move everything to ML engineers’ preferred workspace or development environment.
Also, most enterprises use different tools within departments ranging from Azure and MS Office, to Amazon S3, etc., and as a result, have data scattered across various infrastructures, both on-premises, at the edge or on the cloud. This leads to data silos in different locations, domiciled across cloud providers and various different on-prem and edge locations.
Running a query across all these domains can be quite expensive and it isn’t practical to push on-premise and edge data to the cloud. This raises the conundrum of whether ML engineers should bring the data to compute or bring compute to the data.
Handling data replication, synchronization and compute activities across cloud environments
While both options are expensive, it’s extremely more costly to push all the raw data from these disparate locations into the cloud and run analysis versus performing the same computation/analysis where the data is. This is due to the prohibitive transfer and computation cost associated with doing the former, especially when running AI experiments at scale.
This gives rise to the need to replicate and synchronize data across cloud environments.
How then do we reduce the total cost of collating training and production data from heterogeneous sources for our ML workloads?
Solving some of these challenges requires data scientists to think about consolidating the online/offline feature processing and feature serving infrastructure. And then go beyond this to create a federated feature store to handle all the data that’s stored in various locations.
In production, this may actually require the replication of the code across multiple data locations. For instance, running the same real-time processing pipeline for multiple edge locations will require a clone of the same computation stack and the same data stack across all locations. Mobilizing, synchronizing and aggregating data and metadata across these locations requires access to an infrastructure that can seamlessly move data and computation across environments.
The solution must consists of the following 4 elements:
By using mobile serverless functions with pluggable data interfaces we can run exactly the same code on many different environments at minimal deployment and management overhead, we also need to make sure to have data synchronization and movement capabilities where it makes sense, and should choose a universal cluster orchestration framework like Kubernetes which can run on any cloud and mediate between the application requirements and cloud specific APIs and services.
The solution — streamline the entire process of collecting, preparing, training and deploying data everywhere and clone this entity into all the locations where data needs to be processed. And then, run some sort of federated orchestration and data synchronization process to unify data across the different locations.
Such an entity will efficiently facilitate cross-site collaboration and enable data engineers to take advantage of high-performance training infrastructure whether within on-premise data centers or in the cloud. By its very nature, Kubernetes is federated and can be leveraged as the baseline architecture for data engineering, machine learning, MLOPs and data management activities.
Pairing NetApp’s cloud data services with Iguazio’s automated ML pipeline platform
To enable ML engineers to build ML pipelines (from heterogeneous data sources and compute environments), NetApp partnered with Iguazio to integrate its data fabric into Iguazio's Data Science Platform. This solution also enables data scientists to programmatically handle all aspects of data replication and synchronization across cloud environments.
By filling out a few parameters and running a simple command, ML engineers can access a web-based Jupyter workspace within which they can leverage a simple workflow and API (Python function) to automate data movement along with online and offline data processing at scale.
Programmatically perform data management tasks across cloud environments
Together with Iguazio, NetApp developed a Kubeflow Pipeline operation that can run within ML pipelines to trigger the replication of data across platforms via NetApp’s data synchronization tools. By pairing NetApp’s innovative data fabrics with Iguazio’s robust MLOps control framework, ML engineers can seamlessly move workloads across environments when needed in a programmatic and repeatable manner to perform critical data management tasks.
This capability is powered by NetApp’s Kubernetes storage driver, Trident, which enables data scientists to natively present data to Kubernetes, so it can be consumed by any process or service running within Iguazio. Trident acts as an interface between NetApp’s data fabric (NFS [Network File System] and Cloud Volumes Storage) and whatever Jupyter instance, service or process that’s being orchestrated on top of Kubernetes.
With this, ML engineers can programmatically drive data management tasks from within Iguazio itself while enjoying the enterprise-class storage, high performance and data protection capabilities of NetApp’s hybrid cloud data services.
Watch this webinar with NetApp and Iguazio to learn more about Building ML Pipelines over Federated Data & Computer Environments: