Handling Large Datasets in Data Preparation & ML Training Using MLOps

In this technical training session, we’ll explore how to use Dask, Kubernetes, and MLRun to scale data preparation and training with maximum performance.

Dask is an open-source library for parallel computing written in Python, which can be used in conjunction with open source MLOps orchestration tool MLRun over Kubernetes to handle large-scale datasets.

In this session, we will provide a demonstration of how to use these tools to scale your data prep and ML training with ease.

Watch this session to explore:

An overview of the tools available for large-scale data processing in Python (PySpark, Dask, Vaex, and more), and how they are used with existing ML frameworks
Understanding Dask and how to use the same native Python code at scale, without the need to learn other technologies like Spark
How to run Dask in a distributed and elastic way over Kubernetes to improve resource utilization
How to deploy Dask-based data engineering and ML pipelines with MLRun and Kubeflow, in one click
Further optimizations for handling large-scale data effectively and efficiently

Watch More

Session #12

Simplifying Deployment of ML in Federated Cloud and Edge Environments

Session #8

NetApp's Michael Oglesby on Building ML Pipelines Over Federated Data

Session #1

S&P Global’s Ganesh Nagarathnam on Bringing ML Pipelines to Production