Real-Time Streaming for Data Science
Adi Hirschtein | April 8, 2022
This tutorial demonstrates the availability of streaming data in a data science environment, which is useful for working with real-time and fresh datasets:
First, we collect data from an existing Kafka stream into an Iguazio time series table. Next, we visualize the stream with a Grafana dashboard; and finally, we access the data in a Jupyter notebook using Python code.
We use a Nuclio serverless function to “listen” to a Kafka stream and then ingest its events into our time series table. Iguazio gets you started with a template for Kafka to time series.
We visualize the data with Grafana and work with time series data using Python code in Jupyter. Data scientist easily access both historical and real-time data in a full Python environment for exploration and training with Iguazio.
If you'd prefer to read along, here's a transcript of how we did it:
Here’s how to make streaming data available for data scientists in just a few minutes.
We're going to collect data into an Iguazio time series table from an existing Kafka stream, and then create a quick Grafana dashboard and access the data using Python code from a Jupyter notebook. The idea here is to have data scientists getting real-time and fresh data into their data science environment where they can immediately start working with streaming datasets.
Create the Time Series Table
- Go to the services view and then to the shell service. I'm using tstb CLI.
- Specify the container name and table name with the full path and then the sample rate.
- Now if I go back to the data services view under the container users, I can see my new tab, so it's ready for data ingestion.
Creating Services
- Go to the services view again and create two services. The first is TSDB functions: this service creates Nuclio functions to ingest data into our time series table and the second is Graphana, for visualization.
- Click on new service, select the TSDB functions (no need to change anything in the resources setting).
- Select the container name and enter the full path of the TSDB table that we created earlier
- Click on create service, then the other service is Grafana.
- Enter the user to access the data from Grafana and click on ‘create service’ and then click on apply changes.
Now that the TSDB function service is ready, go to the functions view, and under projects there is a new TSDB functions project, and there are two functions that were created by the TSDB function service. One is a query function, and the other one is the ingest function. Copy the name of the function, because we're going to use it later.
Ingesting Events to the TSDB Table
- Create a new project, and call it streaming.
- Create a new nuclio function that listens to a Kafka stream and writes it to all TSDB table, and name it kafka-tsdb.
- You can choose an existing template by searching for Kafka templates. There are two: one is kafka-to-tsdb and the other one is kafka-to-kv. Click on kafka-to-tsdb, enter the Kafka URL including the port, enter the topic name and paste the ingest function that you copied earlier. There’s also an option to choose between latest versus alias for the Kafka stream to keep the latest.
- Click apply, and now deploy.
This will create a function that listens to a Kafka stream and ingests events into our TSDB table.
Create a Chart and View Events in Real-Time
The Platform also has a Grafana service.
- To create the dashboard, click on the plus button and select the chart type. Select Iguazio as a data source. In this example there are two data sources, time series and table, so we'll be using the time series.
- Add the query.
- Specify the backend type, container name, fields and metrics. Once that’s done, you can immediately see that the data is coming in.
In this case, we are looking at a specific metric and we can also see its labels. You can create more charts and slice and dice reports using all the Grafana features.
Working with TSDB Data from Jupyter Using Python Code
- In the services screen in the main dashboard, open up Jupyter.
- In this example, I'm using a library called frames. I just set up the frames client and run this simple code that fetches the data from our TSDB table. Now that we've read it as a data frame, you can use Pandas and work with this data using your preferred tool.
- Iguazio has a full-fledged data science environment with all the popular Python libraries, so you can play with the data during data exploration or training, and you can work on real-time data sets as well.