Feature Store
What is a feature set?
A feature set is a set of features that are logically grouped together. Feature sets take data from offline or online sources, build a list of features through a set of transformations, and store the resulting features along with the associated metadata and statistics. A feature set can be viewed as a database table with multiple material implementations for batch and real-time access, along with the data pipeline definitions used to produce the features.
What is a feature vector?
A feature vector is a group of features from different feature sets. Feature Vectors are used as an input for models, allowing you to define the feature vector once and in turn create and track the datasets created from it or the online manifestation of the vector for real-time prediction needs.
What engines can your feature store support?
The feature store supports the following engines:
- Storey—this is the default engine. It is a stream processing engine that is optimized for complex, online processing
- Spark—good for batch transformation at large scale
- Pandas—good for batch transformation for small scale
What type of transformations are supported?
The feature store comes with a built-in transformation library with transformation steps that can be leveraged by the feature store storey engine. It includes common transformations such as sliding window aggregations, one hot encoding, map, filter, and more. In addition, the user can create their own custom transformations for any engine.
Does your feature store allow for the ingestion of real-time data?
Yes. You can specify a real-time ingestion endpoint such as HTTP, Kafka, Kenesis, etc. Sending your IoT data to these endpoints will ingest them into the feature store.
Where in the ML workflow would I use a feature store?
Traditionally, in the data ingestion and feature retrieval phase. However, Iguazio’s feature store is designed to be used end-to-end including ingestion, transformation, training, serving, and monitoring.
How is a feature store different from other types of storage?
There are many ways that features for MLOps are built and consumed. Most need specialized toolchains that go beyond what other types of storage can provide. In almost all cases, features are queried in real-time. This requires a highly specialized low-latency storage layer. In addition, training data may sometimes require point-in-time joins which is typically unique-to-ML query patterns.
When do I need a feature store?
Typically, features stores are used when rolling out real-time predictions and pipelines.
Who does a feature store benefit?
Feature stores are especially beneficial to:
- Data scientists—sharing their work and leverage the features others have already built
- Data engineers—create tools to build and support data pipelines
- ML engineers—easily put models into production
Why use a feature store?
Feature stores are best used when building and maintaining your data pipelines. This help to optimize engineering time so that it doesn't take a long time to get models into production. It also saves time by not having to duplicate effort in building the same features multiple times.
What storage are you using to store the features?
There are two type of feature store targets:
- Offline feature set — Stores data on object store as parquet files. We support S3, Azure storage, Google storage and Iguazio v3io object store. The offline feature set is mainly used for training and analytics.
- Online feature set — Store the as key value table on Iguazio key value database. (currently on our feature roadmap). This is used for real-time serving.
What data types are supported?
A number of data types are suppoerted. These include:
- Boolean
- Category
- Float
- Int
- string
- timestamp
- array
Do you support backfilling?
Backfilling historical data is supported. However, updating historical data in a feature set requires manual operations. For assistance, contact Iguazio’s support.
What kind of data sources are supported out of the box?
Files (e.g. csv, parquet) stored on objects store (S3, Azure and GCP), Dataframe, Google Big Query, Snowflake, Kafka (additional databases will be aded in future releases). For more information, see MLRun Sources and Targets.
Does the Feature Store integrate with AWS, Azure, or Google?
Yes. The Iguazio Feature Store can be deployed on AWS EKS, Azure AKS and Google GKE because it leverages a Kubernetes environment. For running computations S3, Azure storage, or google storage can be used as an object store for offline features. We also have the option to run training processes in Sagemaker or Azure ML which can be integrated with Iguazio's Feature Store.
Can the feature store trigger ETL jobs?
Yes. Part of the Feature Store definition is a transformation graph that can basically serve as an ETL process for that feature set.
Can you define feature transformations with SQL / PySpark?
Yes, when using Spark as the engine, you can use PySpark for feature engineering.
What are the latency constraints for the online feature store?
Iguazio's online Feature Store relies on our fast key-value database which can sustain tens of thousands of reads per second with latency of less than 1ms.
Is it possible to have multiple instances of the feature store (Dev, Test, Prod)?
Yes, it is possible to have multiple instances of the feature store.
Does the feature store have a feature set search function?
Yes, you can search features using the Feature store dashboard.
Does the feature store support versioning
Yes, versioning is supported by the feature store.
Can you trace data lineage?
Yes, lineage can be traced in the feature store.
Does the feature store have user access control?
Yes, the features are part of a project and admin can assign members with different roles per project. In addition, the Iguazio platform support RBAC so users can create data access policies.
Is data encrypted at rest?
Yes via standard Linux block-device encryption.
Is data encrypted in transit?
Yes, encryption is done in transit via https. For more information, see HTTP Secure Data Transmission.
Where does the feature store keep the data?
The data of the feature store is stored in the customer account (the customer's VPC).