What Are Feature Stores and Why Are They Critical for Scaling Data Science?
Adi Hirschtein | February 10, 2022
The field of MLOps has grown up around the reality that while the theoretical ability of machine learning to make accurate predictions and solve complex problems is incredibly sophisticated, actually operationalizing machine learning is still a major blocker for most companies. Most of the complexities arise from the data: work is typically done in silos, the path to production is resource-intensive, there’s a general lack of simple access to production-ready features at scale that are consistent with the features in the data science research phases, and disjointed or nonexistent model and feature monitoring processes once the AI service is live.
ML teams need a way to continuously deploy AI applications in a way that creates real, ongoing business value for the organization.
Features are the fuel driving AI for the organization, and feature stores are the architectural answer that can simplify processes, increase model accuracy and accelerate the path to production.
A feature store provides a single pane of glass for sharing all available features across the organization. When a data scientist starts a new project, he or she can go to this catalog and easily find the features they are looking for. But a feature store is not only a data layer; it is also a data transformation service enabling users to manipulate raw data and store it as features ready to be used by any machine learning model. These features can then accelerate machine learning use cases through the reduction of duplicate work.
Some of the largest tech companies that deal extensively with AI have built their own feature stores (Uber, Twitter, Google, Netflix, Facebook, Airbnb, etc.), and the commercial landscape for feature stores has exploded in the last few years. This is a good indication to the rest of the industry of how important it is to use a feature store as a part of an efficient ML pipeline.
Calculating and cataloging features for a feature store
Creating and then calculating offline features can take place over an extended period of time. The calculations of online features, however, are much more challenging, requiring fast computation as well as fast access to the data. The data can be stored in memory or in a very fast key value database. The process itself can be performed on various services in the cloud or on a platform such as the Iguazio MLOps Platform that has all of these components as a part of its core offering.
But first — let’s talk about access. Easy access.
Offline features are built mostly on frameworks such as Spark or SQL, and then stored in a database or as parquet files. Online features, on the other hand, may require data access using APIs for streaming engines such as Kafka, Kinesis, or in-memory key-value databases such as Redis or Cassandra.
Working with a feature store abstracts any complex data access layer, so when a data scientist is looking for a feature, instead of writing an engineering code he can use a simple API for retrieving the data that he needs.
The Benefits of Having a Feature Store:
Faster development
The feature store concept is built to abstract the data engineering layers that consume so much of data scientists’ time and provide easy access for reading and writing the best features for their models.
Smooth model deployment in production
The feature store enables a consistent feature set between the training and serving layer and enables a smoother deployment process, ensuring that the trained model indeed reflects the way things would work in production.
Increased model accuracy
The feature store catalogs additional metadata for each feature, which can help data scientists tremendously when selecting features for a new model, allowing them to focus on those that have achieved better impact on similar existing models.
Better collaboration
Feature stores enable everyone in the company to share their work and avoid duplication.
The ability to track lineage and address regulatory compliance
In a feature store, we can save the data lineage of a feature. This provides the necessary tracking information that captures how the feature was generated and provides the insight, as well as the reports needed for regulatory compliance.
Feature stores and MLOps
Feature stores enable data scientists to reuse features instead of rebuilding these features again and again for different models, saving them valuable time and effort. Feature stores automate this feature engineering process, which is an important part of the MLOps concept.
Given the growing number of AI projects and the complexity associated with bringing ML projects to production, the industry needs a way to standardize and automate the core of feature engineering. To read the full article for more information about feature stores, follow this link to Towards Data Science.