Webinar

MLOps Live #34 - Agentic AI Frameworks: Bridging Foundation Models and Business Impact - January 28th

How to Implement Gen AI in Highly Regulated Environments: Financial Services and Telecommunications and More

Asaf Somekh  and  Yaron Haviv | August 9, 2024

If 2023 was the year of gen experimentation, 2024 is the year of gen AI implementation. As companies embark on their implementation journey, they need to deal with a host of challenges, like performance, GPU efficiency and LLM risks. These challenges are exacerbated in highly-regulated industries, such as financial services and telecommunication, adding further implementation complexities.

Below, we discuss these challenges and present some best practices and solutions to take into consideration. We also show a banking chatbot demo that includes fine-tuning a model and adding guardrails.

This blog post is based on the webinar “Implementing Gen AI in Highly Regulated Environments” with Asaf Somekh, co-founder and CEO, and Yaron Haviv, co-founder and CTO, of Iguazio (acquired by McKinsey). You can watch the entire session here.

Challenges Implementing and Scaling Gen AI

Implementing and scaling gen AI comes with a host of challenges:

  • Performance vs. cost - The massive scale of LLMs make these models relatively inefficient and compute-heavy. Costs can especially spike in highly-regulated industries, where the models need to be run within cloud accounts. In addition, data science teams need flexibility to switch between different LLMs to fit their use case, further enhancing costs. The data science team needs to find a balance between performance accuracy and cost efficiency. At a business level, the organization needs to ensure the cost associated with the development and deployment is lower than the value the application will bring to the business.
  • GPU efficiency - GPUs are scarce and costly. The team needs to make the most out of their existing Al resources with strategies that would enable GPU sharing, running multiple jobs on a single GPU and automatically scaling up as the workload demands.
  • Model risk approval processes - LLMs introduce risks like data privacy, compliance, toxicity and hallucinations. Existing model risk approval processes do not fit the high pace of innovation relating to Gen Al. A new process is needed to provide the right guardrails in handling gen Al within the organization.
  • Strict regulatory and security requirements - Regulatory requirements mandate data locality and the usage of on-premises infrastructure. Organizations need to build architectures and processes that adhere to them.
  • A future-proof architecture - Gen AI is evolving at an extremely rapid pace. It is important to adopt an open and flexible architecture that can be easily adapted to evolving requirements and tooling.

On-Prem Environments Additional Challenges

Working in on-prem environments adds further complexity. Enterprises need to ask themselves the following questions:

  1. How to experiment on-prem/cloud while balancing speed and risks like data privacy? A possible solution is building a dedicated environment that complies with policies and allows emulating the use case.
  2. How to carry out fine-tuning with sensitive data on-prem? Enterprises are doing a great job with public services and data, but we still need to understand how to fine-tune internal data.
  3. How to create pipelines that:

GPU Usage in Regulated Industries

Globally, many enterprises have rushed to acquire GPU capacity for their on-prem data centers. This is due to:

  • Regulatory requirements or self-governed ones
  • Lack of trust in API-based LLM services
  • The need to use sensitive and proprietary data for building applications that bring value
  • High costs related to gen Al in the cloud: API calls, GPU hours from cloud vendors for fine-tuning and inferencing, CAPEX and OPEX priorities, and more.

On-prem GPU deployments considerations:

  • Global shortage, long lead times
  • Many clients are buying older generation GPUs due to lack of availability
  • Most banking client deploy GPUs within servers (2X/4X) rather than buying fully blown DGX Systems
  • GPUs are grouped as a central resource across BUS and Applications

Efficient GPU usage is key for delivering ROI - However most enterprises underutilize GPUs. GPU provisioning allows significantly improving GPU usage.

4 Operating Model Archetypes

Companies are now choosing between four operating model archetypes:

1. Highly centralized

  • Allows fastest skill and capability building for gen Al central team
  • But a central team is potentially "siloed" from the decision making and distant from BUs/functions, which can be a barrier to influencing decisions

2. Centrally-led, BU executed

  • More integrated strategy between BU and central team, reducing friction and easing support for use cases
  • But this results in slower execution

3. BU-led, centrally supported

  • Ease in getting buy in from BUs/functions, since the plan comes from bottom-up approach
  • But this challenging for cross BU-use cases implementation and varying levels of functional development across gen Al groups

4. Highly decentralized

  • Ease in getting buy-in from BU/function, with specialized resources driving relevant insights quickly and with better integration/ embedding within the BU/function
  • But the lack of knowledge and best practices result in difficulty in going deep enough on a single use case to achieve significant breakthrough

According to McKinsey, the more centralized the model is, the higher the success rate. Asaf recommends the central-led, BU-executed model. This provides the domain expertise while leveraging development capabilities. From a risk perspective, a highly-governed process is also recommended.

Gen AI Factories - Continuous Delivery, Automated Deployment, Monitoring and Feedback

An AI architecture consisting of four pipelines and guardrails can assist overcome the risks of gen AI implementation. Such an architecture helps overcome all risks, including those faced by highly-regulated industries.

The four pipelines include:

1. Data Management - Ensuring data quality through data ingestion, transformation, cleansing, versioning, tagging, labeling, indexing, and more.

2. Development - High quality model training, fine-tuning or prompt tuning, automated auditing and lineage.

3. Deployment - Bringing business value to live applications by implementing on-premises, in the cloud or through a hybrid option, including complete CI/CD pipelines.

4. LiveOps - Improving performance, reducing risks and ensuring continuous operations by monitoring data and models for feedback, drift detection and triggering retraining.

Gen AI Factories - Continuous Delivery, Automated Deployment, Monitoring and Feedback

In addition, a governance layer includes guardrails that help address issues like hallucinations, privacy violations, toxicity, etc. Guardrails need to be applied throughout the four pipelines. For example:

  • Data processing - Curating the data to ensure it is clean and accurate, before indexing in a database or before it is used for tuning.
  • Testing - Verifying the model is accurate and behaves appropriately in the application pipeline.
  • LLM Gateway - Monitoring LLM usage by use cases and users and logging/
  • Processing user requests - Checking requests and prompts before sending them to the model.
  • Validating model responses - Verifying compliance and quality of results in real-time. If needed, blocking them or providing an alternative response.
  • Monitoring - Collecting information from production and monitoring the application. Using the same data for model improvement.
Placing data and model controls in GenAI apps

Demo: Adding Guardrails to a Banking Chatbot with Continuous Tuning

Now, let’s look at a demo of implementing gen AI in highly-regulated environments. In this example, we show how to add guardrails and evaluate a banking chatbot using model monitoring and continuous fine tuning techniques. This allows increasing the robustness of the application and reducing risks.

The Architecture

  • The client talks to a chatbot application using LangChain.
  • The application uses an LLM server that hosts an open-source fine-tuned model.
  • All queries and replies are logged in a Kafka stream.
  • The LLM evaluation and monitoring framework in MLRun uses a plugin that sees all the stored data.
  • Data is stored in object storage and labeled to fine-tune the model.
  • An ML pipeline takes the data from monitoring, performs additional processing, fine-tuning and testing to evaluate and verify the model
  • The newly-tuned model is stored in the model registry in MLRun
  • Once a rolling update is performed the LLM server uses the improved model with better guardrails, less risk and better accuracy - immediately impacting application performance.
Continuous evaluation and tuning of LLMs with MLrun

LLM as a Judge

LLM as a Judge

We used the LLM as a Judge technique to evaluate the results of our model. We chose OpenAI’s GPT to evaluate our open-source model, which is tuned for a specific banking application.

First, we generated an MLRun project, enabled monitoring and created a function that evaluates the results of our chat application. To do so, we use a data set we created that shows what is a good banking question vs. a no not good banking question. This is used as a reference to test the evaluation function.

Our prompt evaluates the questions and answering and tells us if the answer is fit for a banking application. For example, asking for something in the supermarket or who is the president is not for a banking application. This needs to be guardrailed.

Our prompt

Model Monitoring

For monitoring, we built and leveraged MLRun’s ability to build evaluation or monitoring plugins by building our own monitoring function for drift, accuracy, hallucinations, etc.

Model Monitoring

MLRun generates a serverless function that will be used for monitoring the model. Then we deployed to the LLM server with Hugging Face (in this case, the gemma-2b model) by auto-scaling. Now, we have a running endpoint of the model.

For monitoring, we asked the model questions. The results can be seen in this Grafana dashboard. Since there are no guardrails yet, the results were poor, with high failure rates:

Based on the dataset, we will fine-tune the model with an MLRun training function. The newly generated model

model we have a newly generated uh model that we're going to now deploy. With MLRun, we performed a rolling upgrade for the LLM server with the new model version. Now, our application should immediately work.

The Results

Asking the same questions as before, model monitoring shows good results to the questions. Adding a guardrail improved the application and adding more guardrails can even further continuously improve them. 

To see the demo live with the chatbot in action, click here.