Choosing the Right-Sized LLM for Quality and Flexibility: Optimizing Your AI Toolkit
Guy Lecker | December 16, 2024
LLMs are the foundation of gen AI applications. To effectively operationalize and de-risk LLMs and ensure they bring business value, organizations need to consider not just the model itself, but the supporting infrastructure, including GPUs and operational frameworks. By optimizing them to your use case, you can ensure you are using an LLM that is the right fit to your needs. Unlike a one-size-fits-all approach (e.g, using an out-of-the-box public model), adapting your LLM and infrastructure to the right size allows tailoring the model to specific use cases, ensures better performance and saves costs.
In this blog post, we’ll take you through the various phases for choosing and optimizing your LLM: model selection, GPU selection, GPU utilization and MLOps practices. In the end, you’ll have the tools to make the necessary choices for building your gen AI application foundation.
Step 1: Model Selection
Choosing the Model
The first step, even before choosing the model, is separating your use case into tasks. These tasks should be as small and focused as possible.
Now it’s time to choose the right model for the tasks at hand. While it’s tempting to jump into complex solutions that provide a wide-range of capabilities - we recommend taking a different approach, and starting with smaller, simpler models, so you can assign a small model to each small task.
Each task should be handled by the smallest model possible. You can join tasks if your chosen model can handle a number of them at the same time, but large models should only be assigned to complex tasks that small models can’t handle.
Of course, this means employing multiple models, each specializing in the tasks you divided the application into. The AI landscape is moving towards a multi-agent architecture with LLM agents, meaning each model works on a clear and simple task. This is similar to the microservices architecture.
So even if you have a complex application at hand, divide it so small models are assigned to small tasks. For example, different tasks and models for sentiment analysis, knowledge retrieval, response generation, etc.
Smaller models are easier to manage, require less computational power and allow faster iterations. In addition, these models have less parameters, which means better latency.
This approach will give you the best of both worlds - in terms of both latency and quality. This is because each task is its own inference, ensuring better specialization, improved performance and easier debugging.
If the smaller model does not provide acceptable performance in accuracy for your use case, you can use prompt engineering, RAG, fine-tuning, or other customization techniques to improve outputs and make sure it fits the task at hand. Otherwise, test larger-scale models until you find one that fits.
Model Optimization
Once the model is selected, you can optimize it to ensure it accurately addresses your technological requirements. Some of the most prominent methods include:
1. Flash Attention - A fast and memory-efficient implementation for speeding up the computation of attention mechanisms in using GPUs. Instead of calculating attention weights and intermediate results directly, Flash Attention optimizes memory reads/writes and computational order to avoid unnecessary memory overhead. This promises great improvement in both latency and memory usage of the model.
Pro Tip: Consider always using flash attention 2 (or 1, or 3 when after it completes beta) if the GPU and CUDA version in hand support it.
2. Quantization Change - Quantization reduces the precision of the weights and activations in a model (e.g., from 32-bit floating-point to 8-bit integers), yielding lower memory and latency. The types of precision formats vary from float16, bfloat16, 8bit (int8) and even 4bit (int4). The Integer operations are much faster than float ops.
There are 3 main quantization methods: bitsandbytes, GPTQ and AWQ. GPTQ and AWQ require quantizing the model in advance, while bitsandbytes can be performed on the model load - hence simpler but slower and less accurate.
Pro tip: Consider using the lowest precision that does not harm the model's task and is supported by the GPU. Remember you can mix and match precisions, you don't have to quantize all layers to the same precision.
3. Device Mapping - Device mapping involves distributing model layers, operations, or data across multiple devices (GPUs, CPUs, or accelerators) for parallel processing, using a layer map to load each on its dedicated device.
Pro tip: It’s preferred to fit the model into a single GPU. If the model is too big, using a device map is a good practice to load the model utilizing more GPUs.
4. CPU offloading - CPU offloading transfers some operations (e.g., embedding layers, gradient computation, or optimizer states) from GPU to CPU to free up GPU memory for other tasks. This is useful for large models that can’t fit entirely in GPU memory, while allowing the use of cheaper or smaller GPUs.
Pro tip: Combining these optimizations can yield significant performance improvements. For instance:
- Use Flash Attention to speed up training and inference.
- Apply Quantization for better memory utilization and increased inference speed.
- If you lack GPUs:
- Employ Device Mapping to leverage multiple GPUs effectively.
- Integrate CPU Offloading for memory-intensive components.
Step 2: GPU Selection
Defining Your Requirements
Selecting the GPU will be based on multiple factors, covering technological and business aspects. Take into consideration the following:
- Budget - Determine how much you're willing to invest. GPUs range from lower prices suitable for basic tasks to high-end units designed for intensive workloads.
- Expected Workload - Identify the primary applications you'll run and their required computational power. Different tasks have varying GPU demands.
- Performance Metrics - Set acceptable performance metrics that will allow you to answer business needs. These include:
- Acceptable Latency - The time it takes for the GPU to process and output data. Lower latency is required for real-time applications like customer-facing chatbots.
- Acceptable Throughput - The volume of data the GPU can handle over a period. Higher throughput benefits tasks like large-scale data processing, which is required for applications like call centers or higher workloads regardless in real time (continuous batching).
Calculating Arithmetic Intensity
Arithmetic Intensity (AI) evaluates whether a GPU is suitable for a given computational workload. It is the ratio of the number of the data type to the amount of data movement (in bytes) between memory and the compute units.
It is also recommended to also look at the Compute Capability of the GPU. This is a number NVIDIA assigns to their GPU to provide a general view about the GPU. Some CUDA versions may require a minimum CC to be available, and CUDA versions may be required for certain frameworks and algorithms.
Here's a basic example of how to calculate arithmetic intensity and use it to verify GPU suitability when using FLOPs (This could also be TOPs, for example) and without Compute Capability.
The formula:
\[
AI = \frac{\text{FLOPs}}{\text{Bytes Transferred to/from Memory}}
\]
- FLOPs - The number of floating-point operations required by your algorithm.
- Bytes Transferred - The total memory bandwidth utilized during the computation.
This is the just foundation. Read more about calculating Arithmetic Intensity here.
Once you know your algorithm's Arithmetic Intensity, compare it to the chosen GPU's capabilities:
1. Check the GPU's memory bandwidth (GB/s). Determine if the GPU's memory bandwidth can handle your workload.
2. Review the GPU's peak FLOP/s performance.
3. Use the roofline model to determine if your workload is memory-bound or compute-bound:
\[
\text{Threshold AI} = \frac{\text{Peak FLOP/s}}{\text{Memory Bandwidth}}
\]
Compare your calculated AI to this value:
- If
\( AI < \text{Threshold AI} \)
, your workload is memory-bound. - If
\( AI > \text{Threshold AI} \)
, your workload is compute-bound.
4. Based on the calculated Arithmetic Intensity:
- If the GPU's compute and memory capabilities exceed the demands of your algorithm, the GPU is suitable.
- For workloads with low AI (memory-bound), ensure the GPU has sufficient memory bandwidth.
- For workloads with high AI (compute-bound), ensure the GPU has high FLOP/s capacity.
Step 3. GPU Utilization Techniques
You’ve found the right model(s) for your task(s) and ensured your chosen GPU can meet your computational and memory needs during operations. There are now additional steps you can take to enhance GPU efficiency. This will improve performance and cut costs even further.
Batching
Batching is the process of grouping multiple inputs into a single batch for simultaneous processing on the GPU. By increasing batch size, the GPU can process data more efficiently, utilizing its parallel processing capabilities. This fills the GPU with enough work to maximize utilization.
Caching
Caching involves storing frequently accessed data or computation results in GPU memory to reduce redundant computations and data transfer overheads.
Multi-instance GPUs (MIG)
MIG, supported by modern GPUs like NVIDIA's A100, enables partitioning a single physical GPU into multiple isolated instances. Each instance can handle different workloads independently. MIG increases resource utilization by allowing multiple smaller workloads to share a single high-capacity GPU without contention.
Distributed Inference
Distributed inference splits inference workloads across multiple GPUs or even machines. For workloads exceeding the capacity of a single GPU (e.g., massive models or extremely high request rates), distributing tasks ensures load balancing and faster processing.
Step 4: MLOps Practices
Finally, it’s time to optimize and enhance performance at the infrastructure-level. This includes two main aspects: GPU provisioning and overall AI orchestration.
GPU Provisioning
GPU Provisioning refers to the automated, efficient and dynamic management of GPU resources to meet varying computational demands without needing additional GPUs. It enables the allocation, scaling, and utilization of GPUs based on workload requirements, ensuring maximum efficiency.
This includes:
- Dynamic Scaling: Automatically increasing or decreasing GPU allocation, even scaling down to zero, depending on application needs.
- Distributed Processing: Allocating multiple GPUs for intensive tasks like LLMs inference or fine-tuning, significantly reducing processing time.
Automatic Distribution Using MLRun
MLRun is an open-source AI orchestration platform that automates data preparation, model tuning, customization, validation and optimization of ML models, LLMs and live AI applications over elastic resources.
Here's a breakdown of how automatic distribution works with MLRun. This occurs without users having to change the original code:
- Parallel Processing of Jobs - MLRun can automatically parallelize tasks by dividing a large dataset into smaller chunks and distributing them across different compute resources.
- Scaling - MLRun can automatically scale compute resources on demand. It can spin up multiple containers or pods to distribute tasks. This scaling can happen automatically based on the resource availability and demand, so as the workload increases, the system can spin up more resources (pods) to handle it.
- Distributed Training, Data Parallelism & Model Parallelism - MLRun integrates with existing frameworks like Hugging Face Accelerate, PyTorch, DeepSpeed and Open MPI to achieve. This enables MLRun users to utilize these frameworks with ease and at minimal effort.
- Automated Pipelines - Each step in the pipeline can be run in parallel on different machines or in different containers, ensuring the pipeline is executed efficiently.
- Elastic Resource Management - With MLRun, you can define resource requirements and allow the system to adjust based on current cluster usage. This is done through auto-scaling, where it adjusts resources according to workload, either by adding more resources during peak times or scaling down during idle periods.
- Containerization - MLRun handles container orchestration and lifecycle management, ensuring that each job is run in an isolated and controlled environment, allowing for better resource utilization and distributed processing.
Conclusion
Optimizing your generative AI toolkit begins with selecting the right-sized LLM and aligning it with your business needs. Complementing this with tailored GPU selection, utilization strategies, and robust MLOps practices sets the foundation for scalable and efficient AI solutions.
To get started, focus on the essentials:
- Understand Your Use Case: Clearly define the problem you’re solving. Whether it’s a chatbot, content generation, process automation, or data analysis, understanding your objectives will guide model selection and infrastructure requirements.
- Start Small and Scale: Begin with a pilot project using smaller models and simple optimizations. Evaluate performance and refine your approach before expanding to larger-scale implementations.
- (But) Begin with the End in Mind - Implement the right infrastructure and processes to ensure your gen AI application can be operationalized and de-risked at scale. This included implementing MLOps practices from data through development to deployment to LiveOps, including governance and guardrails - also known as the “production-first” mindset at Iguazio.
- Optimize Infrastructure: Invest in the right GPUs, implement batching and caching, and explore tools like MLRun to streamline operations. These steps ensure your system can handle current workloads and adapt to future demands at the lowest costs.
- Leverage Expert Tools and Practices: Use techniques like flash attention, quantization, and device mapping to enhance model performance and reduce costs.
To learn more about AI operationalization, click here.