Deploying RAG Pipelines for Production at Scale
- Course Code GK847005
- Duration 1 day
Course Delivery
Course Delivery
This course is available in the following formats:
-
Company Event
Event at company
Request this course in a different delivery format.
Course Overview
TopGain hands-on experience deploying, monitoring, and scaling RAG pipelines with the NIM Operator and learn best practices for infrastructure optimization, performance monitoring, and handling high traffic volumes.
This course begins by building a simple RAG pipeline using the NVIDIA API catalog. Participants will deploy and test individual components in a local environment using Docker Compose. Once familiar with the basics, the focus will shift to deploying NIMs, such as LLM, NeMo Retriever Text Embedding, and NeMo Retriever Text Reranking, in a Kubernetes cluster using the NIM Operator. This will include managing the deployment, monitoring, and scalability of NVIDIA NIM microservices. Building on these deployments, the workshop will cover constructing a production-grade RAG pipeline using the deployed NIMs and explore NVIDIA's blueprint for PDF ingestion, learning how to integrate it into the RAG pipeline.
To ensure operational efficiency, the workshop will introduce Prometheus and Grafana for monitoring pipeline performance, cluster health, and resource utilization. Scalability will be addressed through the use of the Kubernetes Horizontal Pod Autoscaler (HPA) for dynamically scaling NIMs based on custom metrics in conjunction with the NIM Operator. Custom dashboards will be created to visualize key metrics and interpret performance insights.
Company Events
These events can be delivered exclusively for your company at our locations or yours, specifically for your delegates and your needs. The Company Events can be tailored or standard course deliveries.
Course Schedule
TopCourse Objectives
TopIn service of teaching and demonstrating how to deploy enterprise-scale LLM-based agentic and RAG applications this course will cover the following topics and technologies:
- The current landscape of enterprise generative AI applications
- NVIDIA NIM microservices
- Components and architecture of enterprise-grade RAG applications
- At-scale inference considerations and optimations
- Kubernetes, Helm, and the NVIDIA RAG operator to deploy, manage, and scale RAG application services
- Prometheus and Grafana for cluster-wide application behavior and performance visibility · Techniques for deploying and scaling multimodal RAG applications at scale
Course Content
TopModule 1: Production Deployment of Generative AI Applications
- Survey the current landscape of generative AI applications across a wide variety of capabilities.
- Review the many constituent parts of enterprise grade generative AI applications including RAG pipelines.
- Understand the challenges that enterprises face when transitioning from naive to enterprise-grade generative AI applications.
- Learn about the capabilities of NVIDIA NIM microservices for deploying LLMs and other generative AI application components.
- Discuss techniques and technologies for improving the performance of LLM-based application inference at scale.
Module 2:Core Concepts of Enterprise-scale Generative AI DevOps
- Survey the current landscape of available DevOps tools for enterprise-scale containerized application deployment.
- Learn about the value of the Kubernetes container orchestration platform.
Module 3: Deploying Simple RAG Applications
- Learn how to access remote LLMs and RAG services over API calls.
- Review the core LangChain programming patterns used in engineering RAG applications.
- Build a simple RAG application using API-hosted services, and deploy it with Docker Compose.
Module 4: Kubernetes Core Concepts
- Learn the core concepts and techniques required for working with Kubernetes clusters.
- Familiarize yourself with the interactive multi-node Kubernetes cluster programming environment provided to you in this workshop.
- Interactive utilize `kubectl` to deploy, manage, and monitor container-based applications across a cluster.
Module 5: Deploying Self-hosted RAG Applications
- Deploy and coordinate a variety of containerized microservices in service of a cluster-based RAG application.
- Learn about and utilize the NIM operator to manage and scale a variety of RAG microservices.
- Configure storage on your cluster for various model caches.
- Deploy LLM, text embedding, and text retriever services onto your cluster.
Module 6: Monitoring GPU Utilization
- Use NVIDIA Data Center GPU Manager (DCGM) to monitor and manage GPU resources across a Kubernetes cluster.
- Deploy Prometheus and Grafana onto the cluster to better monitor and visualize cluster resources.
- Use DCGM Exporter to export GPU utilization metrics for Prometheus which can be visualized with Grafana.
Module 7: Autoscaling NIM Microservices
- Use Prometheus service monitors to extract custom metrics from the NIM microservices running on the cluster.
- Create HorizontalPodAutoscalers to automatically scale the cluster's services based on custom metrics.
- Test and observe automatic horizontal autoscaling by performing load tests using Locust.
Module 8: Building Multimodal RAG Pipelines
- Learn how to isolate different modalities from multimodal PDF documents like text, figures, and tables.
- Practice various strategies for chunking extracted text.
- Perform table extraction from PDF documents.
- Perform image/table extraction from PDF documents.
- Use the NV-YOLOX model to identify PDF page elements.
- Perform end-to-end multimodal data extraction on the ChipNemo technical paper.
Module 9: Using Generative AI to Represent Extracted Modalities
- Create detailed, contextual descriptions for extracted elements such as text, figures, charts, and tables.
- Perform image transform using VLMs.
- Use a state-of-the-art context-aware chart element detection (CHART) model to detect classes for chart basic elements, including plot elements.
- Combine the use of LLMs and VLMs in an end-to-end example on the ChipNemo technical paper.
Module 10: Multimodal Embedding, Storing, and Retrieving
- Convert all extracted modalities into a common format for use in a universal
- RAG pipeline. · Construct an end-to-end multimodal RAG pipeline.
Module 11: Assessment