Back to Course
LLM Engineering: Transformers & RAG
Module 8 of 12
8. LLMOps & Production
1. Serving with vLLM
Python is too slow for high-throughput inference. vLLM uses PagedAttention to serve thousands of requests per second.
2. Quantization (4-bit)
Compressing a model from 32-bit floats to 4-bit integers to fit on smaller GPUs.