TensorLearn
Back to Course
LLM Engineering: Transformers & RAG
Module 8 of 12

8. LLMOps & Production

1. Serving with vLLM

Python is too slow for high-throughput inference. vLLM uses PagedAttention to serve thousands of requests per second.

2. Quantization (4-bit)

Compressing a model from 32-bit floats to 4-bit integers to fit on smaller GPUs.

Mark as Completed

TensorLearn - AI Engineering for Professionals