8. LLMOps & Production

1. Serving with vLLM

Python is too slow for high-throughput inference. vLLM uses PagedAttention to serve thousands of requests per second.

Compressing a model from 32-bit floats to 4-bit integers to fit on smaller GPUs.