Back to Course
LLM Engineering: Transformers & RAG
Module 11 of 12
11. Multimodal Models (LlaVA)
1. Projectors
How to plug a Vision Encoder (ViT) into a Language Model (Llama)? We use a Linear Projection layer to map Image tokens into the Word Embedding space. To the LLM, an image is just a weird foreign word it learns to understand.