11. Multimodal Models (LlaVA)

1. Projectors

How to plug a Vision Encoder (ViT) into a Language Model (Llama)? We use a Linear Projection layer to map Image tokens into the Word Embedding space. To the LLM, an image is just a weird foreign word it learns to understand.

Report an issue or suggest an improvement

Mark as Completed

Previous Next Lesson