TensorLearn
Back to Course
LLM Engineering: Transformers & RAG
Module 11 of 12

11. Multimodal Models (LlaVA)

1. Projectors

How to plug a Vision Encoder (ViT) into a Language Model (Llama)? We use a Linear Projection layer to map Image tokens into the Word Embedding space. To the LLM, an image is just a weird foreign word it learns to understand.

Mark as Completed

TensorLearn - AI Engineering for Professionals