Back to Course
LLM Engineering: Transformers & RAG
Module 10 of 12
10. RLHF (Training ChatGPT)
1. SFT (Supervised Fine-Tuning)
Data: (Question, Good Answer). Result: A model that mimics the dataset.
2. Reward Model
Data: (Question, Answer A, Answer B). Human picks A. Result: A model that predicts "Human Preference".
3. PPO
Optimize the LLM to maximize the Reward Model's score.