10. RLHF (Training ChatGPT)

1. SFT (Supervised Fine-Tuning)

Data: (Question, Good Answer). Result: A model that mimics the dataset.

Data: (Question, Answer A, Answer B). Human picks A. Result: A model that predicts "Human Preference".

Optimize the LLM to maximize the Reward Model's score.