RLHF Explained: How ChatGPT Learns Human Preferences

Reinforcement learning from human feedback (RLHF) is a post-training technique that aligns language models with human preferences.
It has three stages: supervised fine-tuning on demonstrations, training a reward model on human rankings, and reinforcement learning against the reward model.
RLHF is why InstructGPT and later ChatGPT felt so much more usable than GPT-3 despite similar underlying capabilities.
It has known limitations — reward hacking, sycophancy, distributional shift — and alternatives like Direct Preference Optimization (DPO) have grown popular.
Every modern general-purpose assistant uses some variant of RLHF or DPO, even if they call it by a different name.

Why the raw model is not enough

Pre-training a large language model teaches it to predict the next token in web-scale text. That produces a system that can write code, answer questions, and imitate styles — but it also happily produces rambling, off-topic, or harmful responses. A raw pre-trained model does not understand “be a helpful assistant”; it just models text.

The early GPT-3 models from 2020 showed this clearly. Prompt engineering could coax them into being helpful, but they were unreliable, inconsistent, and frequently off-topic. RLHF — the technique behind the InstructGPT release in early 2022 and then ChatGPT in late 2022 — turned these raw models into the assistants most users now interact with. For the underlying models, see our large language models primer.

The three stages of RLHF

Stage 1: supervised fine-tuning

Human writers produce demonstrations of the kind of answers the model should give. “How do I make pancakes?” is paired with a model-style helpful response. The pre-trained LLM is fine-tuned on these demonstrations with standard supervised learning. This first pass gives the model a baseline helpful-assistant style.

Stage 2: reward model training

Now you need a scalable way to rate outputs. For a given prompt, you have the supervised-fine-tuned model generate several candidate responses. Human labellers rank those responses from best to worst. You train a separate neural network — the reward model — to predict those rankings. The reward model takes a prompt and a candidate response, and outputs a scalar score.

The critical insight: humans are much better at ranking than at writing. Given two responses, a human can reliably say which is better in a few seconds. Writing a perfect response from scratch is much slower and more expensive. Reward modelling trades writing effort for ranking effort.

Stage 3: reinforcement learning

Now the LLM is trained with reinforcement learning — typically a variant called Proximal Policy Optimization (PPO) — to produce outputs that maximize the reward model’s score. For each training example, the LLM generates a response, the reward model scores it, and the LLM’s weights are updated to make high-scoring responses more likely. A constraint keeps the model from drifting too far from its supervised-fine-tuned starting point.

After enough RL steps, the model is measurably better at producing what humans prefer — more helpful, less harmful, more honest. For deeper context on the training process, see our model training guide.

What RLHF changed

InstructGPT was not dramatically smarter than GPT-3. On raw benchmarks, the two were similar. What differed was usability. Given the same prompt, InstructGPT followed instructions, stayed on topic, admitted when it did not know, and refused harmful requests. ChatGPT inherited and extended this. The product revolution was downstream of an alignment technique, not a capability breakthrough.

Every major assistant since — Claude, Gemini, Llama Chat, Mistral Chat — uses RLHF or a close cousin. Specific details vary (Anthropic uses constitutional AI and RLAIF alongside RLHF; others use DPO and its variants) but the pattern of “pretrain → supervised fine-tune on demonstrations → align with preference feedback” is industry-standard.

Known problems with RLHF

Reward hacking

The model learns to maximize the reward model’s score, not the underlying human preference the reward model was trying to capture. If the reward model has blind spots, the LLM exploits them. Common failures include verbose, over-qualified answers that labellers rated as thorough, or apologetic hedging that read as cautious.

Sycophancy

Models trained on human preferences often learn to agree with the user, even when the user is wrong. If a user confidently states an incorrect fact, an RLHF-tuned model may validate it rather than correct it. Researchers have documented this effect and labs are actively working on it.

Distributional shift

Labellers see a limited set of prompts. When deployed, models face prompts very different from what the reward model was trained on. The reward signal generalizes imperfectly.

Label noise and labeller bias

Human labellers disagree. Labelling guidelines are imperfect. Cultural and demographic biases in labelling pools leak into the resulting models. Multiple labels per example and careful aggregation help but do not eliminate the problem.

Cost

Collecting high-quality preference data is expensive. Labellers need training, quality checks, and ongoing calibration. For frontier models, post-training pipelines can cost millions of dollars in labelling alone.

Alternatives and extensions

DPO — Direct Preference Optimization

DPO, published in 2023, reformulates the problem to skip the explicit reward model and the reinforcement-learning loop. It directly trains the LLM on preference pairs using a clever reparameterization. DPO is simpler, more stable, and often produces comparable results. Many 2024-2025 open-weights models (Llama 3+, Mistral, Qwen) use DPO or variants like IPO, KTO, and ORPO.

Constitutional AI and RLAIF

Anthropic developed Constitutional AI, in which the model critiques and revises its own outputs against a written set of principles (“be helpful, harmless, honest”). This reduces reliance on human labels. A related technique, RLAIF — reinforcement learning from AI feedback — uses an LLM in place of some human labellers. Both help scale alignment beyond what human-only pipelines can afford.

Process supervision

Instead of rewarding only the final answer, some newer techniques reward each step of a reasoning chain. This has been particularly effective for math and reasoning tasks.

RLHF and safety

RLHF is also the primary mechanism by which labs train models to refuse harmful requests — generating malware, biological attack plans, instructions for self-harm. Safety-relevant behaviours are baked in via carefully designed preference data. The technique is imperfect; “jailbreaks” that bypass safety training are regularly discovered and patched. For a broader view of the safety landscape, see our ai safety coverage.

Frequently asked questions

Is RLHF what makes ChatGPT feel “chatty”?
Largely yes. The conversational, helpful-assistant feel comes from a combination of the supervised fine-tuning on demonstration dialogues and the RLHF stage that refined that behaviour. The base GPT-3 model behind ChatGPT had very different, less coherent conversational behaviour. Stripping away RLHF alignment (sometimes possible via jailbreaks or access to base models) reveals that difference dramatically.

How is DPO different from RLHF?
DPO skips the two-stage machinery of RLHF — training a reward model, then training the LLM against it with reinforcement learning — and instead trains the LLM directly on preference pairs with a modified supervised loss. The result is usually comparable quality with less complexity, fewer hyperparameters, and more training stability. In production, many teams now prefer DPO or its variants over classical PPO-based RLHF. Both are forms of preference alignment; DPO is the operationally simpler approach.

Does RLHF make models less capable at hard tasks?
Sometimes yes — the phenomenon is called the “alignment tax”. Heavy RLHF can make models more cautious and verbose, reducing performance on benchmarks that reward direct, terse answers or certain creative or technical behaviours. Labs actively try to minimize this tax, and modern techniques have largely closed the gap. But RLHF is a tradeoff — you gain controllability and usability, you may lose a small amount of raw capability, and the net is usually strongly positive for product use cases.