LLMRLHFAI AlignmentFine-tuningNLP

Demystifying Direct Preference Optimization (DPO) with Simple Examples on LLMs

Mark FahimMarch 2, 2026 · ⏱ 10 min read

Demystifying DPO: Aligning LLMs Without a Reward Model

Large Language Models (LLMs) are trained to predict the next token — not to behave helpfully or safely. To bridge that gap, the field developed Reinforcement Learning from Human Feedback (RLHF). But RLHF is notoriously complex: it requires training a separate reward model, running a PPO optimization loop, and juggling multiple models simultaneously. Direct Preference Optimization (DPO) cuts through this complexity with a beautiful mathematical insight.

The Problem: Why LLMs Need Alignment

A base LLM, after pretraining, knows how to complete text — but it doesn't know what humans prefer. Ask it a question and it might give a factually correct but unhelpful answer, or hallucinate confidently. RLHF teaches it better by collecting human preference data (person A prefers response Y over response Z) and training the model to produce preferred outputs.

RLHF: Powerful But Complex

The classic RLHF pipeline has three stages:

Supervised Fine-Tuning (SFT): Train a reference model π_ref on high-quality demonstrations
Reward Modeling: Train a reward model r_φ(x, y) on human preference pairs (y_w ≻ y_l | x)
RL Optimization: Use PPO to maximize reward while staying close to the reference policy

The optimization objective in standard RLHF is:

The KL divergence term prevents the policy from drifting too far from the reference, which would destabilize training. The parameter β controls this trade-off.

The DPO Key Insight

Rafailov et al. (2023) showed that the KL-constrained RL problem above has a closed-form optimal solution:

Where Z(x) is the partition function. Rearranging this, the reward can be expressed in terms of the optimal policy and reference policy:

The crucial observation: Z(x) cancels in the Bradley-Terry preference model! This means we can express the preference loss entirely in terms of policy probabilities — no separate reward model needed.

The DPO Loss Function

The final DPO objective is elegantly simple:

Intuitively: we increase the log probability of the preferred response y_w and decrease the log probability of the dispreferred response y_l — relative to what the reference model would predict. β scales how strongly we enforce the preference signal.

A Simple Example

Suppose a user asks: "What is 2 + 2?"

Preferred (y_w): "2 + 2 = 4"
Dispreferred (y_l): "2 + 2 = 5, trust me!"

The DPO loss encourages the model to assign higher relative probability to the correct, concise answer versus the wrong confident one. We compute:

If the model currently gives equal probability to both, the gradient pushes it to prefer y_w over y_l. After training, the implicit reward for the correct answer should be substantially higher.

DPO vs RLHF: A Comparison

✅ No reward model training required
✅ Simpler implementation — standard cross-entropy style loss
✅ More stable — no PPO rollouts or reward hacking
✅ Single training run instead of 3-stage pipeline
⚠️ Requires offline preference data (can't explore new completions)
⚠️ Sensitive to reference model quality

Code Sketch

python

def dpo_loss(policy_logps_w, policy_logps_l,
             ref_logps_w,    ref_logps_l, beta=0.1):
    """
    policy_logps_w: log π_θ(y_w | x)
    policy_logps_l: log π_θ(y_l | x)
    ref_logps_w:    log π_ref(y_w | x)
    ref_logps_l:    log π_ref(y_l | x)
    """
    log_ratio_w = policy_logps_w - ref_logps_w  # implicit reward for preferred
    log_ratio_l = policy_logps_l - ref_logps_l  # implicit reward for dispreferred
    reward_diff = beta * (log_ratio_w - log_ratio_l)
    loss = -F.logsigmoid(reward_diff).mean()
    return loss

Conclusion

DPO reframes the alignment problem: instead of fitting a reward function explicitly, it treats the language model itself as an implicit reward model. The key mathematical insight — that the optimal RLHF policy has a closed form that makes the partition function cancel — enables end-to-end preference learning with a single model in a single training pass. It's one of the cleanest ideas in recent alignment research.