Sylvain's Blog

In early 2025, DeepSeek released a bold new language model: DeepSeek-R1-Zero, trained entirely via reinforcement learning—without supervised fine-tuning. This means the model wasn’t taught with examples; it had to learn to reason by trial, error, and reward. The result? A model that shows flashes of genuine reasoning, and a new blueprint for how language models might evolve.

This post breaks down the DeepSeek-R1 research, demystifies how Group Relative Policy Optimization (GRPO) works, and explores what it means to train an AI to reason—without ever showing it the answer.

What Makes DeepSeek-R1-Zero Different

Most large language models today rely on supervised learning—they’re trained on datasets where inputs are matched with correct outputs. DeepSeek-R1-Zero skips this step entirely. Instead, it starts with a base model (DeepSeek-V3), lets it generate answers to math and logic problems, and scores the results based on correctness. Over time, it improves—without ever seeing a labeled example.

The results are remarkable: DeepSeek-R1-Zero reaches 71% accuracy on AIME 2024 (a high school math competition), and 86.7% with majority voting—matching models trained with far more human supervision.

Why Skipping Supervised Fine-Tuning Matters

Supervised Fine-Tuning (SFT) is the conventional way to teach language models. You give the model an input, show it the right output, and adjust its weights to imitate the correct answer. This works—but it limits what the model can discover on its own.

By skipping SFT, DeepSeek-R1-Zero becomes a kind of cognitive experiment. Can a model learn to reason, not by mimicking, but by exploring—guided only by reward? The answer, surprisingly, is yes.

How GRPO Teaches Models to Reason

At the heart of DeepSeek-R1-Zero is a reinforcement learning method called Group Relative Policy Optimization (GRPO). It works like this:

For a given question, the model generates a batch of possible answers.
Each answer is scored using simple rules: did it get the math right? Did it use the correct format?
Rather than needing a complex value model, GRPO just compares the answers to each other and updates the model to prefer the better ones.

This relative scoring makes the training stable and efficient, even at scale. And because GRPO avoids “reward hacking” (where models game the reward signal), it produces more honest reasoning steps.

When Reasoning Emerges on Its Own

One of the most fascinating aspects of DeepSeek-R1-Zero is its self-evolution. As training progresses, the model begins generating longer and more structured chains of thought—without being explicitly told to do so. It even begins to “reflect” and revise its answers midstream.

Researchers observed moments where the model pauses, reconsiders, and tries again. These “aha moments” aren’t scripted—they emerge naturally from the RL incentives. It’s a striking example of intelligence arising from constraints, not instruction.

What DeepSeek-R1 Adds on Top

While R1-Zero is impressive, it’s not always user-friendly—its answers can be messy, multilingual, or hard to follow. DeepSeek-R1 builds on this by adding a small amount of SFT at the start (called a “cold start”), and then repeating the RL process.

The result is a model that maintains high reasoning performance while producing clearer, more readable outputs. It reaches 97.3% on MATH-500, and outperforms OpenAI’s o1-mini across multiple benchmarks.

Making Smaller Models Smarter

DeepSeek-R1’s final act is distillation: using its outputs to train smaller, more efficient models. By fine-tuning models like Qwen-7B or LLaMA-8B on 800,000 examples generated by R1, the team creates mini models that outperform much larger ones on math and logic.

Distillation turns one powerful model into many smaller ones—paving the way for reasoning-capable LLMs that are faster, cheaper, and open source.

Final Thoughts

DeepSeek-R1-Zero is more than a model—it's a proof of concept. It shows that reasoning doesn’t have to be taught; it can emerge. And with the right reward signals, models can discover powerful strategies on their own.

For researchers, this is a call to rethink the training pipeline. For developers, it's a preview of what comes next: models that don’t just answer, but think.

Read the full paper here