DeepSeek-R1-Zero vs. DeepSeek-R1

Jan 29, 2025

DeepSeek-R1 and DeepSeek-R1-Zero are both AI inference models released by Hangzhou DeepSeek Artificial Intelligence Basic Technology Research Co., Ltd.

---

DeepSeek-R1

Training Method

- Cold-start data ingestion: Incorporates thousands of high-quality cold-start data points for initial fine-tuning, solving readability and multilingual processing issues present in DeepSeek-R1-Zero.

- Two-stage reinforcement learning: Optimizes inference through two reinforcement learning rounds while aligning with human preferences, improving multitasking versatility.

- Enhanced supervised fine-tuning: When reinforcement learning approaches convergence, non-reasoning skills (e.g., writing, Q&A, role-playing) are enhanced using rejection sampling and multi-domain datasets.

Key Features

- High-performance reasoning: Excels in tasks such as math, coding, and natural language reasoning. Scores:

- AIME2024: 79.8%, slightly outperforming OpenAI-o1-1217.

- MATH-500: 97.3%, comparable to OpenAI-o1-1217.

- Demonstrates expert-level ability in code competitions and engineering-related tasks.

- Support for model distillation: Users can leverage DeepSeek-R1 outputs to train smaller models (e.g., 32B and 70B models distilled with Qwen and Llama), achieving performance comparable to OpenAI o1-mini.

- Open-source and flexible use: Licensed under MIT, supporting commercial applications and model modifications, making it ideal for research and enterprise AI enhancements.

---

DeepSeek-R1-Zero

Training Method

DeepSeek-R1-Zero is the first inference model based purely on reinforcement learning, bypassing the supervised fine-tuning stage. It utilizes two primary reward mechanisms:

1. Accuracy Reward: Evaluates only the final answer's accuracy, such as the final result for math problems or test case outputs for programming questions.

2. Format Reward: Requires the model to write thought processes separately on "scratch paper" (within CoT tags) and avoid mixing reasoning content with user-facing output.

Key Features

- Strong reasoning skills: Achieved a Pass@1 score of 71.0% in the AIME2024 math competition, approaching OpenAI-o1-0912's level.

- Self-evolutionary ability: Develops complex behaviors, such as reflection, re-evaluating reasoning steps, and exploring alternative solutions during training.

- Open-source and community support:The model weights are open-source under the MIT license, enabling users to train additional models using distillation techniques.

---

Comparison: DeepSeek-R1 vs. DeepSeek-R1-Zero

- R1-Zero: Best suited for research scenarios exploring pure reinforcement learning training potential but has limited practical applications.

- R1: Designed for high-precision reasoning applications, such as programming aids, scientific problem-solving, and educational tools.

---

Knowledge Distillation

Supervised fine-tuning (SFT) is applied directly to long CoT data generated by DeepSeek-R1, with the following results:

- Qwen-32B distilled by R1 significantly outperforms Qwen-32B trained with RL alone.

- R1-distilled Qwen-14B surpasses Qwen team's QwQ-32B model.

- Qwen-32B-base distilled using R1 yields better outputs than Qwen-32B+RL.

- Qwen-32B RL underperforms compared to DeepSeek-V3-base.

Conclusion

- For small models, large-scale reinforcement learning may be less effective than distillation-based training.

- Distillation allows cost-effective and efficient training of reasoning models but still requires a strong base model and large-scale reinforcement learning to enhance model capabilities.

Source: Alang's Essay Station

AI Unveiled

Discussion about this post