DeepSeek-R1-Zero vs. DeepSeek-R1
DeepSeek-R1 and DeepSeek-R1-Zero are both AI inference models released by Hangzhou DeepSeek Artificial Intelligence Basic Technology Research Co., Ltd.
---
DeepSeek-R1
Training Method
- Cold-start data ingestion: Incorporates thousands of high-quality cold-start data points for initial fine-tuning, solving readability and multilingual processing issues present in DeepSeek-R1-Zero.
- Two-stage reinforcement learning: Optimizes inference through two reinforcement learning rounds while aligning with human preferences, improving multitasking versatility.
- Enhanced supervised fine-tuning: When reinforcement learning approaches convergence, non-reasoning skills (e.g., writing, Q&A, role-playing) are enhanced using rejection sampling and multi-domain datasets.
Key Features
- High-performance reasoning: Excels in tasks such as math, coding, and natural language reasoning. Scores:
- AIME2024: 79.8%, slightly outperforming OpenAI-o1-1217.
- MATH-500: 97.3%, comparable to OpenAI-o1-1217.
- Demonstrates expert-level ability in code competitions and engineering-related tasks.
- Support for model distillation: Users can leverage DeepSeek-R1 outputs to train smaller models (e.g., 32B and 70B models distilled with Qwen and Llama), achieving performance comparable to OpenAI o1-mini.
- Open-source and flexible use: Licensed under MIT, supporting commercial applications and model modifications, making it ideal for research and enterprise AI enhancements.
---
DeepSeek-R1-Zero
Training Method
DeepSeek-R1-Zero is the first inference model based purely on reinforcement learning, bypassing the supervised fine-tuning stage. It utilizes two primary reward mechanisms:
1. Accuracy Reward: Evaluates only the final answer's accuracy, such as the final result for math problems or test case outputs for programming questions.
2. Format Reward: Requires the model to write thought processes separately on "scratch paper" (within CoT tags) and avoid mixing reasoning content with user-facing output.
Key Features
- Strong reasoning skills: Achieved a Pass@1 score of 71.0% in the AIME2024 math competition, approaching OpenAI-o1-0912's level.
- Self-evolutionary ability: Develops complex behaviors, such as reflection, re-evaluating reasoning steps, and exploring alternative solutions during training.
- Open-source and community support:The model weights are open-source under the MIT license, enabling users to train additional models using distillation techniques.
---
Comparison: DeepSeek-R1 vs. DeepSeek-R1-Zero
- R1-Zero: Best suited for research scenarios exploring pure reinforcement learning training potential but has limited practical applications.
- R1: Designed for high-precision reasoning applications, such as programming aids, scientific problem-solving, and educational tools.
---
Knowledge Distillation
Supervised fine-tuning (SFT) is applied directly to long CoT data generated by DeepSeek-R1, with the following results:
- Qwen-32B distilled by R1 significantly outperforms Qwen-32B trained with RL alone.
- R1-distilled Qwen-14B surpasses Qwen team's QwQ-32B model.
- Qwen-32B-base distilled using R1 yields better outputs than Qwen-32B+RL.
- Qwen-32B RL underperforms compared to DeepSeek-V3-base.
Conclusion
- For small models, large-scale reinforcement learning may be less effective than distillation-based training.
- Distillation allows cost-effective and efficient training of reasoning models but still requires a strong base model and large-scale reinforcement learning to enhance model capabilities.
Source: Alang's Essay Station