Understanding DeepSeek R1
moshesaraneali editou esta página há 2 meses atrás


DeepSeek-R1 is an open-source language model built on DeepSeek-V3-Base that's been making waves in the AI community. Not only does it match-or even surpass-OpenAI's o1 design in many benchmarks, however it also comes with totally MIT-licensed weights. This marks it as the first non-OpenAI/Google model to deliver strong reasoning capabilities in an open and available way.

What makes DeepSeek-R1 particularly interesting is its transparency. Unlike the less-open methods from some industry leaders, DeepSeek has released a detailed training method in their paper. The design is also incredibly cost-effective, with input tokens costing simply $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).

Until ~ GPT-4, the common knowledge was that much better designs required more information and compute. While that's still legitimate, models like o1 and R1 demonstrate an alternative: inference-time scaling through reasoning.

The Essentials

The DeepSeek-R1 paper presented numerous designs, but main amongst them were R1 and R1-Zero. Following these are a series of distilled models that, while interesting, I will not discuss here.

DeepSeek-R1 uses two major ideas:

1. A multi-stage pipeline where a small set of cold-start data kickstarts the model, followed by massive RL.

  1. Group Relative Policy Optimization (GRPO), a support knowing method that counts on comparing several model outputs per prompt to avoid the requirement for a different critic.

    R1 and R1-Zero are both reasoning models. This basically suggests they do Chain-of-Thought before addressing. For the R1 series of designs, this takes type as believing within a tag, before addressing with a last summary.

    R1-Zero vs R1

    R1-Zero applies Reinforcement Learning (RL) straight to DeepSeek-V3-Base without any supervised fine-tuning (SFT). RL is used to optimize the design's policy to maximize reward. R1-Zero attains exceptional precision however often produces complicated outputs, such as mixing several languages in a single response. R1 repairs that by incorporating minimal monitored fine-tuning and several RL passes, which improves both accuracy and readability.

    It is intriguing how some languages may express certain concepts better, which leads the design to choose the most expressive language for the task.

    Training Pipeline

    The training pipeline that DeepSeek released in the R1 paper is immensely interesting. It showcases how they produced such strong reasoning models, and pattern-wiki.win what you can anticipate from each stage. This includes the issues that the resulting models from each stage have, and how they resolved it in the next stage.

    It's interesting that their training pipeline varies from the usual:

    The usual training method: Pretraining on large dataset (train to forecast next word) to get the base model → monitored fine-tuning → preference tuning through RLHF R1-Zero: Pretrained → RL R1: Pretrained → Multistage training pipeline with several SFT and RL stages

    Cold-Start Fine-Tuning: Fine-tune DeepSeek-V3-Base on a couple of thousand Chain-of-Thought (CoT) samples to make sure the RL procedure has a decent starting point. This gives a great model to begin RL. First RL Stage: Apply GRPO with rule-based rewards to enhance reasoning accuracy and formatting (such as forcing chain-of-thought into thinking tags). When they were near convergence in the RL process, they transferred to the next step. The outcome of this step is a strong reasoning model but with weak basic abilities, e.g., poor formatting and language blending. Rejection Sampling + general data: trade-britanica.trade Create brand-new SFT information through rejection tasting on the RL checkpoint (from action 2), integrated with monitored information from the DeepSeek-V3-Base design. They collected around 600k premium reasoning samples. Second Fine-Tuning: Fine-tune DeepSeek-V3-Base again on 800k total samples (600k thinking + 200k general jobs) for more comprehensive capabilities. This action resulted in a strong thinking design with general abilities. Second RL Stage: Add more benefit signals (helpfulness, harmlessness) to refine the final design, in addition to the thinking benefits. The outcome is DeepSeek-R1. They also did design distillation for wavedream.wiki several Qwen and Llama models on the thinking traces to get distilled-R1 designs.

    Model distillation is a method where you utilize an instructor design to improve a trainee model by producing training data for the trainee model. The teacher is generally a larger design than the trainee.

    Group Relative Policy Optimization (GRPO)

    The basic concept behind using reinforcement knowing for LLMs is to fine-tune the design's policy so that it naturally produces more precise and beneficial answers. They used a reward system that examines not just for correctness however likewise for correct formatting and language consistency, so the model gradually discovers to favor reactions that meet these quality criteria.

    In this paper, they motivate the R1 design to create chain-of-thought reasoning through RL training with GRPO. Instead of including a different module at reasoning time, the training procedure itself pushes the design to produce detailed, detailed outputs-making the chain-of-thought an emerging habits of the optimized policy.

    What makes their technique especially interesting is its reliance on straightforward, rule-based reward functions. Instead of depending upon costly external designs or human-graded examples as in standard RLHF, the RL utilized for R1 utilizes easy criteria: it may provide a greater benefit if the response is right, if it follows the anticipated/ formatting, and if the language of the response matches that of the timely. Not depending on a reward design also suggests you do not have to hang out and effort training it, and it doesn't take memory and compute away from your main design.

    GRPO was presented in the DeepSeekMath paper. Here's how GRPO works:

    1. For each input timely, the model produces different responses.
  2. Each response receives a scalar reward based on elements like precision, [mariskamast.net](http://mariskamast.net:/smf/index.php?action=profile