Notes from Deepseek-R1

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Models:

R1-Zero ( Only RL , No SFT)
R1 ( 2 stage SFT + RL)

Core Contribution/ Insights:

Reasoning can arise purely from RL without needing SFT (R1-zero)
Training a bigger model for reasoning and then distilling it to a smaller model is better than teaching a smaller model to reason. This also means the DSR1 can be used to generate training reasoning data for tuning smaller models.

Tunix Hack Note: Since gemma is also a base model with no SFT , we can potentially do the same thing of having a “Gemma-3-r-zero” and “Gemma-3-r-1”

The RL Approach:

This paper is a landmark work specifically for introducing “GRPO” , (Group Relative Policy Optimization)
Reward Modelling:
- Accuracy Rewards : Is the final answer that is generated correct?
- Format Rewards: To enforce <reasoning_start>,<reasoning_end>,<answer_start>, <answer_end>

Note: Nowhere in the process is the “Reasoning” specifically rewarded , we seem to assume that getting the right answer equates to reasoning correctly which may or may not be true in all cases.

Paper Idea/Tunix Hack Note: We can possibly train the model to reason better by using a learning rate type of token-count based reward function which start starts off by rewarding med-high reasoning token output and latter just starts pulling it back to mid token count. Since reasoning seems to be learnt well when modeled as fully open-ended as in R1/R1-zero where it is not even part of the reward , something like this can potentially provide a external pressure to both improve reasoning context size and quality over time.

Similar research is already being explored here: https://chatgpt.com/share/692d4d85-80ac-800d-895b-db88874ac4b7

R1-zero prompt template:

A conversation between User and Assistant. The user asks a question, and the Assistant solves it.The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, 
i.e., <think> reasoning process here </think>
<answer> answer here </answer>. 
User: prompt. 
Assistant:

R1 zero Aha-Moment

Aha Moment of DeepSeek-R1-Zero A particularly intriguing phenomenon observed during the training of DeepSeek-R1-Zero is the occurrence of an “aha moment”. This moment, as illustrated in Table 3, occurs in an intermediate version of the model. During this phase, DeepSeek-R1-Zero learns to allocate more thinking time to a problem by reevaluating its initial approach. This behavior is not only a testament to the model’s growing reasoning abilities but also a captivating example of how reinforcement learning can lead to unexpected and sophisticated outcomes.

Main Problems with R1-zero

poor readability, and language mixing. How did we solve it , by training deepseekR1.

DeepSeek-R1: Reinforcement Learning with Cold Start

Tunix Hack Note: DSR1 has a 800k big training set for reasoning which i can use? is this available/open source?

Other potential training data: MMLU, DROP, GPQA Diamond, and SimpleQA

Quite simply the R1 is built to solve the problems with R1-zero by addressing its shortcomings and iteratively training for reasoning. The training process looks something like this:

We start with the Deepseek-V3 base model.

1. SFT~1:

bad readability → SFT before RL on long CoT to avoid cold start and introduce better structure for reasoning. Prompt template: |special_token|<reasoning_process>|special_token|<summary>
language mixing - > reward function penalizing diversity of language in output tokens a.k.a “language-consistency reward”

2. RLFT w/ GRPO:

Same as R1-zero, RL on verifiable domains like math, coding , logic etc.
1. Performance is better than R1-zero thanks to improved cold start done in SFT-1.
2. Rejection Sampling and Curating Reasoning Data: They generate some 600k samples of reasoning based CoT by sampling the model check-pointed at previously step ( sampled after SFT1 and RLFT w/ GRPO). The model is now sampled for non-verifiable domain here.
3. They also throw in a 200k sample which is “Non-reasoning” (pain old text data) from the training corpus of deepseek-v2

4. SFT~2

SFT-2 is now performed on the new 800k samples (600k reasoning + 200k non reasoning) from the checkpoint model from step 2.

5. RL “for all scenarios” \ Alignment:

Here they try to align the model towards 2 concepts “helpfulness” and “harmlessness”
This is done using 2 levers: combination of reward signals and prompt distribution.
“Helpfulness”: is gauged only based on the “summary” (see prompt template from SFT1)
“Harmlessness”: is gauged based on the entire output ( reasoning + summary)

Other Insights

Distillation of a larger model that is taught to reason into a smaller model seems to work better than trying to teach a smaller model to reason. ( this feels kinda bad and a good indicator why we need alt approaches to reach AGI and just scaling alone wont get there.)

First, distilling more powerful models into smaller ones yields excellent results, whereas smaller models relying on the large-scale RL mentioned in this paper require enormous computational power and may not even achieve the performance of distillation. Second, while distillation strategies are both economical and effective, advancing beyond the boundaries of intelligence may still require more powerful base models and largerscale reinforcement learning.

Not so successful attempt: Process Reward Model (PRM) - why?

Explicitly generating scores for reasoning steps is hard.
Scoring/Estimating if a trajectory is in the right direction which can lead to correct answer is hard.
Models start reward hacking when PRM is used.

In conclusion, PRM might be good for reranking and guided search but not as the base reward mechanism for the RLFT.

Not so successful attempt: Monte Carlo Tree Search (MCTS)

Main problem being the search space of tokens is too large and we have to restrict it for the search to be practical, but when we do limit it we have a high chance of ending up in local minima.
The value function can only grade the final answer and that makes reward sparse. Even in cases where reasoning is right but final answer is slightly off the entire output is penalized. Hence the “value model” that is used becomes the main bottle neck , the branching/search will completely rely on the baseline set by the value model.

Deepseek-R1