Notes from RLPR

Title: RLPR: EXTRAPOLATING RLVR TO GENERAL DOMAINS WITHOUT VERIFIERS

My ChatGPT: https://chatgpt.com/share/69349442-938c-800d-ac5f-2eaed704433d

This promises an approach to RL in LLMs which does not require a “verifier model” ( reward model ) in order to computer a “score”. RLVR mainly sticks to math and code cause its easy to verify, anywhere we cannot have a strict verifier we cannot really apply RLFT well. Hence , RLPR = Reinforcement Learning based on Reference Probability Reward.

The basic insight is that LLM’s intrinsic probability of generating a correct answer directly indicates its own evaluation of the reasoning reward (i.e., how well the reasoning process leads to the correct answer).

The contribution of this work can be summarized as fourfold:

(1) We present RLPR, a simple and scalable framework that extends RLVR to general domains without using external verifiers.

(2) We propose a novel probability reward that eliminates the need for external verifiers and achieves better reward quality than naive likelihood as a reward.

(3) We introduce a novel standard deviation filtering strategy that effectively stabilizes training by removing samples with low reward standard deviation.

(4) We conduct comprehensive experiments to demonstrate the effectiveness of the proposed framework on various base models from Qwen, Llama and Gemma.

RLVR (When we have verifiers)

$J (θ) = E z, y \sim π θ (\cdot ∣ x) [f v er i f i er (y, y *)]$

RLPR : tries to uses models per-token decoding probabilities of the reference answer as the reward signal.

$r = f se q (p i ∣ o' i \in y *)$

And


f_seq = 1 / y ∗ Summation (mean probabilities)

Research Idea: We can use an encoder to convert the reasoning(z) and output(y) into vectors and then also convert the reference answer(y*) also to a vec using the same encoder. Now that they belong to the same latent space, semantically or atleast by some latent representation they must be the similar. And the distance/similarity between these vectors (assuming they represent the core concept and the semantic similarity) can be minimized as a result and we might not need to do the compute intensive multiple roll-outs as required in GRPO.

Reward Debiasing


U_r = U_z + U_others

contributors to the probability reward r as U_r
U_z represents the effects of the reasoning content
U_others captures the characteristics of other related factors, including the question and reference answer

Now we should only care about how U_z and try to limit the effect of U_others as much as possible. Model does not really spit out U_z or U_other , we only get U_r. So what we can do is de-bias the term by setting up a baseline that can potentially eliminate the effects of U_other and isolate U_z well. i.e: Total reward probability - Reward probability influenced by (systemprompt + question) , the suggest way to get this base line is to find out the probability of the expected answer y* when the input does not contain the reasoning (i.e: SysPrompt + Question only NO z) , this gives us r’ which is out baseline. Now we can do:


reward = clip( 0 , 1, r - r' )

the clip is used to limit the effect of debiasing , but the core idea is captured by r - r'

How we arrive at the final gradient estimator of the obj fn is well described in the GPT chat linked above.

STANDARD DEVIATION FILTERING

They throw away prompts where all sampled responses get almost the same reward, and keep prompts where the rewards for different responses disagree. Disagreement = useful learning signal.

Mathematically:

$Train only on prompts where Var (\overset{r}{^}_{k} ∣ x) is high$

Intuitively:

Focus RL updates on prompts where the model sometimes does well and sometimes badly; ignore those where it’s consistently great or consistently hopeless.

Main intuition for when to use RLPR:

The base model is sufficiently knowladgable and has good intuition. Use RLPR to build up and reinforce exisintg “intuition” of the model and bring out more of that as part of RLFT.

Results look reasonably good and the ablation study provides a degree of confidence on how well PR can work with variation of different variables like the base prompt and effectiveness of imprv introduced like debiasing and std-dev-filtering. Also shows RLPR can work in verifiable domains like MATH and still further juice out performance in fine tuning.

RLPR

Notes from RLPR

Reward Debiasing

STANDARD DEVIATION FILTERING

Main intuition for when to use RLPR:

Graph View

Table of Contents