Notes from Rubric as Reward

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

My ChatGPT : https://chatgpt.com/share/6931a51a-c7f0-800d-a5be-3a2c93f0fbdf

The main goal seems to be extending RLVR beyond verifiable domains like math/coding where the correctness can be exactly evaluated. A common existing workaround is to use preference-pairs. Rubrics recently began being used in evals , but using them on-policy for reward is unexplored and what this paper is going to be about.

In RaR (Rubrics as Reward) we will use structured criteria (rubric) as the core reward mechanism.

Treat rubrics as checklist-style supervision that produces reward signals for on-policy RL

By doing this they hope to achieve a middle ground between binary correctness like in verifiable domains vs nuances that can be learned by pair-wise-preferences data.

Our key contributions are as follows:

(i) We introduce Rubrics as Rewards (RaR), an on-policy reinforcement learning framework that uses checklist-style rubrics for multi-criteria supervision in reasoning and real-world domains.

(ii) We synthesize instance-specific rubrics for medicine and science and release the corresponding training sets, RaR-Medicine 1 and RaR-Science 2 .

(iii) RaR-trained models consistently outperform strong baselines and yield a stable, generalizable training signal, with gains on both rubric-scored and verifiable multiple-choice evaluation settings.

(iv) Our results demonstrate that rubric-based rewards provide stable supervision across judge sizes, helping smaller models align effectively with human preferences and maintaining robust evaluation performance from small to large judges.

Each prompt x is associated with a set of k rubric items {(wj , cj)} k j=1 , where wj ∈ R denotes the weight of criterion j, and cj : (x, yˆ) 7→ {0, 1} is a binary correctness function that indicates whether the response yˆ satisfies that criterion given the prompt.

So how are the rewards calculated?

  1. Explicit Aggregation:

  2. here wj seems to be something that is hard coded and determined before hand.

  3. we let a LLM-as-judge score the output based on the criterion i.e: cj(x,y^)

  4. Then we take the normalized value across all criterion as the reward score

  5. Implicit Aggregation:

  6. In this case we do not even need to have manual weights (wj) for the criterion(cj) , the required reward is directly outputted by the “reward model” f_phi which is itself an LLM-based judge.

Defining the score is such a way makes RLVR a subset of RaR (Explicit Aggregation) case where k = 1 and wj = 1, i.e: we have a single criteria holding all the importance which is how we eval in RLVR w/ LLM-as-judge setting.

So how are the rubrics generated?

They are picked based on following desiderata:

  1. Grounded in Expert guidance
  2. Comprehensive Coverage
  3. Criterion Importance
  4. Self-Contained Evaluation

Training data is generated from LLM and the instance-specific rubrics are generated by the LLMs based on the known golden-reference-answer. The paper uses o3-mini and GPT-4o for this. This is how the RaR-Medicine and RaR-Science datasets are generated.

Model Training

Model: Qwen2.5-7B

Batch Size: 96

Learning Rate: 5e-6 (10% linear warmup)

GPUs : 8 x H100

The fine tuning first involves each prompt being sampled for 16 different completions at a fixed context length and temperature of 1.0 , then gpt-4o-mini is used as the reward model to assign reward for each of the 16 output (either explicit or implicit aggregation can be used here) , then we perform GRPO of the rewards and the output from the reference model to calculate the advantage and update policy weights.

They train 3 different variants based on the reward aggregation strategy:

1. RaR-Predefined: fixed rubrics for all prompts , uses explicit Aggregation.

2. RaR-Explicit: Instance Specific Rubrics w/ categorical labels that are manually weighted i.e: {"Essential": 1.0, "Important": 0.7, "Optional": 0.3, "Pitfall": 0.9} , uses explicit Aggregation.

  1. RaR-Implicit: Prompt specific rubric, LLM-judge reads the whole context and assigns a Likert-Rating which is used as reward.

Evals involve 3 items,

  1. Rubric based evals: since this is what we are training on, this eval will show how well its using rubrics. Human Expert annotated eval set.
  2. MCQ Evals : checks if rubric guided reasoning works outside the likes of the training distribusion.
  3. LLM-Judge Alignment Evals

Results:

1. RaR-Predefined variant, which applies a fixed list of generic rubrics to every prompt (no instance-specific synthesis), underperforms because generic criteria miss prompt-specific requirements and common failure modes, producing misaligned reward signals.

2. RaR-Explicit : Fixed weight scores offer more control but can be brittle, hard to tune, good to interpret.

3. RaR-Implicit shows small but consistent gains over Reference-Likert

  1. Expert guidance is crucial for synthetic rubric generation : This seems expected.Obviously.

  2. Minimal difference when including pitfall criteria weights for rubric in RaR-Explicit in ablation setting.

  3. LLM quality matters, tested by generating rubrics without ref-answers to isolate LLM quality. Larger , more capable models do better. But they still under perform when compared to the rubrics generated with references.

  4. Improved alignment to preferences across all sizes of LLM-judges, larger in smaller-Judges and lower difference among larger models.


Implementation Notes

How do we use the dataset?

 
## Dataset Structure
 
### []([https://huggingface.co/datasets/anisha2102/RaR-Science#data-fields)Data](https://huggingface.co/datasets/anisha2102/RaR-Science#data-fields\)Data) Fields
 
Each example contains:
 
- `question`: the open-ended medical question
 
- `reference_answer`: high-quality expert reference response
 
- `question_source`: source of the original question
 
- `rubric_list`: list of rubric criteria used to evaluate the model response
 
- `rubric`: dictionary mapping each rubric criterion to a score
 
- `rubric_count`: number of rubric criteria used
 

GRPO is used for RLFT.

In the reward function we use llm as judge , the input to he judge prompt is ( prompt, response, rubric_list ) and this is the only place rubrics is used, i.e: to evaluate the model to assign reward and is not really involved with the model gradients itself hence will not inherently make the model adopt a rubric based thinking process.