==================

Verify The Unverifiable

A common problem is building verifiable datasets for RL is finding verifiable answers for seemingly non-verifiable domains.

I spent an inordinate amount of time trying to come up with clever verification strategies: p/q truth tables, lean verifiers, etc, etc, however, the solution is actually quite stupid: just turn it into a multiple choice problem

That’s it.

That’s the blog.

Examples

We take an open-ended question with an answer that we think is the best (note, this is our interpretation of the law and subsequently what makes law extremely hard): What is the legal significance of ‘Nichols v. Union Underwear Co.' (to which there are a million almost correct answers), but we have our ground truth of A product is considered unreasonably dangerous and therefore defective if its risk of harm is so great that a reasonably careful manufacturer, fully aware of the danger, would have chosen not to release it to the market.

To validate this, one could probably just use an LLM-as-a-judge and ask it to analyze the answer using some kind of prompt like:

Given the following question: and the corresponding ground-truth answer, analyze the following answer and return True if the answer is in line with the ground-truth.

QUESTION {question}
GROUND TRUTH: {ground_truth}
ANSWER {answer}

This is fine, but in order for us to use R1/O1 GRPO/PPO style RL we can simplify this a bit more, by turning our original question into a simple multiple choice question; thus making it incredibly verifiable.

create multiple choice question

def create_multiple_choice(dataset, question='question'):
    all_answers = dataset['ground_truth']
    
    def transform_example(example):
        # very simple algo:
        # 1. get 3 random wrong answers (excluding the correct one)
        # 2. create multiple choice options
        # 3. shuffle options and keep track of correct answer
        # 4. format the question with options

        wrong_answers = random.sample([ans for ans in all_answers if ans != example['ground_truth']], 3)
        
        options = [example['ground_truth']] + wrong_answers
        correct_index = 0
        random.shuffle(options)
        # convert 0->A, 1->B, etc
        correct_letter = chr(65 + options.index(example['ground_truth']))
        
        formatted_question = (
            f"{example[question]} "
            f"A: {options[0]} "
            f"B: {options[1]} "
            f"C: {options[2]} "
            f"D: {options[3]}"
        )
        
        return {
            'question': formatted_question,
            'ground_truth': correct_letter,
            'id': example['id']
        }
    
    return dataset.map(transform_example)

create a simple correctness verifier

def mc_accuracy_reward(completions, prompt, ground_truth, **kwargs):
    contents = [completion[0]["content"] for completion in completions]
    rewards = []
    for content, sol in zip(contents, ground_truth):
        answer_match = re.search(r'<answer>\s*([A-Z])\s*</answer>', content, re.IGNORECASE)
        if answer_match:
            answer_text = answer_match.group(1)
            reward = float(sol.upper() in answer_text.upper())
        else:
            reward = 0.0
        rewards.append(reward)
    return rewards

actual example from training

Given how easy it is to do this, we basically just hack on some verifiers in open-r1’s GRPO implementation and let it rip.

prompt:

"What is the legal significance of 'Nichols v. Union Underwear Co.'? A: If a defendant submits an affidavit containing allegations that, if true, would invalidate the purported service-of-process and nullify the court s personal jurisdiction over the defendant, then the trial court must hold an evidentiary hearing to determine whether service-of-process was properly completed. B: Fully distributed cost is not a proper test for determining whether an entity engaged in predatory pricing. C: A product is considered unreasonably dangerous and therefore defective if its risk of harm is so great that a reasonably careful manufacturer, fully aware of the danger, would have chosen not to release it to the market. D: A transferor of a negotiable instrument containing an alteration can be held liable for the amount due on the instrument for breaching a transfer warranty."

response:

<think>
Okay, so I need to figure out the legal significance of "Nichols v. Union Underwear Co." from the given options. Let me break this down step by step. 

First, the question is asking about the legal significance of this case. The options are A, B, C, and D, each making a different legal point. I remember that "Nichols v. Union Underwear Co." is a case in distributive torts, specifically related to deceptive practices by companies that manufacture latex products. 

So, the first thought is that the correct answer should be about whether reallocating distribution is allowed for profit without creating abuse of_secretary. That makes sense because companies might have made a profit in the past by distributing products through third parties. If another company wants to reallocate distribution, they have to do it in a way that doesn't harm their profits. If it's done truthfully, it's okay; but not if it's unfair.

Looking at the options, option A mentions service-of-process and the trial court's duty to hold an evidentiary hearing to determine if service was completed properly. That seems related to whether an affidavit is too vague to affect the validity of service-of-process, which is also part of the same legal issue.

Option B talks about using a fully distributed cost as a standard to determine if an entity engaged in predatory pricing. That rings a bell because in product liability cases, certain costs are considered fair use for determining proper pricing. The court assesses whether the costs are reasonable to protect the buyer and make the practice respectful of price.

Option C is about product safety; specifically, if a product has a very high risk of harm, it's considered defective even if an ordinary company would not have placed it on the market. That seems foundational to product liability laws, ensuring that companies take serious enough risks to put products on the market.

Option D is about breach of a transfer warranty, which relates to selling or transferring goods without proper guarantees. The court can hold that the company was negligent in such a transaction, which affects the amount due.

Putting it together, "Nichols v. Union Underwear Co." likely addressed one of these points. The governing law was likely the Uniform Commercial Code, and the case involved product liability, so both options A and D are about product liability, but one of them is more about the law relating to distribution while reallocated.

The court probably struck down a drafting company's ad, assuming that reallocating distribution for profit was still allowed without creating abuse of_secretary or similar language. So, option A seems relevant. Alternatively, it may have led to finding out that the company would not have placed the product on the market based on its calculations, relying on the lower bound of fair use for product liability.

Option B seems more about the cost as a standard, which is part of product liability, but I'm not entirely sure about the specific rule. Option C is about product safety, which is foundational and likely a part of the legal framework, but perhaps not addressed in the specific case.

Therefore, considering the context and the case, the most probable answer is A: a trial court must hold an evidentiary hearing to determine if service-of-process was completed properly.
</think>

<answer> A </answer>

validation

Example answer: A
Correct Answer: C
Example Correct?: False

oo lala

Here’s a training plot from a 1.5B Qwen model being RL’d on the very difficult FVLegalBench. At baseline (eg no RL/finetuning), the model gets like around 1% accuracy on test and after RL, gets about 30% accuracy on test. Pretty wild considering that’s right around where o1/o3/ get…

Verify Verify Verify