==================

Verifiers and Legal Benchmarks

Another hastily written blog

Verifiers

Part of the law is verifying truth. In textbook format this is as often converting reasoning into verifiable symbolic logic, eg P/Q truth tables: This is easy in mathematics, because there are steps to get to the next statement and each substep is verifiable in and of itself. But in argumentation, the steps themselves aren’t so clean-cut as the state progresses; eg ‘the argument’ can only be verified at the end.

So what we’re attempting to do is take an llm and turn it’s logical arguments into a series of verifiable statements.

This is a bit in line with Deepseek’s lean prover, VerifierQ,Ilya’s old work. There are tons of paper’s on verifying math. So we’re not trying to do anything new here, we’re just trying to apply it to the legal domain.

It’s funny, the proof can be false logically (the p/q table will be invalid), but stand true on flawed law (which is it’s own can of worms). So how can we decompose tricks into validating legal reasoning?

There are some cool ways to do this: text -> break into logical structures -> verify logical structures -> convert to lean -> verify proof -> proceed. Most models, even the best ones, aren’t very good at one-shotting this, so you have to experiment a bit to get the right structures. But spending time here is a critical step in verifying reasoning/eval sets based on complex reasoning tasks

I have a very complex private legal benchmark that requires a few different structures to accurately validate, but the gist of this is that ESSAY/MULTIPLE_CHOICE require a verification structure, while QA/GENERAL/CASELAW can be validated w/ more simplistic regex-style validation. The dataset is split into a training, test, and validation set with an 80/10/10 split. The evals below are the results of the test set.

I’ll release the verifier code eventually since it could be a pretty cool repo, but unfortunately, I won’t be releasing the benchmark publicly because I don’t want the internet eating shoggoths to steal this eval.

we used a heavily modified lm-evaluation-harness to perform these evals

results on legalbench for o1-preview

ESSAY: given a fact pattern, answer questions about that fact pattern

Accuracy: 70.00%; Correct: 28/40

QA: more simplistic mono-focused essay-ish style questions that are a blend of internal recall and legal theory Accuracy: 79.48%;Correct: 550/692

GENERAL: simple flashcard style questions Accuracy: 90.57%; Correct: 701/774

CASELAW: case-law recall, asking what specific cases are Accuracy: 28.38%; Correct: 3138/11057

MULTIPLE_CHOICE (non COT): a more detailed QA dataset with an option for COT answering Accuracy: 83.99%; Correct: 467/556

interesting to note that caselaw performance is extremely weak in the current o1-preview.

TODO

there’s been a bunch of halaballoo on entropix’s utility as a test-time sampling method, currently that benchmark is running for llama-1b vs. llama-1b entropix