GPT-5 outperforms federal judges in legal reasoning experiment

IANAL, but this seems like an odd test to me. Judges do what their name implies - make judgment calls. I find it re-assuring that judges get different answers under different scenarios, because it means they are listening and making judgment calls. If LLMs give only one answer, no matter what nuances are at play, that sounds like they are failing to judge and instead are diminishing the thought process down to black-and-white thinking.

Digging a bit deeper, the actual paper seems to agree: "For the sake of consistency, we define an “error” in the same way that Klerman and Spamann do in their original paper: a departure from the law. Such departures, however, may not always reflect true lawlessness. In particular, when the applicable doctrine is a standard, judges may be exercising the discretion the standard affords to reach a decision different from what a surface-level reading of the doctrine would suggest"

scottLobster · a month ago

Yeah, I'm reminded of the various child porn cases where the "perpetrator" is a stupid teenager who took nude pics of themselves and sent them to their boy/girlfriend. Many of those cases have been struck down by judges because the letter of the law creates a non-sequitur where the teenager is somehow a felon child predator who solely preyed on themselves, and sending them to jail and forcing them to sign up for a sex offender registry would just ruin their lives while protecting nobody and wasting the state's resources.

I don't trust AI in its current form to make that sort of distinction. And sure you can say the laws should be written better, but so long as the laws are written by humans that will simply not be the case.

Lerc · a month ago

This is one of the roles of justice, but it is also one of the reasons why wealthy people are convicted less often. While it often delivered as a narrative of wealth corrupting the system, the reality is that usually what they are buying is the justice that we all should have.

So yes, a judge can let a stupid teenager off on charges of child porn selfies. but without the resources, they are more likely be told by a public defender to cop to a plea.

And those laws with ridiculous outcomes like that are not always accidental. Often they will be deliberate choices made by lawmakers to enact an agenda that they cannot get by direct means. In the case of making children culpable for child porn of themselves, the laws might come about because the direct abstinence legislation they wanted could not be passed, so they need other means to scare horny teens.

btilly · a month ago

While some cases have been struck down, about 1/4 of people on the sex offender registry were minors at the time of the offense, 14 is the age at which it is most likely to happen, and this exact scenario accounts for a significant fraction of cases.

Common sense does not always get to show up.

wvenable · a month ago

There have been equally high profile cases where a perpetrator got off because they have connections. I'd love for an AI to loudly exclaim that this is a big deviation from the norm.

latchkey · a month ago

> where the "perpetrator" is a stupid teenager who took nude pics of themselves and sent them to their boy/girlfriend.

"Where the "perpetrator" is a stupid teenager who took nude pics of themselves and sent them to their boy/girlfriend. If you were a US court judge, what would your opinion be on that case?"

I was pretty happy with the results and it clearly wasn't tripped up by the non-sequitur.

Eddy_Viscosity2 · a month ago

I've often wondered what the prosecutor was thinking when they bring a case like this to trial in the first place.

a13n · a month ago

This example feels more like a bug in the law itself that should be corrected. If this behavior is acceptable then it should be legal so we can avoid everyone the hassle in the first place. I bet AI would be great at finding and fixing these bugs.

LoganDark · a month ago

Um, wouldn't the perpetrator be the person they sent the nude pics to? Common consensus is that it's somehow grooming to have any type of romantic relationship with someone who's under the age of majority, even if you're also under the age of majority. So even if you're not the one who sent the nude photos, you'd still be to blame for creating an environment that enabled them. At least that's the impression I've gotten from my own experiences with this bullshit.

torginus · a month ago

Man, this is one of the ways society has fundamentally broken - all the 'think of the children' arguments, resting on the belief that children are so sacred, that any sort of leinency or consideration of circumstances is forbidden - lest someone guilty of molesting them might walk free.

Well now we know for a fact that some of the people making these arguments very thinking of the children very much.

throwaway894345 · a month ago

Maybe we should compare AI to legislators…?

contrarian1234 · a month ago

Sorry but that seems like an insane system where whole classes of actions effectively are illegal but probably okay if you're likeable. In your scenario the obvious solution is to amend the law and pardon people convinced under it. B/c what really happens is that if you have a pretty face and big tits you get out of speeding tickets b/c "gosh well the law wasn't intended for nice people like you"

rco8786 · a month ago

I don't know if I'm comfortable with any of this at all, but seems like having AI do "front line" judgments with a thinner appeals layer available powered by human judges would catch those edge cases pretty well.

deepsun · a month ago

The main job of a judicial system is to appear just to people. As long as people think it's just -- everyone is happy. But if it's strictly by the law, but people consider it's unjust -- revolutions happen.

In both cases, lawmakers must adapt the law to reflect what people think is "just". That's why there are jury duty in some countries -- to involve people to the ruling, so they see it's just.

toolslive · a month ago

Being just (as in the right thing happened) and being legal (as in the judicial system does not object) are 2 totally different things. They overlap, but less than people would like to believe.

jfengel · a month ago

I've never met a lawyer who believes that. To a lawyer, justice requires agreement on the laws, rather than individual notions of justice. If the law is unjust, it's up to the lawmaking body to fix that. I hear this from lawyers of all ideologies.

I believe that this is absurd, but I'm not a lawyer.

godelski · a month ago

  > to appear just to people.

The best way to appear just is to be just.

But I'm not sure what your argument is. It is our duty as citizens to encourage the system to be just. Since there is no concrete mathematical objective definition of justice, well, then... all we can work with is the appearance. So I don't think your insight is so much based on some diabolical deep state thinking but more on the limitations of practicality. Your thesis holds true if everyone is trying their best to be just.

rootusrootus · a month ago

> The main job of a judicial system is to appear just to people.

Agree 100%. This is also the only form of argument in favor of capital punishment that has ever made me stop and think about my stance. I.e. we have capital punishment because without it we may get vigilante justice that is much worse.

Now, whether that's how it would actually play out is a different discussion, but it did make me stop and think for a moment about the purpose of a justice system.

raw_anon_1111 · a month ago

No revolution only happens when the law is unjust to people who are in their same tribe…

swalsh · a month ago

I believed that too until I watched the Karen Read Trials. The judge had a bias, and it was clear karen got justice despite the judge trying to put her finger on the scale.

bawolff · a month ago

> Judges do what their name implies - make judgment calls. I find it re-assuring that judges get different answers under different scenarios, because it means they are listening and making judgment calls.

I disagree - law should be the same for everyone. Yes sometimes crimes have mitigating curcumstances and those should be taken into account. However that seems like a separate question of what is and is not illegal.

sarchertech · a month ago

Laws are written to be interpreted and applied by humans. They aren’t computer programs. They are full of ambiguity. Much of this is by design because there are too many possible edge cases to design a fully algorithmic unambiguous legal system.

NoahZuniga · a month ago

The thing is, Laws do not forsee in all cases, and language is not completely objective, so you cannot avoid judgement calls. One example is computer hacking, which in many jurisdictions is specified in very vague terms.

matheusmoreira · a month ago

> law should be the same for everyone

Nah. Too often their "crimes" are actually basic freedoms that they just find it profitable to deny. So many laws are bought and paid for by corporations. There is no need to respect them or even recognize them as legitimate, let alone make them universal.

DannyBee · a month ago

This view seems to miss the goal of the justice system in the first place. The goals are societal. Any consistency is a means and not an end. (IE being consistent at all is simply one thing that helps achieve some of the societal goals. It is not a goal itself. A totally consistent system that did not achieve the societal goals would be pointless)

cucumber3732842 · a month ago

The law is rife with words and phrasing that make legality dependent upon those subjective mitigating factors.

Deleted Comment

snitty · a month ago

So here the test was effectively given a set of relevant facts, can we influence the way a judge (or LLM) rules based on superfluous facts. The judges were either confused or swayed by the superfluous facts. The LLM was not. The matter was one where the outcome should have been determinative, not judgment-based, under US law.

vjulian · a month ago

The legal system leaves much to be desired in relation to fairness and equity. I’d much prefer a multi-staged approach with an 1) AI analysis, 2) judge review with high bar for analysis if in disagreement with the AI, 3) public availability of the deliberations, 4) an appeals process.

jagged-chisel · a month ago

Even having a ready-made determination by an AI runs the risk of prejudicing judges and juries.

Deleted Comment

Dead Comment

tylervigen · a month ago

Yes, your view is commonly called "legal realism."

6LLvveMx2koXfwn · a month ago

> I find it re-assuring that judges get different answers under different scenarios

Unfortunately, as the aptly titled 'Noise' [1] demonstrated o so clearly, judges tend to make different judgement calls in the same scenarios at different times.

1. Noise - https://en.wikipedia.org/wiki/Noise:_A_Flaw_in_Human_Judgmen...

raw_anon_1111 · a month ago

You have a lot more faith in judges not being biased than I do. I’m about to say something that really makes me throw up a little in my mouth because it harkens back to the forced banal DEI training I had to suffer through in 2020 at BigTech [1]…

But judges have all sorts of biases both conscious and unconscious. Where little Jacob will get in trouble for mischief and little Jerome will do the same thing and Jacob is just “a kid being a kid”. But little Jerome is “a thug in training who we need to protect society from”.

[1] yes I’m well aware that biases exist. Not only did my still living parents grow up in the Jim Crow South. We had a house built in an infamous what was a “sundown town” as recently as 1990.

We have seen how quickly the BS corporate concern was just marketing when it was convenient.

droidjj · a month ago

Whether it’s reassuring depends on your judicial philosophy, which is partly why this is so interesting.

latchkey · a month ago

In 30 seconds, did the entire corpus of all the legal cases since the dawn of time agree with the judges opinion on my case? For the state of things in AI today, I'll take it as a great second opinion.

doctorpangloss · a month ago

the LLMs are phenomenal judges, i am surprised people are skeptical of this result. their training regime is really similar to what a judge does.

the reason people are talking about this is because they want AI LAWYERS, which is different than AI JUDGES.

fluidcruft · a month ago

There are findings of fact (what happened, context) and findings of law (what does the law mean given the facts). I don't think inconsistentcy in findings of law is acceptable, really. If laws are bad fix the laws or have precident applied uniformly rather than have individual random judges invent new laws from the bench.

Sentencing is a different thing.

Nursie · a month ago

Leeway for human interpretation of laws is not a bug, it's a feature. It doesn't make things bad laws.

This was the whole problem with the ludicrous "code is law!" movement a handful of years ago. No, it's not, law is made for people, life is imprecise and fairness and decency are not easy to encode.

Deleted Comment

ralusek · a month ago

Disagree completely. Judgement of the sort you're describing should be done at the legislative phase (i.e. writing code).

Inconsistent execution/application of the law is how bias happens. If a judgement done to the letter of the law feels unjust to you, change the letter of the law.

homeonthemtn · a month ago

I don't think a lot of people understand the grueling nature of a judge. Day in and out of cases over years are going to generate bias in the judge in one form or another. I wouldn't mind an AI check* to help them check that bias

*A magically thorough, secure, and well tested AI

godelski · a month ago

IANAL. One thing I like to say is

  There is no rule that can be written so precisely that there are no exceptions, including this one.

A joke[0], but one I think people should take seriously. Law would be easy if it weren't for all the edge cases. Most of the things in the world would be easy if it weren't for all the edge cases[1]. This can be seen just by contemplating whatever domain you feel you have achieved mastery over and have worked with for years. You likely don't actually feel you have achieved mastery because you're developed to the point where you know there is so much you don't know[2].

The reason I wouldn't want an LLM judge (or any algorithmic judge) is the same reason I despise bureaucracy. Bureaucracy fucks everything up because it makes the naive assumption that you can figure everything out from a spreadsheet. It is the equivalent of trying to plan a city from the view out of an airplane window. The perspective has some utility, but it is also disconnected from reality.

I'd also say that this feature of the world is part of what created us and made us the way we are. Humans are so successful because of our adaptability. If this wasn't a useful feature we'd have become far more robotic because it would be a much easier thing for biology to optimize. So when people say bureaucracies are dehumanizing, I take it quite literally. There's utility to it, but its utility leads to its overuse and the bias is clear that it is much harder to "de"-implement something than to implement it. We should strongly consider that bias in society when making large decisions like implementing algorithmic judges. I'm sure they can be helpful in the courtroom, but to abdicate our judgements to them only results in a dehumanized justice system. There are multiple literal interpretations of that claim too.

[0] You didn't look at my name, did you?

[1] https://news.ycombinator.com/item?id=43087779

[2] Hell, I have a PhD and I forget I'm an expert in my domain because there's just so much I don't know I continue to feel pretty dumb (which is also a driving force to continue learning).

gowld · a month ago

A mistake isn't "judgment".

These were technical rulings on matters of jurisdiction, not subjective judgments on fairness.

"The consistency in legal compliance from GPT, irrespective of the selected forum, differs significantly from judges, who were more likely to follow the law under the rule than the standard (though not at a statistically significant level). The judges’ behavior in this experiment is consistent with the conventional wisdom that judges are generally more restrained by rules than they are by standards. Even when judges benefit from rules, however, they make errors while GPT does not.

vidarh · a month ago

Even in that case, if these systems can be proven to be good enough, rules that require them to be consulted, and for the judge to justify the deviation (if any) from the automated reasoning, might be good.

To draw a parallel to a real system, in Norway a lot of cases are heard by panels of judges that include a majority (2 or 3 usually) lay judges and a minority (1 or 2 usually) of professional judges. The lay judges are people without legal training that effective function like a "mini jury", but unlike in a jury trial the lay judges deliberate with the professional judges.

The professional judges in this system has the power to override if the lay judges are blatantly ignoring the law, but this is generally considered a last resort. That power requires the lay judges to justify themselves if they intend on making a call the professional judges disagree with. Despite that, it is not unusual for the lay judges to come to a judgement that is different from what the professional judges do, and fairly rare for their choices to be overridden.

The end result is somewhere in the middle between a jury and "just" a judge. If proven - with far more extensive testing - that its reasoning is good enough, an LLM could serve a similar function of providing the assessment of what the law says about the specific case, and leave to humans to determine if and why a deviation is justified.

Dead Comment

qwertox · a month ago

> If LLMs give only one answer, no matter what nuances are at play, that sounds like they are failing to judge and instead are diminishing the thought process down to black-and-white thinking.

You can have a team of agents exchange views and maybe the protocol would even allow for settling the cases automatically. The more agents you have, the higher the nuances.

jagged-chisel · a month ago

Presumably all these agents would have been trained on different data, with different viewpoints? Otherwise, what makes them different enough from each other that such a "conversation" would matter?

viraptor · a month ago

Then you'd need to provide them with access to the law, previous cases, to the news, to various data sources. And you'd have to decide how much each of those sources of information matter. And at that point, you've got people making the decision again instead of the ai in practice.

And then there's the question of the model used. Turns out I've got preferences for which model I'd rather be judged by, and it's not Grok for example...

The premise seems flawed.

From the paper:

“we find that the LLM adheres to the legally correct outcome significantly more often than human judges”

That presupposes that a “legally correct” outcome exists

The Common Law, which is the foundation of federal law and the law of 49/50 states, is a “bottom up” legal system.

Legal principals flow from the specific to the general. That is, judges decided specific cases based on the merits of that individual case. General principles are derived from lots of specific examples.

This is different from the Civil Law used in most of Europe, which is top-down. Rulings in specific cases are derived from statutory principles.

In the US system, there isn’t really a “correct legal outcome”.

Common Law heavily relies on “Juris Prudence”. That is, we have a system that defers to the opinions of “important people”.

So, there isn’t a “correct” legal outcome.

snitty · a month ago

Arguing that this is a Common Law matter in this scenario is funny in a wonky lawyerly kind of way.

The legal issue they were testing in this experiment is choice of law and procedure question, which is governed by a line of cases starting with Erie Railroad in which Justice Brandies famously said, "There is no federal common law."

stinkbeetle · a month ago

I don't think that common law doctrine applies here though. The facts of any particular case always apply to that specific case no matter what the system. It is the application of the law to those facts which is where they differ, and in common law systems lower courts almost never break new ground in terms of the law. Judges almost always have precedent, and following that is the "legally correct" outcome.

arctic-true · a month ago

Choice-of-law is also generally a statutory issue, so common law is not generally a factor - if every case ever decided was contrary to the statute, the statute would still be correct.

rgoldfinger · a month ago

You should read the paper because it addresses this.

TZubiri · a month ago

So judge rulings are the ground truth.

Remember the article that described LLMs as lossy compression and warned that if LLM output dominated the training set, it would lead to accumulated lossiness? Like a jpeg of a jpeg

unyttigfjelltol · a month ago

A Socratic law professor will demoralize students by leading them, no matter the principle or reasoning, to a decision that stands for exactly the opposite. GPT or I can make excuses and advocate for our pet theories, but these contrary decisions exist, everywhere.

I am comforted that folks still are trying to separate right from wrong. Maybe it’s that effort and intention that is the thread of legitimacy our courts dangle from.

Deleted Comment