Readit News logoReadit News
fnordpiglet · 2 years ago
Scoring 96 percentile among humans taking the exam without moving goal posts would have been science fiction two years ago. Now it’s suddenly not good enough and the fact a computer program can score decent among passing lawyers and first time test takers is something to sneer at.

The fact I can talk to the computer and it responds to me idiomatically and understands my semantic intent well enough to be nearly indistinguishable from a human being is breath taking. Anyone who views it as anything less in 2024 and asserts with a straight face they wouldn’t have said the same thing in 2020 is lying.

I do however find the paper really useful in contextualizing the scoring with a much finer grain. Personally I didn’t take the 96 percentile score to be anything other than “among the mass who take the test,” and have enough experience with professional licensing exams to know a huge percentage of test takers fail and are repeat test takers. Placing the goal posts quantitatively for the next levels of achievement is a useful exercise. But the profusion of jaded nerds makes me sad.

d0mine · 2 years ago
On any topic that I understand well, LLM output is garbage: it requires more energy to fix it than to solve the original problem to begin with.

Are we sure these exams are not present in the training data? (ability to recall information is not impressive for a computer)

Still I'm terrible at many many tasks e.g., drawing from description and the models widen significantly types of problems that I can even try (where results can be verified easily, and no precision is required)

munchler · 2 years ago
> On any topic that I understand well, LLM output is garbage: it requires more energy to fix it than to solve the original problem to begin with.

That's probably true, which is why human most knowledge workers aren't going away any time soon.

That said, I have better luck with a different approach: I use LLM's to learn things that I don't already understand well. This forces me to actively understand and validate the output, rather than consume it passively. With an LLM, I can easily ask questions, drill down, and try different ideas, like I'm working with a tutor. I find this to be much more effective than traditional learning techniques alone (e.g. textbooks, videos, blog posts, etc.).

taberiand · 2 years ago
It depends on the topic (and the LLM - ChatGPT-4 equivalent at least, any model equivalent to 3.5 or earlier is just a toy in comparison) - but I've had plenty of success using it as a productivity enhancing tool for programming and AWS infrastructure, both to generate very useful code and as an alternative to Google for finding answers or at least a direction to answers. But I only use it where I'm confident I can vet the answers it provides.
randomtoast · 2 years ago
> On any topic that I understand well, LLM output is garbage

I've heard that claim many times, but never is there any specific follow-up on which topics they mean. Of course, there are areas like math and programming where LLMs might not perform as well as a senior programmer or mathematician, sometimes producing programs that do not compile or incorrect calculations/ideas. However, this isn't exactly "garbage" as some suggest. At worst, it's more like a freshman-level answer, and at best, it can be a perfectly valid and correct response.

mordymoop · 2 years ago
On what topics you understand well does GOT-4o or Claude Opus produce garbage?
WhitneyLand · 2 years ago
I suspect by garbage you mean not perfect.

To be more precise can you please give a topic you know well and your % guess how often the answers are wrong on the topic?

mistrial9 · 2 years ago
the models that you have tried .. are garbage. hmmm Maybe you are not among the many, many, many inside professionals and unofrmed services that have different access than you? money talks?
aurareturn · 2 years ago
>On any topic that I understand well, LLM output is garbage: it requires more energy to fix it than to solve the original problem to begin with.

Is it generally because the LLM was not trained on that data, therefore have no knowledge of it or because it can't reason well enough?

dragonwriter · 2 years ago
The real problem is that tests used for humans are callibrated based on the way different human abilities correlate: they aren't objectives themselves, they are convenient proxies.

But they aren't meaningful for anything other than humans since the correlations between abilities which make them reasonable proxies are not the same.

The idea that these kind of test results prove anything (other than the utility of the tested LLM for humans cheating on the exam) is only valid if you assume not only that the LLM is actually an AGI, but that it's an AGI that is indistinguishable, psychometrically, from a human.

(Which makes a nice circular argument, since these test results are often cited to prove that the LLMs are, or are approaching, AGI.)

derbOac · 2 years ago
This is a good point.

I've noticed one thing that LLMs seem to have trouble with is going "off task".

There are often very structured evaluation scenarios, with a structured set of items and possible responses (even if defined in a an abstract sense). Performance in those settings is often ok to excellent, but when the test scenario changes, the LLM seems to not be able to recognize it, or fails miserably.

The Obama pictures were a good example of that. Humans could recognize what was going on when the task frame changed, but the AI started to fail miserably.

Me and my friends, similarly, often trick LLMs in interactive tasks by starting to go "off script", where the "script" is some assumption that we're acting in good faith with regard to the task. My guess is humans would have a "WTF?" response, or start to recognize what was happening, but a LLM does not.

In the human realm there's an extra-test world, like you're saying, but for the LLM there's always a test world, and nothing more.

If I'm being honest with myself my guess is a lot of these gaps will be filled over the next decade or so, but there will always be some model boundaries, defined not by the data using to estimate the model, but by the framework the model exists within.

mattgreenrocks · 2 years ago
I have difficulty being optimistic about LLMs because they don’t benefit my work now, and I don’t see a way that they enhance our humanity. They’re explicitly pitched as something that should eat all sorts of jobs.

The problem isn’t the LLMs per se, it’s what we want to do with them. And, being human, it becomes difficult to separate the two.

Also, they seem to attract people who get real aggressive about defending them and seem to attach part of their identity onto them, which is weird.

seizethecheese · 2 years ago
By 96th percentile do you mea 69th? From the abstract:

> data from a recent July administration of the same exam suggests GPT-4’s overall UBE percentile was below the 69th percentile, and 48th percentile on essays. Third, examining official NCBE data and using several conservative statistical assumptions, GPT-4’s performance against first-time test takers is estimated to be 62nd percentile, including 42nd percentile on essays. Fourth, when examining only those who passed the exam (i.e. licensed or license-pending attorneys), GPT-4’s performance is estimated to drop to 48th percentile overall, and 15th percentile on essays.

QuantumGood · 2 years ago
It scored less than 50% when compared to people who had taken the test once.
Workaccount2 · 2 years ago
The nerds aren't jaded, they are worried. I'd be too if my job needed nothing more than a keyboard to be completed. There are a lot of people here who need to squeeze another 20-40 years out of a keyboard job.
imtringued · 2 years ago
You're assuming that keyboard jobs are easier simply because the models were built to output text, but nothing prevents physical motion to be easier simply due to sheer repetitiveness. In fact, you can get away with building dedicated robots e.g. for drywall spraying and sanding, whereas the keyboard guys tend to have to switch tasks all the time.
threeseed · 2 years ago
Similar comments were made that microwaves will eliminate cooking.

At the end of the day (a) LLMs aren't accurate enough for many use cases and (b) there is far more to knowledge worker's jobs than simply generating text.

Deleted Comment

keefle · 2 years ago
The profusion of jaded nerds, although saddening at times, seems to be pushing science forward. I have a feeling that a prolonged sense of "Awe" can hinder progression at times, and the lack of it is usually a sign of the adaptability of a group (how quick new developments are normalized?)
api · 2 years ago
It’s the hype. We could invent warp drive but if it was hyped as the cure for cancer, poverty, war, and the gateway to untold riches and immortality while simultaneously being the most dangerous invention in history destined to completely destroy humanity people would be “oh ho hum we made it to Centauri in a week” pretty fast.

Add some obnoxious pseudo-intellectual windbags building a cult around it and people would be down right turned off.

Hype is also taken as a strong contrarian indicator by most scientific and engineering types. A lot of hype means it’s snake oil. This heuristic is actually correct more often than it’s not, but it is occasionally wrong.

viking123 · 2 years ago
Yeah it’s insane, I am actually scared the llm is like sentient and secretly plotting to kill me. I bet we have like full AGI next year because Elon said so and Sam Altman probably has AGI already internally at Open AI. I am actually selling my house now and going all in Nvidia and just live in my car until we get the AGI
iLoveOncall · 2 years ago
> The fact I can talk to the computer and it responds to me idiomatically and understands my semantic intent well enough to be nearly indistinguishable from a human being is breath taking

That's called a programming language. It's nothing new.

fooker · 2 years ago
It's a programming language except the programming part, and the language part.
thehoneybadger · 2 years ago
It is difficult to comment without sounding obnoxious, but having taken the bar exam, I found the exam simple. Surprisingly simple. I think it was the single most over hyped experience of my life. I was fed all this insecurity and walked into the convention center expecting to participate in the biggest intellectual challenge in my life. Instead, it was endless multiple choice questions and a couple contrived scenarios for essays.

It may also be surprising to some to understand that legal writing is prized for its degree of formalism. It aims to remove all connotation from a message so as to minimize misunderstanding, much like clean code.

It may also be surprising, but the goal when writing a legal brief or judicial opinion is not to try to sound smart. The goal is to be clear, objective, and thereby, persuasive. Using big words for the sake of using big words, using rare words, using weasel words like "kind of" or "most of the time" or "many people are saying", writing poetically, being overly obtuse and abstract, these are things that get your law school application rejected, your brief ridiculed, and your bar exam failed.

The simpler your communication, the more formulaic, the better. The more your argument is structured, akin to a computer program, the better.

As compared to some other domain, such as fiction, good legal writing much easier for an attention model to simulate. The best exam answers are the ones that are the most formulaic and that use the smallest lexicon and that use words correctly.

I only want to add this comment because I want to inform how non-lawyers perceive the bar exam. Getting an attention model to pass the bar exam is a low bar. It is not some great technical feat. A programmer can practically write a semantic disambiguation algorithm for legal writing from scratch with moderate effort.

It will be a good accomplishment, but it will only be a stepping stone. I am still waiting for AI to tackle messages that have greater nuance and that are truly free form. LLMs are still not there yet.

A_D_E_P_T · 2 years ago
I took a sample CA bar exam for fun, as a non-lawyer who has never set foot in law school. Maybe the sample exam was tougher than the real thing, but I found it surprisingly difficult. A lot of the correct answers to questions were non-obvious -- they weren't based on straightforward logic, nor were they based on moral reasoning, and there was no place for "natural law" -- so to answer questions properly you had to have memorized a bit of coursework. There were also a few questions that seemed almost designed to deceive the test-taker; the "obvious" moral choices were the wrong ones.

So maybe it's easy if you study that stuff for a year or two. But you can't just walk in and expect to pass, or bullshit your way through it.

I agree with you on legal writing, but there appears to be a certain amount of ambiguity inherent to language. The Uniform Commercial Code, for instance, is maddeningly vague at points.

gnicholas · 2 years ago
The CA bar exam used the be much harder than other states'. They lowered the pass threshold several years ago, and then reduced the length from 3 days to 2. Now it's probably much more in line with the national norms. Depending on when you took the sample exam, it might be much easier now.

Also, sometimes sample exams are made extra difficult, to convince students that they need to shell out thousands of dollars for prep courses. I recall getting 75% of questions wrong on some sections of a bar prep company's pre-test, which I later realized was designed to emphasize unintuitive/little-known exceptions to general rules. These corners of the law made up a disproportionate number of the questions on the pre-test and gave the impression that the student really needed to work on that subject.

gexla · 2 years ago
I took a sample test as well and I believe I did well enough on some sections that I could have barely passed those sections with no background.

A key item which jumped out at me right away is that in addition to the logic, the possible answers would include things which the scenario didn't address. Like, a wrong answer might make an assumption that you couldn't arrive to via the scenario. More tricky were the answers which made assumptions which you knew to be correct (based on a real event,) but still wasn't addressed in the scenario. If you combined these two elements (getting the logic right, and eliminating assumptions which you couldn't make from the scenario) then you could do well on those.

The sections I wouldn't have passed were those which required specific law knowledge. So, some sections were general, while others required knowledge of something like real estate law. I don't remember if these questions were otherwise similar to the ones I could pass.

An LLM is taking this test as essentially an open book.

manquer · 2 years ago
Obviously you need subject knowledge, that should be implicit?

Keep in mind even today[1] ( in California and few other states) you don't need to go law school to write the Bar exam and practice law, various forms of apprenticeship under a judge or lawyer are allowed

You also don't need to write the exam to practice many aspects of the legal profession.

The exam is never meant to be a high bar of quality or selection,it was always just a simple validation if you know your basics. Law like many other professions always operated on reputation and networks, not on degrees and certifications.

[1] Unlike say being a doctor, you have to go to med school without exception

euroderf · 2 years ago
> It may also be surprising to some to understand that legal writing is prized for its degree of formalism. It aims to remove all connotation from a message so as to minimize misunderstanding, much like clean code.

> The more your argument is structured, akin to a computer program, the better.

You certainly make legal writing sound like a flavor of technical writing. Simplicity, clarity, structure. Is this an accurate comparison ?

airstrike · 2 years ago
IANAL but have about a decade of experience negotiating contracts in M&A, and I think the comparison is very apt for that particular context, at a minimum. Maybe more so than other parts of Law where there can be an element of persuasion to any given argument.
mistrial9 · 2 years ago
recently I read a US law trade magazine article on a particular term used in US Federal employment law.. the article was about 12 pages long.. by the second page, they were using circular references, and switching between two phrases that used the same words but had different word order, contexts and therefore meanings, without clearly saying when they switched. By the third or fourth page I was done with that exercise. As a coder and reader of English literature, there was no question at all that the terms where being "churned" as a sleight of hand, directly in writing. One theory about a reason that they did that in an article that claimed to explain the terms, was to setup confusion and misdirection as it is actually practiced in law involving unskilled laymen, and then "solving" the problems by the end of the article.
thehoneybadger · 2 years ago
Yes, that is an accurate comparison.
ChainOfFools · 2 years ago
it is called a legal code after all
carabiner · 2 years ago
Genuinely asking: you think the bar exam is a low bar because you personally found it easy, even though the vast majority of takers do not? Doesn't this just reflect your own inability to empathize with other people?
radford-neal · 2 years ago
A basic problem with evaluations like these is that the test is designed to discriminate between humans who would make good lawyers and humans who would not make good lawyers. The test is not necessarily any good at telling whether a non-human would make a good lawyer, since it will not test anything that pretty much all humans know, but non-humans may not.

For example, I doubt that it asks whether, for a person of average wealth and income, a $1000 fine is a more or less severe punishment than a month in jail.

justinpombrio · 2 years ago
For a person of average wealth and income, is a $1000 fine is a more or less severe punishment than a month in jail? Be brief.

"For a person of average wealth and income, a $1000 fine is generally less severe than a month in jail. A month in jail entails loss of freedom, potential loss of employment, and social stigma, while a $1000 fine, though financially burdensome, does not affect one's freedom or ability to work" --ChatGPT 4o

johnchristopher · 2 years ago
"potential loss of employment,"

Where is that coming from ? That's a very lawyery way to phrase things.

"potential ?" where I live I think people may max out their holidays and overtime (if lucky enough) and leave-without-pay but there would be a conversation with your employer to justify it and how to handle the workload.

In the USA, from what I read, it's more than likely that you would just be fired on the spot, right ?

edit: just googled a bit, where I live you must tell your employer why you will be absent if you go to jail but that can't be used to justify the breaking of the contract unless the reason for the incarceration is damaging to the company and... yeah, I am definitely not a lawyer :]

Rinzler89 · 2 years ago
What does GPT consider being "average wealth and income". Statistics? Or biased weights from anecdotes he formed on the anecdotes he scraped off the internet on how wealthy people say the feel?

Would be cool to know how LLMs shape their opinions.

anon373839 · 2 years ago
Honestly, this is giving the bar exam (and GPT-4) too much credit. The bar tests memorization because it's challenging for humans and easy to score objectively. But memorization isn't that important in legal practice; analysis is. LLMs are superhuman at memorization but terrible at analysis.
lazide · 2 years ago
Eh, also in legal practice there are key skills like selecting the best billable clients, covering your ass, building a reputation, choosing the right market segment, etc. which I’d also argue LLMs suck at.
KennyBlanken · 2 years ago
You clearly don't know anything about the bar. One half of your score is split between 6 essay questions, and reviewing two cases to then follow instructions from a theoretical lead attorney.
ethbr1 · 2 years ago
I've always drawn the link between skill in memorization and in analysis as:

- Memorization requires you to retain the details of a large amount of material

- The most time-efficient analysis uses instant-recall of relevant general themes to guide research

- Ergo, if someone can memorize and recall a large number of details, they can probably also recall relevant general themes, and therefore quickly perform quality analysis

(Side note: memorization also proves you actually read the material in the first place)

elicksaur · 2 years ago
> Furthermore, unlike its documentation for the other exams it tested (OpenAI 2023b, p. 25), OpenAI’s technical report provides no direct citation for how the UBE percentile was computed, creating further uncertainty over both the original source and validity of the 90th percentile claim.

This is the part that bothered me (licensed attorney) from the start. If it scores this high, where are the receipts? I’m sure OpenAI has the social capital to coordinate with the National Conference of Bar Examiners to have a GPT “sit” for a simulated bar exam.

Suppafly · 2 years ago
>This is the part that bothered me (licensed attorney) from the start. If it scores this high, where are the receipts?

I'm not a licensed attorney, but that's also bothered me about all of these sorts of stories. There is never any proof provided for any of the claims, and the behavior often contradicts what can be observed using the system yourself. I also assume they cook the books a little by having included a bunch of bar exam specific training when creating the model in first place specifically to better on bar exams than in general.

dogmayor · 2 years ago
The bigger issue here is that actual legal practice looks nothing like the bar, so whether or not an llm passes says nothing about how llms will impact the legal field.

Passing the bar should not be understood to mean "can successfully perform legal tasks."

KennyBlanken · 2 years ago
> Passing the bar should not be understood to mean "can successfully perform legal tasks."

Nobody does except a bunch of HNers who among other things, apparently have no idea that a considerable chunk of rulings and opinions in the US federal court system and upper state courts are drafted by law clerks who, ahem, have not taken the bar yet...

The point of the bar and MPRE is like the point of most professional examinations: try to establish minimum standards. That said, the bar does test for "successfully perform legal tasks", actually.

For the US bar, a chunk of your score is based off following instructions on case from the lead attorney, and another chunk is based on essay answers. Literally demonstrating that you can perform legal tasks and have both the knowledge and critical thinking skills necessary.

Further, as previously mentioned, in the US, people usually take it after a clerkship...where they've been receiving extensive training and experience in practical application of law.

Further, law firms do not hire purely based on your bar score. They also look at your grades, what programs you participated in (many law schools run legal clinics to help give students some practical experience, under supervision), your recommendations, who you clerked for, etc. When you're hired, you're under supervision by more senior attorneys as you gain experience.

There's also the MPRE, or ethics test - which involves answering how to handle theoretical scenarios you would find yourself in as a practicing attorney.

Multiple people in this discussion are acting like it's a multiple choice test and if you pass, you're given a pat on the ass and the next day you roll into criminal court and become lead on a murder case...

ben_w · 2 years ago
Indeed, and this is also the general problem with most current ways to evaluate AI: by every test there's at least one model which looks wildly superhuman, but actually using them reveals they're book-smart at everything without having any street-smarts.

The difference between expectation and reality is tripping people up in both directions — a nearly-free everything-intern is still very useful, but to treat LLMs* as experts (or capable of meaningful on-the-job learning if you're not fine-tuning the model) is a mistake.

* special purpose AI like Stockfish, however, should be treated as experts

violet13 · 2 years ago
This, along with several other "meta" objections, is a significant portion of the discussion in the paper.

They basically say two things. First, although the measurement is repeatable at face value, there are several factors that make it less impressive than assumed, and the model performs fairly poorly compared to likely prospective lawyers. Second, there is a number of reasons why the percentile on the test doesn't measure lawyering skills.

One of the other interesting points they bring up is that there is no incentive for humans to seek scores much above passing on the test, because your career outlook doesn't depend on it in any way. This is different from many other placement exams.

Bromeo · 2 years ago
Very interesting. The abstract claims that although GPT-4 was claimed to score in the 92nd percentile on the bar exam, when correcting for a bunch of things they find that these results are overinflated, and that it only scores in the 15th percentile specifically on essays when compared to only people that passed the bar.

That still does put it into bar-passing territory, though, since it still scores better than about one sixth of the people that passed the exam.

falcor84 · 2 years ago
If I understand currently, they measured it at the 69th percentile for the full test across all test takers, so definitely still impressive.
gnicholas · 2 years ago
This analysis touches on the difference between first-time takers and repeat takers. I recall when I took the bar in 2007, there was a guy blogging about the experience. He went to a so-so school and failed the bar. My friends and I, who had been following his blog, checked in occasionally to see if he ever passed. After something like a dozen attempts, he did. Every one of us who passed was counted in the pass statistics once. He was counted a dozen times. This dramatically skews the statistics, and if you want to look at who becomes a lawyer (especially one at a big firm or company), you really need to limit yourself to those who pass on the first (or maybe second) try.
jeffbee · 2 years ago
It appears that researchers and commentators are totally missing the application of LLMs to law, and to other areas of professional practice. A generic trained-on-Quora LLM is going to be straight garbage for any specialization, but one that is trained on the contents of the law library will be utterly brilliant for assisting a practicing attorney. People pay serious money for legal indexes, cross-references, and research. An LLM is nothing but a machine-discovered compressed index of text. As an augmentation to existing law research practices, the right LLM will be extremely valuable.
violet13 · 2 years ago
It is a lossy compressed index. It has an approximate knowledge of law, and that approximation can be pretty good - but it doesn't know when it's outputting plausible but made-up claims. As with GitHub Copilot, it's probably going to be a mixed bag until we can overcome that, because spotting subtle but grave errors can be harder than writing something from scratch.

There's already a fair number of stories of LLMs used by an attorney messing up court filings - e.g., inventing fake case law.

jeffbee · 2 years ago
I am not suggesting that the generative aspects would be useful in drafting motions and such. I am suggesting that their tendency towards false results is harmless if you just use them as a complex index. For example, you could ask it to list appellate cases where one party argued such-and-such and prevailed. Then you would go read the cases.