Readit News logoReadit News
dahart · 7 years ago
> Utah has been using AI as the primary scorer on its standardized tests for several years. “It was a major cost to our state to hand score, in addition to very time consuming,” said Cydnee Carter, the state’s assessment development coordinator. The automated process also allowed the state to give immediate feedback to students and teachers, she said.

Yes, education takes time and costs money. Yes, not educating is both cheaper and faster. Note how the rationalizing ignores the needs of the students and the quality of the education.

I live in Utah and my children have been subjected to this automated essay scoring here. One night I came home from work and my son and wife were both in tears, frustrated with each other and frustrated with the essay scoring which refused to give a high enough score to meet what the teacher said was required, no matter how good the essay was. My wife wrote versions herself from scratch and couldn’t get the required score. When I got involved, I did the same with the same results.

Turns out the instructions said the essay would be scored on verbal efficiency; getting the point across clearly with the fewest words. I started playing around and realized that the more words I added, the higher the score, whether they were relevant or grammatical or not. Random unrelated sentences pasted in the middle would increase the score. We found a letter of petition online for banning automated scoring for the purposes of grades or student evaluation of any kind. It was very long, so it got a perfect score. I encouraged my son to submit it, and he did. Later I visited his teacher to explain and to urge her to not use automated scoring. She listened and then told me about how much time it saves and how fast students get feedback. :/

piokoch · 7 years ago
Frankly, I can't believe what I am reading. The idea that some "AI" grades essays automatically is idiotic and has nothing to do with education. Where is the place for discussions? Where is the place for ideas confrontation? Where is the place for writing style development? How this AI is supposed to grade things like repetitions (that can be either good rhetorical tool or a mistake, depending on context), etc?

Who the hell came out with such an idea. I would even hesitate to use "AI" for automatic spell checking as it is sufficient to give some character unusual name and it will be marked as error.

My guess is that soon or later people will learn how to game that AI. I wouldn't be surprised if there were some software that will generate essay that Utah "AI" likes.

dagw · 7 years ago
My guess is that soon or later people will learn how to game that AI.

Already been done. http://lesperelman.com/writing-assessment-robo-grading/babel...

Here's a sample essay that is complete nonsense and got a perfect score on the GRE.

http://lesperelman.com/wp-content/uploads/2015/12/6-6_ScoreI...

maze-le · 7 years ago
>> Who the hell came out with such an idea.

I'd guess this is a product of dwindling state finances and contempt for any form of real education. AI's are orders of magnitude cheaper than real teachers. They also don't form unions and wouldn't voice any opposition against changes in the curiculum.

They are also pretty useless, as you have pointed out. The consequences of this policy will be postponed until the students reach a certain age -- that'll be like 10-15 years in the future.

Balgair · 7 years ago
> My guess is that soon or later people will learn how to game that AI.

To be fair, the GP here is specifically describing that he gamed the AI via a copy-paste of a critique of the AI, his kid submitted it on their own accord, it was graded without comment, and then when the GP went into comment on the gaming of the AI, the teacher not only did not care that the AI was gamed, but expressed gratitude for the AI saving hours of work, still ignoring that the AI fundamentally made things worse, all at the expense of the entire point of being a teacher in the first place.

The issue, for the teacher, is that in 'the system' in which they collect a pay-check, the AI works flawlessly. The point, for the teacher, is not to educate children. It is to have assignments that children pass with some sort of distribution that can be sent in and calculated by some person in a beige suit, wide tie, and hair troubles. The difference is subtle at first, but when you get further along to the point where the GP is sitting, then the difference is comical.

The AI allows the teacher to increase their effiency in processing assignments, ones that never really mattered to the teacher in the first place. In valley-speak: the incentives are not aligned.

de_watcher · 7 years ago
I can't believe either, it's completely ridiculous. They're basically claiming that they've developed a general AI. It's like some part of population is living in different fantasy worlds and makes policy decisions accordingly.
amiga_500 · 7 years ago
Is the USA tenable going forward? Your cost of everything, value of nothing culture appears to be very destructive.
lr4444lr · 7 years ago
I agree with you wholeheartedly but I think there's a stronger argument to be made here: the algorithms being used "work" only on a correlation based on an ignorance of the scoring metric. If the students under test knew even sketchily how the system worked, e.g., points deducted if your average sentence word length > 7, points added if your word length stddev is greater than 2, and the students could meaningfully push their scores up by focusing on these proxies that don't _actually_ measure what a human would say is quality work - or even they can even get gibberish[0] rated highly - then the whole thing is a fraud. No one will stand for a grading system that only works by virtue of obscurity.

[0] https://www.nytimes.com/2012/04/23/education/robo-readers-us...

Nasrudith · 7 years ago
It is classic beancounter thinking in the worst way, the worst stereotype of a MBA trying to minimize cost beyond all reason cutting corners. Even when it saws at the branch they sit upon.

It is frankly a sign of a diseased culture to use it in any capacity except an exercise to improve AI.

wisty · 7 years ago
Teachers often score by similarly sensible criteria.
ALittleLight · 7 years ago
When I was a child I was obsessed with the "grade level" function in Microsoft Word. It was a preference you could enable on spell check to tell you the "grade level" of your writing.

Every essay I wrote, I'd always force myself to reach the max "12.0" grade level. While writing I'd struggle over word choice, sentence structure, rearranging paragraphs, working on my tone etc, all in pursuit of the 12th grade way to phrase things. All my revisions were subject to the approval of the Grade Level checker.

Whenever I could I would check the grade levels of my friend's writing - usually by showing them a "neat feature" they could enable. Then, I'd smugly applaud myself for being the better writer whenever their grade level was below 12.0.

The Grade Level feature fascinated me and to try and master it, I found a book about Microsoft Word and looked through it in a bookstore. I was absolutely gobsmacked at how simple the formulas was. I had childishly been expecting something, like perhaps Utah educators imagine they have. I genuinely expected the method to be complex beyond my understanding.

Instead, Word used a variant of Flesch-Kincaid. There was a direct relationship between sentence length and grade score, and polysyllabic words and grade score. Meaning, the longer your sentences and words, the higher your grade score.

As soon as I got home from the bookstore I loaded a draft of something I had written. It was "pre-12.0" writing from me. I simply deleted all the periods but one and checked again. 12.0.

Automatic grading is a wonderful lure. It's nice to imagine that there's some objective writing quality easy to tap into. At the moment, I think we're far from that ability.

Personally, I feel the solution to insufficient teacher time is to use peer grading much more, and spot checks. Get kids to read and revise each other's works frequently, and teachers should aim to grade at least N papers per student where N is much less than the number of papers a student writes.

Revising is a really vital part of writing. Getting more chances to do revision, plus having to write something good enough to show your peers, plus having the risk of any paper count for your grade should compensate for incomplete teacher grading.

nerdponx · 7 years ago
The fact that you were literally still a child when this happened, but automated grading is being foisted on us by grown adults who are ostensibly professionals, says a lot about the situation.
rwbcxrz · 7 years ago
> Personally, I feel the solution to insufficient teacher time is to use peer grading much more, and spot checks. Get kids to read and revise each other's works frequently, and teachers should aim to grade at least N papers per student where N is much less than the number of papers a student writes.

That's how it's done in creative writing courses. I've always found it infinitely more helpful than only having feedback from the instructor, even if the instructor's feedback was generally more helpful/useful than peer feedback.

pzs · 7 years ago
Arguably, Hemingway's texts are well written. One of the sources of power of his prose is the use of simple words, and basic sentence structures. I bet Word would classify that as below 12th grade.

The point I am trying to make in agreement with the parent is: there are qualities that are very hard to score with algorithms. The difficulty of solving this problem equals if not exceeds that of automated translation, which still only works properly for specialized and limited domains, e.g. weather forecasts.

regrub · 7 years ago
All that grade-level gaming paid off, I reckon! This was a funny, informative personal account of it :)
jhanschoo · 7 years ago
It's interesting that the tool (and system) is designed to aid people trying for the opposite result, i.e. for publicists and other authors striving to word their message to be as widely understood as possible.
mnky9800n · 7 years ago
You just ruined my childhood Thanks.
bryanrasmussen · 7 years ago
I went to high school in Utah, long before this automated scoring. It sounds awful but considering the quality of the education I received there perhaps not that bad after all.

My best Utah education anecdote - In the first day of British literature class the teacher came in and asked "Does anyone here know what A.D means?" someone said After Death - she said no. I figured this was my time to shine so I raised my eager hand and said "Anno Domini, in the year of the lord" - she said no.

Then she announced: "A.D means after the Deluge, and B.C means before Christ".

She also totally lied to me one time about whether she would be considering a particular textbook question as applying to Rosencrantz or Guildenstern.

Anyway I think that was one of the many classes I got an F in after stopped going and would walk past it every day on my way to play chess with my German teacher.

C1sc0cat · 7 years ago
How is this relevant in a Lit class - presumably you hid the fact you where a Catholic / Anglican from her.
clay_the_ripper · 7 years ago
Wow this is pretty shocking. I can understand using automated systems for something like math problems, it makes sense. There’s (usually) one right answer. But essays? This should be banned.
Baeocystin · 7 years ago
Wait 'til you see a kid in tears because the math answer they submitted was supposed to equal zero, but the algorithms behind the scenes are so bad that the float math failed the equality check.

Note: This is not hyperbole, I have seen this exact scenario more than once.

There may be a place for a well-designed one, but if it exists, I've never seen it.

Aromasin · 7 years ago
Having been forced to use an online math software for all my homework while at school, I vehemently disagree. It was so poor that it became a meme within my year group.

It would mark you as incorrect for using too many decimal places, even though it wouldn't tell you how many significant figures was required. I often remember it marking my answer as incorrect, even though it was identical to the answer they gave. Sometimes you'd have to show your working, but it couldn't handle brackets. Once I put the answer as "1+x=y" but they wanted the answer "y-1=x", and they marked it as incorrect.

I'm sure academic software design is leaps and bounds above what it was in the early 2000's, but to have a pupils futures hinge on what generally seems to be poorly tested code is dangerous.

wickedsickeune · 7 years ago
I have often solved many hard math problems with very unconventional solutions (eg geometrical proof for algebraic problems). Trust me, a piece of software is decades away from being able to accurately determine the future of children and massively impact their self esteem / trust in society.
UweSchmidt · 7 years ago
It is important for the teacher to see where part of the class took the wrong turn, where the students' understanding ended. It is important to distinguish between careless errors, wrong memorizing of a formula and lack of understanding.
yoz-y · 7 years ago
> Turns out the instructions said the essay would be scored on verbal efficiency; getting the point across clearly with the fewest words. I started playing around and realized that the more words I added, the higher the score, whether they were relevant or grammatical or not.

Frankly, this does not change anything from my experience in school decades ago. The teachers always said that the length does not matter and we should not pad the papers. However students who wrote more pages got better scores every single time.

bradstewart · 7 years ago
Is it possible that the students who wrote shorter papers were in fact presenting incomplete arguments and/or thoughts? Writing clearly and concisely is extremely difficult.
MertsA · 7 years ago
Historically on the written portion of the SAT length is substantially correlated to the final score.
raxxorrax · 7 years ago
You have automated systems that rate essays without any human actually reading them?

Kids, forget everything you know because crime does indeed pay off. Best grades will be reserved for those that try to cheat this system however it is implemented. Botting your essays is the way to go in the 21st century.

dagw · 7 years ago
Given that the stack ranking at your future job will also be done by an "AI" (probably developed by the same company that graded you tests) this is a very useful skill to have.
_0ffh · 7 years ago
>Turns out the instructions said the essay would be scored on verbal efficiency; getting the point across clearly with the fewest words. I started playing around and realized that the more words I added, the higher the score

Apart from the fact that your story is straight up frightening, isn't this part completely backwards, too? I mean, clearly using more words to convey the same message is /less/ efficient, not more so?

dahart · 7 years ago
Yes. Exactly. Before I figured out how to game the program, my son and wife were editing shorter. That’s what the instructions said to do. And, that’s also a major strategy for decent writing: brainstorm a lot, then edit down to the good parts. What this means is the software’s scoring is an anti-incentive to good writing. Used as a teaching aid, it’s actually doing pure damage, not good. Not only can it not score reliably, nor provide meaningful feedback, it’s actually actively teaching a very wrong way to write. But it is cheaper than humans, and it does give immediate feedback, so there’s that.
bigred100 · 7 years ago
This is a problem I have with a lot of human behavior. Instead of admitting you don’t have the resources to do something or aren’t willing to prioritize it, people come up with a bad version that’s not worth doing. Lots of things are worth doing poorly, but many of them I believe you just need to admit are not worth it unless a certain level of performance is met.

What’s even cheaper than AI? Tell the students to write some pages, have the teacher glance at the number of pages written, give full credit if the mark was met, and throw the papers out without reading them. It sounds like it would be similarly effective and less aggravating. Unfortunately, this would require humility on the part of the educators.

analyst74 · 7 years ago
Think from a positive angle, students today are learning useful life skills to game computer systems, which they will have to deal with when they grow up.

edit: ...just like how previous generations have to learn how to game social systems.

alexanderdmitri · 7 years ago
Except the algorithm being gamed can change suddenly, drastically and without the gamer's knowledge.

When such changes occur, the gamer will be docked until they can reverse-engineer the new algorithm. There's also the risk that all their previous inputs "gaming" the system might be reconsumed to terrible results as well, effectively rewriting their historical performance disasterously.

As always, those with the social standing and power to have insider knowledge or guidance will be in the best position to profit off such systems.

YeGoblynQueenne · 7 years ago
Ho ho. Wait. You mean you were able to submit multiple versions of the essay? So that anyone can basically game the test, by submitting multiple essays until they get the best score they can wring out of it?

That is just mad.

c3534l · 7 years ago
It's be easier and equally fair to just grade student's essays by rolling a pair of dice.
ejk314 · 7 years ago
It's arguably more fair. At least purely random scoring doesn't incentivize cheating.
nraynaud · 7 years ago
wait, you get the score in real time? Like some kind of objective function you can train a machine to maximise on?
christophilus · 7 years ago
Heh heh. I like the way you think. Hackers of Utah unite!
nyxtom · 7 years ago
This needs to just be outright banned
imtringued · 7 years ago
I can't even comprehend how someone can use automation for a task like this... It completely goes against human nature. In a world where all jobs have been automated teachers would be the last ones to go before humanity is completely obsolete.
daveFNbuck · 7 years ago
Do you just get to keep submitting the essay to see what score it will get before you turn it in? That sounds like a bigger problem than any of the particulars about how the grading is done.
dahart · 7 years ago
In this case, there was a limit to the number of times the essay could be submitted, and there was a required score that needed to be obtained within that limit, otherwise the grade would go down. The limit was something like 20 tries, and when I got there they’d already used maybe 14 of them.

I could perhaps see value in having unlimited tries, as a teaching aid, if the result wasn’t being used for grading. That would at least leave room for curiosity and exploration. And, more importantly, I could see value if the software wasn’t essentially a scam that fundamentally is not able do what is advertised. If the software really could grade essays reliably, and provide meaningful suggestions for improvement, then maybe it could be used to help educate students, in conjunction with the teacher’s guidance. But the software does not grade reliably, and it absolutely does not offer meaningful constructive feedback, and the teachers were using it to avoid reading essays, not to supplement their own expertise.

One of the several amusing ironies here is how the software company has convinced the state and teachers to willingly replace themselves with bots, despite obvious evidence that the humans can do the job better.

pault · 7 years ago
The instant feedback mechanism is just begging for someone to turn it into a GAN by writing the other half. I would absolutely love to hear that some particularly clever high school student was able to train an ML algorithm to consistently fool the grading algorithm, thus instantly rendering all of their efforts worthless and dragging the administrators through the mud at the same time.
seanmcdirmid · 7 years ago
That really sounds like Utah, they have lots of students (due to LDS influences) with a conservative government (ditto), so the pupil/teacher ratio is insane. I can guess the teacher really doesn’t have any other choice.
VikingCoder · 7 years ago
My mother worked grading standardized tests. It was a hellish job for many reasons (limited breaks, etc.)

One question she had to grade was essentially, "What's something you want your teacher to know about you?"

It was an essay answer, and she was supposed to grade it on grammar, etc. Just the mechanical aspects of writing. (The real question explained the details more, but that was the core of the question.)

She saw answers that would make you weep.

"My daddy touches me."

"I haven't eaten today. I don't know when I'm going to eat again."

Stuff like that.

And my mother was going to be the only human who ever saw their responses. Their teacher had no chance to see their responses, just my mom.

So she goes to her supervisor and asks, "What can we do to help these kids?"

The supervisor said there was nothing you can do. Just grade the answers.

dredmorbius · 7 years ago
The US has federal child abuse mandatory reporting requirement laws which include teachers and school staff and personnel, as well as additional state requirements which vary but include, for 11 states, faculty, staff, and volunteers at public or private higher education institutions. Computer and IT professionals are also covered in cases.

Faculty, administrators, athletics staff, or other employees and volunteers at institutions of higher learning, including public and private colleges and universities and vocational and technical schools (11 States).

https://www.childwelfare.gov/topics/systemwide/laws-policies...

https://www.childwelfare.gov/pubPDFs/manda.pdf

This includes penalties for failure to report in multiple states:

https://www.childwelfare.gov/topics/systemwide/laws-policies...

DoctorOetker · 7 years ago
so ... nobody wonders about the obvious rammification: then any ML scoring systems ... must detect child abuse signals!
harry8 · 7 years ago
Some of these will be 100% true as well. But don't make the mistake that there are no kids who go for shock value or are wantonly manipulative when they know it can't come back to them.

So how many are true and how many false? I have no clue. Literally none. And no it doesn't make me feel any better about the screams of existential agony even if that were a low percentage. Could be high too.

dmoy · 7 years ago
For the not eating, it's pretty easy to get data. It's like 1 in 5 children live in food-insecure households in the US and maybe 1 in 20 of those very insecure, so not eating before school provided lunch is common enough that if you're grading tons of papers you'll run into kids like that.
manfredo · 7 years ago
When I was a high school student, we had some state administered test in health class that tasked us with analyzing advertisements for liquor and tobacco and seeing if we could recognize harmful behavior that the ads might be promoting. This test had no impact on our class grade...

..which means students wrote whatever the hell we wanted. I was assigned a Captain Morgan (rum) ad. I wrote that the ad was glorifying maritime piracy and was likely responsible for pirate activity in Somalia.

inimino · 7 years ago
Of course some kids are manipulative, going for shock value, continuing an "in-joke", or just plain trolling. But would a teacher just look the other way, or would they talk to the kid? What would you want for your kids? This is why teachers assigning homework like "what do you want your teacher to know about you" and then not even seeing it is dehumanizing.
mxcrossb · 7 years ago
I don’t know about calling it manipulative. I remember taking the ACT, and struggling to plan out one of my essays. It was something like “tell us about a book that inspired you”. So I changed details about the plot so it all fit nicely and was easy to write. I can see something similar here, where someone takes on a persona when writing in order to effectively communicate.
7952 · 7 years ago
False accusations can actually be the result of prior abuse. They may substitute one person for another. Or do things as a result of mental illness caused by abuse. Kids think differently to adults and may behave inexplicably. And unfortunately that means that an abused child is a terrible witness.
fucking_tragedy · 7 years ago
> But don't make the mistake that there are no kids who go for shock value or are wantonly manipulative when they know it can't come back to them.

In the US, school funding is based upon standardized test results, and bad results can shut a poorly performing school down.

It's drilled into every kid's head that these tests are very important, super strict and if they accidentally mess up, it can ruin their academics, because retesting and regrading are expensive.

empath75 · 7 years ago
‘Poor sentence structure and grammar, 1 point out of five. Sorry your daddy touches you.’
shaggyfrog · 7 years ago
Punch up, not down.
rhexs · 7 years ago
Or...report it to the police? I’d gladly risk my job to do the right thing in that instance.
VikingCoder · 7 years ago
What my mom saw had an ID number on it. No other demographics. And she was grading from multiple states.

So do what? Contact her local police?

With a written accusation from a child? Is that enough to get a warrant to force the company to release the demographic information?

And people don't work at a job like that because they want to. They work there because they need the money.

Everything she took in and out of there was monitored, too. So it's not like she can go to the Xerox, and walk out of there with a copy.

It's beyond dehumanizing. For everyone. The kid, the people who work there.

sjg007 · 7 years ago
They should be mandatory reporters at least in the USA.
aidenn0 · 7 years ago
Hand-graded standardized tests are usually anonymized.
killjoywashere · 7 years ago
Tell you mom to take those numbers and the company's name to the police. They can walk back the identification problem.

Dead Comment

drngdds · 7 years ago
This is my first time learning that AI-graded essays are a thing. Am I the only one who thinks that's insane? I feel like you'd probably have to have an AGI to meaningfully evaluate an essay.
_delirium · 7 years ago
I work in AI, and was very surprised when I heard about this (a few years ago). I don't think anyone who works in the area thinks the tech is ready for this kind of deployment. There is research on the subject [1], and NLP systems can do better than baseline methods, but the error rates are still pretty high.

A thing you quickly find if you try to download off-the-shelf NLP tools and apply them to anything is how little is reliable at all, unless you can constrain the domain. Even basic topic identification only works with low error rates when constrained to something like NYT stories, or PubMed abstracts, not arbitrary text by arbitrary writers. And I would bet ETS is using worse tech than research state-of-the-art.

[1] e.g. https://www.aclweb.org/anthology/P15-1053

harry8 · 7 years ago
You've noticed though that the AI con is on. This damages your work as people get burned and will bring about the second "AI winter"

People making big decisions with a lot of money around computing know nothing about it and are marks for con-artists. Think big consulting firms selling to senior public servants in washington. "For a successful technology reality must take precedence of public relations." But reality just gets in the way when conning a mark for a successful snake oil sale, right?

Call it out, publically, cite your credentials. Encourage colleagues, your competition and everyone with a clue to pour scorn on whoever is selling this evil, toxic waste as drinkable.

mlthoughts2018 · 7 years ago
Hmmm. I also work in AI, in fact professionally in information retrieval and NLP. I disagree strongly with what you say. Basic topic summarization and keyword / named entity extraction on unstructured sources of text works reasonably well. It’s easy to modify BERT and GPT on smaller problems, language classification is borderline totally solved by extremely easy to train neural network models.

I still agree that automatic essay grading is beyond the reach of SOTA NLP models today, but youmake it sound like virtually nothing can be done in a production-grade manner that solves real world unconstrained NLP problems. This is manifestly false.

chrisdsaldivar · 7 years ago
We had this in my school for 8th and 9th grade so 2008-2010. We had to type the essays in class and submit by the end of the hour. I would only get maybe 3 paragraphs in before time was up because I was trying to build a strong argument for the prompts. Despite that I would usually get 3-4/6 and my teacher said she would read the essays and regrade but she never actually did. My friend literally copy and pasted the pledge of allegiance 20-30 times and scored a perfect 6/6. Later we found out if you repeated the words in the writing prompt you would get a guaranteed 5/6 and with a high enough word count you’d get 6/6. The essays were all bullshit and just a way for the teachers to get an extra free period once a week.
xkcd-sucks · 7 years ago
I totally agree that "AI" grading is totally bullshit. But, I also have plenty of experience teaching/TAing large courses, and after reading too many essays they all become semanticically saturated meaninglessness. One can not help but skim them, and grade according to a few quick heuristics. At that point one tries to be self-consistent and defensible in one's grading, but careful consideration is right out. I suspect state graders are dealing with way more than 100 essays per person and are probably on a tight schedule too. It's quite possible that a ML model is better than an exhausted human grader, as their cognitive strategies are mostly identical.
kwhitefoot · 7 years ago
The solution isn't to do a better job at grading 'meaninglessness' but to stop requiring the production of it in the first place.

One major problem with algorithmic approaches, whether automated or not, is that they become the definition of good in the context and therefore become something that cannot be argued against. And of course it makes 'teaching to the test' an even more likely outcome.

If I were a conspiracy theorist I'd attribute this to wanting a dumbed down population. Unfortunately I think it is probably the other way round, the population is already dumbed down and a belief in AI unicorns is the result.

As Aristotle said to Alexander: 'There is no royal road to geometry', and so it is with education; it's hard work for both the student and the educator and no amount of AI/ML/algorithmic snake oil will change that without also changing the meaning of the word education.

nefitty · 7 years ago
I remember when I was in middle school 16 years ago, my English classes would have us submit some of our work to a web app. It would then grade the submission. I remember this distinctly because I asked my teacher to intervene on at least two occasions. The app failed to recognize the words "squirrelly" (as in "That guy in the corner has been acting squirrelly.") and "defragment". My teacher decided to subvert the app's recommended grades because she, as a human, understood the intent of my use of those weird words.

To emphasize, this was 16 years ago.

dcolkitt · 7 years ago
> I feel like you'd probably have to have an AGI to meaningfully evaluate an essay.

So the reason this isn't the case, is because there are very simple metrics that tend to highly correlate with essay quality. It doesn't mean the grading-bot is actually evaluating essay quality. It's just looking for properties that are statistically associated with good essays. Remember, at the end of the day as long as the bot's ranking is close enough to the human grader's ranking, nobody really cares about the internal logic.

A very straightforward example is spelling mistakes. People who make spelling mistakes aren't necessarily bad writers. And vice versa, there may be great speller who can't write for shit. But by and large the people who spell poorly also tend to write poorly. Easily detectable grammatical issues, like misplaced modifiers, subject verb disagreement, or inconsistent tense, are also correlated indicators.

A very simple metric is essay length. Especially if its a timed exam. Good writers tend to have verbal fluidity, with words easily flowing to paper. They don't struggle converting thoughts too sentences. So they tend to end up with the most words written down within a fixed time period. By and large the longer a timed essay is, the more likely that its actual quality is high.

Grading bots basically rely on these statistical relationships. They're not measuring anything intrinsic to good writing. But at the end of the day, their student rankings are usually pretty close to that of a typical human grader. In some cases the bot will have a closer ranking to a random human grader, than two random human graders will have to each other.

The biggest flaw here is Goodwin's law. When the test takers become aware of the kludges that the bots use, they can exploit it. For example just dump a bunch of verbal diarrhea with as many correctly spelled words as possible. But even then it doesn't really hurt the bot's ranking accuracy too much. Because the kids who do the most test-prep and learn all the tips and tricks, are usually high-achievers who do well on essays anyway.

bo1024 · 7 years ago
Strongly (but respectfully) disagree with a lot of this!

This is related to current fairness-in-AI discussions. In many cases the basic problem is ML systems leverage correlations for making causal decisions. Here, there is a huge ethical difference between scoring a person based on "is this a good essay" and "do the features of this essay correlate with features of good essays". Just like there is a huge fairness and discrimination difference between "is this person qualified for a loan" and "do the features of this person correlate with features of people who qualify for loans" (algorithmic redlining). Your last sentence has a big discrimination/fairness issue also, since you are testing even more for parental income and parental free time.

6gvONxR4sf7o · 7 years ago
I can't disagree strongly enough.

>Remember, at the end of the day as long as the bot's ranking is close enough to the human grader's ranking, nobody really cares about the internal logic.

This isn't true at all. Imagine you got a B or C on an essay that a human would have given an A to because you wrote it concisely and in plain language, or because you used language that's statistically correlated with being black. Does the fact that this is rare console you? "Sorry, but it's usually very close to the human grader's ranking." Close enough isn't good enough when you get the short end of the stick. "Sorry, you aren't going to get to go to the college you wanted because you use language statistically correlated with poor writing." Or just because you're different, so the statistical correlation doesn't apply to you, you filthy outlier. Just because it's a rare event doesn't make it okay.

In adulthood, this is like hiring or firing for work statistically correlated with good work. Remember when amazon rolled out the resume scorer? [0] Sure it was biased towards women, but it was close enough to human scores, so who cares about the internal logic?

>Grading bots basically rely on these statistical relationships. They're not measuring anything intrinsic to good writing.

At the end of the day, our goal here is to measure good writing. If the bots aren't measuring anything intrinsic to good writing, we shouldn't use them.

https://www.reuters.com/article/us-amazon-com-jobs-automatio...

ptx · 7 years ago
I think you meant Goodhart's law: "When a measure becomes a target, it ceases to be a good measure."
mannykannot · 7 years ago
Your last paragraph, and particularly the last sentence, epitomizes what is wrong with your whole thesis: the ultimate goal of the testing (and education itself, for that matter) is not to find people who can "do well on essays"; it is to develop analytical thinking.
rvense · 7 years ago
It is absolutely insane. By no definition does the system understand what is written.

You could ask a student to write an essay taking a firm opinion on some subject, and they could change standpoint every paragraph and there's no way these systems would know.

If I was a student I would be extremely offended at people wasting my time like this.

chucksmash · 7 years ago
I'm surprised people are surprised by it. I guess it just hasn't gotten talked about it a lot? When I took the GRE in 2011 the rule was that my essay would be graded by one human and one automated grader, and a second human would become involved if the computer and the human differed by one point or more iirc.

Maybe nobody really makes a big deal about it because it is pretty much irrelevant anywah. Applicants provide a letter of intent that the grad dept people can, y'know, actually read for themselves, so I think unless you totally bombed the writing section nobody cared.

Spivak · 7 years ago
In a forum of CS people I'm surprised this is one of the top opinions. Our field is full of super surprising results like this -- that you don't have to actually understand the text at beyond basic grammar structures to reasonably accurately predict the score a human would give it.

Like this kind of thing should be cool, not insane. I mean wasn't it cool in your AI class when you learned that DFS could play Mario if you structured the search space right?

cmroanirgo · 7 years ago
I came first in English for my school, many moons ago. Leading up to the finals, I regularly finished ahead of the hard core the English essay people, generally to my amusement. My exam essay responses were generally half the length (sometimes even shorter) than the prodigious writers. Although I've an ok vocabulary, I always made sure I made the right choice of word to hit a specific meaning, rather than choosing words with a high syllable count.

I'd find it highly interesting to see what kind of result I'd get using an automated system.

Why?

Because, I once asked a teacher (also an examiner) why I got good grades above the others, and the answer surprised me: my answers were generally unique /refreshingly different, to the point/ not too long and easy to read.

I suspect with this new system, I'd be an average student. It'd also be interesting to find out, several years down the road, if the automated system could be gamed at all -- I suspect it could, and teachers would help students 'maximise' their scores as a result of that.

mherdeg · 7 years ago
When I hear a result like "software which understands basic grammar structures can predict what grade a human would give an essay" I think my views are roughly:

* 5% - cool, we could make a company that grades essays

* 15% - cool, we could make a company that grades essays and sell our source code to the test-prep industry

* 80% - fascinating, it sounds like the exam designers need to reevaluate what they are trying to measure with essay questions

danenania · 7 years ago
"...that you don't have to actually understand the text at beyond basic grammar structures to reasonably accurately predict the score a human would give it"

That only really shows that the humans they're training on are terrible at grading essays.

munchbunny · 7 years ago
This problem is a first class demonstration of the difference between "can we?" and "should we?"

The fact that it's being implemented in society is insane because anyone who is paying attention to the state of AI today already knows how it will go wrong: without reading the article I already guessed that it systematically discriminated against certain demographics. Which was in fact what the article claimed.

It's interesting that it's possible to predict what the scorer would decide, but the moment you actually implement it is when all of the known problems become relevant, and the intellectual wonder must take a backseat to the human problems.

jammygit · 7 years ago
Teaching human-human communication by removing human inputs and having computers decide about quality... call me a skeptic. I feel bad for the students. Essay grading was bad enough before this

Narrowly for grammar however - is even that a good thing? It probably helps scale grammar help to more students, but if those tools became ubiquitous in grading and editing then unique voices would just disappear and a lot of potentially “great writers” might choose different careers because the machines don’t like them

shkkmo · 7 years ago
Adding further bias against the underprivileged is not "cool". Implenting this while avoiding publicity or providing a means to publically audit the results is doubly not cool.

It is fine to play with "cool" techniques when you are doing consequence free stuff like playing Mario. When you are creating systems that have significant and long term effects of people's lives a different standard applies.

peteradio · 7 years ago
Based on the title alone, how would you feel if you were given bad marks due to flawed black box?
strken · 7 years ago
This is sort of like discovering the Excel spreadsheet at the heart of a system responsible for handling hundreds of millions of dollars of transactions for your bank.

Yeah, it's cool, but what about your savings account?

Dead Comment

RcouF1uZ4gsC · 7 years ago
Unlike a multiple choice test where the primary audience is automated graders, the primary audience for an essay is other humans. If even Google and Facebook with their billions of dollars and billions of posts worth of data, still cannot always understand the intent and purpose of written content, what hope do these algorithms have?

If it is cost-prohibitive for every essay to be graded by humans, then they should be dropped from the tests. Otherwise, we are missing the whole point of essays which is to communicate effectively with another human, not just match certain text patterns.

anigbrowl · 7 years ago
If it is cost-prohibitive then then maybe we should adjust the economic model, not abandon the measurement.
rocqua · 7 years ago
Sure, have less essay test questions, and start grading them for content not form.

If you want to grade on form to test the ability to write correct rather than coherent sentences, make those separate questions, and mark them so.

Spivak · 7 years ago
I mean that's what the automated grading systems are trying to do but it seems like people don't like them very much.
saagarjha · 7 years ago
> If it is cost-prohibitive for every essay to be graded by humans, then they should be dropped from the tests.

Apparently it is. But everyone still wants writing to be assessed…

risubramanian · 7 years ago
The SAT dropped the writing section a few years ago, and many schools don't care about the GRE writing score.
iliketosleep · 7 years ago
> Otherwise, we are missing the whole point of essays which is to communicate effectively with another human, not just match certain text patterns.

I agree, this is traditionally the purpose of an essay. But to play devil's advocate, consider the rising number of people who are writing SEO or ASO content which is actually targeted at machines.

jakear · 7 years ago
“In most machine scoring states, any of the randomly selected essays with wide discrepancies between human and machine scores are referred to another human for review”.

And “between 5 to 20 percent” of essays are randomly selected for human review.

So the takeaway is that if you’re one of the 80-95% of (typically black or female) people who the machine scored dramatically lower, but are not selected for human review, your education future is systematically fucked and you have no knowledge of why or how to change it.

Absolutely reprehensible. Anyone involved in the creation or adoption of these systems should be ashamed.

kazinator · 7 years ago
The thing is, you could be similarly screwed by a biased human whose grading is not checked by a less biased human.

At least the machines offer the following hope: even if unbiased humans are rare among paper-grading teachers, those humans can be used to train the machines, so then bias-free or lower-bias grading becomes more ubiquitous.

Basically, the system has the potential for systematically identifying and reducing systematic bias. A computer program can be retrained much more readily than nation-wide army of humans. Humans can be given a lecture on bias, and then they will just return to their ways.

gibolt · 7 years ago
AI has a lot more potential for bias than humans. It depends on the input data which is likely heavily biased based on other data set results like face detection. It will only amplify any small bias present in the data.

Deleted Comment

IanSanders · 7 years ago
>Anyone involved in the creation or adoption of these systems should be ashamed

That's the problem - there is seemingly no shame these days. People involved "saved time and money", got paid and that's it. "If I didn't do it someone else would" and all of that.

sjg007 · 7 years ago
Weapons of Math Destruction talks about this.

Dead Comment

rynomad · 7 years ago
Personal anecdote;

I remember taking a standardized test, can't remember if it was SAT or CSAT (Colorado pre-SAT test). This was at a time when I'm confident that humans were the graders.

I started with an intro that would be appropriate for a standard 5 paragraph essay; i.e. the thing you write when you don't know what you're talking about and you're just following a format.

In the third paragraph I took a leaf from family guy, and just interjected "WAFFLES, NICE CRISPY WAFFLES, WITH LOTS OF SYRUP." for the next page and a half, I berated the very foundation of the essay prompt, insulting it the way only an angst ridden early teen can.

... I got a 98% on the essay.

Fast forward several years. I write an essay for for an introductory college course final. My paper is returned to me with a coffee stain and a "94% - good work!" note scribbled on the top. That note was scribbled by a TA that would turn out to be my girlfriend for 2 years. One night in bed, she tilts her laptop to me, showing an article that I used as the central theme to the above essay; "can you believe this?"

"Are you joking? Of course I can believe this, it was the subject of the essay you gave me an A on 2 years ago"

She admits she didn't read past the first paragraph of anything she grades, and just bases grades on intuition based on how articulate the essays are at the outset.

...

The point I'm making:

Does AI suck at judging the amount of informative content in a student essay? YES

Do humans suck at judging the amount of informative content in a student essay? ALSO YES

dlkf · 7 years ago
This is a great example of why it's grossly irresponsible for members of the ML community to talk about how AGI is just around the corner. In addition to the fact that we have no idea whether this is true, it primes a naive public for believing that technologies like this are worth the tradeoff.

"People worry that computers will get too smart and take over the world, but the real problem is that they're too stupid and they've already taken over the world."

empath75 · 7 years ago
I imagine that any student that experimented with the form of the essay or wrote an exceptionally well argued piece in simple language would not have their test graded appropriately either.

Any essay writing test which could be adequately graded by a machine is not testing anything of value.

Edit: I’ll further add that as soon as people’s careers depend on a metric, the metric becomes useless as a metric, because it will be gamed and manipulated by everyone involved. Almost nobody involved is incentivized to accurately measure student’s writing ability.

kazinator · 7 years ago
I think machines could be valuable in giving feedback on writing, like that grammarly.com.

A lot of what students write is actually garbage from that point of view. Even if they happen to have a good basic idea about what they want to say, the point of essay writing is to master the mechanics of expression so that you get the idea across effectively.

Whether the student has a brilliant idea isn't even so important, and it wouldn't even be fair; imagine if high school computer science expected students to turn in a best-selling app for a term project. Not everyone can come up with something brilliant to say; and even relatively mundane lines of reasoning can be given a good treatment in writing to develop the skill.

I remember when I had essays graded in school, a lot of the comments were low-grade fluff like "run on sentence", "wrong word", "faulty parallelism", "missing colon before 'for example'" and such points having nothing to do with the content being original, well-considered and well-argued. That sort of thing might as well be done by machine, at least as a preprocessing step to improve a student's rough draft.

HarryHirsch · 7 years ago
Almost nobody involved is incentivized to accurately measure student’s writing ability

It's the same reason you see keyword posters in math education. "Together" means "plus", that kind of thing. It's completely worthless, except for one-step problems, and even then it doesn't always work. What is happening is collusion between teachers and testmakers. You can't teach understanding, but you can teach test-passing techniques because the way the test is set permits this.

You see the same thing here, in English you can get away with not teaching quality writing if you teach techniques to score well.

Spivak · 7 years ago
I feel like the mistake is assuming that essay writing is about the content. It's just a thing to give the student something barely non-trivial to write about.

When your essays are graded they're marked down for mechanical and wording problems. There's really no point in trying or grade 'good ideas' on a subject piece you had maybe 10 minutes to skim.

topologistics · 7 years ago
If I have 3 left shoes colored blue green and red, and you have 2 right shoes colored black and white, how many pairs can we make if our lefts and rights are put together?

Hint: together does not mean plus.

rocqua · 7 years ago
There is value in the ability to produce correct English 'off the cuff'. You could argue essays are the best way to get students to produce off the cuff written text. Hence, it makes some sense to ask students for essays, and then judge those essays only for form.

However, it is rather important that students know their essays are not judged as essays, but only judged on the content. Otherwise you teach students that form trumps content in essays.

When judging an essay as an essay correct English barely matters. What matters is how convincing you are, and how interesting of a read the essay is. This is a great skill to have, and testing it also makes sense. Really though, we should separate these two forms of testing.