> Utah has been using AI as the primary scorer on its standardized tests for several years. “It was a major cost to our state to hand score, in addition to very time consuming,” said Cydnee Carter, the state’s assessment development coordinator. The automated process also allowed the state to give immediate feedback to students and teachers, she said.
Yes, education takes time and costs money. Yes, not educating is both cheaper and faster. Note how the rationalizing ignores the needs of the students and the quality of the education.
I live in Utah and my children have been subjected to this automated essay scoring here. One night I came home from work and my son and wife were both in tears, frustrated with each other and frustrated with the essay scoring which refused to give a high enough score to meet what the teacher said was required, no matter how good the essay was. My wife wrote versions herself from scratch and couldn’t get the required score. When I got involved, I did the same with the same results.
Turns out the instructions said the essay would be scored on verbal efficiency; getting the point across clearly with the fewest words. I started playing around and realized that the more words I added, the higher the score, whether they were relevant or grammatical or not. Random unrelated sentences pasted in the middle would increase the score. We found a letter of petition online for banning automated scoring for the purposes of grades or student evaluation of any kind. It was very long, so it got a perfect score. I encouraged my son to submit it, and he did. Later I visited his teacher to explain and to urge her to not use automated scoring. She listened and then told me about how much time it saves and how fast students get feedback. :/
Frankly, I can't believe what I am reading. The idea that some "AI" grades essays automatically is idiotic and has nothing to do with education. Where is the place for discussions? Where is the place for ideas confrontation? Where is the place for writing style development? How this AI is supposed to grade things like repetitions (that can be either good rhetorical tool or a mistake, depending on context), etc?
Who the hell came out with such an idea. I would even hesitate to use "AI" for automatic spell checking as it is sufficient to give some character unusual name and it will be marked as error.
My guess is that soon or later people will learn how to game that AI. I wouldn't be surprised if there were some software that will generate essay that Utah "AI" likes.
I'd guess this is a product of dwindling state finances and contempt for any form of real education. AI's are orders of magnitude cheaper than real teachers. They also don't form unions and wouldn't voice any opposition against changes in the curiculum.
They are also pretty useless, as you have pointed out. The consequences of this policy will be postponed until the students reach a certain age -- that'll be like 10-15 years in the future.
> My guess is that soon or later people will learn how to game that AI.
To be fair, the GP here is specifically describing that he gamed the AI via a copy-paste of a critique of the AI, his kid submitted it on their own accord, it was graded without comment, and then when the GP went into comment on the gaming of the AI, the teacher not only did not care that the AI was gamed, but expressed gratitude for the AI saving hours of work, still ignoring that the AI fundamentally made things worse, all at the expense of the entire point of being a teacher in the first place.
The issue, for the teacher, is that in 'the system' in which they collect a pay-check, the AI works flawlessly. The point, for the teacher, is not to educate children. It is to have assignments that children pass with some sort of distribution that can be sent in and calculated by some person in a beige suit, wide tie, and hair troubles. The difference is subtle at first, but when you get further along to the point where the GP is sitting, then the difference is comical.
The AI allows the teacher to increase their effiency in processing assignments, ones that never really mattered to the teacher in the first place. In valley-speak: the incentives are not aligned.
I can't believe either, it's completely ridiculous. They're basically claiming that they've developed a general AI. It's like some part of population is living in different fantasy worlds and makes policy decisions accordingly.
I agree with you wholeheartedly but I think there's a stronger argument to be made here: the algorithms being used "work" only on a correlation based on an ignorance of the scoring metric. If the students under test knew even sketchily how the system worked, e.g., points deducted if your average sentence word length > 7, points added if your word length stddev is greater than 2, and the students could meaningfully push their scores up by focusing on these proxies that don't _actually_ measure what a human would say is quality work - or even they can even get gibberish[0] rated highly - then the whole thing is a fraud. No one will stand for a grading system that only works by virtue of obscurity.
It is classic beancounter thinking in the worst way, the worst stereotype of a MBA trying to minimize cost beyond all reason cutting corners. Even when it saws at the branch they sit upon.
It is frankly a sign of a diseased culture to use it in any capacity except an exercise to improve AI.
When I was a child I was obsessed with the "grade level" function in Microsoft Word. It was a preference you could enable on spell check to tell you the "grade level" of your writing.
Every essay I wrote, I'd always force myself to reach the max "12.0" grade level. While writing I'd struggle over word choice, sentence structure, rearranging paragraphs, working on my tone etc, all in pursuit of the 12th grade way to phrase things. All my revisions were subject to the approval of the Grade Level checker.
Whenever I could I would check the grade levels of my friend's writing - usually by showing them a "neat feature" they could enable. Then, I'd smugly applaud myself for being the better writer whenever their grade level was below 12.0.
The Grade Level feature fascinated me and to try and master it, I found a book about Microsoft Word and looked through it in a bookstore. I was absolutely gobsmacked at how simple the formulas was. I had childishly been expecting something, like perhaps Utah educators imagine they have. I genuinely expected the method to be complex beyond my understanding.
Instead, Word used a variant of Flesch-Kincaid. There was a direct relationship between sentence length and grade score, and polysyllabic words and grade score. Meaning, the longer your sentences and words, the higher your grade score.
As soon as I got home from the bookstore I loaded a draft of something I had written. It was "pre-12.0" writing from me. I simply deleted all the periods but one and checked again. 12.0.
Automatic grading is a wonderful lure. It's nice to imagine that there's some objective writing quality easy to tap into. At the moment, I think we're far from that ability.
Personally, I feel the solution to insufficient teacher time is to use peer grading much more, and spot checks. Get kids to read and revise each other's works frequently, and teachers should aim to grade at least N papers per student where N is much less than the number of papers a student writes.
Revising is a really vital part of writing. Getting more chances to do revision, plus having to write something good enough to show your peers, plus having the risk of any paper count for your grade should compensate for incomplete teacher grading.
The fact that you were literally still a child when this happened, but automated grading is being foisted on us by grown adults who are ostensibly professionals, says a lot about the situation.
> Personally, I feel the solution to insufficient teacher time is to use peer grading much more, and spot checks. Get kids to read and revise each other's works frequently, and teachers should aim to grade at least N papers per student where N is much less than the number of papers a student writes.
That's how it's done in creative writing courses. I've always found it infinitely more helpful than only having feedback from the instructor, even if the instructor's feedback was generally more helpful/useful than peer feedback.
Arguably, Hemingway's texts are well written. One of the sources of power of his prose is the use of simple words, and basic sentence structures. I bet Word would classify that as below 12th grade.
The point I am trying to make in agreement with the parent is: there are qualities that are very hard to score with algorithms. The difficulty of solving this problem equals if not exceeds that of automated translation, which still only works properly for specialized and limited domains, e.g. weather forecasts.
It's interesting that the tool (and system) is designed to aid people trying for the opposite result, i.e. for publicists and other authors striving to word their message to be as widely understood as possible.
I went to high school in Utah, long before this automated scoring. It sounds awful but considering the quality of the education I received there perhaps not that bad after all.
My best Utah education anecdote - In the first day of British literature class the teacher came in and asked "Does anyone here know what A.D means?" someone said After Death - she said no. I figured this was my time to shine so I raised my eager hand and said "Anno Domini, in the year of the lord" - she said no.
Then she announced: "A.D means after the Deluge, and B.C means before Christ".
She also totally lied to me one time about whether she would be considering a particular textbook question as applying to Rosencrantz or Guildenstern.
Anyway I think that was one of the many classes I got an F in after stopped going and would walk past it every day on my way to play chess with my German teacher.
Wow this is pretty shocking. I can understand using automated systems for something like math problems, it makes sense. There’s (usually) one right answer. But essays? This should be banned.
Wait 'til you see a kid in tears because the math answer they submitted was supposed to equal zero, but the algorithms behind the scenes are so bad that the float math failed the equality check.
Note: This is not hyperbole, I have seen this exact scenario more than once.
There may be a place for a well-designed one, but if it exists, I've never seen it.
Having been forced to use an online math software for all my homework while at school, I vehemently disagree. It was so poor that it became a meme within my year group.
It would mark you as incorrect for using too many decimal places, even though it wouldn't tell you how many significant figures was required. I often remember it marking my answer as incorrect, even though it was identical to the answer they gave. Sometimes you'd have to show your working, but it couldn't handle brackets. Once I put the answer as "1+x=y" but they wanted the answer "y-1=x", and they marked it as incorrect.
I'm sure academic software design is leaps and bounds above what it was in the early 2000's, but to have a pupils futures hinge on what generally seems to be poorly tested code is dangerous.
I have often solved many hard math problems with very unconventional solutions (eg geometrical proof for algebraic problems). Trust me, a piece of software is decades away from being able to accurately determine the future of children and massively impact their self esteem / trust in society.
It is important for the teacher to see where part of the class took the wrong turn, where the students' understanding ended. It is important to distinguish between careless errors, wrong memorizing of a formula and lack of understanding.
> Turns out the instructions said the essay would be scored on verbal efficiency; getting the point across clearly with the fewest words. I started playing around and realized that the more words I added, the higher the score, whether they were relevant or grammatical or not.
Frankly, this does not change anything from my experience in school decades ago. The teachers always said that the length does not matter and we should not pad the papers. However students who wrote more pages got better scores every single time.
Is it possible that the students who wrote shorter papers were in fact presenting incomplete arguments and/or thoughts? Writing clearly and concisely is extremely difficult.
You have automated systems that rate essays without any human actually reading them?
Kids, forget everything you know because crime does indeed pay off. Best grades will be reserved for those that try to cheat this system however it is implemented. Botting your essays is the way to go in the 21st century.
Given that the stack ranking at your future job will also be done by an "AI" (probably developed by the same company that graded you tests) this is a very useful skill to have.
>Turns out the instructions said the essay would be scored on verbal efficiency; getting the point across clearly with the fewest words. I started playing around and realized that the more words I added, the higher the score
Apart from the fact that your story is straight up frightening, isn't this part completely backwards, too? I mean, clearly using more words to convey the same message is /less/ efficient, not more so?
Yes. Exactly. Before I figured out how to game the program, my son and wife were editing shorter. That’s what the instructions said to do. And, that’s also a major strategy for decent writing: brainstorm a lot, then edit down to the good parts. What this means is the software’s scoring is an anti-incentive to good writing. Used as a teaching aid, it’s actually doing pure damage, not good. Not only can it not score reliably, nor provide meaningful feedback, it’s actually actively teaching a very wrong way to write. But it is cheaper than humans, and it does give immediate feedback, so there’s that.
This is a problem I have with a lot of human behavior. Instead of admitting you don’t have the resources to do something or aren’t willing to prioritize it, people come up with a bad version that’s not worth doing. Lots of things are worth doing poorly, but many of them I believe you just need to admit are not worth it unless a certain level of performance is met.
What’s even cheaper than AI? Tell the students to write some pages, have the teacher glance at the number of pages written, give full credit if the mark was met, and throw the papers out without reading them. It sounds like it would be similarly effective and less aggravating. Unfortunately, this would require humility on the part of the educators.
Think from a positive angle, students today are learning useful life skills to game computer systems, which they will have to deal with when they grow up.
edit: ...just like how previous generations have to learn how to game social systems.
Except the algorithm being gamed can change suddenly, drastically and without the gamer's knowledge.
When such changes occur, the gamer will be docked until they can reverse-engineer the new algorithm. There's also the risk that all their previous inputs "gaming" the system might be reconsumed to terrible results as well, effectively rewriting their historical performance disasterously.
As always, those with the social standing and power to have insider knowledge or guidance will be in the best position to profit off such systems.
Ho ho. Wait. You mean you were able to submit multiple versions of the essay? So that anyone can basically game the test, by submitting multiple essays until they get the best score they can wring out of it?
I can't even comprehend how someone can use automation for a task like this... It completely goes against human nature. In a world where all jobs have been automated teachers would be the last ones to go before humanity is completely obsolete.
Do you just get to keep submitting the essay to see what score it will get before you turn it in? That sounds like a bigger problem than any of the particulars about how the grading is done.
In this case, there was a limit to the number of times the essay could be submitted, and there was a required score that needed to be obtained within that limit, otherwise the grade would go down. The limit was something like 20 tries, and when I got there they’d already used maybe 14 of them.
I could perhaps see value in having unlimited tries, as a teaching aid, if the result wasn’t being used for grading. That would at least leave room for curiosity and exploration. And, more importantly, I could see value if the software wasn’t essentially a scam that fundamentally is not able do what is advertised. If the software really could grade essays reliably, and provide meaningful suggestions for improvement, then maybe it could be used to help educate students, in conjunction with the teacher’s guidance. But the software does not grade reliably, and it absolutely does not offer meaningful constructive feedback, and the teachers were using it to avoid reading essays, not to supplement their own expertise.
One of the several amusing ironies here is how the software company has convinced the state and teachers to willingly replace themselves with bots, despite obvious evidence that the humans can do the job better.
The instant feedback mechanism is just begging for someone to turn it into a GAN by writing the other half. I would absolutely love to hear that some particularly clever high school student was able to train an ML algorithm to consistently fool the grading algorithm, thus instantly rendering all of their efforts worthless and dragging the administrators through the mud at the same time.
That really sounds like Utah, they have lots of students (due to LDS influences) with a conservative government (ditto), so the pupil/teacher ratio is insane. I can guess the teacher really doesn’t have any other choice.
My mother worked grading standardized tests. It was a hellish job for many reasons (limited breaks, etc.)
One question she had to grade was essentially, "What's something you want your teacher to know about you?"
It was an essay answer, and she was supposed to grade it on grammar, etc. Just the mechanical aspects of writing. (The real question explained the details more, but that was the core of the question.)
She saw answers that would make you weep.
"My daddy touches me."
"I haven't eaten today. I don't know when I'm going to eat again."
Stuff like that.
And my mother was going to be the only human who ever saw their responses. Their teacher had no chance to see their responses, just my mom.
So she goes to her supervisor and asks, "What can we do to help these kids?"
The supervisor said there was nothing you can do. Just grade the answers.
The US has federal child abuse mandatory reporting requirement laws which include teachers and school staff and personnel, as well as additional state requirements which vary but include, for 11 states, faculty, staff, and volunteers at public or private higher education institutions. Computer and IT professionals are also covered in cases.
Faculty, administrators, athletics staff, or other employees and volunteers at institutions of higher learning, including public and private colleges and universities and vocational and technical schools (11 States).
Some of these will be 100% true as well. But don't make the mistake that there are no kids who go for shock value or are wantonly manipulative when they know it can't come back to them.
So how many are true and how many false? I have no clue. Literally none. And no it doesn't make me feel any better about the screams of existential agony even if that were a low percentage. Could be high too.
For the not eating, it's pretty easy to get data. It's like 1 in 5 children live in food-insecure households in the US and maybe 1 in 20 of those very insecure, so not eating before school provided lunch is common enough that if you're grading tons of papers you'll run into kids like that.
When I was a high school student, we had some state administered test in health class that tasked us with analyzing advertisements for liquor and tobacco and seeing if we could recognize harmful behavior that the ads might be promoting. This test had no impact on our class grade...
..which means students wrote whatever the hell we wanted. I was assigned a Captain Morgan (rum) ad. I wrote that the ad was glorifying maritime piracy and was likely responsible for pirate activity in Somalia.
Of course some kids are manipulative, going for shock value, continuing an "in-joke", or just plain trolling. But would a teacher just look the other way, or would they talk to the kid? What would you want for your kids? This is why teachers assigning homework like "what do you want your teacher to know about you" and then not even seeing it is dehumanizing.
I don’t know about calling it manipulative. I remember taking the ACT, and struggling to plan out one of my essays. It was something like “tell us about a book that inspired you”. So I changed details about the plot so it all fit nicely and was easy to write. I can see something similar here, where someone takes on a persona when writing in order to effectively communicate.
False accusations can actually be the result of prior abuse. They may substitute one person for another. Or do things as a result of mental illness caused by abuse. Kids think differently to adults and may behave inexplicably. And unfortunately that means that an abused child is a terrible witness.
> But don't make the mistake that there are no kids who go for shock value or are wantonly manipulative when they know it can't come back to them.
In the US, school funding is based upon standardized test results, and bad results can shut a poorly performing school down.
It's drilled into every kid's head that these tests are very important, super strict and if they accidentally mess up, it can ruin their academics, because retesting and regrading are expensive.
This is my first time learning that AI-graded essays are a thing. Am I the only one who thinks that's insane? I feel like you'd probably have to have an AGI to meaningfully evaluate an essay.
I work in AI, and was very surprised when I heard about this (a few years ago). I don't think anyone who works in the area thinks the tech is ready for this kind of deployment. There is research on the subject [1], and NLP systems can do better than baseline methods, but the error rates are still pretty high.
A thing you quickly find if you try to download off-the-shelf NLP tools and apply them to anything is how little is reliable at all, unless you can constrain the domain. Even basic topic identification only works with low error rates when constrained to something like NYT stories, or PubMed abstracts, not arbitrary text by arbitrary writers. And I would bet ETS is using worse tech than research state-of-the-art.
You've noticed though that the AI con is on. This damages your work as people get burned and will bring about the second "AI winter"
People making big decisions with a lot of money around computing know nothing about it and are marks for con-artists. Think big consulting firms selling to senior public servants in washington. "For a successful technology reality must take precedence of public relations." But reality just gets in the way when conning a mark for a successful snake oil sale, right?
Call it out, publically, cite your credentials. Encourage colleagues, your competition and everyone with a clue to pour scorn on whoever is selling this evil, toxic waste as drinkable.
Hmmm. I also work in AI, in fact professionally in information retrieval and NLP. I disagree strongly with what you say. Basic topic summarization and keyword / named entity extraction on unstructured sources of text works reasonably well. It’s easy to modify BERT and GPT on smaller problems, language classification is borderline totally solved by extremely easy to train neural network models.
I still agree that automatic essay grading is beyond the reach of SOTA NLP models today, but youmake it sound like virtually nothing can be done in a production-grade manner that solves real world unconstrained NLP problems. This is manifestly false.
We had this in my school for 8th and 9th grade so 2008-2010. We had to type the essays in class and submit by the end of the hour. I would only get maybe 3 paragraphs in before time was up because I was trying to build a strong argument for the prompts. Despite that I would usually get 3-4/6 and my teacher said she would read the essays and regrade but she never actually did. My friend literally copy and pasted the pledge of allegiance 20-30 times and scored a perfect 6/6. Later we found out if you repeated the words in the writing prompt you would get a guaranteed 5/6 and with a high enough word count you’d get 6/6. The essays were all bullshit and just a way for the teachers to get an extra free period once a week.
I totally agree that "AI" grading is totally bullshit. But, I also have plenty of experience teaching/TAing large courses, and after reading too many essays they all become semanticically saturated meaninglessness. One can not help but skim them, and grade according to a few quick heuristics. At that point one tries to be self-consistent and defensible in one's grading, but careful consideration is right out. I suspect state graders are dealing with way more than 100 essays per person and are probably on a tight schedule too. It's quite possible that a ML model is better than an exhausted human grader, as their cognitive strategies are mostly identical.
The solution isn't to do a better job at grading 'meaninglessness' but to stop requiring the production of it in the first place.
One major problem with algorithmic approaches, whether automated or not, is that they become the definition of good in the context and therefore become something that cannot be argued against. And of course it makes 'teaching to the test' an even more likely outcome.
If I were a conspiracy theorist I'd attribute this to wanting a dumbed down population. Unfortunately I think it is probably the other way round, the population is already dumbed down and a belief in AI unicorns is the result.
As Aristotle said to Alexander: 'There is no royal road to geometry', and so it is with education; it's hard work for both the student and the educator and no amount of AI/ML/algorithmic snake oil will change that without also changing the meaning of the word education.
I remember when I was in middle school 16 years ago, my English classes would have us submit some of our work to a web app. It would then grade the submission. I remember this distinctly because I asked my teacher to intervene on at least two occasions. The app failed to recognize the words "squirrelly" (as in "That guy in the corner has been acting squirrelly.") and "defragment". My teacher decided to subvert the app's recommended grades because she, as a human, understood the intent of my use of those weird words.
> I feel like you'd probably have to have an AGI to meaningfully evaluate an essay.
So the reason this isn't the case, is because there are very simple metrics that tend to highly correlate with essay quality. It doesn't mean the grading-bot is actually evaluating essay quality. It's just looking for properties that are statistically associated with good essays. Remember, at the end of the day as long as the bot's ranking is close enough to the human grader's ranking, nobody really cares about the internal logic.
A very straightforward example is spelling mistakes. People who make spelling mistakes aren't necessarily bad writers. And vice versa, there may be great speller who can't write for shit. But by and large the people who spell poorly also tend to write poorly. Easily detectable grammatical issues, like misplaced modifiers, subject verb disagreement, or inconsistent tense, are also correlated indicators.
A very simple metric is essay length. Especially if its a timed exam. Good writers tend to have verbal fluidity, with words easily flowing to paper. They don't struggle converting thoughts too sentences. So they tend to end up with the most words written down within a fixed time period. By and large the longer a timed essay is, the more likely that its actual quality is high.
Grading bots basically rely on these statistical relationships. They're not measuring anything intrinsic to good writing. But at the end of the day, their student rankings are usually pretty close to that of a typical human grader. In some cases the bot will have a closer ranking to a random human grader, than two random human graders will have to each other.
The biggest flaw here is Goodwin's law. When the test takers become aware of the kludges that the bots use, they can exploit it. For example just dump a bunch of verbal diarrhea with as many correctly spelled words as possible. But even then it doesn't really hurt the bot's ranking accuracy too much. Because the kids who do the most test-prep and learn all the tips and tricks, are usually high-achievers who do well on essays anyway.
Strongly (but respectfully) disagree with a lot of this!
This is related to current fairness-in-AI discussions. In many cases the basic problem is ML systems leverage correlations for making causal decisions. Here, there is a huge ethical difference between scoring a person based on "is this a good essay" and "do the features of this essay correlate with features of good essays". Just like there is a huge fairness and discrimination difference between "is this person qualified for a loan" and "do the features of this person correlate with features of people who qualify for loans" (algorithmic redlining). Your last sentence has a big discrimination/fairness issue also, since you are testing even more for parental income and parental free time.
>Remember, at the end of the day as long as the bot's ranking is close enough to the human grader's ranking, nobody really cares about the internal logic.
This isn't true at all. Imagine you got a B or C on an essay that a human would have given an A to because you wrote it concisely and in plain language, or because you used language that's statistically correlated with being black. Does the fact that this is rare console you? "Sorry, but it's usually very close to the human grader's ranking." Close enough isn't good enough when you get the short end of the stick. "Sorry, you aren't going to get to go to the college you wanted because you use language statistically correlated with poor writing." Or just because you're different, so the statistical correlation doesn't apply to you, you filthy outlier. Just because it's a rare event doesn't make it okay.
In adulthood, this is like hiring or firing for work statistically correlated with good work. Remember when amazon rolled out the resume scorer? [0] Sure it was biased towards women, but it was close enough to human scores, so who cares about the internal logic?
>Grading bots basically rely on these statistical relationships. They're not measuring anything intrinsic to good writing.
At the end of the day, our goal here is to measure good writing. If the bots aren't measuring anything intrinsic to good writing, we shouldn't use them.
Your last paragraph, and particularly the last sentence, epitomizes what is wrong with your whole thesis: the ultimate goal of the testing (and education itself, for that matter) is not to find people who can "do well on essays"; it is to develop analytical thinking.
It is absolutely insane. By no definition does the system understand what is written.
You could ask a student to write an essay taking a firm opinion on some subject, and they could change standpoint every paragraph and there's no way these systems would know.
If I was a student I would be extremely offended at people wasting my time like this.
I'm surprised people are surprised by it. I guess it just hasn't gotten talked about it a lot? When I took the GRE in 2011 the rule was that my essay would be graded by one human and one automated grader, and a second human would become involved if the computer and the human differed by one point or more iirc.
Maybe nobody really makes a big deal about it because it is pretty much irrelevant anywah. Applicants provide a letter of intent that the grad dept people can, y'know, actually read for themselves, so I think unless you totally bombed the writing section nobody cared.
In a forum of CS people I'm surprised this is one of the top opinions. Our field is full of super surprising results like this -- that you don't have to actually understand the text at beyond basic grammar structures to reasonably accurately predict the score a human would give it.
Like this kind of thing should be cool, not insane. I mean wasn't it cool in your AI class when you learned that DFS could play Mario if you structured the search space right?
I came first in English for my school, many moons ago. Leading up to the finals, I regularly finished ahead of the hard core the English essay people, generally to my amusement. My exam essay responses were generally half the length (sometimes even shorter) than the prodigious writers. Although I've an ok vocabulary, I always made sure I made the right choice of word to hit a specific meaning, rather than choosing words with a high syllable count.
I'd find it highly interesting to see what kind of result I'd get using an automated system.
Why?
Because, I once asked a teacher (also an examiner) why I got good grades above the others, and the answer surprised me: my answers were generally unique /refreshingly different, to the point/ not too long and easy to read.
I suspect with this new system, I'd be an average student. It'd also be interesting to find out, several years down the road, if the automated system could be gamed at all -- I suspect it could, and teachers would help students 'maximise' their scores as a result of that.
When I hear a result like "software which understands basic grammar structures can predict what grade a human would give an essay" I think my views are roughly:
* 5% - cool, we could make a company that grades essays
* 15% - cool, we could make a company that grades essays and sell our source code to the test-prep industry
* 80% - fascinating, it sounds like the exam designers need to reevaluate what they are trying to measure with essay questions
"...that you don't have to actually understand the text at beyond basic grammar structures to reasonably accurately predict the score a human would give it"
That only really shows that the humans they're training on are terrible at grading essays.
This problem is a first class demonstration of the difference between "can we?" and "should we?"
The fact that it's being implemented in society is insane because anyone who is paying attention to the state of AI today already knows how it will go wrong: without reading the article I already guessed that it systematically discriminated against certain demographics. Which was in fact what the article claimed.
It's interesting that it's possible to predict what the scorer would decide, but the moment you actually implement it is when all of the known problems become relevant, and the intellectual wonder must take a backseat to the human problems.
Teaching human-human communication by removing human inputs and having computers decide about quality... call me a skeptic. I feel bad for the students. Essay grading was bad enough before this
Narrowly for grammar however - is even that a good thing? It probably helps scale grammar help to more students, but if those tools became ubiquitous in grading and editing then unique voices would just disappear and a lot of potentially “great writers” might choose different careers because the machines don’t like them
Adding further bias against the underprivileged is not "cool". Implenting this while avoiding publicity or providing a means to publically audit the results is doubly not cool.
It is fine to play with "cool" techniques when you are doing consequence free stuff like playing Mario. When you are creating systems that have significant and long term effects of people's lives a different standard applies.
This is sort of like discovering the Excel spreadsheet at the heart of a system responsible for handling hundreds of millions of dollars of transactions for your bank.
Yeah, it's cool, but what about your savings account?
Unlike a multiple choice test where the primary audience is automated graders, the primary audience for an essay is other humans. If even Google and Facebook with their billions of dollars and billions of posts worth of data, still cannot always understand the intent and purpose of written content, what hope do these algorithms have?
If it is cost-prohibitive for every essay to be graded by humans, then they should be dropped from the tests. Otherwise, we are missing the whole point of essays which is to communicate effectively with another human, not just match certain text patterns.
> Otherwise, we are missing the whole point of essays which is to communicate effectively with another human, not just match certain text patterns.
I agree, this is traditionally the purpose of an essay. But to play devil's advocate, consider the rising number of people who are writing SEO or ASO content which is actually targeted at machines.
“In most machine scoring states, any of the randomly selected essays with wide discrepancies between human and machine scores are referred to another human for review”.
And “between 5 to 20 percent” of essays are randomly selected for human review.
So the takeaway is that if you’re one of the 80-95% of (typically black or female) people who the machine scored dramatically lower, but are not selected for human review, your education future is systematically fucked and you have no knowledge of why or how to change it.
Absolutely reprehensible. Anyone involved in the creation or adoption of these systems should be ashamed.
The thing is, you could be similarly screwed by a biased human whose grading is not checked by a less biased human.
At least the machines offer the following hope: even if unbiased humans are rare among paper-grading teachers, those humans can be used to train the machines, so then bias-free or lower-bias grading becomes more ubiquitous.
Basically, the system has the potential for systematically identifying and reducing systematic bias. A computer program can be retrained much more readily than nation-wide army of humans. Humans can be given a lecture on bias, and then they will just return to their ways.
AI has a lot more potential for bias than humans. It depends on the input data which is likely heavily biased based on other data set results like face detection. It will only amplify any small bias present in the data.
>Anyone involved in the creation or adoption of these systems should be ashamed
That's the problem - there is seemingly no shame these days. People involved "saved time and money", got paid and that's it. "If I didn't do it someone else would" and all of that.
I remember taking a standardized test, can't remember if it was SAT or CSAT (Colorado pre-SAT test). This was at a time when I'm confident that humans were the graders.
I started with an intro that would be appropriate for a standard 5 paragraph essay; i.e. the thing you write when you don't know what you're talking about and you're just following a format.
In the third paragraph I took a leaf from family guy, and just interjected "WAFFLES, NICE CRISPY WAFFLES, WITH LOTS OF SYRUP." for the next page and a half, I berated the very foundation of the essay prompt, insulting it the way only an angst ridden early teen can.
... I got a 98% on the essay.
Fast forward several years. I write an essay for for an introductory college course final. My paper is returned to me with a coffee stain and a "94% - good work!" note scribbled on the top. That note was scribbled by a TA that would turn out to be my girlfriend for 2 years. One night in bed, she tilts her laptop to me, showing an article that I used as the central theme to the above essay; "can you believe this?"
"Are you joking? Of course I can believe this, it was the subject of the essay you gave me an A on 2 years ago"
She admits she didn't read past the first paragraph of anything she grades, and just bases grades on intuition based on how articulate the essays are at the outset.
...
The point I'm making:
Does AI suck at judging the amount of informative content in a student essay? YES
Do humans suck at judging the amount of informative content in a student essay? ALSO YES
This is a great example of why it's grossly irresponsible for members of the ML community to talk about how AGI is just around the corner. In addition to the fact that we have no idea whether this is true, it primes a naive public for believing that technologies like this are worth the tradeoff.
"People worry that computers will get too smart and take over the world, but the real problem is that they're too stupid and they've already taken over the world."
I imagine that any student that experimented with the form of the essay or wrote an exceptionally well argued piece in simple language would not have their test graded appropriately either.
Any essay writing test which could be adequately graded by a machine is not testing anything of value.
Edit: I’ll further add that as soon as people’s careers depend on a metric, the metric becomes useless as a metric, because it will be gamed and manipulated by everyone involved. Almost nobody involved is incentivized to accurately measure student’s writing ability.
I think machines could be valuable in giving feedback on writing, like that grammarly.com.
A lot of what students write is actually garbage from that point of view. Even if they happen to have a good basic idea about what they want to say, the point of essay writing is to master the mechanics of expression so that you get the idea across effectively.
Whether the student has a brilliant idea isn't even so important, and it wouldn't even be fair; imagine if high school computer science expected students to turn in a best-selling app for a term project. Not everyone can come up with something brilliant to say; and even relatively mundane lines of reasoning can be given a good treatment in writing to develop the skill.
I remember when I had essays graded in school, a lot of the comments were low-grade fluff like "run on sentence", "wrong word", "faulty parallelism", "missing colon before 'for example'" and such points having nothing to do with the content being original, well-considered and well-argued. That sort of thing might as well be done by machine, at least as a preprocessing step to improve a student's rough draft.
Almost nobody involved is incentivized to accurately measure student’s writing ability
It's the same reason you see keyword posters in math education. "Together" means "plus", that kind of thing. It's completely worthless, except for one-step problems, and even then it doesn't always work. What is happening is collusion between teachers and testmakers. You can't teach understanding, but you can teach test-passing techniques because the way the test is set permits this.
You see the same thing here, in English you can get away with not teaching quality writing if you teach techniques to score well.
I feel like the mistake is assuming that essay writing is about the content. It's just a thing to give the student something barely non-trivial to write about.
When your essays are graded they're marked down for mechanical and wording problems. There's really no point in trying or grade 'good ideas' on a subject piece you had maybe 10 minutes to skim.
If I have 3 left shoes colored blue green and red, and you have 2 right shoes colored black and white, how many pairs can we make if our lefts and rights are put together?
There is value in the ability to produce correct English 'off the cuff'. You could argue essays are the best way to get students to produce off the cuff written text. Hence, it makes some sense to ask students for essays, and then judge those essays only for form.
However, it is rather important that students know their essays are not judged as essays, but only judged on the content. Otherwise you teach students that form trumps content in essays.
When judging an essay as an essay correct English barely matters. What matters is how convincing you are, and how interesting of a read the essay is. This is a great skill to have, and testing it also makes sense. Really though, we should separate these two forms of testing.
Yes, education takes time and costs money. Yes, not educating is both cheaper and faster. Note how the rationalizing ignores the needs of the students and the quality of the education.
I live in Utah and my children have been subjected to this automated essay scoring here. One night I came home from work and my son and wife were both in tears, frustrated with each other and frustrated with the essay scoring which refused to give a high enough score to meet what the teacher said was required, no matter how good the essay was. My wife wrote versions herself from scratch and couldn’t get the required score. When I got involved, I did the same with the same results.
Turns out the instructions said the essay would be scored on verbal efficiency; getting the point across clearly with the fewest words. I started playing around and realized that the more words I added, the higher the score, whether they were relevant or grammatical or not. Random unrelated sentences pasted in the middle would increase the score. We found a letter of petition online for banning automated scoring for the purposes of grades or student evaluation of any kind. It was very long, so it got a perfect score. I encouraged my son to submit it, and he did. Later I visited his teacher to explain and to urge her to not use automated scoring. She listened and then told me about how much time it saves and how fast students get feedback. :/
Who the hell came out with such an idea. I would even hesitate to use "AI" for automatic spell checking as it is sufficient to give some character unusual name and it will be marked as error.
My guess is that soon or later people will learn how to game that AI. I wouldn't be surprised if there were some software that will generate essay that Utah "AI" likes.
Already been done. http://lesperelman.com/writing-assessment-robo-grading/babel...
Here's a sample essay that is complete nonsense and got a perfect score on the GRE.
http://lesperelman.com/wp-content/uploads/2015/12/6-6_ScoreI...
I'd guess this is a product of dwindling state finances and contempt for any form of real education. AI's are orders of magnitude cheaper than real teachers. They also don't form unions and wouldn't voice any opposition against changes in the curiculum.
They are also pretty useless, as you have pointed out. The consequences of this policy will be postponed until the students reach a certain age -- that'll be like 10-15 years in the future.
To be fair, the GP here is specifically describing that he gamed the AI via a copy-paste of a critique of the AI, his kid submitted it on their own accord, it was graded without comment, and then when the GP went into comment on the gaming of the AI, the teacher not only did not care that the AI was gamed, but expressed gratitude for the AI saving hours of work, still ignoring that the AI fundamentally made things worse, all at the expense of the entire point of being a teacher in the first place.
The issue, for the teacher, is that in 'the system' in which they collect a pay-check, the AI works flawlessly. The point, for the teacher, is not to educate children. It is to have assignments that children pass with some sort of distribution that can be sent in and calculated by some person in a beige suit, wide tie, and hair troubles. The difference is subtle at first, but when you get further along to the point where the GP is sitting, then the difference is comical.
The AI allows the teacher to increase their effiency in processing assignments, ones that never really mattered to the teacher in the first place. In valley-speak: the incentives are not aligned.
[0] https://www.nytimes.com/2012/04/23/education/robo-readers-us...
It is frankly a sign of a diseased culture to use it in any capacity except an exercise to improve AI.
Every essay I wrote, I'd always force myself to reach the max "12.0" grade level. While writing I'd struggle over word choice, sentence structure, rearranging paragraphs, working on my tone etc, all in pursuit of the 12th grade way to phrase things. All my revisions were subject to the approval of the Grade Level checker.
Whenever I could I would check the grade levels of my friend's writing - usually by showing them a "neat feature" they could enable. Then, I'd smugly applaud myself for being the better writer whenever their grade level was below 12.0.
The Grade Level feature fascinated me and to try and master it, I found a book about Microsoft Word and looked through it in a bookstore. I was absolutely gobsmacked at how simple the formulas was. I had childishly been expecting something, like perhaps Utah educators imagine they have. I genuinely expected the method to be complex beyond my understanding.
Instead, Word used a variant of Flesch-Kincaid. There was a direct relationship between sentence length and grade score, and polysyllabic words and grade score. Meaning, the longer your sentences and words, the higher your grade score.
As soon as I got home from the bookstore I loaded a draft of something I had written. It was "pre-12.0" writing from me. I simply deleted all the periods but one and checked again. 12.0.
Automatic grading is a wonderful lure. It's nice to imagine that there's some objective writing quality easy to tap into. At the moment, I think we're far from that ability.
Personally, I feel the solution to insufficient teacher time is to use peer grading much more, and spot checks. Get kids to read and revise each other's works frequently, and teachers should aim to grade at least N papers per student where N is much less than the number of papers a student writes.
Revising is a really vital part of writing. Getting more chances to do revision, plus having to write something good enough to show your peers, plus having the risk of any paper count for your grade should compensate for incomplete teacher grading.
That's how it's done in creative writing courses. I've always found it infinitely more helpful than only having feedback from the instructor, even if the instructor's feedback was generally more helpful/useful than peer feedback.
The point I am trying to make in agreement with the parent is: there are qualities that are very hard to score with algorithms. The difficulty of solving this problem equals if not exceeds that of automated translation, which still only works properly for specialized and limited domains, e.g. weather forecasts.
My best Utah education anecdote - In the first day of British literature class the teacher came in and asked "Does anyone here know what A.D means?" someone said After Death - she said no. I figured this was my time to shine so I raised my eager hand and said "Anno Domini, in the year of the lord" - she said no.
Then she announced: "A.D means after the Deluge, and B.C means before Christ".
She also totally lied to me one time about whether she would be considering a particular textbook question as applying to Rosencrantz or Guildenstern.
Anyway I think that was one of the many classes I got an F in after stopped going and would walk past it every day on my way to play chess with my German teacher.
Note: This is not hyperbole, I have seen this exact scenario more than once.
There may be a place for a well-designed one, but if it exists, I've never seen it.
It would mark you as incorrect for using too many decimal places, even though it wouldn't tell you how many significant figures was required. I often remember it marking my answer as incorrect, even though it was identical to the answer they gave. Sometimes you'd have to show your working, but it couldn't handle brackets. Once I put the answer as "1+x=y" but they wanted the answer "y-1=x", and they marked it as incorrect.
I'm sure academic software design is leaps and bounds above what it was in the early 2000's, but to have a pupils futures hinge on what generally seems to be poorly tested code is dangerous.
Frankly, this does not change anything from my experience in school decades ago. The teachers always said that the length does not matter and we should not pad the papers. However students who wrote more pages got better scores every single time.
Kids, forget everything you know because crime does indeed pay off. Best grades will be reserved for those that try to cheat this system however it is implemented. Botting your essays is the way to go in the 21st century.
Apart from the fact that your story is straight up frightening, isn't this part completely backwards, too? I mean, clearly using more words to convey the same message is /less/ efficient, not more so?
What’s even cheaper than AI? Tell the students to write some pages, have the teacher glance at the number of pages written, give full credit if the mark was met, and throw the papers out without reading them. It sounds like it would be similarly effective and less aggravating. Unfortunately, this would require humility on the part of the educators.
edit: ...just like how previous generations have to learn how to game social systems.
When such changes occur, the gamer will be docked until they can reverse-engineer the new algorithm. There's also the risk that all their previous inputs "gaming" the system might be reconsumed to terrible results as well, effectively rewriting their historical performance disasterously.
As always, those with the social standing and power to have insider knowledge or guidance will be in the best position to profit off such systems.
That is just mad.
I could perhaps see value in having unlimited tries, as a teaching aid, if the result wasn’t being used for grading. That would at least leave room for curiosity and exploration. And, more importantly, I could see value if the software wasn’t essentially a scam that fundamentally is not able do what is advertised. If the software really could grade essays reliably, and provide meaningful suggestions for improvement, then maybe it could be used to help educate students, in conjunction with the teacher’s guidance. But the software does not grade reliably, and it absolutely does not offer meaningful constructive feedback, and the teachers were using it to avoid reading essays, not to supplement their own expertise.
One of the several amusing ironies here is how the software company has convinced the state and teachers to willingly replace themselves with bots, despite obvious evidence that the humans can do the job better.
One question she had to grade was essentially, "What's something you want your teacher to know about you?"
It was an essay answer, and she was supposed to grade it on grammar, etc. Just the mechanical aspects of writing. (The real question explained the details more, but that was the core of the question.)
She saw answers that would make you weep.
"My daddy touches me."
"I haven't eaten today. I don't know when I'm going to eat again."
Stuff like that.
And my mother was going to be the only human who ever saw their responses. Their teacher had no chance to see their responses, just my mom.
So she goes to her supervisor and asks, "What can we do to help these kids?"
The supervisor said there was nothing you can do. Just grade the answers.
Faculty, administrators, athletics staff, or other employees and volunteers at institutions of higher learning, including public and private colleges and universities and vocational and technical schools (11 States).
https://www.childwelfare.gov/topics/systemwide/laws-policies...
https://www.childwelfare.gov/pubPDFs/manda.pdf
This includes penalties for failure to report in multiple states:
https://www.childwelfare.gov/topics/systemwide/laws-policies...
So how many are true and how many false? I have no clue. Literally none. And no it doesn't make me feel any better about the screams of existential agony even if that were a low percentage. Could be high too.
..which means students wrote whatever the hell we wanted. I was assigned a Captain Morgan (rum) ad. I wrote that the ad was glorifying maritime piracy and was likely responsible for pirate activity in Somalia.
In the US, school funding is based upon standardized test results, and bad results can shut a poorly performing school down.
It's drilled into every kid's head that these tests are very important, super strict and if they accidentally mess up, it can ruin their academics, because retesting and regrading are expensive.
So do what? Contact her local police?
With a written accusation from a child? Is that enough to get a warrant to force the company to release the demographic information?
And people don't work at a job like that because they want to. They work there because they need the money.
Everything she took in and out of there was monitored, too. So it's not like she can go to the Xerox, and walk out of there with a copy.
It's beyond dehumanizing. For everyone. The kid, the people who work there.
Dead Comment
A thing you quickly find if you try to download off-the-shelf NLP tools and apply them to anything is how little is reliable at all, unless you can constrain the domain. Even basic topic identification only works with low error rates when constrained to something like NYT stories, or PubMed abstracts, not arbitrary text by arbitrary writers. And I would bet ETS is using worse tech than research state-of-the-art.
[1] e.g. https://www.aclweb.org/anthology/P15-1053
People making big decisions with a lot of money around computing know nothing about it and are marks for con-artists. Think big consulting firms selling to senior public servants in washington. "For a successful technology reality must take precedence of public relations." But reality just gets in the way when conning a mark for a successful snake oil sale, right?
Call it out, publically, cite your credentials. Encourage colleagues, your competition and everyone with a clue to pour scorn on whoever is selling this evil, toxic waste as drinkable.
I still agree that automatic essay grading is beyond the reach of SOTA NLP models today, but youmake it sound like virtually nothing can be done in a production-grade manner that solves real world unconstrained NLP problems. This is manifestly false.
One major problem with algorithmic approaches, whether automated or not, is that they become the definition of good in the context and therefore become something that cannot be argued against. And of course it makes 'teaching to the test' an even more likely outcome.
If I were a conspiracy theorist I'd attribute this to wanting a dumbed down population. Unfortunately I think it is probably the other way round, the population is already dumbed down and a belief in AI unicorns is the result.
As Aristotle said to Alexander: 'There is no royal road to geometry', and so it is with education; it's hard work for both the student and the educator and no amount of AI/ML/algorithmic snake oil will change that without also changing the meaning of the word education.
To emphasize, this was 16 years ago.
So the reason this isn't the case, is because there are very simple metrics that tend to highly correlate with essay quality. It doesn't mean the grading-bot is actually evaluating essay quality. It's just looking for properties that are statistically associated with good essays. Remember, at the end of the day as long as the bot's ranking is close enough to the human grader's ranking, nobody really cares about the internal logic.
A very straightforward example is spelling mistakes. People who make spelling mistakes aren't necessarily bad writers. And vice versa, there may be great speller who can't write for shit. But by and large the people who spell poorly also tend to write poorly. Easily detectable grammatical issues, like misplaced modifiers, subject verb disagreement, or inconsistent tense, are also correlated indicators.
A very simple metric is essay length. Especially if its a timed exam. Good writers tend to have verbal fluidity, with words easily flowing to paper. They don't struggle converting thoughts too sentences. So they tend to end up with the most words written down within a fixed time period. By and large the longer a timed essay is, the more likely that its actual quality is high.
Grading bots basically rely on these statistical relationships. They're not measuring anything intrinsic to good writing. But at the end of the day, their student rankings are usually pretty close to that of a typical human grader. In some cases the bot will have a closer ranking to a random human grader, than two random human graders will have to each other.
The biggest flaw here is Goodwin's law. When the test takers become aware of the kludges that the bots use, they can exploit it. For example just dump a bunch of verbal diarrhea with as many correctly spelled words as possible. But even then it doesn't really hurt the bot's ranking accuracy too much. Because the kids who do the most test-prep and learn all the tips and tricks, are usually high-achievers who do well on essays anyway.
This is related to current fairness-in-AI discussions. In many cases the basic problem is ML systems leverage correlations for making causal decisions. Here, there is a huge ethical difference between scoring a person based on "is this a good essay" and "do the features of this essay correlate with features of good essays". Just like there is a huge fairness and discrimination difference between "is this person qualified for a loan" and "do the features of this person correlate with features of people who qualify for loans" (algorithmic redlining). Your last sentence has a big discrimination/fairness issue also, since you are testing even more for parental income and parental free time.
>Remember, at the end of the day as long as the bot's ranking is close enough to the human grader's ranking, nobody really cares about the internal logic.
This isn't true at all. Imagine you got a B or C on an essay that a human would have given an A to because you wrote it concisely and in plain language, or because you used language that's statistically correlated with being black. Does the fact that this is rare console you? "Sorry, but it's usually very close to the human grader's ranking." Close enough isn't good enough when you get the short end of the stick. "Sorry, you aren't going to get to go to the college you wanted because you use language statistically correlated with poor writing." Or just because you're different, so the statistical correlation doesn't apply to you, you filthy outlier. Just because it's a rare event doesn't make it okay.
In adulthood, this is like hiring or firing for work statistically correlated with good work. Remember when amazon rolled out the resume scorer? [0] Sure it was biased towards women, but it was close enough to human scores, so who cares about the internal logic?
>Grading bots basically rely on these statistical relationships. They're not measuring anything intrinsic to good writing.
At the end of the day, our goal here is to measure good writing. If the bots aren't measuring anything intrinsic to good writing, we shouldn't use them.
https://www.reuters.com/article/us-amazon-com-jobs-automatio...
You could ask a student to write an essay taking a firm opinion on some subject, and they could change standpoint every paragraph and there's no way these systems would know.
If I was a student I would be extremely offended at people wasting my time like this.
Maybe nobody really makes a big deal about it because it is pretty much irrelevant anywah. Applicants provide a letter of intent that the grad dept people can, y'know, actually read for themselves, so I think unless you totally bombed the writing section nobody cared.
Like this kind of thing should be cool, not insane. I mean wasn't it cool in your AI class when you learned that DFS could play Mario if you structured the search space right?
I'd find it highly interesting to see what kind of result I'd get using an automated system.
Why?
Because, I once asked a teacher (also an examiner) why I got good grades above the others, and the answer surprised me: my answers were generally unique /refreshingly different, to the point/ not too long and easy to read.
I suspect with this new system, I'd be an average student. It'd also be interesting to find out, several years down the road, if the automated system could be gamed at all -- I suspect it could, and teachers would help students 'maximise' their scores as a result of that.
* 5% - cool, we could make a company that grades essays
* 15% - cool, we could make a company that grades essays and sell our source code to the test-prep industry
* 80% - fascinating, it sounds like the exam designers need to reevaluate what they are trying to measure with essay questions
That only really shows that the humans they're training on are terrible at grading essays.
The fact that it's being implemented in society is insane because anyone who is paying attention to the state of AI today already knows how it will go wrong: without reading the article I already guessed that it systematically discriminated against certain demographics. Which was in fact what the article claimed.
It's interesting that it's possible to predict what the scorer would decide, but the moment you actually implement it is when all of the known problems become relevant, and the intellectual wonder must take a backseat to the human problems.
Narrowly for grammar however - is even that a good thing? It probably helps scale grammar help to more students, but if those tools became ubiquitous in grading and editing then unique voices would just disappear and a lot of potentially “great writers” might choose different careers because the machines don’t like them
It is fine to play with "cool" techniques when you are doing consequence free stuff like playing Mario. When you are creating systems that have significant and long term effects of people's lives a different standard applies.
Yeah, it's cool, but what about your savings account?
Dead Comment
If it is cost-prohibitive for every essay to be graded by humans, then they should be dropped from the tests. Otherwise, we are missing the whole point of essays which is to communicate effectively with another human, not just match certain text patterns.
If you want to grade on form to test the ability to write correct rather than coherent sentences, make those separate questions, and mark them so.
Apparently it is. But everyone still wants writing to be assessed…
I agree, this is traditionally the purpose of an essay. But to play devil's advocate, consider the rising number of people who are writing SEO or ASO content which is actually targeted at machines.
And “between 5 to 20 percent” of essays are randomly selected for human review.
So the takeaway is that if you’re one of the 80-95% of (typically black or female) people who the machine scored dramatically lower, but are not selected for human review, your education future is systematically fucked and you have no knowledge of why or how to change it.
Absolutely reprehensible. Anyone involved in the creation or adoption of these systems should be ashamed.
At least the machines offer the following hope: even if unbiased humans are rare among paper-grading teachers, those humans can be used to train the machines, so then bias-free or lower-bias grading becomes more ubiquitous.
Basically, the system has the potential for systematically identifying and reducing systematic bias. A computer program can be retrained much more readily than nation-wide army of humans. Humans can be given a lecture on bias, and then they will just return to their ways.
Deleted Comment
That's the problem - there is seemingly no shame these days. People involved "saved time and money", got paid and that's it. "If I didn't do it someone else would" and all of that.
Dead Comment
I remember taking a standardized test, can't remember if it was SAT or CSAT (Colorado pre-SAT test). This was at a time when I'm confident that humans were the graders.
I started with an intro that would be appropriate for a standard 5 paragraph essay; i.e. the thing you write when you don't know what you're talking about and you're just following a format.
In the third paragraph I took a leaf from family guy, and just interjected "WAFFLES, NICE CRISPY WAFFLES, WITH LOTS OF SYRUP." for the next page and a half, I berated the very foundation of the essay prompt, insulting it the way only an angst ridden early teen can.
... I got a 98% on the essay.
Fast forward several years. I write an essay for for an introductory college course final. My paper is returned to me with a coffee stain and a "94% - good work!" note scribbled on the top. That note was scribbled by a TA that would turn out to be my girlfriend for 2 years. One night in bed, she tilts her laptop to me, showing an article that I used as the central theme to the above essay; "can you believe this?"
"Are you joking? Of course I can believe this, it was the subject of the essay you gave me an A on 2 years ago"
She admits she didn't read past the first paragraph of anything she grades, and just bases grades on intuition based on how articulate the essays are at the outset.
...
The point I'm making:
Does AI suck at judging the amount of informative content in a student essay? YES
Do humans suck at judging the amount of informative content in a student essay? ALSO YES
"People worry that computers will get too smart and take over the world, but the real problem is that they're too stupid and they've already taken over the world."
Any essay writing test which could be adequately graded by a machine is not testing anything of value.
Edit: I’ll further add that as soon as people’s careers depend on a metric, the metric becomes useless as a metric, because it will be gamed and manipulated by everyone involved. Almost nobody involved is incentivized to accurately measure student’s writing ability.
A lot of what students write is actually garbage from that point of view. Even if they happen to have a good basic idea about what they want to say, the point of essay writing is to master the mechanics of expression so that you get the idea across effectively.
Whether the student has a brilliant idea isn't even so important, and it wouldn't even be fair; imagine if high school computer science expected students to turn in a best-selling app for a term project. Not everyone can come up with something brilliant to say; and even relatively mundane lines of reasoning can be given a good treatment in writing to develop the skill.
I remember when I had essays graded in school, a lot of the comments were low-grade fluff like "run on sentence", "wrong word", "faulty parallelism", "missing colon before 'for example'" and such points having nothing to do with the content being original, well-considered and well-argued. That sort of thing might as well be done by machine, at least as a preprocessing step to improve a student's rough draft.
It's the same reason you see keyword posters in math education. "Together" means "plus", that kind of thing. It's completely worthless, except for one-step problems, and even then it doesn't always work. What is happening is collusion between teachers and testmakers. You can't teach understanding, but you can teach test-passing techniques because the way the test is set permits this.
You see the same thing here, in English you can get away with not teaching quality writing if you teach techniques to score well.
When your essays are graded they're marked down for mechanical and wording problems. There's really no point in trying or grade 'good ideas' on a subject piece you had maybe 10 minutes to skim.
Hint: together does not mean plus.
However, it is rather important that students know their essays are not judged as essays, but only judged on the content. Otherwise you teach students that form trumps content in essays.
When judging an essay as an essay correct English barely matters. What matters is how convincing you are, and how interesting of a read the essay is. This is a great skill to have, and testing it also makes sense. Really though, we should separate these two forms of testing.