The piece is kind of making a basic fundamental mistake in measurement, assuming that all variability is meaningful variability.
There are ways of making the argument they're trying to make, but they're not doing that.
Also, sometimes a single overall score is useful. A better analogy than the cockpit analogy they use is clothing sizing. Yes, tailored shirts, based on detailed measurements of all your body parts, fit awesome, but for many people, small, medium, large, x-large, and so forth suffice.
I think there's a lesson here about reinventing the wheel.
I appreciate the goals of the company and wish them the best, but they need a psychometrician or assessment psychologist on board.
We aren't trying to make a rigorous statement here -- we're trying to draw attention to the fact that the most common metrics do not give much insight into what a student has actually shown mastery of. This is especially important when you consider that the weightings of particular questions are often fairly arbitrary.
I certainly agree that all variability is not meaningful variability, but I'd push back a bit and say that there's meaningful variability in what's shown here. We'll go into more depth and hopefully have something interesting to report.
I've also seen a fair number of comments stating that this is not a surprising result. I'd agree (if you've thought about it), but if you look at what's happening in practice, it's clear that either many people would be surprised by this, or are at least unable to act on it. We're hoping to help with the latter.
This kind of data is commonly modeled using item response theory (IRT). I suspect that even in data generated by a unidimensional IRT model (which they are arguing against), you might get the results they report, depending on the level of measurement error in the model.
Measurement error is the key here, but is not considered in the article. That + setting an unjustified margin of 20% around the average is very strange. An analogous situation would be criticizing a simple regression, by looking at how many points fall X units above/below the fitted line, without explaining your choice of X.
The main point of this post is to highlight that the most common metric of student performance may not be that useful. Most of the time, students will get their score, the average score, and sometimes a standard deviation as well. As jimhefferon mentioned in a response to a different comment, the conventional wisdom is that two students with the same grade know roughly the same stuff, and that's seeming not to be true.
We're hoping to build some tools here to help instructors give students a better experience by helping them cater to the different groups that are present.
disclaimer: I'm one of the founders of Gradescope.
Deleted Comment
Deleted Comment
https://en.m.wikipedia.org/wiki/University_of_California_fin...
If I recall, they had computers grade the GMAT essays for at least 10 years, but they had to have a human in the loop because that is the ultimate measure for whether a computer is "correct" in terms of grading an essay.
The automated essay grading stuff typically looks at writing style more than content, but it's true that it's a problem that tons of people have worked on and there's been some cool progress there as well. We're not really working on bringing AI to essay grading ourselves though.
But code review is more than just reviewing diffs. I need to test the code by actually building and running it. How does that critical step fit in to this workflow? If the async runner stops after it finishes writing code, do I then need to download the PR to my machine, install dependencies, etc. to test it? Major flow blocker for me, defeats the entire purpose of such a tool.
I was planning to build always-on devcontainers on a baremetal server. So after Claude Code does its thing, I have a live, running version of my app to test alongside the diffs. Sort of like Netlify/Vercel branch deploys, but with a full stack container.
Claude Code also works far better in an agentic loop when it can self-heal by running tests, executing one-off terminal commands, tailing logs, and querying the database. I need to do this anyway. For me, a mobile async coding workflow needs to have a container running with a mobile-friendly SSH terminal, database viewer, logs viewer, lightweight editor with live preview, and a test runner. Diffs just don't cut it for me.
I do believe that before 2025 is over we will achieve the dream of doing real software engineering on mobile. I was planning to build it myself anyway.