I know there are companies that are highly productive with AI including ours. However, AI skeptics ask for real studies and all of them available now show no real gains.
Many won't care unless you show them an actual study.
So my question is, are there any actual studies about the companies that actually make it work with AI?
The gains are ~17% increase in individual effectiveness, but a ~9% of extra instability.
In my experience using AI assisted coding for a bit longer than 2 years, the benefit is close to what Dora reported (maybe a bit higher around 25%). Nothing close to an average of 2x, 5x, 10x. There's a 10x in some very specific tasks, but also a negative factor in others as seemingly trivial, but high impact bugs get to production that would have normally be caught very early in development on in code reviews.
Obviously depends what one does. Using AI to build a UI to share cat pictures has a different risk appetite than building a payments backend.
That 17% increase is in self-reported effectiveness. The software delivery throughput only went up 3%, at a cost of that 9% extra instability. So you can build 3% faster with 9% more bugs, if I'm reading those numbers right.
The question that people are actually interested in, "After adopting this specific AI tool, will there be a noticeable impact on measures we care about?" is not addressed by this model at all, since they do not compare individual respondents' answers over time, nor is there any attempt to establish causality.
Three months ago, with opus4.5, I would have said that the productivity improvement was ~10% for my whole team.
I now have to contradict myself: juniors and even experienced new hires with little domain knowledge don't improve as fast as they used to. I still have to write new tasks/issue like I would have for someone we just hired, after 8 months. I still catch the same issues we caught in reviews three months ago.
Basically, experience doesn't improve productivity as fast as it used to. On easy stuff it doesn't matter (like frontend changes, the productivity gains are extremely high, probably 10x), and on specific subjects like red teaming where a quantity of small tools is better than an integrated solution I think it can be better than that.
But I'm in a netsec tooling team, we do hard automation work to solve hard engineering issues, and that is starting to be a problem if juniors don't level up fast.
There are genuinely weeks where I go 5x though, and others where I go 0.5x.
You need broad economic measurements, not individual or company specific. And that takes a long time plus there's a lot of noise in the data right now (war, for example).
Deleted Comment
We only avoid doing it at scale because it's expensive. In particular if we want the measurement to generalise out of sample.
(In particular in this case, where once we're done, proponents will claim our data is too old to be a useful guide to tomorrow.)
The problem with this is that AI will create worse code that is going to cause more problems in the future, but the measurements won’t take that into account.
If we could even measure teams, against themselves, others and some kind of baseline, but we don't AFAIK.
Unironically, ai evaluating the impact of those lines might be getting close to a metric that would measure output better than having everyone print out their last 6 months of work for the new boss to look at.
I don't know man, could just be in my head. I better defer judgement, put aside all my own opinions about what happened and let some researchers with god knows what axe to grind make that decision for me.
Which is the issue with almost all studies and statistics, what it means depends entirely on what you're measuring.
I can program very very fast if I only consider the happy path, hard code everything and don't bother with things like writing tests defining types or worrying about performance under expected scale. It's all much faster right up until the point it isn't - and then it's much slower. Ai isn't quite so obviously bad, but it can still hide short term gains into long term problems which is what studies tend to focus on as the short term doesn't usually require a study to observe.
I think Ai is similar to outsourcing staff to cheeper counties, replacing ingredients with cheaper alternatives and other MBA style ideas. It's almost always instantly beneficial, but the long term issues are harder to predict, and can have far more varied outcomes dependent on weird specifics of the business.
In practice, arriving at this ideal scenario can be very challenging. Actually feasible experiments will be necessarily narrow, with the expectation that their results can be (roughly) extrapolated outside of their specific experimental setup.
Another valid approach would be to carry out qualitative research, for example a case study. This typically requires the study of one (or a few) developers and their specific contexts in great detail. The idea is that a deep understanding of how one person navigates their work and their tools would provide us with insights that might be related to our specific situation.
Personally, in this particular area, I tend to prefer detailed qualitative accounts of how other developers are working on similar projects and with similar tools as me.
But in any case, both approaches are valid and complementary.
Those that can “see” the potential push through the adaptation period, even when longer than expected.
Depending on how forward looking a group is, the adaptation costs are a problem, a dilemma, or a completely obvious win.
Yet, external measurements don't distinguish between accumulating, accelerating, flat or fading intermediate value.
--
Avoidance of necessary adaptation, even with no immediate impact, becomes the dual. Technical, strategic, or capability debt.
Does that hidden anti-productivity ever get accounted for? When maladaptive firms take their anti-productivity into a hole as they fade/demise?
A company can operate with high margins while its sales fall off a cliff. Is that just "decreasing quantities" of uniformly "high productivity"?
There are a mountain of things that we reasonably know to be true but haven't done studies on. Is it beneficial for programming languages to support comments? Are regexes error-prone? Does static typing improve productivity on large projects? Is distributed version control better than centralised (lock based)? Etc.
Also you can't just say "AI improves productivity". What kind of AI? What are you using it for? If you're making static landing pages... yeah obviously it's going to help. Writing device drivers in Ada? Not so much.