This is a great write-up and I love all the different ways they collected and analyzed data.
That said, it would have been much easier and more accurate to simply put each laptop side by side and run some timed compilations on the exact same scenarios: A full build, incremental build of a recent change set, incremental build impacting a module that must be rebuilt, and a couple more scenarios.
Or write a script that steps through the last 100 git commits, applies them incrementally, and does a timed incremental build to get a representation of incremental build times for actual code. It could be done in a day.
Collecting company-wide stats leaves the door open to significant biases. The first that comes to mind is that newer employees will have M3 laptops while the oldest employees will be on M1 laptops. While not a strict ordering, newer employees (with their new M3 laptops) are more likely to be working on smaller changes while the more tenured employees might be deeper in the code or working in more complicated areas, doing things that require longer build times.
This is just one example of how the sampling isn’t truly as random and representative as it may seem.
So cool analysis and fun to see the way they’ve used various tools to analyze the data, but due to inherent biases in the sample set (older employees have older laptops, notably) I think anyone looking to answer these questions should start with the simpler method of benchmarking recent commits on each laptop before they spend a lot of time architecting company-wide data collection
I totally agree with your suggestion, and we (I am the author of this post) did spot-check the performance for a few common tasks first.
We ended up collecting all this data partly to compare machine-to-machine, but also because we want historical data on developer build times and a continual measure of how the builds are performing so we can catch regressions. We quite frequently tweak the architecture of our codebase to make builds more performant when we see the build times go up.
I think there's something to be said for the fact that the engineering organization grew through this exercise - experimenting with using telemetry data in new ways that, when presented to other devs in the org, likely helped them to all level up and think differently about solving problems.
Sometimes these wandering paths to the solution have multiple knock-on effects in individual contributor growth that are hard to measure but are (subjectively, in my experience) valuable in moving the overall ability of the org forward.
I didn't see any analysis of network building as an alternative to M3s. For my project, ~40 million lines, past a certain minimum threshold, it doesn't matter how fast my machine is, it can't compete with the network build our infra-team makes.
So sure, an M3 might make my build 30% faster than my M1 build, but the network build is 15x faster. Is it possible instead of giving the developers M3s they should have invested in some kind of network build?
Network full builds might be faster, but would incremental builds be? Would developers still be able to use their favourite IDE and OS? Would developers be able to work without waiting in a queue? Would developers be able to work offline?
If you have a massive, monolithic, single-executable-producing codebase that can't be built on a developer machine, then you need network builds. But if you aren't Google, building on laptops gives developers better experience, even if it's slower.
That is a very large company if you have a singular 40 million line codebase, maybe around 1000 engineers or greater? Network builds also takes significant investment in adopting stuff like bazel and a dedicated devex team to pull off most of the time. Setting up build metrics to determine a build decision and the other benefits that come from it is a one month project at most for one engineer.
It's like telling an indie hacker to adopt a complicated kubernetes setup for his app.
> This is a great write-up and I love all the different ways they collected and analyzed data.
> [..] due to inherent biases in the sample set [..]
But that is an analysis methods issue. This serves as a reminder that one cannot depend on AI-assistants when they are not themselves enough knowledgeable on a topic. At least for the time being.
For once, as you point, they conducted a t-test on data that are not independently sampled, as multiple data points were sampled by different people, and there are very valid reasons to believe that different people would have different tasks that may be more or less compute-demanding, which confound the data. This violates one of the very fundamental assumptions of the t-test, which was not pointed out by the code interpreter. In contrast, they could have modeled their data with what is called "linear mixed effects model" where stuff like person (who the laptop belongs to) as well as possibly other stuff like seniority etc could be put into the model as "random effects".
Nevertheless it is all quite interesting data. What I found most interesting is the RAM-related part: caching data can be very powerful, and higher RAM brings more benefits than people usually realise. Any laptop (or at least macbook) with more RAM than it usually needs has most of the time its extra RAM filled by cache.
I agree, it seems like they were trying to come up with the most expensive way to answer the question possible for some reason. And why was the finding in the end to upgrade M1 users to more expensive M3s when M2s were deemed sufficient?
If employees are purposefully isolated from the company's expenses, they'll waste money left and right.
Also, they don't care since any incremental savings aren't shared with the employees. Misaligned incentives. In that mentally, it's best to take while you can.
I would think you would want to capture what/how was built, as like:
* Repo started at this commit
* With this diff applied
* Build was run with this command
Capture that for a week. Now you have a cross section of real workloads, but you can repeat the builds on each hardware tier (and even new hardware down the road)
The dev telemetry sounds well intentioned… but in 5-10 years will some new manager come in and use it as a productivity metric or work habit tracking technique, officially or unofficially?
As a scientist, I'm interested how computer programmers work with data.
* They drew beautiful graphs!
* They used chatgpt to automate their analysis super-fast!
* ChatGPT punched out a reasonably sensible t test!
But:
* They had variation across memory and chip type, but they never thought of using a linear regression.
* They drew histograms, which are hard to compare. They could have supplemented them with simple means and error bars. (Or used cumulative distribution functions, where you can see if they overlap or one is shifted.)
I'm glad you noted programmers; as a computer science researcher, my reaction was the same as yours. I don't think I ever used a CDF for data analysis until grad school (even with having had stats as a dual bio/cs undergrad).
It's because that's usually the data scientist's job, and most eng infra teams don't have a data scientist and don't really need one most of the time.
Most of the time they deal with data the way their tools generally present data, which correlate closely to most analytics, perf analysis and observability software suites.
Expecting the average software eng to know what a CDF is the same as expecting them to know 3d graphics basics like quaternions and writing shaders.
A standard CS program will cover statistics (incl. calculus-based stats e.g. MLEs), and graphics is a very common and popular elective (e.g. covering OpenGL). I learned all of this stuff (sans shaders) in undergrad, and I went to a shitty state college. So from my perspective an entry level programmer should at least have a passing familiarity with these topics.
Does your experience truly say that the average SWE is so ignorant? If so, why do you think that is?
>Expecting the average software eng to know what a CDF is the same as expecting them to know 3d graphics basics like quaternions and writing shaders.
I did write shaders and used quaternions back in the day. I also worked on microcontrollers, did some system programming, developed mobile and desktop apps. Now I am working on a rather large microservice based app.
> ChatGPT punched out a reasonably sensible t test!
I think the distribution is decidedly non normal here and the difference in the medians may well have also been of substantial interest -- I'd go for a Wilcox test here to first order... Or even some type of quantile regression. Honestly the famous Jonckheere–Terpstra test for ordered medians would be _perfect_ for this bit of pseudoanalysis -- have the hypothesis that M3 > M2 > M1 and you're good to go, right?!
Note that in some places they used boxplots, which offer clearer comparisons. It would have been more effective to present all the data using boxplots.
> They drew histograms, which are hard to compare.
Like you, I'd suggest empirical CDF plots for comparisons like these. Each distribution results in a curve, and the curves can be plotted together on the same graph for easy comparison. As an example, see the final plot on this page:
I think you might want to add the caveat "young computer programmers." Some of us grew up in a time where we had to learn basic statistics and visualization to understand profiling at the "bare metal" level and carried that on throughout our careers.
> cumulative distribution functions, where you can see if they overlap or one is shifted
Why would this be preferred over a PDF? I've rarely seen CDF plots after high school so I would have to convert the CDF into a PDF inside my head to check if the two distributions overlap or are shifted. CDFs are not a native representation for most people
I can give a real example. At work we were testing pulse shaping amplifiers for Geiger Muller tubes. They take a pulse in, shape it to get a pulse with a height proportional to the charge collected, and output a histogram of the frequency of pulse heights, with each bin representing how many pulses have a given amount of charge.
Ideally, of all components are the same, there is no jitter, and if you feed in a test signal from a generator with exactly the same area per pulse, you should see a histogram where every count is in a single bin.
In real life, components have tolerances, and readouts have jitter, so the counts spread out and you might see, with the same input, one device with, say, 100 counts in bin 60, while a comparably performing device might have 33 each in bins 58, 59, and 60.
This can be hard to compare visually as a PDF, but if you compare CDF's, you see S-curves with rising edges that only differ slightly in slope and position, making the test more intuitive.
If one line is to the right of the other everywhere, then the distribution is bigger everywhere. (“First order stochastic dominance” if you want to sound fancy.) I agree that CDFs are hard to interpret, but that is partly due to unfamiliarity.
I am part of a medium-sized software company (2k employees). A few years ago, we wanted to improve dev productivity. Instead of going with new laptops, we decided to explore offloading the dev stack over to AWS boxes.
This turned out to be a multi-year project with a whole team of devs (~4) working on it full-time.
In hindsight, the tradeoff wasn't worth it. It's still way too difficult to remap a fully-local dev experience with one that's running in the cloud.
Kind of interesting to think that CI is significantly slower in practice and both systems need to be maintained. Is it just the overhead of pushing through git or are there other reasons as well?
I tried Garden briefly but didn't like it for some reason. DevSpace was simpler to set up and works quite reliably. The sync feature where they automatically inject something into the pod works really well.
This might have to do with scale. At my employer (~7k employees) we started down this path a few years ago as well, and while it has taken longer for remote to be better than local, it now definitively is and has unlocked all kinds of other stuff that wasn't possible with the local-only version. One example is working across multiple branches by switching machines instead of files on local has meant way lower latency when switching between tasks.
One thing I've never understood (and admittedly have not thoroughly researched) is how a remote workspace jives with front-end development. My local tooling is all terminal-based, but after ssh'ing into the remote box to conduct some "local" development, how do I see those changes in a browser? Is the local just exposed on an ip:port?
Are you using a public cloud to host the dev boxen? Is compilation actually faster than locally – assuming that your pc's having been replaced to lower-specced versions since they don't do any heavy lifting anymore?
I work for a not-really-tech company (and I'm not a full-time dev either), so I've been issued a crappy "ultra-portable" laptop with an ultra-low-voltage CPU. I've looked into offloading my dev work to an AWS instance, but was quite surprised that it wasn't any faster than doing things locally for things like Rust compiles.
Haha, as I read more words of your comment, I got more sure that we worked at the same place. Totally agree, remote devboxes are really great these days!
However, I also feel like our setup was well suited to remote-first dev anyway (eg. syncing of auto-generated files being a pain for local dev).
My company piles so much ill-considered Linux antivirus and other crap in cloud developer boxes that even on a huge instance type, the builds are ten or more times slower than a laptop, and hundreds of times slower than a real dev box with a Threadripper or similar. It's just a pure waste of money and everyone's time.
It turns out that hooking every system call with vendor crapware is bad for a unix-style toolchain that execs a million subprocesses.
My dev box died (that I used for remote work), and instead of buying something new immediately, I moved my setup to a Hetzner cloud vps. Took around 2 days. Stuff like setting up termux on my tablet and the cli environment on the vps was 90 percent of that. The plus side was that I then spent the remaining summer working outside in the terrace and in the park. Was awesome. I was able to do it because practically all of my tools are command line based (vim, etc).
How much does this cost you? I've been dealing with a huge workstation-server thing for years in order to get this flexibility and while the performance/cost is amazing, reliability and maintenance has been a pain. I've been thinking about buying some cloud compute but an equivalent workstation ends up being crazy expensive (>$100/mo).
I’d be careful with Hetzner. I was doing nothing malicious and signed up. I had to submit a passport which was valid US. It got my account cancelled. I asked why and they said they couldn’t say for security reasons. They seem like an awesome service, I don’t want knock them I just simply asked if I could resubmit or something the mediate and they said no. I don’t blame them just be careful. I’m guessing my passport and face might have trigged some validation issues? I dunno.
It of course steongly depends on what your stack is, my current job provides a full remote dev server for our backend and it's the best experience I've seen in a long time. In particular having a common DB is suprinsingly uneventful (nobody's dropping tables here and there) while helping a lot.
We have interns coming in and fully ready within an hour or two of setup. Same way changing local machines is a breeze with very little downtime.
Isn't the point of a dev environment precisely that the intern can drop tables? Idk, I've never had a shared database not turn to mush over a long enough period, and think investing the effort to build data scripts to rebuild dev dbs from scratch has always been the right call.
> We have interns coming in and fully ready within an hour or two of setup. Same way changing local machines is a breeze with very little downtime.
This sounds like the result of a company investing in tooling, rather than something specific to a remote dev env. Our local dev env takes 3 commands and less than 3 hours to go from a new laptop to a fully working dev env.
If I understand correctly, they're not talking about remote desktops. Rather, the editor is local and responds normally, while the heavy lifting of compilation is done remotely. I've dabbled in this myself, and it's nice enough.
Thanks for the insight. It maybe depends on each team too.
While my team (platform & infra) much prefer remote devbox, the development teams are not.
It could be specific to my org because we have way too many restrictions on the local dev machine (eg: no linux on laptop but it's ok on server and my team much prefer linux over crippled Windows laptop).
I suspect things like GitHub's Codespaces offering will be more and more popular as time goes on for this kind of thing. Did you guys try out some of the AWS Cloud9 or other 'canned' dev env offerings?
My experience with GitHub Codespaces is mostly limited to when I forgot my laptop and had to work from my iPad. It was a horrible experience, mostly because Codespaces didn’t support touch or Safari very well and I also couldn’t use IntelliJ which I’m more familiar with.
Can’t really say anything for performance, but I don’t think it’ll beat my laptop unless maven can magically take decent advantage of 32 cores (which I unfortunately know it can’t).
One thing that jumps out at me is the assumption that compile time implies wasted time. The linked Martin Fowler article provides justification for this, saying that longer feedback loops provide an opportunity to get distracted or leave a flow state while ex. checking email or getting coffee. The thing is, you don't have to go work on a completely unrelated task. The code is still in front of you and you can still be thinking about it, realizing there's yet another corner case you need to write a test for. Maybe you're not getting instant gratification, but surely a 2-minute compile time doesn't imply 2 whole minutes of wasted time.
- M3 Pro is nerfed (only 6 perf cores) to better distinguish and sell M3 Max, basically on par with M2 Pro
So, in the end, I got a slightly used 10 core M1 Pro and am very happy, having spent less than half of what the base M3 Pro would cost, and got 85% of its power (and also, considering that you generally need to have at least 33 to 50 % faster CPU to even notice the difference :)).
The M3 Pro being nerfed has been parroted on the Internet since the announcement. Practically it’s a great choice. It’s much more efficient than the M2 Pro at slightly better performance. That’s what I am looking for in a laptop. I don’t really have a usecase for the memory bandwidth…
The M3 Pro and Max get virtually identical results in battery tests, e.g. https://www.tomsguide.com/news/macbook-pro-m3-and-m3-max-bat.... The Pro may be a perfectly fine machine, but Apple didn't remove cores to increase battery life; they did it to lower costs and upsell the Max.
AI is the main use case for memory bandwidth that I know of. Local LLM’s are memory bandwidth limited when running inference, so once you fall into that trap you end up wanting the 400 gb/s max memory bandwidth of the m1/m2/m3 max, paired with lots and lots of RAM. Apple pairs memory size and bandwidth upgrades to core counts a lot more in m3 which makes the m3 line-up far more expensive than the m2 line-up to reach comparable LLM performance. Them touting AI as a use case for the m3 line-up in the keynote was decidedly odd, as this generation is a step back when it comes to price vs performance.
I picked up an M3Pro/11/14/36GB/1TB to 'test' over the long holiday return period to see if I need an M3 Max. For my workflow (similar to blog post) - I don't! I'm very happy with this machine.
Die shots show the CPU cores take up so little space compared to GPUs on both the Pro and Max... I wonder why.
My experience was similar: In real world compile times, the M1 Pro still hangs quite closely to the current laptop M2 and M3 models. Nothing as significant as the differences in this article.
I could depend on the language or project, but in head-to-head benchmarks of identical compile commands I didn’t see any differences this big.
That's interesting you saw less of an improvement in the M2 than we saw in this article.
I guess not that surprising given the different compilation toolchains though, especially as even with the Go toolchain you can see how specific specs lend themselves to different parts of the build process (such as the additional memory helping linker performance).
You're not the only one to comment that the M3 is weirdly capped for performance. Hopefully not something they'll continue into the M4+ models.
Yep, there appears to be no reason for getting M3 Pro instead of M2 Pro, but my guess is that after this (unfortunate) adjustment, they got the separation they wanted (a clear hierarchy of Max > Pro > base chip for both CPU and GPU power), and can then improve all three chips by a similar amount in the future generations.
I also made this calculation recently and ended up getting an M1 Pro with maxed out memory and disk. It was a solid deal and it is an amazing computer.
I love my M1 MacBook Air for iOS development. One thing, I'd like to have from Pro line is the screen, and just the PPI part. While 120Hz is a nice thing to have, it won't happen on Air laptops.
I am ex-core contributor Chromium and Node.js and current core contributor to gRPC Core/C++.
I am never bothered with build times. There is "interactive build" (incremental builds I use to rerun related unit tests as I work on code) and non-interactive build (one I launch and go get coffee/read email). I have never seen hardware refresh toggle non-interactive into interactive.
My personal hardware (that I use now and then to do some quick fix/code review) is 5+ year old Intel i7 with 16Gb of memory (had to add 16Gb when realized linking Node.js in WSL requires more memory).
My work laptop is Intel MacBook Pro with a touch bar. I do not think it has any impact on my productivity. What matters is the screen size and quality (e.g. resolution, contrast and sharpness) and storage speed. Build system (e.g. speed of incremental builds and support for distributed builds) has more impact than any CPU advances. I use Bazel for my personal projects.
Somehow programmers have come to accept that a minuscule change in a single function that only result in a few bytes changing in a binary takes forever to compile and link. Compilation and linking should be basically instantaneous. So fast that you don't even realize there is a compilation step at all.
Sure, release builds with whole program optimization and other fancy compiler techniques can take longer. That's fine. But the regular compile/debug/test loop can still be instant. For legacy reasons compilation in systems languages is unbelievably slow but it doesn't have to be this way.
This is the reason why I often use tcc compiler for my edit/compile/hotreload cycle, it is about 8x faster than gcc with -O0 and 20x faster than gcc with -O2.
With tcc the initial compilation of hostapd it takes about 0.7 seconds and incremental builds are roughly 50 milliseconds.
The only problem is that tcc's diagnostics aren't the best and sometimes there are mild compatibility issues (usually it is enough to tweak CFLAGS or add some macro definition)
I mean yeah I've come to accept it because I don't know any different. If you can share some examples of large-scale projects that you can compile to test locally near-instantly - or how we might change existing projects/languages to allow for this - then you will have my attention instead of skepticism.
I am firmly in test-driven development camp. My test cases build and run interactively. I rarely need to do a full build. CI will make sure I didn’t break anything unexpected.
I too come from Blaze and tried to use Bazel for my personal project which involves backend + frontend dockerized, the build rules got weird and niche real quick and I was spending lots of time working with the BUILD files making me question the value against plain old Makefiles, this was 3 years ago, maybe the public ecosystem is better now.
Aren’t M series screen and storage speed significantly superior to your Intel MBP? I transitioned from an Intel MBP to M1 for work and the screen was significantly superior (not sure about storage speed, our builds are all on a remote dev machine that is stacked).
When I worked at Chromium there were two major mitigations:
1. Debug compilation was split in shared libraries so only a couple of them has to be rebuilt in your regular dev workflow.
2. They had some magical distributed build that "just worked" for me. I never had to dive into the details.
I was working on DevTools so in many cases my changes would touch both browser and renderer. Unit testing was helpful.
Bazel is significantly faster on m1 compared to i7 even if it doesn’t try to recompile protobuf compiler code which it’s still attempting to do regularly
To people who are thinking about using AI for data analyses like the one described in the article:
- I think it is much easier to just load the data into R, Stata etc and interrogate the data that way. The commands to do that will be shorter and more precise and most importantly more reproducible.
- the most difficult task in data analysis is understanding the data and the mechanisms that have generated it. For that you will need a causal model of the problem domain. Not sure that AI is capable of building useful causal models unless they were somehow first trained using other data from the domain.
- it is impossible to reasonably interpret the data without reference to that model. I wonder if current AI models are capable of doing that, e.g., can they detect confounding or oversized influence of outliers or interesting effect modifiers.
Perhaps someone who knows more than I do on the state of current technology can provide a better assessment of where we are in this effort
so ChatGPT can build a causal model for a problem domain? How does it communicate that (using a DAG?)? It would be important for the data users to understand that model.
> All developers work with a fully fledged incident.io environment locally on their laptops: it allows for a <30s feedback loop between changing code and running it, which is a key factor in how productively you can work with our codebase.
This to me is the biggest accomplishment.
I've never worked at a company (besides brief time helping out with some startups) where I have been able to run a dev/local instance of the whole company on a single machine.
There's always this thing, or that, or the other that is not accessible. There's always a gotcha.
I never couldn't run the damn app locally until my latest job. Drives me bonkers. I don't understand how people aren't more upset and this atrocious devex. Damn college kids don't know what they're missing.
I can’t imagine not having this. We use k3s to run everything locally and it works great. But we (un)fortunately added snowflake in the last year — it solves some very real problems for us, but it’s also a pain to iterate on that stuff.
We used to have that, but it's hard to support as you scale. The level of effort is somewhat quadratic to company size: linear in the number of services you support and in the number of engineers you have to support. Also divergent use cases come up that don't quite fit, and suddenly the infra team is the bottleneck to feature delivery, and people just start doing their own thing. Once that Pandora's Box is opened, it's essentially impossible to claw your way back.
I've heard of largeish companies that still manage to do this well, but I'd love to learn how.
That said, yeah I agree this is the biggest accomplishment. Getting dev cycles down from hours or days to minutes is more important than getting them down from minutes to 25% fewer minutes.
Like if you have logic apps and azure Data pipelines, how do you create and more importantly keep current the local development equivalents for those?
I'm not saying if you are YouTube that all the videos on YouTube must fit on a developer's local machine but would be nice if you could run the whole instance locally or if not, at least be able to reproduce the whole set up on a different environment without six months worth of back and forth emails.
I’m currently doing my best to make this possible with an app I’m building. I had to convince the CEO the M2 Max would come in handy for this (we run object detection models and stable diffusion). So far it’s working out!
Lots of stuff in this from profiling Go compilations, building a hot-reloader, using AI to analyse the build dataset, etc.
We concluded that it was worth upgrading the M1s to an M3 Pro (the max didn’t make much of a difference in our tests) but the M2s are pretty close to the M3s, so not (for us) worth upgrading.
Happy to answer any questions if people have them.
Thanks for the detailed analysis.
I’m wondering if you factored in the cost of engineering time invested in this analysis, and how that affects the payback time (if at all).
Author here: this probably took a 2.5 days to put together, all in.
The first day was spent hacking together a new hot reloaded but this also fixed a lot of issues we’d had with the previous loader such as restarting into stale code, which was really harming people’s productivity. That was well worth even several days of effort really!
The second day I was just messing around with OpenAI to figure out how I’d do this analysis. We’re right now building an AI assistant for our actual product so you can ask it “how many times did I get paged last year? How many were out-of-hours? Is my incident workload increasing?” Etc and I wanted an excuse to learn the tech so I could better understand that feature. So for me, well worth investing a day to learn.
Then the article itself took about 4hrs to write up. That’s worth it for us given exposure for our brand and the way it benefits us for hiring/etc.
We trust the team to make good use of their team and allowing people to do this type of work if they think it’s valuable is just an example of that. Assuming I have a £1k/day rate (I do not) we’re still only in for £2.5k, so less than a single MacBook to turn this around.
I'm curious how you came to the conclusion the Max SKUs aren't much faster, the distributions in the charts make them look faster but the text below just says they look the same.
That said, it would have been much easier and more accurate to simply put each laptop side by side and run some timed compilations on the exact same scenarios: A full build, incremental build of a recent change set, incremental build impacting a module that must be rebuilt, and a couple more scenarios.
Or write a script that steps through the last 100 git commits, applies them incrementally, and does a timed incremental build to get a representation of incremental build times for actual code. It could be done in a day.
Collecting company-wide stats leaves the door open to significant biases. The first that comes to mind is that newer employees will have M3 laptops while the oldest employees will be on M1 laptops. While not a strict ordering, newer employees (with their new M3 laptops) are more likely to be working on smaller changes while the more tenured employees might be deeper in the code or working in more complicated areas, doing things that require longer build times.
This is just one example of how the sampling isn’t truly as random and representative as it may seem.
So cool analysis and fun to see the way they’ve used various tools to analyze the data, but due to inherent biases in the sample set (older employees have older laptops, notably) I think anyone looking to answer these questions should start with the simpler method of benchmarking recent commits on each laptop before they spend a lot of time architecting company-wide data collection
We ended up collecting all this data partly to compare machine-to-machine, but also because we want historical data on developer build times and a continual measure of how the builds are performing so we can catch regressions. We quite frequently tweak the architecture of our codebase to make builds more performant when we see the build times go up.
Glad you enjoyed the post, though!
Sometimes these wandering paths to the solution have multiple knock-on effects in individual contributor growth that are hard to measure but are (subjectively, in my experience) valuable in moving the overall ability of the org forward.
So sure, an M3 might make my build 30% faster than my M1 build, but the network build is 15x faster. Is it possible instead of giving the developers M3s they should have invested in some kind of network build?
If you have a massive, monolithic, single-executable-producing codebase that can't be built on a developer machine, then you need network builds. But if you aren't Google, building on laptops gives developers better experience, even if it's slower.
It's like telling an indie hacker to adopt a complicated kubernetes setup for his app.
> [..] due to inherent biases in the sample set [..]
But that is an analysis methods issue. This serves as a reminder that one cannot depend on AI-assistants when they are not themselves enough knowledgeable on a topic. At least for the time being.
For once, as you point, they conducted a t-test on data that are not independently sampled, as multiple data points were sampled by different people, and there are very valid reasons to believe that different people would have different tasks that may be more or less compute-demanding, which confound the data. This violates one of the very fundamental assumptions of the t-test, which was not pointed out by the code interpreter. In contrast, they could have modeled their data with what is called "linear mixed effects model" where stuff like person (who the laptop belongs to) as well as possibly other stuff like seniority etc could be put into the model as "random effects".
Nevertheless it is all quite interesting data. What I found most interesting is the RAM-related part: caching data can be very powerful, and higher RAM brings more benefits than people usually realise. Any laptop (or at least macbook) with more RAM than it usually needs has most of the time its extra RAM filled by cache.
Also, they don't care since any incremental savings aren't shared with the employees. Misaligned incentives. In that mentally, it's best to take while you can.
* Repo started at this commit
* With this diff applied
* Build was run with this command
Capture that for a week. Now you have a cross section of real workloads, but you can repeat the builds on each hardware tier (and even new hardware down the road)
Deleted Comment
Dead Comment
Dead Comment
* They drew beautiful graphs!
* They used chatgpt to automate their analysis super-fast!
* ChatGPT punched out a reasonably sensible t test!
But:
* They had variation across memory and chip type, but they never thought of using a linear regression.
* They drew histograms, which are hard to compare. They could have supplemented them with simple means and error bars. (Or used cumulative distribution functions, where you can see if they overlap or one is shifted.)
Deleted Comment
Most of the time they deal with data the way their tools generally present data, which correlate closely to most analytics, perf analysis and observability software suites.
Expecting the average software eng to know what a CDF is the same as expecting them to know 3d graphics basics like quaternions and writing shaders.
Does your experience truly say that the average SWE is so ignorant? If so, why do you think that is?
I did write shaders and used quaternions back in the day. I also worked on microcontrollers, did some system programming, developed mobile and desktop apps. Now I am working on a rather large microservice based app.
I think the distribution is decidedly non normal here and the difference in the medians may well have also been of substantial interest -- I'd go for a Wilcox test here to first order... Or even some type of quantile regression. Honestly the famous Jonckheere–Terpstra test for ordered medians would be _perfect_ for this bit of pseudoanalysis -- have the hypothesis that M3 > M2 > M1 and you're good to go, right?!
(Disclaimers apply!)
Note that in some places they used boxplots, which offer clearer comparisons. It would have been more effective to present all the data using boxplots.
Like you, I'd suggest empirical CDF plots for comparisons like these. Each distribution results in a curve, and the curves can be plotted together on the same graph for easy comparison. As an example, see the final plot on this page:
https://ggplot2.tidyverse.org/reference/stat_ecdf.html
Most people hates nuances when reading data report.
Why would this be preferred over a PDF? I've rarely seen CDF plots after high school so I would have to convert the CDF into a PDF inside my head to check if the two distributions overlap or are shifted. CDFs are not a native representation for most people
Ideally, of all components are the same, there is no jitter, and if you feed in a test signal from a generator with exactly the same area per pulse, you should see a histogram where every count is in a single bin.
In real life, components have tolerances, and readouts have jitter, so the counts spread out and you might see, with the same input, one device with, say, 100 counts in bin 60, while a comparably performing device might have 33 each in bins 58, 59, and 60.
This can be hard to compare visually as a PDF, but if you compare CDF's, you see S-curves with rising edges that only differ slightly in slope and position, making the test more intuitive.
A word of warning from personal experience:
I am part of a medium-sized software company (2k employees). A few years ago, we wanted to improve dev productivity. Instead of going with new laptops, we decided to explore offloading the dev stack over to AWS boxes.
This turned out to be a multi-year project with a whole team of devs (~4) working on it full-time.
In hindsight, the tradeoff wasn't worth it. It's still way too difficult to remap a fully-local dev experience with one that's running in the cloud.
So yeah, upgrade your laptops instead.
Code sits on our laptops but live syncs to the remote services without requiring a Docker build or K8s deploy. It really does feel like local.
In particular it lets us do away with the commit-push-pray cycle because we can run integ tests and beyond as we code as opposed to waiting for CI.
We use Garden, (https://docs.garden.io) for this. (And yes I am afilliated :)).
But whether you use Garden or not, leveraging the power of the cloud for “inner loop” dev can be pretty amazing with right tooling.
I wrote a bit more about our experience here: https://thenewstack.io/one-year-of-remote-kubernetes-develop...
I work for a not-really-tech company (and I'm not a full-time dev either), so I've been issued a crappy "ultra-portable" laptop with an ultra-low-voltage CPU. I've looked into offloading my dev work to an AWS instance, but was quite surprised that it wasn't any faster than doing things locally for things like Rust compiles.
However, I also feel like our setup was well suited to remote-first dev anyway (eg. syncing of auto-generated files being a pain for local dev).
When you need to have 200+ parts running to do anything, it can be hard to work in a single piece that touches a couple others.
With servers that have upwards of 128+ cores and 256+ threads, my opinion is swinging back in favor of monoliths for most software.
Deleted Comment
It turns out that hooking every system call with vendor crapware is bad for a unix-style toolchain that execs a million subprocesses.
At large tech companies like Google, Meta, etc the dev environment is entirely in the cloud for the vast majority of SWEs.
This is a much nicer dev experience than anything local.
We have interns coming in and fully ready within an hour or two of setup. Same way changing local machines is a breeze with very little downtime.
This sounds like the result of a company investing in tooling, rather than something specific to a remote dev env. Our local dev env takes 3 commands and less than 3 hours to go from a new laptop to a fully working dev env.
While my team (platform & infra) much prefer remote devbox, the development teams are not.
It could be specific to my org because we have way too many restrictions on the local dev machine (eg: no linux on laptop but it's ok on server and my team much prefer linux over crippled Windows laptop).
Can’t really say anything for performance, but I don’t think it’ll beat my laptop unless maven can magically take decent advantage of 32 cores (which I unfortunately know it can’t).
One thing that jumps out at me is the assumption that compile time implies wasted time. The linked Martin Fowler article provides justification for this, saying that longer feedback loops provide an opportunity to get distracted or leave a flow state while ex. checking email or getting coffee. The thing is, you don't have to go work on a completely unrelated task. The code is still in front of you and you can still be thinking about it, realizing there's yet another corner case you need to write a test for. Maybe you're not getting instant gratification, but surely a 2-minute compile time doesn't imply 2 whole minutes of wasted time.
- M2 Pro is nice, but the improvement over 10 core (8 perf cores) M1 Pro is not that large (136 vs 120 s in Xcode benchmark: https://github.com/devMEremenko/XcodeBenchmark)
- M3 Pro is nerfed (only 6 perf cores) to better distinguish and sell M3 Max, basically on par with M2 Pro
So, in the end, I got a slightly used 10 core M1 Pro and am very happy, having spent less than half of what the base M3 Pro would cost, and got 85% of its power (and also, considering that you generally need to have at least 33 to 50 % faster CPU to even notice the difference :)).
Die shots show the CPU cores take up so little space compared to GPUs on both the Pro and Max... I wonder why.
I could depend on the language or project, but in head-to-head benchmarks of identical compile commands I didn’t see any differences this big.
Dead Comment
I guess not that surprising given the different compilation toolchains though, especially as even with the Go toolchain you can see how specific specs lend themselves to different parts of the build process (such as the additional memory helping linker performance).
You're not the only one to comment that the M3 is weirdly capped for performance. Hopefully not something they'll continue into the M4+ models.
Yep, there appears to be no reason for getting M3 Pro instead of M2 Pro, but my guess is that after this (unfortunate) adjustment, they got the separation they wanted (a clear hierarchy of Max > Pro > base chip for both CPU and GPU power), and can then improve all three chips by a similar amount in the future generations.
I am never bothered with build times. There is "interactive build" (incremental builds I use to rerun related unit tests as I work on code) and non-interactive build (one I launch and go get coffee/read email). I have never seen hardware refresh toggle non-interactive into interactive.
My personal hardware (that I use now and then to do some quick fix/code review) is 5+ year old Intel i7 with 16Gb of memory (had to add 16Gb when realized linking Node.js in WSL requires more memory).
My work laptop is Intel MacBook Pro with a touch bar. I do not think it has any impact on my productivity. What matters is the screen size and quality (e.g. resolution, contrast and sharpness) and storage speed. Build system (e.g. speed of incremental builds and support for distributed builds) has more impact than any CPU advances. I use Bazel for my personal projects.
Sure, release builds with whole program optimization and other fancy compiler techniques can take longer. That's fine. But the regular compile/debug/test loop can still be instant. For legacy reasons compilation in systems languages is unbelievably slow but it doesn't have to be this way.
With tcc the initial compilation of hostapd it takes about 0.7 seconds and incremental builds are roughly 50 milliseconds.
The only problem is that tcc's diagnostics aren't the best and sometimes there are mild compatibility issues (usually it is enough to tweak CFLAGS or add some macro definition)
1. Debug compilation was split in shared libraries so only a couple of them has to be rebuilt in your regular dev workflow. 2. They had some magical distributed build that "just worked" for me. I never had to dive into the details.
I was working on DevTools so in many cases my changes would touch both browser and renderer. Unit testing was helpful.
Deleted Comment
- I think it is much easier to just load the data into R, Stata etc and interrogate the data that way. The commands to do that will be shorter and more precise and most importantly more reproducible.
- the most difficult task in data analysis is understanding the data and the mechanisms that have generated it. For that you will need a causal model of the problem domain. Not sure that AI is capable of building useful causal models unless they were somehow first trained using other data from the domain.
- it is impossible to reasonably interpret the data without reference to that model. I wonder if current AI models are capable of doing that, e.g., can they detect confounding or oversized influence of outliers or interesting effect modifiers.
Perhaps someone who knows more than I do on the state of current technology can provide a better assessment of where we are in this effort
Except when I did it, it was python and pandas. You can ask it to show you the code it used to do it's analysis.
So you can load the data into R/Python and google "how do I do xyzzzy" and write the code yourself, or use ChatGPT.
This to me is the biggest accomplishment. I've never worked at a company (besides brief time helping out with some startups) where I have been able to run a dev/local instance of the whole company on a single machine.
There's always this thing, or that, or the other that is not accessible. There's always a gotcha.
People who haven’t lived in that world just cannot understand how much better it is, and will come up with all kinds of cope.
I've heard of largeish companies that still manage to do this well, but I'd love to learn how.
That said, yeah I agree this is the biggest accomplishment. Getting dev cycles down from hours or days to minutes is more important than getting them down from minutes to 25% fewer minutes.
https://news.ycombinator.com/item?id=38816135
Like if you have logic apps and azure Data pipelines, how do you create and more importantly keep current the local development equivalents for those?
I'm not saying if you are YouTube that all the videos on YouTube must fit on a developer's local machine but would be nice if you could run the whole instance locally or if not, at least be able to reproduce the whole set up on a different environment without six months worth of back and forth emails.
Lots of stuff in this from profiling Go compilations, building a hot-reloader, using AI to analyse the build dataset, etc.
We concluded that it was worth upgrading the M1s to an M3 Pro (the max didn’t make much of a difference in our tests) but the M2s are pretty close to the M3s, so not (for us) worth upgrading.
Happy to answer any questions if people have them.
Thanks for the detailed analysis. I’m wondering if you factored in the cost of engineering time invested in this analysis, and how that affects the payback time (if at all).
Thanks!
The first day was spent hacking together a new hot reloaded but this also fixed a lot of issues we’d had with the previous loader such as restarting into stale code, which was really harming people’s productivity. That was well worth even several days of effort really!
The second day I was just messing around with OpenAI to figure out how I’d do this analysis. We’re right now building an AI assistant for our actual product so you can ask it “how many times did I get paged last year? How many were out-of-hours? Is my incident workload increasing?” Etc and I wanted an excuse to learn the tech so I could better understand that feature. So for me, well worth investing a day to learn.
Then the article itself took about 4hrs to write up. That’s worth it for us given exposure for our brand and the way it benefits us for hiring/etc.
We trust the team to make good use of their team and allowing people to do this type of work if they think it’s valuable is just an example of that. Assuming I have a £1k/day rate (I do not) we’re still only in for £2.5k, so less than a single MacBook to turn this around.
Deleted Comment
Or the tasks maybe finished so fast that it didn’t make a difference in real world usage?
Logistical question: did management move some deliverables out of the way to give you room to do this? Or was it extra curricular?