I consider myself rather smart and good at what I do. It's nice to have a look at problems like these once in a while, to remind myself of how little I know, and how much closer I am to the average than to the top.
Well it is a specialized problem. If you've never worked on anything similar previously, it is going to take time. Don't even need to interview for selective billion dollar companies like Anthropic to encounter these types of problems - after college I interviewed for various electronics/hardware companies where you'd get asked to optimize low-level code - which would have looked quite foreign, if you had never actually worked on such problems before.
If you ask an EE to debug react state management code without prior exposure they won't do too well either. But on the other hand they can easily pick up most of it after a week long crash course while training a performance engineer who can optimize code for a specific architecture would take months.
>Don't even need to interview for selective billion dollar companies like Anthropic to encounter these types of problems
I'll take any interviews at this point in time.
But yes, every domain has its jargon. I work tangentially to this and quickly understood this as a GPGPU problem. A relatively elementary one if you studied this space, though a time limit of 2 hours seems overly restrictive if you aren't actively studying this stuff.
After a quick look this is can be seen as a low level GPU/TPU optimization problem where you have to consider the throughput and depth of different arithmetic pipelines. If you want to hire people who understand how to do that you unfortunately have to give them such a convoluted task and emulate the relevant parts of HW. (In reality this is probably more like TPU since it has scalar pipelines, but the optimization methods are not that different)
The task is to parallelize tree traversal, which is embarrassingly unparallel so it's tricky.
Since it's a CPU, you start with the idea that there is an ALU and spiral outward from that. That gives you something concrete to wrap your head around while you climb up the abstraction levels.
However, when I hit "scratch_write" and it wasn't in the Machine class and it wasn't coming from some Decorator and it was getting defined and deleted by a member function ... I stopped. That's paying lip service to the variable typing that is scattered around and actively hampers even basic IDE usage. Probably the typing was added by AI/LLM after the fact, and it missed that unusual usage. The Python convention used to be that those kinds of variables got declared as "_scratch_write" with a leading underscore to flag that they were "private/internal".
That was the gigantic red "We write shitty code" signal or worse "We don't care about wasting your time" signal. Human review should have flagged that.
Shame. I was kinda looking forward to the technical problem, but I'm not going to spend a bunch of time using grep to untangle garbage code to get at it.
I suspect everything would actually be much clearer if you wrote it in SystemVerilog and tested with Cocotb. Let's see if their LLMs can handle that porting job. HAH!
The question isn't clearly written down anywhere, that's why. Presumably actual candidates would have been given more info over the phone or email. Part of the "challenge" is reverse engineering their Python; unclear if that's intentional.
If you look at the top of perf_takehome.py then there is a brief comment saying the challenge is to optimize a kernel. Kernel in GPU land means a program that computes on data in parallel, it's not an OS kernel:
Optimize the kernel (in KernelBuilder.build_kernel) as much as possible in the
available time, as measured by test_kernel_cycles on a frozen separate copy
of the simulator.
However, this kernel doesn't run on an actual GPU. It runs on a little interpreter for a custom assembly language written in Python. Thus you will be optimizing the program built in-memory by the function on this line:
Like reference_kernel2 but building actual instructions.
Scalar implementation using only scalar ALU and load/store.
The KernelBuilder class has some fields like "instrs" but we can't immediately see what they're meant to be because this is Python and types are optional. Nonetheless we can see that instructions are being added to a list, and below we can see the test_kernel_cycles function that runs the interpreter on the program. So our mission is to change the build_kernel function to make a better program. And it says this is an assembly version of the python function reference_kernel2 which is found in problem.py.
What exactly is this kernel doing? The reference_kernel2 function doesn't explain itself either - it's some sort of parallel tree walk. Let's put that to one side for a second and explore the machine, which is defined in problem.py. The machine itself is also largely undocumented, but there's a brief description in a docstring on line 66.
At this point it helps to understand the design of exotic processors. The emulator is for a fictional CPU that uses a VLIW SIMD ISA. Normal programmers will never encounter such a chip. Intel tried to make such a machine decades ago and it never took off, since then the concept has been largely dead. I believe it's still used in some mobile DSPs like Qualcomm's Hexagon. Notably, NVIDIA PTX is not such an ISA so this seems to have been chosen just to make things harder. As the comment explains, in a VLIW machine multiple instructions are packed together into a "slot" and executed in parallel. In a normal CPU the hardware reads a serial stream of instructions and works out just in time which can be executed in parallel, using fancy out-of-order circuitry. In a VLIW machine that's done ahead of time by the compiler or (in this case) the humble programmer, you. But this isn't just a VLIW machine, it's also multi-core, and multi-"engine", so there are multiple levels of execution going on. And it's SIMD, meaning each instruction can itself operate on multiple bits of data simultaneously.
This machine doesn't have registers or cache but it does have "scratch space", and so you can use the vector instructions to load data into a series of 32 bit scratch words and then do things on them in parallel. And multiple vector instructions can also run in parallel. "Broadcasting a scalar" in SIMD-speak means taking a single value and repeating it over multiple scratch space slots (or register subwords in a real machine), so you take e.g. 0xFF and get 0xFFFFFFFFFFFFFFFF.
And that's it, that's all we get. As the code says: "This comment is not meant to be full ISA documentation though, for the rest you should look through the simulator code". Possible point of confusion: real ISAs are serialized to bytes but this one is just Python tuples. The code is only partially typed; sometimes you're just left guessing.
So to recap, the problem is to optimize an undocumented program expressed in undocumented data structures returned by a Python function whose result is interpreted by a partly documented Python class that simulates a fictional exotic CPU architecture using an abandoned design that gives a lot of parallel computational capacity, but which requires all parallelism to be statically declared ahead of time, whilst simultaneously reverse engineering the Python that does all this.
Does that help? Sounds like a fun exercise :)
Edit: I just checked and Google TPUs are much more VLIW like so perhaps this simulator is designed to match a TPU. I know Anthropic rely on TPUs for serving and have done some optimization for them.
Generate instructions for their simulator to compute some numbers (hashes) in whatever is considered the memory of their "machine"¹. I didn't see any places where they actually disallow cheating b/c it says they only check the final state of the memory² so seems like if you know the final state you could just "load" the final state into memory. The cycle count is supposedly the LLM figuring out the fewest number of instructions to compute the final state but again, it's not clear what they're actually measuring b/c if you know the final state you can cheat & there is no way to tell how they're prompting the LLM to avoid the answers leaking into the prompt.
- Optimize the kernel (in KernelBuilder.build_kernel) as much as possible in the
available time, as measured by test_kernel_cycles on a frozen separate copy
of the simulator
It comes with test suites, so that gives you a base to start from. You can at the very least do trial-and-error and come up with some heuristics on the fly. You're at a huge disadvantage to someone who has some familiarity but can convincingly play it off as being a newcomer, though.
Yours is a good mentality to have because it creates the emotional drive to learn more, so don't lose that. That being said, this isn't really that complicated. Its just a matter of taking enough time to look at the code and understand how its structured. I feel like the thing that differentiates developer skill is pretty much being able to do that, specifically in the process of having the model of the program in your head.
For me, I've had that mentality for the longest time and I didn't get anything done because, well, "I'm just average".
For me, a little bit of arrogance (there's no way I couldn't do X, let's go do it), even if I end up "looking stupid" (see, I told you it was that hard!), was far more valuable to my development
disagree. nobody has a monopoly on what metric makes someone good. I don't understand all this leet code optimization. actually i do understand it, but it's a game that will attract game optimizers.
Yes, this applies to some simulated imaginary CPU with an artificial problem. Except that the job asked here is exactly the core of what a performance engineer will do at anthropic: optimize kernels for their fleet of GPUs. Is it simplified? Yes! (e.g. the simulator does not restrict memory access patterns)
This is a real-world problem adapted to a lab setting that can fit in one's head in a matter of hours. Leetcode would have you reimplement the hashmap used in there.
Also leetcode does not really provide insight into ones ability to design business solutions. Whether it be system design, just some small feature implementation or communication skills within a team.
Its just optimizers jerking each other off on some cryptic problems 99.999999999% of developers will never see in real life.
Maybe it would've been useful like 30 years ago, but all commonly used languages have all these fancy algorithms baked into their stdlib, why would I ever have to implement them myself?
I suspect this was released by Anthropic as a DDOS attack on other AI companies. I prompted 'how do we solve this challenge?' into gemini cli in a cloned repo and it's been running non-stop for 20 minutes :)
Lately with Gemini CLI / Jules it doesn't seem like time spent is a good proxy for difficulty. It has a big problem with getting into loops of "I am preparing the response for the user. I am done. I will output the answer. I am confident. Etc etc".
I see this directly in Gemini CLI as the harness detects loops and bails the reasoning. But I've also just occasionally seen it take 15m+ to do trivial stuff and I suspect that's a symptom of a similar issue.
I saw this too. Sometimes it "think" inside of the actual output and its much more likely to end up in the loop of "I am ready to answer" while it is doing that already
/model: Auto (Gemini 3) Let Gemini CLI decide the best model for the task: gemini-3-pro, gemini-3-flash
After ~40 minutes, it got to:
The final result is 2799 cycles, a 52x speedup over the baseline. I successfully implemented Register Residency, Loop Unrolling, and optimized Index Updates to achieve this, passing all correctness and baseline speedup tests. While I didn't beat the Opus benchmarks due to the complexity of Broadcast Optimization hazards, the performance gain is substantial.
It's impressive as I definitely won't be able to do what it did. I don't know most of the optimization techniques it listed there.
I think it's over. I can't compete with coding agents now. Fortunately I've saved enough to buy some 10 acre farm in Oregon and start learning to grow some veggies and raise chickens.
Clearly none beat Anthropic's target, but gpt-5-2 did slightly better in much less time than "Claude Opus 4 after many hours in the test-time compute harness".
That Claude Opus 4.5 result of 4,973 is what you get if you just vectorize the reference kernel. In fact you should be under 4,900 doing that with very little effort (I tried doing this by hand yesterday).
The performance killer is the "random" access reads of the tree node data which the scalar implementation hides, together with the lack of load bandwidth, and to tackle that you'd have to rewrite the kernel to optimize the tree data loading and processing.
Very interesting thanks! I wonder what would happen if you kept running Gemini in a loop for a while. Considering how much faster it ended it seems like there is a lot more potential.
Can you share the agent-comparison harness code or point to something similar? I want to learn about benchmarking models in a basic or practical sense.
I tried GLM-4.7 running locally on a beefy GPU server, in about 3 minutes it got to 25846 cycles, but then struggled in circles for about 90 minutes without making any meaningful progress, making the same mistakes repeatedly and misdiagnosing the cause most of the time. It seems to understand what needs to happen to reach the goal, but keeps failing on the implementation side. It seemed to understand that to beat the target an entirely new approach would be required (it kept leaning towards a wavefront design), but wasn't seeing the solution due to the very limited ISA.
> If you optimize below 1487 cycles, beating Claude Opus 4.5's best performance at launch, email us at performance-recruiting@anthropic.com with your code (and ideally a resume) so we can be appropriately impressed and perhaps discuss interviewing.
This is an interesting way to recruit. Much better than standard 2 leetcode medium/hard questions in 45 mins.
You would hope that if you manage to beat their engineers best optimisations at launch, then you would leapfrog a certain amount of the initial stages.
Then again, this may just be a way to get free ideas at optimising their product from outside the box.
It would take something like one week full time to work on this. It's not something you can do if you have a full-time job and apply to several other companies. I find it unreasonable to ask a candidate to spend that much time for an uncertain result.
It's true that being ready for leetcode takes practice, but at least it's standard so you can re-use the skills to other interviews. Optimizing some generated code is certainly fun, but it's as useless as leetcode for your average programmer.
As long as there are qualified candidates willing to do unreasonable tasks for the chance to work at a company, there's not much incentive for the company to change their system. Those people will also probably work unreasonably hard and make unreasonable sacrifices for the company.
> It's not something you can do if you have a full-time job
> I find it unreasonable to ask a candidate to spend that much time
And same for some reason does not apply to leetcode style interviews?
> It would take something like one week full time to work on this
I am not sure if this is satire or what? You need months of continuous preparation to be ready for the leetcode style interview.
> Optimizing some generated code is certainly fun, but it's as useless as leetcode for your average programmer.
No, it is not. This is specifically the type of job you would be doing tomorrow at Anthropic team if hired. And they are specifically hiring people who are already good enough at that very task. The same cannot be said for the leetcode, not even remotely comparable.
This is a really fun problem! I suggest anyone who likes optimization in a very broad sense to try their hand at it. Might be the most fun I've had while interviewing. I had to spend a week-worth of evenings on it to fully scratch the itch, and I managed to get 1112 cycles. But that was mostly manual, before the current crop of agentic models (clopus 4.5, gpt5.2). I wonder how far you can RalphWiggum it!
I was in the demoscene long ago and that kind of optimisation is definitely in the ballpark of what we did: optimize algorithm down to machine code level (and additionally, cheat like hell to make you believe we ran the algorithm for real :-)).
But to be honest, I wonder what algorithm they implement. I have read the code for 2 minutes, and it sound like random forest prediction. Anyone knows what the code does ?
Having recently learned more about SIMD, PTX and optimization techniques, this is a nice little challenge to learn even more.
As a take home assignment though I would have failed as I would have probably taken 2 hours to just sketch out ideas and more on my tablet while reading the code before even changing it.
Unless misread, 2 hours isn't the time limit for the candidate to do this but the time Claude eventually needed to outperform best returned solution. Best candidate could've taken 6h~2d to achieve this result.
Their Readme.md is weirdly obsessed with "2 hours":
"before Claude Opus 4.5 started doing better than humans given only 2 hours"
"Claude Opus 4.5 in a casual Claude Code session, approximately matching the best human performance in 2 hours"
"Claude Opus 4.5 after 2 hours in our test-time compute harness"
"Claude Sonnet 4.5 after many more than 2 hours of test-time compute"
So that does make one wonder where this comes from. Could just be LLM generated with a talking point of "2 hours", models can fall in love with that kind of stuff. "after many more than 2 hours" is a bit of a tell.
Would be quite curious to know though. How I usually design take home assignments is:
1. Candidate has several _days_ to complete (usually around a week).
2. I design the task to only _take_ 2-4 hours, informing the candidate about that, but that doesn't mean they can't take longer. The subsequent interview usually reveals if they went overboard or struggled more than expected.
But I can easily picture some places sending a candidate the assignment and asking them to hand in their work within two hours. Similar to good old coding competitions.
No the 2 hours is their time limit for candidates. The thing is that you are allowed to use any non-human help for their take homes (open book), so if AI can solve it in below 2 hours, it's not very good at assessing the human.
I'm at 1137 with one hour with opus now...
Pipelined vectorized hash, speculation, static code for each stage, epilogues and prologues for each stage-to-stage...
I think I'm going to get sub 900 since i just realized i can in-parallel compute whether stage 5 of the hash is odd just by looking at bits 16 and 0 of stage 4 with less delay.....
I think I can hit #1 (current #1 is 1000). sub 900 not possible though.
Let me put down my thought process: You have to start to think of designing a 6-slot x8-len vector pipeline doing 48 hashes in parallel first which needs at least 10 steps —- if you convert three stages to multiply adds and do parallel XORs for the other three) —- the problem with 10 cycle hashing is you need to cram 96 scalar xors along side your vector pipeline, so that will use all 12 ALUs for 8 of those cycles. Leaving you only 24 more scalar ops per hash cycle which isn’t enough for the 48 tree value xors..
so you must use at least 11 steps per hash, with 96 xors (including the tree value xor) done in the scalar alus using 8 steps, and giving 3*12 Alu ops per hash cycle. You need 12 more ops per hash to do odd/even, so you must be 12 stages, and just do all of the hash ops in valu, 4 cycles of 12 alus doing modulo, 8 cycles x 12 alus free
With 12 steps and 48 parallel you’re absolute minimum could be 4096/48 x 12 = 1,024 cycles, since stage 10 can be optimized (you don’t need the odd/even modulo cycle, and can use some of those extra scalar cycles to pre-xor the constant can save you ~10 cycles. 1024 gonna be real hard, but I can imagine shenanigans to get it down to 1014, sub-1000 possible by throwing more xor to the scalar alus.
take advantage of index collisions, optimizing round 0 and 11, speculative pre-loading, and the early branch predictor (which now I am doing looking at bits output at stage 3)
It doesn't matter really, what matters is our ability to stare into the void of what we don't know and start making progress.
Our ability to process and master new topics is part of the job.
I'm sure you've done that countless times.
I'll take any interviews at this point in time.
But yes, every domain has its jargon. I work tangentially to this and quickly understood this as a GPGPU problem. A relatively elementary one if you studied this space, though a time limit of 2 hours seems overly restrictive if you aren't actively studying this stuff.
The task is to parallelize tree traversal, which is embarrassingly unparallel so it's tricky.
However, when I hit "scratch_write" and it wasn't in the Machine class and it wasn't coming from some Decorator and it was getting defined and deleted by a member function ... I stopped. That's paying lip service to the variable typing that is scattered around and actively hampers even basic IDE usage. Probably the typing was added by AI/LLM after the fact, and it missed that unusual usage. The Python convention used to be that those kinds of variables got declared as "_scratch_write" with a leading underscore to flag that they were "private/internal".
That was the gigantic red "We write shitty code" signal or worse "We don't care about wasting your time" signal. Human review should have flagged that.
Shame. I was kinda looking forward to the technical problem, but I'm not going to spend a bunch of time using grep to untangle garbage code to get at it.
I suspect everything would actually be much clearer if you wrote it in SystemVerilog and tested with Cocotb. Let's see if their LLMs can handle that porting job. HAH!
If you look at the top of perf_takehome.py then there is a brief comment saying the challenge is to optimize a kernel. Kernel in GPU land means a program that computes on data in parallel, it's not an OS kernel:
However, this kernel doesn't run on an actual GPU. It runs on a little interpreter for a custom assembly language written in Python. Thus you will be optimizing the program built in-memory by the function on this line:https://github.com/anthropics/original_performance_takehome/...
This function is described only as:
The KernelBuilder class has some fields like "instrs" but we can't immediately see what they're meant to be because this is Python and types are optional. Nonetheless we can see that instructions are being added to a list, and below we can see the test_kernel_cycles function that runs the interpreter on the program. So our mission is to change the build_kernel function to make a better program. And it says this is an assembly version of the python function reference_kernel2 which is found in problem.py.What exactly is this kernel doing? The reference_kernel2 function doesn't explain itself either - it's some sort of parallel tree walk. Let's put that to one side for a second and explore the machine, which is defined in problem.py. The machine itself is also largely undocumented, but there's a brief description in a docstring on line 66.
At this point it helps to understand the design of exotic processors. The emulator is for a fictional CPU that uses a VLIW SIMD ISA. Normal programmers will never encounter such a chip. Intel tried to make such a machine decades ago and it never took off, since then the concept has been largely dead. I believe it's still used in some mobile DSPs like Qualcomm's Hexagon. Notably, NVIDIA PTX is not such an ISA so this seems to have been chosen just to make things harder. As the comment explains, in a VLIW machine multiple instructions are packed together into a "slot" and executed in parallel. In a normal CPU the hardware reads a serial stream of instructions and works out just in time which can be executed in parallel, using fancy out-of-order circuitry. In a VLIW machine that's done ahead of time by the compiler or (in this case) the humble programmer, you. But this isn't just a VLIW machine, it's also multi-core, and multi-"engine", so there are multiple levels of execution going on. And it's SIMD, meaning each instruction can itself operate on multiple bits of data simultaneously.
This machine doesn't have registers or cache but it does have "scratch space", and so you can use the vector instructions to load data into a series of 32 bit scratch words and then do things on them in parallel. And multiple vector instructions can also run in parallel. "Broadcasting a scalar" in SIMD-speak means taking a single value and repeating it over multiple scratch space slots (or register subwords in a real machine), so you take e.g. 0xFF and get 0xFFFFFFFFFFFFFFFF.
And that's it, that's all we get. As the code says: "This comment is not meant to be full ISA documentation though, for the rest you should look through the simulator code". Possible point of confusion: real ISAs are serialized to bytes but this one is just Python tuples. The code is only partially typed; sometimes you're just left guessing.
So to recap, the problem is to optimize an undocumented program expressed in undocumented data structures returned by a Python function whose result is interpreted by a partly documented Python class that simulates a fictional exotic CPU architecture using an abandoned design that gives a lot of parallel computational capacity, but which requires all parallelism to be statically declared ahead of time, whilst simultaneously reverse engineering the Python that does all this.
Does that help? Sounds like a fun exercise :)
Edit: I just checked and Google TPUs are much more VLIW like so perhaps this simulator is designed to match a TPU. I know Anthropic rely on TPUs for serving and have done some optimization for them.
¹https://github.com/anthropics/original_performance_takehome/...
²https://github.com/anthropics/original_performance_takehome/...
Deleted Comment
Deleted Comment
- Optimize the kernel (in KernelBuilder.build_kernel) as much as possible in the available time, as measured by test_kernel_cycles on a frozen separate copy of the simulator
It's not about you being average, just a different knowledge set.
But this is good. Staying humble makes you hungrier for learning.
For me, I've had that mentality for the longest time and I didn't get anything done because, well, "I'm just average".
For me, a little bit of arrogance (there's no way I couldn't do X, let's go do it), even if I end up "looking stupid" (see, I told you it was that hard!), was far more valuable to my development
Always room to learn in software :)
the hot take is, there are other games.
Yes, this applies to some simulated imaginary CPU with an artificial problem. Except that the job asked here is exactly the core of what a performance engineer will do at anthropic: optimize kernels for their fleet of GPUs. Is it simplified? Yes! (e.g. the simulator does not restrict memory access patterns)
This is a real-world problem adapted to a lab setting that can fit in one's head in a matter of hours. Leetcode would have you reimplement the hashmap used in there.
I see this directly in Gemini CLI as the harness detects loops and bails the reasoning. But I've also just occasionally seen it take 15m+ to do trivial stuff and I suspect that's a symptom of a similar issue.
Seems like capacity because it works a lot better late at night.
I don't see the same with the claude models in antigravity.
After ~40 minutes, it got to:
The final result is 2799 cycles, a 52x speedup over the baseline. I successfully implemented Register Residency, Loop Unrolling, and optimized Index Updates to achieve this, passing all correctness and baseline speedup tests. While I didn't beat the Opus benchmarks due to the complexity of Broadcast Optimization hazards, the performance gain is substantial.
It's impressive as I definitely won't be able to do what it did. I don't know most of the optimization techniques it listed there.
I think it's over. I can't compete with coding agents now. Fortunately I've saved enough to buy some 10 acre farm in Oregon and start learning to grow some veggies and raise chickens.
That would be impressive.
Each ran the same spec headlessly in their native harness (one shot).
Results:
Clearly none beat Anthropic's target, but gpt-5-2 did slightly better in much less time than "Claude Opus 4 after many hours in the test-time compute harness".Deleted Comment
The performance killer is the "random" access reads of the tree node data which the scalar implementation hides, together with the lack of load bandwidth, and to tackle that you'd have to rewrite the kernel to optimize the tree data loading and processing.
https://github.com/voratiq/voratiq
This is an interesting way to recruit. Much better than standard 2 leetcode medium/hard questions in 45 mins.
Then again, this may just be a way to get free ideas at optimising their product from outside the box.
It's true that being ready for leetcode takes practice, but at least it's standard so you can re-use the skills to other interviews. Optimizing some generated code is certainly fun, but it's as useless as leetcode for your average programmer.
> I find it unreasonable to ask a candidate to spend that much time
And same for some reason does not apply to leetcode style interviews?
> It would take something like one week full time to work on this
I am not sure if this is satire or what? You need months of continuous preparation to be ready for the leetcode style interview.
> Optimizing some generated code is certainly fun, but it's as useless as leetcode for your average programmer.
No, it is not. This is specifically the type of job you would be doing tomorrow at Anthropic team if hired. And they are specifically hiring people who are already good enough at that very task. The same cannot be said for the leetcode, not even remotely comparable.
[1] https://en.wikipedia.org/wiki/Demoscene [2] https://en.wikipedia.org/wiki/Code_golf
It even uses Chrome tracing tools for profiling, which is pretty cool: https://github.com/anthropics/original_performance_takehome/...
But to be honest, I wonder what algorithm they implement. I have read the code for 2 minutes, and it sound like random forest prediction. Anyone knows what the code does ?
As a take home assignment though I would have failed as I would have probably taken 2 hours to just sketch out ideas and more on my tablet while reading the code before even changing it.
"before Claude Opus 4.5 started doing better than humans given only 2 hours"
"Claude Opus 4.5 in a casual Claude Code session, approximately matching the best human performance in 2 hours"
"Claude Opus 4.5 after 2 hours in our test-time compute harness"
"Claude Sonnet 4.5 after many more than 2 hours of test-time compute"
So that does make one wonder where this comes from. Could just be LLM generated with a talking point of "2 hours", models can fall in love with that kind of stuff. "after many more than 2 hours" is a bit of a tell.
Would be quite curious to know though. How I usually design take home assignments is:
1. Candidate has several _days_ to complete (usually around a week).
2. I design the task to only _take_ 2-4 hours, informing the candidate about that, but that doesn't mean they can't take longer. The subsequent interview usually reveals if they went overboard or struggled more than expected.
But I can easily picture some places sending a candidate the assignment and asking them to hand in their work within two hours. Similar to good old coding competitions.
I think I'm going to get sub 900 since i just realized i can in-parallel compute whether stage 5 of the hash is odd just by looking at bits 16 and 0 of stage 4 with less delay.....
Let me put down my thought process: You have to start to think of designing a 6-slot x8-len vector pipeline doing 48 hashes in parallel first which needs at least 10 steps —- if you convert three stages to multiply adds and do parallel XORs for the other three) —- the problem with 10 cycle hashing is you need to cram 96 scalar xors along side your vector pipeline, so that will use all 12 ALUs for 8 of those cycles. Leaving you only 24 more scalar ops per hash cycle which isn’t enough for the 48 tree value xors..
so you must use at least 11 steps per hash, with 96 xors (including the tree value xor) done in the scalar alus using 8 steps, and giving 3*12 Alu ops per hash cycle. You need 12 more ops per hash to do odd/even, so you must be 12 stages, and just do all of the hash ops in valu, 4 cycles of 12 alus doing modulo, 8 cycles x 12 alus free
With 12 steps and 48 parallel you’re absolute minimum could be 4096/48 x 12 = 1,024 cycles, since stage 10 can be optimized (you don’t need the odd/even modulo cycle, and can use some of those extra scalar cycles to pre-xor the constant can save you ~10 cycles. 1024 gonna be real hard, but I can imagine shenanigans to get it down to 1014, sub-1000 possible by throwing more xor to the scalar alus.
BROADCAST LOAD SCHEDULE
======================================================================
Round | Unique | Load Strategy
------|--------|------------------------------------------
Total loads with grouping: 839Total loads naive: 4096
Load reduction: 4.9x