Failing a take-home is an entirely different thing. It's a huge loss in time and mental energy.
I've only done 3 of those in my career and only because the projects sounded interesting. 1 of those 3 resulted in a job offer which I can now confidently say in hindsight was the worst job in my career (...so far!).
I'm now leaning towards just filtering out companies that do take-homes because it signals to me that they don't care about their candidate's time and how a company treats its candidates is usually a good indicator of how they treat their employees.
With a take home I can demonstrate how I would perform at work. I can sit on it, think things over in my head, come up with an attack plan and execute it. I can demonstrate how I think about problems and my own value more clearly. Using a take home as a test is indicative to me that a company cares a bit more about its hiring pipeline and is being careful not to put candidates under arbitrary pressures.
So there's no ground truth; they're just benchmarking how impressive an LLM's code review sounds to a different LLM. Hard to tell what to make of that.
Claude is not cheap, why is it far and away the most popular if it's not top 10 in performance?
Qwen3 235b ranks highest on these benchmarks among open models, but I have never met someone who prefers its output over Deepseek R1. It's extremely wordy and often gets caught in thought loops.
My interpretation is that the models at the top of ArtificialAnalysis are focusing the most on public benchmarks in their training. Note I am not saying XAI is necessarily nefariously doing this, could just be that they decided it's better bang for the buck to rely on public benchmarks than to try to focus on building their own evaluation systems.
But Grok is not very good compared to the anthropic, openai, or google models despite ranking so highly in benchmarks.
def my_func(a: int, b: int):
print(a+b)
# both of these can be used
my_func(1, 2)
my_func(a=1, b=2)
``` Point = namedtuple('Point', 'x y') pt1 = Point(1.0, 5.0) ```
And then call the X or Y coordinates either by index: pt1[0], pt1[1], or coordinate name: pt1.x, pt1.y.
This can be a really handy way to help people understand your code as what you are calling becomes a lot more explicit.
[0] https://stackoverflow.com/questions/2970608/what-are-named-t...
Messing with optimizers is one of the ways to enter hyperparameter hell: it’s like legacy code but on steroids because changing it only breaks your training code stochastically. Much better to stop worrying and love AdamW.
For example, I’ve encountered the belief that just by recording something at ultra high temporal resolution gives you “millions of datapoints”. This then has all sorts of effects on the breakdown of statistics and hypothesis testing (seemingly).
In reality, the replicability of the entire setup, the day it was performed, the person doing it, etc. means the n for the day is probably closer to 1. So to ensure replicability you’d have to at least do it on separate days, with separately prepared samples. Otherwise, how can you eliminate the chance that your ultra finicky sample just happened to vibe with that day’s temperature and humidity?
But they don’t teach you in statistics what exactly “n” means, probably because a hundred years ago it was much more literal in nature. 100 samples is because I counted 100 mice, 100 peas, or 100 surveys.