ImageXav (u/ImageXav)

ImageXav commented on Everything is correlated (2014–23) gwern.net/everything... · Posted by u/gmays

There’s a common misconception that high throughput methods = large n.

For example, I’ve encountered the belief that just by recording something at ultra high temporal resolution gives you “millions of datapoints”. This then has all sorts of effects on the breakdown of statistics and hypothesis testing (seemingly).

In reality, the replicability of the entire setup, the day it was performed, the person doing it, etc. means the n for the day is probably closer to 1. So to ensure replicability you’d have to at least do it on separate days, with separately prepared samples. Otherwise, how can you eliminate the chance that your ultra finicky sample just happened to vibe with that day’s temperature and humidity?

But they don’t teach you in statistics what exactly “n” means, probably because a hundred years ago it was much more literal in nature. 100 samples is because I counted 100 mice, 100 peas, or 100 surveys.

ImageXav · 6 days ago

This is an interesting point. I've been trying to think about something similar recently but don't have much of an idea how to proceed. I'm gathering periodic time series data and am wondering how to factor in the frequency of my sampling for the statistical tests. I'm not sure how to assess the difference between 50Hz and 100Hz on the outcome, given that my periods are significantly longer. Would you have an idea of how to proceed? The person I'm working with currently just bins everything in hour long buckets and uses the mean for comparison between time series but this seems flawed to me.

ImageXav commented on An engineer's perspective on hiring jyn.dev/an-engineers-pers... · Posted by u/pabs3

shahbaby · 19 days ago

As much as I dislike leetcode style interviews, if I fail one of those, I learn what I can and move on.

Failing a take-home is an entirely different thing. It's a huge loss in time and mental energy.

I've only done 3 of those in my career and only because the projects sounded interesting. 1 of those 3 resulted in a job offer which I can now confidently say in hindsight was the worst job in my career (...so far!).

I'm now leaning towards just filtering out companies that do take-homes because it signals to me that they don't care about their candidate's time and how a company treats its candidates is usually a good indicator of how they treat their employees.

ImageXav · 18 days ago

I've had the complete opposite experience, and feel the complete opposite way. What is there to learn from failing a leetcode? It feels like luck of the draw - I didn't study that specific problem type and so failed. Also, there is an up front cost of several months to cover and study a wide array of leetcode problems.

With a take home I can demonstrate how I would perform at work. I can sit on it, think things over in my head, come up with an attack plan and execute it. I can demonstrate how I think about problems and my own value more clearly. Using a take home as a test is indicative to me that a company cares a bit more about its hiring pipeline and is being careful not to put candidates under arbitrary pressures.

ImageXav commented on Benchmarking GPT-5 on 400 real-world code reviews qodo.ai/blog/benchmarking... · Posted by u/marsh_mellow

comex · 20 days ago

> Each model’s responses are ranked by a high-performing judge model — typically OpenAI’s o3 — which compares outputs for quality, relevance, and clarity. These rankings are then aggregated to produce a performance score.

So there's no ground truth; they're just benchmarking how impressive an LLM's code review sounds to a different LLM. Hard to tell what to make of that.

ImageXav · 20 days ago

Yes, especially as models are known to have a preference towards outputs of models in the same family. I suspect this leaderboard would change dramatically with different models as the judge.

ImageXav commented on Qwen3-4B-Thinking-2507 huggingface.co/Qwen/Qwen3... · Posted by u/IdealeZahlen

cowpig · 22 days ago

Compare these rankings to actual usage: https://openrouter.ai/rankings

Claude is not cheap, why is it far and away the most popular if it's not top 10 in performance?

Qwen3 235b ranks highest on these benchmarks among open models, but I have never met someone who prefers its output over Deepseek R1. It's extremely wordy and often gets caught in thought loops.

My interpretation is that the models at the top of ArtificialAnalysis are focusing the most on public benchmarks in their training. Note I am not saying XAI is necessarily nefariously doing this, could just be that they decided it's better bang for the buck to rely on public benchmarks than to try to focus on building their own evaluation systems.

But Grok is not very good compared to the anthropic, openai, or google models despite ranking so highly in benchmarks.

ImageXav · 22 days ago

Thanks for sharing that. Interesting that the leaderboard is dominated by Anthropic, Google and DeepSeek. Openai doesn't even register.

ImageXav commented on Voxtral – Frontier open source speech understanding models mistral.ai/news/voxtral... · Posted by u/meetpateltech

lostmsu · a month ago

My Whisper v3 Large Turbo is $0.001/min, so their price comparison is not exactly perfect.

ImageXav · a month ago

How did you achieve that? I was looking into it and $0.006/min is quoted everywhere.

ImageXav commented on Occurences of swearing in the Linux kernel source code over time vidarholen.net/contents/w... · Posted by u/microsoftedging

holowoodman · 2 months ago

Theory: the shift towards lesser swearwords is a sign of corporatization, making the linux source a soulless bland hellscape of confirmity.

ImageXav · 2 months ago

I feel as though it also represents the fact that contributors are less invested in the project. There was a small study done a few years back hypothesizing that the number of swear words related somewhat to code quality [0] due to emotional involvement of the codebase authors. I can imagine this to be somewhat true. I would love to see this study redone now that LLMs are widespread on pre chatgpt repos (as I suspect that repos created using LLMs are going to be very sanitised).

[0] https://cme.h-its.org/exelixis/pubs/JanThesis.pdf

ImageXav commented on N-Params vs. Single Param carlos-menezes.com/single... · Posted by u/carlos-menezes

ashtuchkin · 4 months ago

python does this pretty elegantly:

    def my_func(a: int, b: int):
        print(a+b)

    # both of these can be used
    my_func(1, 2)
    my_func(a=1, b=2)

ImageXav · 4 months ago

Even better, python has named tuples [0]. So if you have a tuple that you are sure will always have the same inputs you can declare it:

``` Point = namedtuple('Point', 'x y') pt1 = Point(1.0, 5.0) ```

And then call the X or Y coordinates either by index: pt1[0], pt1[1], or coordinate name: pt1.x, pt1.y.

This can be a really handy way to help people understand your code as what you are calling becomes a lot more explicit.

[0] https://stackoverflow.com/questions/2970608/what-are-named-t...

ImageXav commented on AI systems with 'unacceptable risk' are now banned in the EU techcrunch.com/2025/02/02... · Posted by u/geox

johndhi · 7 months ago

What? Why? Shouldn't those same use cases all be banned regardless of what tech is used to build them?

ImageXav · 7 months ago

Not necessarily. Interpretability of a system used to make decisions is more important in some contexts than others. For example, a black box AI used to make judiciary decisions would completely remove transparency from a system that requires careful oversight. It seems to me that the intent of the legislation is to avoid such cases from popping up, so that people can contest decisions made that would have a material impact on them, and that organisations can provide traceable reasoning.

ImageXav commented on An overview of gradient descent optimization algorithms (2016) ruder.io/optimizing-gradi... · Posted by u/skidrow

janalsncm · 7 months ago

Article is from 2016. It only mentions AdamW at the very end in passing. These days I rarely see much besides AdamW in production.

Messing with optimizers is one of the ways to enter hyperparameter hell: it’s like legacy code but on steroids because changing it only breaks your training code stochastically. Much better to stop worrying and love AdamW.

ImageXav · 7 months ago

Something that stuck out to me in the updated blog [0] is that Demon Adam performed much better than even AdamW, with very interesting learning curves. I'm wondering now why it didn't become the standard. Anyone here have insights into this?

[0] https://johnchenresearch.github.io/demon/