kouteiheika (u/kouteiheika)

kouteiheika commented on Code formatting comes to uv experimentally pydevtools.com/blog/uv-fo... · Posted by u/tanelpoder

benreesman · 2 days ago

I agree with you that binary distribution is a perfectly reasonable adjunct to source distribution and sometimes even the more sensible one (toolchain size, etc).

In this instance the build is way nastier than building the NVIDIA toolchain (which Nix can do with a single line of configuration in most cases), and the binary artifacts are almost as broken as the source artifact because of NVIDIA tensor core generation shenanigans.

The real answer here is to fucking fork flash-attn and fix it. And it's on my list, but I'm working my way down the major C++ packages that all that stuff links to first. `libmodern-cpp` should be ready for GitHub in two or three months. `hypermodern-ai` is still mostly a domain name and some scripts, but they're the scripts I use in production, so it's coming.

kouteiheika · 2 days ago

I thought about fixing Flash Attention too so that I don't have to recompile it every time I update Python or pytorch (it's the only special snowflake dependency that I need to manually handle), but at the end of the day it's not that much of a pain to justify the time investment.

If I'm going to invest time here then I'd rather just write my own attention kernels and also do other things which Flash Attention currently doesn't do (8-bit and 4-bit attention variants similar to Sage Attention, and focus on supporting/optimizing primarily for GeForce and RTX Pro GPUs instead of datacenter GPUs which are unobtanium for normal people).

kouteiheika commented on Code formatting comes to uv experimentally pydevtools.com/blog/uv-fo... · Posted by u/tanelpoder

twothreeone · 2 days ago

While shipping binary kernels may be a workaround for some users, it goes against what many people would consider "good etiquette" for various valid reasons, such as hackability, security, or providing free (as in liberty) software.

kouteiheika · 2 days ago

It's not a workaround; it's the most sane way of shipping such software. As long as the builds are reproducible there's nothing wrong with shipping binaries by default, especially when those binaries require non-trivial dependencies (the whole CUDA toolchain) to build.

There's a reason why even among the most diehard Linux users very few run Gentoo and compile their whole system from scratch.

kouteiheika commented on Code formatting comes to uv experimentally pydevtools.com/blog/uv-fo... · Posted by u/tanelpoder

benreesman · 2 days ago

It's really difficult to do Python projects in a sound, reproducible, reasonably portable way. uv sync is in general able to build you only a package set that it can promise to build again.

But it can't in general build torch-tensorrt or flash-attn because it has no way of knowing if Mercury was in retrograde when you ran pip. They are trying to thread a delicate an economically pivotal needle: the Python community prizes privatizing the profits and socializing the costs of "works on my box".

The cost of making the software deployable, secure, repeatable, reliable didn't go away! It just became someone else's problem at a later time in a different place with far fewer degrees of freedom.

Doing this in a way that satisfies serious operations people without alienating the "works on my box...sometimes" crowd is The Lord's Work.

kouteiheika · 2 days ago

> But it can't in general build torch-tensorrt or flash-attn because it has no way of knowing if Mercury was in retrograde when you ran pip.

This is a self-inflicted wound, since flash attention insist on building a native C++ extension which is completely unnecessary in this case.

What you can do is the following:

1) Compile your CUDA kernels offline. 2) Include those compiled kernels in a package you push to pypi. 3) Call into the kernels with pure Python, without going through a C++ extension.

I do this for the CUDA kernels I maintain and it works great.

Flash attention currently publishes 48 (!) different packages[1], for different combinations of pytorch and C++ ABI. With this approach it would have to only publish one, and it would work for every combination of Python and pytorch.

[1] - https://github.com/Dao-AILab/flash-attention/releases/tag/v2...

kouteiheika commented on PYX: The next step in Python packaging astral.sh/blog/introducin... · Posted by u/the_mitsuhiko

russellbeattie · 10 days ago

What are the reasons that Python can't implement the same sort of module/packaging system as NodeJS? That seems to work well enough.

Executing a Python script in the same directory as some sort of project.json file that contains all the complicated dependency details would be a pretty good solution to me. But I'm probably missing a whole bunch of details. (Feel free to educate me).

In general I really dislike the current system of having to use new environment variables in a new session in order to isolate Py scripts. It has always seemed like a hack with lots of footguns. Especially if you forget which console is open.

kouteiheika · 10 days ago

It can. That's what uv is. Put an '#!/usr/bin/env -S uv run python' shebang in your script, add a `pyproject.toml` with all of your deps, and you're done.

kouteiheika commented on Debian GNU/Hurd 2025 released lists.debian.org/debian-h... · Posted by u/jrepinc

ants_everywhere · 12 days ago

> They should retry rewriting it in rust (or zig)

It's an antipattern to chase whatever language is being hyped the most at the moment. And it's probably bad from a community POV to deliberately attract developers who are chasing hype.

kouteiheika · 11 days ago

> > They should retry rewriting it in rust (or zig)

> It's an antipattern to chase whatever language is being hyped the most at the moment.

Hype? Come on, Rust's 1.0 release was already over a decade ago. At this point it's pretty boring. How many more years will it take before people start taking it seriously and finally accept that those who prefer Rust over C do so because it's a much better language than C and not just because it's hyped?

kouteiheika commented on Ask HN: How can ChatGPT serve 700M users when I can't run one GPT-4 locally? · Posted by u/superasn

kazinator · 15 days ago

One trick is that multiple prompts are combined into a batch that is subject to the token prediction simultaneously.

I've had Google Gemini start talking to be in the context of someone else's chat, unrelated to mine.

ME: Does POSIX have any API for creating a filesystem object (such as a file) with a predetermined inode number supplied by the caller?

AI: <reformats my question into LaTeX syntax, out of the blue>

  \documentclass{article}
  \usepackage{amsmath}
  \usepackage{amsfonts}
  \usepackage{amssymb}
  \begin{document}
  \section*{Does POSIX have any API for creating a filesystem object with a predetermined inode number supplied by the caller?}
  \textbf{No, POSIX does not provide a standard API for creating a filesystem object (like a file or directory) with a caller-specified inode number.}

  \subsection*{Inode Allocation in POSIX Filesystems}
  [... SNIP]

ME: Why are you spewing LaTeX syntax at me?

AI: You are absolutely right! My apologies. That was a mistake on my part. I got carried away with the instruction to use LaTeX for mathematical and scientific notations and incorrectly applied it to the entire response. [... SNIP]

There was no such instruction. I've never chatted with any AI about LaTeX. it leaked from the tokens of someone else's chat.

kouteiheika · 15 days ago

> There was no such instruction. I've never chatted with any AI about LaTeX. it leaked from the tokens of someone else's chat.

Nope. That's not how it works. Attention doesn't work across multiple independent prompts queued in the same batch. It's not physically possible for the tokens of another chat to leak.

What most likely happened is that the model glitched out to the instructions in its (hidden) system prompt, which most likely does include instructions about using LaTeX for mathematical and scientific notation.

kouteiheika commented on Slopsquatting en.wikipedia.org/wiki/Slo... · Posted by u/gregnavis

kouteiheika · 18 days ago

> LLMs hallucinated a package named "huggingface-cli" [...] it is not the name of the package [...] software is correctly installed with [...] huggingface_hub

It would be a good idea to disallow registering packages which only differ by '-'/'_'. Rust's crates.io does this, so if you register `foo-bar` you cannot register `foo_bar` anymore.

kouteiheika commented on Genie 3: A new frontier for world models deepmind.google/discover/... · Posted by u/bradleyg223

kouteiheika · 19 days ago

> To that end, we're exploring how we can make Genie 3 available to additional testers in the future.

No need to explore; I can tell you how. Release the weights to the general public so that everyone can play with it and non-Google researchers can build their work upon it.

Of course this isn't going to happen because "safety". Even telling us how many parameters this model has is "unsafe".

kouteiheika commented on LLM leaderboard – Comparing models from OpenAI, Google, DeepSeek and others artificialanalysis.ai/lea... · Posted by u/bookofjoe

LeoPanthera · 23 days ago

Is there an index for judging how much a model distorts the truth in order to comply with a political agenda?

kouteiheika · 23 days ago

It's not perfect, but, yes: https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard

kouteiheika commented on A GPU Calculator That Helps Calculate What GPU to Use calculator.inference.ai/... · Posted by u/chlobunnee

fooker · a month ago

No, if you try to train an LLM like you're suggesting:

- you'll get something similar to gpt2.

- To approach the scale of modern LLMs, you'll need about 10x more than all the GPUs in the world.

It's a neat abstraction to consider these the same, but do you think Meta is paying 100M for writing a 15 line script?

kouteiheika · a month ago

I still don't understand what exactly you are disagreeing with.

Meta is paying the big bucks because to train a big LLM in a reasonable time you need *scale*. But the process itself is the same as full fine-tuning, just scaled up across many GPUs. If I would be patient enough to wait a few years/decades for my single GPU to chug through 15 trillion tokens then I could too train a Llama from scratch (assuming I feed it the same training data).