chromaton (u/chromaton)

chromaton commented on The programmers who live in Flatland blog.redplanetlabs.com/20... · Posted by u/winkywooster

chromaton · 12 days ago

Lisp has been around for 65 years (not 50 as in the author believes), and is one of the very first high-level programming languages. If it was as great as its advocates say, surely it would have taken over the world by now. But it hasn't, and advocates like PG and this article author don't understand why or take any lessons from that.

chromaton commented on GPT-5: "How many times does the letter b appear in blueberry?" bsky.app/profile/kjhealy.... · Posted by u/minimaxir

rainsford · 4 months ago

I can't take credit for coming up with this, but LLMs have basically inverted the common Sci-Fi trope of the super intelligent robot that struggles to communicate with humans. It turns out we've created something that sounds credible and smart and mostly human well before we made something with actual artificial intelligence.

I don't know exactly what to make of that inversion, but it's definitely interesting. Maybe it's just evidence that fooling people into thinking you're smart is much easier than actually being smart, which certainly would fit with a lot of events involving actual humans.

chromaton · 4 months ago

Moravec strikes again.

chromaton commented on GPT-5: Overdue, overhyped and underwhelming. And that's not the worst of it garymarcus.substack.com/p... · Posted by u/kgwgk

chromaton · 4 months ago

For my benchmarking suite, it turns out that it's about 1/5 the price of Claude Sonnet 4.1, with roughly comparable results.

chromaton commented on How I code with AI on a budget/free wuu73.org/blog/aiguide1.h... · Posted by u/indigodaddy

chromaton · 4 months ago

If you're looking for free API access, Google offers access to Gemini for free, including for gemini-2.5-pro with thinking turned on. The limit is... quite high, as I'm running some benchmarking and haven't hit the limit yet.

Open weight models like DeepSeek R1 and GPT-OSS are also made available with free API access from various inference providers and hardware manufacturers.

chromaton commented on Open models by OpenAI openai.com/open-models/... · Posted by u/lackoftactics

chromaton · 4 months ago

This has been available (20b version, I'm guessing) for the past couple of days as "Horizon Alpha" on Openrouter. My benchmarking runs with TianshuBench for coding and fluid intelligence were rate limited, but the initial results show worse results that DeepSeek R1 and Kimi K2.

chromaton commented on François Chollet: The Arc Prize and How We Get to AGI [video] youtube.com/watch?v=5QcCe... · Posted by u/sandslash

chromaton · 5 months ago

Current AI systems don't have a great ability to take instructions or information about the state of the world and produce new output based upon that. Benchmarks that emphasize this ability help greatly in progress toward AGI.

chromaton commented on Introducing TiānshūBench (天书Bench) jeepytea.github.io/genera... · Posted by u/chromaton

JSR_FDED · 7 months ago

Would it be useful to generate Procedural, OOP and Functional variations of the problems?

chromaton · 7 months ago

Yes, it would be fantastic to have more languages to test off of. I picked the base language I did (Mamba) because it was easy to modify and integrate into Python.

chromaton commented on Introducing TiānshūBench (天书Bench) jeepytea.github.io/genera... · Posted by u/chromaton

chiwilliams · 7 months ago

Cool project! I have a couple of questions that would be nice in the writeup: * How did you generate your example problems? Did you take an existing benchmark? Or did you have LLMs generate the problems? * Do you have any thought to adding a second "base programming language" to alter? I'm not sure that there's enough variation as there is. (Another thought would be to generate 4 or 5 different new languages, each quite different, and then run the benchmark on each of those languages? I'm not sure how much the fact that it is randomly generated each time matters that much?)

But overall, a clever idea!

chromaton · 7 months ago

Generating the problems: I just thought up a few simple things that the computer might be able to do. In the future, I hope to expand to more complex problems, based upon common business situations: reading CSVs, parsing data, etc. I'll probably add new tests once I get multi-shot and reliability working correctly.

New base programming languages would be great, but what would be even better is some sort of meta-language where many features can be turned on or off, rather than just scrambling the keywords like I do now.

I did some vibe testing with a current frontier model, and it gets quite confused and keeps insisting that there's a control structure that definitely doesn't exist in the TiānshūBench language with seed=1.