Readit News logoReadit News
cowsaymoo · a year ago
Now this is truly the programming language that we should be using to benchmark LLM code gen in a private hold out set. There is no substantial datasets on the internet or github, and no documentation except the one provided. And that's all the model should need.

I asked GPT-4 to write a mat mul function, but that was too ambitious and it spit out outrageous nonsense.

To be more fair, I gave it in-context access to the documentation in prompt, along with the fibonacci example function; aka everything humans have access to. I then asked it to do the simpler task of converting a base 10 integer to binary. It was unable to write something error free even after 4 rounds of supplying it the error messages.

I repeated this 5 times in case it generates something grammatical in the Top-K@5.

I suspected there was some confusion it couldn't surmount about string manipulation. So I changed the question to something challenging, yet something that only used function calls, conditional logic, basic math ops, and numbers. First, I asked for an nth root approximator using newton's method. Didn't work. Asked for just the square root. Didn't work. Finally, I asked for a function that prints a student's grade given their integer percentage. Not even.

GPT-4 also persistently hallucinated the keyword BREAKING NEWS, which I think sounds like a pretty good keyword if Tabloid were to ever get error handling.

The spooky part is that the almost all the solutions at face value would get partial credit. They had the right abstract approach, being familiar with reams of example approaches in natural language or programming languages. However, in each case, GPT-4, 4o, Claude all failed to produce something without syntax errors.

I suspect this is the case because transformers do subgraph matching, and while on one end there are rich internal connections for all the problems I requested, on the other end there is nothing similar enough for it to even get a foothold, hence the biggest struggle being syntax. If the only barrier to executing Tabloid code (or other unseen languages) is more basic syntax training, then it excitingly suggests it just needs to learn the abstract concepts from leetcode scrapes once for every syntax it knows. Prior research has shown that grammar is easy for language models. When GPT-2 was made large enough, it went from babbling to grammatical sentences very early in it's training, and at that moment its loss plummeted.

All tests conducted in temporary data mode so that this eval stays dark.

silentdanni · a year ago
Claude managed to write code successfully.

```

DISCOVER HOW TO square_root WITH x, iterations RUMOR HAS IT EXPERTS CLAIM guess TO BE x DIVIDED BY 2 DISCOVER HOW TO improve_guess WITH current_guess RUMOR HAS IT SHOCKING DEVELOPMENT (current_guess PLUS (x DIVIDED BY current_guess)) DIVIDED BY 2 END OF STORY

    DISCOVER HOW TO iterate WITH current_guess, remaining_iterations
    RUMOR HAS IT
        WHAT IF remaining_iterations SMALLER THAN 1
            SHOCKING DEVELOPMENT current_guess
        LIES! RUMOR HAS IT
            EXPERTS CLAIM new_guess TO BE improve_guess OF current_guess
            SHOCKING DEVELOPMENT
                iterate OF new_guess, remaining_iterations MINUS 1
        END OF STORY
    END OF STORY
    
    SHOCKING DEVELOPMENT iterate OF guess, iterations
END OF STORY

EXPERTS CLAIM number TO BE 16 EXPERTS CLAIM num_iterations TO BE 5

YOU WON'T WANT TO MISS 'The square root of' YOU WON'T WANT TO MISS number YOU WON'T WANT TO MISS 'is approximately' YOU WON'T WANT TO MISS square_root OF number, num_iterations

PLEASE LIKE AND SUBSCRIBE

```

CapeTheory · a year ago
This is consistent with my own experience that Claude is just downright better than ChatGPT.
cowsaymoo · a year ago
Ah bravo! What was the prompt and Claude model?
carterdmorgan · a year ago
Great idea here. I wonder if there's potentially more demand for new programming languages now purely as benchmarks for LLMs, like you said?
cowsaymoo · a year ago
Maybe they will take on that role too one day
rob74 · a year ago
I really think the "please like and subscribe" that ends the program should also be printed out (with a link to the project's GitHub page to make it more... actionable).
abtinf · a year ago
I would change BEATS/SMALLER THAN to “DESTROYS” and “HUMILIATED BY”
boredemployee · a year ago
hahaha laughed hard on this one.
6510 · a year ago
and functions: WHY YOU SHOULD foo WITH bar
velcrovan · a year ago
I wrote the Racket implementation, in case you want to be able to compile your Tabloid programs: https://github.com/otherjoel/tabloid
ChrisArchitect · a year ago
Some more discussion from 2020 with author input:

https://news.ycombinator.com/item?id=24578749

georgf · a year ago
Reminds me of ArnoldC[1] from a few years ago.

[1] https://lhartikk.github.io/ArnoldC/

red-iron-pine · a year ago
For-Loops should be something like

[n] GOOD REASONS WHY [i =< n]

[thing] HATES THIS THING <----- exception handling

ITS TIME WE TALK ABOUT [x] <----- while-loop

Cthulhu_ · a year ago
I couldn't believe and was SHOCKED to find out that this was a computer language! Please like and subscribe to learn more.