Readit News logoReadit News
DonHopkins · 6 months ago
I hope the term "deyodawgification" enters the pantheon of elegant code purification terms of art, alongside the classics like "degotofying", "deCOMtamination", and "outparamdelling".

https://knowyourmeme.com/memes/xzibit-yo-dawg

https://wiki.mozilla.org/Gecko:DeCOMtamination_Algorithm

https://bugzilla.mozilla.org/show_bug.cgi?id=455943

https://news.ycombinator.com/item?id=22708241

Who knows (or can invent) any other good ones:

Deadlocksmithing, JSONcology, YAMLectomy, XMLimination, DeDTDification, DeXMLEntitification, ExCDATAration, AntiPOJOification, SOAPerfluousness Scrubbing, WSDLectomy, Delogorrhea, DeK8ification, HelmholtzianRechartification, DevOpsDevangelicalism, FactoryFactoryDefactoringDefactoring, Antiquasimonolithicmicrofragmentedmacrodecontainerizationarianism...

immibis · 6 months ago
"Soft lock picking" was the title of a YouTube series about strange ways out of impossible save files (soft locks, as opposed to hard locks where the game freezes) in video games.
ggambetta · 6 months ago
DTDetox
preinheimer · 6 months ago
I think this was a really brave call for help from the writer. They needed help and they asked for it, from strangers!
spyc · 6 months ago
Thank you!
kibwen · 6 months ago
I'm interested in examining the idea of a programming language that eschews the notion of a callstack and returns to having a single fixed activation record per function. This obviously places a lot of limits on language design, but in return you get a statically-guaranteed upper bound on memory usage, plus weirdo stuff like the ability to hand out references to local data in functions that have already returned (which remains valid as long as you don't call the function again, which I think should be possible to enforce via a borrow checker).
openasocket · 6 months ago
There's prior art in this, very very old prior art. If you check out the first volume of The Art of Computer Programming, it uses a fictitious architecture called MIX. I understand that while it was fictitious, it was made to be similar enough to contemporary architectures. In it there are no special stack manipulation instructions, because there is no first-class notion of a stack! Functions used global variables for their scratch space. To call a function you would just jump to that address. Of course that function needed to know where to jump back to when complete. To do that, before jumping, the caller would WRITE THE RETURN ADDRESS TO THE JUMP INSTRUCTION AT THE END OF THE FUNCTION. This seems kind of insane in the modern day (function calls requiring self-modifying code!) but it meant you could implement functions without needing even a single extra word of storage space, and all you really gave up was recursion. I believe the original Fortran and some other older languages were originally implemented this way.

There's definitely an advantage with this idea, but also some downsides. While you have a guaranteed upper bound on memory usage, it's completely static and actually going to be worse than doing the equivalent on the stack. Suppose you have 100 functions that each need 100 bytes of scratch space. Statically allocating everything like you describe means you need 10KB of memory reserved. But suppose there is one main function, and the rest are helper functions, and the main function calls each helper function one at time and the helper functions don't call anything. With the static approach, you still need 10KB. With the stack-based approach, you only need 200 bytes (the stack space for the main function, and the stack space for the helper function it is currently calling). Another advantage to the stack-based approach is due to caching. In the static approach, whether or not a function's scratch space is in the cache or not depends on how often that function is called. But in the stack-based approach, the top of the stack is almost always in the L1 cache, giving a speed boost to functions, even when they are not called frequently.

That said, I do think it is an interesting idea that is worth exploring.

NobodyNada · 6 months ago
> Suppose you have 100 functions that each need 100 bytes of scratch space. Statically allocating everything like you describe means you need 10KB of memory reserved. But suppose there is one main function, and the rest are helper functions, and the main function calls each helper function one at time and the helper functions don't call anything. With the static approach, you still need 10KB. With the stack-based approach, you only need 200 bytes (the stack space for the main function, and the stack space for the helper function it is currently calling).

This is not necessarily the case: if you statically analyze the call graph, you may be able to prove that two functions will never be live at the same time, and thus reuse memory regions for their local variables: https://github.com/llvm-mos/llvm-mos/blob/main/llvm/lib/Targ...

Of course, recursion, function pointers, and dynamic linking will defeat (or at least complicate) this scheme.

fpoling · 6 months ago
Original Fortran did not modified the code. Rather each function had a global variable storing the address to return that the caller had to set.
cyco130 · 6 months ago
As others have pointed out, that's pretty much how things originally started. A more recent example I remember is the Action! programming language for Atari 8-bit computers[1], an Algol descendent for 6502 with static activation frames.

But I don't understand the appeal. Many algorithms and data structures are most naturally expressed with recursion. Not allowing it forces the programmer to create and maintain their own manual stack data structures which brings us to square one in terms of statically-guaranteed upper bound on memory usage since they can grow indefinitely. At best, stack overflow errors will be replaced by array out of bounds errors which is the exact same thing anyway. In fact, it will be even worse: Manual stack implementation will come with its own complexity and source of bugs.

[1] https://en.wikipedia.org/wiki/Action!_(programming_language)

kazinator · 6 months ago
> I'm interested in examining the idea of a programming language that eschews the notion of a callstack

Chicken Scheme fits the bill. While it doesn't eschew the notation of a call stack, it turns it on its ear.

In Chicken Scheme

1. All objects are allocated on the stack. Strings, conses, symbols, lexical environments, ...

2. Functions do not return. Everything is compiled to C which uses continuation passing style. What looks like returning in the Scheme source code is a continuation call: the parent function passes a continuation, and so the function return invokes that and runs that lambda.

3. Because functions do not return and everything is allocated from the stack, the stack grows voraciously. When it hits a certain limit, a kind of ephemeral garbage collection takes place: all objects that have been allocated on the stack which are still live are transferred to the heap. Then ... the stack pointer is rewound.

Essentially, the C stack has become a bump allocator in a kind of generational GC.

adrian_b · 6 months ago
That programming language exists.

The old FORTRAN and COBOL language versions were like this.

In ancient FORTRAN programs, the programmer typically computed a maximum possible size for the data and allocated statically in the main program one or more work arrays corresponding to that maximum size, which were either placed in a COMMON block of global variables or they were passed as arguments to the invoked functions and subroutines.

In the functions or subroutines, the work arrays were typically partitioned by allocating in them various local variables.

Overall, this approach was more reliable, because there were no risks of exceeding the memory limits, but it was much more tedious for the programmer, because the worst cases for memory usage had to be accurately predicted.

guenthert · 6 months ago
> Overall, this approach was more reliable, because there were no risks of exceeding the memory limits, but it was much more tedious for the programmer, because the worst cases for memory usage had to be accurately predicted.

Well, most importantly, it fixated the program's limits. Got a bigger machine and wanted to tackle greater data-sets? You're out of luck, if you didn't have the source code. Old Unix programs were often like that. It were the GNU counterparts which often did away with such arbitrary limits allowing systems to grow and the software to stay relevant.

amavect · 6 months ago
>plus weirdo stuff like the ability to hand out references to local data in functions that have already returned (which remains valid as long as you don't call the function again, which I think should be possible to enforce via a borrow checker).

The C programming language supports this with the static keyword. Further calls may overwrite the pointed data. I have played with allocating fixed-size data in static locals, but I always found that allocating in the caller does the same job better. For example, compare strerror() with strerror_s(). (A sensible strerror implementation should return a pointer to static immutable data, but the Standard doesn't specify it.)

A procedural language can achieve a statically bounded call stack by restricting self recursion, mutual recursion, and function pointers. I struggle to imagine how a language without a call stack could perform differently or better.

Someone · 6 months ago
> I'm interested in examining the idea of a programming language that eschews the notion of a callstack

Technically, I don’t know any language whose spec mentions the notion of a call stack. For example, it’s perfectly OK for a conforming C compiler to use a linked list of dynamically allocated activation records (from a standards view point; users of such an implementation may have different opinions)

A conforming C compiler also could choose to only use a call stack for functions that take part in recursive calls or that may get called by multiple threads.

> plus weirdo stuff like the ability to hand out references to local data in functions that have already returned (which remains valid as long as you don't call the function again, which I think should be possible to enforce via a borrow checker).

If you add multi-threading (something that is almost a must have for languages on powerful CPUs nowadays), I don’t think that’s easy to do.

layer8 · 6 months ago
I don’t think that “call stack” implies “contiguous memory”, or “what the operating system might think of a process call stack”, so a linked list would still qualify as a call stack. While the C standard doesn’t use the word “stack”, it explicitly requires support for recursive function calls, and the related semantics it specifies with regard to automatic storage duration effectively describe a stack.
fc417fc802 · 6 months ago
> single fixed activation record per function

What about statically determining a fixed number of activation records at compile time? Similar in spirit to security focused languages that require loops to have a statically determined finite bound in order to successfully compile.

As to lifetime after returning, do you really hate continuations that much?

kibwen · 6 months ago
> What about statically determining a fixed number of activation records at compile time?

Sure, it may be useful to represent a function's local storage as a first-class concept, and then allow users to instantiate copies of the function at will, if they're willing to allocate more storage themselves, thereby allowing users to either precisely limit the number of instances of a function or otherwise use dynamic allocation to manually reimplement a callstack if they so choose.

> As to lifetime after returning, do you really hate continuations that much?

This is a language that forbids recursion, the functional programmers have already run screaming for the exits. :P

hakfoo · 6 months ago
Does it require a new language, or could it be enforced by simple discipline or linting?

I'd suspect it makes sense in an embedded bare-metal environment where you know all the data structures you might ever have in circulation at once. This sometimes ties into things like explicitly unchanging data as "this can be allocated to ROM" to tightly manage the limited RAM.

Deleted Comment

cypherpunk666 · 6 months ago
related thoughts perhaps? http://lambda-the-ultimate.org/node/5555

(be warned that the old site links are slow, one has to wait for the unarchiving to run, or something.)

bsder · 6 months ago
> I'm interested in examining the idea of a programming language that eschews the notion of a callstack and returns to having a single fixed activation record per function.

Welcome to the old 8-bit C compilers.

Everything was statically allocated so no recursion. The downside is that you can only have a single thread of execution.

It would be interesting to see what that would look like if extended to multiple threads (they would have to be defined/allocated at compile time).

baggy_trough · 6 months ago
I wonder what would be wrong with changing the recursive function to take a depth and then bailing out if the depth is too big.
TonyTrapp · 6 months ago
Two things come to mind: - Maximum stack depth greatly varies between targets. If you wanted to support a recursion depth of, say, 50, that may already be too much for some of the hardware this library needs to be able to run on, so the original problem still exists. - This would be add an arbitrary limit to the parsing capabilities of the library. Stack space is very limited to begin with compared to heap space, So any solution that remains recursive would probably require a limit way lower than a heap-based solution could offer.
echoangle · 6 months ago
Or just take the limit as an argument so developers can adjust it based on the platform. Python also has a recursion limit but you can change it if you need to.
fpoling · 6 months ago
Depth is easy to miss when the parser calls a lot of functions that call each other and that have vastly different stack usage.

A more reliable check is to compare an address of a thing on the stack at api boundary with the current address of things on the stack. SpiderMonkey, the JS engine in Firefox, used something like that in past to stop run-away recursion or at least to switch to slower algorithms with bounded stack usage.

spyc · 6 months ago
That idea works in general but causes false positives: No artificial limit you pick is "right" and the false positives can be avoided by getting rid of the recursion altogether.

PS: It's not one single function, not direct but indirect recursion.

iforgotpassword · 6 months ago
Sure if it's indirect I agree it will get messy fast with a dozen functions suddenly needing to handle an additional parameter, but unrelated to that... I'd really like to know who needs recursion for this that's deeper than 3 or 4 levels. What's the use case? Such xml surely would be unreadable and unwritable to humans, but if it's used as some form of exchange format between systems, what would that be? How would it end up with such deeply nested entities? It sounds like something you deliberately implement that way to show how "smart" you are, but not "hey that seems the reasonable thing to do here".

This makes me wonder: does any of the popular xml libs have a sort of safe mode, where custom entities and similar features are disabled, those schema urls ignored, and namespaces just flattened (and whatever else I forgot or don't even know about)? You know for when I know I only need to parse simple xml files that should contain a couple plain tags and attributes, and want to reduce attack surface.

maxbond · 6 months ago
You can think of this as having two base cases, your normal success case and an error case. You could use metaprogramming to do this semi-automatically, like a decorator that took a maximum depth and checked it before calling your inner function.
mrkeen · 6 months ago
It's not ambitious enough. This 'solution' would be someone else's problem to be avoided at all cost.
BradSwain · 6 months ago
This is a neat bug!

A colleague and I spent some time last year looking for DoS vulnerabilities caused by recursing on user input [1].

TL;DR: With CodeQL and some manual review, we found several issues resulting in two assigned CVEs, a rustsec advisory, and a handful of fixes implemented in various projects.

We mostly looked at Java projects. It is interesting to see a C vulnerability from around the same time.

It would be cool to see a larger study on how common this issue is across different programming languages.

[1]: https://resources.trailofbits.com/input-driven-recursion-whi...

spyc · 6 months ago
Thanks for sharing that research!
bahorn · 6 months ago
Stack Clashing is pretty neat, something you should really pay attention to in embedded spaces (its often exploitable in UEFI land as most EDK2 builds lack guard pages).

I got to write some exploits for some recently, very fun bug class to exploit.

jasonthorsness · 6 months ago
There's a useful clang-tidy function to warn on this, for when you want to ensure there is no recursion lurking anywhere in a large codebase sensitive to stack overflow issues:

https://clang.llvm.org/extra/clang-tidy/checks/misc/no-recur...

pcwalton · 6 months ago
That warning doesn't ensure that there's no recursion, as the caveats point out. Indeed, it's trivial to show that ensuring there's no recursion is impossible as long as you have function pointers. (This is also why languages that claim to be working on a solution to prevent recursion statically are going to fail.)
LegionMammal978 · 6 months ago
In general, analysis becomes impossible when you have function pointers produced and used all over the place. But that doesn't have to be the case if the program is written deliberately to avoid the impossible cases. E.g., if you can separate the set of functions into "functions that (indirectly) are exported or have their address taken" and "functions that (indirectly) invoke a function pointer", then you know that there can't be any recursion through function pointers. Basically just a form of escape analysis.

And if you're writing your own language, you can just have explicit 'colored' functions (and corresponding pointers), and enforce that the call graph between different colors has no cycles, and no function invokes a pointer of its own color. Generics would get a bit messy, but it's still far from impossible.

Deleted Comment

DonHopkins · 6 months ago
There's a wonderful DDJ interview with James Clark (author of expat and developer many other open source sgml and xml standards and tools like Relax/NG, and even horrible ones like XSLT ;) called "A Triumph of Simplicity: James Clark on Markup Languages and XML", in which he explains how a standard has failed if everyone just uses the reference implementation, because the point of a standard is to be crisp and simple enough that many different implementations can interoperate perfectly.

A Triumph of Simplicity: James Clark on Markup Languages and XML:

https://www.drdobbs.com/a-triumph-of-simplicity-james-clark-...

I wrote more about his work in this discussion thread about Ted Nelson on What Modern Programmers Can Learn from the Past, and reading documents from 20 years ago:

https://news.ycombinator.com/item?id=16226209

>Reading documents from 20 years ago is a mixed bag. Links usually fail horribly, which was something Xanadu was trying to solve, but I'm not convinced they could have solved it so well that 20-year-old links would still actually work in practice. [...]

>In the ideal world we would all be using s-expressions and Lisp, but now XML and JSON fill the need of language-independent data formats. >Not trying to defend XSLT (which I find to be a mixed bag), but you're aware that it's precursor was DSSSL (Scheme), with pretty much a one-to-one correspondence of language constructs and symbol names, aren't you?

>The mighty programmer James Clark wrote the de-facto reference SGML parser and DSSSL implementation, was technical lead of the XML working group, and also helped design and implement XSLT and XPath (not to mention expat, Trex / RELAX NG, etc)! It was totally flexible and incredibly powerful, but massively complicated, and you had to know scheme, which blew a lot of people's minds. But the major factor that killed SGML and DSSSL was the emergence of HTML, XML and XSLT, which were orders of magnitude simpler.

James Clark:

http://www.jclark.com/

https://en.wikipedia.org/wiki/James_Clark_(programmer)