For context: the author (see his other posts) is exploring the possibilities of writing C with no C runtime to avoid having to deal with it on Windows. He began to kind of treat it as a new language, with the string type, arenas and such, which help avoid memory bugs (and from my experience, are very useful).
This is a pretty cool hack. Makes me want to write a regex library again.
TBH, most of the C stdlib is quite useless anyway because the APIs are firmly stuck in the 80's and never had been updated for the new C99 language features and more recent common OS features (like non-blocking IO) - and that's coming from a C die hard ;)
> TBH, most of the C stdlib is quite useless anyway because the APIs are firmly stuck in the 80's (...)
This. A big reason behind Rust managing to get some traction from the onset was how Rust presented itself as an alternative to C for system's programming that offered a modern set of libraries designed with the benefit of having decades of usability research.
I don't think everything has to be "modernized" and "updated." When I look at software from the 80s that is still with us, I think: "This is robust, keeps working, and has withstood the test of time" not "This must be changed." I still use C and the standard C library because I know how it worked in the past, I know it works today, and I know it will work for decades to come.
(minus the known foot-guns like strcpy() that we learned long ago were not great)
Thanks for that explanation! I have occasionally fantasized about a similar project - what could C be like, if one abandoned its ancient stdlib and replaced it with something suited to current purposes? - so I'm looking forward now to reading more of this author's writing.
Thank you for the context. I wouldn't have read the article without it. I mean, it's a pretty good idea for "no runtime," but when I saw the article title, I thought at first "Why????" Honestly, I'm glad I read it.
This comprehensive article goes over the problems of memory allocation, how programmers and educators have been trained to wrongly think about the problem, and how the concept of arenas solve it.
As someone who spends most of his time in garbage collected languages, this was wildly fascinating to me.
So bad is the performance of gcc std::regex that I reimplemented part of it using regex(3). Of course, I didn’t discover the problem until I’d committed to the interface, so I put mine in namespace dts, just in case one day the supplied implementation becomes useful.
As it stands, std::regex should come with a warning label. It’s fine for occasional use. As part of a parser, it’s not. Slow is better than broken, until slow is broken.
To be fair, the GNU implementation of std::regex has to conform to the API defined by ISO/IEC 14882 (The C++ Programming Language). If you don't have to provide that API purely in a header file, it gets pretty easy to write something bespoke that is faster, or smaller, or conforms to some special esoteric requirement, or does something completely different that what the C++ standard library specification requires.
The purpose of the C++ standard library is to provide well-tested, well-documented general functionality. If you have specific requirements and have an implementation or API that meets your requirements better than what the C++ standard library supplies, that's great. You're encouraged to use that instead.
If you have an implementation of std::regex that meets all the documented requirements and is provably faster under all or most circumstances than my implementation is, then submit it upstream. It's Free software and it wouldn't be the first time improved implementations of library code have been suggested and accepted by that project. Funny how no one has done that for std::regex in over a decade though, despite the complaints.
Around 30 years ago, STL introduced an allocator template parameter everywhere to let you control allocation. Here in 2024 we read about making use of the, erm, strange semantics of dynamic linking to force standard C++ code to allocate your way
TFA links to what arenas are and where they come from, how some bits included here would not really be part of this library but assumed part of the project using these techniques, does explain the general point of the exercise, and how this isn't even strictly a suggestion for a library but a "potpouri of techniques".
They are fully aware of -lre and assume that everyone else is too. This isn't about just achieving regex somehow. It's about avoiding the crt and gc and c++ in general while using an environment that normally includes all that by default.
You don't redefine new just to get regex. Obviously there must be some larger point and this regex is just some zoomed-in detail example of existing and operating within that larger point.
This is fun and impressive, but it feel the author kind of misses out on explaining in the intro why it would be wrong to just ... use C's regex library [1]?
I guess the entire post could be seen as an exercise in wrapping C++ to C with nice memory-handling properties and so on, but it would also be fine to be open and upfront about that, in my opinion.
> The regex engine allocates everything in the arena, including all temporary working memory used while compiling, matching, etc.
I do something quite different. I design the API so any data returned by the library function is allocated by the caller. This means the caller has full control over what style of memory management works best.
For example, you can then choose to use stack allocation, RAII, malloc/free, the GC, static allocation, etc.
Isn't giving the caller control over the memory exactly what this API does? The caller just passes in a block of memory that will be used for all of the internal allocations as well as the strings returned by the API.
This is a pretty cool hack. Makes me want to write a regex library again.
This. A big reason behind Rust managing to get some traction from the onset was how Rust presented itself as an alternative to C for system's programming that offered a modern set of libraries designed with the benefit of having decades of usability research.
(minus the known foot-guns like strcpy() that we learned long ago were not great)
What is grep going to do while it waits for data?
https://gitlab.gnome.org/GNOME/glib/
https://apr.apache.org/
As proven by early editions of Petzold famous book.
Whether you want to use that is another question.
This comprehensive article goes over the problems of memory allocation, how programmers and educators have been trained to wrongly think about the problem, and how the concept of arenas solve it.
As someone who spends most of his time in garbage collected languages, this was wildly fascinating to me.
As it stands, std::regex should come with a warning label. It’s fine for occasional use. As part of a parser, it’s not. Slow is better than broken, until slow is broken.
The purpose of the C++ standard library is to provide well-tested, well-documented general functionality. If you have specific requirements and have an implementation or API that meets your requirements better than what the C++ standard library supplies, that's great. You're encouraged to use that instead.
If you have an implementation of std::regex that meets all the documented requirements and is provably faster under all or most circumstances than my implementation is, then submit it upstream. It's Free software and it wouldn't be the first time improved implementations of library code have been suggested and accepted by that project. Funny how no one has done that for std::regex in over a decade though, despite the complaints.
Problematic macro in the header, custom string type compatible with nothing else in C, and I have no idea where the arena type comes from.
Having it magically deallocate memory is nice, but will confuse C programmers reading the caller.
Honestly, adding -lre to the linker is just much easier, and that library comes with docs too.
They are fully aware of -lre and assume that everyone else is too. This isn't about just achieving regex somehow. It's about avoiding the crt and gc and c++ in general while using an environment that normally includes all that by default.
You don't redefine new just to get regex. Obviously there must be some larger point and this regex is just some zoomed-in detail example of existing and operating within that larger point.
I guess the entire post could be seen as an exercise in wrapping C++ to C with nice memory-handling properties and so on, but it would also be fine to be open and upfront about that, in my opinion.
1: https://www.man7.org/linux/man-pages/man3/regex.3.html
my_audio_sdk_init(&arena, sizeof(arena)); // char arena[65536]; // or something like this
I do something quite different. I design the API so any data returned by the library function is allocated by the caller. This means the caller has full control over what style of memory management works best.
For example, you can then choose to use stack allocation, RAII, malloc/free, the GC, static allocation, etc.
For a primitive example, snprintf.