Readit News logoReadit News
willvarfar · a year ago
I enjoy messing around parsing things etc. Although I started with handmade unschooled attempts many decades ago, I later went through the classic yak/bison phase etc before firmly getting back into the hand-rolling custom side of things where I'm much happier.

My main motivation is speed, e.g. I have enjoyed handcrafting wickedly fast custom JSON and SQL parsers and bits of code that sit on top of them.

My general approach now is to use a tokenizer that generates an int that represents each token, where the bits in the int tell me the location of the token in the source buffer and its type etc.

In languages with a lot of per-object overhead like Java this is literally a long; but in the C/C++/rust camp it can look and feel like an object or struct or whatever because, underneath, it ends up still being an int that gets passed in registers and on the stack etc.

Sometimes the parsing is one-pass and the tokens don't need to be stored or anything; its usually the memory allocations that kill parsing performance, and a once-through json decoder can completely eliminate bottlenecks on hot paths in data processing etc.

Other times I run through once and store these tokens in an array, particularly if I'm going to be going back over them etc. Its actually easy to make a function that, given an 'int' token, finds the next one, so if you are going through the data several times you don't need any allocations. But other times you want to go backwards or you are really going to be going through the data a lot so it makes sense to store the offsets of everything.

Sometimes future steps will be reordering things and transforming the AST etc; in those cases, I generally have a writeable arena where I append the text of new tokens, and a bit in the token ints discriminate between the read-only source buffer and this transformed buffer. This is all particularly cool when it comes to generating sensible error messages with context, which is a finesse most handmade parser makers rue later that they had overlooked :)

I would be interested to know just how unmainstream this kind of approach is? Please weigh in, would love to learn new tricks :)

mightee_pirate · a year ago
I am very interested in parsing and optimizing them. Currently I have an html parser written in go. I have used Go's html parser but it is waaay slower than mine. Here what I do: - stream the content (either from a file or from network) and send chunks to the tokenizer

- tokenizer is dead simple with a state machine

- when an html tag is found, emit the token with positions (problem is someone needs to hold the whole incoming buffer. currently its copying the incoming bytes but I plan to hold the bytes until a token is found. after emitting the token with the buffer (copying) the inner buffer will be freed)

- parser does its thing and creates a DOM

as you can see I am not well versed in optimizing to the last bit but I am very interested in diving deep into the optimizations.

Currently I am looking at zero-copy networking. there is supposedly a library for go (fasthttp) that would not copy bytes from networking interface but I havent tried it yet.

What kind of algorithms would you recommend for tokenizing & parsing html / xml very efficiently ?

willvarfar · a year ago
There are so many 'it depends' here, based on your use-cases and perhaps details specific to go optimisation that I am unaware of.

The first step, tokenization, might well be very go-specific around zero-copy networking and things, sorry. The general idea, though, would be to use a nice little finite-state-machine to eat characters and recognise when the tags open and close, the name of the tag, the attributes and values, the comments and the body etc. And if you can be avoiding allocating any kind of object on a heap at this stage, it'll almost always be a big win.

But you need to then create a DOM, and you're going to need memory allocation for that.

But the arena approach in the article is good for this; use a big byte array as an arena, and be writing a cleaned-up normalized copy of the parsed html into it, with length-prefixes and distance-to-next-tag etc baked in. Typically a DOM has nodes with parent, previous sibling and next sibling pointers. You'll probably still want these, but these can be offsets written into the byte buffer in between fragments of the parsed html, rather than maintained as a separate structure.

Then, if the user of your DOM can modify it, you can have a node navigation API that seamlessly stores rewritten DOM parts in a new byte arena or appended on the end of the original arena.

neonsunset · a year ago
It may seem unexpected given all the hype around Go, but it is a surprisingly poor choice for this. If you want a more convenient language than C++ or Rust but retain the ability to reach optimal performance, C# could serve you much better.

Go underperforms at trivial XML parsing: https://news.ycombinator.com/item?id=40283721 (and the parent comment)

If you push it, C# can extract optimal HW utilization when parsing text, beating C++: https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-... (Go was not on the list because it was that much slower)

kragen · a year ago
you may want to add blank lines between your bullets
PythagoRascal · a year ago
I would be very interested in a more detailed write-up, if you have the time (or have one already).
kragen · a year ago
this sounds awesome, can i see?
kragen · a year ago
probably worth noting that the default peg backend for the c combinator parser 'hammer' uses an arena allocator for speed and because it's a good match for packrat memoization. there are hammer bindings for several languages including python and java, not sure if there's a rust one yet. hammer also has some other backends including lalr, and in some cases you can write a grammar such that you can parse it with either packrat or lalr

hammer's arena allocator is not very fast, though. it's faster than any malloc i've seen, but a good arena allocator is faster than calling an empty function, so you have to inline the fast path. the open-source prototype arena allocator i wrote in http://canonical.org/~kragen/sw/dev3/kmregion.h (and .c) doesn't quite reach that level, but it takes about 1.9 nanoseconds per allocation (520 million allocations per second on one core), while glibc's malloc takes about 19 nanoseconds per allocation. more recent versions of glibc (on a faster cpu) have reduced that to 14 nanoseconds in that particular microbenchmark. details are at the link. your mileage may vary

chris wellons wrote an excellent article last year about his approach to arena allocation in c in https://nullprogram.com/blog/2023/09/27/

for those who aren't aware, garbage collectors for languages like js, c#, ocaml, or java are usually generational copying collectors for efficiency, which means they use what is effectively an arena allocator for most new allocations. often they reserve a cpu register or two for this, which my c implementation above can't

for c, the apache portable runtime 'apr' has a generic arena allocator in it called pools, which also supports destructors (the drop trait, in rust) and nested pools

the arena allocator in gcc is called 'obstacks' and got moved into libiberty so you can use it in other c programs. or hypothetically in rust

thequux · a year ago
Hammer's arena allocator is, in fact, hot garbage. The entire reason it's there in the first place is that I didn't want to get into the weeds integrating a GC when building a proof of concept that PEGs could work at runtime in C with replaceable backends, so I didn't bother tuning it much beyond "4k sounds like a good number".

To me, the valuable part of hammer was the idea that you could use a friendly PEG-like API and end up with an LALR or GLR parser, and that, I believe, was a resounding sucess.

On the other hand, the generated AST turned out to be even less pleasant to work with than I imagined, and the only part of the implementation that's worth salvaging is the bit reader and writer routines.

kragen · a year ago
haha! well, you're being harsher on your own work than i think it deserves

hammer works pretty well, i think, even the ast generation part. judicious application of semantic actions can give you whatever ast you want, if you're willing to stick to the peg backend anyway

i'm curious what you think about the prototype arena allocator i linked above; i wrote it originally as a candidate replacement for hammer's

fanf2 · a year ago
obstacks are in glibc so they are very widely available https://sourceware.org/glibc/manual/latest/html_node/Obstack...
kragen · a year ago
oh thanks, i didn't realize
asplake · a year ago
> To release memory we simply release the blob itself, so it's like a batched deallocation, sort of. The consequence here is that all of our types must be:

> 1. trivially copyable (: Copy if we speak in Rust)

> 2. trivially destructible (i.e. they can't have impl Drop)

2 (Drop) seems obvious enough, but why 1 (Copy)? There’s a bit of Rust knowledge implicit there that I’m unaware of, and I might not be alone in that. Could someone explain?

GrantMoyer · a year ago
I think it's a mistake. Allocator Implementation[1] shows values being moved into the arena, which can be done without Copy or even Clone. Maybe the author got confused, since moving an object in Rust just copies the bytes and invalidates the moved from binding.

[1]: https://iliabylich.github.io/arena-based-parsers/allocator_i...

LegionMammal978 · a year ago
I'm pretty sure being trivially destructible is sufficient for this use case. The only reason you might want a Copy bound when you're not actually copying values is if you're trying to disallow interior mutability, but that doesn't seem to be the case here. Perhaps the author was misinformed.

Or maybe they're just using Copy as a type-level bound that guarantees a trivial destructor, even if it's more restrictive than strictly necessary.

MindSpunk · a year ago
You can't have a negative trait bound to say "I take anything that _doesn't_ implement Drop". It's not a stabilized feature in Rust because it would allow adding a trait to a struct to become a breaking API change where it currently is not. Copy guarantees !Drop so it's currently the only way to do it without opting into nightly features.

You can remove the requirement for trivially destructible types by maintaining a singly linked list of (ptr to alloc'd object, destructorFn) pairs with the links allocated from the arena itself. The arena can just walk the list and call all the drop functions when it gets reset. You can even specialize for types with trivial destructors so they don't append to the list and so you only pay the extra cost if you allocate Drop types (using std::mem::needs_drop).

mmastrac · a year ago
Trivially copyable and destructible means that they are plain-old data, with no allocations, associated resources or any other side-effects.
LegionMammal978 · a year ago
Trivial destructibility alone is generally sufficient for that in Rust. Recall that all values (outside a Pin) are trivially moveable, so pretty much the only reason you'd want trivial copying is if you actually need more than one valid copy at once. (One exception is using Copy to express a poor man's Freeze bound, which is quite limited, and dubious in any case.)
Cloudef · a year ago
Arena allocation is very trivial in zig
himujjal · a year ago
This. Writing a parser in Zig is so simple. Just allocate once, and then start writing your parser.

One allocator for parser, one for scanner. One for type allocation. Keep them all for semantic analysis. Write the output renderer (binary/language). Deallocate.

In this whole process, it makes it so easy to not think about memory anymore. Just enjoy writing your program.

JonChesterfield · a year ago
The scanner can probably be zero allocation though - zig should be able to express an iterator over the input byte array that returns a token on dereference without needing more state than the position in the array
lylejantzi3rd · a year ago
Is there a detailed blog post or a book or something you can share that shows how easy it is to write a parser in zig?
olliej · a year ago
Parsing is fun, but assuming a good baseline allocator arena allocation doesn’t get you a whole lot these days (maybe it’s better under a GC environment where tear down does not require a tree traversal?), especially if you’re able to lazily build the AST (this is what I did in JSC years ago).

I still feel that there’s a lot of practical parsing stuff that isn’t considered in academia which leads to a near constant preference for LALR and LR parser generators despite them being honestly worse than recursive descent in almost every important metric if you want to do anything at all other than producing a completely generic AST for known good data.

5kg · a year ago
Somewhat related talk: Modernizing Compiler Design for Carbon Toolchain - CppNow 2023 (https://www.youtube.com/watch?v=ZI198eFghJk)
nuudlman · a year ago
This talk was enormously educational when I first dove into LLVM and modern compilers last January—Chandler is a great speaker.
JonChesterfield · a year ago
This is a good idea. The tree constructed by a parser is relatively likely to want similar lifetimes for most of the nodes.

If the nodes are mutable you can link them together as you go, meaning some amount of overhead in the structure for links that aren't in use, but no garbage left behind.

Storing pointers as distance from the start of a contiguous arena means a structure you can mmap later without relocating.

All good stuff.

a_t48 · a year ago
protobuf for C++ has this feature - https://protobuf.dev/reference/cpp/arenas/ (not so much parsing but (de)serializing)