Optimizing GoAWK with a bytecode compiler and virtual machine

cb321 · 4 years ago

Instead of interpreters, if one has less of a "must be a full featured prog.lang" mentality and a fast compiler like Go or Nim [1] (or is willing to wait, for slow optimizing compiles to apply against big data sets) then an end-to-end simpler design for "one-liners" (or similarly simple programs) is the whole program generator. Maybe "big IFs", but also maybe not.

To back up my simplicity claim, consider rp [2] -- like 60 non-comment/import/signature lines of code for the generator. Generated programs are even smaller. But, you can deploy gcc or clang or whatever against them and make fast libraries in the host language.

Why, if you are willing to write those little generation command options in C99 then you can compile the harness with tcc [3] in about 1 millisecond which is faster than most interpreter start-up times - byte code or otherwise - and can link against gcc -O3 (or whatever) helper libraries.

Anyway, I only write this because in my experience few people realize how much development cost they buy into when then insist on a full featured prog.lang, not to criticize Ben's work. You also make users learn quirks of a new language instead of the quirks of a "harness" which may be fewer|easier. (EDIT: Awk is fairly well established, of course.)

[1] https://forum.nim-lang.org/

[2] https://github.com/c-blake/cligen/blob/master/examples/rp.ni...

[3] https://repo.or.cz/w/tinycc.git

benhoyt · 4 years ago

rp looks like a very interesting and clever approach. I'll have to take a closer look. Is https://github.com/c-blake/cligen/blob/master/examples/rp.ni... the extent of rp's documentation?

cb321 · 4 years ago

> the extent of rp's "documentation"

Unfortunately, yes. My hope was that it would be compact/small enough to be "quasi-self-documenting" to the likes of "compiler writer types". Probably not to ordinary users. (The extra asterisks are to get nice ANSI SGR escape highlighting and/or rST markup out of the auto-documentation system.) {EDIT2: Also, I would be happy to add some rp.README if you want to contribute one. You seem a great explainer. }

FWIW, I think of this kind of generation as part of the Go mentality more generally, but I am not a Go user/in that community. So, maybe that is speaking out of turn.

I also have a C version of this that I call `crp` I could provide if anyone wants (and yes, short for "crap"). C is a kind of higher ceremony language for such things { EDIT1: but even more established than awk... :-) }

cb321 · 4 years ago

I should have said, those help strings at the very bottom of the file are also part of the "documentation" to the extent the variable names in the proc signature are not self explanatory.

pjmlp · 4 years ago

The problem with that approach is that it tends to create the false belief on developers inexperienced with compiler development that C is a must have for doing compilers.

Then one needs to go around explaining, that no, some one programming language could have been chosen and it was just a matter of convinience why C was picked up.

cb321 · 4 years ago

This is fair, but argument in favor of the approach is that it is so simple it really could be done with any backend language {EDIT: by a 1st year CS student in an afternoon, um, maybe.. ;-)}..Go, Pascal, Ada, even slow running ones like Python/Perl/etc. or slow compiling ones like Rust. So, were the design to be more common people might not be so prone to make that assumption. OTOH, getting people to not make wrong assumptions is an eternal challenge. :-)

My own view is that awk was done more or less for these one-liner/simple purposes but by people like Aho for whom full featured languages are barely more thinking than my 60 lines (in either doing the parser or using). :-)

silasdavis · 4 years ago

The binary search in a case statement is annoying for stack machines like this. How would implementing your own jump table as a hashmap to functions perform here vs a case statement? I'm guessing the function calls make it a lot slower as you mention in the comparison with the tree walking.

benhoyt · 4 years ago

Yeah, I discuss something similar in the article, except I use an array of functions (search for "array of functions"), which is significantly faster than hashing and a hash table lookup. It was very slightly faster than switch, but I decided not to use it:

> This only gave me a 1-2% speed increase on GoAWK’s microbenchmarks (see results and code). In the end I decided I’d rather stick with the simpler switch code and find other ways to improve the speed. And when the Go compiler supports jump tables for switch, I’ll get a 10% improvement by doing nothing!

jonstewart · 4 years ago

I wrote a niche grep library (in C++) that uses the byte code approach. I’m not a Go programmer, and am gobsmacked about how switch works—the jump table switch in C/C++ is a huge part of the benefit of a byte code VM. Also, my guess is that you may see a disproportionate performance increase (>18%) between the AST interpreter and the byte code backend with larger AWK programs, due to the latter’s better cache characteristics.

YMMV with computed goto. It was an important technique but modern processors with their crazy branch predictors can yield disappointing results today. It didn’t help my VM.

I would -not- write your own regex library. Doesn’t RE2 have a Go port? Or maybe something else? It’s a world of pain.

Given AWK’s usage as a string processor, I’d sell out hard for string processing. Maybe an opcode specifically for “hardcoded string op” with a secondary operand denoting which type of string op that you dispatch to through an array? Then just load that up with beefy functions and the VM is used to glue together the string operations.

Also, not sure how possible this is in Go, but if it allows for unions, packed structs, and bitfields, you can get some benefit from having 32 bit instructions where the opcodes are a byte and then have the other three bytes available for operands, and additionally allowing for some instructions to be multiword (so a jump could have a second word following that’s the target address, for a full 32 bits). Condensing your byte code without incurring other costs (bit access, compression) is key to improving your L1 cache rate. A cache line can then fit 16 instructions.

Preallocating as much memory as possible is helpful, too. Even if there are necessary conditionals for growing arrays and whatnot, they’re likely to be false and thus handled by branch prediction with no cost.

eatonphil · 4 years ago

I was looking at this earlier and noticed the tree walking interpreter. Nice work with the upgrade!

For folks not familiar, basically every major interpreter today compiles to bytecode and runs on a bytecode VM. People will squabble about the term "compiler" saying it only applies to generating binary files directly but I still think it's fun almost every major interpreter is a compiler in the general sense.

The only major interpreters that do tree walking AFAIK are shells like bash.

deckarep · 4 years ago

Awesome blog post Ben! Question: how did you come up with your list of opcodes?

I ask because while some opcodes are obviously needed others as not so obvious and coming up with a balanced instruction set is somewhat difficult design problem.

benhoyt · 4 years ago

Good question. A bit of experience, some guesswork, and a lot of testing and benchmarking along the way. I separated out global and local opcodes originally because I wanted individual variable accesses to be fast (why do at runtime what you can do at compile time). As I mention in the article, originally I had all the builtin functions as separate opcodes, but that was slower due to the sheer number of opcodes and Go's current binary tree approach to "switch". I also used Go's profiling tools a bunch to see where the hotspots where. I'm sure there's significant room for improvement, but it's hard, because sometimes when you improve one benchmark, something else suffers.

ternaryoperator · 4 years ago

What tools did you use for timing and benchmarking beyond the tools included in the go distribution?

ogogmad · 4 years ago

Is it possible to write an interpreter for this bytecode language using RPython, so as to get a JIT speedup?

I think if you write an interpreter in RPython, and annotate it with some directives, you can get a JIT compiler for free. But it wouldn't be "GoAWK" anymore!

https://rpython.readthedocs.io/en/latest/examples.html

mihaitodor · 4 years ago

Really handy when running AWK scripts via https://benthos.dev :)

ncmncm · 4 years ago

Looks like much of the effort is working around basic Go language limitations. Coding the virtual machine interpreter in Go seems like a mistake.

Compiling straight to Go, and then compiling that the normal way, seems like it would do better, and be simpler. Assuming you really wanted to keep it all in Go.

benhoyt · 4 years ago

Go isn't the most efficient language for these sorts of things, but it's also not bad (and Go itself is not really the bottleneck for real-world AWK scripts). That said, GoAWK is useful mainly when you want to use AWK scripts in a Go program (the Benthos stream processor does this: https://www.benthos.dev/docs/components/processors/awk/).

I also did a project recently, creatively called AWKGo, that compiles (a subset of) AWK scripts to Go source code: https://benhoyt.com/writings/awkgo/

v3gas · 4 years ago

Great post!