Parsing Awk Is Tricky

Awk is something that I think every programmer and especially every sysadmin should learn. 8 like the comparison at the end and have never heard of nnawk or bbawk before.

I recently made a dashboard to compare four versions of awk output together, since not all awk scripts I'll run the same on each version: https://megamansec.github.io/awk-compare/ I'll have to add those:)

hulium · a year ago

awk is also not hard to understand, scroll through the Wikipedia page for a few minutes https://en.wikipedia.org/wiki/AWK#Structure_of_AWK_programs

It runs an action for each line in the input (optionally filtered by regex). You get automatic variables $1,$2... for the words in the line split by spaces.

The syntax is almost like a simple subset of Javascript. Builtin functions are similar to C standard library.

If you have input in text that is separated in columns with a delimiter, and you want do simple operations on that (filter, map, aggregate), it can be done quickly with awk.

That's all you need to know about awk.

penguin_booze · a year ago

I often find the missing support for slicing (like fields from 2-6, as `cut -f` can do) a handicap. I tend reach for jq instead of awk, these days.

joemi · a year ago

> Awk is something that I think every programmer and especially every sysadmin should learn

I'd argue that it should be every programmer who doesn't already know a scripting language like Ruby or Python. If you already know a scripting language, chances are the time saved between writing an Awk one-liner and quickly banging out a script in your preferred language is negligible. And if you ever need to revisit it and maybe expand it, it'll be much easier to do in your preferred scripting language than in Awk, especially the more complex it gets.

I'm speaking from experience on this last point... At my work I wrote a very simple file transformer (move this column to here, only show lines where this other column is greater than X, etc etc) in Awk many years ago. It was quick and easy and did what it needed to. It was a little too big to be reasonable as a one-liner, though not by very much at all. But as we changed and expanded what it needed to do, it ended up getting to be a few thousand lines of Awk, and that was a nightmare. One day I got so fed up with it that I rewrote it all in Ruby in my free time and that's how it's been ever since, and it's soooo much better that way. Could have saved myself a lot of trouble if it were that way from the beginning, but I had no idea at that time it would grow beyond the practically-a-one-liner size, so I thought Awk would be a great choice.

Chris2048 · a year ago

> every programmer and especially every sysadmin should learn

There are lots of things "every <tech position> should learn", usually by people who already did so. I still have a bunch of AI/ML items on that list too.

What's the advantage of learning AWK over Perl?

mbivert · a year ago

> What's the advantage of learning AWK over Perl?

Getting awk in your head (fully) takes about an afternoon: reading the (small and exhaustive) man page, going through a few examples, trying to build a few toys with it. Perl requires much, much more effort.

Great gain/investment ratio.

Snoddas · a year ago

Both will get you where you want to go, but I don't think the usecase for perl and awk are the same.

I reach for awk when my bash-scripts get a bit messy, perl is/was for when I want to build a small application (or nowdays python).

But both perl and python require cpan/pip to get the most out of and with awk, I just nead awk.

shawn_w · a year ago

I put off learning awk for literal decades because I knew perl, but then I picked it up and wish I had done so earlier. I still prefer perl for a lot of use cases, but in one-liners, awk's syntax makes working with specific fields a lot more convenient than perl's autosplit mode. `$1` instead of `$F[0]`, basically.

rlonstein · a year ago

- Awk is defined in POSIX

- Awk is on more systems than Perl

- Awk has more implementations than Perl

philipov · a year ago

Every linux system comes with awk already on it. Perl has to be installed, and might not be available on a system you don't control.

I think this is a good illustration of why parser-generator middleware like yacc is fundamentally misguided; they create totally unnecessary gaps between design intent and the action of the parser. In a hand-rolled recursive descent parser, or even a set of PEG productions, ambiguities and complex lookahead or backtracking leap out at the programmer immediately.

jasone · a year ago

Hard disagree. Yacc has unnecessary footguns, in particular the fallout from using LALR(1), but more modern parser generators like bison provide LR(1) and IELR(1). Hand-rolled recursive descent parsers as well as parser combinators can easily obscure implicit resolution of grammar ambiguities. A good LR(1) parser generator enables a level of grammar consistency that is very difficult to achieve otherwise.

thomasmg · a year ago

> Hand-rolled recursive descent parsers as well as parser combinators can easily obscure implicit resolution of grammar ambiguities.

Could you give a concrete, real-life example of this? I have written many recursive-descent parsers and never ran into this problem (Apache Jackrabbit Oak SQL and XPath parser, H2 database engine, PointBase Micro database engine, HypersonicSQL, NewSQL, Regex parsers, GraphQL parsers, and currently the Bau programming language).

I have often heard that Bison / Yacc / ANTLR etc are "superior", but mostly from people that didn't actually have to write and maintain production-quality parsers. I do have experience with the above parser generators, eg. for university projects, and Apache Jackrabbit (2.x). I remember that in each case, the parser generators had some "limitations" that caused problems down the line. Then I had to spend more time trying to work around the parser generator limitations than actually doing productive work.

This may sound harsh, but well that's my experience... I would love to hear from people that had a different experience for non-trivial projects...

tgv · a year ago

Same. LR(k) and LL(k) are readable and completely unambiguous, in contrast to PEG, where ambiguity is resolved ad hoc: PEG doesn't have a single definition, so implementations may differ, and the original PEG uses the order of the rules and backtracking to resolve ambiguity, which may lead to different resolutions in different contexts. Ambiguity does not leap out to the programmer.

OTOH, an LL(1) grammar can be used to generate a top-down/recursive descent parser, and will always be correct.

HelloNurse · a year ago

A large portion of this consistency is not making executive decisions about parsing ambiguities. The difference between "the language is implicitly defined by what the parser does" and "the grammar for the language has been refined one failed test at a time" is large and practically important.

tannhaeuser · a year ago

I think it would be interesting and adequate to hear about and link to the reflections of the original awk authors (Aho, Kernighan, Weinberg et al) considering they were also experts for yacc and other compiler-compiler tools from the 1977–1985 era and authors of the dragon book. After all, awk syntax was the starting point for JavaScript including warts such as regexp literals, optional semicolons, for (e in a), delete a[e], introducing the function keyword to a C-like language, etc. I recall at least Kernighan talked about optional semicolons as something he‘d reconsider given the chance.

Levitating · a year ago

And GNU is notorious for their use of yacc. Even gnulib functions like parse_datetime (primarily used to power the date command) rely on a yacc generated parser.

bonzini · a year ago

That's mostly for historical reasons. Nobody felt the need to switch and do all the work needed to avoid breaking edge cases.

GCC used to have Bison grammars but it switched to recursive descent about 20 years ago. The C++ grammar was especially horrible.

benhoyt · a year ago

Brian Kernighan sent Gawk maintainer Arnold Robbins an email linking to this blog post with the comment "Hindsight has a lot of benefits, it would appear."

Peter Weinberger (quoted with permission) responded:

> That's interesting, Here's some thoughts/recollections. (remember that human memory is fallible.)

> 1. Using whitespace for string concatenation, in retrospect, was probably not the ideal choice (but '+' would not have worked).

> 2. Syntax choices were in part driven by the desire for our local C programmers to find it familiar.

> 3. As creatures of a specific time and place awk shared with C the (then endearing, now irritating) property of being underspecified.

> I think that collectively we understood YACC reasonably well. We tortured the grammar until the parser came close to doing what we wanted, and then we stopped. The tools then were more primitive, but they did fit in 64K of memory.

Al Aho also replied (quoted with permission):

> Peter's observation about torturing the grammar is apt! As awk grew in its early years, the grammar evolved with it and I remember agonizing to make changes to the grammar to keep it under control (understanding and minimizing the number of yacc-generated parsing-action conflicts) as awk evolved. I found yacc's ability to point out parsing-action conflicts very helpful during awk's development. Good grammar design was very much an art in those days (maybe even today).

It's fun to hear the perspectives of the original AWK creators. I've had some correspondence with Kernighan and Weinberger before, but I think that's the first time I've been on an email thread with all three of A, W, and K.

1vuio0pswjnm7 · a year ago

"The tools were more primitive, but they did fit in 64k of memory."

I will take "primitive" over present-day bloat and complexity every time, quirks and all.

That programs fitting in 64K of memory have remained in continuous use and the subject of imitation for so long must be a point of pride for the authors. From what I have seen, contemporary software authors are unlikely to ever achieve such longevity.

Affric · a year ago

Thanks for posting this.

I think it casts a pretty harsh light on criticisms of awk.

Ultimately awk is one of the all time great languages. Small. Good at what it does.

There’s something satisfying about using it which languages like Python just don’t give you. It’s a little bit of Unix wizardry.

mmsc · a year ago

RodgerTheGreat · a year ago

teleforce · a year ago

If you think AWK is hard to parse then try C++. The latter is so hard to parse thus very slow compile time that most probably inspired a funny programmer skit like this, one of the most popular XKCDs of all time [1].

Then come along fast compilation modern languages like Go and D. The latter is such a fresh air is that even though it's a complex language like C++ and Rust but it managed to compile very fast. Heck it even has RDMD facility that can perform compiled REPL as you interacting with the prompt similar to interpreted programming languages like Python and Matlab.

According to its author, the main reason D has very fast compile time (as long as you avoid the CTFE) is because of the language design decisions avoid the notorious symbols that can complicated symbol table just like happened in C++ and the popular << and >> overloading for I/O and shifting. But the fact that Rust come much later than C++ and D but still slow to compile is bewildering to say the least.

[1] Compiling:

https://xkcd.com/303/

moomin · a year ago

Pretty sure Rust's compile times are a function of the complex type system and generic instantiation. Everything's a trade-off.

masklinn · a year ago

Except in some rare edge cases, it’s mostly the latter, indirectly: in the average crate the vast majority of the time is spent in LLVM optimization passes and linking. Sometimes IR generation gets a pretty high score, but that’s somewhat inconsistent.

pornel · a year ago

`cargo check` that does all the parsing, type system checks, and lifetime analysis is pretty fast compared to builds.

Rust compilation time spends most time in LLVM, due to verbosity of the IR it outputs, and during linking, due to absurd amount of debug info and objects to link.

When cargo check isn't fast, it's usually due to build scripts and procedural macros, which are slow due to being compiled binaries, so LLVM, linking, and running of an unoptimized ton of code blocks type checking.

dotancohen · a year ago

Which are damn more important (to me) than is the compile time metric.

fnord77 · a year ago

IIRC, rust's long compile times are because it is basically doing static analysis, looking for potential errors

Deleted Comment

keybored · a year ago

> According to its author, the main reason D has very fast compile time (as long as you avoid the CTFE) is because of the language design decisions avoid the notorious symbols that can complicated symbol table just like happened in C++ and the popular << and >> overloading for I/O and shifting. But the fact that Rust come much later than C++ and D but still slow to compile is bewildering to say the least.

The reasons why Rust (rustc) is slow to compile are well-known. Not bewildering.

orwin · a year ago

Rust isn't particulary slow to compile as long as you keep opt-level to 1 and the number of external library minimal. But even them it isn't as slow as C++ (but i write shit C++ code, i've heard that modern C++ is way better, i learned with C++98 and never really improved my style despite using C++11).

mardifoufs · a year ago

How much time does the parsing step take when compiling c++, relatively speaking? Is it actually significant compared to everything else that happens?

kazinator · a year ago

If you are parsing awk, you must treat any ream of whitespace that contains a newline as a visible token, which you have to reference in various places in the grammar. Your implementation will likely benefit from a switch, in the lexical analyzer, which sometimes turns off the visible newline.

ufo · a year ago

Another tricky bit is deciding whether "/" is the division operator or the start of a regular expression.

IIRC, awk does this in a context sensitive manner, by looking at the previous token.

jangliss · a year ago

Surely it is AWKward?