The Awk Programming Language, Second Edition

I was privileged to be one of the technical reviewers for this book. There's a fair bit of the original content (which is still great), but Kernighan's done a great job with some good restructuring and some significant updates, too. The early chapters are very hands-on, with something of a focus on "exploratory data processing", particularly with CSV files. Big data with AWK, you could say.

Gawk and awk will soon have a new "--csv" option that enables proper CSV input mode (parsing files with quoted and multiline fields per the CSV RFC). I'm really glad Arnold Robbins added a robust "--csv" implementation to Gawk, too, because that's really the most-heavily used version of AWK nowadays. I've already got CSV support in my own GoAWK implementation, and I'll be adding "--csv" to make it compatible.

I'm really glad this new updated version is coming out!

calvinmorrison · 2 years ago

Its a crying shame we never settled on a control character separated text format. There's a ascii control characters for record and field (unit) separators. A bit of user space support for that would have been great.

bachmeier · 2 years ago

As I recall, you can tell Awk to use the control characters as record and field separators. Not helpful if you're getting your data from others, but if you're working by yourself, you have the option. I've come to use control characters as a default because it makes life so much easier.

galleywest200 · 2 years ago

It is a shame. I have been using tab-separated sheets recently as it allows me to simply not care about almost any possible character in my strings...apart from tabs of course. But those are far less common than commas, and putting strings in quotes 100% of the time looks messy to me.

hermitcrab · 2 years ago

Some discussion of that here: https://news.ycombinator.com/item?id=31220841

To be really useful as a format it would just need for text editors to: -display something distinct for the field separator (some editors do this) -treat the record separator character like a carriage return (not aware of any editors that do this)

PeterisP · 2 years ago

Tab-delimited "csv" formats are quite common (e.g. the CONLL format family for many natural language processing tasks) and also supported by common tools such as MS Excel for decades already.

JdeBP · 2 years ago

Miller handles it.

* https://miller.readthedocs.io/en/6.8.0/file-formats/#csvtsva...

I have programs that handle it.

* https://jdebp.uk/Softwares/nosh/guide/commands/console-flat-...

lolive · 2 years ago

Most important comment I have ever read on HN ever !

Simon_O_Rourke · 2 years ago

> Gawk and awk will soon have a new "--csv" option that enables proper CSV input mode

Awesome!!!! Super excited to see this!

ryenus · 2 years ago

Awk is really great, for those knowing nvm [1], I used awk to make `nvm ls-remote` run more than 10 times faster [2] by replacing the related shell script with around 60 lines of awk script [3], and I was quite happy with the improvement.

It's not really a one-liner, neither something big, but one can take that as an example regarding that awk is really not just for one-liners.

Meanwhile having `--csv` support is really nice. I'd also like to see things like a builtin `length` function to be standard.

[1]: https://github.com/nvm-sh/nvm/ [2]: https://github.com/nvm-sh/nvm/pull/2827/ [3]: https://github.com/nvm-sh/nvm/blob/9a769630d7/nvm.sh#L1703-L...

benhoyt · 2 years ago

But length() is standard POSIX, no? Even length(array) has been approved by POSIX [1] but not yet included in the spec (they're very slow to update the spec for some reason). Both forms have been supported in onetrueawk, Gawk, mawk, and Busybox awk for a long time.

[1] https://www.austingroupbugs.net/view.php?id=1566

anyfactor · 2 years ago

Our data product is delivered in CSV format. Even though I create user documentation mainly using csvkit, grep and sed, I would love to convert all those solutions to AWK. Sometimes AWK is more readable than sed and csvkit requires installation.

It will be nice to have a awk cookbook for CSV. In terms of CSV maniupulation and querying there is only a limited number of operations and I think there is potential to standardize those operation using AWK.

tomcam · 2 years ago

Ben is not just any old technical reviewer. He wrote a version of AWK in go and has done a ton of other work in the AWK ecosystem.

ryenus · 2 years ago

Thanks! ICYMI: GoAWK [1] - A POSIX-compliant AWK interpreter written in Go, with CSV support.

[1]: https://github.com/benhoyt/goawk

nmz · 2 years ago

It's nice that everyone is supporting this, I've written a portable awk module that takes control of the parsing and it is SLOW (and a little buggy). I'm a little bummed that nobody will use it but this is truly a step in the right direction.

I guess for the people that are still using nawk, you can set up an AWK envvar so you can { awk -f $AWKU/ucsv.awk -f <(echo '{print NR, $1}') }

https://github.com/Nomarian/Awk-Batteries/blob/master/Units/...

lost_tourist · 2 years ago

Would you say the first few chapters are enough to get the 75-80% usefulness for mere mortals like me who will never try to master the full language? Or is the material fairly sprinkled throughout the whole tome?

benhoyt · 2 years ago

Yes, definitely. The first three chapters would be more than enough for that: 1) An Awk Tutorial, 2) Awk in Action, and 3) Exploratory Data Analysis. For most people who just want to use AWK for one-liners on the command line, you can stop there. The rest of the chapters are about writing larger (still small! but not one-liner) programs in AWK to create reports, little languages, and experiment with algorithms.

b3lm0nt · 2 years ago

Fantastic news. I’ve tried lots of new CLI tools but they always seem to fall between too little functionality (eg. xsv) and too much (VisiData). AWK is just right.

cauthon · 2 years ago

This is amazing, I may never use pandas again

I know lots of people like awk, but I pretend it doesn't exist. Why? Here's my comment on this from 6 years ago[0],

>I used awk until I learned Python (long ago). For me, awk was yet another example of the "worse is better" approach to things so common in unix. For example, if you make a syntax error, you might get a message like "glob: exec error," rather than an informative message. "Worse is better" is probably a good strategy in business and for getting things done, but still, mediocrity and the sense of entitlement that so often goes with carelessness, sickens me.

[0] https://news.ycombinator.com/item?id=13457265

Long live the Unix Hater's Handbook! (Unix is fine, and so are the criticisms herein. Some of these criticisms have been eclipsed by ongoing development.) https://en.wikipedia.org/wiki/The_UNIX-HATERS_Handbook

ghshephard · 2 years ago

You are missing out. As a former data engineer/current SRE, I spend my entire day with VSCode/Python/Notebooks/CoPilot banging out python code - but whenever I need to do a complex analysis of a semistructured text file in < 60 seconds, awk is my twitch reflex tool. It can trivially do state transition based on patterns in the file, as well as populate hashes from one file and use them in analysis of the next file in just a few characters.

Awk's claim to fame in my world is that it's cognitive activation energy for anyone who has taken the 3-4 hours to learn the language from start to finish (and that's the awesome thing about the language - it really is about 3 hours of concentrated attention) - is essentially nil. You see a bunch of ugly not really structured text 500 MB files that you can't pull into pandas, or easily parse into python dicts? No problem - awk will tear through them for you and get the information you want in < 60 seconds, including the time you took to write your (almostl always single line) of code.

That's Awk's sweet spot.

getpost · 2 years ago

Point taken. I have a Python program that is an elemental version of awk, and I use that for the odd task. I can modify it if needed and I have the entire Python library to help me. Is the text Unicode? HTML? These little details matter.

I'm not complaining that someone banged out awk (speaking figuratively) on a Friday afternoon to do something and not have to stay after work. Excellent! My complaint is that the failure to address technical debt has negatively affected the productivity of millions, if not tens of millions, of people, often working under pressure, for DECADES.

classichasclass · 2 years ago

In general Perl fits that niche for me better, but sometimes awk is what you have.

pmarreck · 2 years ago

I will bet you $1000 that time spent learning Awk will lead to better results much faster than time spent polluting your privileged user directories with Python's excuse for "dependency management"

paleface · 2 years ago

I agree entirely!

For many python users, it’s the only language they know. Often, they see programming in python, as part of their “identity” - so they’re overly invested in it, to the detriment of other wonderful languages, like awk.

I used to code perl myself, back in the day - but I came to appreciate the simplicity of awk, and now it’s one of my favourites. I no longer code perl, as a consequence, as I believe awk to be far more elegant! I wouldn’t have done so, if I was overly invested in being a “perl programmer”.

momentoftop · 2 years ago

Specifically, Awk is a good solution to a problem that should never have existed in the first place. Why am I having to write these bespoke parsers for the random mess of output formats that you get from the UNIX command line?

Well, the fact is that I have to write such parsers. That's very sad, but has no chance of being fixed. So it's good to know Awk.

I think Erik Naggum had this exact criticism of Perl.

siraben · 2 years ago

Awk is awesome! Glad that they are looking to modernize the book. It wasn't really necessary, all the code examples in the original edition of the book still run just fine, although some are somewhat dated, like printing ASCII bar graphs. They also had examples of writing VMs, parsers and interpreters in the book, which run on modern implementations.[0]

The language has some quirks. To declare temporary variables, it's common practice to add extra arguments to functions that won't be used. And traversal of associative arrays is implementation-dependent. I'm not sure what the situation is regarding locale and UTF-8 support.

EDIT: Looks like Brian Kernighan added Unicode support last year.[1]

[0] https://github.com/siraben/awk-vm/blob/master/vm.awk

[1] https://github.com/onetrueawk/awk/commit/9ebe940cf3c652b0e37...

kqr · 2 years ago

What would you suggest as an alternative to printing ASCII bar graphs? I do that all the time. Takes 20 seconds and often makes distributions, modalities, and patterns over time obvious right away.

zimpenfish · 2 years ago

`sparklines`[1] is good for an overall low-res view. `termgraph`[2] is sometimes better for a higher-res, more capable view (but can be finicky about the data.)

[1] https://github.com/deeplook/sparklines

[2] https://github.com/mkaz/termgraph

bluetomcat · 2 years ago

Is there a particular benefit in writing a VM in AWK, placed in a big BEGIN block? Very similar code can be written in Perl or Python. Isn't the strength of AWK in its line-matching capability, being able to pattern-match a line against a block of code?

> Is there a particular benefit in writing a VM in AWK

Not really. Later on the book just ran out of line-matching examples to go through and started doing regular programming instead :P. When I actually write AWK code I rely on line-matching and using a variable to handle state.

chasil · 2 years ago

AWK runs everywhere. Perl and Python do not.

Busybox has their own independent AWK implementation.

https://busybox.net/ https://frippery.org/busybox/

Also see the first edition of the AWK manual online here:

https://archive.org/details/pdfy-MgN0H1joIoDVoIC7

ufo · 2 years ago

I love telling about that example to my programming language friends.

> Hey you should read the AWK book, it even says how to write a VM!

> Why would I ever want to use AWK for that?

> Well, the input is a text file with one space-delimited instruction per line.

> Hmm... You have a point.

anthk · 2 years ago

On wm's, why not a Z-machine? It's ideal for this.

donatj · 2 years ago

I love awk. It’s everywhere and every time I am writing a shell script and work myself into a corner, awk has been the way out.

I know exactly enough to be dangerous and have meant to deep dive for almost a decade.

coliveira · 2 years ago

awk can be mastered by just reading the man page. The book doesn't take long to read either. Once you understand the simple principles, you can write an infinite number of scripts for all kinds of tasks.

IggleSniggle · 2 years ago

See, when I'm writing a shell script interactively and work myself into a corner, I reach for awk, struggle with it for a bit, and then either:

1) succeed, and regret the messiness of the solution

2) fail, and find a non-awk way to handle it.

I really tried to like awk, but its portability hasn't been enough of a feature to raise it above other scripting languages for me. Especially if I'm going to end up in an editor

apienx · 2 years ago

Thanks for your work! Awk is a rabbit hole.

"Dark corners are basically fractal - no matter how much you illuminate, there is always a smaller but darker one." - - Brian Kernighan (quoted in the GNU Awk book)

binary_ninja · 2 years ago

Awk has always been a language that I loved but I have struggled to use besides quick jobs for parsing text files. I understand it is meant to be use for exactly that, but the fact that is simple, fast and lightweight sometimes makes me want to do something more with it, but when I start trying to do something besides parsing text I find that it starts becoming awkward (pun intented?).

usrbinbash · 2 years ago

> but the fact that is simple, fast and lightweight

I see awk as a DSL to be honest. Yes, it can be used as a general purpose language, but that quickly becomes, as you say, awkward :D

Like many DSLs, it is simple, fast and lightweight as long as it is used for it's intended purpose. Once you start using it for something else, these advantages evaporate pretty quickly, because then you have to essentially work around the DSL design to get it to do what you want.

snitty · 2 years ago

DSL == Domain Specific Language?

PhilipRoman · 2 years ago

I find it pretty nice for writing simple preprocessors. For example I have one which takes anything between two marker lines and pipes it through a command (one invocation per block). Awk has an amazing pipe operator which lets you do something like this:

    ... {
        print $0 | "command"
    }

"command" is executed once, and the pipe is kept open until closed explicitly by close("command"), at which point the next invocation will execute it again. The command string itself acts as a key for the pipe file descriptor.

And of course, no mention of awk is complete without the "uniq" implementation, which beats the coreutils uniq in every way possible (by supporting arbitrary expressions as keys and not requiring sorted input):

    !a[$0]++

aktau · 2 years ago

I had no idea about this "keep the pipe open" behaviour. I thought it would spawn the binary on every print statement and thus didn't consider it in the past. But now...

This is exactly why I moved from AWK to Perl for these quick jobs a couple of years ago. If you stick to an AWK-like subset, Perl is also simple, fast and lightweight. If you want to grow your scripts (and you have a lot of discipline) Perl – in contrast to AWK – gives you enough noose to hang^W^W^W^Wthe tools you need.

joeythedolphin · 2 years ago

Perl? Wow. Is that better than bash, python or even nodejs? Why write in Perl over these? Serious question, was propaghandized to hate Perl.

tripflag · 2 years ago

I have found a handful of unconventional applications for awk -- I once needed a tiny pcm pulsewave generator, and awk was surprisingly decent for the job [1].

Aside from that I've mostly been using it for quick statistics [2], but it quickly moves into perl territory...

1: https://github.com/9001/asm/blob/hovudstraum/etc/bin/beeps#L...

2: https://ocv.me/doc/unix/oneliners/#965bfcb8

It's a language for creating quick alternative views from line- and column-oriented text streams. That means, take the output of another tool and represent it in a different way.

asicsp · 2 years ago

I use awk mostly for one-liners and resort to Python when I need more than a few lines of code.

MikeTheGreat · 2 years ago

Ok, dumb question: Is the link supposed to link to the actual book (i.e., is the book free and/or open source) or is this just a page of miscellaneous interesting links about the book (which we can pay for, later, when it's published).

I was expecting the book, but the page itself says "This page is a placeholder for material related to the second edition of The AWK Programming Language."

It's fine if this is a placeholder page (and an awesome excuse to read talk about AWK here on HN :) ) but I want to be sure that I'm not missing the book itself.

RGBCube · 2 years ago

What I understand from the page is that the Second Edition of the book will reside in the page when it is released (the reason why it says it is a "placeholder").

_ph_ · 2 years ago

I think the page description is quite clear: it contains material related to the book. Not the book itself. So I would guess all downloadable code and perhaps supplementory material.

ineedasername · 2 years ago

Amazing, takes me back.

One of my first big projects at my first job fresh out of college was using sed & awk to semi-automate the transformation of semi-unstructured data into a database.

IIRC I couldn't completely automate because it contained author names, from global naming conventions. (parsing names correctly is deceptively complex) They had somewhat arbitrary #'s of initials ranging from 0-3.

Again, IIRC, I could easily accommodate 0 or 1 initial (followed by \.) but trying for more would make the regex I was using too greedy and pull in part of the article abstract. These were scientific books and journals.

So I scripted a sed & awk program to detect the possibility of > 1 initials and when that occured, I'd pipe the record into nano for a quick review where I manually inserted the correct \. characters for the initials.

It was decades of back-catalogue publications for digitization so I sat there for days, listening to music on an original 1st gen iPod, waiting for my duct-taped kludge of a program to pipe one of thousands of records into a nano session every few minutes. This was on an Apple G4 workstation running OS X, where I earned my real bash scripting chops. It was an awful hack by today's standards, but at the time, accomplishing what was expected to be a 1-year long project in ~1 month, it was seen as nearly miraculous.