I was privileged to be one of the technical reviewers for this book. There's a fair bit of the original content (which is still great), but Kernighan's done a great job with some good restructuring and some significant updates, too. The early chapters are very hands-on, with something of a focus on "exploratory data processing", particularly with CSV files. Big data with AWK, you could say.
Gawk and awk will soon have a new "--csv" option that enables proper CSV input mode (parsing files with quoted and multiline fields per the CSV RFC). I'm really glad Arnold Robbins added a robust "--csv" implementation to Gawk, too, because that's really the most-heavily used version of AWK nowadays. I've already got CSV support in my own GoAWK implementation, and I'll be adding "--csv" to make it compatible.
I'm really glad this new updated version is coming out!
Its a crying shame we never settled on a control character separated text format. There's a ascii control characters for record and field (unit) separators. A bit of user space support for that would have been great.
As I recall, you can tell Awk to use the control characters as record and field separators. Not helpful if you're getting your data from others, but if you're working by yourself, you have the option. I've come to use control characters as a default because it makes life so much easier.
It is a shame. I have been using tab-separated sheets recently as it allows me to simply not care about almost any possible character in my strings...apart from tabs of course. But those are far less common than commas, and putting strings in quotes 100% of the time looks messy to me.
To be really useful as a format it would just need for text editors to:
-display something distinct for the field separator (some editors do this)
-treat the record separator character like a carriage return (not aware of any editors that do this)
Tab-delimited "csv" formats are quite common (e.g. the CONLL format family for many natural language processing tasks) and also supported by common tools such as MS Excel for decades already.
Awk is really great, for those knowing nvm [1], I used awk to make `nvm ls-remote` run more than 10 times faster [2] by replacing the related shell script with around 60 lines of awk script [3], and I was quite happy with the improvement.
It's not really a one-liner, neither something big, but one can take that as an example regarding that awk is really not just for one-liners.
Meanwhile having `--csv` support is really nice. I'd also like to see things like a builtin `length` function to be standard.
But length() is standard POSIX, no? Even length(array) has been approved by POSIX [1] but not yet included in the spec (they're very slow to update the spec for some reason). Both forms have been supported in onetrueawk, Gawk, mawk, and Busybox awk for a long time.
Our data product is delivered in CSV format. Even though I create user documentation mainly using csvkit, grep and sed, I would love to convert all those solutions to AWK. Sometimes AWK is more readable than sed and csvkit requires installation.
It will be nice to have a awk cookbook for CSV. In terms of CSV maniupulation and querying there is only a limited number of operations and I think there is potential to standardize those operation using AWK.
It's nice that everyone is supporting this, I've written a portable awk module that takes control of the parsing and it is SLOW (and a little buggy). I'm a little bummed that nobody will use it but this is truly a step in the right direction.
I guess for the people that are still using nawk, you can set up an AWK envvar so you can { awk -f $AWKU/ucsv.awk -f <(echo '{print NR, $1}') }
Would you say the first few chapters are enough to get the 75-80% usefulness for mere mortals like me who will never try to master the full language? Or is the material fairly sprinkled throughout the whole tome?
Yes, definitely. The first three chapters would be more than enough for that: 1) An Awk Tutorial, 2) Awk in Action, and 3) Exploratory Data Analysis. For most people who just want to use AWK for one-liners on the command line, you can stop there. The rest of the chapters are about writing larger (still small! but not one-liner) programs in AWK to create reports, little languages, and experiment with algorithms.
Fantastic news. I’ve tried lots of new CLI tools but they always seem to fall between too little functionality (eg. xsv) and too much (VisiData). AWK is just right.
Awk is awesome! Glad that they are looking to modernize the book. It wasn't really necessary, all the code examples in the original edition of the book still run just fine, although some are somewhat dated, like printing ASCII bar graphs. They also had examples of writing VMs, parsers and interpreters in the book, which run on modern implementations.[0]
The language has some quirks. To declare temporary variables, it's common practice to add extra arguments to functions that won't be used. And traversal of associative arrays is implementation-dependent. I'm not sure what the situation is regarding locale and UTF-8 support.
EDIT: Looks like Brian Kernighan added Unicode support last year.[1]
What would you suggest as an alternative to printing ASCII bar graphs? I do that all the time. Takes 20 seconds and often makes distributions, modalities, and patterns over time obvious right away.
`sparklines`[1] is good for an overall low-res view. `termgraph`[2] is sometimes better for a higher-res, more capable view (but can be finicky about the data.)
Is there a particular benefit in writing a VM in AWK, placed in a big BEGIN block? Very similar code can be written in Perl or Python. Isn't the strength of AWK in its line-matching capability, being able to pattern-match a line against a block of code?
> Is there a particular benefit in writing a VM in AWK
Not really. Later on the book just ran out of line-matching examples to go through and started doing regular programming instead :P. When I actually write AWK code I rely on line-matching and using a variable to handle state.
awk can be mastered by just reading the man page. The book doesn't take long to read either. Once you understand the simple principles, you can write an infinite number of scripts for all kinds of tasks.
See, when I'm writing a shell script interactively and work myself into a corner, I reach for awk, struggle with it for a bit, and then either:
1) succeed, and regret the messiness of the solution
or
2) fail, and find a non-awk way to handle it.
I really tried to like awk, but its portability hasn't been enough of a feature to raise it above other scripting languages for me. Especially if I'm going to end up in an editor
"Dark corners are basically fractal - no matter how much you illuminate, there is always a smaller but darker one." - - Brian Kernighan (quoted in the GNU Awk book)
Awk has always been a language that I loved but I have struggled to use besides quick jobs for parsing text files. I understand it is meant to be use for exactly that, but the fact that is simple, fast and lightweight sometimes makes me want to do something more with it, but when I start trying to do something besides parsing text I find that it starts becoming awkward (pun intented?).
> but the fact that is simple, fast and lightweight
I see awk as a DSL to be honest. Yes, it can be used as a general purpose language, but that quickly becomes, as you say, awkward :D
Like many DSLs, it is simple, fast and lightweight as long as it is used for it's intended purpose. Once you start using it for something else, these advantages evaporate pretty quickly, because then you have to essentially work around the DSL design to get it to do what you want.
I find it pretty nice for writing simple preprocessors. For example I have one which takes anything between two marker lines and pipes it through a command (one invocation per block). Awk has an amazing pipe operator which lets you do something like this:
... {
print $0 | "command"
}
"command" is executed once, and the pipe is kept open until closed explicitly by close("command"), at which point the next invocation will execute it again. The command string itself acts as a key for the pipe file descriptor.
And of course, no mention of awk is complete without the "uniq" implementation, which beats the coreutils uniq in every way possible (by supporting arbitrary expressions as keys and not requiring sorted input):
I had no idea about this "keep the pipe open" behaviour. I thought it would spawn the binary on every print statement and thus didn't consider it in the past. But now...
This is exactly why I moved from AWK to Perl for these quick jobs a couple of years ago. If you stick to an AWK-like subset, Perl is also simple, fast and lightweight. If you want to grow your scripts (and you have a lot of discipline) Perl – in contrast to AWK – gives you enough noose to hang^W^W^W^Wthe tools you need.
I have found a handful of unconventional applications for awk -- I once needed a tiny pcm pulsewave generator, and awk was surprisingly decent for the job [1].
Aside from that I've mostly been using it for quick statistics [2], but it quickly moves into perl territory...
It's a language for creating quick alternative views from line- and column-oriented text streams. That means, take the output of another tool and represent it in a different way.
Ok, dumb question: Is the link supposed to link to the actual book (i.e., is the book free and/or open source) or is this just a page of miscellaneous interesting links about the book (which we can pay for, later, when it's published).
I was expecting the book, but the page itself says "This page is a placeholder for material related to the second edition of The AWK Programming Language."
It's fine if this is a placeholder page (and an awesome excuse to read talk about AWK here on HN :) ) but I want to be sure that I'm not missing the book itself.
What I understand from the page is that the Second Edition of the book will reside in the page when it is released (the reason why it says it is a "placeholder").
I think the page description is quite clear: it contains material related to the book. Not the book itself. So I would guess all downloadable code and perhaps supplementory material.
One of my first big projects at my first job fresh out of college was using sed & awk to semi-automate the transformation of semi-unstructured data into a database.
IIRC I couldn't completely automate because it contained author names, from global naming conventions. (parsing names correctly is deceptively complex) They had somewhat arbitrary #'s of initials ranging from 0-3.
Again, IIRC, I could easily accommodate 0 or 1 initial (followed by \.) but trying for more would make the regex I was using too greedy and pull in part of the article abstract. These were scientific books and journals.
So I scripted a sed & awk program to detect the possibility of > 1 initials and when that occured, I'd pipe the record into nano for a quick review where I manually inserted the correct \. characters for the initials.
It was decades of back-catalogue publications for digitization so I sat there for days, listening to music on an original 1st gen iPod, waiting for my duct-taped kludge of a program to pipe one of thousands of records into a nano session every few minutes. This was on an Apple G4 workstation running OS X, where I earned my real bash scripting chops. It was an awful hack by today's standards, but at the time, accomplishing what was expected to be a 1-year long project in ~1 month, it was seen as nearly miraculous.
I know lots of people like awk, but I pretend it doesn't exist. Why? Here's my comment on this from 6 years ago[0],
>I used awk until I learned Python (long ago). For me, awk was yet another example of the "worse is better" approach to things so common in unix. For example, if you make a syntax error, you might get a message like "glob: exec error," rather than an informative message. "Worse is better" is probably a good strategy in business and for getting things done, but still, mediocrity and the sense of entitlement that so often goes with carelessness, sickens me.
You are missing out. As a former data engineer/current SRE, I spend my entire day with VSCode/Python/Notebooks/CoPilot banging out python code - but whenever I need to do a complex analysis of a semistructured text file in < 60 seconds, awk is my twitch reflex tool. It can trivially do state transition based on patterns in the file, as well as populate hashes from one file and use them in analysis of the next file in just a few characters.
Awk's claim to fame in my world is that it's cognitive activation energy for anyone who has taken the 3-4 hours to learn the language from start to finish (and that's the awesome thing about the language - it really is about 3 hours of concentrated attention) - is essentially nil. You see a bunch of ugly not really structured text 500 MB files that you can't pull into pandas, or easily parse into python dicts? No problem - awk will tear through them for you and get the information you want in < 60 seconds, including the time you took to write your (almostl always single line) of code.
Point taken. I have a Python program that is an elemental version of awk, and I use that for the odd task. I can modify it if needed and I have the entire Python library to help me. Is the text Unicode? HTML? These little details matter.
I'm not complaining that someone banged out awk (speaking figuratively) on a Friday afternoon to do something and not have to stay after work. Excellent! My complaint is that the failure to address technical debt has negatively affected the productivity of millions, if not tens of millions, of people, often working under pressure, for DECADES.
I will bet you $1000 that time spent learning Awk will lead to better results much faster than time spent polluting your privileged user directories with Python's excuse for "dependency management"
For many python users, it’s the only language they know. Often, they see programming in python, as part of their “identity” - so they’re overly invested in it, to the detriment of other wonderful languages, like awk.
I used to code perl myself, back in the day - but I came to appreciate the simplicity of awk, and now it’s one of my favourites. I no longer code perl, as a consequence, as I believe awk to be far more elegant! I wouldn’t have done so, if I was overly invested in being a “perl programmer”.
Specifically, Awk is a good solution to a problem that should never have existed in the first place. Why am I having to write these bespoke parsers for the random mess of output formats that you get from the UNIX command line?
Well, the fact is that I have to write such parsers. That's very sad, but has no chance of being fixed. So it's good to know Awk.
I think Erik Naggum had this exact criticism of Perl.
Gawk and awk will soon have a new "--csv" option that enables proper CSV input mode (parsing files with quoted and multiline fields per the CSV RFC). I'm really glad Arnold Robbins added a robust "--csv" implementation to Gawk, too, because that's really the most-heavily used version of AWK nowadays. I've already got CSV support in my own GoAWK implementation, and I'll be adding "--csv" to make it compatible.
I'm really glad this new updated version is coming out!
To be really useful as a format it would just need for text editors to: -display something distinct for the field separator (some editors do this) -treat the record separator character like a carriage return (not aware of any editors that do this)
* https://miller.readthedocs.io/en/6.8.0/file-formats/#csvtsva...
I have programs that handle it.
* https://jdebp.uk/Softwares/nosh/guide/commands/console-flat-...
Awesome!!!! Super excited to see this!
It's not really a one-liner, neither something big, but one can take that as an example regarding that awk is really not just for one-liners.
Meanwhile having `--csv` support is really nice. I'd also like to see things like a builtin `length` function to be standard.
[1]: https://github.com/nvm-sh/nvm/ [2]: https://github.com/nvm-sh/nvm/pull/2827/ [3]: https://github.com/nvm-sh/nvm/blob/9a769630d7/nvm.sh#L1703-L...
[1] https://www.austingroupbugs.net/view.php?id=1566
It will be nice to have a awk cookbook for CSV. In terms of CSV maniupulation and querying there is only a limited number of operations and I think there is potential to standardize those operation using AWK.
[1]: https://github.com/benhoyt/goawk
I guess for the people that are still using nawk, you can set up an AWK envvar so you can { awk -f $AWKU/ucsv.awk -f <(echo '{print NR, $1}') }
https://github.com/Nomarian/Awk-Batteries/blob/master/Units/...
The language has some quirks. To declare temporary variables, it's common practice to add extra arguments to functions that won't be used. And traversal of associative arrays is implementation-dependent. I'm not sure what the situation is regarding locale and UTF-8 support.
EDIT: Looks like Brian Kernighan added Unicode support last year.[1]
[0] https://github.com/siraben/awk-vm/blob/master/vm.awk
[1] https://github.com/onetrueawk/awk/commit/9ebe940cf3c652b0e37...
[1] https://github.com/deeplook/sparklines
[2] https://github.com/mkaz/termgraph
Not really. Later on the book just ran out of line-matching examples to go through and started doing regular programming instead :P. When I actually write AWK code I rely on line-matching and using a variable to handle state.
Busybox has their own independent AWK implementation.
https://busybox.net/ https://frippery.org/busybox/
Also see the first edition of the AWK manual online here:
https://archive.org/details/pdfy-MgN0H1joIoDVoIC7
> Hey you should read the AWK book, it even says how to write a VM!
> Why would I ever want to use AWK for that?
> Well, the input is a text file with one space-delimited instruction per line.
> Hmm... You have a point.
I know exactly enough to be dangerous and have meant to deep dive for almost a decade.
1) succeed, and regret the messiness of the solution
or
2) fail, and find a non-awk way to handle it.
I really tried to like awk, but its portability hasn't been enough of a feature to raise it above other scripting languages for me. Especially if I'm going to end up in an editor
"Dark corners are basically fractal - no matter how much you illuminate, there is always a smaller but darker one." - - Brian Kernighan (quoted in the GNU Awk book)
I see awk as a DSL to be honest. Yes, it can be used as a general purpose language, but that quickly becomes, as you say, awkward :D
Like many DSLs, it is simple, fast and lightweight as long as it is used for it's intended purpose. Once you start using it for something else, these advantages evaporate pretty quickly, because then you have to essentially work around the DSL design to get it to do what you want.
And of course, no mention of awk is complete without the "uniq" implementation, which beats the coreutils uniq in every way possible (by supporting arbitrary expressions as keys and not requiring sorted input):
Aside from that I've mostly been using it for quick statistics [2], but it quickly moves into perl territory...
1: https://github.com/9001/asm/blob/hovudstraum/etc/bin/beeps#L...
2: https://ocv.me/doc/unix/oneliners/#965bfcb8
I was expecting the book, but the page itself says "This page is a placeholder for material related to the second edition of The AWK Programming Language."
It's fine if this is a placeholder page (and an awesome excuse to read talk about AWK here on HN :) ) but I want to be sure that I'm not missing the book itself.
~
One of my first big projects at my first job fresh out of college was using sed & awk to semi-automate the transformation of semi-unstructured data into a database.
IIRC I couldn't completely automate because it contained author names, from global naming conventions. (parsing names correctly is deceptively complex) They had somewhat arbitrary #'s of initials ranging from 0-3.
Again, IIRC, I could easily accommodate 0 or 1 initial (followed by \.) but trying for more would make the regex I was using too greedy and pull in part of the article abstract. These were scientific books and journals.
So I scripted a sed & awk program to detect the possibility of > 1 initials and when that occured, I'd pipe the record into nano for a quick review where I manually inserted the correct \. characters for the initials.
It was decades of back-catalogue publications for digitization so I sat there for days, listening to music on an original 1st gen iPod, waiting for my duct-taped kludge of a program to pipe one of thousands of records into a nano session every few minutes. This was on an Apple G4 workstation running OS X, where I earned my real bash scripting chops. It was an awful hack by today's standards, but at the time, accomplishing what was expected to be a 1-year long project in ~1 month, it was seen as nearly miraculous.
>I used awk until I learned Python (long ago). For me, awk was yet another example of the "worse is better" approach to things so common in unix. For example, if you make a syntax error, you might get a message like "glob: exec error," rather than an informative message. "Worse is better" is probably a good strategy in business and for getting things done, but still, mediocrity and the sense of entitlement that so often goes with carelessness, sickens me.
[0] https://news.ycombinator.com/item?id=13457265
Long live the Unix Hater's Handbook! (Unix is fine, and so are the criticisms herein. Some of these criticisms have been eclipsed by ongoing development.) https://en.wikipedia.org/wiki/The_UNIX-HATERS_Handbook
Awk's claim to fame in my world is that it's cognitive activation energy for anyone who has taken the 3-4 hours to learn the language from start to finish (and that's the awesome thing about the language - it really is about 3 hours of concentrated attention) - is essentially nil. You see a bunch of ugly not really structured text 500 MB files that you can't pull into pandas, or easily parse into python dicts? No problem - awk will tear through them for you and get the information you want in < 60 seconds, including the time you took to write your (almostl always single line) of code.
That's Awk's sweet spot.
I'm not complaining that someone banged out awk (speaking figuratively) on a Friday afternoon to do something and not have to stay after work. Excellent! My complaint is that the failure to address technical debt has negatively affected the productivity of millions, if not tens of millions, of people, often working under pressure, for DECADES.
For many python users, it’s the only language they know. Often, they see programming in python, as part of their “identity” - so they’re overly invested in it, to the detriment of other wonderful languages, like awk.
I used to code perl myself, back in the day - but I came to appreciate the simplicity of awk, and now it’s one of my favourites. I no longer code perl, as a consequence, as I believe awk to be far more elegant! I wouldn’t have done so, if I was overly invested in being a “perl programmer”.
Well, the fact is that I have to write such parsers. That's very sad, but has no chance of being fixed. So it's good to know Awk.
I think Erik Naggum had this exact criticism of Perl.