Benchmarking shell pipelines and the Unix “tools” philosophy

Command being timed: "ls" User time (seconds): 0.00 System time (seconds): 0.00 Percent of CPU this job got: 0% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.00 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 1912 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 112 Voluntary context switches: 1 Involuntary context switches: 1 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0

I got excited when I saw the 'f' and 'count' commands, but they're just scripts he has on his system. Like doing grep 'plover' blah.log | cut -d ' ' -f 11 | sort | uniq -c | sort -n. Personally I'd prefer to use the ubiquitous commands that work everywhere than rely on having custom scripts on my system, but they are nice.

juped · 6 years ago

Most people who use Unix directly build up some stuff in ~/bin (often a misnomer because it's shell scripts and not binaries, although mine is less of a misnomer than most because so much is in C rather than shell). The trick is to build them out of the standard portable components that exist everywhere. (This means, among other things, no #!/bin/bash.)

chriswitts · 6 years ago

You shouldn't be using `#!/bin/bash` but rather `#!/usr/bin/env bash` instead.

tuldia · 6 years ago

sed 's| no | not only |'

amp108 · 6 years ago

That's the whole point of shell scripting, to take a series of minimal programs and tie them together into something that does a more complex task. There's no reason to distrust a shell script simply because it is a script any more than there is to trust a binary simply because it's a binary.

cholmon · 6 years ago

Sure, but relying on custom shell scripts as unix primitives can be problematic if you find yourself frequently managing/troubleshooting systems that you don't own, and you don't want to (or aren't allowed to) put those handy scripts in place. Then when you're on any given system, you forget whether you can use "f", or if you have to fall back on awk.

I think it's less about not trusting custom scripts than it is about ensuring that your unix muscle memory doesn't atrophy.

tasogare · 6 years ago

That’s the theory but frankly the syntax is so cumbersome, irregular and needs so many googling for "easy" things like conditional, substring, etc. that I now use a real programming language if a script needs to be anything more than a list of commands without any logic (besides variables substitution).

rmetzler · 6 years ago

Instead of doing

    grep 'plover' blah.log | cut -d ' ' -f 11

I usually do

    awk '/plover/{print $11}' blah.log

Less typing, but probably not as easy to understand.

tuldia · 6 years ago

> [...] but they're just scripts he has on his system. Personally I'd prefer to use the ubiquitous commands that work everywhere than rely on having custom scripts on my system [...]

Is okay for one to have their own tools.

  $ f() { printf "\$%s" "$1"; }
  $ echo a b c | awk '{ print $(f 2) }'

His system is not very different from mine or yours. He just chose to combine the tools in a specific way.

'sort | uniq -c | sort -n' is an interesting pipeline. It will always work and does a great job with large cardinality data on low memory systems.

However, if you have the ram, or know the data set has a low cardinality (like, http status codes or filesnames instead of ip addresses) then something that works in memory will be much more efficient.

I threw 144,000,000 'hello' and 'world' into a file:

  justin@box:~$ ls -lh words
  -rw-r--r-- 1 justin justin 824M Jan  7 15:21 words
  justin@box:~$ wc -l words
  144000000 words


  justin@box:~$ time (sort <words|uniq  -c)
  72000000 hello
  72000000 world

  real 0m22.831s
  user 0m32.999s
  sys 0m4.675s

Compared to doing it in memory with awk:

  justin@box:~$ time awk '{words[$1]++} END {for (w in words) printf("%s %d\n", w, words[w])}' < words
  hello 72000000
  world 72000000

  real 0m10.639s
  user 0m9.736s
  sys 0m0.876s

so, half the time and 1/3 the cpu.

tuldia · 6 years ago

This is because in the first example you are invoking two programs. The first one sort the content of the file, the second count how many lines are equal.

While in the awk example it is creating a hash table with all words and incrementing by the key and then printing.

There is no sorting plus printing may be buffered.

justinsaccount · 6 years ago

Thanks for explaining my own comment to me.

crystaldev · 6 years ago

All of your examples work in memory.

justinsaccount · 6 years ago

Not exactly. sort (at least GNU sort) will end up doing external merge sort on temporary files if you give it more data than you have memory. Which, if you give it 100GB of 5 different strings, ends up being a huge waste.

dredmorbius · 6 years ago

Even working in memory, there are different efficiencies for different methods.

Awk includes an asort() function which can sort an array, such that it would be possible to create a similar process entirely within awk to the sort | uniq -c pipeline:

    #!/usr/bin/gawk -f
    { x[NR] = $1 }
    END {
        rc = asort(x)
        j=0
        for(i in x) {
            if( x[i] "" == x[i-1] "" ) freq[j]++
     else {
         j++
         elem[j] = x[i]
         freq[j] = 1
     }
        }
        for(j in elem) {
            printf( "%6i %s\n", freq[j], elem[j])
        }
    }

As compares with a hash-based counter:

    #!/usr/bin/gawk -f
    { x[$1]++ }
    END { for(i in x ) printf( "%6i %s\n", x[i], i ) }

On a 2,000 value test dataset with 10 unique values:

sort | uniq -c takes 0.019s (8 runs averaged)

awk hash takes 0.023s (8 runs averaged)

awk-implemented sort + unique takes 0.33s (8 runs averaged)

In this case, sort | uniq is the fastest option. But the all-in-memory sort + separate tabulation of unique values in awk is notably slower (running in 143% of the time) than the also all-in-memory hash accumulator.

As I bump up the dataset size (20,000 records) that discrepency increases, roughly 0.052s sort|uniq, 0.065s hash, and 0.217s ask sort-unique.

TL;DR: test your assumptions, especially regarding performance.

Note: Data were generated with a simple bash loop:

    for i in {1..2000}; do echo $((RANDOM%10)); done  > data

Thanks for this!

Another nice thing about /usr/bin/time is the --verbose flag which gives:

yingw787 · 6 years ago

Wow this looks amazing! I didn’t know time could track all of those!

vaingloriole · 6 years ago

man getrusage

boyter · 6 years ago

Can anyone comment why you can only use the verbose flag if you use the full path of time?

    time -v ls

does not work but

    /usr/bin/time -v ls

does? I don't have enough knowledge of either linux applications or bash to know whats happening to cause this.

dtwwtd · 6 years ago

This is very likely because without the full path your shell is using the `time` builtin function of your shell as opposed to using the binary.

The shell's builtin keyword for `time` is more limited in nature than the full `time` binary. This is true of a number of other common unix commands as well, e.g. `echo`. The manpage for your shell should describe the builtins functions.

eesmith · 6 years ago

Your shell has a built-in "time" command.

man bash:

    If the time reserved word precedes a pipeline, the elapsed as well
    as user and system time consumed by its execution are reported when
    the pipeline terminates.

man time:

     Some shells may provide a builtin time command which is similar
     or identical to this utility.  Consult the builtin(1) manual page.

Deleted Comment

inimino · 6 years ago

You can also type `type time` to see that time is a builtin. `which` and `hash` may also be interesting.

amypinka · 6 years ago

The full path calls a GNU tool, the other calls your shell's builtin function.

You can also use:

  \time -v ls

to skip the builtin version.

Neil44 · 6 years ago

skywhopper · 6 years ago

"What if Unix had less compositionality but I could use it with less memorized trivia? Would that be an improvement? I don't know."

The answer is "no" here, because the alternative doesn't exist. Could it be created? Maybe in theory, but I suspect that the amount of stuff that you'd need to memorize (or learn to look up) to use it effectively would be about the same for any system that allowed a similar variety of work to be accomplished. If you are willing to trade off functionality for simplicity, then sure, it can be done. You can get it today by just not using all these tools at all, I suppose.

wahern · 6 years ago

There would be less trivia to memorize if the command behaviors and options were more consistent. You may not be able to achieve that at the edges, where new commands and options are added, but you can always go back and clean things up.

For example, the cut(1) command is intended to do precisely what his f script does. But it's inconvenient because unlike many other commands it (1) doesn't obey $IFS and (2) the -d delimiter option only takes a single character. This could and should be remediated with a new, simple option.

I think the only thing preventing that change is that there's not enough interest in moving POSIX forward faster; certainly not like JavaScript.

Another problem are GNU tools. They have many great features but OMG are they a nightmare of inconsistency. BSD extensions tend to be much better thought through, perhaps because GNU tools tend to be lead by a single developer while BSD tools tend to be more team oriented.

So the way forward isn't to replace the organic evolution, it's to layer on processes that refine the proven extensions. And we already have some of those processes in place; we just need to imbue them with more authority, and that starts by not rolling our eyes at standardization and portability.

Authority is the problem...not standardization and portability. Everyone is willing and able to tell you the best way to do your work if you use their tools. Straitjacketing implementation in the name of order is a surefire way to dissuade people from using your tools.

jessant · 6 years ago

That is what the author says in the next paragraph.

> I don't know. I rather suspect that there's no way to actually reach that hypothetical universe.

nibbula · 6 years ago

I'm already living in that universe, but with even more composability.

3xblah · 6 years ago

         for i in $(seq 1 20); do
           (run once and emit the total CPU time)
         done |addup

"Here we don't actually care about the output (we never actually use $i) but it's a convenient way to get the for loop to run twenty times."

This is slower than not running seq and just using builtins.

         n=1;while true;do
         test $n -le 20||break;
           (run once and emit the total CPU time)
         n=$((n+1));
         done |addup

The author refers to "the shell" in his "benchmark" but if he is a Linux user then

   sh -c 'f 11 access.2020-01-* ... is dash

whereas

   f 11 access.2020-01* ... is bash

Needless to say, one of these shells is much faster than the other and only one contains the "time" builtin command.

https://wiki.ubuntu.com/DashAsBinSh

Then there is the use of perl and its system command in "count". As with seq, why is perl needed. No explanation. Why not just put the entire pipeline into a perl system command.

deathanatos · 6 years ago

The author did say convenient, not fast.

If you don't want the inefficiencies of seq, bash has:

  for (( expr1 ; expr2 ; expr3 )) ; do list ; done

which is a lot more idiomatic than constructing a for loop out of a while loop.

pimlottc · 6 years ago

You could also just use a bash brace expansion:

    for i in {1..20}
    do
      ...
    done

https://wiki.bash-hackers.org/syntax/expansion/brace

madacol · 6 years ago

I haven't been able to open this site for days (Timeout error, thought it was down), until today that I decided to use tor

perfunctory · 6 years ago

> The appearance of the TIME=… assignment at the start of the shell command disabled the shell's special builtin treatment of the keyword time, so it really did use /usr/bin/time. This computer stuff is amazingly complicated. I don't know how anyone gets anything done.

Indeed.

m463 · 6 years ago

the ACTUAL benchmark should be:

1) start timer

2) start deciding which commands to pipeline together

3) run the commands

4) stop timer

a lot of times the decision is the long pole.

in this authors case it included:

5) try a couple more variants of steps 2 and 3

6) write a blog post

CGamesPlay · 6 years ago

This is a good point about the second half of the article (compositonality), but the author started the article by saying this was a command he "sometimes runs", presumably indicating he has it saved somewhere.

My commands of this type are usually retrieved using control-r