Another nice thing about /usr/bin/time is the --verbose flag which gives:
Command being timed: "ls"
User time (seconds): 0.00
System time (seconds): 0.00
Percent of CPU this job got: 0%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.00
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1912
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 112
Voluntary context switches: 1
Involuntary context switches: 1
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
This is very likely because without the full path your shell is using the `time` builtin function of your shell as opposed to using the binary.
The shell's builtin keyword for `time` is more limited in nature than the full `time` binary. This is true of a number of other common unix commands as well, e.g. `echo`. The manpage for your shell should describe the builtins functions.
If the time reserved word precedes a pipeline, the elapsed as well
as user and system time consumed by its execution are reported when
the pipeline terminates.
man time:
Some shells may provide a builtin time command which is similar
or identical to this utility. Consult the builtin(1) manual page.
I got excited when I saw the 'f' and 'count' commands, but they're just scripts he has on his system. Like doing grep 'plover' blah.log | cut -d ' ' -f 11 | sort | uniq -c | sort -n. Personally I'd prefer to use the ubiquitous commands that work everywhere than rely on having custom scripts on my system, but they are nice.
Most people who use Unix directly build up some stuff in ~/bin (often a misnomer because it's shell scripts and not binaries, although mine is less of a misnomer than most because so much is in C rather than shell). The trick is to build them out of the standard portable components that exist everywhere. (This means, among other things, no #!/bin/bash.)
That's the whole point of shell scripting, to take a series of minimal programs and tie them together into something that does a more complex task. There's no reason to distrust a shell script simply because it is a script any more than there is to trust a binary simply because it's a binary.
Sure, but relying on custom shell scripts as unix primitives can be problematic if you find yourself frequently managing/troubleshooting systems that you don't own, and you don't want to (or aren't allowed to) put those handy scripts in place. Then when you're on any given system, you forget whether you can use "f", or if you have to fall back on awk.
I think it's less about not trusting custom scripts than it is about ensuring that your unix muscle memory doesn't atrophy.
That’s the theory but frankly the syntax is so cumbersome, irregular and needs so many googling for "easy" things like conditional, substring, etc. that I now use a real programming language if a script needs to be anything more than a list of commands without any logic (besides variables substitution).
> [...] but they're just scripts he has on his system. Personally I'd prefer to use the ubiquitous commands that work everywhere than rely on having custom scripts on my system [...]
Is okay for one to have their own tools.
$ f() { printf "\$%s" "$1"; }
$ echo a b c | awk '{ print $(f 2) }'
His system is not very different from mine or yours. He just chose to combine the tools in a specific way.
"What if Unix had less compositionality but I could use it with less memorized trivia? Would that be an improvement? I don't know."
The answer is "no" here, because the alternative doesn't exist. Could it be created? Maybe in theory, but I suspect that the amount of stuff that you'd need to memorize (or learn to look up) to use it effectively would be about the same for any system that allowed a similar variety of work to be accomplished. If you are willing to trade off functionality for simplicity, then sure, it can be done. You can get it today by just not using all these tools at all, I suppose.
There would be less trivia to memorize if the command behaviors and options were more consistent. You may not be able to achieve that at the edges, where new commands and options are added, but you can always go back and clean things up.
For example, the cut(1) command is intended to do precisely what his f script does. But it's inconvenient because unlike many other commands it (1) doesn't obey $IFS and (2) the -d delimiter option only takes a single character. This could and should be remediated with a new, simple option.
I think the only thing preventing that change is that there's not enough interest in moving POSIX forward faster; certainly not like JavaScript.
Another problem are GNU tools. They have many great features but OMG are they a nightmare of inconsistency. BSD extensions tend to be much better thought through, perhaps because GNU tools tend to be lead by a single developer while BSD tools tend to be more team oriented.
So the way forward isn't to replace the organic evolution, it's to layer on processes that refine the proven extensions. And we already have some of those processes in place; we just need to imbue them with more authority, and that starts by not rolling our eyes at standardization and portability.
Authority is the problem...not standardization and portability. Everyone is willing and able to tell you the best way to do your work if you use their tools. Straitjacketing implementation in the name of order is a surefire way to dissuade people from using your tools.
'sort | uniq -c | sort -n' is an interesting pipeline. It will always work and does a great job with large cardinality data on low memory systems.
However, if you have the ram, or know the data set has a low cardinality (like, http status codes or filesnames instead of ip addresses) then something that works in memory will be much more efficient.
I threw 144,000,000 'hello' and 'world' into a file:
justin@box:~$ ls -lh words
-rw-r--r-- 1 justin justin 824M Jan 7 15:21 words
justin@box:~$ wc -l words
144000000 words
justin@box:~$ time (sort <words|uniq -c)
72000000 hello
72000000 world
real 0m22.831s
user 0m32.999s
sys 0m4.675s
Compared to doing it in memory with awk:
justin@box:~$ time awk '{words[$1]++} END {for (w in words) printf("%s %d\n", w, words[w])}' < words
hello 72000000
world 72000000
real 0m10.639s
user 0m9.736s
sys 0m0.876s
This is because in the first example you are invoking two programs. The first one sort the content of the file, the second count how many lines are equal.
While in the awk example it is creating a hash table with all words and incrementing by the key and then printing.
There is no sorting plus printing may be buffered.
Not exactly. sort (at least GNU sort) will end up doing external merge sort on temporary files if you give it more data than you have memory. Which, if you give it 100GB of 5 different strings, ends up being a huge waste.
Even working in memory, there are different efficiencies for different methods.
Awk includes an asort() function which can sort an array, such that it would be possible to create a similar process entirely within awk to the sort | uniq -c pipeline:
In this case, sort | uniq is the fastest option. But the all-in-memory sort + separate tabulation of unique values in awk is notably slower (running in 143% of the time) than the also all-in-memory hash accumulator.
As I bump up the dataset size (20,000 records) that discrepency increases, roughly 0.052s sort|uniq, 0.065s hash, and 0.217s ask sort-unique.
TL;DR: test your assumptions, especially regarding performance.
Note: Data were generated with a simple bash loop:
for i in {1..2000}; do echo $((RANDOM%10)); done > data
Then there is the use of perl and its system command in "count". As with seq, why is perl needed. No explanation. Why not just put the entire pipeline into a perl system command.
> The appearance of the TIME=… assignment at the start of the shell command disabled the shell's special builtin treatment of the keyword time, so it really did use /usr/bin/time. This computer stuff is amazingly complicated. I don't know how anyone gets anything done.
This is a good point about the second half of the article (compositonality), but the author started the article by saying this was a command he "sometimes runs", presumably indicating he has it saved somewhere.
Another nice thing about /usr/bin/time is the --verbose flag which gives:
:)The shell's builtin keyword for `time` is more limited in nature than the full `time` binary. This is true of a number of other common unix commands as well, e.g. `echo`. The manpage for your shell should describe the builtins functions.
man bash:
man time:Deleted Comment
I think it's less about not trusting custom scripts than it is about ensuring that your unix muscle memory doesn't atrophy.
Is okay for one to have their own tools.
His system is not very different from mine or yours. He just chose to combine the tools in a specific way.The answer is "no" here, because the alternative doesn't exist. Could it be created? Maybe in theory, but I suspect that the amount of stuff that you'd need to memorize (or learn to look up) to use it effectively would be about the same for any system that allowed a similar variety of work to be accomplished. If you are willing to trade off functionality for simplicity, then sure, it can be done. You can get it today by just not using all these tools at all, I suppose.
For example, the cut(1) command is intended to do precisely what his f script does. But it's inconvenient because unlike many other commands it (1) doesn't obey $IFS and (2) the -d delimiter option only takes a single character. This could and should be remediated with a new, simple option.
I think the only thing preventing that change is that there's not enough interest in moving POSIX forward faster; certainly not like JavaScript.
Another problem are GNU tools. They have many great features but OMG are they a nightmare of inconsistency. BSD extensions tend to be much better thought through, perhaps because GNU tools tend to be lead by a single developer while BSD tools tend to be more team oriented.
So the way forward isn't to replace the organic evolution, it's to layer on processes that refine the proven extensions. And we already have some of those processes in place; we just need to imbue them with more authority, and that starts by not rolling our eyes at standardization and portability.
> I don't know. I rather suspect that there's no way to actually reach that hypothetical universe.
However, if you have the ram, or know the data set has a low cardinality (like, http status codes or filesnames instead of ip addresses) then something that works in memory will be much more efficient.
I threw 144,000,000 'hello' and 'world' into a file:
Compared to doing it in memory with awk: so, half the time and 1/3 the cpu.While in the awk example it is creating a hash table with all words and incrementing by the key and then printing.
There is no sorting plus printing may be buffered.
Awk includes an asort() function which can sort an array, such that it would be possible to create a similar process entirely within awk to the sort | uniq -c pipeline:
As compares with a hash-based counter: On a 2,000 value test dataset with 10 unique values:sort | uniq -c takes 0.019s (8 runs averaged)
awk hash takes 0.023s (8 runs averaged)
awk-implemented sort + unique takes 0.33s (8 runs averaged)
In this case, sort | uniq is the fastest option. But the all-in-memory sort + separate tabulation of unique values in awk is notably slower (running in 143% of the time) than the also all-in-memory hash accumulator.
As I bump up the dataset size (20,000 records) that discrepency increases, roughly 0.052s sort|uniq, 0.065s hash, and 0.217s ask sort-unique.
TL;DR: test your assumptions, especially regarding performance.
Note: Data were generated with a simple bash loop:
This is slower than not running seq and just using builtins.
https://wiki.ubuntu.com/DashAsBinSh
Then there is the use of perl and its system command in "count". As with seq, why is perl needed. No explanation. Why not just put the entire pipeline into a perl system command.
If you don't want the inefficiencies of seq, bash has:
which is a lot more idiomatic than constructing a for loop out of a while loop.Deleted Comment
Indeed.
1) start timer
2) start deciding which commands to pipeline together
3) run the commands
4) stop timer
a lot of times the decision is the long pole.
in this authors case it included:
5) try a couple more variants of steps 2 and 3
6) write a blog post
:)