GNU sort traps with field separators

GNU sort is an excellent utility that is a mainstay of the linux command line. It has all kinds of tricks up its sleeves, including support for uniquifying records, stable sorts, files larger than memory, parallelisation, controlled memory usage, etc. Go read the man page for all the gory details.

It also supports sorting with field separators, but unfortunately this support has some nasty traps for the unwary. Hence this post.

First, GNU sort cannot do general sorts of CSV-style datasets, because it doesn't understand CSV-features like quoting rules, quote-escaping, separator-escaping, etc. If you have very simple CSV files that don't do any escaping and you can avoid quotes altogether (or always use them), you might be able to use GNU sort - but it can get difficult fast.

Here I'm only interested in very simple delimited files - no quotes or escaping at all, Even here, though, there are some nasty traps to watch out for.

Here's a super-simple example file with just two lines and three fields, called dsort.csv:

$ cat dsort.csv
a,b,c
a,b+c,c

If we do a vanilla sort on this file, we get the following (I'm also running it through md5sum to highlight when the output changes):

$ sort dsort.csv | tee /dev/stderr | md5sum
a,b+c,c
a,b,c
5efd74fa9bef453dd477ec9acb2cef5f  -

The longer line sorts before the shorter line because the '+' sign collates before the second comma in the short line - this is sorting on the whole line, not on the individual fields.

Okay, so if I want do an individual field sort, I can just use the -t option, right? You would think so, but unfortunately:

$ sort -t, dsort.csv | tee /dev/stderr | md5sum
a,b+c,c
a,b,c
5efd74fa9bef453dd477ec9acb2cef5f  -

Huh? Why doesn't that work the short line first, like we'd expect? Maybe it's not sorting on all the fields or something? Do I need to explicitly include all fields? Let's see:

$ sort -t, -k1,3 dsort.csv | tee /dev/stderr | md5sum
a,b+c,c
a,b,c
5efd74fa9bef453dd477ec9acb2cef5f  -

Huh? What the heck is going on here?

It turns out this unintuitive behaviour is because of the way sort interprets the the -k option - -kM,N (where M != N) doesn't mean 'sort by field M, then field M+1,... then by field N', it means instead 'join all fields from M to N (with the field separator?), and sort by that'. Ugh!

So I just need to specify the fields individually? Unfortunately, even that's not enough:

$ sort -t, -k1 -k2 -k3 dsort.csv | tee /dev/stderr | md5sum
a,b+c,c
a,b,c
5efd74fa9bef453dd477ec9acb2cef5f  -

This is because the first option here - -k1 is interpreted as -k1,3 (since the last field is '3'), because the default 'end-field' is the last. Double-ugh!

So the takeaway is: if you want an individual-field sort you have to specify every field individually, AND you have to use -kN,N syntax, like so:

$ sort -t, -k1,1 -k2,2 -k3,3 dsort.csv | tee /dev/stderr | md5sum
a,b,c
a,b+c,c
493ce7ca60040fa184f1bf7db7758516  -

Yay, finally what we're after!

Also, unfortunately, there doesn't seem to be a generic way of specifying 'all fields' or 'up to the last field' or 'M-N' fields - you have to specify them all individually. It's verbose and ugly, but it works.

And for some good news, you can use sort suffixes on those individual options (like n for numerics, r for reverse sorts, etc.) just fine.

Happy sorting!

blog comments powered by Disqus