Think of a phone directory listing with a lot of Browns followed by some Brownes, Brownings and Brownleys. Is there a simple way to identify the places in the sorted list where Brown changes to Browne, Browne to Browning and Browning to Brownley?
If you're a programmer, you've probably just thought of an 'if, then' test for items in the list. If you're an AWK-lover (like me!) you may have started thinking about the getline option.
There's another and more interesting way to do this job on the GNU/Linux command line, as explained below.
Regular readers of Free Software Magazine may remember that I tackled this 'find the changeover' problem in an article last year on splitting spreadsheets.
In that article, I showed how to use a spreadsheet function to add a 'changeover' flag to the row just before a change happened, in a special flagging column. The spreadsheet function was the if X, then A; if not X, then B kind. The spreadsheet with its flags was then saved as a text file, and command-line trickery was used to split the sections between flags into new spreadsheets. If our phone directory listings were in a spreadsheet, you would finish up with a Browns sheet, a Brownes sheet, and so on. (For a better command-line trick to do this splitting, see below.)
But what I really wanted was a way to do that 'changeover' flagging on the command line, without needing a spreadsheet, and now I have it.
A look at the numbers
Here's a sorted, comma-separated directory listing of Brown-type names. I've numbered the lines for clarity. Notice that there are 9 Browns, 3 Brownes, 7 Brownings and 2 Brownleys.
Did you check the numbers by counting? You don't have to. We can use the command line to do the counting. First, we save the list as a file, 'names' (without those added line numbers). Next, we use AWK to print just the first field (last name), after telling AWK that the field separator is a comma. We pipe the resulting list to the uniq command with the -c option, and the result is a count of the unique last names:
Now for the trick. The line number in the original comma-separated list just before Brown changes to Browne is 9 (see above). The line number just before Browne changes to Browning is 12, which is 9+3, and the Browning to Brownley changeover happens just after line number 19, which is 9+3+7.
These changeover line numbers are subtotals from the uniq -c list we generated above. To calculate them automatically, we can first use AWK to grab just the counts of unique names, which is the first field in that list we created. We can pass the counts to AWK again to calculate subtotals. The full command now looks like this:
The result is a list of the line numbers just before the changeovers. We don't really need that last subtotal (21), which is the line where the Brownleys and the list finish, so we can delete it before the subtotalling happens, with sed:
Job done, we have the 3 line numbers we wanted. But to make these numbers really useful, we need to turn them into a 'shopping list' for other commands to use when they go through the names file. We can do that by first saving the subtotal list as a file, list. That's our 'shopping list'.
An excellent command-line tool for processing files like names is sed, but how do we get sed to use our list file as its 'shopping list', in other words to do something only on lines 9, 12 and 19?
The wizards at Stack Overflow considered this question recently, and contributor 'perreal' suggested a nice general solution. First, we use sed to build a list of steps for another sed command to carry out later on. These steps will each be preceded by the -e option, which simply adds the step to the list of things for sed to do.
For example, the command
sed -e 's/^/-e /' -e 's/$/p' list
will take the list file and build this series of commands for later use:
Here the 'p' will just tell sed to print that line (9, 12 or 19). If we feed the output of this command to sed and tell it to do these steps with names, it prints out just those chosen lines:
Now for the improvement on the file-splitter I mentioned earlier. This time we'll use AWK (with a fairly complicated command!) to search for blank lines, split the file at the blanks, and name each file 'part1', 'part2' etc:
On the command line, there are nearly always alternative ways to reach the same end. I like the procedure outlined above (identify changeovers with uniq -c and subtotals, generate a list of changeover lines, use the list in other commands) and I've already applied it in a long shell script, but readers who are experienced programmers might prefer a different approach.
Note also that in our phone directory example, our starting list was already sorted on the first field (last name), and we looked for changeovers in that first field. We could also have sorted the names file on the fourth field (town abbreviation), then the first field, then had AWK pull out the fourth field for the uniq -c trick. That would give us alphabetised 'Brownian' names by town: