/usr/sbin/blog

Alex's Geekery Journal.

This blog has been deprecated. Go here for the new /usr/sbin/blog

How Old School C Programmers Process Arguments

Here’s a bit of code from Kernighan’s and Ritchie’s seminal text, The C Programming Language. I’m not sure I’d actually recommend coding this way (as written in the book, the entire program has only one comment), but it’s so darn clever and concise, I can’t help but post it. This snippet is part of a larger program called find that searches text for patterns, but the details aren’t important. I want to focus on the portion below that parses the arguments passed to the program. Don’t worry about understanding it just yet. I’ll step through it below:

char *s;
while (--argc > 0 && (*++argv)[0] == '-')
    for(s = argv[0]+1; *s != '\0'; s++)
        switch(*s) {
            case 'x':
                /* Set appropriate flags for 'x' option */
                break;
            case 'n':
                /* Set appropriate flags for 'n' option */
                break;
            default:
                printf("Illegal option %c\n", *s);
                argc = 0;
                break;
        }
if( argc != 1)
    printf("Usage: find -x -n pattern\n");
else
    /* Main program logic goes here */

What’s so cool about this is its flexibility. Those two nested loops can process all of the following (and more):

find -xn hello
find -nx hello
find -x -n hello
find -n -x hello
find hello
find -x hello
...

As you can see, the usage is find [-x] [-n] PATTERN. The nested loops above process the flags, -x and -n, in any order, and either separated or concatenated. Hopefully you’re as impressed as I was. Anyway, let’s dive into the code and begin by looking at the while statement.

The while Loop

while (--argc > 0 && (*++argv)[0] == '-')

What’s going on here? First, the while loop decrements argc and checks that it’s greater than 0 (as dictated by convention, argc holds the number of arguments passed to the program). Whenever a flag is processed, argc is decremented, thus making it a running count of how many flags there are left to process. If it’s 0, then there’s nothing left to do. Notice that the decrement always occurs before the ‘>’ is evaluated. This would be true even if it were postfix (i.e., argc-- > 0).

If argc checks out, then we move on to (*++argv)[0]=='-'. Yikes. That’s a doozy. First remember that argv is a pointer to the strings that contain the arguments. So, argv points to the string containing the program name and (argv+1) points to the string containing the first argument. That means that (*argv)[0] is the first character of the program name and (*argv)[1] is the first character of the first argument. Putting that together, (*++argv)[0] increments argv, dereferences it to a string, and then gets the first character of that string. In other words, if argv was originally pointing to the string containing the program’s name, it’s now pointing to the string containing the first argument, and then grabbing the first character of that string and comparing it to ‘-‘. Why’s it doing that? Because it wants to make sure it’s looking at a string containing a flag, and flags begin with the ‘-’ character.

Extra Credit: How is *++argv[0] == '-' different than (*++argv)[0] == '-'?

By the end of all of this, argc represents the number of arguments remaining and argv is pointing to the string containing the first argument. Wow.

That was a bit of a slog, but we’re not done yet. Let’s look at the for loop.

The for Loop

char *s;
for(s = argv[0]+1; *s != '\0'; s++)

Oh boy. More pointer arithmetic on arrays of arrays. Well, what’s going on here? Remember that now argv is pointing to the string containing the first argument. Therefore, argv[0] is the pointer to the first character of that string and argv[0]+1 is the pointer to the second character of that string. argv[0]+1 is then assigned to s. The result is that, if argv was pointing to “-xn”, then s is now pointing to the “x” in that string. Lucky for us, the last half is much simpler. *s != '\0' checks to make sure that it hasn’t reached the end of the string yet, and s++ increments the pointer (after the loop has finished its first run, of course). If the first run processed the “x” in “-xn”, then s is incremented and the “n” is processed. The third iterations sees the “\0” at the end of the string and exits. So, to bring this all together, the for loop traverses concatenated flags (i.e., it moves the pointer from the “x” to the “n” in “-xn”) and the while loop traverses separated flags (i.e., it moves the pointer from the “-x” to the “-n” in “-x -n”).

The switch Statement

switch(*s) {
      case 'x':
        /* Set appropriate flags for 'x' option */
        break;
    case 'n':
         /* Set appropriate flags for 'n' option */
        break;
    default:
        printf("Illegal option %c\n", *s);
        argc = 0;
        break;
}

The switch is simple enough. It tests if *s is equal to one of the flags, and then does the necessary processing for that flag. Most likely it just sets an internal flag, which will be taken into account when the main part of the program runs (this is how it’s set up in the full program).

The if Statement

if( argc != 1)
    printf("Usage: find -x -n pattern\n");
else
    /* Main program logic goes here */

The last and final part is the if statement. It prints an error/usage message if argc is not equal to 1. Why? Because there should always be one argument left after the flags are processed (the PATTERN argument). The error will also be printed if the program is given an unrecognized flag and the default case in the switch is executed. If there aren’t any problems, then it executes the else where the main part of the program is contained.

So there you have it. That’s how Kernigham and Ritchie did it in 1978. Nowadays they just haul in the software weenies with their fancy objects and methods. Lost is the subtle art of manipulating arrays of pointers to strings of characters. Alas. (Just kidding. After all, I’m probably one of those software weenies.)

Fibonacci Sequence, Part II

The matrix equation in the last post can actually be whittled down a bit further to produce another equation that, in some ways, is easier to work with. The result is as follows:

fib

Fk is, of course, the kth Fibonacci number. Now, before I go on, I want to point out that I stumbled upon these two equations and the proofs for these equations in Gilbert Strang’s excellent Linear Algebra and Its Applications. All credit goes to him for coming up with this, and I recommend his book for an even more in depth explanation.

Anyway, back to the equation, which I found interesting for a few reasons:

  1. This equation allows you to easily find the kth term using only a pocket calculator. The last equation requires something that can deal with matrices.
  2. This also makes it easier to use in programming applications. There’s no need to write functions or import libraries to deal with matrices. It’s also probably faster than writing some sort of recursive function to compute the kth Fibonacci number.
  3. As Strang points out, that equation, amazingly, produces an integer, despite all the fractions and square roots.
  4. We can simplify that equation further for a very good approximation (so good that it’s not really an approximation).

To see how we get that equation from the equation in the last post, let’s begin with the way Mr. Strang has it written in his book (it’s the same as mine, but flipped upside down).

title

Where F0=0, F1=1, F2=1, etc. Let’s now make some substitutions:

title

We now have:

title

This hasn’t changed the content of the equation at all. We’ve just substituted in new symbols to represent the different terms.

The next step is to diagonalize the matrix A. Remember that diagonalizing A produces A = SΛS-1 where S contains A’s eigenvectors and Λ contains A’s eigenvalues. Substituting A = SΛS-1 into the previous equation yields:

title

Notice how everything but the first and last S and S-1 cancel, giving the final form: SΛkS-1u0. This is important, because Λ is a diagonal matrix, making Λk very simple:

title

λ1 and λ2 are, of course, the eigenvalues of A. Let’s now make one last substitution and put all this diagonalization stuff together. First we define c as

title

Then we substitute it into the equation:

title

That might need some explaining. The first bit is simply the equation with c substituted in. Following that, the first matrix is S, but written to show that it contains x1 and x2, which are the eigenvectors of A placed vertically in the two columns. The middle matrix is Λk and the last is simply the matrix c. Finally, the matrices are multiplied out, yielding the final c1λk1x1+c2λk2x2

The final steps are simply to compute c, and A’s eigenvalues and eigenvectors. I’ll spare you the tedious algebra and simply tell you that:

title

title

This can be computed the standard way, by solving: det(A-λI)=0.

All the variables are then plugged into the equation for uk.

title

We want Fk, so we multiply, factor, and take the bottom row, giving the equation we want:

title

That’s the full equation, but now notice that the second term is always less than 1/2. This mean we can simply drop it, yielding:

In fact, since the second term is always less than 1/2 and the full equation always gives us an integer, we can take this a step further: Rounding the approximation to the nearest integer will always give you the exact value for Fk. Cool.

Using Matrices to Generate the Fibonacci Sequence

Here’s a fun little matrix:

Fib Matrix

That generates the an and an+1 terms of the Fibonacci sequence.1 To see why, let’s look at a recursive definition of the Fibonacci sequence.

Fib Def

Fib Def

That’s easy enough to understand. The first and second terms are both 1, and the next term is the sum of the last term plus the current term. To see how this relates to the matrix above, we must turn this into a matrix equation. The trick here is to have a good understanding of matrix multiplication. If you need a refresher, here’s a site with a tutorial. Let’s look at an example of a 2x2 matrix multiplied by a 2x1 (even if you have a good understanding of matrix multiplication, hang tight, this is going somewhere).

title

Now let’s make some substitutions and multiply it out:

title

As you can see, the effect of multiplying the b matrix by the Fibonacci matrix is that it moves the b1,2 position to the b1,1 position, and the b1,2 gets filled with the sum of b1,1 and b1,2. Let’s now substitute the b’s for an-1 and an:

title

Does the resultant matrix look familiar? If you look above, you’ll see that it’s the same as our recursive definition of the Fibonacci sequence! The top row is equal to the current number in the sequence (an) and the bottom row is equal to the next number in the sequence (an-1 + an = an+1). Continuing to multiply the resultant matrix by the Fibonacci matrix will cause consecutive entries to be produced. Because matrix multiplication is associative, we can move our multiplication to the exponent, and multiply that result by the first two terms in the sequence (0, 1), leading to our initial matrix:

Fib Matrix

References

1: Strang, Gilbert. ”Linear Algebra and Its Applications.”

Using ‘Trap’ to Catch ‘Ctrl + C’s and Control How Your Script Exits

So you’ve written a script that creates all sorts of temporary files, and launches dozens of processes in the background, and, as is, the script runs indefinitely. The only way to quit is Ctrl + C, but once you fire off this hotkey combo, you’re left with a mess of background processes and temp files. There are lots of reasons why your script might do this. For example, I’m currently writing a download manager that sends its wget processes to the background, and creates various temporary status files. It runs indefinitely because it polls a queue file, which contains the URLs of the files it’s instructed to download. So, given all that, how is it that I get my script to clean up after itself once I deal it the fatal Ctrl + C blow (or more technically, a “signal interrupt” or “SIGINT” for short)? The trick here is to use the trap command along with some basic process manipulation. Here’s the syntax:

trap 'command' signal_to_trap

The above bit of code will cause the command (or string of commands between the quotes) to execute if the script is issued a signal_to_trap. As is implied by the syntax, you can trap more than one type of signal. To catch the Ctrl + C combo, you usually want to trap a SIGINT, which is done as follows:

trap 'echo "Exiting"; exit 0' INT

That bit of code will cause your script to print the message “Exiting” and exit with status 0 once Ctrl + C is pressed (or gets issued a SIGINT some other way). One important caveat is always remember the exit command. For example, the following code will cause your program to print “Exiting” and then continue, effectively ignoring the SIGINT. What NOT to do if you want your program to actually quit:

trap 'echo "Exiting"' INT

Of course, that construct isn’t without its uses. Perhaps you want your script to immediately run some routine if Ctrl + C is entered, but you don’t want it to quit. The above bit of code would be the way to do it, but that strikes me as a confusing break from convention. Most people expect Ctrl + C to stop the currently running process. A better way to do this would be to use a USR signal. Just substitute INT for USR1 or USR2, and send the USR signal to the script using the kill command: kill -USR1 pid. In any case, back to the topic at hand: exiting a script cleanly.

‘Trapping’ and Cleaning Up After Yourself

The way I use ‘trap’ is something along these lines:

exit_routine () {
    # TODO: Clean up stuff.
}
trap 'exit_routine' INT # Intercept SIGINT and call exit_routine

If it’s short enough, you can, of course, cram your entire exit routine between the single quotes, but if it’s nontrivial it’s best to pull it out into its own function. Also remember that BASH executes a script’s commands in the order it sees them, so this must be placed somewhere near the beginning of the script to set the trap early on. (Perhaps that’s obvious, but I’d be the first to admit that that’s just the sort of pitfall I’d waste an hour puzzling over.) Also remember that the function must be declared before the trap statement (or at least before the script receives a SIGINT and tries to call exit_routine).

Cleaning Up Stuff (Processes)

Now that we’ve declared the exit routine and set the trap, we need to do some actual housekeeping. I do this by keeping track of all the background processes’ PIDs and temporary files’ filenames in an array. Consider the following code:

for i in $(seq 3); do
    wget "$URL[$i]" & 
    PIDS[$i]=$!
done

Three wget instances are launched and sent to the background with the & operator. Their PIDs are accessed with the $! variable and stored to $PIDS. We can now kill off those processes with the following bit of code:

exit_routine () {
    kill ${PIDS[*]}
    echo "Script exiting."
    exit 0
}
trap 'exit_routine' INT

Whenever Ctrl + C is pressed, the SIGINT is trapped and exit_routine is called. kill is then given all the PIDs, which are accessed using ${PIDS[*]}. This kills the wget processes, an exit message is printed, and the script is exited. The neat thing about the ${ARRAY[*]} way of accessing all an array’s elements is that empty elements won’t be returned. So if I do:

PIDS[5]="5"
PIDS[10]="10"
echo ${PIDS[*]}

Then only “5 10” is printed.

Dealing with Temp Files

Erasing temp files is a bit more tricky. The possibility of white space characters in a file’s name makes rm ${FILES[*]} troublesome. Enclosing the array in double quotes doesn’t help either. Instead, we need to step through the array as follows:

for i in $(seq 3); do
    rm "${FILES[$i]}"
done

If we want to ignore the empty elements of $FILES like ${PIDS[*]} automatically did for the PIDs, then we can test if an element is empty using an if statement. Alternatively, you can impress your friends with some BASH-fu and take advantage of BASH’s short circuit functionality:

for i in $(seq 3); do
    [ "${FILES[$i]}" != "" ] && rm "${FILES[$i]}"
done

If ${FILES[$i]} is empty, the statement short circuits, and does nothing. If it contains something, the rm command is executed.

So, that’s how it’s done. Now you have no excuse for leaving garbage behind (if only BASH had some way of automagically collecting this so called garbage!), and, as always, if you have any tips regarding this, please share your BASH-fu with us in the comments!

Sending Emails and Texts From the Command Line

Here’s a handy way to send emails from the command line:

echo 'message' | mail -s 'subject' 'email_address'

If you know the recipient’s carrier, you can also use it to send text messages from the command line. For example, if the recipient’s number is 111-111-1111 and she’s on Verizon, you can send her a text message using the following command:

echo 'Sent from my terminal!' | 
    mail -s 'Linux is fun' '1111111111@vtext.com'

Once again, the trick here is knowing the domain name of the carrier’s Email-to-SMS Gateway. In the case of Verizon, it’s @vtext.com. Wikipedia has a handy list of other gateways here. Another trick is to send yourself an email from the phone, and the phone’s SMS gateway will be revealed in the ‘from’ field.

You can use this to create all sorts of fun shell scripts. For example, here’s one I wrote recently that tells me whether or not the VPS provider BuyVM has any VPSs in stock:

#!/bin/bash -

echo 'Is BuyVM in stock?'
while :; do
        wget -O - -q http://doesbuyvmhavestock.com/ | grep -i yes
        if [ $? == 0 ]; then
                echo 'Well, what are you waiting for?' | 
                    mail -s "BuyVM is in stock" "you@example.com"
                echo 'Yes!'
                exit 0
        else
            echo 'No'
            sleep 15m
        fi
done

It’s pretty simple. Every 15 minutes it scrapes the webpage doesbuyvmhavestock.com and sends me an email if it contains the word ‘yes’. Obviously, the success of your own script depends on how well it can parse the webpage. As you can see, my parsing routine (the grep on line 5) is pretty primitive.

Backing Up a Directory’s Permissions

We’ve all done it before, and if it hasn’t happened to you yet, it’s only a matter of time. You’re attempting to set the permissions on a directory and and you’re doing something along the lines of sudo chmod -R 660 ./foobar. You execute the command only to realize that you’re in the wrong directory or you’ve mistyped the directory’s name, and now your permissions are totally clobbered. If you’ve kept good backups, this is annoying, but not devastating. Either you could completely restore the files, or perhaps you could write a script which would copy the old file permissions back over. An easier alternative would be to plan ahead and backup the file permissions before disaster strikes. Here’s a script that will do that for you:

#!/bin/bash

#Check that we've been passed a directory.
if [ ! -d "$1" ]; then
   echo "$1 is not a directory. Exiting."
   exit 1
fi

#Check if we've been passed a relative or absolute
#path. If it's relative, store $PWD in $DIR. We'll need
#this later to build the absolute path.
echo "$1" | grep -q '^/'
if [ $? == 0 ]; then
   DIR='' 
else
   DIR="$PWD/"
fi

#Generate the header comments for the script.
echo "#!/bin/bash"
echo "#"
echo "#This script will restore permissions under $DIR$1"
echo "#to how they were on `date`"
echo "#"

#Loop through the given directory's tree and 
#echo the commands to restore the permissions and owner.
find "$1" -print0 | while read -d $'\0' i
do
   echo chmod `stat -c %a "$i"` \"$DIR$i\"
   echo chown `stat -c "%U:%G" "$i"` \"$DIR$i\"
done

What this script does is generate another script, which, when run, will restore a directory’s permissions, owner, and group (and also the permissions, owner, and group of any files and directories under it). Suppose you drop this script into a file called backup.sh and you want to backup the permissions of a directory called /home/you/mystuff. Here’s how you’d do it:

./backup.sh /home/you/mystuff > mystuff-backup.sh

Now when you execute mystuff-backup.sh, all of /home/you/mystuff’s permissions will be restored. Easy.

There is one security caveat you should keep in mind. The resultant script will contain the names of all the files and subdirectories in the directory tree. The upshot is that anyone who has read access to the script will be able to see this information. If this is a problem, you should adjust the script’s permissions accordingly.

The script itself is pretty straight forward. On line 4, we tell a relative path from an absolute path by seeing if it’s prefixed with a forward slash. If it is, then it’s absolute. If it isn’t, then it’s relative. We’ll need this information later to build the absolute path to all the files. The next step is to loop through the files, which is a bit tricky. On line 28, we need to pipe the output of find to a while read because the more intuitive for loop construct doesn’t handle spaces and newlines in filenames properly (of course, anyone who puts newlines in a filename needs to be thrown into the sarlacc). In fact, you may just want to memorize that line, as it’s a common problem with a far from obvious solution. Finally, on line 30 and 31, we read the file’s permission and owners with stat, which is a handy command that, among other things, outputs file information in a user definable format.

So there you have it. Download the script here. Now go forth and make this a cron job.

The First Spellchecker (in 6 Lines)

I recently came across this gem in a book called Classic Shell Scripting. According to the authors, it’s a modern port of the first spellchecker. The original version was written by Steve Johnson, and was then translated into the form you see here by Kernighan and Plauger. I assume the translation was necessary because the syntax of the original isn’t compatible with modern shells. Without further ado, here it is:

cat filename | 
  tr A-Z a-z | 
    tr -c a-z '\n' | 
      sort | 
        uniq | 
          comm -13 dictionary_file -

It’s a spellchecker, which, at 115 characters, fits within a tweet! Imagine that! To prove to you that it actually works, and to truly appreciate the cleverness of this code, let’s step through it line by line. The first command, ‘cat’, simply reads the file and sends it down the pipe. This is then processed by ‘tr’ which converts all the uppercase characters to lower case. This is again piped to ‘tr’, but this time everything that’s not a lower case character is converted to a newline character. At this point, the original input has been processed down to a list of words separated by newline characters. This list is then sorted alphabetically with the ‘sort’ command, and all the duplicate words are removed with the ‘uniq’ command. Finally, this list of words is compared with a dictionary file. Any words that appear in the list, but not in the dictionary, are sent to standard output. The final result is a list of words that appear in your document, but not in the dictionary, and thus are probably misspelled. At this point, you may have noticed a rather serious problem with this script. It doesn’t do a very good job of handling apostrophes. The third line will remove all single quote characters, even if they’re functioning as apostrophes, and replace them with newline characters. One way to fix this is to replace line 3 with the following:

sed -r "s/('$|^'|\W'\b|\b'\W|\W'\W)/ /g" |
  tr -c a-z\' '\n' |

That rather ugly ‘sed’ command will remove all the single quote characters which aren’t in the middle of words, thereby preserving apostrophes, and the new ‘tr’ command will ignore all single quote characters since ‘sed’ has already dealt with them. The end result is that the apostrophe in, for example, foobar's will be preserved but the single quotes at the beginning and end of 'foobar's' will be removed. Another improvement would be to keep it from removing hyphens in compound words, but I’ll leave that as an exercise to the reader. Also, If you need a dictionary file try:

sudo apt-get install wamerican-small

And look under /usr/share/dict/words.

beala@superfly:/usr/share/dict$ wc -l words
50253 words

That’s a lot of words! Try ‘wamerican-insane’ if you want a, well, insane number of words.