Friday, December 28, 2007

A quick look at SNOBOL

This entry is just a brief glimpse at the SNOBOL programming language. From its Wikipedia entry:


SNOBOL (String Oriented Symbolic Language) is a computer programming language developed between 1962 and 1967 at AT&T Bell Laboratories by David J. Farber, Ralph E. Griswold and Ivan P. Polonsky.


SNOBOL is a language for string manipulations. Also from its Wikipedia entry:


... SNOBOL was widely used in the 1970s and 1980s as a text manipulation language ... its popularity has faded as newer languages such as Awk and Perl have made string manipulation by means of regular expressions popular ...



This language caught my attention while listening to the OOPSLA podcast episode on the excellent 50-in-50 talk by Guy Steele and Richard Gabriel.

Given that this a programming language exploration blog, learning more about this language provide an excellent opportunity to know more about the first languages for text manipulation.

The best place to learn about the language is http://www.snobol4.org/ a lot of SNOBOL resources can be found there. One of the best resources is a link to the THE SNOBOL4 PROGRAMMING LANGUAGE (Green Book).

All the examples presented in this post were created using the Macro SNOBOL4 in C implementation.

A "Hello world" program in SNOBOL looks like this:


OUTPUT = 'Hello World'
END


As shown here, the assignment to the special OUTPUT variable outputs the value to the standard output.

The inverse is also true for the INPUT variable. For example the following program asks the name of the user.


OUTPUT = "Your name? "
NAME = INPUT
OUTPUT = "Hello " NAME
END


Flow of control is given by jumps to labels given the successful execution of a statement. For example:


ASK
OUTPUT = "Your name? "
NAME = INPUT :F(DONE)
OUTPUT = "Hello " NAME :(ASK)
DONE
OUTPUT = "Finished"
END


This example asks for a name until the input is closed, that is end of file or Ctrl+D (in Linux). The ASK,DONE and END elements are labels; all of them (except for END) are user specified names. The :F(DONE) modifier means jump to DONE if failed and the :(ASK) modifier means jump to ASK.

The most interesting thing about the language is the string pattern matching capabilities. Here's an small(and very incomplete) example that extracts the parts of a simplified URL string:


LETTER = "abcdefghijklmnopqrstuvwxyz"
LETTERORDOT = "." LETTER
LETTERORSLASH = "/" LETTER

LINE = INPUT
LINE SPAN(LETTER) . PROTO "://" SPAN(LETTERORDOT) . HOST "/" SPAN(LETTERORSLASH) . RES

OUTPUT = PROTO
OUTPUT = HOST
OUTPUT = RES
END



In line 6, the contents of the LINE variable is matched against a pattern. The pattern contains the following elements:


  1. The SPAN(LETTER) . PROTO "://" section says identify a sequence of letters followed by "://" and assign them to the variable called PROTO

  2. The SPAN(LETTERORDOT) . HOST "/" secotion says take a sequence of letters and dots followed by "/" and assign then to the variable called HOST

  3. Finally the last section takes the remaining letters and slash characters and assign them to the RES variable




To show a litte program that uses all the elements presented here, I wanted to create a small example that takes as input the authentication /var/log/auth.log and shows all the uses of sudo and the program that was executed. The desired lines look like this:


Dec 28 08:21:42 glorfindel sudo: lfallas : TTY=pts/3 ; PWD=/home/lfallas ; USER=root ; COMMAND=/bin/bash


This file also contains entries other than sudo usages, so we have to ignore them.

Heres the program:


&ANCHOR = 0
UCASE = "ABCDEFGHIJLKMNOPQRSTUVWXYZ"
LCASE = "abcdefghijlkmnopqrstuvwxyz"
DIGIT = "0123456789"
APATH = SPAN(DIGIT)
USERNAMECHAR = DIGIT LCASE UCASE
USERNAMEPAT = SPAN(USERNAMECHAR)

READLINE LINE = INPUT :F(DONE)
LINE " sudo: " USERNAMEPAT . USER :F(READLINE)
LINE "COMMAND=" ARB . COMMAND RPOS(0)

OUTPUT = USER ":" COMMAND :(READLINE)

DONE

END


Here the &ANCHOR assignment tells SNOBOL that pattern matching is performed at any position of the specified string. The ARB element says any character before the next pattern succeeds and the RPOS(0) element is used to identify the end of line.



For future entries I'm going to show more interesting SNOBOL features.