Survey of Operating Systems:
§ 5: UNIX Control Features

Instructor: M.S. Schmalz


Reading Assignments and Exercises

UNIX has a scripting language that has several interesting control features. These control structures can cause various UNIX commands or programs to execute, and this provides the user with a powerful high-level control capability.

This section reviews basic control structures available in UNIX that support the building of scripts and high-level UNIX command structures. This section is organized as follows:

Information in this section was compiled from a variety of text- and Web-based sources, and is not to be used for any commercial purpose.

5.1. Regular Expressions

Reading Assignments and Exercises

A regular expression is a concise way of expressing any pattern of characters or abstractions denoted by a regular sub-expression. Regular expressions are constructed by combining ordinary characters with one or more metacharacters, which are characters that have special meaning for the given shell (e.g., Bourne, Korn, or C-shell) that your UNIX implementation supports.

Regular expressions can be used in conjunction with UNIX programs to:

Each UNIX command differs in terms of the types of regular expressions that command supports. One way to find out what regular expressions are supported is to check the manual page for the given command. In this section, we will discuss regular expressions in greater detail, with examples of command usage.

5.1.1. Practical Aspects of Regular Expressions

Regular expressions (REs) often describe patterns within text. The use of REs extends to configuration files, mail filters, text editors, and numerous programming languages. A UNIX application that manipulates text can use regular expressions.

A regular expression evaluates input data and returns an answer of true or false, similar to a relational operator in a high-level programming language. For instance, a regular expression might be configured to recognize a string S, for which a new string S' might be substituted.

Since regular expressions can be used by many applications, the result of applying a regular expression depends in part on the application. For example, after recognizing a string S, an application might substitute new text in place of S, save S in a buffer for later use, or execute a UNIX program with S as one of the arguments of the program.

For example, the UNIX grep utility searches a file for a particular text string. The grep program can accept a search string specification as either a string, quoted string, or a regular expression.

Example.Suppose one searches for all the <title> tags in a directory of HTML files. The grep command would look like this:

Here, grep evaluates whether or not each line in each *.html file matches the description <title>. If the line is a match, then grep prints out the file name and the matching line. When applied to this directory of HTML files, the result is:

In this example, the regular expression is '<title>', which is a quoted character string.

Occasionally a more involved search of text datais required, using constraints that represent restrictions, qualifications, or abstractions. To do this, one needs to use regular expressions with descriptive "metacharacters", as follows.

5.1.2. Placeholders and Repetition.

For purposes of illustration, assume that an HTML directory has 225 files and 400 <title> tags. In order to avoid searching manually through 400 tags to find one on a specific subject (e.g., "commands"), a regular expression must be employed.

Example. To find only titles that reference commands, we use the following invocation of grep:

This example has two metacharacters. The period (.) says that any character can occur in that place. The asterisk (*) means that zero or more instances of the previous character can occur there. In plain English, this means "match any line that contains a < title > HTML tag followed by any number of characters, as long as the word commands appears before the end of the line".

In the preceding example, the period plays a very important role, as shown below.

Example. If the command line in the preceding example had been:

then grep would search for lines that had many " > " signs after the word title, as follows:

In other words, the * character would direct grep to search for 0 or more instances of the character >, which would usually not be found.

5.1.3. Range Specifications

Occasionally, a regular expression should be generalized by including higher-level abstractions. One way of doing this is by using the range delimiters ([ ]), which can be used to specify a character set.

Example. To match the digits 0-9, use the range specification [0123456789].

A peculiarity of the range brackets are backslash separators.

Example. The backslash separators (\) when used within the range delimiters (e.g., [\.\*]) specify the period (.) as a punctuation character, and thus specify matching either (.) or (*).

Note: Putting backslashes before dots and stars in order to turn off their behavior as special characters is called escaping the characters.

One can negate the range function to match anything but the specified characters, by preceding the range match with the caret (^).

Example. In [^1234], the caret inside this range operator means match anything but the characters 1 through 4.

Example. Suppose we want to find all the href codes that point to URLs that (mistakenly) have a space in the URL pathname. The UNIX egrep program (enhanced grep) is used with the negated range expressions, as follows:

The regular expression "[^"]* [^"]*" helps egrep find all the href lines that have a space between the begin quote and end quote. The range operator is used in a clever way, i.e., to signify any character other than a quote.

5.1.4. Determining Position

In UNIX regular expressions, two characters restrict matching to a specific location within the string. The beginning of the input data is specified with (^), and the end with ($).

Example. To find the HTML tags that are not closed before a line break, we use egrep as follows:

where the regular expression '<[^>]*$' instructs egrep to (1) find an HTML tag (all such tags start with <), then (2) find a string of characters that does not end with a (>) sign.

The preceding examples are designed to provide some idea of how regular expressions are used in UNIX. Each UNIX program has its own regular expression syntax and regular expression processor. To find out more about UNIX commands and libraries that use regular expressions, type man -k regular at the UNIX prompt.

5.2. Iteration and Control Statements

Reading Assignments and Exercises

Various shells handle control statements (e.g., if..then..else, for, and while loops) in different ways. A new version of the Korn shell handles control structures more elegantly than the existing C shell, as summarized below.

Example. Suppose one wants to determine the maximum string size within a list of strings, for example, to determine the initial number of columns in the multi-column display. This could also be used to determine the maximum width for a column of entries. A typical shell implementation would customarily be given as:

where if..then..fi are the if-statement keywords.

The Korn shell also provides for function definitions using the following format:

With this technique, one can define a function that has as its body a segment of UNIX scripting language.

Variable definitions in the Korn shell functions have local scope, which means that the variable definition holds only within the function in which the variable is defined.

Example. Let a local variable v be defined such that it has precedence over the global variable v.

At the conclusion of this code fragment, the variable v will have value equal to 6.

Korn shell statements of the format (( expression )) denote arithmetic commands, which return True when the value of the enclosed expression is non-zero, and False when the expression evaluates to zero. The construct $((expression) can be used as a word or part of a word, and is replaced by the value of expression.

Example. Consider the code fragment

The Korn shell evaluates the expression which includes an assignment to the .sh.value variable. Note that the expression

invokes the strlenList built-in function and return the maximum width of the strings (given as integer values) in the entries[ ] array. The preceding code fragment then adds 3 to the maximum width value (e.g., for formatting purposes).

A conditional command in a Korn shell evaluates a test-expression and returns either True or False. For example, conditional commands can be used as part of an or list, and list, or as part of an if-elseif-else command. Conditional commands have the format:

When used in conjunction with an and list, Korn shell evaluates the test-expression and will execute the and component only if the test-expression evaluates to True. If a conditional command is part of an and list, then that the return statement will be executed only if the test-expression evaluates to True.

Iteration control in the Korn shell has two formats, namely, traditional and arithmetic-for.

Example. The traditional format is exemplified by iterating on each word in a list. For example,

An arithmetic-for command has been provided that is very similar to the C programming language for statement. The format is given as:

The initExpression is evaluated by the Korn shell prior to executing the for command. The condition is then evaluated prior to each iteration of compound-list. If the condition is nonzero, then the Korn shell executes the compound-list. The loopExpression is evaluated at the end of each iteration.

5.3. Piping and I/O Redirection and Control

Reading Assignments and Exercises

Piping is a mechanism in UNIX for directing the data that a program consumes or produces to other programs. Input/Output redirection facilitates sorting the output of a program to a file (called output redirection) or using the contents of a file as input to a process (input redirection). UNIX has two I/O ports called stdin (standard input) and stdout (standard output) that function like IOCS (I/O Control) buffers in selected operating systems. In practice, piping and redirection allows the user to specify a source that will respectively be written to, or a target to be read from, stdout or stdin.

5.3.1. Piping

Piping of information is accomplished with the ("|") symbol.

Example. To pipe output from program1 into the input of program2, the following syntax would be employed:

Since the command line is scanned from left to right (standard lexicographical order employed in Western writing), program1 is executed first. Then, the UNIX shell interpreter (command line processor) sees the pipe symbol (|) and redirects the output of program1, which is written to stdout, to the stdin buffer associated with program2.

Here follows a concrete example to help you understand the use of piping.

Example. Suppose you have a program, which we will call program1, that produces hundreds of lines of screen output. Further assume that program1 has command-line options, which we will denote as -options.

Recall from our previous discussion that the more command can be used to display a file or input stream in UNIX one screen at a time. By using the piping command:

the output of program1 (under constraint of whatever options are specified) is sent to the more program, which displays this information pagewise.

For example, suppose you have a directory with many files, and you want to display the directory contents in detail (i.e., using the ls -l command). Instead of having to use the scrollbar to get through the directory listing, it is often more efficient to type:

which will display the current directory pagewise.

A similar feature is available in MS-DOS (not by coincidence).

5.3.2. I/O Redirection

Suppose one is running a program that writes to stdout and it is desired that the output go to a file instead, UNIX provides a feature for implementing this, called output redirection. A symmetric case exists for implementing file input to a program that otherwise receives its input from stdin.

UNIX has two types of redirection, namely, creation/replacement and appending. In the former case, program output is directed to a file that is opened as new (if it did not exist before) or overwrites an older version (if it exists on disk). Creation or replacement uses the operator (">" or "<"), while appending to output uses the operator (">>"). Note that appending something to input is a meaningless concept in UNIX.

Example. To redirect output from program1 into a file out1.txt, the following syntax would be employed:

Two concrete examples follow, which will help illustrate this powerful capability of UNIX.

Example. Suppose you have a program (program1) that has options denoted by -options, and you want that program to take input from a file data.in and write output to data.out. The following command could be used:

Again scanning in lexicographical order, a UNIX shell command processor will see the file first, then the redirection command, then the program name, and will run program1 with data from data.in. As program1 produces output, this output will be written to data.out.

Example. Suppose you want to redirect the output of a detailed directory listing into a file directory.txt. The following command line would be employed:

Note that piping to the cat command is used to assemble the output of the ls command so it can be formatted for redirection or screen display (if redirection not used).

Assume that a program (prog1) inputs data from stdin and writes output to stdout. Suppose we want to run prog1 repeatedly, to build up an output file that consists of many instances of running prog1. The UNIX appending operator (>>) is useful in this respect, as shown in the following example.

Example. Suppose you have a program (prog1) that computes the mean of a list of numbers from stdin, then writes the filename and the mean to stdout. Further assume that you want to produce a report that portrays the results of applying this program to many different lists of numbers. The following command line could be employed:

where input.dat denotes one instance of the input file, and output.rpt denotes the accumulated record of running prog1 on many different instances of the input file.

It is easy to see that the preceding commands are regular expressions. There are other UNIX operators that also allow different functions to occur on the command line, which will be discussed in Section 6 of these notes.


This concludes our overview of basic UNIX control structures. We next discuss the software development process with a UNIX operating system.


References