Data Structures: § 1: Introductory Material

Instructor: M.S. Schmalz


Computer program design can be made much easier by organizing information into abstract data structures (ADS) or abstract data types (ADTs). For example, one can model a table of numbers that has three columns and an indeterminate number of rows, in terms of an array with two dimensions: (1) a large number of rows, and (2) three columns.

A key feature of modern computer programs is the ability to manipulate ADS using procedures or methods that are predefined by the programmer or software designer. This requires that data structures be specified carefully, with forethought, and in detail.

This section is organized as follows:

In this class, we will concentrate only on data structures called arrays, lists, stacks, queues, heaps, graphs, and trees. In order to address the topics and algorithms covered in this class, we present each data structure in terms of a unifying formalism, namely, the ADT and its associated operations, called the ADT sequence.

We then discuss discrete math concepts and notation, and present an overview of the Java programming language, which is the language of choice for this class. This will help you establish background proficiency in theory and programming languages required for successful completion of the projects.

Complexity analysis is an important tool of the algorithm designer, which we discuss in sufficient detail to permit analysis of the algorithms presented in this class. We also discuss the use of recursion and the complexity analysis technique of recursion relations. Recursion is an important component of many modern software systems, programs, and modules. We will see that the design of recursive programs is somewhat different from the linear, sequential flow of control that you have become accustomed to in beginning-level programming. Recursion relations are thus required to analyze this different method of programming in a way that is decidedly different from analysis of algorithms with linear, sequential flow.

Arrays and lists are simple data structures that facilitate a wide variety of operations on ADTs, and thus form the basis for modern algorithm and software design and implementation. In order to find data stored in arrays or lists we perform an operation called searching. When this process is associated with order statistics (e.g., finding the median of a sequence), this is sometimes called selection. In order to facilitate efficient searching, we need to be able to sort the numbers or strings (sequence of characters) over which we want to search. In the case of real numbers, sorting means that the numbers are arranged in order of increasing magnitude.

1.1. Introduction and Overview

In order to set the proper context for this class, which is concerned with low-level mechanisms that facilitate good software design, we begin with a discussion of software design and engineering principles.

1.1.1. Software Design & Engineering Goals

Software design can be divided into two levels. High-level software design and analysis seeks to produce correct, efficient software. At a low level, we are more concerned with robustness, adaptability, and reusability.

1.1.1.1. High-level Software Design Goals. In order to produce programs that work as specified for all instances of input, the following design goals must be achieved at a high level:

Correctness -- a data structure or algorithm is designed, at a high level, to work as specified for all possible inputs.

Example. A sorting algorithm must arrange its inputs in ascending or descending order only.

Efficiency implies (a) fast runtime, and (b) minimal use of computational resources.

Example. Algorithm design for real-time systems is not just a matter of using fast hardware. One must produce fast algorithms that can take advantage of commercial off-the-shelf (COTS) hardware, for decreased production cost.

1.1.1.2. Low-level Software Design Goals. Correctly designed and functioning programs must achieve the following goals within an implementational perspective:

Robustness -- a program produces correct outputs for all inputs on any hardware platform on which the program is implemented. Robustness includes the concept of scalability -- the capability of a data structure or algorithm to handle expanding input size without incurring prohibitive complexity or causing inaccuracy or failure.

Adaptability -- software must be able to be modified or evolved over time to meet environmental changes (e.g., increased CPU speed, client-server implementation, or new functionality to increase market demand).

Example. The Year 2000 Problem is an excellent example of poor software design, where date fields in old software (also called legacy code) were not designed to be large enough to accomodate date codes for an interval greater than 100 years.

Reusability -- a given partition or piece of code must be able to be used as a component in different systems or in different application domains without significant modification.

Example. Upgrades to existing software systems should be simple to design and implement, but often aren't, due to idiosyncratic and poorly engineered programming.

To meet the above-listed goals, object-oriented programming evolved from procedural and functional programming.

1.1.1.3. Object-Oriented Design Principles help a programmer or software designer to organize the structure and functionality of code or programs according to several guiding principles, which include:

  1. Abstraction -- One must be able to distill a complicated system into a simple, concise, clear description of system structure and functionality.

    Example. The "Cut and Paste" operations on a word processor's Edit Menu are simple, elegant, and intuitively obvious, but may be implementationally complex.

    Observation. Applying the principle of abstraction to data structures produces Abstract Data Types (ADTs).

    Definition. An ADT is a mathematical model of a data structure that specifies (a) type of data stored and (b) operations on the stored data, with specification of the types of parameters in the operations. In particular, an ADT specifies what each operation does, but not necessarily how a given operation works.

    Example. In Java, an ADT is modelled as an interface implemented as a class. Classes specify how a given operation on the ADT's data is to be performed.

    Definition. A Java class implements an interface if and only if the class methods include those of the interface. Note that a given class can implement multiple interfaces.

  2. Encapsulation -- different components of a software system should implement an abstraction without revealing internal details of the implementation. This is called information hiding.

    Example. An Edit menu on a word processor is completely useful without the user knowing how the Cut and Paste processes work.

    Advantages of encapsulation include:

    Loose coupling between classes -- programmer only needs to meet interface specification, not worry about the details of a given method's functionality.

    Adaptability -- code can be changed without adversely afecting program functionality, due to the presence of a uniform interface.

  3. Modularity -- the presence of an organizing structure that partitions sfotware systems into functional units based on hierarchical relationships.

    Example. Two kinds of hierarchical relationships include the has-a and is-a relations. The has-a relation induces component hierarchies such as the one shown in Figure 1. Here, the door, window, wall, and floor are constituents of a building. The is-a relation induces membership hierarchies such as the one shown in Figure 2, where a residential or commercial structure are both instances of a more general type of structure called a building.


    Figure 1.1.1. Component hierarchy based on has-a relation.


    Figure 1.1.2. Membership hierarchy based on is-a relation.

    Advantages of modularity and hierarchical design include:

    Grouping of common functionality at the most general level.

    Viewing specialized behavior and attributes as extensions of one or more general cases, also called inheritance.

1.1.1.4. Object-Oriented Design Techniques. In order to implement object-oriented design paradigms, the following techniques have been developed and are available in most object-oriented languages such as C++ and Java:

Classes and Objects -- the actors in the object-oriented programming (OOP) paradigm, which have methods that are the actions performed by the objects. Classes and objects facilitate abstraction and, hence, robustness.

Interfaces and Strong Typing facilitate efficient interpretation or compilation of the source language, without confusing ambiguities. This facilitates encapsulation and, hence, adaptability.

Inheritance and Polymorphism allow a given object or subclass to specialize the attributes or capabilities of the parent class. For example, in Figure 2, a skyscraper specializes the description of an office building. Polymorphism refers to the ability of a given operation or method to admit different types of operands that are specializations of a higher-level class description. Inheritance and polymorphism faciliate modularity and reusability when class descriptions are available to designers and programmers.

We examine each of these concepts in detail, as follows.

  1. Classes and Objects are the building blocks of object-oriented design and programming

    Definition. Objects are actors in the OOP paradigm.

    Definition. A class is a specification of data fieldsinstance variables that a class contains, as well as those operations that a class can execute.

    Concept of Reference: When an object is created, the OOP interpreter or compiler (1) allocates memory for the given object, and (2) binds a variable to the object, where the variable is said to reference the object.

    Interaction. A class or method can either (a) execute one of an object's methods or (b) look up the fields of an object.

    Observation. In OOP, objects interact with message passing. For example, two objects o1 and o2 can interact by o1 requesting o2 to:

    1. Print a description of itself,

    2. Convert o2 to a string representation, or

    3. Return the value of one of o2's data fields.

    Alternatively, o1 can access o2's fields directly, but only if o2 has given o1 permission for such access. This is an example of information hiding.

  2. Interfaces and Strong Typing.

    Observation. Each class must slecify the Application Programming Interface (API) that it presents to other objects. The ADT approach is to specify the interface as a class type definition and collection of methods for that type, where all arguments for each method must have specific types.

    Remark. Tight type specifications (strong typing) are rigidly enforced by the interpreter or compiler, which requires rigorous type matching at runtime.

    Note: This encapsulation is achieved at the cost of work to the programmer.

  3. Inheritance and Polymorphism.

    Observation. Inheritance involves design of high-level classes that can be specialized to more particular class descriptions.

    Example. The Java class Number can be specialized to float, integer, double, or long.

    Definition. A base class or superclass is the most general class.

    Observation. Specialized classes or derived classes inherit base properties or methods from the general class, which are specialized for a given subclass. Figure 3 illustrates this concept for the general class of Animal, which can include subclasses of dog, horse, and human, among others.


    Figure 3. Derivation of subclasses with methods also derived from a superclass of Animal.

    Definition. Polymorphism refers to the ability of an object to take different forms.

    Example. A reference variable v reference an object o, but must define which class C to which it is allowed to refer.

    If Class D extends Class C (which it does in Figure 3), then v can refer to any class D.

    When o refers to an object from Class D, it will execute the method o.bark(), o.neigh(), o.talk(), or o.sing(), as appropriate. In this case, D is said to override method b() from Class C.

    When o references an object in Class D, it will execute the method o.make-noise(), since the object-oriented interpreter or compiler searches from the most specific to most general class. Thus, o can be polymorphic.

    Observation. This kind of functional organization allows a specialized class D to:

    1. Extend a class (e.g., C in Figure 3),
    2. Inherit generic methods from C, and
    3. redefine other methods from C to account for specific properties of objects in D.

    Remark. The method that processes the message to invoke a given method, then finds that method, is called the dynamic dispatch algorithm.

    Definition. A method's signature is the combination of the method-name together with the type and number of arguments passed to the method.

    Implementation. Java provides a watered-down kind of polymorphism called overloading, where even one class D can have multiple methods with the same name, provided that each method has a different signature.

    Observation. Inheritance, polymorphism, and method overloading support reusability.

1.1.2. Abstract Data Structures and ADTs

The purpose of an abstract data structure (ADS) is to narrow the semantic gap between the user's programming constructs or application software and the computing hardware (otherwise known as the bare machine).

When instances of a given ADS can be constructed from a higher-level description (e.g., a Java class definition), then we say that such instances have a given Abstract Data Type (e.g., array, list, stack, ...) In this class we will study several ADTs, but they will each have a common set of generic operations, which are discussed as follows.

1.1.3. The ADT Sequence

There is a more-or-less common set of operations that can be performed on each ADT, which we list as follows. Here a given ADT is assumed (e.g., the current class), a value in an ADT element is denoted by x, and the position of an ADT element in a multi-element ADT is denoted by "i".

IsFull() returns True if the ADT cannot contain any more data, and False otherwise.

IsEmpty() returns True if the ADT contains no data, and False otherwise.

Size() returns an integer greater than or equal to zero, which denotes the number of elements in the ADT.

PositionOf(x) returns an integer or coordinate that denotes the position of the first instance of x encountered while searching through the elements of the ADT in a prespecified scanning order.

ValueAt(i) returns the value of the i-th element in the ADT.

InsertValue(x,i) inserts the value x into the ADT via the following steps:

  1. Create a new element of type compatible with the ADT.

  2. Link the element to other elements in the ADT, making the new element the i-th element, and making all subsequent elements have indices (e.g., i+1, i+2, etc.) that follow the index of the new element, in the precedence order inherent in the ADT's definition.

  3. Place the value x in the new element.

ReplaceValueAt(i,x) puts the value x in the i-th element of the ADT without disturbing the order of elements in the ADT.

DeleteElement(i) removes the i-th element of the ADT, restoring the ADT's order property. This essentially reverses the order of steps used in the InsertValue operation.

OutputADT usually writes the ADT to disk, either as an object in the native programming language (e.g., Java object serialization), or as an ASCII string representation.

Note that all of the preceding methods might not be used with a given ADT. For example, an array does not facilitate easy insertion or deletion of its elements, but a list does.

1.1.4. Representing ADTs in a Programming Language

In Java, which is the language of choice for this class, we represent each ADT as a class. In earlier languages like FORTRAN, there were a fixed number of datatypes, which could not be altered. In PASCAL, one could predefine datatypes, but there were limitations on what kind of data the datatypes could hold. This definition process is made more flexible in C, and the encapsulation of class definitions and methods unified the earlier distinctions of "data" and "procedure" in terms of objects.

It is worth noting that certain languages before PASCAL had very flexible but idiosyncratic datatype definitions. SNOBOL, an early string processing language, had a very flexible ADT definition system and a powerful pattern-matching language that could be customized (albeit with some effort) to find data stored in ADTs. PL/1 made some feeble attempts at more flexible datatype definitions, but was too large a language to be of practical use, and died an unmourned death after several revisions.

1.2. Review of Discrete Math Concepts and Notation

This review is given in Appendix A, and is thus not reproduced here.

1.3. Overview of Java Programming Language

A thorough review of Java, with tutorials, is organized in hierarchical form in Appendix B, and is thus not reproduced here. Also, Dr. Crummer's Java tutorial is a good place to get started programming in Java. Additional textbooks and references are cited in this link.

1.4. Overview of Complexity Notation

In this section, we discuss some concepts of complexity, including Big-Oh notation.

Concept. It is useful to be able to specify the complexity of a function or algorithm, i.e., how its computational cost varies with input size.

Background and Application. Complexity analysis is key to the design of efficient algorithms. Note that the work or storage (memory) requirement of a given algorithm can vary with the architecture (processor) that the algorithm runs on, as well as the type of software that is being used (e.g., runtime modules generated from a FORTRAN versus C++ compiler). Thus, complexity analysis tends to be an approximate science, and will be so presented in this sub-section.

In particular, when we analyze the complexity of an algorithm, we are interested in establishing bounds on work, space, or time needed to execute the computation specified by the algorithm. Hence, we will begin our exposition of complexity analysis by analyzing the upper bound on the growth of a function.

Representation. "Big-Oh Notation" (e.g., O(x2) reads Big-Oh of x-squared) is used to specify the upper bound on the complexity of a function f. The bound is specified in terms of another function g and two constraints, C and k.

Definition. Let f,g : R -> R be functions. We say that f(x) = O(g(x)) if there exist constants C,k R such that

| f(x) | C · | g(x) | ,

whenever x > k.

Example. Let f = x3 + 3x2. If C = 2, then

| f(x) | = | x3 + 3x2 | 2 · | x3 | ,

whenever x > 2. The constants C,k = 2 fulfill the preceding definition, and f(x) = O(g(x)).

In the next section, we will discuss composition of big-Oh estimates, to determine an upper bound on the complexity of composite functions and algorithms.

1.4.1. Algorithms and Algorithmic Complexity

We begin by noting that, in this section, we use algorithms to introduce salient concepts and to concentrate on analysis of algorithm performance, especially computational complexity.

Definition. An algorithm is a finite set of instruction for performing a computation or solving a problem.

Types of Algorithms Considered. In this course, we will concentrate on several different types of relatively simple algorithms, namely:

Selection -- Finding a value in a list, counting numbers;

Sorting -- Arranging numbers in order of increasing or decreasing value; and

Comparison -- Matching a test pattern with patterns in a database .

Properties of Algorithms. It is useful for an algorithm to have the following properties:

Input -- Values that are accepted by the algorithm are called input or arguments.

Output -- The result produced by the algorithm is the solution to the problem that the algorithm is designed to address.

Definiteness -- The steps in a given algorithm must be well defined.

Correctness -- The algorithm must perform the task it is designed to perform, for all input combinations.

Finiteness -- Output is produced by the algorithm after a finite number of computational steps.

Effectiveness -- Each step of the algorithm must be performed exactly, in finite time.

Generality -- The procedure inherent in a specific algorithm should be applicable to all algorithms of the same general form, with minor modifications permitted.

Concept. In order to facilitate the design of efficient algorithms, it is necessary to estimate the bounds on complexity of candidate algorithms.

Representation. Complexity is typically represented via the following measures, which are numbered in the order that they are typically computed by a system designer:

Work W(n) -- How many operations of each given type are required for an algorithm to produce specified output given n inputs?

Space S(n) -- How much storage (memory or disk) is required for an algorithm to produce specified output given n inputs?

Time T(n) -- How long does it take to compute the algorithm (with n inputs) on a given architecture?

Cost C(n) = T(n) · S(n) -- Sometimes called the space-time bandwidth product, this measure tells a system designer what expenditure of aggregate computational resources is required to compute a given algorithm with n inputs.

Procedure. Analysis of algorithmic complexity generally proceeds as follows:

Step 1. Decompose the algorithm into steps and operations.

Step 2. For each step or operation, determine the desired complexity measures, typically using Big-Oh notation, or other types of complexity bounds discussed below.

Step 3. Compose the component measures obtained in Step 2 via theorems presented below, to obtain the complexity estimate(s) for the entire algorithm.

Example. Consider the following procedure for finding the maximum of a sequence of numbers. An assessment of the work requirement is given to the right of each statement.

    Let {an} = (a1,a2,...,an)
     { max = a1               # One I/O operation
       for i = 2 to n do:     # Loop iterates n-1 times
         { if ai  max then   # One comparison per iteration
              max := ai }     # Maybe one I/O operation
     }

Analysis. (1) In the preceding algorithm, we note that there are n-1 comparisons within the loop. (2) In a randomly ordered sequence, half the values will be less than the mean, and a1 would be assumed to have the mean value (for purposes of analysis). Hence, there will be an average of n/2 I/O operations to replace the value max with ai. Thus, there are n/2 + 1 I/O operations. (3)This means that the preceding algorithm is O(n) in comparisons and I/O operations. More precisely, we assert that, in the average case:

W(n) = n-1 comparisons + (n/2 - 1) I/O operations.

Exercise. In the worst case, there would be n I/O operations. Why is this so? Hint: This may be a quiz or exam question.

Useful Terms. When we discuss algorithms, we often use the following terms:

Tractable -- An algorithm belongs to class P, such that the algorithm is solvable in polynomial time (hence the use of P for polynomial). This means that complexity of the algorithm is O(nd), where d may be large (which typically implies slow execution).

Intractable -- The algorithm in question requires greater than polynomial work, time, space, or cost. Approximate solutions to such algorithms are often more efficient than exact solutions, and are preferred in such cases.

Solvable -- An algorithm exists that generates a solution to the problem addressed by the algorithm, but the algorithm is not necessarily tractable.

Unsolvable -- No algorithm exists for solving the given problem. Example: Turing showed that it is impossible to decide whether or not a program will terminate on a given input.

Class NP -- If a problem is in Class NP (non-polynomial), then no algorithm with polynomial worst-case complexity can solve this problem.

Class NP-Complete -- If any problem is in Class NP-Complete, then any polynomial-time algorithm that could be used to solve this problem could solve all NP-complete problems. Also, it is generally accepted, but not proven, that no NP-complete problem can be solved in polynomial time.

1.4.2. Big-Oh Notation and Associated Identities.

We have previously defined Big-Oh notation. In this section, we present identities for Big-Oh notation that allow the computation of complexity estimates for composite functions, with examples.

Theorem 1. (Polynomial Dominance) If f(x) = anxn , then f(x) = O(xn).

Example. Find the Big-Oh complexity estimate for n!

    Observe that n! = 1 · 2 · 3 · ... · n n · n · n · ... · n (n times) = nn.

    Hence, n! = O(nn) .

Theorem 2. (Additive Combination) If f1(x) = O(g1(x)) and f2(x) = O(g2(x)), then

(f1 + f2)(x) = O(max(g1(x), g2(x)) .

Theorem 3. (Multiplicative Combination) If f1(x) = O(g1(x)) and f2(x) = O(g2(x)), then

(f1 · f2)(x) = O(g1(x) · g2(x)) .

Example. Estimate the complexity of f(n) = 3n · log(n!).

Step 1. From the previous example, log(n!) is O(n · log(n)) .

Step 2. Let f1 = 3n and f2 = log(n!), with f = f1 · f2 .

Step 3. Note that 3n = O(n).

Step 4. From the law of multiplicative combination of complexity estimates (as shown in Theorem 3, above), it follows that

g1(x) · g2(x) = O(n) · O(n · n log(n)) = O(n2 · log(n)) .

1.4.3. Big-Omega and Big-Theta Notation.

Big-Oh notation establishes an upper bound on an algorithm's complexity, since | f(x) | C · | g(x) | , for x greater than some constant k, given constant C. (The less-than-or-equal sign denotes that Big-Oh describes an upper bound).

Concept. To establish a lower bound on complexity, we use (called Big-Omega) and (called Big-Theta) notation.

Definition. If f,g : R -> R are functions and C,k R+ such that

| f(x) | C · | g(x) | ,

then f(x) is said to be (g(x)) for all x > k.

Example. If f(x) = 8x3 + 5x2 + 7, then f(x) is (x3), since f(x) 8x3 x R+ .

Definition. If f,g : R -> R are functions, where (a) f(x) is O(g(x)) and (b) f(x) is (g(x)), then f(x) is said to be (g(x)), or Big-Theta of g(x) .

Note: When Big-Theta notation is used, then g(x) is usually a simple function (e.g., n, nm, log n).

1.5. Recursive Computation

It is often useful to write functions, procedures, or methods that call themselves. This technique, called recursion, is employed extensively in computer science, particularly in the solution of problems involving graphs or trees. In this section, we discuss the impact of recursive procedures on computational cost, with examples.

Concept. A recursive procedure calls itself. Thus, in a sequence of n procedure calls, the output of the n-th procedure becomes the input of the (n-1)-th procedure, which becomes the input of the (n-2)-th procedure, and so forth, until the top-level procedure is reached. This procedure processes the output of the 2nd procedure call in the recursive sequence, and returns a net result.

Example. Let a function f(x) compute the expression:

fi(x) = fi-1(x) + x , i = 1..n .

which simply computes x + x + ... + x as many times as there are values in i (in this example, n times).

Observation. If a procedure that has computational work requirement W (exclusive of the function call overhead) calls itself n times, then that procedure incurs a cost of nW operations, plus the additional overhead of n-1 function calls. Note: The first function call would occur whether the procedure was iterative or was called recursively.

Remark. The overhead of additional function calls is one of the significant disadvantages of recursive programming. An additional disadvantage is the effort involved in writing a recursive version of an iterative procedure. The following example is illustrative.

Example. Observe the following two code fragments, each of which computes x factorial, where x is a positive integer:

    NONRECURSIVE             RECURSIVE
    
    factorial(x : int)       factorial(x : int):
    { fact = 1               { return(x * factorial(x-1))
      for i = 2 to n         }
        fact = fact * i      
      endfor
      return(fact)
    }
    

Observe that the nonrecursive or iterative algorithm is more verbose than the recursive formulation. Both compute x!, and both require n-1 multiplications to do so. However, the nonrecursive algorithm has an overhead of n-1 loop index incrementations, whereas the recursive algorithm requires n-1 additional function calls, as discussed previously.

Example. Observe the following two code fragments, each of which computes the sum of a sequence a of n elements.

    NONRECURSIVE                       RECURSIVE
    
    sum(a : array [1..n] of int)       sum(a : array [1..n] of int, i,n : int)
    { sum = 0                          { if i = 1 
      for i = 1 to n                       then return(a[1])
        sum = sum + a[i]                 else if i > 1 and i <= n
      endfor                               then return(a[i] + sum(a[i+1],i+1,n)
      return(sum)                      }
    }
    

As before, note that the nonrecursive or iterative algorithm has more lines of code than the recursive formulation. Both compute sum(a), and both require n-1 additions to do so. However, the nonrecursive algorithm has an overhead of n-1 loop index incrementations, whereas the recursive algorithm requires 2(n-1) incrementations of index i, together with n-1 additional function calls, as discussed previously. This means that a recursive algorithm would probably not be the best way of computing a vector sum.

Observation. In most computing systems, the overhead of function calls far exceeds the work requirement of incrementing a loop index. Thus, the recursive formulation is usually (but not always) more expensive. In practice, the nonrecursive procedure is more expensive if computation of the index i, or some other index derived from i, takes extensively many arithmetic operations, which taken in aggregate would be more expensive than the additional function calls of the recursive computation. However, the recursive formulation has the advantage of being concise, and of being able to portray the kind of function calls that mathematicians often like to write with much less effort than iterative formulations of the same function.

Hence, recursive functions are likely to remain useful for some time in the future. Applications of recursive function calls abound, for example, in compilers, parsing of formal and natural language text, and in algorithms that are based on data structures called graphs or trees.

1.6. Arrays and Lists.

We begin with several observations about the use of arrays in computer programs.

Observation. In the early days of computer programming, machines were dedicated to the task of computing tables of artillery trajectories (WWII) and tables of accounting and business inventory information (early 1950s). Thus, numerically intensive computing machines were designed to handle linear or two-dimensional arrays. When computer programming became better established and scientific applications came into vogue, the FORTRAN (FORmula TRANslation) language was developed which supported multiply-dimensioned arrays.

Definition. An array is a data structure whose domain is a finite subset of Euclidean n-space Rn.

Observation. Arrays are customarily assigned the datatype of the elements that they contain, which can be one and only one datatype. For example, arrays can be integer-, real-, string-, or character-valued, but elements of more than one such type are not usually contained in an array.

Example. The vector a = (2,3,-1,5,6,0,9,-7) is a one-dimensional integer-valued array. The first value, denoted by a(1), equals 2. The i-th value in the array is denoted by a(i), i = 1..8, because there are eight elements in the array.

Example. The two-dimensional array shown below has four columns and three rows. Each element of the array is referenced by its (row,column) coordinate. For example, the element whose value equals 9.2 is in row 3, column 4, which we write as a(3,4) = 9.2 .


Figure 1.6.1. An example of a two-dimensional array.

Remark. The use of row-column indices or coordinates makes referencing elements of arrays convenient. It is especially useful to note that arrays can be indexed in loops. For example, a loop that would set all the elements of the array a in Figure 1.6.1 to zero could be written in pseudocode as:

         :
       DECLARE a : array [3,4] of real;
         :
       FOR i = 1 to 3 DO:
         FOR j = 1 to 4 DO:
	   a(i,j) := 0.0
	 ENDFOR
       ENDFOR
	   

Observation. In the above pseudocode fragment, each dimension of the array a has a lower and upper limit to the subscripts that are allowed. One finds the size of each dimension by subtracting the lower limit from the upper limit, then adding one.

Example. If an array is dimensioned as:

     b : array [-2..3,5..9,4..8] of integer; 

then the number of elements in the array is computed as follows:

     STEP 1: Dimension #1 is of size (3 - -2 + 1) = 6
     STEP 2: Dimension #2 is of size (9 - 5 + 1) = 5
     STEP 3: Dimension #3 is of size (8 - 4 + 1) = 5
     STEP 4: Multiply the dimension sizes:  
            
                      Nelements = 6 x 5 x 5 = 150
       

to find that there are 150 elements in the array.

In the early days of computing, it was very important to know how large arrays were, because computer memory was extremely limited. Today, with large memory models, one still must be careful not to specify array dimensions too large, but it is less of a problem than in the past.

Programming Hint: As we mentioned in class, one always initializes program variables to which file data is not assigned prior to computing a given expression. One can use the preceding loop structure to assign initial values to array elements in an efficient manner.

A key problem with arrays is that they have fixed size. Hence, they are called static data structures. We will soon examine techniques for programming data structures called lists, which can expand and contract with the data that one puts into or takes out of the list. Another problem of arrays which we mentioned previously is that they are statically typed, i.e., cannot be used to store any type of data, but only the type of data assigned to them in the declaration statement in which they were specified. It is interesting to note that certain languages, such as SNOBOL and ICON, have circumvented this difficulty by providing a TABLE data structure that can hold data of any type. You are not responsible for knowing about the TABLE data structure, however, since it is not available in Java, or in most modern high-level programming languages.

We next consider the case of singly- and doubly-linked linear lists.

Concept. We have seen that arrays use an index to point to each element of the array. For example, given an array element a[i], i = 1..n, the third element of the array, denoted by 3, would be specified as a[3] if n = 3 or greater. Another type of linear data structure, similar in concept but not in implementation to a one-dimensional array, is the list.

Definition. A list is an ordered collection of elements, where each element has at least one pointer and at least one value.

Definition. A singly-linked list (SLL) is a list where each element has the following fields:

Value: The value that is stored in the list element. This can be of any datatype, including a list. Lists can thus be multi-dimensional, i.e., a list element can point to another list through its value field.

Next-pointer: This is a pointer to the next element in the list, if it exists, and has datatype list-element.

Convention. Given an SLL element called this, we usually reference the element's value and next-pointer as this.value and this.next, respectively.

Observation. As shown in the schematic diagram of an SLL in Figure 1.6.2, there are different ways to construct list representations. Two are shown - the top example has sentinel head and tail nodes, which represent extra storage and maintenance overhead, but can provide a measure of data security if head and tail manipulation operations are employed. The simplified list representation, without sentinel head or tail nodes, is more common. In this case, the first node is the head, and the last node (the one that points to the null value) is the tail of the list.


Figure 1.6.2. Schematic diagram of Singly-Linked List, where L[i], i=1..n, denotes a value in the i-th list element.

Definition. A doubly-linked list (DLL) is a list where each element has the following fields:

Value: The value that is stored in the list element. This can be of any datatype, including a list. Lists can thus be multi-dimensional, i.e., a list element can point to another list through its value field.

Next-pointer: This is a pointer to the next element in the list, if it exists, and has datatype list-element.

Prev-pointer: This is a pointer to the previous element in the list, if it exists, and has datatype list-element.

Convention. Given an DLL element called this, we usually reference the element's value, next-pointer, and prev-pointer as this.value, this.next, and this.prev, respectively.

Observation. Finding the head and tail of a DLL is a little easier than with an SLL, since the prev-pointer (next-pointer) of the DLL's head (tail) points to the null value.


Figure 1.6.3. Schematic diagram of Doubly-Linked List, where L[i], i=1..n, denotes a value in the i-th list element.

Observation. Traversing a DLL is easier than traversing an SLL because the prev-pointer of the DLL allows one to backtrack (i.e., traverse the list backwards). This is important in certain types of sorting or searching operations conducted over lists. SLLs have the advantage of being slightly simpler than DLLs, and are easier to implement, since they have 2/3 the data fields of a DLL.

ADT Operations. An SLL or DLL has the following ADT operations:

Size(): Returns the number of elements in the list.

IsEmpty(): Returns true if the list has no elements.

i-th(i): Returns the value of the i-th element in the list. This function is sometimes called ElementAt(i).

Head(): Returns the value of the first element in the list.

Tail(): Returns the value of the last element in the list.

Insert(x,i): Inserts a new element in the list, after the i-th element, where the new element has value x. The (i+1)-th and following elements each have their indices incremented. That is, the (i+1)-th element becomes the (i+2)-th element, and so forth.

Delete(i): Deletes the i-th element from the list. The (i+1)-th and following elements each have their indices decremented. That is, the (i+1)-th element becomes the i-th element, and so forth.

Observation. The i-th or ElementAt operation incurs O(n) complexity for an n-element SLL or DLL. Thus, if you put the ElementAt operation inside a loop having m iterations, this will incur a work requirement of O(n2) operations.


Figure 1.6.4. Schematic diagram of inserting an element with value "C" in a Doubly-Linked List, between elements with value "A" and "B". Note that Steps 1-3 generalize insertion in an SLL, if and only if the list nodes are SLL nodes, as shown in Figure 1.6.2.


Figure 1.6.5. Result of insertion operation in DLL. Steps from Figure 1.6.4 are numbered in blue.

Complexity. The work of inserting an element in a DLL, as shown in Figure 1.6.4, is (a) one list element creation plus (b) four list pointer assignments. For an SLL, this work is reduced by two pointer assignments, since the prev-element pointer is not available for use in an SLL. The preceding work estimate does not include the work required to find the list.

1.7. Searching, Selection, and Sorting

In computer science, we are often called upon to perform practical tasks, such as looking up an entry in a table or dictionary, finding items in a database, etc. The process of searching for such items is also called selection, since one attempts to select a given value from an array, list, set, or other ADT. Selection is discussed in Section 1.7.1.

Another important task in computer science is ordering a list of names, numbers, or other data according to a prespecified ordering scheme. This process, called sorting, has been extensively investigated and was the subject of much research in the 1950s through 1970s. Thus, there are many sorting algorithms of interest. We will study several of these in Section 1.7.2.

1.7.1. Selection

There are two types of selection algorithms of interest in this introductory class. First, we have the simple technique of linear search, which merely scans each item of a data structure to see if its value corresponds to a prespecified or reference value. This requires O(n) work for searching through n data. Second, one can produce hierarchical arrangements of data, called trees, for which search algorithms exist that can find data in O(log n) work.

Concept. Linear search scans through a list, array, or other ADT, comparing the value of each element to a reference value. At the first match, the search algorithm returns the position of the value of interest.

Analysis. The best case for a linear search algorithm is to have the value x for which one is searching located at the first position of the ADT. This obviously requires a constant number of comparison operations, i.e., the work is O(1) comparison. In the worst case, x is located at the n-th position of the ADT, for which the associated work is O(n).

Algorithm. Given an n-element vector a, linear search for an element x can be implemented as follows:

    LinSearch(x,a,n):
    { for i = 1 to n
        if (a[i] = x) then return(i)
    }
    

Analysis. The preceding algorithm incurs a minimum (maximum) of 1 (n) comparisons, for an average work requirement of (n+1)/2 comparisons.

Example. Given a five-element vector a = (2,8,32,7,4), let us search for the element a[i] = 7, i = 1..5. The following table suffices to exemplify linear search:

    STEP     i VALUE    a[i]    CHECK     RESULT    NEXT i
    ----     -------    ----  ---------  --------  --------
     1          1         2    a[1] = 7?   2!= 7      2
     2          2         8    a[2] = 7?   8!= 7      3
     3          3        32    a[3] = 7?  32!= 7      3
     4          4         7    a[2] = 7?   7 = 7     stop

Observation. Search trees are usually binary, that is, there are two children per internal node. This allows one to use a binary relation such as greater-than or less-than to construct a min-tree or max-tree that has, at the root of the tree, the minimum or maximum of the data stored in the tree. We state a more general case, as follows.

Definition. An n-ary search tree is a rooted tree that has n children per nonterminal node.

Definition. In an n-ary search tree of depth L levels, O(L) operations are required to find a prespecified value stored in one of the tree nodes. Given m data items stored in a complete n-ary search tree, the number of levels is computed as follows:

L = ceil(lognm) ,

where ceil denotes the ceiling function.

Observation. The disadvantage of some types of search trees, such as binary search trees, is that the sequence stored in the tree must be presorted in order to have the order property required to implement O(log n) work in the selection process. However, other types of search trees, such as some implementations of binary search trees, AVL trees, and red-black trees discussed in Section 3, require only O(log n) work to insert and delete an element from the tree.

1.7.2. Sorting

There are several types of sorting algorithms of interest to this course. First, we have the simple technique of injection-sort, which positions each input value in its position on the number line, using the value as an index for the sorted array. A more sophisticated sorting technique, called histogram-sort or bin-sort scans the histogram of a sequence to output the values stored in the histogram in a predetermined order. Since the domain of the histogram is assumed to be an n-element subset of the integers or reals, the work involved is O(n) additions and I/O operations.

More sophisticated sorting algorithms were developed to handle large input sequences that had arbitrary values. For example, bubble-sort migrates the maximum or minimum of a sequence to the head of the sequence. The remainder of the sequence is then sorted in the same way. This means that there are n passes, where the i-th pass requires scanning through i data. This involves O(n2) comparison and I/O operations, and is thus prohibitive for very large sequences. The more intelligent (and efficient) technique of insertion-sort selectively positions each value of an input sequence in its proper position in the sequence, according to a prespecified order. This can be thought of as a type of intelligent bubble-sort algorithm, and requires far fewer operations than bubble-sort.

Still more efficient sorting algorithms include merge-sort and quick-sort, which are tree-structured. These algorithms, which belong to the general class of divide-and-conquer algorithms, first split (divide) an input sequence into subsequences, which are then sorted when a given subsequence has two elements. (Note that sequences that have one element do not need to be sorted.) The subsequences are then recombined to yield a sorted sequence. Merge-sort partitions the input sequence based on its length, which is assumed to be a power of two, since a binary tree is employed. Quick-sort partitions an input sequence based on a pivot value that is computed from the values in the sequence. Quick-sort is therefore said to be data-dependent.

1.7.1.1. Injection Sort.

Concept. Given an input sequence [an] that is a subset of the integers modulo n, injection-sort places each value a(i) in its proper place on the number line.


Figure 1.7.1. Schematic diagram of injection sort applied to input vector (2,4,3,1) to yield sorted vector (1,2,3,4).

Algorithm. Assume that a sequence a has n distinct integers taken from the set Zn, where n is small (i.e., less than 1M). We can sort a as follows:

    InjectionSort(a : array [1..n] of integer modulo n):
    { for i = 1 to n do:
        b[a[i]] = a[i]
      endfor
      return(b)
    }
    

Analysis. The preceding algorithm consists of one loop, which has n iterations and performs work of two memory I/O operations during each iteration. Thus, the total work is 2n memory I/O operations, and the injection-sort algorithm sorts its input with O(n) work.

Injection sort has no best or worst case input, because the same process is applied to each input, regardless of how the input is ordered. Another way of saying this is to observe that there are no decision statements in the algorithm, and thus no way of exiting the loop before completion. We say that such an algorithm has performance or complexity that is not data-dependent.

1.7.1.2. Histogram Sort (Bin Sort).

Concept. Given an input sequence [an] whose values can be mapped to the integers modulo m, where m > n is possible, histogram-sort (also called bin-sort) places each value a(i) in its proper place on the number line by scanning the histogram of a.


Figure 1.7.2. Schematic diagram of histogram sort applied to input vector (2,1,4,2,3,1,4,1) to yield sorted vector (1,1,1,2,2,3,4,4).

Algorithm. Assume that a sequence a has n distinct elements that can be mapped to the set Zm, where m is small. We can sort a using its histogram h(a), as follows:

    HistogramSort(a : array [1..n] of integer modulo n):
    { h = 0                       # Zero the histogram buffer
      for i = 1 to n do:          # Build the histogram
        h[a[i]]++
      endfor
      i = 0                       # Zero the output counter
      for j = 1 to m do:          # Histogram domain scan loop
        for k = 1 to h[j] do:     # Bin countup loop
           i++
           b[i] = j               # Put bin index in output
        endfor
      endfor
      return(b)
    }
    

Analysis. The preceding algorithm consists of two loops. The first loop has n iterations and performs work of one memory I/O operation and one incrementation during each iteration. The second loop is a compound loop (j and k are indices) that iterates a total n times, since there are n data in the histogram (recall that there were n input data). For each iteration of the second loop, there is one incrementation (i++) and one memory I/O operation. Thus, the total work is 2n memory I/O operations and incrementations, and the histogram-sort algorithm sorts its input with O(n) work.

For the same reasons given in Section 1.7.1.1, Analysis, Histogram-Sort has no best case or worst case, and is not data-dependent. However, like Injection-Sort, Histogram-Sort requires a highly restricted input, which is not necessarily a realistic constraint.

Remark. Because a histogram is said to consist of bins in which the count for a given value is stored (like putting n beans in bin x if the value x occurs n times), histogram-sort is often called bin-sort.

We next move from the O(n) complexity sorting algorithms with restricted input sets to an early sorting algorithm with unrestricted real input values.

1.7.1.3. Bubble Sort.

Concept. Given an input sequence [an] whose values are assumed to be real numbers, bubble-sort places each value a(i) in its proper place on the number line by respectively "bubbling up" or migrating the minimum (maximum) of the sequence or remaining subsequence to achieve a sort in ascending (descending) order.


Figure 1.7.3. Schematic diagram of bubble sort applied to input vector (2,1,4,2,3) to yield sorted vector (1,2,2,3,4). Note that the bounded regions are the areas to which bubble-sort is applied. Each of these areas decreases by one element, and is equal in size to n-i, where i denotes the index of the processing step.

Algorithm. Assume that a sequence a has n real elements, which need not be distinct. We can sort a in ascending order using bubble-sort, as follows:

    BubbleSort(a,n):
    { read(a[i], i = 1..n)
      for i = 1 to n do:
        min = minimum of the sequence a[i..n]
        pos = position at which the minimum occurred
        b[i] = min
        swap(a[i],a[pos])
      endfor
      return(b)
    }
    

Analysis. The preceding algorithm consists of one loop, which has n iterations and performs work of (n-i+1) comparisons per iteration. Thus the complexity is O(n) comparisons, which dominate over the memory I/O operations.

In the best case, the input is presorted and bubble-sort requires a minimum number of I/O operations. (Hint: The exact number of I/O operations might be a good exam question.)

In the worst case, the input is sorted opposite to the desired order of the sorted output, and a maximum number of I/O operations are required. The number of comparisons is invariant to input value, since the maximum of each sequence or subsequence is not known a priori.

1.7.1.4. Insertion Sort.

Concept. Given an input sequence [an] whose values are assumed to be numbers, insertion-sort places each value a(i) in its proper place in the sorted sequence by "bubbling up" or migrating the minimum (maximum) of the sequence or remaining subsequence, only if that value is out of order.


Figure 1.7.4. Schematic diagram of insertion sort applied to input vector (2,1,4,2,3) to yield sorted vector (1,2,2,3,4). Note that a given exchange terminates with determination of InPlace ordering, regardless of how many steps are in the exchange. Also, when the marker is at the final position (in this case, the fifth position because n = 5), this signals the final exchange.

Algorithm. Assume that a sequence a has n elements, which need not be distinct. We can sort a in ascending order using insertion-sort, as follows:

    InsertionSort(a,n):
    { for i = 2 to n do:
        if a[i] is out of place in the subsequence a[1..i], then
          migrate a[i] upward until it is in its proper place
        endif
      endfor
      return(a)
    }
    

Analysis. The preceding algorithm consists of one loop, which has n iterations and performs a maximum of O(n) comparisons per iteration. This occurs when the input sequence is sorted in opposite order from the desired output (worst case). In the best case, the input sequence is properly sorted, and a total of n-1 comparisons are required (because the previous subsequence is already sorted). Thus, the complexity of insertion-sort varies from O(n) to O(n2) comparisons, which dominate over the memory I/O operations. For this reason, the InsertionSort algorithm has data-dependent complexity.

1.7.1.5. Merge Sort.

Concept. Given an input sequence [an] whose values are assumed to be numbers, and n is a power of 2 for convenience, merge-sort uses divide-and-conquer to sort the input recursively until there are subsequences of two elements only. These are sorted using one comparison and one swap operation (if required), then are merged into progressively larger subsequences, until the input length is attained. The result is a sorted sequence. As shown in Figure 1.7.5, the MergeSort algorithm consists of (1) splitting, (2) sorting, and (3) merging steps. For purposes of clarity, we assume that MergeSort input consists of distinct integers.


Figure 1.7.5. Schematic diagram of merge sort applied to input vector (2,8,5,3,7,1,6,4) to yield sorted vector (1,2,3,4,5,6,7,8).

Algorithm. Assume that a sequence a has n elements, which need not be distinct. We can sort a in ascending order using merge-sort, as follows:

    Merge(L1,L2):
    { L3 = empty list
      repeat until list L1 or list L2 is empty:
        if head(L1) < head(L2) then   
          insert head(L1) at the head of L3
          delete(head(L1))
        else
          insert head(L2) at the head of L3
          delete(head(L2))
        endif
      end-repeat
      concatenate the remaining (non-empty) list
        onto the rear of L3
    }
    
    MergeSort(a,n):
    { if size(a) = 1 then return(a)
      elseif size(a) = 2 then
        if a[1] and a[2] are not in proper order, then
          swap(a[1],a[2])
        endif
      else 
      { split a into two lists of equal size, L and R
        return(Merge(MergeSort(L,n/2),MergeSort(R,n/2))) }
      endif
    }
    

Merge Example. The merging algorithm's operation may not be intuitively obvious, so we present the following example, in which two 2-element sorted sequences from Figure 1.7.5 are merged.

Thus, one can see that merging in ascending (descending) order is merely a matter of outputting the smaller (larger) of the list heads, then deleting that element from its respective list. Since the lists L1 and L2 are already sorted, this is possible. If L1 and L2 were not in-order, Merge would not work.

Remark. Another way to think of merging is to imagine a process by which two decks of n cards, each of which is sorted, could be shuffled to yield one sorted deck.

Analysis. The preceding algorithm MergeSort calls itself recursively log(n)-1 times. This leads to the construction of a recursion tree of log(n) levels. The swap operation adds another level to the tree, and the resultant Merge operations unwind the recursion stack to yield another log(n) levels.

  • The splitting (divide) phase requires as few as (n/2 + 1) bit operations, if an array implementation is employed. Otherwise, n I/O operations are required at each level, for a total of O(n log(n)) operations.

  • The sorting phase requires n/2 comparisons and a maximum of n I/O operations, if each 2-element subsequence is unsorted.

  • The merging phase unwinds the recursion stack, and requires m-1 comparisons for each set of two m-element subsequences. Since there are O(log n) levels in the merge portion of the sorting architecture, and m is a function of n, a total of O(n log(n)) comparison operations are required.

  • This means that the merge-sort algorithm requires O(n log(n)) work. In terms of comparisons, there is no best or worst case. However, the I/O overhead is reduced when each 2-element subsequence encountered at the sorting level is itself sorted.

    Similarly, since the Merge algorithm checks the order of each element in the sequences to be merged, there is no best or worst case. Thus, MergeSort is not data dependent in terms of comparison operations, but is data-dependent in terms of I/O operations.

    1.7.1.6. Quick Sort.

    Quick-sort is conceptually similar to merge-sort, but can improve slightly upon merge-sort's performance by eliminating the merging phase. This holds mainly when the input can be divided in the same way as merge-sort's input (i.e., halving of each sequence or subsequence at each level of the tree).

    Concept. Given an input sequence [an] whose values are assumed to be numbers, and n is arbitrary, quick-sort uses divide-and-conquer to sort the input recursively until there are subsequences of one or two elements only. These are sorted using one comparison and one swap operation (if required), then are concatenated into progressively larger subsequences, until the input length is attained. The result is a sorted sequence. The difference between merge-sort and quick-sort is that quick-sort uses a pivot value instead of the sequence or subsequence size to decide how to partition the input.


    Figure 1.7.6. Schematic diagram of quick sort applied to input vector (42,43,2,1,46,4,45) to yield sorted vector (1,2,4,42,43,45,46).

    Algorithm. Assume that a sequence a has n elements, which need not be distinct. We can sort a in ascending order using quick-sort, as follows:

      QuickSort(a):
      { if size(a) = 1 then return(a)
        elseif size(a) = 2 then
          if a[1] and a[2] are not in proper order, then
            swap(a[1],a[2])
          endif
        else 
        { pick a pivot value p based on the values of a
          split a into two lists, L and R, where 
          { L contains the values of a that are less than p
            R contains the remaining values of a }
          return(Concatenate(QuickSort(L),QuickSort(R))) }
        endif
      }
      

    Analysis. The preceding algorithm calls itself recursively a minimum of ceil(log(n))-1 times, and a maximum of n-1 times (worst case). This leads to the construction of a recursion tree having log(n) to n-1 levels. The swap operation adds another level to the tree, and the resultant concatenation operations unwind the recursion stack to yield another log(n) to n levels.

  • The splitting (divide) phase requires log(n) levels in the best case, if the median of each sequence or subsequence is chosen as the pivot value. Other pivot schemes run the risk of partitioning the input at other than its positional midpoint.

  • The sorting phase requires n/2 comparisons and a maximum of n I/O operations, if 2-element subsequences result (best case) and each 2-element subsequence is unsorted.

  • The concatenation phase unwinds the recursion stack, and requires no comparisons, unlike merge-sort. Since there are log(n) to n-1 levels in the concatenation portion of the sorting architecture, and each level has n I/O operations, a total of O(n log(n)) to O(n2) I/O operations are required.
  • This means that the quick-sort algorithm requires O(n log(n)) to O(n2) work, depending on how the pivot value is chosen. If the median is chosen as the pivot, then quick-sort can actually be slightly faster than merge-sort, since there are no comparisons involved in the conquer phase.

    The best-case input is a sequence that has all subsequences produced by splitting with the same mean as their median, since this produces the fewest number of levels. The worst-case input for splitting on the mean is a geometric series sorted in opposite order to the desired sorting order, for example, (10,100,1000,...). When the mean is taken, such a sequence will require n-1 levels of splitting and n-1 levels of concatenation, because it will be split into n-2 subsequences of one element each, and one subsequence of two elements.

    Note: Students should verify this by constructing a splitting and concatenation diagram for the sequence S = (101,102,...,108).

    In Section 2, we examine data structures such as sets and strings that can be represented by lists, then look at a special data structure called a heap. It will be shown that a sorting algorithm based on the heap (unsurprisingly called HeapSort) can achieve O(n log(n)) complexity and is easier to implement than MergeSort.


    Copyright © 1998-2000 by Mark S. Schmalz
    All rights reserved, except for viewing/printing by UF faculty or students registered for this class.