notebooks/foundations/foundations7.mld

Dictionaries

lookup: find an item in the dictionary
update (insert): replace (store) an item in the dictionary
delete: remove an item from the dictionary
empty: the null dictionary
Missing: exception for errors in lookup and delete

Ideally, an abstract type should provide these operations and hide the internal data structures.

A dictionary attaches values to identifiers, called “keys”. Before choosing the internal representation for a data structure, you need to specify the full set of operations. In fact, here we only consider update (associating a value with an identifier) and lookup (retrieving such a value). Deletion is more difficult and would limit our choices. Some applications may need additional operations, such as merge (combining two dictionaries). We shall see that update can be done efficiently in a functional style, without excessive copying.

An abstract type provides specified operations while hiding low-level details, such as the data structure used to represent dictionaries. Abstract types can be declared in any modern programming language. Java’s objects serve this role, as do OCaml’s modules. This course does not cover modules, and we simply declare the dictionary operations individually.

An association list (a list of pairs) is the simplest dictionary representation. Lookup is by linear search, and therefore slow: $O(n)$ . Association lists are only usable if there are few keys in use. However, they are general in that the keys do not need a concept of ordering, only equality.

# exception Missing
exception Missing
# let rec lookup a = function
  | [] -> raise Missing
  | (x, y) :: pairs ->
      if a = x then y
      else lookup a pairs
val lookup : 'a -> ('a * 'b) list -> 'b = <fun>
# let update (l, b, y) = (b, y) :: l
val update : ('a * 'b) list * 'a * 'b -> ('a * 'b) list = <fun>

To enter a new (key, value) pair, simply “cons” it to the list with update. This takes constant time, which is the best we could hope for. But the space requirement is huge: linear in the number of updates, not in the number of distinct keys. Obsolete entries are never deleted: that would require first finding them, increasing the update time from $O(1)$ to $O(n)$ .

Binary Search Trees

A dictionary associates values (here, numbers) with keys.

Binary search trees are an important application of binary trees. They work for keys that have a total ordering, such as strings. Each branch of the tree carries a $(key, value)$ pair; its left subtree holds smaller keys; the right subtree holds greater keys. If the tree remains reasonably balanced, then update and lookup both take $O(\log n)$ for a tree of size $n$ . These times hold in the average case; given random data, the tree is likely to remain balanced.

At a given node, all keys in the left subtree are smaller (or equal) while all trees in the right subtree are greater.

An unbalanced tree has a linear access time in the worst case. Examples include building a tree by repeated insertions of elements in increasing or decreasing order; there is a close resemblance to quicksort. Building a binary search tree, then converting it to inorder, yields a sorting algorithm called treesort.

Self-balancing trees, such as Red-Black trees, attain $O(\log n)$ in the worst case. They are complicated to implement.

Lookup: Seeks Left or Right

# exception Missing of string
exception Missing of string
# let rec lookup b = function
  | Br ((a, x), t1, t2) ->
      if b < a then
        lookup b t1
      else if a < b then
        lookup b t2
      else
        x
  | Lf -> raise (Missing b)
val lookup : string -> (string * 'a) tree -> 'a = <fun>

This has guaranteed $O(\log n)$ access time if the tree is balanced!

Lookup in the binary search tree goes to the left subtree if the desired key is smaller than the current one and to the right if it is greater. It raises Missing if it encounters an empty tree.

Since an ordering is involved, we have to declare the functions for a specific type, here string. Now exception Missing mentions that type: if lookup fails, the exception returns the missing key. The exception could be eliminated using type option of our earlier Datatypes lecture, using the constructor None for failure.

Update

# let rec update k v = function
  | Lf -> Br ((k, v), Lf, Lf)
  | Br ((a, x), t1, t2) ->
      if k < a then
        Br ((a, x), update k v t1, t2)
      else if a < k then
        Br ((a, x), t1, update k v t2)
      else (* a = k *)
        Br ((a, v), t1, t2)
val update : 'a -> 'b -> ('a * 'b) tree -> ('a * 'b) tree = <fun>

This is also $O(\log n)$ as it copies the path only, and not whole subtrees!

If you are familiar with the usual update operation for this sort of tree, you may wonder whether it can be implemented in OCaml, where there is no direct way to replace part of a data structure by something else.

The update operation is a nice piece of functional programming. It searches in the same manner as lookup, but the recursive calls reconstruct a new tree around the result of the update. One subtree is updated and the other left unchanged. The internal representation of trees ensures that unchanged parts of the tree are not copied, but shared. Therefore, update copies only the path from the root to the new node. Its time and space requirements, for a reasonably balanced tree, are both $O(\log n)$ .

The comparison between $b$ and $a$ allows three cases:

smaller: update the left subtree; share the right
greater: update the right subtree; share the left
equal: update the label and share both subtrees

Note: in the function definition, (* a = b*) is a comment. Comments in OCaml are enclosed in the brackets (* and *).

Aside: Traversing Trees (3 Methods)

# let rec preorder = function
  | Lf -> []
  | Br (v, t1, t2) ->
      [v] @ preorder t1 @ preorder t2
val preorder : 'a tree -> 'a list = <fun>
# let rec inorder = function
  | Lf -> []
  | Br (v, t1, t2) ->
      inorder t1 @ [v] @ inorder t2
val inorder : 'a tree -> 'a list = <fun>
# let rec postorder = function
  | Lf -> []
  | Br (v, t1, t2) ->
      postorder t1 @ postorder t2 @ [v]
val postorder : 'a tree -> 'a list = <fun>

Tree traversal means examining each node of a tree in some order. D. E. Knuth has identified three forms of tree traversal: preorder, inorder and postorder. We can code these “visiting orders” as functions that convert trees into lists of labels. Algorithms based on these notions typically perform some action at each node; the functions above simply copy the nodes into lists. Consider the tree:

preorder visits the label first (“Polish notation”), yielding ABDECFG
inorder visits the label midway, yielding DBEAFCG
postorder visits the label last (“Reverse Polish”), yielding DEBFGCA. You might be familiar with this concept if you own an ancient RPN calculator!

What is the use of inorder? Consider applying it to a binary search tree: the result is a sorted list of pairs. We could use this, for example, to merge two binary search trees. It is not difficult to transform a sorted list of pairs into a binary search tree.

Efficiently Traversing Trees

Unfortunately, the functions shown on the previous slide are quadratic in the worst case: the appends in the recursive calls are inefficient. To correct that problem, we (as usual) add an accumulating argument. Observe how each function constructs its result list and compare with how appends were eliminated from quicksort in the Sorting lecture.

# let rec preord = function
  | Lf, vs -> vs
  | Br (v, t1, t2), vs ->
      v :: preord (t1, preord (t2, vs))
val preord : 'a tree * 'a list -> 'a list = <fun>
# let rec inord = function
  | Lf, vs -> vs
  | Br (v, t1, t2), vs ->
      inord (t1, v::inord (t2, vs))
val inord : 'a tree * 'a list -> 'a list = <fun>
# let rec postord = function
  | Lf, vs -> vs
  | Br (v, t1, t2), vs ->
      postord (t1, postord (t2, v::vs))
val postord : 'a tree * 'a list -> 'a list = <fun>

One can prove equations relating each of these functions to its counterpart on the previous section. For example:

\texttt{inord}(t, vs) = \texttt{inorder}(t) @ vs

These three types of tree traversal are related in that all are depth-first. They each traverse the left subtree in full before traversing the right subtree. Breadth-first search (from the Queues lecture) is another possibility. That involves going through the levels of a tree one at a time.

Arrays

A conventional array is an indexed storage area.
- It is updated in place by the command a.(k) <- x
- The concept is inherently imperative.
A functional array is a finite map from integers to data.
- Updating implies copying to return update(A, k, x)
- The new array equals A except that A.(k) = x.
Can we do updates efficiently?

The elements of a list can only be reached by counting from the front. Elements of a tree are reached by following a path from the root. An array hides such structural matters; its elements are uniformly designated by number. Immediate access to arbitrary parts of a data structure is called random access.

Arrays are the dominant data structure in conventional programming languages. The ingenious use of arrays is the key to many of the great classical algorithms, such as Hoare’s original quicksort (the partition step) and Warshall’s transitive-closure algorithm.

The drawback is that subscripting is a chief cause of programmer error. That is why arrays play little role in this introductory course.

Functional arrays are described below in order to illustrate another way of using trees to organise data. Here is a summary of basic dictionary data structures in order of decreasing generality and increasing efficiency:

Linear search: Most general, needing only equality on keys, but inefficient: linear time.
Binary search: Needs an ordering on keys. Logarithmic access time in the average case, but our binary search trees are linear in the worst case.
Array subscripting: Least general, requiring keys to be integers, but even worst-case time is logarithmic.

Functional Arrays as Binary Trees

The path to element $i$ follows the binary code for $i$ (its “subscript”).

This simple representation (credited to W. Braun) ensures that the tree is balanced. Complexity of access is always $O(\log n)$ , which is optimal. For actual running time, access to conventional arrays is much faster: it requires only a few hardware instructions. Array access is often taken to be $O(1)$ , which (as always) presumes that hardware limits are never exceeded.

The lower bound for array subscripts (or ``indices'') is one. The upper bound starts at zero (which signifies the empty array) and can grow without limit. Inspection of the diagram above should make it clear that these trees are always balanced: the left subtree can have at most one node more than the right subtree, recursively all the way down. (This assumes that the array is defined for subscripts $1\ldots n$ with no gaps; an array defined only for odd numbers, for example, would obviously be unbalanced.)

The numbers in the diagram above are not the labels of branch nodes, but indicate the positions of array elements. For example, the label corresponding to $A\[2\]$ is at the position shown. The nodes of a functional array are labelled with the data we want to store, not with these integers.

The Lookup Function

# exception Subscript
  let rec sub = function
  | Lf, _ -> raise Subscript  (* Not found *)
  | Br (v, t1, t2), k ->
      if k = 1 then v
      else if k mod 2 = 0 then
        sub (t1, k / 2)
      else
        sub (t2, k / 2)
exception Subscript
val sub : 'a tree * int -> 'a = <fun>
# let rec sub = function (* Alternative implementation *)
  | Lf, _ -> raise Subscript
  | Br (v, t1, t2), 1 -> v
  | Br (v, t1, t2), k when k mod 2 = 0 -> sub (t1, k / 2)
  | Br (v, t1, t2), k -> sub (t2, k / 2)
val sub : 'a tree * int -> 'a = <fun>

Notice that we have used a new keyword when above, which changes pattern clauses to be only matched if the expression evalutes to true. This can be equivalently expressed by moving the corresponding checks into an if clause on the right hand side of the pattern match, but is often more readable using when (as above).

The lookup function sub, divides the subscript by 2 until 1 is reached. If the remainder is 0 then the function follows the left subtree, otherwise the right. If it reaches a leaf, it signals error by raising exception Subscript.

Array access can also be understood in terms of the subscript’s binary code. Because the subscript must be a positive integer, in binary it has a leading one. Discard this one and reverse the remaining bits. Interpreting zero as left and one as right yields the path from the root to the subscript.

Popular literature often explains the importance of binary as being led by hardware: because a circuit is either on or off. The truth is almost the opposite. Designers of digital electronics go to a lot of trouble to suppress the continuous behaviour that would naturally arise. The real reason why binary is important is its role in algorithms: an if-then-else decision leads to binary branching.

Data structures, such as trees, and algorithms, such as mergesort, use binary branching in order to reduce a cost from $O(n)$ to $O(\log n)$ . Two is the smallest integer divisor that achieves this reduction. (Larger divisors are only occasionally helpful, as in the case of B-trees, where they reduce the constant factor.) The simplicity of binary arithmetic compared with decimal arithmetic is just another instance of the simplicity of algorithms based on binary choices.

The Update Function

# let rec update = function
  | Lf, k, w ->
      if k = 1 then
        Br (w, Lf, Lf)
      else
        raise Subscript  (* Gap in tree *)
  | Br (v, t1, t2), k, w ->
      if k = 1 then
        Br (w, t1, t2)
      else if k mod 2 = 0 then
        Br (v, update (t1, k / 2, w), t2)
      else
        Br (v, t1, update (t2, k / 2, w))
val update : 'a tree * int * 'a -> 'a tree = <fun>

The update function also divides the subscript repeatedly by two. When it reaches a value of one, it has identified the element position. Then it replaces the branch node by another branch with the new label.

A leaf may be replaced by a branch, extending the array, provided no intervening nodes have to be generated. This suffices for arrays without gaps in their subscripting. (The data structure can be modified to allow sparse arrays, where most subscript positions are undefined.) Exception Subscript indicates that the subscript position does not exist and cannot be created. This use of exceptions is not easily replaced by None and Some.

Note that there are two tests involving $k=1$ . If we have reached a leaf, it returns a branch, extending the array by one. If we are still at a branch node, then the effect is to update an existing array element.

A similar function can shrink an array by one.

Exercise 7.1

Draw the binary search tree that arises from successively inserting the following pairs into the empty tree: ("Alice", 6), ("Tobias", 2), ("Gerald", 8), ("Lucy", 9). Then repeat this task using the order ("Gerald", 8), ("Alice", 6), ("Lucy", 9), ("Tobias", 2). Why are results different?

Exercise 7.2

Code an insertion function for binary search trees. It should resemble the existing update function except that it should raise the exception Collision if the item to be inserted is already present.

Exercise 7.3

Continuing the previous exercise, it would be natural for exceptional Collision to return the value previously stored in the dictionary. Why is that goal difficult to achieve?

Exercise 7.4

Describe an algorithm for deleting an entry from a binary search tree. Comment on the suitability of your approach.

Exercise 7.5

Code the delete function outlined in the previous exercise.

Exercise 7.6

Show that the functions preorder, inorder and postorder all require $O(n^2)$ time in the worst case, where $n$ is the size of the tree.

Exercise 7.7

Show that the functions preord, inord and postord all take linear time in the size of the tree.

Exercise 7.8

Write a function to remove the first element from a functional array. All the other elements are to have their subscripts reduced by one. The cost of this operation should be linear in the size of the array.