CS 240

Algorithm Design

Problem: Given a problem instance, carry out a particular computational task.

Problem Instance: Input.

Problem Solution: Output.

Size of a problem instance: Size(I) is a positive integer that measures the size of instance I.

Algorithm: Step-by-step process for carrying out a series of computations, given an arbitrary problem instance I.

For a problem \Pi, we can have several algorithms. For an algorithm A solving \Pi, we can have several programs (implementations).

Analysis of Algorithms I

We may be interested in the amount of time or memory. Experimental shortcomings have shortcomings because of implementations, hardware, and testing against all inputs.

Instead we write code for Random Access Machines (RAM)

The efficiency is measured in terms of it’s growth rate (this is called the complexity of the algorithm).

for i in range(n):
    for j in range(n):
        c[i][j] = 0
        for k in range(n):
            c[i][j] += a[i][k] * b[k][j]

About n^3 operations.

Asymptotic Notation

O-Notation f(n) \in O(g(n)) if there exists constants c > 0, n_0 > 0 such that if |f(n)| \le c|g(n)| \forall n \ge n_0.

Example: f(n) = 75n + 500 and g(n) = 5n^2.

f(n), g(n) \ge 0 \forall n \ge 1
For n \ge 20
n^2 \ge 20n
5n^2 \ge 100n
25n \ge 25 * 20 = 500
5n^2 \ge 75n + 500
g(n) \ge f(n)
Taking n_0 = 20, c = 1, this proves that f(n) \in O(g(n)).

We want a tight asymptotic bound.

\Omega-notation: f(n) \in \omega(g(n)) if \exists c > 0, n_0 > 0 such that c|g(n)| \le |f(n)| \forall n \ge n_0.

\Theta-notation: \exists c1, c2 \ge 0, n_0 > 0 such that c1|g(n)| \le |f(n)| \le c2|g(n)| \forall n \ge n_0.

How to express f(n) is asymptotically strictly smaller than g(n)?

o-notation: \forall c > 0, \exists n_0 > 0 such that |f(n)| < c|g(n)| \forall n \ge n_0.

\omega-notation: \forall c > 0, \exists n_0 > 0 such that 0 \le c|g(n)| < f(n) \forall n \ge n_0.

Example: f(n) = 2000n^2 + 5000n and g(n) = n^3. Prove f(n) \in o(g(n)).

Given c > 0, for n > 0, f(n) = 2000n^2 + 5000n \le 7000n^2.
7000n^2 < cn^3 \Leftrightarrow 7000 < cn \Leftrightarrow n > \frac{7000}{c}
So if we take n_0 = \frac{7000}{c} + 1, we have f(n) < g(n) \forall n \ge n_0. This proves that f(n) \in o(g(n)).

Asymptotic Identities

f(n) \in \Theta(g(n)) \Leftrightarrow g(n) \in \Theta(f(n))
f(n) \in O(g(n)) \Leftrightarrow g(n) \in \Omega(f(n))
f(n) \in o(g(n)) \Rightarrow f(n) \in O(g(n)), \nRightarrow f(n) \in \Omega(g(n))
f(n) \in \omega(g(n)) \Rightarrow f(n) \in \Omega(g(n)), \Rightarrow f(n) \notin O(g(n))
Identitity: f(n) \in \Theta(f(n))
Maximum: f(n) > 0, g(n) > 0 \forall n \ge n_0 \Rightarrow O(f(n) + g(n)) = O(\max\{f(n), g(n)\}). Similar for \Omega.
Transitivity: f(n) \in O(g(n)), g(n) \in O(h(n)) \Rightarrow f(n) \in O(h(n)). Similar for \Omega.

Limit Test

Suppose L = \lim_{n \to \infty} \frac{f(n)}{g(n)}.

f(n) \in \begin{cases} o(g(n)), &L = 0 \\ \Theta(g(n)), &0 < L < \infty \\ \omega(g(n)), &L = \infty \end{cases}

Example: f(n) = \log(n) = \frac{\ln(n)}{\ln(2)}, g(n) = n.

\lim_{n \to \infty} \frac{1}{n\ln(2)} = 0 so f(n) \in o(g(n)) by the limit test.

Analysis of Algorithms II

Example:

def Test1(n):
    sum = 0
    for i in range(1, n + 1):
        for j in range(i, n + 1):
            sum += (i - j) * (i - j)
    return sum

Let T_1(n) be the runtime of Test1(n).
T_1(n) \in \Theta(S_1(n)) where S_1(n) is the number of times we enter the body of the inner loop on line 4.

S_1(n) = \sum_{i = 1}^n \sum_{j = i}^n i.

1st solution: Brute Force.

S_1(n) = \sum_{i = 1}^n (n - i + 1) = \sum_{i = 1}^n n - \sum_{i = 1}^n + \sum_{i = 1}^n 1 = n^2 - \frac{n(n+1)}{2} + n = \frac{n^2}{2} + \frac{n}{2}.
Therefore we have S_1(n) \in \Theta(n^2).

2nd solution: Separate O and \Omega.

S_1(n) \leq \sum_{i = 1}^n \sum_{j = 1}^n 1 = n^2 so S_1(n) \in O(n^2).
S_1(n) \geq \sum_{i = 1}^{n / 2} \sum_{j = 1}^n i \geq \sum_{i = 1}^{n / 2} \sum_{j = n / 2}^n 1 = \frac{n^2}{4} so S_1(n) \in \Omega(n^2).
Therefore S_1(n) \in \Theta(n^2).

Example:

def Test2(A, n):
    max = 0
    for i in range(1, n + 1):
        for j in range(i, n + 1):
            sum = 0
            for k in range(i, j + 1):
                sum += A[k]
    return max

Let T_2(n) be the runtime of Test2(n).
Then T_2(n) \in \Theta(S_2(n)), where S_2(n) is the number of times we enter the body of the inner loop on line 6.

S_2(n) = \sum_{i = 1}^n \sum_{j = i}^n \sum_{k = i}^j 1.

1st Solution: Brute Force.

Input to Wolfram \alpha and see that S_2(n) = \frac{n^3}{6} + ...n^2 + ...n + ... so S_2(n) \in \Theta(n^3).

2nd Solution: Separate O and \Omega.

S_2(n) \leq \sum_{i = 1}^n \sum_{j = 1}^n \sum_{k = 1}^n 1 = n^3 so S_2(n) \in O(n^3).
S_2(n) \geq \sum_{i = 1}^{n / 3} \sum_{j = i}^n \sum_{k = i}^j 1 \geq \sum_{i = 1}^{n / 3} \sum_{j = 2n \ 3}^n \sum_{k = n / 3}^{2n / 3} 1 = (\frac{n}{3})^3 \in \Omega(n^3)
Therefore S_2(n) \in \Theta(n^3).

Example:

# Insertion Sort
def Test3(A, n):
    for i in range(1, n):
        j = i
        while j > 0 and A[j] > A[j - 1]:
            A[j], A[j - 1] = A[j - 1], A[j]
            j -= 1

Let T_A(I) denote the running time of an algorithm A on instance I.
Worse-case: T_A(n) = \max \{T_A(I): Size(I) = n\}.
Average-case: T_A^{avg}(n) = \frac{1}{|\{I: Size(I) = n\}} \sum_{\{I: Size(I) = n\}} T_A(I).

It is important to try and not make comparisons between algorithms using O-notation.

MergeSort Example

Input: Array A of n integers.

  1. We split A into two subarrays. A_L consists of the first \lceil\frac{n}{2}\rceil elements and A_R consists of the last \lfloor\frac{n}{2}\rfloor elements.
  2. Recursively run MergeSort on A_L and A_R.
  3. After A_L and A_R are sorted, merge them together.
def MergeSort(A, l = 0, r = n - 1):
    if r <= l:
        return
    else:
        m = (l + r) / 2
        MergeSort(A, l, m)
        MergeSort(A, m + 1, r)
        Merge(A, l, m, r)

def Merge(A, l, m, r):
    # We should only copy over A[l, r] to S.
    S = A
    i_L = l, i_R = m + 1
    for k in range(l, r + 1):
        if i_L > m:
            A[k] = S[i_R]
            i_R += 1
        elif i_R > r:
            A[k] = S[i_L]
            i_L += 1
        elif S[i_L] <= S[i_R]:
            A[k] = S[i_L]
            i_L += 1
        else:
            A[k] = S[i_R]
            i_R += 1

Merge takes time \Theta(r - l + 1) = \Theta(n) time for merging n elements.

Analysis of MergeSort

Let T(n) denote the time to run MergeSort on an array of length n.

  1. Step 1 takes \Theta(n).
  2. Step 2 takes T(\lceil\frac{n}{2}\rceil) + T(\lfloor\frac{n}{2}\rfloor).
  3. Step 3 takes \Theta(n).

Sloppy Recurrence

T(n) = \begin{cases} 2T(\frac{n}{2}) + cn, &\text{if } n > 1 \\ c, &\text{if } n = 1 \end{cases}

Exact and sloppy recurrences are identical when n is a power of 2.
The recurrence can be easily solved by various methods when n = 2^k.

T(n) = 2T(\frac{n}{2}) + cn.
For n a power of 2, n = 2^k.
\begin{aligned} T(2^k) &= 2T(2^{k - 1}) + c2^k \\ &= 2(2T(2^{k - 2}) + c2^{k - 1}) + c2^k \\ &= 2^3 T(2^{k - 3}) + 3c2^k \\ &= 2^kT(2^{k - k}) + kc2^k \\ &= 2^k + c2^kk \\ &= n + cn\log(n) \end{aligned}
Therefore T(n) \in \Theta(n\log(n)), for n = 2^k.

Priority Queue

An ADT consisting of a collection of items (each having a priority)

def PQ_Sort(A):
    pq = PriorityQueue()
    for k in range(n):
        pq.insert(k)
    for k in reversed(range(n)):
        pq.deleteMax()

The runtime will be determined by how efficient our insert() and deleteMax() functions are.

O(n + n \cdot insert + n \cdot deleteMax)

Inserting Into Dynamic Array

Let T(n) be the total cost of insertion from length n to 2n - 1, n = 2^k. Then T(n) is the cost of doubling from n to 2n plus the cost of inserting n items. T(n) \in O(n).

Binary Heaps

A (binary) heap is a certain type of tree.

Any binary tree with n nodes has height at least \log(n + 1) - 1 \in \Omega(\log(n)).

  1. Structural Property: All levels of the heap are completely filled, except for the last level potentially. The entries in the last level are left-justified.
  2. Heap-order Property: For any node i, the key of i’s parent is greater or equal to the key of i.

These are the properties for a max-heap. A min-heap is the same but with opposite order property.

Lemma: The height of a heap with n nodes is \Theta(\log(n)).

Given the height of the tree h.
2^n \le n \le 2^{h + 1}
h \le \log(n) \le h + 1
\log(n) - 1 \le h \le \log(n)

Storing Heaps in Arrays

Heaps should not be stored as binary trees!

Let H be a heap of n items and let A be an array of size n. We can store the root in A[0] and continue with elements level-by-level from top to bottom. In each level left-to-right.

We should hide implementation using helper functions.

Operations in Binary Heaps

Insertion in Heaps

Delete in Heaps

HeapSort is not used as often as MergeSort or QuickSort because it has poor data locality.

Building Heaps by Bubble-Down

Given an array, build a heap out of them.

def heapify(A):
    n = len(A)
    for i in reversed(range(parent(last(n)) + 1)):
        fix_down(A, n, i)

Careful analysis yields a worst-case complexity of \Theta(n).

Whenever you reach node i, you know that the subtrees are both heaps because you have fixed them before.

Let T(n) be the worse case runtime of heapify for an array of size n. It is related to the worse case number of swaps S(n), so T(n) \in \Theta(S(n)).

S(1) = 0.
S(2) = 1 because we bubble-down the root.
S(3) = 1 because we bubble-down the root.
S(4) = 1 + 2 = 3.
S(5) = 3.

If we add n nodes to a heap of size n, (n \to 2n). The new nodes form a layer of leaves. Every single original node now has to perform 1 more swap because the length of the path has been increased by 1. The new nodes do not have to perform any swaps because they are all leaves.

S(2n) = S(n) + n

For n a power of 2, n = 2^k.

\begin{aligned} S(n) = S(2^k) &= S(2^{k - 1}) + 2^{k - 1} \\ &= S(2^{k - 2}) + 2^{k - 2} + 2^{k - 1} \\ &= \sum_{i = 0}^{k - 1} 2^i \\ &= 2^k - 1 \\ &= n - 1 \end{aligned}

Therefore T(n) \in \Theta(n) for n = 2^k.

Selection

The kth-max problem asks to find the kth largest item in an array A of n numbers.

  1. Scan the array and maintain the k largest numbers seen so far in a min-heap. \Theta(n\log(k)).
  2. Make a max-heap by calling heapify(A). Call deleteMax(A) k times. \Theta(n + k\log(n)).

Sorting and Randomized Algorithms

QuickSelect

Given an array of A of n numbers, 0 \le k < n, find the element that be at position k of the sorted array.

The best heap-based algorithm had a running time of O(n + k\log n). For median finding, this is O(n\log n) which is the same cost as our best sorting algorithms.

The QuickSelect algorithm can do this in linear time.

QuickSelect and the related QuickSort rely on two subroutines.

  1. chose_pivot(A): Choose an index p in A.
  2. partition(A, p): Rearrange A and return pivot index i such that A[i] = v, A[0, ..., i - 1] \le v, v \le A[i + 1, ..., n - 1].

Partitioning in-place can be done by using two pointers i = 0, j = n - 1, and moving them to the middle, swapping while they are on the wrong side.

def partition(A, p):
    A[n - 1], A[p] = A[p], A[n - 1]

    i = 0, j = n - 2, v = A[n - 1]
    while True:
        while i < n and A[i] < v:
            i += 1
        while j >= 0 and A[j] > v:
            j -= 1
        if j >= i:
            break
        A[i], A[j] = A[j], A[i]

    A[i], A[n - 1] = A[n - 1], A[i]

QuickSelect Algorithm

def quick_select_1(A, k):
    p = choose_pivot(A)
    i = partition(A, p)
    if i == k:
        return A[i]
    elif i > k:
        return quick_select_1(A[0, ..., i - 1], k)
    elif i < k:
        return quick_select_1(A[i + 1, ..., n - 1], k - i - 1)

Analysis of quick_select_1

Worse Case Analysis: Recursive calls could always have size of n - 1. The recurrence is given by. T(n) = \begin{cases} T(n - 1) + cn, &n \ge 2 \\ c, &n = 1 \end{cases} For some constant c > 0. Solution: T(n) = cn + c(n - 1) + ... + c(2) + c \in \Theta(n^2).

Best Case Analysis: First chosen pivot could be the kth element. We have no recursive calls, so the total cost would be \Theta(n).

Sorting Permutations

We want to take the average runtime of quick_select_1 over all inputs.

How do we characterize all inputs of size n? We can make a simplifying assumption that all input numbers are distinct. quick_select_1 does not depend on the actual value, only their relative order. Therefore, we can characterize our input based on sorting permutation: the permutation that would put the input in order.

If we assume that all n! permutations are equally likely, the average cost is the sum of the total cost for all permutations, divided by n!.

Define T(n) to be the average case for selecting from a size n array. Fix one 0 \le i \le n - 1. There are (n - 1)! permutations for which the pivot value v is the ith smallest item.

\begin{aligned} T(n, k) &= \frac{1}{n!}\sum_{\text{permutations A of } \{1, ..., n\}} T(A, k) \\ T(n) &= \max_{k} T(n, k) \\ &\le c\cdot n + \frac{1}{n}\sum_{i = 0}^{n - 1}\max\{T(i), T(n - i + 1)\} \\ \end{aligned}

Theorem: T(n) \in \Theta(n).

Sloppy Proof: We want T(n) \le \lambda n. Suppose T(i) \le \lambda i, i < n. \begin{aligned} T(n) &\le cn + \frac{1}{n}\sum_{i = 0}^{n - 1}\max \{\lambda i, \lambda(n - i - 1)\} \\ &\le cn + \frac{\lambda}{n}\sum_{i = 0}^{n - 1} \max \{i, n - i - 1\} \end{aligned} We can see visually (taking max of i and n - i - 1) that \sum \approx \frac{3}{4}n^2. Therefore we have T(n) \le cn + \frac{\lambda}{n}\frac{3}{4}n^2 = (c + \frac{3}{4}\lambda) n. We want T(n) \le \lambda n, let us take \lambda such that c + \frac{3}{4}\lambda = \lambda \Leftrightarrow \lambda = 4c.

Clean Proof: We can rewrite the proof that we had before using T(n) \le 4cn by induction. We also have to prove that \sum_{i = 0}^{n - 1} \max\{1, n - i - 1\} \le \frac{3}{4}n^2 which can be done by a case discussion when n is even or odd.

Randomized Algorithms

No more bad instances, just unlucky numbers.

Expected Running Time

Define T(I, R) be the running time of the randomized algorithm for instance I and sequence of random numbers R.

T^{(exp)}(I) = E[T(I, R)] = \sum_{R}T(I, R)P(R).

T^{(exp)}_{worst}(n) = \max_{\{I: size(I) = n\}}T^{(exp)}(I).

T^{(exp)}_{avg}(n) = \frac{1}{|\{I: size(I) = n\}|}\sum_{\{I : size(I) = n\}}T^{(exp)}(I).

Average-Case Analysis of QuickSort

\begin{cases} T(n) = cn + \frac{1}{n}\sum_{i = 0}^{n - 1}(T(i) + T(n - (i + 1))) \\ T(0) = T(1) = 0 \end{cases}

We prove that T(n) \le 2cn\log(n), n \ge 1.

Define S(n) = \sum_{i = 1}^{n - 1}i\log(i). We claim that S(n) \le \frac{1}{2}n^2\log(n) - \frac{1}{4}n^2. We skip the proof because it is too difficult.

Assume that for i = 1, ..., n - 1, T(i) \le 2ci\log(i).

\begin{aligned} T(n) &= cn + \frac{1}{n}\sum_{i = 0}^{n - 1}T(i) + \frac{1}{n}\sum_{i = 0}^{n - 1}T(n - (i + 1)) \\ &= cn + \frac{2}{n}\sum_{i = 0}^{n - 1}T(i) \\ &= cn + \frac{2}{n} \sum_{i = 1}^{n - 1}T(i) \\ &\le cn + \frac{2}{n} \sum_{i = 1}^{n - 1}2ci\log(i) \\ &\le cn + \frac{4c}{n} \sum_{i = 1}^{n - 1}i\log(i) \\ &\le cn + \frac{4c}{n} S(n) \\ &\le cn + \frac{4c}{n} (\frac{1}{2}n^2\log(n) - \frac{1}{4}n^2) \\ &\le cn + (2cn\log(n) - cn) \\ &\le 2cn\log(n) \end{aligned}

The auxiliary space is \Theta(n) worst-case. We can reduce this to \Theta(\log(n)) by recursing in smaller sub-array first and replacing the other recursion by a while loop.

def quicksort_iterative(A, l, r):
    while l < r:
        i = partition(A, l, r)
        if i - l < r - i:
            quicksort_iterative(A, l, i - 1)
            l = i + 1
        else:
            quicksort_iterative(A, i + 1, r)
            r = i - 1

One should stop recursing when n \le 10. Then one run of InsertionSort at the end sorts everything in O(n) because all items are within 10 units of their required position.

Arrays with many duplicates can be dealt with more efficiently by creating a third partition for equals.

Comparison Model

In the comparison model, data can only be accessed in 2 ways.

  1. Comparing two elements.
  2. Moving elements around (e.g. copying, swapping).

We can represent comparison models as decision trees.

Theorem: Any correct comparison-based sorting algorithm requires at least \Omega(n\log n) comparisons.

In a decision tree that sorts n integers, there are at least n! leaves.

We claim that in a binary tree of height h, with L leaves, L \le 2^{h + 1}. We see that L \le \#\ nodes and that is less than or equal to the number of nodes in a complete heap of height h.

\begin{aligned} n! &\le 2^{h + 1} \\ \log(n!) &\le h + 1 \\ \end{aligned} We know that \log(n!) \in \Theta(n\log n) so h \in \Omega({n \log n}).

Non-Comparison-Based Sorting

Bucket Sort

def bucket_sort(A, d):
    Initialize an array B[0 .. R - 1] of empty lists.
    for i in range(len(A)):
        B[dth digit of A[i]].append(A[i])
    i = 0
    for j in range(R):
        while len(B[j]) > 0:
            A[i] = B[j]
            i += 1

This is stable, equal items stay in order. The runtime is O(n + R), auxiliary space \Theta(n + R).

Count Sort

def key_indexed_countsort(A, d):
    count = [0] * R
    for i in range(len(A)):
        count[dth digit of A[i]] += 1
    
    idx = [0] * R
    for i in range(1, R):
        id[i] = id[i - 1] + count[i - 1]

    aux = [0] * len(A)
    for i in range(len(A)):
        aux[id[dth digit of A[i]]] = A[i]
        id[dth digit of A[i]] += 1

    A = aux
def MSD_radixsort(A, l, r, d):
    if l < r:
        key_index_countsort(A[l .. r], d)
        if d < m:
            for i in range(R):
                l_i, r_i = boundaries of ith bin
                MSD_radixsort(A, l_i, r_i, d + 1)

Drawback is that we have many recursive calls.

def LSD_radixsort(A):
    for d in reversed(range(0, m)):
        key_index_countsort(A, d)

Time complexity is \Theta(m(n + R)). Auxiliary memory is O(n + R). This can be o(n\log(n)) for special cases.

Dictionaries and Balanced Search Trees

Dictionary ADT

  1. search(k)
  2. insert(k, v)
  3. delete(k)

Common assumptions are that the dictionary has n KVPs, each of which uses constant space. We also assume that keys can be compare in constant time.

Binary Search Trees

search(k): Start at the root, compare k to the current node. Stop if found or node is external, else recurse at child.

insert(k, v): Search for k, then insert (k, v) as new node.

delete(k): Search for node x that contains the key k.

  1. If x is a leaf, then delete it.
  2. If x has one child, then move that child up.
  3. If x has two children, swap at x with the key at successor or predecessor node and then delete that node.

Remark: The successor of node x does not have to be a leaf but it cannot have a left child.

Height of a BST

All of the functions above have cost \Theta(h), where h is the height of the tree.

If n items are inserted, the worst case is \Theta(n). The best case is \Theta(\log(n)) because any binary tree with n nodes has height \ge \log(n + 1) - 1. It can be shown that the average case is \Theta(\log(n)).

AVL Trees

Introduced by Adel’son-Vel’ski˘ı and Landis in 1962.

AVL Trees are BST with an additional height-balance property that the height of the left subtree L and right subtree R differ by at most 1. The height of an empty tree is defined to be -1.

At each non-empty node, we require height(R) - height(L) \in \{-1, 0, 1\}.

We need to store the height of the subtree rooted at each node. It can be shown that it is sufficient to store height(R) - height(L) at each node which uses fewer bits but more complicated code (especially for deleting).

Height of an AVL Tree

Theorem: an AVL Tree with n nodes has \Theta(\log(n)) height.

Proof: We prove the upper and lower bounds separately.

  1. The height h of an AVL with n keys is \Omega(\log(n)) because h \ge \log(n + 1) - 1.
  2. The height h of an AVL with n keys is O(\log(n)).

    Define N_h as the minimum number of keys in an AVL of height h. Our plan is to prove N_h \in \Omega(c^h) for some constant c. Then we would have \log(n) \in \Omega(h).

    We see that N_h has a relation to Fibonacci numbers, N_h = F_{h + 3} - 1. This is because to build N_h, we can take the tree for N_{h - 1} as a left child and the tree for N_{h - 2} as a right child. We can prove N_{h} = N_{h - 1} + N_{h - 2} + 1 with generating series. We show the inductive proof that N_h = F_{h + 3} - 1.

    \begin{aligned} N_{h + 2} &= N_{h + 1} + N_{h} + 1 \\ &= (F_{h + 4} - 1) + (F_{h + 3} - 1) + 1 \\ &= F_{h + 4} + F_{h + 3} - 1 \\ &= F_{h + 5} - 1\end{aligned}

    We claim F_h \in \Theta(\phi^h), \phi = \frac{1 + \sqrt 5}{2} (239). Then N_h \in \Theta(\phi^h) and \log(N_h) \in \Theta(h) (a1). For any AVL with height h, n nodes, \log(n) \ge \log(N_h) so \log(n) \in \Omega(h). Therefore h \in O(\log(n)).

AVL Insertion

  1. Insert (k, v) into T with usual BST insertion.
  2. We assume that this returns the new leaf z where the key was sorted.
  3. Move up the tree from z and update heights.
  4. If the height difference becomes \pm 2 at node z, then z is unbalanced. We must re-structure the tree to rebalance.
def AVL_insert(r, k, v):
    z = BST_insert(r, k, v);
    z.height = 0
    while z is not None:
        set_height_from_children(z)
        if abs(z.left.height - z.right.height) == 2:
            AVL_fix(z)
            break
        else:
            z = z.parent

def set_height_from_children(u):
    u.height = 1 + max(u.left.height + u.right.height)

Fixing a slightly unbalanced AVL tree. This is applied at a node z that has balance \pm 2, but the subtrees at z are AVL.

def AVL_fix(z):
    # Find child and grand-child that go deepest.
    if z.right.height > z.left.height:
        y = z.right
        if y.left.height > y.right.height:
            x = y.left
        else:
            x = y.right
    else:
        y = z.left
        if y.right.height > y.left.height:
            x = y.right
        else:
            x = y.left
    # Apply appropriate rotation to restructure at x, y, z
    ...

How do we “fix” an unbalanced AVL? Notice that there are many different BSTs with the same keys. Our goal is to change the structure among the three nodes without changing the order such that the subtree becomes balanced.

Right Rotation

def rotate_right(z):
    y = z.left
    make y.right the new left child of z
    make y the new root of the subtree
    make z the new right child of yb
    set_height_from_children(z)
    set_height_from_children(y)

Recall that we need to update the parent links as well!

Double Right Rotation

  1. Left rotation at y.
  2. Right rotation at z.

Rule: The median of x, y, z becomes the new root.

AVL Deletion

Other Dictionary Implementations

Arrays are a simple and popular implementation. Can we do something to make the search more effective in practice?

Yes, if the items are not equally likely to be accessed and we have a probability distribution of the items. - Intuition is that frequently accessed items should be in the front.

Optimal Static Ordering

Is this better than our linear time solution?

Consider L=[x_1, ..., x_n], P(x_1) = \frac{1}{2}, P(x_2) = \frac{1}{4}, …, P(x_{n-1}) = P(x_n) = \frac{1}{2^{2n-1}}. E(L) = 1\cdot\frac{1}{2} + 2\cdot\frac{1}{4} + ... + (n-1)\cdot \frac{1}{2^{n-1}} + n\cdot\frac{1}{2^{n-1}}. \lim_{n \to \infty}E(L) = 2.

Claim: Over all possible static orderings, the one that sorts items by non-increasing access-probability minimizes the expected cost. We can prove by contradiction.

Dynamic Ordering

Performance of Dynamic Ordering

Skip Lists

Randomized data structure for dictionary ADT.

Search in Skip Lists

def skip_search(l, k):
    P = [p]
    while below(p) is not None:
        p = below(p)
        while key(after(p)) < k:
            p = after(p)
        P.append(p) 
    return P

Insert in Skip Lists

Delete in Skip Lists

Expected Performance

Dictionaries for Special Keys

Theorem: In the comparison model, \Omega(\log(n)) comparisons are required to search a size n dictionary.

Instead of binary searching, if we know relative distances, we can get a better estimation.

If we are searching between indices l, r of array A for key k, we compare at index l + \lfloor \frac{k - A[l]}{A[r] - A[l]}(r-l) \rfloor. This approach works well if the keys are uniformly distributed.

But, worst-case performance is \Theta(n).

Trie (Radix Tree)

Dictionary for binary strings based on bitwise comparisons.

Prefix of a string S[0 .. n - 1] is a substring S[0..i] of S for some 0 \le i < n.

Prefix-Free: there is no pair of binary strings in the dictionary where one is the prefix of the other.

We assume that our dictionaries are prefix-free. This is always satisfied if all strings have the same length or are terminated with a unique character ‘$’.

Insert: We search for x in the trie, and expand it by adding the missing nodes.

Delete: We search for x in the trie and then delete the node, and all dangling nodes created by the deletion.

Time complexity for all operations on string s over alphabet \Sigma are O(|s|).

Trie Variations

  1. No Leaf Labels.

    We do not have to store the keys in leaves, it is implicitly stored along the path to a leaf. This halves the amount of storage needed.

  2. Remove Chains to Labels.

    We can stop adding nodes to the trie as soon as the key is unique. This saves space if there are only a few strings that are long.

    Note: This cannot be combined with no leaf labels as we would not be able to search.

  3. Allow Proper Prefixes.

    We can allow internal nodes to be keys by adding a flag at each node indicating whether it represents a key in the dictionary.

Compressed Tries (Patricia)

Idea: Eliminate unflagged nodes to reduce space.

def patricia_search(v, x):
    if v.is_leaf():
        return v.key() == x
    
    d = v.index()
    c = v.child(x[d])
    if c is None:
        return "Not Found!"
    return particia_search(c, x)

Delete

Insert

Multiway Tries

Hashing

Direct Addressing

Consider every key k is an integer with 0 \le k < M. We can implement a dictionary easily, using an array A of size M that stores (k,v) via A[k] = v. \Theta(M).

Direct addressing isn’t possible if keys are not integers. The storage is very wasteful if n << M.

Hashing: Map the keys to a small range of integers and then use direct addressing.

Collisions

Generally, hash function h is not injective, so many keys can map to the same integer. We get collisions when we want to insert (k,v) into the table, but T[h(k)] is already occupied.

Assume we have n random integers in \{0, ..., M - 1\}. P(\text{no collisions}) = \frac{M^{(n)}}{M^n} = (1)\left(1 - \frac{1}{M}\right)\left(1 - \frac{2}{M}\right)...\left(1-\frac{n-1}{M}\right)

  1. Allow multiple items at each table location (chaining).
  2. Allow each item to go into multiple locations (open addressing).

Load Factor and Re-Hashing

We evaluate strategies by the cost of search, insert, and delete functions. This evaluation is done in terms of the load factor \alpha = \frac{n}{M}.

We keep the load factor small by rehashing when needed.

Separate Chaining

Each table entry is a bucket containing 0 or more KVPs. This could be implemented by any dictionary (or even another hash-table). The simplest approach is to use an unsorted linked list in each bucket. This is called collision resolution by separate chaining.

Complexity of Chaining

Recall the load factor \alpha = \frac{n}{M}. We have a Uniform Hashing Assumption, each value is equally likely.

Proposition: Under uniform hashing, the average length of a bucket is \alpha = \frac{n}{M}.

We define X_{i,j} = \begin{cases}1, &\text{ if } h(k_i) = j \\ 0\end{cases} as the probability that element i is in bucket j. The length of bucket j is then \sum_{i = 1}^n X_{i,j}.

\begin{aligned}E(\text{length } B_j) &= E(\sum_{i = 1}^n X_{i,j}) \\ &= \sum_{i = 1}^n E(X_{i,j}) \\ &= \sum_{i=1}^n \frac{1}{M} \\ &= \frac{n}{M} = \alpha\end{aligned}

If we maintain \alpha \in \Theta(1), then the average costs are \Theta(1) and space is \Theta(n).

Open Addressing

Each hash table entry holds only one item, but any key k can go in multiple locations.

search and insert follow a probe sequence of possible locations for key k until an empty spot is found.

delete becomes problematic. We cannot leave an empty spot behind, because the next search might not go far enough.

  1. We can move later items in the probe sequence forward.
  2. Lazy deletion: Mark the spot in deleted rather than empty, and continue searching past deleted spots.

The simplest method for open addressing is linear probing. h(k,i) = (h(k)+1) \mod M, for some has function h.

def insert(T, (k, v)):
    for j in range(M):
        if T[h(k, j)] is "empty" or "deleted":
            T[h(k, j)] = v
            return "success"
    return "faliure"
def search(T, k):
    for j in range(M):
        if T[h(k, j)] is "deleted":
            return "not found"
        elif key(T[h(k, j)]) == k:
            return T[h(k, j)]
    return "not found"

Independent Hash Functions

Double Hashing

Suppose M = 16, h_1(k) = 3, h_2(k) = 6. Then the cycle is of length \frac{16}{\gcd(6, 16)} = 8. If M is prime, the cycle length is guaranteed to be M.

Cuckoo Hashing

def cuckoo_insert(T, x):
    y = x
    i = h_1(x.key)
    for _ in range(n):
        swap(y, T[i])
        if y is "empty":
            return "success"
        if i = h_1(y.key):
            i = h_2(y.key)
        else:
            i = h_1(y.key)
    return "faliure"

Complexity of Open Addressing Strategies

Strategy Search Insert Delete
Linear Probing \frac{1}{1 - \alpha} \frac{1}{(1 - \alpha)^2} \frac{1}{1 - \alpha}
Double Hashing \frac{1}{1 - \alpha} \frac{1}{1 - \alpha} \frac{1}{\alpha}\log\left(\frac{1}{1 - \alpha}\right)
Cuckoo Hashing 1 \frac{\alpha}{(1 - 2\alpha)^2} 1

Summary: All operations have O(1) average-case run-time if the hash-function is uniform and \alpha is sufficiently small.

Multi-Dimensional Data

What if the keys are multi-dimensional, such as strings in \Sigma^*?

Choosing Good Hash Functions

  1. Modular Method: h(k) = k \mod M, M should be prime.
  2. Multiplicative Method: h(k) = \lfloor M (kA - \lfloor kA \rfloor)\rfloor.

Universal Hashing

We again shift the average-case performance to expected performance via randomization.

h(k) = ((ak + p) \mod p) \mod M

We can prove that for any fixed numbers x \neq y, the probability of a collision using this function h is at most \frac{1}{M}. So, the expected runtime is O(1) if we keep the load factor small enough.

Range Searching

Quadtrees

2D Range Trees

We are given n points in a sorted array with respect to x. We can build a balanced BST for these points in time O(n).

Recursively, we spend O(1) at every node, total O(n).

A range tree is a tree of trees. The primary structure is a BST T that uses the x-coordinates as keys. Each node v \in T stores an auxiliary structure T(v).

Space Analysis

Let S(n) be the space complexity of a range tree over n points. S(n) satisfies. S(n) = 2S\left(\frac{n}{2}\right) + O(n) So S(n) \in O(n\log(n)).

Range tree space usage is O(n\log(n)).

Dictionary Operations

We have a problem with balancing. If we use AVL trees, insertion / deletions are very slow (rotation at v changes P(v) and hence requires a rebuild of T(v)). Instead of rotations, we can do something similar as for kd-trees, allow certain imbalance, and rebuild the entire subtree if violated.

def build(P):
    # O(|P|log|P|)
    x_list = sorted(P, key=lambda p: p.x)
    y_list = sorted(P, key=lambda p: p.y)
    build_recursive(x_list, y_list)

def build_recursive(x_list, y_list):
    T = RangeTree()
    # O(1)
    T.key = x_list[n // 2]
    # O(|y_list|)
    T.y_tree = BST(y_list)
    # O(|x_list|)
    x_left = x_list[0 : n // 2]
    x_right = x_list[n // 2 :]
    # O(|y_list|)
    y_left = [p for p in y_list if p.x < x_list[n // 2].x]
    y_right = [p for p in y_list if p.x > x_list[n // 2].x]
    # We have |x_list| = |y_list|, x_list is sorted by x, y_list sorted by y
    T.left = build_recursive(x_left, y_left)
    T.right = build_recursive(x_right, y_right)
    return T

We see that the runtime of build_recursive is T(n) = 2T\left(\frac{n}{2}\right) + O(n), so T(n) \in O(n\log(n)).

Time for range-query in range tree is O(s + \log^2(n)). This can be further reduced to O(s + \log(n)).

Higher Dimensions

Comparison of Range Query Data Structures

String Matching

Brute Force

Example: T = abbbababbab, P = abba.

Rabin-Karp Algorithm

Boyer-Moore Algorithm

Brute-force search with two changes.

  1. Reverse-order searching: Compare P with a guess moving backwards.
  2. Bad character jumps: When a mismatch occurs, then eliminate guesses where P does not agree with this character of T.
def BoyerMoore(T, P):
    L = [] # Last occurence array computed from P.
    k = 0 # Current guess.

    while k <= n - m:
        j = m - 1
        while j >= 0:
            if T[k + j] != P[j]:
                break
            j -= 1

        if j == -1:
            return k
        else:
            k = k + max(1, j - L[T[k + j]])

    return FAIL

Suffix Tree

Claim: The suffix tree for a string of length n has \Theta(n) nodes.

  1. Suffix tree has n+1 leaves.
  2. Suffix tree has at most n internal nodes.

    Key Property: Every internal node in a suffix tree has at least 2 children.

    We prove by induction. A tree with m leaves that satisfies this property, has \le m-1 internal nodes.

    Consider a tree with m+1 leaves, the root has s \ge 2 children, all s subtrees satisfy the induction hypothesis. Taking sums across every subtree will result in \le m internal nodes.

Compression

Judging Encoding Schemes

Types of Data Compression

Decoding

The code must be uniquely decodable.

Huffman Encoding

Huffman’s Algorithm

  1. Determine frequency of every c \in \Sigma_S.
  2. For each c \in \Sigma, create a height-0 trie holding c.
  3. Assign a “weight” to each trie which is the sum of all frequencies of all letters in the trie.
  4. Find two tries with the minimum weight.
  5. Merge these tries with a new interior node.
  6. Repeat steps 4, 5 until there is only 1 trie left. This is D.

We should store the tries in a heap in order to make this efficient.

Run-Length Encoding

Lempel-Ziv-Welch

Observation: Certain substrings are much more frequent than others.

We want to take advantage of such substrings without needing to know beforehand what they are.

Compression Summary

Huffman Run-Length Encoding Lempel-Ziv-Welch
Variable-length Variable-length Fixed-length
Single-character Multi-character Multi-character
2-pass 1-pass 1-pass
60% compression on English text Bad on text 45% compression on English text
Optimal 01-prefix-code Good on long runs (e.g. pictures) Good on English text
Must send dictionary Can be worse than ASCII Can be worse than ASCII
Rarely used directly Rarely used directly Frequently used
JPEG, MP3 Fax machines, old picture-formats GIF, Unix compress

bzip2

  1. Burrows-Wheeler Transform.

  2. MTF Transform.

  3. Modified RLE.

  4. Huffman Encoding.

MTF Transform

This takes advantage of locality in the data. If we see a character now, we might see it again soon.

Specifics: MTF is an adaptive text-transform algorithm.

If the source alphabet is \Sigma_S with size m, then the coded alphabet will be \Sigma_C = \{0, ..., m - 1\}.

Burrows-Wheeler Transform

Crucial Ingredient: A cyclic shift of a string X of length n is the concatenation of X[i + 1..n - 1] and X[0..i], for 0 \le i < n.

  1. Write all cyclic shifts.
  2. Sort cyclic shifts.
  3. Extract last characters from sorted shifts.

BTW Decoding

Given C, we can reconstruct the first and last column of the array of cyclic shifts by sorting.

  1. Last Column: C.
  2. First Column: C sorted.
  3. Disambiguate by row-index.
  4. Starting from \$, recover S.

Encoding Cost: O(n^2) using MSD radix sort, often better.

Decoding Cost: O(n).

Encoding and decoding both use O(n) space.