CS 240

Algorithm Design

Problem: Given a problem instance, carry out a particular computational task.

Problem Instance: Input.

Problem Solution: Output.

Size of a problem instance: Size(I) is a positive integer that measures the size of instance I.

Algorithm: Step-by-step process for carrying out a series of computations, given an arbitrary problem instance I.

For a problem \Pi, we can have several algorithms. For an algorithm A solving \Pi, we can have several programs (implementations).

Designing an algorithm A that solves \Pi -> Algorithm Design.
Assessing correctness and efficiency of A -> Algorithm Analysis.

Analysis of Algorithms I

We may be interested in the amount of time or memory. Experimental shortcomings have shortcomings because of implementations, hardware, and testing against all inputs.

Instead we write code for Random Access Machines (RAM)

Set of memory cells, each of which stores one item of data.
Access to memory in O(1).
Primitive operations are O(1).
Runtime of a program is the number of memory accesses + the number of primitive operations.

The efficiency is measured in terms of it’s growth rate (this is called the complexity of the algorithm).

for i in range(n):
    for j in range(n):
        c[i][j] = 0
        for k in range(n):
            c[i][j] += a[i][k] * b[k][j]

About n^3 operations.

Asymptotic Notation

O-Notation f(n) \in O(g(n)) if there exists constants c > 0, n_0 > 0 such that if |f(n)| \le c|g(n)| \forall n \ge n_0.

Example: f(n) = 75n + 500 and g(n) = 5n^2.

f(n), g(n) \ge 0 \forall n \ge 1
For n \ge 20
n^2 \ge 20n
5n^2 \ge 100n
25n \ge 25 * 20 = 500
5n^2 \ge 75n + 500
g(n) \ge f(n)
Taking n_0 = 20, c = 1, this proves that f(n) \in O(g(n)).

We want a tight asymptotic bound.

\Omega-notation: f(n) \in \omega(g(n)) if \exists c > 0, n_0 > 0 such that c|g(n)| \le |f(n)| \forall n \ge n_0.

\Theta-notation: \exists c1, c2 \ge 0, n_0 > 0 such that c1|g(n)| \le |f(n)| \le c2|g(n)| \forall n \ge n_0.

How to express f(n) is asymptotically strictly smaller than g(n)?

o-notation: \forall c > 0, \exists n_0 > 0 such that |f(n)| < c|g(n)| \forall n \ge n_0.

\omega-notation: \forall c > 0, \exists n_0 > 0 such that 0 \le c|g(n)| < f(n) \forall n \ge n_0.

Example: f(n) = 2000n^2 + 5000n and g(n) = n^3. Prove f(n) \in o(g(n)).

Given c > 0, for n > 0, f(n) = 2000n^2 + 5000n \le 7000n^2.
7000n^2 < cn^3 \Leftrightarrow 7000 < cn \Leftrightarrow n > \frac{7000}{c}
So if we take n_0 = \frac{7000}{c} + 1, we have f(n) < g(n) \forall n \ge n_0. This proves that f(n) \in o(g(n)).

Asymptotic Identities

f(n) \in \Theta(g(n)) \Leftrightarrow g(n) \in \Theta(f(n))
f(n) \in O(g(n)) \Leftrightarrow g(n) \in \Omega(f(n))
f(n) \in o(g(n)) \Rightarrow f(n) \in O(g(n)), \nRightarrow f(n) \in \Omega(g(n))
f(n) \in \omega(g(n)) \Rightarrow f(n) \in \Omega(g(n)), \Rightarrow f(n) \notin O(g(n))
Identitity: f(n) \in \Theta(f(n))
Maximum: f(n) > 0, g(n) > 0 \forall n \ge n_0 \Rightarrow O(f(n) + g(n)) = O(\max\{f(n), g(n)\}). Similar for \Omega.
Transitivity: f(n) \in O(g(n)), g(n) \in O(h(n)) \Rightarrow f(n) \in O(h(n)). Similar for \Omega.

Limit Test

Suppose L = \lim_{n \to \infty} \frac{f(n)}{g(n)}.

f(n) \in \begin{cases} o(g(n)), &L = 0 \\ \Theta(g(n)), &0 < L < \infty \\ \omega(g(n)), &L = \infty \end{cases}

Example: f(n) = \log(n) = \frac{\ln(n)}{\ln(2)}, g(n) = n.

\lim_{n \to \infty} \frac{1}{n\ln(2)} = 0 so f(n) \in o(g(n)) by the limit test.

Analysis of Algorithms II

Example:

def Test1(n):
    sum = 0
    for i in range(1, n + 1):
        for j in range(i, n + 1):
            sum += (i - j) * (i - j)
    return sum

Let T_1(n) be the runtime of Test1(n).
T_1(n) \in \Theta(S_1(n)) where S_1(n) is the number of times we enter the body of the inner loop on line 4.

S_1(n) = \sum_{i = 1}^n \sum_{j = i}^n i.

1st solution: Brute Force.

S_1(n) = \sum_{i = 1}^n (n - i + 1) = \sum_{i = 1}^n n - \sum_{i = 1}^n + \sum_{i = 1}^n 1 = n^2 - \frac{n(n+1)}{2} + n = \frac{n^2}{2} + \frac{n}{2}.
Therefore we have S_1(n) \in \Theta(n^2).

2nd solution: Separate O and \Omega.

S_1(n) \leq \sum_{i = 1}^n \sum_{j = 1}^n 1 = n^2 so S_1(n) \in O(n^2).
S_1(n) \geq \sum_{i = 1}^{n / 2} \sum_{j = 1}^n i \geq \sum_{i = 1}^{n / 2} \sum_{j = n / 2}^n 1 = \frac{n^2}{4} so S_1(n) \in \Omega(n^2).
Therefore S_1(n) \in \Theta(n^2).

Example:

def Test2(A, n):
    max = 0
    for i in range(1, n + 1):
        for j in range(i, n + 1):
            sum = 0
            for k in range(i, j + 1):
                sum += A[k]
    return max

Let T_2(n) be the runtime of Test2(n).
Then T_2(n) \in \Theta(S_2(n)), where S_2(n) is the number of times we enter the body of the inner loop on line 6.

S_2(n) = \sum_{i = 1}^n \sum_{j = i}^n \sum_{k = i}^j 1.

1st Solution: Brute Force.

Input to Wolfram \alpha and see that S_2(n) = \frac{n^3}{6} + ...n^2 + ...n + ... so S_2(n) \in \Theta(n^3).

2nd Solution: Separate O and \Omega.

S_2(n) \leq \sum_{i = 1}^n \sum_{j = 1}^n \sum_{k = 1}^n 1 = n^3 so S_2(n) \in O(n^3).
S_2(n) \geq \sum_{i = 1}^{n / 3} \sum_{j = i}^n \sum_{k = i}^j 1 \geq \sum_{i = 1}^{n / 3} \sum_{j = 2n \ 3}^n \sum_{k = n / 3}^{2n / 3} 1 = (\frac{n}{3})^3 \in \Omega(n^3)
Therefore S_2(n) \in \Theta(n^3).

Example:

# Insertion Sort
def Test3(A, n):
    for i in range(1, n):
        j = i
        while j > 0 and A[j] > A[j - 1]:
            A[j], A[j - 1] = A[j - 1], A[j]
            j -= 1

Let T_A(I) denote the running time of an algorithm A on instance I.
Worse-case: T_A(n) = \max \{T_A(I): Size(I) = n\}.
Average-case: T_A^{avg}(n) = \frac{1}{|\{I: Size(I) = n\}} \sum_{\{I: Size(I) = n\}} T_A(I).

It is important to try and not make comparisons between algorithms using O-notation.

Worse-case run-time may happen rarely.
O-notation is an upper bound, we should use \Theta-notation.

MergeSort Example

Input: Array A of n integers.

We split A into two subarrays. A_L consists of the first \lceil\frac{n}{2}\rceil elements and A_R consists of the last \lfloor\frac{n}{2}\rfloor elements.
Recursively run MergeSort on A_L and A_R.
After A_L and A_R are sorted, merge them together.

def MergeSort(A, l = 0, r = n - 1):
    if r <= l:
        return
    else:
        m = (l + r) / 2
        MergeSort(A, l, m)
        MergeSort(A, m + 1, r)
        Merge(A, l, m, r)

def Merge(A, l, m, r):
    # We should only copy over A[l, r] to S.
    S = A
    i_L = l, i_R = m + 1
    for k in range(l, r + 1):
        if i_L > m:
            A[k] = S[i_R]
            i_R += 1
        elif i_R > r:
            A[k] = S[i_L]
            i_L += 1
        elif S[i_L] <= S[i_R]:
            A[k] = S[i_L]
            i_L += 1
        else:
            A[k] = S[i_R]
            i_R += 1

Merge takes time \Theta(r - l + 1) = \Theta(n) time for merging n elements.

Analysis of MergeSort

Let T(n) denote the time to run MergeSort on an array of length n.

Step 1 takes \Theta(n).
Step 2 takes T(\lceil\frac{n}{2}\rceil) + T(\lfloor\frac{n}{2}\rfloor).
Step 3 takes \Theta(n).

Sloppy Recurrence

T(n) = \begin{cases} 2T(\frac{n}{2}) + cn, &\text{if } n > 1 \\ c, &\text{if } n = 1 \end{cases}

Exact and sloppy recurrences are identical when n is a power of 2.
The recurrence can be easily solved by various methods when n = 2^k.

T(n) = 2T(\frac{n}{2}) + cn.
For n a power of 2, n = 2^k.
\begin{aligned} T(2^k) &= 2T(2^{k - 1}) + c2^k \\ &= 2(2T(2^{k - 2}) + c2^{k - 1}) + c2^k \\ &= 2^3 T(2^{k - 3}) + 3c2^k \\ &= 2^kT(2^{k - k}) + kc2^k \\ &= 2^k + c2^kk \\ &= n + cn\log(n) \end{aligned}
Therefore T(n) \in \Theta(n\log(n)), for n = 2^k.

Priority Queue

An ADT consisting of a collection of items (each having a priority)

Insert: Insert an item with a given priority.
DeleteMax: Remove the item with highest priority

def PQ_Sort(A):
    pq = PriorityQueue()
    for k in range(n):
        pq.insert(k)
    for k in reversed(range(n)):
        pq.deleteMax()

The runtime will be determined by how efficient our insert() and deleteMax() functions are.

O(n + n \cdot insert + n \cdot deleteMax)

Inserting Into Dynamic Array

Let T(n) be the total cost of insertion from length n to 2n - 1, n = 2^k. Then T(n) is the cost of doubling from n to 2n plus the cost of inserting n items. T(n) \in O(n).

Binary Heaps

A (binary) heap is a certain type of tree.

Any binary tree with n nodes has height at least \log(n + 1) - 1 \in \Omega(\log(n)).

Structural Property: All levels of the heap are completely filled, except for the last level potentially. The entries in the last level are left-justified.
Heap-order Property: For any node i, the key of i’s parent is greater or equal to the key of i.

These are the properties for a max-heap. A min-heap is the same but with opposite order property.

Lemma: The height of a heap with n nodes is \Theta(\log(n)).

Given the height of the tree h.
2^n \le n \le 2^{h + 1}
h \le \log(n) \le h + 1
\log(n) - 1 \le h \le \log(n)

Storing Heaps in Arrays

Heaps should not be stored as binary trees!

Let H be a heap of n items and let A be an array of size n. We can store the root in A[0] and continue with elements level-by-level from top to bottom. In each level left-to-right.

The root node is at A[0].
The left child of A[i] is A[2i + 1].
The right child of A[i] is A[2i + 2].
The parent of A[i] is A[\frac{i - 1}{2}].
The last node is A[n - 1].

We should hide implementation using helper functions.

Operations in Binary Heaps

Insertion in Heaps

Place the new key in the first free leaf.
The heap-order property might be violated so we may need to perform a bubble-up.
- Keep swapping with the parent node until the heap-order property is satisfied.
- O(\text{height of heap}) = O(\log(n)).

Delete in Heaps

The maximum item of a heap is just the root node.
We replace the root by the last leaf.
Heap-order property might be violated so we may need to perform a bubble-down.
- Keep swapping with the larger child until the heap-order property is satisfied.
- O(\text{height of heap}) = O(\log(n)).

HeapSort is not used as often as MergeSort or QuickSort because it has poor data locality.

Building Heaps by Bubble-Down

Given an array, build a heap out of them.

def heapify(A):
    n = len(A)
    for i in reversed(range(parent(last(n)) + 1)):
        fix_down(A, n, i)

Careful analysis yields a worst-case complexity of \Theta(n).

Whenever you reach node i, you know that the subtrees are both heaps because you have fixed them before.

Let T(n) be the worse case runtime of heapify for an array of size n. It is related to the worse case number of swaps S(n), so T(n) \in \Theta(S(n)).

S(1) = 0.
S(2) = 1 because we bubble-down the root.
S(3) = 1 because we bubble-down the root.
S(4) = 1 + 2 = 3.
S(5) = 3.

If we add n nodes to a heap of size n, (n \to 2n). The new nodes form a layer of leaves. Every single original node now has to perform 1 more swap because the length of the path has been increased by 1. The new nodes do not have to perform any swaps because they are all leaves.

S(2n) = S(n) + n

For n a power of 2, n = 2^k.

\begin{aligned} S(n) = S(2^k) &= S(2^{k - 1}) + 2^{k - 1} \\ &= S(2^{k - 2}) + 2^{k - 2} + 2^{k - 1} \\ &= \sum_{i = 0}^{k - 1} 2^i \\ &= 2^k - 1 \\ &= n - 1 \end{aligned}

Therefore T(n) \in \Theta(n) for n = 2^k.

Selection

The kth-max problem asks to find the kth largest item in an array A of n numbers.

Scan the array and maintain the k largest numbers seen so far in a min-heap. \Theta(n\log(k)).
Make a max-heap by calling heapify(A). Call deleteMax(A) k times. \Theta(n + k\log(n)).

Sorting and Randomized Algorithms

QuickSelect

Given an array of A of n numbers, 0 \le k < n, find the element that be at position k of the sorted array.

The best heap-based algorithm had a running time of O(n + k\log n). For median finding, this is O(n\log n) which is the same cost as our best sorting algorithms.

The QuickSelect algorithm can do this in linear time.

QuickSelect and the related QuickSort rely on two subroutines.

chose_pivot(A): Choose an index p in A.
partition(A, p): Rearrange A and return pivot index i such that A[i] = v, A[0, ..., i - 1] \le v, v \le A[i + 1, ..., n - 1].

Partitioning in-place can be done by using two pointers i = 0, j = n - 1, and moving them to the middle, swapping while they are on the wrong side.

def partition(A, p):
    A[n - 1], A[p] = A[p], A[n - 1]

    i = 0, j = n - 2, v = A[n - 1]
    while True:
        while i < n and A[i] < v:
            i += 1
        while j >= 0 and A[j] > v:
            j -= 1
        if j >= i:
            break
        A[i], A[j] = A[j], A[i]

    A[i], A[n - 1] = A[n - 1], A[i]

QuickSelect Algorithm

def quick_select_1(A, k):
    p = choose_pivot(A)
    i = partition(A, p)
    if i == k:
        return A[i]
    elif i > k:
        return quick_select_1(A[0, ..., i - 1], k)
    elif i < k:
        return quick_select_1(A[i + 1, ..., n - 1], k - i - 1)

Analysis of quick_select_1

Worse Case Analysis: Recursive calls could always have size of n - 1. The recurrence is given by. T(n) = \begin{cases} T(n - 1) + cn, &n \ge 2 \\ c, &n = 1 \end{cases} For some constant c > 0. Solution: T(n) = cn + c(n - 1) + ... + c(2) + c \in \Theta(n^2).

Best Case Analysis: First chosen pivot could be the kth element. We have no recursive calls, so the total cost would be \Theta(n).

Sorting Permutations

We want to take the average runtime of quick_select_1 over all inputs.

How do we characterize all inputs of size n? We can make a simplifying assumption that all input numbers are distinct. quick_select_1 does not depend on the actual value, only their relative order. Therefore, we can characterize our input based on sorting permutation: the permutation that would put the input in order.

If we assume that all n! permutations are equally likely, the average cost is the sum of the total cost for all permutations, divided by n!.

Define T(n) to be the average case for selecting from a size n array. Fix one 0 \le i \le n - 1. There are (n - 1)! permutations for which the pivot value v is the ith smallest item.

\begin{aligned} T(n, k) &= \frac{1}{n!}\sum_{\text{permutations A of } \{1, ..., n\}} T(A, k) \\ T(n) &= \max_{k} T(n, k) \\ &\le c\cdot n + \frac{1}{n}\sum_{i = 0}^{n - 1}\max\{T(i), T(n - i + 1)\} \\ \end{aligned}

Theorem: T(n) \in \Theta(n).

Sloppy Proof: We want T(n) \le \lambda n. Suppose T(i) \le \lambda i, i < n. \begin{aligned} T(n) &\le cn + \frac{1}{n}\sum_{i = 0}^{n - 1}\max \{\lambda i, \lambda(n - i - 1)\} \\ &\le cn + \frac{\lambda}{n}\sum_{i = 0}^{n - 1} \max \{i, n - i - 1\} \end{aligned} We can see visually (taking max of i and n - i - 1) that \sum \approx \frac{3}{4}n^2. Therefore we have T(n) \le cn + \frac{\lambda}{n}\frac{3}{4}n^2 = (c + \frac{3}{4}\lambda) n. We want T(n) \le \lambda n, let us take \lambda such that c + \frac{3}{4}\lambda = \lambda \Leftrightarrow \lambda = 4c.

Clean Proof: We can rewrite the proof that we had before using T(n) \le 4cn by induction. We also have to prove that \sum_{i = 0}^{n - 1} \max\{1, n - i - 1\} \le \frac{3}{4}n^2 which can be done by a case discussion when n is even or odd.

Randomized Algorithms

No more bad instances, just unlucky numbers.

Expected Running Time

Define T(I, R) be the running time of the randomized algorithm for instance I and sequence of random numbers R.

T^{(exp)}(I) = E[T(I, R)] = \sum_{R}T(I, R)P(R).

T^{(exp)}_{worst}(n) = \max_{\{I: size(I) = n\}}T^{(exp)}(I).

T^{(exp)}_{avg}(n) = \frac{1}{|\{I: size(I) = n\}|}\sum_{\{I : size(I) = n\}}T^{(exp)}(I).

Average-Case Analysis of QuickSort

\begin{cases} T(n) = cn + \frac{1}{n}\sum_{i = 0}^{n - 1}(T(i) + T(n - (i + 1))) \\ T(0) = T(1) = 0 \end{cases}

We prove that T(n) \le 2cn\log(n), n \ge 1.

Define S(n) = \sum_{i = 1}^{n - 1}i\log(i). We claim that S(n) \le \frac{1}{2}n^2\log(n) - \frac{1}{4}n^2. We skip the proof because it is too difficult.

Assume that for i = 1, ..., n - 1, T(i) \le 2ci\log(i).

\begin{aligned} T(n) &= cn + \frac{1}{n}\sum_{i = 0}^{n - 1}T(i) + \frac{1}{n}\sum_{i = 0}^{n - 1}T(n - (i + 1)) \\ &= cn + \frac{2}{n}\sum_{i = 0}^{n - 1}T(i) \\ &= cn + \frac{2}{n} \sum_{i = 1}^{n - 1}T(i) \\ &\le cn + \frac{2}{n} \sum_{i = 1}^{n - 1}2ci\log(i) \\ &\le cn + \frac{4c}{n} \sum_{i = 1}^{n - 1}i\log(i) \\ &\le cn + \frac{4c}{n} S(n) \\ &\le cn + \frac{4c}{n} (\frac{1}{2}n^2\log(n) - \frac{1}{4}n^2) \\ &\le cn + (2cn\log(n) - cn) \\ &\le 2cn\log(n) \end{aligned}

The auxiliary space is \Theta(n) worst-case. We can reduce this to \Theta(\log(n)) by recursing in smaller sub-array first and replacing the other recursion by a while loop.

def quicksort_iterative(A, l, r):
    while l < r:
        i = partition(A, l, r)
        if i - l < r - i:
            quicksort_iterative(A, l, i - 1)
            l = i + 1
        else:
            quicksort_iterative(A, i + 1, r)
            r = i - 1

One should stop recursing when n \le 10. Then one run of InsertionSort at the end sorts everything in O(n) because all items are within 10 units of their required position.

Arrays with many duplicates can be dealt with more efficiently by creating a third partition for equals.

Comparison Model

In the comparison model, data can only be accessed in 2 ways.

Comparing two elements.
Moving elements around (e.g. copying, swapping).

We can represent comparison models as decision trees.

Theorem: Any correct comparison-based sorting algorithm requires at least \Omega(n\log n) comparisons.

In a decision tree that sorts n integers, there are at least n! leaves.

We claim that in a binary tree of height h, with L leaves, L \le 2^{h + 1}. We see that L \le \#\ nodes and that is less than or equal to the number of nodes in a complete heap of height h.

\begin{aligned} n! &\le 2^{h + 1} \\ \log(n!) &\le h + 1 \\ \end{aligned} We know that \log(n!) \in \Theta(n\log n) so h \in \Omega({n \log n}).

Non-Comparison-Based Sorting

Assume all keys are numbers are in base (radix).
Assume all keys have the same number m of digits.
- Can achieve by padding with leading 0s.

Bucket Sort

Sort numbers by a single digit.
Create a “bucket” for each possible digit.

def bucket_sort(A, d):
    Initialize an array B[0 .. R - 1] of empty lists.
    for i in range(len(A)):
        B[dth digit of A[i]].append(A[i])
    i = 0
    for j in range(R):
        while len(B[j]) > 0:
            A[i] = B[j]
            i += 1

This is stable, equal items stay in order. The runtime is O(n + R), auxiliary space \Theta(n + R).

Count Sort

We can save auxiliary memory by counting how many elements there are in each B.

def key_indexed_countsort(A, d):
    count = [0] * R
    for i in range(len(A)):
        count[dth digit of A[i]] += 1
    
    idx = [0] * R
    for i in range(1, R):
        id[i] = id[i - 1] + count[i - 1]

    aux = [0] * len(A)
    for i in range(len(A)):
        aux[id[dth digit of A[i]]] = A[i]
        id[dth digit of A[i]] += 1

    A = aux

def MSD_radixsort(A, l, r, d):
    if l < r:
        key_index_countsort(A[l .. r], d)
        if d < m:
            for i in range(R):
                l_i, r_i = boundaries of ith bin
                MSD_radixsort(A, l_i, r_i, d + 1)

Drawback is that we have many recursive calls.

def LSD_radixsort(A):
    for d in reversed(range(0, m)):
        key_index_countsort(A, d)

Time complexity is \Theta(m(n + R)). Auxiliary memory is O(n + R). This can be o(n\log(n)) for special cases.

Dictionaries and Balanced Search Trees

Dictionary ADT

A collection of items, each of which contains a key, and data (key-value pair). Keys can be compared and are typically unique.

search(k)
insert(k, v)
delete(k)

Common assumptions are that the dictionary has n KVPs, each of which uses constant space. We also assume that keys can be compare in constant time.

Binary Search Trees

search(k): Start at the root, compare k to the current node. Stop if found or node is external, else recurse at child.

insert(k, v): Search for k, then insert (k, v) as new node.

delete(k): Search for node x that contains the key k.

If x is a leaf, then delete it.
If x has one child, then move that child up.
If x has two children, swap at x with the key at successor or predecessor node and then delete that node.

Remark: The successor of node x does not have to be a leaf but it cannot have a left child.

Height of a BST

All of the functions above have cost \Theta(h), where h is the height of the tree.

If n items are inserted, the worst case is \Theta(n). The best case is \Theta(\log(n)) because any binary tree with n nodes has height \ge \log(n + 1) - 1. It can be shown that the average case is \Theta(\log(n)).

AVL Trees

Introduced by Adel’son-Vel’ski˘ı and Landis in 1962.

AVL Trees are BST with an additional height-balance property that the height of the left subtree L and right subtree R differ by at most 1. The height of an empty tree is defined to be -1.

At each non-empty node, we require height(R) - height(L) \in \{-1, 0, 1\}.

-1, left-heavy

0, balanced

1, right-heavy

We need to store the height of the subtree rooted at each node. It can be shown that it is sufficient to store height(R) - height(L) at each node which uses fewer bits but more complicated code (especially for deleting).

Height of an AVL Tree

Theorem: an AVL Tree with n nodes has \Theta(\log(n)) height.

Proof: We prove the upper and lower bounds separately.

The height h of an AVL with n keys is \Omega(\log(n)) because h \ge \log(n + 1) - 1.
The height h of an AVL with n keys is O(\log(n)).

Define N_h as the minimum number of keys in an AVL of height h. Our plan is to prove N_h \in \Omega(c^h) for some constant c. Then we would have \log(n) \in \Omega(h).

We see that N_h has a relation to Fibonacci numbers, N_h = F_{h + 3} - 1. This is because to build N_h, we can take the tree for N_{h - 1} as a left child and the tree for N_{h - 2} as a right child. We can prove N_{h} = N_{h - 1} + N_{h - 2} + 1 with generating series. We show the inductive proof that N_h = F_{h + 3} - 1.

\begin{aligned} N_{h + 2} &= N_{h + 1} + N_{h} + 1 \\ &= (F_{h + 4} - 1) + (F_{h + 3} - 1) + 1 \\ &= F_{h + 4} + F_{h + 3} - 1 \\ &= F_{h + 5} - 1\end{aligned}

We claim F_h \in \Theta(\phi^h), \phi = \frac{1 + \sqrt 5}{2} (239). Then N_h \in \Theta(\phi^h) and \log(N_h) \in \Theta(h) (a1). For any AVL with height h, n nodes, \log(n) \ge \log(N_h) so \log(n) \in \Omega(h). Therefore h \in O(\log(n)).

AVL Insertion

Insert (k, v) into T with usual BST insertion.
We assume that this returns the new leaf z where the key was sorted.
Move up the tree from z and update heights.
- We assume that we have parent links. This can be avoided if insert returns the full path to z.
If the height difference becomes \pm 2 at node z, then z is unbalanced. We must re-structure the tree to rebalance.

def AVL_insert(r, k, v):
    z = BST_insert(r, k, v);
    z.height = 0
    while z is not None:
        set_height_from_children(z)
        if abs(z.left.height - z.right.height) == 2:
            AVL_fix(z)
            break
        else:
            z = z.parent

def set_height_from_children(u):
    u.height = 1 + max(u.left.height + u.right.height)

Fixing a slightly unbalanced AVL tree. This is applied at a node z that has balance \pm 2, but the subtrees at z are AVL.

def AVL_fix(z):
    # Find child and grand-child that go deepest.
    if z.right.height > z.left.height:
        y = z.right
        if y.left.height > y.right.height:
            x = y.left
        else:
            x = y.right
    else:
        y = z.left
        if y.right.height > y.left.height:
            x = y.right
        else:
            x = y.left
    # Apply appropriate rotation to restructure at x, y, z
    ...

How do we “fix” an unbalanced AVL? Notice that there are many different BSTs with the same keys. Our goal is to change the structure among the three nodes without changing the order such that the subtree becomes balanced.

Right Rotation

def rotate_right(z):
    y = z.left
    make y.right the new left child of z
    make y the new root of the subtree
    make z the new right child of yb
    set_height_from_children(z)
    set_height_from_children(y)

Recall that we need to update the parent links as well!

Left Rotation is symmetrical.

Double Right Rotation

Left rotation at y.
Right rotation at z.

Rule: The median of x, y, z becomes the new root.

AVL Deletion

Remove the key k with BST_delete
We assume that BST_delete returns the place where the structural change happened (the parent z of the node that got deleted).
Go back up to the root, update heights, and rotate if needed.

Other Dictionary Implementations

Arrays are a simple and popular implementation. Can we do something to make the search more effective in practice?

Yes, if the items are not equally likely to be accessed and we have a probability distribution of the items. - Intuition is that frequently accessed items should be in the front.

Optimal Static Ordering

Expected access cost in L is E(L) = \sum_{i = 1}^n P(x_i)T(x_i) = \sum_{i=1}^n P(x_i)*i
- P(x_i) is the access probability for x_i.
- T(x_i) is the position of x_i in L.

Is this better than our linear time solution?

Consider L=[x_1, ..., x_n], P(x_1) = \frac{1}{2}, P(x_2) = \frac{1}{4}, …, P(x_{n-1}) = P(x_n) = \frac{1}{2^{2n-1}}. E(L) = 1\cdot\frac{1}{2} + 2\cdot\frac{1}{4} + ... + (n-1)\cdot \frac{1}{2^{n-1}} + n\cdot\frac{1}{2^{n-1}}. \lim_{n \to \infty}E(L) = 2.

Claim: Over all possible static orderings, the one that sorts items by non-increasing access-probability minimizes the expected cost. We can prove by contradiction.

Dynamic Ordering

What if we do not know the access probabilities ahead of time?
Temporal locality: recently accessed items are likely to be used soon again.
Move-To-Front (MTF): Upon a successful search, move the accessed item to the front of the list.
Transpose: Upon a successful search, swap the accessed item with the item immediately preceding it.

Performance of Dynamic Ordering

MTF works well in practice.
We can show that MTF is “2-competitive”. C_{OPT} \le C_{MTF} \le 2C_{OPT}.

Skip Lists

Randomized data structure for dictionary ADT.

A hierarchy S of ordered linked lists (levels) S_0, S_1, ..., S_h.
- Each list S_i contains the special keys -\infty and +\infty.
- List S_0 contains the KVPs of S in non-decreasing order.
- Each list is a subsequence of the previous one, S_0 \supseteq S_1 \supseteq ... \supseteq S_h.
- List S_h only contains the two special keys.
Each node p has references below(p) and after(p).

Search in Skip Lists

def skip_search(l, k):
    P = [p]
    while below(p) is not None:
        p = below(p)
        while key(after(p)) < k:
            p = after(p)
        P.append(p) 
    return P

Insert in Skip Lists

In an ideal skiplist, we have 2n keys and \log(n) height. The search costs \Theta(\log(n)).
Randomly compute the height of a new item k, flip a coin until you get tails.
Let i be the number of heads, this will be the height of k.
- P(height \ge l) = (\frac{1}{2})^l.
Increase the height of the skip list if needed, to have h > i levels.
Search for k with skip_search to get stack P.
- The top i items of p are the predecessors p_0, p_1, ..., p_i where k should be in each list S_0, S_1, ... S_i.
- Insert (k, v) after p_0 in S_0, and k after p_j in S_j for 1 \le j \le i.

Delete in Skip Lists

Search for k with skip_search to get stack P.
P contains all predecessors p_0, p_1, ..., p_h of k in lists S_0, ..., S_h.
For each 0 \le j \le h, if key(after(p_j)) = k, then remove after(p_j) from list S_j.
Remove all but one of the lists S_i that contain only the two special keys.

Expected Performance

O(n) space.
O(\log(n)) height. A skip list with n items has height at most 3\log(n) with probability at least 1 - \frac{1}{n^2}.

Let X_k be the height of tower k.

P(h \ge i) \le \sum_{k}P(X_k \ge i) \le \frac{n}{2^i} It follows that P(h < i) \ge 1 - \frac{n}{2^i}. Imagine i = c\log(n). We would like to know P(h < c\log(n)). P(h < c\log(n)) \ge 1 - \frac{n}{2^{c\log(n)}} = 1 - \frac{n}{n^c} We see that the height is bounded by O(\log(n)) with high probability.

\begin{aligned}E[h] &= \sum_{t \ge 1} t P(h = t) \\ &= \sum_{t = 1}^{2\log(n)}tP(h = t) + \sum_{t = 2\log(n) + 1}^n tP(h=t) + \sum_{t \ge n+1}tP(h = t) \\ &\le \sum_{t = 1}^{2\log(n)}2\log(n)P(h=t) + \sum_{t=2\log(n)}^n nP(h=t) + (\sum_{t \ge 1}tP(h=t)-\sum_{t=1}^ntP(h=t))\\ &\le 2\log(n)P(h \le 2\log(n)) + nP(h \ge 2\log(n)) + n(\sum_{t \ge 1}\frac{t}{2^t} - \sum_{t = 1}^n\frac{t}{2^t}) \\ &\le ... + n\frac{n}{2^{2\log(n)}} + \frac{n(n+2)}{2^n} \\ &\le 2\log(n) + 1 + \frac{n(n+2)}{2} \end{aligned} We see that the expected height is O(\log(n)).
O(\log(n)) search, insert, and delete.

Dictionaries for Special Keys

Theorem: In the comparison model, \Omega(\log(n)) comparisons are required to search a size n dictionary.

Interpolation Search

Instead of binary searching, if we know relative distances, we can get a better estimation.

If we are searching between indices l, r of array A for key k, we compare at index l + \lfloor \frac{k - A[l]}{A[r] - A[l]}(r-l) \rfloor. This approach works well if the keys are uniformly distributed.

We can show that the side we recurse to has expected size \sqrt n.
Recurrence relation is T^{(avg)}(n) = T^{(avg)}(\sqrt n) + \Theta(1) which resolves to a runtime of \Theta(\log(\log(n))).

But, worst-case performance is \Theta(n).

Trie (Radix Tree)

Dictionary for binary strings based on bitwise comparisons.

Prefix of a string S[0 .. n - 1] is a substring S[0..i] of S for some 0 \le i < n.

Prefix-Free: there is no pair of binary strings in the dictionary where one is the prefix of the other.

We assume that our dictionaries are prefix-free. This is always satisfied if all strings have the same length or are terminated with a unique character ‘$’.

Insert: We search for x in the trie, and expand it by adding the missing nodes.

Delete: We search for x in the trie and then delete the node, and all dangling nodes created by the deletion.

Time complexity for all operations on string s over alphabet \Sigma are O(|s|).

Trie Variations

No Leaf Labels.

We do not have to store the keys in leaves, it is implicitly stored along the path to a leaf. This halves the amount of storage needed.
Remove Chains to Labels.

We can stop adding nodes to the trie as soon as the key is unique. This saves space if there are only a few strings that are long.

Note: This cannot be combined with no leaf labels as we would not be able to search.
Allow Proper Prefixes.

We can allow internal nodes to be keys by adding a flag at each node indicating whether it represents a key in the dictionary.

Compressed Tries (Patricia)

Idea: Eliminate unflagged nodes to reduce space.

Every chain is compressed to a single edge.
Each node stores an index indicating which character should be tested against.
A compressed trie storing n keys always has at most n-1 internal nodes (Induction).

Compressed Trie Search

def patricia_search(v, x):
    if v.is_leaf():
        return v.key() == x
    
    d = v.index()
    c = v.child(x[d])
    if c is None:
        return "Not Found!"
    return particia_search(c, x)

Delete

Search for x in the trie and then delete the node. Compress the path to x whenever possible.

Insert

Search for x in the trie, end up at node v.
Conceptually simplest approach is to uncompress the path, insert x normally, and then recompress.

Multiway Tries

We can represent a string over any fixed alphabet \Sigma.
Each node has up to |\Sigma| + 1 references to children.
- Storing in arrays would be O(1) but O(|\Sigma|n) space is required.
- We could store a list of children, which would require O(n) space but O(|\Sigma|) time to find a child. We could speed this search up using MTF heuristic.
- Storing children in dictionaries is the best in theory but is not practical unless the |\Sigma| is very large.

Hashing

Direct Addressing

Consider every key k is an integer with 0 \le k < M. We can implement a dictionary easily, using an array A of size M that stores (k,v) via A[k] = v. \Theta(M).

Direct addressing isn’t possible if keys are not integers. The storage is very wasteful if n << M.

Hashing: Map the keys to a small range of integers and then use direct addressing.

Collisions

Generally, hash function h is not injective, so many keys can map to the same integer. We get collisions when we want to insert (k,v) into the table, but T[h(k)] is already occupied.

Assume we have n random integers in \{0, ..., M - 1\}. P(\text{no collisions}) = \frac{M^{(n)}}{M^n} = (1)\left(1 - \frac{1}{M}\right)\left(1 - \frac{2}{M}\right)...\left(1-\frac{n-1}{M}\right)

Allow multiple items at each table location (chaining).
Allow each item to go into multiple locations (open addressing).
- Could allow many other locations (probe sequence) or just one other (cuckoo hacking).

Load Factor and Re-Hashing

We evaluate strategies by the cost of search, insert, and delete functions. This evaluation is done in terms of the load factor \alpha = \frac{n}{M}.

We keep the load factor small by rehashing when needed.

Keep track of n and M throughout operations.
If \alpha gets too large, we create a new, larger hash-table, and re-insert all items into the new table.
Rehashing costs \Theta(M + n) but we can ignore this term when amortizing over all operations.
We should also re-hash when \alpha gets too small, so that the space is always \Theta(n).

Separate Chaining

Each table entry is a bucket containing 0 or more KVPs. This could be implemented by any dictionary (or even another hash-table). The simplest approach is to use an unsorted linked list in each bucket. This is called collision resolution by separate chaining.

We insert to the front of the list.

Complexity of Chaining

Recall the load factor \alpha = \frac{n}{M}. We have a Uniform Hashing Assumption, each value is equally likely.

Proposition: Under uniform hashing, the average length of a bucket is \alpha = \frac{n}{M}.

We define X_{i,j} = \begin{cases}1, &\text{ if } h(k_i) = j \\ 0\end{cases} as the probability that element i is in bucket j. The length of bucket j is then \sum_{i = 1}^n X_{i,j}.

\begin{aligned}E(\text{length } B_j) &= E(\sum_{i = 1}^n X_{i,j}) \\ &= \sum_{i = 1}^n E(X_{i,j}) \\ &= \sum_{i=1}^n \frac{1}{M} \\ &= \frac{n}{M} = \alpha\end{aligned}

Search / Delete: \Theta(1 + \alpha) average case, \Theta(n) worst case.
Insert: O(1) worst case.
Space: \Theta(M + n) = \Theta(\frac{n}{\alpha} + n).

If we maintain \alpha \in \Theta(1), then the average costs are \Theta(1) and space is \Theta(n).

Open Addressing

Each hash table entry holds only one item, but any key k can go in multiple locations.

search and insert follow a probe sequence of possible locations for key k until an empty spot is found.

delete becomes problematic. We cannot leave an empty spot behind, because the next search might not go far enough.

We can move later items in the probe sequence forward.
Lazy deletion: Mark the spot in deleted rather than empty, and continue searching past deleted spots.

The simplest method for open addressing is linear probing. h(k,i) = (h(k)+1) \mod M, for some has function h.

def insert(T, (k, v)):
    for j in range(M):
        if T[h(k, j)] is "empty" or "deleted":
            T[h(k, j)] = v
            return "success"
    return "faliure"

def search(T, k):
    for j in range(M):
        if T[h(k, j)] is "deleted":
            return "not found"
        elif key(T[h(k, j)]) == k:
            return T[h(k, j)]
    return "not found"

Independent Hash Functions

Some hashing methods require two hash functions, h_1, h_2.
These hash functions should be independent in the sense that the random variables are independent.
Using two modular hash-functions may often lead to dependencies.
A better idea is to use a multiplicative method for the second hash function. h(k) = \lfloor M(kA - \lfloor kA \rfloor)\rfloor.
- A is some floating-point number with 0< A < 1.
- Knuth suggests A = \frac{\sqrt 5 - 1}{2}.

Double Hashing

Assume we have two hash independent functions h_1, h_2.
Assume further that h_2(k) \neq 0 and that h_2(k) is relatively prime with the table-size M for all keys k. (Choose M prime).

Suppose M = 16, h_1(k) = 3, h_2(k) = 6. Then the cycle is of length \frac{16}{\gcd(6, 16)} = 8. If M is prime, the cycle length is guaranteed to be M.

Double hashing: Open addressing with probe sequence, h(k,i) = h_1(k) + i \cdot h_2(k) \mod M.

Cuckoo Hashing

We use independent hash functions h_1, h_2.
Main idea is that an item with key k can only be in T[h_1(k)] or T[h_2(k)].
- Search / Delete then take constant time.
- Insert always puts a new item into T[h_1(k)]. If it is already occupied, you kick out the other item, which we then attempt to reinsert into its alternate position.
  - May lead to a loop of kicks. We detect this by aborting after too many attempts.
  - In case of failure, we rehash with a larger M and new hash functions.
  - Slow, but expected to be constant time if the load factor is small enough.

def cuckoo_insert(T, x):
    y = x
    i = h_1(x.key)
    for _ in range(n):
        swap(y, T[i])
        if y is "empty":
            return "success"
        if i = h_1(y.key):
            i = h_2(y.key)
        else:
            i = h_1(y.key)
    return "faliure"

Complexity of Open Addressing Strategies

Strategy	Search	Insert	Delete
Linear Probing	\frac{1}{1 - \alpha}	\frac{1}{(1 - \alpha)^2}	\frac{1}{1 - \alpha}
Double Hashing	\frac{1}{1 - \alpha}	\frac{1}{1 - \alpha}	\frac{1}{\alpha}\log\left(\frac{1}{1 - \alpha}\right)
Cuckoo Hashing	1	\frac{\alpha}{(1 - 2\alpha)^2}	1

Summary: All operations have O(1) average-case run-time if the hash-function is uniform and \alpha is sufficiently small.

Linear probing has good data locality (cache coherent).

Multi-Dimensional Data

What if the keys are multi-dimensional, such as strings in \Sigma^*?

Standard approach is the flatten string w into integer f(w) \in \mathbb{N}.
We combine this with a popular modular hash function such as h(w) = f(w) \mod M.
To compute this in O(|w|) time without overflow, we use Horner’s rule and apply mod incrementally.

Choosing Good Hash Functions

Goal is to satisfy uniform hashing assumption.
Try to choose hash-function that is unrelated to any possible patterns in the data and depends on all parts of the key.

Modular Method: h(k) = k \mod M, M should be prime.
Multiplicative Method: h(k) = \lfloor M (kA - \lfloor kA \rfloor)\rfloor.

Universal Hashing

We again shift the average-case performance to expected performance via randomization.

We can attempt randomization. When initializing or re-hashing, we choose a prime number p > M, and random numbers a, b \in \{0, ..., p - 1\}, a \neq 0.

h(k) = ((ak + p) \mod p) \mod M

We can prove that for any fixed numbers x \neq y, the probability of a collision using this function h is at most \frac{1}{M}. So, the expected runtime is O(1) if we keep the load factor small enough.

Range Searching

Quadtrees

We have n points S = \{(x_0, y_0), ..., (x_{n-1}, y_{n-1})\}.
Assume: All points are within square R, ideally a power of 2.
Convention: Points on split lines belong to right / top side.

Spread factor of point S: \beta(S) = \frac{len(R)}{d_{min}}.
Height of quadtree, h \in \Theta(\log(\beta(S))).
Complexity to build initial tree, \Theta(nh).
Complexity of range search, \Theta(nh) worst-case, even if the answer is \emptyset.

2D Range Trees

We are given n points in a sorted array with respect to x. We can build a balanced BST for these points in time O(n).

Recursively, we spend O(1) at every node, total O(n).

A range tree is a tree of trees. The primary structure is a BST T that uses the x-coordinates as keys. Each node v \in T stores an auxiliary structure T(v).

Let P(v) be all points in the descendants of v \in T, (including v).
T(v) stores P(v) in a binary search tree, using the y-coordinates as keys.
Note: v is not necessarily the root of T(v).

Space Analysis

Primary tree uses O(n) space.
Associate tree T(v) uses O(|P(v)) space.

Let S(n) be the space complexity of a range tree over n points. S(n) satisfies. S(n) = 2S\left(\frac{n}{2}\right) + O(n) So S(n) \in O(n\log(n)).

Range tree space usage is O(n\log(n)).

Dictionary Operations

Search: as in a BST.
Insert: first, insert point by x-coordinate in T. Then walk back up to the root and insert the point by y-coordinate in all T(v) of nodes v on path to the root.
Delete: analogous to insertion.

We have a problem with balancing. If we use AVL trees, insertion / deletions are very slow (rotation at v changes P(v) and hence requires a rebuild of T(v)). Instead of rotations, we can do something similar as for kd-trees, allow certain imbalance, and rebuild the entire subtree if violated.

def build(P):
    # O(|P|log|P|)
    x_list = sorted(P, key=lambda p: p.x)
    y_list = sorted(P, key=lambda p: p.y)
    build_recursive(x_list, y_list)

def build_recursive(x_list, y_list):
    T = RangeTree()
    # O(1)
    T.key = x_list[n // 2]
    # O(|y_list|)
    T.y_tree = BST(y_list)
    # O(|x_list|)
    x_left = x_list[0 : n // 2]
    x_right = x_list[n // 2 :]
    # O(|y_list|)
    y_left = [p for p in y_list if p.x < x_list[n // 2].x]
    y_right = [p for p in y_list if p.x > x_list[n // 2].x]
    # We have |x_list| = |y_list|, x_list is sorted by x, y_list sorted by y
    T.left = build_recursive(x_left, y_left)
    T.right = build_recursive(x_right, y_right)
    return T

We see that the runtime of build_recursive is T(n) = 2T\left(\frac{n}{2}\right) + O(n), so T(n) \in O(n\log(n)).

Range Search

Perform a range query A = [x_1, x_2] \times [y_1, y_2].
- Perform a range search (on the x-coordinates) for the interval [x_1, x_2] in primary tree T.
- Obtain boundary, topmost outside, and allocation nodes as before.
- For every allocation node v, perform a range search (on the y-coordinates) for the interval [y_1, y_2] in T(v).
- For every boundary node, text to see if the corresponding point is within the region A.
O(\log(n)) time to find boundary and allocation nodes in primary tree.
There are O(\log(n)) allocation nodes.
O(\log(n) + s_v) time for each allocation node v.
Every point is reported in at most one auxiliary structure, \sum s_v \le s.

Time for range-query in range tree is O(s + \log^2(n)). This can be further reduced to O(s + \log(n)).

Higher Dimensions

Range trees can be generalized into d-dimensional space.
Space: O(n\log^{d-1}(n)).
Construction: O(n\log^{d-1}(n)).
Query: O(s + \log^{d}(n)).

Comparison of Range Query Data Structures

Quadtrees.
- Simple even for dynamic set of points.
- Works well only if points are evenly distributed.
- Wasted space for higher dimensions.
KD-Trees.
- Linear space.
- O(\sqrt n) queries.
- Care needed for duplicate coordinates.
- Insertions and deletions destroy balance.
Range Trees.
- Fastest range search O(\log^2(n)).
- Wastes some space.
- Insertions and deletions more complicated.

String Matching

Brute Force

Example: T = abbbababbab, P = abba.

What is the worst possible input, P = a^{m-1}b, T = a^n.
Worst case performance \Theta((n - m + 1)m).
This is \Theta(nm).

Rabin-Karp Algorithm

Compute hash functions for each guess and compare with the pattern hash.
- If the values are unequal, then the guess cannot be an occurrence.
Never misses a match since h(k_1) \neq h(k_2) \Rightarrow k_1 \neq k_2.
h(T[k..k+m-1]) depends on m characters, so naive computation takes \Theta(m) time per shift.
Running time is \Theta(mn) for search miss.
If we are using a rolling hash function, we can expect a runtime of O(n + m).
- The worst case is still \Theta(mn), but this is (unbelievably) unlikely.

Boyer-Moore Algorithm

Brute-force search with two changes.

Reverse-order searching: Compare P with a guess moving backwards.
Bad character jumps: When a mismatch occurs, then eliminate guesses where P does not agree with this character of T.

In practice, large parts of T will not be looked at.

def BoyerMoore(T, P):
    L = [] # Last occurence array computed from P.
    k = 0 # Current guess.

    while k <= n - m:
        j = m - 1
        while j >= 0:
            if T[k + j] != P[j]:
                break
            j -= 1

        if j == -1:
            return k
        else:
            k = k + max(1, j - L[T[k + j]])

    return FAIL

Worse case runtime with only bad-character heuristic is \Theta(mn + |\Sigma|).
Worse-case runtime can be reduced to \Theta(n + m + |\Sigma|) with good-suffix heuristic.
In pratice, bad-character heristic alone is good enough.

Suffix Tree

Text T has n characters and n+1 suffixes.
We can build the suffix tree in \Theta(n^2) time by inserting each suffix of T into a compressed trie.
There is a way to build it in \Theta(n).

Claim: The suffix tree for a string of length n has \Theta(n) nodes.

Suffix tree has n+1 leaves.
Suffix tree has at most n internal nodes.

Key Property: Every internal node in a suffix tree has at least 2 children.

We prove by induction. A tree with m leaves that satisfies this property, has \le m-1 internal nodes.

Consider a tree with m+1 leaves, the root has s \ge 2 children, all s subtrees satisfy the induction hypothesis. Taking sums across every subtree will result in \le m internal nodes.

Compression

Source Text: Original data, string S from source alphabet \Sigma_S.
Coded Text: Encoded data, string C of characters from the coded alphabet \Sigma_C.
Encoding: An algorithm mapping source texts to coded texts.
Decoding: An algorithm mapping coded texts to source texts.

Judging Encoding Schemes

Processing speed.
Reliability (e.g. error-correcting codes).
Security (e.g. encryption).
Size. Encoding schemes that try to minimize the size of the coded text perform data compression. We will measure the compression ratio \frac{|C|\log|\Sigma_C|}{|S|\log|\Sigma_S|}.

Types of Data Compression

Logical vs. Physical.
- Logical Compression uses the meaning of the data and only applies to a certain domain.
- Physical Compression only knows the physical bits in the data, no the meaning behind them.
Lossy vs. Lossless.
- Lossy Compression achieves better compression ratios, but the decoding is approximate. The exact source text S is not recoverable.
- Lossless Compression always decodes S exactly.

Decoding

The code must be uniquely decodable.

Prefix-free codes E, E(c) is not a prefix of E(c^\prime) for any c, c^\prime \in \Sigma_S.
- This corresponds to a trie with the characters of \Sigma_S only at leaves. So the trie does not need the \$ character.

Huffman Encoding

We can scan the sample text to get a sample frequency.
When we build the trie, we want characters with higher frequency to be closer to the root.

Huffman’s Algorithm

Determine frequency of every c \in \Sigma_S.
For each c \in \Sigma, create a height-0 trie holding c.
Assign a “weight” to each trie which is the sum of all frequencies of all letters in the trie.
Find two tries with the minimum weight.
Merge these tries with a new interior node.
Repeat steps 4, 5 until there is only 1 trie left. This is D.

We should store the tries in a heap in order to make this efficient.

Encoding Runtime: O(|\Sigma_S|\log|\Sigma_S| + |C|).
Decoding Runtime: O(|C|).
The constructed trie is optimalin the sense that |C| is shortest among all prefix-free character-encodings with \Sigma_C = \{0, 1\}.

Run-Length Encoding

We use the first bit to specify the starting block parity.
We follow with a block b of 0s, where |b| is one less than the number of bits required to hold the next block.
We then write the size of the block.
An all-0 string of length n would be compressed to 2\lfloor \log(n) \rfloor + 2 \in o(n) bits.
- Usually, we are not that lucky. There is no compression until the run-length k \ge 6.
- Expansion when run-length k = 2,4.
Method can be adapted to larger alphabet sizes.

Lempel-Ziv-Welch

Observation: Certain substrings are much more frequent than others.

We want to take advantage of such substrings without needing to know beforehand what they are.

Start with dictionary D_0 for |\Sigma_S|.
Every step adds to the dictionary a multi-character string, using codenumbers.
We use fixed length codewords.
Generally, the decoder is “one step behind” in creating the dictionary.
A problem occurs when we want to use a code that we are about to build.
But we are able to figure this out!
- We know that the Encoder assigned s_{prev} + s[0] to the code number k.
- So s = s_{prev} + s_{prev}[0].

Compression Summary

Huffman	Run-Length Encoding	Lempel-Ziv-Welch
Variable-length	Variable-length	Fixed-length
Single-character	Multi-character	Multi-character
2-pass	1-pass	1-pass
60% compression on English text	Bad on text	45% compression on English text
Optimal 01-prefix-code	Good on long runs (e.g. pictures)	Good on English text
Must send dictionary	Can be worse than ASCII	Can be worse than ASCII
Rarely used directly	Rarely used directly	Frequently used
JPEG, MP3	Fax machines, old picture-formats	GIF, Unix compress

bzip2

To achieve even better compression, bzip2 uses text transform. We change the input into a different text that is not necessarily shorter, but has other desirable qualities.

Burrows-Wheeler Transform.
- If T_0 has repeated longer substrings, then T_1 has runs of characters.
MTF Transform.
- If T_1 has long runs of characters, T_2 has long runs of zeros.
Modified RLE.
- If T_2 has long runs of zeros, then T_3 has chars 0^\prime or 1^\prime very frequently.
Huffman Encoding.
- Compresses well since input-chars un-evenly distributed.

MTF Transform

After an element is accessed, we move it to the front of the dictionary.

This takes advantage of locality in the data. If we see a character now, we might see it again soon.

Specifics: MTF is an adaptive text-transform algorithm.

If the source alphabet is \Sigma_S with size m, then the coded alphabet will be \Sigma_C = \{0, ..., m - 1\}.

Burrows-Wheeler Transform

Transforms the source text to a coded text with the same letters, just in a different order.
Works well with MTF Transform.
Required: Source text S ends with indicator character \$ that occurs nowhere else in S.
Encoding algorithm needs all of S, so BWT is a block compression method.
Decoding is more efficient than encoding. Asymmetric scheme.

Crucial Ingredient: A cyclic shift of a string X of length n is the concatenation of X[i + 1..n - 1] and X[0..i], for 0 \le i < n.

Write all cyclic shifts.
Sort cyclic shifts.
Extract last characters from sorted shifts.

BTW Decoding

Given C, we can reconstruct the first and last column of the array of cyclic shifts by sorting.

Last Column: C.
First Column: C sorted.
Disambiguate by row-index.
Starting from \$, recover S.

Encoding Cost: O(n^2) using MSD radix sort, often better.

Encoding is theoretically possibly in O(n) time.
- Sorting cyclic shifts of S is equivalent to sorting the suffixes of SS that have length > n.
- This can be done by traversing the suffix tree of SS.

Decoding Cost: O(n).

Encoding and decoding both use O(n) space.