Problem: Given a problem instance, carry out a particular computational task.
Problem Instance: Input.
Problem Solution: Output.
Size of a problem instance: Size(I) is a positive integer that measures the size of instance I.
Algorithm: Step-by-step process for carrying out a series of computations, given an arbitrary problem instance I.
For a problem \Pi, we can have several algorithms. For an algorithm A solving \Pi, we can have several programs (implementations).
We may be interested in the amount of time or memory. Experimental shortcomings have shortcomings because of implementations, hardware, and testing against all inputs.
Instead we write code for Random Access Machines (RAM)
The efficiency is measured in terms of it’s growth rate (this is called the complexity of the algorithm).
About n^3 operations.
O-Notation f(n) \in O(g(n)) if there exists constants c > 0, n_0 > 0 such that if |f(n)| \le c|g(n)| \forall n \ge n_0.
Example: f(n) = 75n + 500 and g(n) = 5n^2.
f(n), g(n) \ge 0 \forall n \ge 1
For n \ge 20
n^2 \ge 20n
5n^2 \ge 100n
25n \ge 25 * 20 = 500
5n^2 \ge 75n + 500
g(n) \ge f(n)
Taking n_0 = 20, c = 1, this proves that f(n) \in O(g(n)).
We want a tight asymptotic bound.
\Omega-notation: f(n) \in \omega(g(n)) if \exists c > 0, n_0 > 0 such that c|g(n)| \le |f(n)| \forall n \ge n_0.
\Theta-notation: \exists c1, c2 \ge 0, n_0 > 0 such that c1|g(n)| \le |f(n)| \le c2|g(n)| \forall n \ge n_0.
How to express f(n) is asymptotically strictly smaller than g(n)?
o-notation: \forall c > 0, \exists n_0 > 0 such that |f(n)| < c|g(n)| \forall n \ge n_0.
\omega-notation: \forall c > 0, \exists n_0 > 0 such that 0 \le c|g(n)| < f(n) \forall n \ge n_0.
Example: f(n) = 2000n^2 + 5000n and g(n) = n^3. Prove f(n) \in o(g(n)).
Given c > 0, for n > 0, f(n) = 2000n^2 + 5000n \le 7000n^2.
7000n^2 < cn^3 \Leftrightarrow 7000 < cn \Leftrightarrow n > \frac{7000}{c}
So if we take n_0 = \frac{7000}{c} + 1, we have f(n) < g(n) \forall n \ge n_0. This proves that f(n) \in o(g(n)).
f(n) \in \Theta(g(n)) \Leftrightarrow g(n) \in \Theta(f(n))
f(n) \in O(g(n)) \Leftrightarrow g(n) \in \Omega(f(n))
f(n) \in o(g(n)) \Rightarrow f(n) \in O(g(n)), \nRightarrow f(n) \in \Omega(g(n))
f(n) \in \omega(g(n)) \Rightarrow f(n) \in \Omega(g(n)), \Rightarrow f(n) \notin O(g(n))
Identitity: f(n) \in \Theta(f(n))
Maximum: f(n) > 0, g(n) > 0 \forall n \ge n_0 \Rightarrow O(f(n) + g(n)) = O(\max\{f(n), g(n)\}). Similar for \Omega.
Transitivity: f(n) \in O(g(n)), g(n) \in O(h(n)) \Rightarrow f(n) \in O(h(n)). Similar for \Omega.
Suppose L = \lim_{n \to \infty} \frac{f(n)}{g(n)}.
f(n) \in \begin{cases} o(g(n)), &L = 0 \\ \Theta(g(n)), &0 < L < \infty \\ \omega(g(n)), &L = \infty \end{cases}
Example: f(n) = \log(n) = \frac{\ln(n)}{\ln(2)}, g(n) = n.
\lim_{n \to \infty} \frac{1}{n\ln(2)} = 0 so f(n) \in o(g(n)) by the limit test.
Example:
def Test1(n):
sum = 0
for i in range(1, n + 1):
for j in range(i, n + 1):
sum += (i - j) * (i - j)
return sum
Let T_1(n) be the runtime of Test1(n).
T_1(n) \in \Theta(S_1(n)) where S_1(n) is the number of times we enter the body of the inner loop on line 4.S_1(n) = \sum_{i = 1}^n \sum_{j = i}^n i.
1st solution: Brute Force.
S_1(n) = \sum_{i = 1}^n (n - i + 1) = \sum_{i = 1}^n n - \sum_{i = 1}^n + \sum_{i = 1}^n 1 = n^2 - \frac{n(n+1)}{2} + n = \frac{n^2}{2} + \frac{n}{2}.
Therefore we have S_1(n) \in \Theta(n^2).
2nd solution: Separate O and \Omega.
S_1(n) \leq \sum_{i = 1}^n \sum_{j = 1}^n 1 = n^2 so S_1(n) \in O(n^2).
S_1(n) \geq \sum_{i = 1}^{n / 2} \sum_{j = 1}^n i \geq \sum_{i = 1}^{n / 2} \sum_{j = n / 2}^n 1 = \frac{n^2}{4} so S_1(n) \in \Omega(n^2).
Therefore S_1(n) \in \Theta(n^2).
Example:
def Test2(A, n):
max = 0
for i in range(1, n + 1):
for j in range(i, n + 1):
sum = 0
for k in range(i, j + 1):
sum += A[k]
return max
Let T_2(n) be the runtime of Test2(n).
Then T_2(n) \in \Theta(S_2(n)), where S_2(n) is the number of times we enter the body of the inner loop on line 6.S_2(n) = \sum_{i = 1}^n \sum_{j = i}^n \sum_{k = i}^j 1.
1st Solution: Brute Force.
Input to Wolfram \alpha and see that S_2(n) = \frac{n^3}{6} + ...n^2 + ...n + ... so S_2(n) \in \Theta(n^3).
2nd Solution: Separate O and \Omega.
S_2(n) \leq \sum_{i = 1}^n \sum_{j = 1}^n \sum_{k = 1}^n 1 = n^3 so S_2(n) \in O(n^3).
S_2(n) \geq \sum_{i = 1}^{n / 3} \sum_{j = i}^n \sum_{k = i}^j 1 \geq \sum_{i = 1}^{n / 3} \sum_{j = 2n \ 3}^n \sum_{k = n / 3}^{2n / 3} 1 = (\frac{n}{3})^3 \in \Omega(n^3)
Therefore S_2(n) \in \Theta(n^3).
Example:
# Insertion Sort
def Test3(A, n):
for i in range(1, n):
j = i
while j > 0 and A[j] > A[j - 1]:
A[j], A[j - 1] = A[j - 1], A[j]
j -= 1
Let T_A(I) denote the running time of an algorithm A on instance I.
Worse-case: T_A(n) = \max \{T_A(I): Size(I) = n\}.
Average-case: T_A^{avg}(n) = \frac{1}{|\{I: Size(I) = n\}} \sum_{\{I: Size(I) = n\}} T_A(I).
It is important to try and not make comparisons between algorithms using O-notation.
Input: Array A of n integers.
def MergeSort(A, l = 0, r = n - 1):
if r <= l:
return
else:
m = (l + r) / 2
MergeSort(A, l, m)
MergeSort(A, m + 1, r)
Merge(A, l, m, r)
def Merge(A, l, m, r):
# We should only copy over A[l, r] to S.
S = A
i_L = l, i_R = m + 1
for k in range(l, r + 1):
if i_L > m:
A[k] = S[i_R]
i_R += 1
elif i_R > r:
A[k] = S[i_L]
i_L += 1
elif S[i_L] <= S[i_R]:
A[k] = S[i_L]
i_L += 1
else:
A[k] = S[i_R]
i_R += 1
Merge takes time \Theta(r - l + 1) = \Theta(n) time for merging n elements.
Let T(n) denote the time to run MergeSort on an array of length n.
T(n) = \begin{cases} 2T(\frac{n}{2}) + cn, &\text{if } n > 1 \\ c, &\text{if } n = 1 \end{cases}
Exact and sloppy recurrences are identical when n is a power of 2.
The recurrence can be easily solved by various methods when n = 2^k.
T(n) = 2T(\frac{n}{2}) + cn.
For n a power of 2, n = 2^k.
\begin{aligned}
T(2^k) &= 2T(2^{k - 1}) + c2^k \\
&= 2(2T(2^{k - 2}) + c2^{k - 1}) + c2^k \\
&= 2^3 T(2^{k - 3}) + 3c2^k \\
&= 2^kT(2^{k - k}) + kc2^k \\
&= 2^k + c2^kk \\
&= n + cn\log(n)
\end{aligned}
Therefore T(n) \in \Theta(n\log(n)), for n = 2^k.
An ADT consisting of a collection of items (each having a priority)
def PQ_Sort(A):
pq = PriorityQueue()
for k in range(n):
pq.insert(k)
for k in reversed(range(n)):
pq.deleteMax()
The runtime will be determined by how efficient our insert() and deleteMax() functions are.
O(n + n \cdot insert + n \cdot deleteMax)
Let T(n) be the total cost of insertion from length n to 2n - 1, n = 2^k. Then T(n) is the cost of doubling from n to 2n plus the cost of inserting n items. T(n) \in O(n).
A (binary) heap is a certain type of tree.
Any binary tree with n nodes has height at least \log(n + 1) - 1 \in \Omega(\log(n)).
These are the properties for a max-heap. A min-heap is the same but with opposite order property.
Lemma: The height of a heap with n nodes is \Theta(\log(n)).
Given the height of the tree h.
2^n \le n \le 2^{h + 1}
h \le \log(n) \le h + 1
\log(n) - 1 \le h \le \log(n)
Heaps should not be stored as binary trees!
Let H be a heap of n items and let A be an array of size n. We can store the root in A[0] and continue with elements level-by-level from top to bottom. In each level left-to-right.
We should hide implementation using helper functions.
HeapSort is not used as often as MergeSort or QuickSort because it has poor data locality.
Given an array, build a heap out of them.
Careful analysis yields a worst-case complexity of \Theta(n).
Whenever you reach node i, you know that the subtrees are both heaps because you have fixed them before.
Let T(n) be the worse case runtime of heapify for an array of size n. It is related to the worse case number of swaps S(n), so T(n) \in \Theta(S(n)).
S(1) = 0.
S(2) = 1 because we bubble-down the root.
S(3) = 1 because we bubble-down the root.
S(4) = 1 + 2 = 3.
S(5) = 3.
If we add n nodes to a heap of size n, (n \to 2n). The new nodes form a layer of leaves. Every single original node now has to perform 1 more swap because the length of the path has been increased by 1. The new nodes do not have to perform any swaps because they are all leaves.
S(2n) = S(n) + n
For n a power of 2, n = 2^k.
\begin{aligned} S(n) = S(2^k) &= S(2^{k - 1}) + 2^{k - 1} \\ &= S(2^{k - 2}) + 2^{k - 2} + 2^{k - 1} \\ &= \sum_{i = 0}^{k - 1} 2^i \\ &= 2^k - 1 \\ &= n - 1 \end{aligned}
Therefore T(n) \in \Theta(n) for n = 2^k.
The kth-max problem asks to find the kth largest item in an array A of n numbers.
Given an array of A of n numbers, 0 \le k < n, find the element that be at position k of the sorted array.
The best heap-based algorithm had a running time of O(n + k\log n). For median finding, this is O(n\log n) which is the same cost as our best sorting algorithms.
The QuickSelect algorithm can do this in linear time.
QuickSelect and the related QuickSort rely on two subroutines.
Partitioning in-place can be done by using two pointers i = 0, j = n - 1, and moving them to the middle, swapping while they are on the wrong side.
def partition(A, p):
A[n - 1], A[p] = A[p], A[n - 1]
i = 0, j = n - 2, v = A[n - 1]
while True:
while i < n and A[i] < v:
i += 1
while j >= 0 and A[j] > v:
j -= 1
if j >= i:
break
A[i], A[j] = A[j], A[i]
A[i], A[n - 1] = A[n - 1], A[i]
def quick_select_1(A, k):
p = choose_pivot(A)
i = partition(A, p)
if i == k:
return A[i]
elif i > k:
return quick_select_1(A[0, ..., i - 1], k)
elif i < k:
return quick_select_1(A[i + 1, ..., n - 1], k - i - 1)
Worse Case Analysis: Recursive calls could always have size of n - 1. The recurrence is given by. T(n) = \begin{cases} T(n - 1) + cn, &n \ge 2 \\ c, &n = 1 \end{cases} For some constant c > 0. Solution: T(n) = cn + c(n - 1) + ... + c(2) + c \in \Theta(n^2).
Best Case Analysis: First chosen pivot could be the kth element. We have no recursive calls, so the total cost would be \Theta(n).
We want to take the average runtime of quick_select_1 over all inputs.
How do we characterize all inputs of size n? We can make a simplifying assumption that all input numbers are distinct. quick_select_1 does not depend on the actual value, only their relative order. Therefore, we can characterize our input based on sorting permutation: the permutation that would put the input in order.
If we assume that all n! permutations are equally likely, the average cost is the sum of the total cost for all permutations, divided by n!.
Define T(n) to be the average case for selecting from a size n array. Fix one 0 \le i \le n - 1. There are (n - 1)! permutations for which the pivot value v is the ith smallest item.
\begin{aligned} T(n, k) &= \frac{1}{n!}\sum_{\text{permutations A of } \{1, ..., n\}} T(A, k) \\ T(n) &= \max_{k} T(n, k) \\ &\le c\cdot n + \frac{1}{n}\sum_{i = 0}^{n - 1}\max\{T(i), T(n - i + 1)\} \\ \end{aligned}
Theorem: T(n) \in \Theta(n).
Sloppy Proof: We want T(n) \le \lambda n. Suppose T(i) \le \lambda i, i < n. \begin{aligned} T(n) &\le cn + \frac{1}{n}\sum_{i = 0}^{n - 1}\max \{\lambda i, \lambda(n - i - 1)\} \\ &\le cn + \frac{\lambda}{n}\sum_{i = 0}^{n - 1} \max \{i, n - i - 1\} \end{aligned} We can see visually (taking max of i and n - i - 1) that \sum \approx \frac{3}{4}n^2. Therefore we have T(n) \le cn + \frac{\lambda}{n}\frac{3}{4}n^2 = (c + \frac{3}{4}\lambda) n. We want T(n) \le \lambda n, let us take \lambda such that c + \frac{3}{4}\lambda = \lambda \Leftrightarrow \lambda = 4c.
Clean Proof: We can rewrite the proof that we had before using T(n) \le 4cn by induction. We also have to prove that \sum_{i = 0}^{n - 1} \max\{1, n - i - 1\} \le \frac{3}{4}n^2 which can be done by a case discussion when n is even or odd.
No more bad instances, just unlucky numbers.
Define T(I, R) be the running time of the randomized algorithm for instance I and sequence of random numbers R.
T^{(exp)}(I) = E[T(I, R)] = \sum_{R}T(I, R)P(R).
T^{(exp)}_{worst}(n) = \max_{\{I: size(I) = n\}}T^{(exp)}(I).
T^{(exp)}_{avg}(n) = \frac{1}{|\{I: size(I) = n\}|}\sum_{\{I : size(I) = n\}}T^{(exp)}(I).
\begin{cases} T(n) = cn + \frac{1}{n}\sum_{i = 0}^{n - 1}(T(i) + T(n - (i + 1))) \\ T(0) = T(1) = 0 \end{cases}
We prove that T(n) \le 2cn\log(n), n \ge 1.
Define S(n) = \sum_{i = 1}^{n - 1}i\log(i). We claim that S(n) \le \frac{1}{2}n^2\log(n) - \frac{1}{4}n^2. We skip the proof because it is too difficult.
Assume that for i = 1, ..., n - 1, T(i) \le 2ci\log(i).
\begin{aligned} T(n) &= cn + \frac{1}{n}\sum_{i = 0}^{n - 1}T(i) + \frac{1}{n}\sum_{i = 0}^{n - 1}T(n - (i + 1)) \\ &= cn + \frac{2}{n}\sum_{i = 0}^{n - 1}T(i) \\ &= cn + \frac{2}{n} \sum_{i = 1}^{n - 1}T(i) \\ &\le cn + \frac{2}{n} \sum_{i = 1}^{n - 1}2ci\log(i) \\ &\le cn + \frac{4c}{n} \sum_{i = 1}^{n - 1}i\log(i) \\ &\le cn + \frac{4c}{n} S(n) \\ &\le cn + \frac{4c}{n} (\frac{1}{2}n^2\log(n) - \frac{1}{4}n^2) \\ &\le cn + (2cn\log(n) - cn) \\ &\le 2cn\log(n) \end{aligned}
The auxiliary space is \Theta(n) worst-case. We can reduce this to \Theta(\log(n)) by recursing in smaller sub-array first and replacing the other recursion by a while loop.
def quicksort_iterative(A, l, r):
while l < r:
i = partition(A, l, r)
if i - l < r - i:
quicksort_iterative(A, l, i - 1)
l = i + 1
else:
quicksort_iterative(A, i + 1, r)
r = i - 1
One should stop recursing when n \le 10. Then one run of InsertionSort at the end sorts everything in O(n) because all items are within 10 units of their required position.
Arrays with many duplicates can be dealt with more efficiently by creating a third partition for equals.
In the comparison model, data can only be accessed in 2 ways.
We can represent comparison models as decision trees.
Theorem: Any correct comparison-based sorting algorithm requires at least \Omega(n\log n) comparisons.
In a decision tree that sorts n integers, there are at least n! leaves.
We claim that in a binary tree of height h, with L leaves, L \le 2^{h + 1}. We see that L \le \#\ nodes and that is less than or equal to the number of nodes in a complete heap of height h.
\begin{aligned} n! &\le 2^{h + 1} \\ \log(n!) &\le h + 1 \\ \end{aligned} We know that \log(n!) \in \Theta(n\log n) so h \in \Omega({n \log n}).
def bucket_sort(A, d):
Initialize an array B[0 .. R - 1] of empty lists.
for i in range(len(A)):
B[dth digit of A[i]].append(A[i])
i = 0
for j in range(R):
while len(B[j]) > 0:
A[i] = B[j]
i += 1
This is stable, equal items stay in order. The runtime is O(n + R), auxiliary space \Theta(n + R).
def key_indexed_countsort(A, d):
count = [0] * R
for i in range(len(A)):
count[dth digit of A[i]] += 1
idx = [0] * R
for i in range(1, R):
id[i] = id[i - 1] + count[i - 1]
aux = [0] * len(A)
for i in range(len(A)):
aux[id[dth digit of A[i]]] = A[i]
id[dth digit of A[i]] += 1
A = aux
def MSD_radixsort(A, l, r, d):
if l < r:
key_index_countsort(A[l .. r], d)
if d < m:
for i in range(R):
l_i, r_i = boundaries of ith bin
MSD_radixsort(A, l_i, r_i, d + 1)
Drawback is that we have many recursive calls.
Time complexity is \Theta(m(n + R)). Auxiliary memory is O(n + R). This can be o(n\log(n)) for special cases.
Common assumptions are that the dictionary has n KVPs, each of which uses constant space. We also assume that keys can be compare in constant time.
search(k): Start at the root, compare k to the current node. Stop if found or node is external, else recurse at child.
insert(k, v): Search for k, then insert (k, v) as new node.
delete(k): Search for node x that contains the key k.
Remark: The successor of node x does not have to be a leaf but it cannot have a left child.
All of the functions above have cost \Theta(h), where h is the height of the tree.
If n items are inserted, the worst case is \Theta(n). The best case is \Theta(\log(n)) because any binary tree with n nodes has height \ge \log(n + 1) - 1. It can be shown that the average case is \Theta(\log(n)).
Introduced by Adel’son-Vel’ski˘ı and Landis in 1962.
AVL Trees are BST with an additional height-balance property that the height of the left subtree L and right subtree R differ by at most 1. The height of an empty tree is defined to be -1.
At each non-empty node, we require height(R) - height(L) \in \{-1, 0, 1\}.
- -1, left-heavy
- 0, balanced
- 1, right-heavy
We need to store the height of the subtree rooted at each node. It can be shown that it is sufficient to store height(R) - height(L) at each node which uses fewer bits but more complicated code (especially for deleting).
Theorem: an AVL Tree with n nodes has \Theta(\log(n)) height.
Proof: We prove the upper and lower bounds separately.
The height h of an AVL with n keys is O(\log(n)).
Define N_h as the minimum number of keys in an AVL of height h. Our plan is to prove N_h \in \Omega(c^h) for some constant c. Then we would have \log(n) \in \Omega(h).
We see that N_h has a relation to Fibonacci numbers, N_h = F_{h + 3} - 1. This is because to build N_h, we can take the tree for N_{h - 1} as a left child and the tree for N_{h - 2} as a right child. We can prove N_{h} = N_{h - 1} + N_{h - 2} + 1 with generating series. We show the inductive proof that N_h = F_{h + 3} - 1.
\begin{aligned} N_{h + 2} &= N_{h + 1} + N_{h} + 1 \\ &= (F_{h + 4} - 1) + (F_{h + 3} - 1) + 1 \\ &= F_{h + 4} + F_{h + 3} - 1 \\ &= F_{h + 5} - 1\end{aligned}
We claim F_h \in \Theta(\phi^h), \phi = \frac{1 + \sqrt 5}{2} (239). Then N_h \in \Theta(\phi^h) and \log(N_h) \in \Theta(h) (a1). For any AVL with height h, n nodes, \log(n) \ge \log(N_h) so \log(n) \in \Omega(h). Therefore h \in O(\log(n)).
def AVL_insert(r, k, v):
z = BST_insert(r, k, v);
z.height = 0
while z is not None:
set_height_from_children(z)
if abs(z.left.height - z.right.height) == 2:
AVL_fix(z)
break
else:
z = z.parent
def set_height_from_children(u):
u.height = 1 + max(u.left.height + u.right.height)
Fixing a slightly unbalanced AVL tree. This is applied at a node z that has balance \pm 2, but the subtrees at z are AVL.
def AVL_fix(z):
# Find child and grand-child that go deepest.
if z.right.height > z.left.height:
y = z.right
if y.left.height > y.right.height:
x = y.left
else:
x = y.right
else:
y = z.left
if y.right.height > y.left.height:
x = y.right
else:
x = y.left
# Apply appropriate rotation to restructure at x, y, z
...
How do we “fix” an unbalanced AVL? Notice that there are many different BSTs with the same keys. Our goal is to change the structure among the three nodes without changing the order such that the subtree becomes balanced.
def rotate_right(z):
y = z.left
make y.right the new left child of z
make y the new root of the subtree
make z the new right child of yb
set_height_from_children(z)
set_height_from_children(y)
Recall that we need to update the parent links as well!
Rule: The median of x, y, z becomes the new root.
Arrays are a simple and popular implementation. Can we do something to make the search more effective in practice?
Yes, if the items are not equally likely to be accessed and we have a probability distribution of the items. - Intuition is that frequently accessed items should be in the front.
Is this better than our linear time solution?
Consider L=[x_1, ..., x_n], P(x_1) = \frac{1}{2}, P(x_2) = \frac{1}{4}, …, P(x_{n-1}) = P(x_n) = \frac{1}{2^{2n-1}}. E(L) = 1\cdot\frac{1}{2} + 2\cdot\frac{1}{4} + ... + (n-1)\cdot \frac{1}{2^{n-1}} + n\cdot\frac{1}{2^{n-1}}. \lim_{n \to \infty}E(L) = 2.
Claim: Over all possible static orderings, the one that sorts items by non-increasing access-probability minimizes the expected cost. We can prove by contradiction.
Randomized data structure for dictionary ADT.
def skip_search(l, k):
P = [p]
while below(p) is not None:
p = below(p)
while key(after(p)) < k:
p = after(p)
P.append(p)
return P
In an ideal skiplist, we have 2n keys and \log(n) height. The search costs \Theta(\log(n)).
O(\log(n)) height. A skip list with n items has height at most 3\log(n) with probability at least 1 - \frac{1}{n^2}.
Let X_k be the height of tower k.
P(h \ge i) \le \sum_{k}P(X_k \ge i) \le \frac{n}{2^i} It follows that P(h < i) \ge 1 - \frac{n}{2^i}. Imagine i = c\log(n). We would like to know P(h < c\log(n)). P(h < c\log(n)) \ge 1 - \frac{n}{2^{c\log(n)}} = 1 - \frac{n}{n^c} We see that the height is bounded by O(\log(n)) with high probability.
\begin{aligned}E[h] &= \sum_{t \ge 1} t P(h = t) \\ &= \sum_{t = 1}^{2\log(n)}tP(h = t) + \sum_{t = 2\log(n) + 1}^n tP(h=t) + \sum_{t \ge n+1}tP(h = t) \\ &\le \sum_{t = 1}^{2\log(n)}2\log(n)P(h=t) + \sum_{t=2\log(n)}^n nP(h=t) + (\sum_{t \ge 1}tP(h=t)-\sum_{t=1}^ntP(h=t))\\ &\le 2\log(n)P(h \le 2\log(n)) + nP(h \ge 2\log(n)) + n(\sum_{t \ge 1}\frac{t}{2^t} - \sum_{t = 1}^n\frac{t}{2^t}) \\ &\le ... + n\frac{n}{2^{2\log(n)}} + \frac{n(n+2)}{2^n} \\ &\le 2\log(n) + 1 + \frac{n(n+2)}{2} \end{aligned} We see that the expected height is O(\log(n)).
O(\log(n)) search, insert, and delete.
Theorem: In the comparison model, \Omega(\log(n)) comparisons are required to search a size n dictionary.
Instead of binary searching, if we know relative distances, we can get a better estimation.
If we are searching between indices l, r of array A for key k, we compare at index l + \lfloor \frac{k - A[l]}{A[r] - A[l]}(r-l) \rfloor. This approach works well if the keys are uniformly distributed.
But, worst-case performance is \Theta(n).
Dictionary for binary strings based on bitwise comparisons.
Prefix of a string S[0 .. n - 1] is a substring S[0..i] of S for some 0 \le i < n.
Prefix-Free: there is no pair of binary strings in the dictionary where one is the prefix of the other.
We assume that our dictionaries are prefix-free. This is always satisfied if all strings have the same length or are terminated with a unique character ‘$’.
Insert: We search for x in the trie, and expand it by adding the missing nodes.
Delete: We search for x in the trie and then delete the node, and all dangling nodes created by the deletion.
Time complexity for all operations on string s over alphabet \Sigma are O(|s|).
No Leaf Labels.
We do not have to store the keys in leaves, it is implicitly stored along the path to a leaf. This halves the amount of storage needed.
Remove Chains to Labels.
We can stop adding nodes to the trie as soon as the key is unique. This saves space if there are only a few strings that are long.
Note: This cannot be combined with no leaf labels as we would not be able to search.
Allow Proper Prefixes.
We can allow internal nodes to be keys by adding a flag at each node indicating whether it represents a key in the dictionary.
Idea: Eliminate unflagged nodes to reduce space.
def patricia_search(v, x):
if v.is_leaf():
return v.key() == x
d = v.index()
c = v.child(x[d])
if c is None:
return "Not Found!"
return particia_search(c, x)
Consider every key k is an integer with 0 \le k < M. We can implement a dictionary easily, using an array A of size M that stores (k,v) via A[k] = v. \Theta(M).
Direct addressing isn’t possible if keys are not integers. The storage is very wasteful if n << M.
Hashing: Map the keys to a small range of integers and then use direct addressing.
Generally, hash function h is not injective, so many keys can map to the same integer. We get collisions when we want to insert (k,v) into the table, but T[h(k)] is already occupied.
Assume we have n random integers in \{0, ..., M - 1\}. P(\text{no collisions}) = \frac{M^{(n)}}{M^n} = (1)\left(1 - \frac{1}{M}\right)\left(1 - \frac{2}{M}\right)...\left(1-\frac{n-1}{M}\right)
We evaluate strategies by the cost of search, insert, and delete functions. This evaluation is done in terms of the load factor \alpha = \frac{n}{M}.
We keep the load factor small by rehashing when needed.
Each table entry is a bucket containing 0 or more KVPs. This could be implemented by any dictionary (or even another hash-table). The simplest approach is to use an unsorted linked list in each bucket. This is called collision resolution by separate chaining.
Recall the load factor \alpha = \frac{n}{M}. We have a Uniform Hashing Assumption, each value is equally likely.
Proposition: Under uniform hashing, the average length of a bucket is \alpha = \frac{n}{M}.
We define X_{i,j} = \begin{cases}1, &\text{ if } h(k_i) = j \\ 0\end{cases} as the probability that element i is in bucket j. The length of bucket j is then \sum_{i = 1}^n X_{i,j}.
\begin{aligned}E(\text{length } B_j) &= E(\sum_{i = 1}^n X_{i,j}) \\ &= \sum_{i = 1}^n E(X_{i,j}) \\ &= \sum_{i=1}^n \frac{1}{M} \\ &= \frac{n}{M} = \alpha\end{aligned}
If we maintain \alpha \in \Theta(1), then the average costs are \Theta(1) and space is \Theta(n).
Each hash table entry holds only one item, but any key k can go in multiple locations.
search and insert follow a probe sequence of possible locations for key k until an empty spot is found.
delete becomes problematic. We cannot leave an empty spot behind, because the next search might not go far enough.
The simplest method for open addressing is linear probing. h(k,i) = (h(k)+1) \mod M, for some has function h.
def insert(T, (k, v)):
for j in range(M):
if T[h(k, j)] is "empty" or "deleted":
T[h(k, j)] = v
return "success"
return "faliure"
def search(T, k):
for j in range(M):
if T[h(k, j)] is "deleted":
return "not found"
elif key(T[h(k, j)]) == k:
return T[h(k, j)]
return "not found"
Suppose M = 16, h_1(k) = 3, h_2(k) = 6. Then the cycle is of length \frac{16}{\gcd(6, 16)} = 8. If M is prime, the cycle length is guaranteed to be M.
def cuckoo_insert(T, x):
y = x
i = h_1(x.key)
for _ in range(n):
swap(y, T[i])
if y is "empty":
return "success"
if i = h_1(y.key):
i = h_2(y.key)
else:
i = h_1(y.key)
return "faliure"
Strategy | Search | Insert | Delete |
---|---|---|---|
Linear Probing | \frac{1}{1 - \alpha} | \frac{1}{(1 - \alpha)^2} | \frac{1}{1 - \alpha} |
Double Hashing | \frac{1}{1 - \alpha} | \frac{1}{1 - \alpha} | \frac{1}{\alpha}\log\left(\frac{1}{1 - \alpha}\right) |
Cuckoo Hashing | 1 | \frac{\alpha}{(1 - 2\alpha)^2} | 1 |
Summary: All operations have O(1) average-case run-time if the hash-function is uniform and \alpha is sufficiently small.
What if the keys are multi-dimensional, such as strings in \Sigma^*?
We again shift the average-case performance to expected performance via randomization.
h(k) = ((ak + p) \mod p) \mod M
We can prove that for any fixed numbers x \neq y, the probability of a collision using this function h is at most \frac{1}{M}. So, the expected runtime is O(1) if we keep the load factor small enough.
We are given n points in a sorted array with respect to x. We can build a balanced BST for these points in time O(n).
Recursively, we spend O(1) at every node, total O(n).
A range tree is a tree of trees. The primary structure is a BST T that uses the x-coordinates as keys. Each node v \in T stores an auxiliary structure T(v).
Let S(n) be the space complexity of a range tree over n points. S(n) satisfies. S(n) = 2S\left(\frac{n}{2}\right) + O(n) So S(n) \in O(n\log(n)).
Range tree space usage is O(n\log(n)).
We have a problem with balancing. If we use AVL trees, insertion / deletions are very slow (rotation at v changes P(v) and hence requires a rebuild of T(v)). Instead of rotations, we can do something similar as for kd-trees, allow certain imbalance, and rebuild the entire subtree if violated.
def build(P):
# O(|P|log|P|)
x_list = sorted(P, key=lambda p: p.x)
y_list = sorted(P, key=lambda p: p.y)
build_recursive(x_list, y_list)
def build_recursive(x_list, y_list):
T = RangeTree()
# O(1)
T.key = x_list[n // 2]
# O(|y_list|)
T.y_tree = BST(y_list)
# O(|x_list|)
x_left = x_list[0 : n // 2]
x_right = x_list[n // 2 :]
# O(|y_list|)
y_left = [p for p in y_list if p.x < x_list[n // 2].x]
y_right = [p for p in y_list if p.x > x_list[n // 2].x]
# We have |x_list| = |y_list|, x_list is sorted by x, y_list sorted by y
T.left = build_recursive(x_left, y_left)
T.right = build_recursive(x_right, y_right)
return T
We see that the runtime of build_recursive is T(n) = 2T\left(\frac{n}{2}\right) + O(n), so T(n) \in O(n\log(n)).
Time for range-query in range tree is O(s + \log^2(n)). This can be further reduced to O(s + \log(n)).
Example: T = abbbababbab, P = abba.
Brute-force search with two changes.
def BoyerMoore(T, P):
L = [] # Last occurence array computed from P.
k = 0 # Current guess.
while k <= n - m:
j = m - 1
while j >= 0:
if T[k + j] != P[j]:
break
j -= 1
if j == -1:
return k
else:
k = k + max(1, j - L[T[k + j]])
return FAIL
Claim: The suffix tree for a string of length n has \Theta(n) nodes.
Suffix tree has at most n internal nodes.
Key Property: Every internal node in a suffix tree has at least 2 children.
We prove by induction. A tree with m leaves that satisfies this property, has \le m-1 internal nodes.
Consider a tree with m+1 leaves, the root has s \ge 2 children, all s subtrees satisfy the induction hypothesis. Taking sums across every subtree will result in \le m internal nodes.
The code must be uniquely decodable.
We should store the tries in a heap in order to make this efficient.
Observation: Certain substrings are much more frequent than others.
We want to take advantage of such substrings without needing to know beforehand what they are.
Huffman | Run-Length Encoding | Lempel-Ziv-Welch |
---|---|---|
Variable-length | Variable-length | Fixed-length |
Single-character | Multi-character | Multi-character |
2-pass | 1-pass | 1-pass |
60% compression on English text | Bad on text | 45% compression on English text |
Optimal 01-prefix-code | Good on long runs (e.g. pictures) | Good on English text |
Must send dictionary | Can be worse than ASCII | Can be worse than ASCII |
Rarely used directly | Rarely used directly | Frequently used |
JPEG, MP3 | Fax machines, old picture-formats | GIF, Unix compress |
Burrows-Wheeler Transform.
MTF Transform.
Modified RLE.
Huffman Encoding.
This takes advantage of locality in the data. If we see a character now, we might see it again soon.
Specifics: MTF is an adaptive text-transform algorithm.
If the source alphabet is \Sigma_S with size m, then the coded alphabet will be \Sigma_C = \{0, ..., m - 1\}.
Crucial Ingredient: A cyclic shift of a string X of length n is the concatenation of X[i + 1..n - 1] and X[0..i], for 0 \le i < n.
Given C, we can reconstruct the first and last column of the array of cyclic shifts by sorting.
Encoding Cost: O(n^2) using MSD radix sort, often better.
Decoding Cost: O(n).
Encoding and decoding both use O(n) space.