Available here in PDF.
Should be the same throughout the algorithm.
Used to prove correctness of algorithm and to understand how an algorithm works.
There are stronger and weaker invariants, stronger invariants are more detailed.
Some examples:
"At the end of iteration j: the first j items in the array are in sorted order." (insertion sort) -- also true for selection sort (but there is a stronger one for it)
After iteration j: the first j items in the array are in sorted order and are the j smallest items in the array." (selection sort)
$T(n) = O(f(n))$ if T grows no faster than f.
There exists constants $c > 0$ and $n_0 > 0$ such that for all $n > n_0$: $T(n)<=cf(n)$.
$T(n) = \Omega(f(n))$ if T grows no slower than f.
There exists constants $c > 0$ and $n_0 > 0$ such that for all $n > n_0$: $T(n)>=cf(n)$.
$T(n) = \Theta(f(n))$ if T grows at the same rate as f.
$T(n) = \Theta(f(n))$ if and only if $T(n) = O(f(n))$ and $T(n) = \Omega(f(n))$
Function | Name |
---|---|
c | Constant |
$\log\log n$ | double log |
$\log n$ | logarithmic |
$\log^2 n$ | poly-logarithmic |
$sqrt(n)$ | square root |
$n$ | linear |
$n\log n$ | log-linear |
$n^3$ | polynomial |
$n^3\log n$ | |
$n^4$ | polynomial |
$2^n$ | exponential |
$2^{2n}$ | different (bigger) than $2^n$ |
$n!$ | factorial |
Note: $\log(n!)$ is $O(n\log n)$
Sterling's Approximation: $n! \approx \sqrt{2\pi n}(\frac{n}{e})^n$
"Average" cost but NOT average.
Definition: Operation has amortised cost of T(n) if for every integer k, the cost of k operations is <=kT(n)
Must be true for ALL operations.
Different sorting algorithms have their place! Sometimes you want an algorithm with specific properties.
Absolute speed is usually not a good indicator (by itself)
Asymptotic run time Best/worst/average run time
In-place algorithm = $O(1)$ space usage
Whether the algorithm keeps the same order for identical keys.
Inputs like sorted, unsorted, reverse sorted arrays.
An array where Bubble Sort is slow but Selection Sort is fast: a sorted array that moves the smallest element to the end of the array. e.g. [2, 3, 4, 5, 6, 7, 8, 9, 1]
Randomly generate a permutation and check if sorted. $O(n*n!)$
Not in-place. Not stable.
repeat n times:
for j <- 1 to n-1:
if A[j] > A[j+1] then swap(A[j], A[j+1])
break if no swap is done (means sorted)
Compare and swap each element.
Average: $O(n^2)$ Best: $O(n)$ -- sorted (no swaps done) Worst: $O(n^2)$ -- reverse sorted n iterations of the array
Bubble sort iterates through the entire array (maximum of) n times and terminates early if no swap is done.
Loop Invariant: At the end of iteration j, the biggest j items are correctly sorted in the final j positions of the array.
Correctness: after n iterations, the array is sorted
In-place. Stable algorithm.
Checks for the minimum element n times and swaps. Always: $\Theta(n^2)$
Loop Invariant: At the end of the iteration j, the smallest j items are correctly sorted in the first j positions of the array.
In-place. Not stable algorithm, swaps change element. (but possible to be stable if careful) At most $n$ swaps.
Average: $O(n^2)$ Best: $O(n)$ -- sorted (no swaps done) Worst: $O(n^2)$ -- reverse sorted
$O(nk)$ where k is the maximum distance of element from its sorted index
Loop Invariant: At the end of the iteration j, the first j items in the array are in sorted order.
Fast if the array is (almost) sorted.
In-place.
Can be stable, iterate while A[i] > key
and not >=
Divide and conquer algorithm that splits the array into half recursively.
Average: $O(n\log n)$ Worst: $T(n)=2T(n/2)+cn$ -> $O(n\log n)$
In practice, Insertion Sort is faster than Merge Sort when arrays are < 1000 in size. Could be due to the need to copy array. So, it is possible to combine merge and insertion sort together to have a faster sorting algorithm than each of them alone.
Not in-place (possible but very tricky or slower, using array shifting)
But can minimise memory used by only copying array at merge
portion instead of at recursive call.
Stable algorithm.
Divide and conquer algorithm.
Using pivots to partitioning the array into 2: elements that are <=
or >
compared to the pivot.
There is a double-pivot variant of quicksort that is faster than the single-pivot variant.
Average case: $O(n\log n)$ Worst case: $O(n^2)$ -- sorted or reverse sorted array (only for head or tail pivots)
Loop invariant: For every i < low: B[i] < pivot For every i >= high: B[i] > pivot
This version handles duplicates well. 3-way partitioning of the array (easy in theory but a lot of corner cases!)
< pivot | = pivot | in-progress | > pivot
Pivot can be head, tail or random or an arbitrary index. In the worst case, most simple (deterministic) pivots are bad. Partition around the median element is always $O(n\log n)$ (but need work to compute the median, especially since it is obviously unsorted)
At least 1:9 ratio split still $O(n\log n)$! Pick the pivot randomly will produce at least this split most of the time (use Probability Theorem to analyse)! 8/10 probability of a good pivot (only picking the lowest 10% or highest 10% will result in a bad pivot)
Almost every single time its $O(n\log n)$ if picking random pivot.
Switching to insertion sort for small arrays is better than recursing all the way (these smaller arrays are already mostly sorted)! OR stop the recursion early (when each chunk is around < 20), then insertion sort on the entire array.
In-place. Not stable.
All in-place quick-sorts => not stable.
Counting Sort, Radix Sort, Bucket Sort
Good for small ranges of values with high amount of repetition.
Maintain another array as a counter. Count in $O(n)$. Then, re-construct the array using the counter in $O(n)$.
Can be stable if re-construct using a mix of original array and counter.
Stable but not in-place.
Good for sorting integers or strings. Sort by first index of each container (), then second, etc. Or least significant to most significant digit.
$O(nm)$ where n is the size of the array and m is the size of the largest container.
Only faster than quicksort if small container sizes. $n>2^m$ to be faster.
Makes use of counting sort as a subroutine.
So, stable (if sort in correct order of least to most significant bits) but not in-place.
(For CS2040S) Total space ever allocated. Ignore memory allocators, garbage collectors, stack frames etc.
Finding the $k^{th}$ smallest element in an unsorted array.
Only do the minimum amount of sorting necessary.
Worst "best case" of a 9:1 split. $T(n)=T(9n/10)+O(n)$ $O(n)$
// https://leetcode.com/problems/kth-largest-element-in-an-array/
class Solution {
Random rng = new Random();
private int randomInt(int lo, int hi) {
// hi is inclusive
return lo + rng.nextInt(hi - lo + 1);
}
private void swap(int[] nums, int a, int b) {
int temp = nums[a];
nums[a] = nums[b];
nums[b] = temp;
}
// this is a simple 2-way partitioning
// there are better (and more complex) partitioning algorithms, notably:
// 3-way partitioning same complexities but has better average real time (when duplicate values are involved)
// Median of medians worst case O(n log n) https://en.wikipedia.org/wiki/Median_of_medians
private int partition(int[] nums, int lo, int hi) {
// returns the rank of the pivot
int pivot = randomInt(lo, hi);
int value = nums[pivot];
swap(nums, lo, pivot);
int pointer = lo + 1;
for (int i = lo + 1; i <= hi; i++) {
int x = nums[i];
if (x < value) {
swap(nums, i, pointer++);
}
}
swap(nums, lo, pointer - 1);
return pointer - 1;
}
private int findKthSmallest(int[] nums, int lo, int hi, int k) {
int index = partition(nums, lo, hi);
if (index > k) {
return findKthSmallest(nums, lo, index - 1, k);
} else if (index < k) {
return findKthSmallest(nums, index + 1, hi, k);
} else {
return nums[k];
}
}
public int findKthLargest(int[] nums, int k) {
// quick select => O(n)
int N = nums.length;
return findKthSmallest(nums, 0, N - 1, N - k);
}
}
Finding the largest $k$ elements in an unsorted array.
Finding the smallest $k$ elements is similar.
Trees can have very different shapes! Performance is dependent on shape of tree.
Number of ways to order insertions: $n!$ Number of shapes of a binary tree: ~$4^n$ (by Catalan numbers) Since $n!>4^n$, by PHP, order of insertions will not result in a unique shape.
The height of a rooted tree is the maximum level of any vertex of the tree. The leaves have height 0. Empty node has height -1.
Easy with recursion. Or use a stack to simulate recursion. $O(n)$ to traverse the entire tree. Each node is visited at most once.
BFS starts at the root and visits its adjacent vertices, and then moves to the next level. Use a queue to traverse.
Property: All in left subtree < key < all in right subtree Possible to reconstruct from just pre-order traversal.
Recurse left or right depending on less or more respectively to find a key or insertion position.
search(x) then place there
If searching for a key not in the tree. search(x) will find either predecessor or successor.
But if x is the maximum value, then this algorithm will simply return the root (which is wrong).
If leaf (no child), just remove.
If only has 1 child, simply replace with child.
If there are 2 children swap it with the successor and then remove the node.
Successor of a node (must be the minimum in the right subtree of node) being deleted must have only 1 child.
This will make sure that BST property is upheld.
$O(height)$
BST is balanced if $h=O(\log n)$ (not exactly). A node v is height-balanced if $|v.left.height - v.right.height |<=1$. A BST is height-balanced if every node in the tree is height-balanced.
Every height-balanced tree is balanced. NOT all balanced trees are height-balanced.
$2\log n$ is the maximum height of a balanced tree.
Storing the height in the node to avoid having to recompute the height multiple times.
Note: must update height on insert/delete
this.height = max(left.height, right.height) + 1
A height-balanced tree with n nodes has height $h<2\log n$. More accurately, $h\approx \log n/\log \phi \approx 1.44\log n$
Invariant of AVL Trees: Trees are height-balanced (don't need to be perfect).
Minimum number of nodes $n>2^{h/2}$. Proof: $n_h>=1+n_{h-1}+n_{h-2}$ $n_h>=2n_{h-2}$ $n_h>=2^{h/2}$
Right rotation requires a left child. Left rotation requires a right child.
Rotations always have inverses!
Right rotation:
(not inclusive of checking if valid)
w = v.left
w.parent = v.parent
v.parent = w
v.left = w.right
w.right = v
After an insertion, use tree rotations to re-balance the tree. Maximum of 1 node being out of balanced. So, balance the first unbalanced node.
Left-heavy if left sub-tree has larger height than right sub-tree. Right-heavy if right sub-tree has larger height than left sub-tree.
Rotations are $O(1)$ because only changing pointers in constant time. Rotations maintain ordering of keys, aka maintains BST property.
If v is out of balance and left-heavy:
If v is out of balance and right-heavy:
Check if height-balanced when unrolling recursion (and updating heights).
Rebalance trees using tree rotation.
Steps for inserting into AVL Tree.
Minimum number of rotations after an insertion is 0. Maximum number of rotations after an insertion is 2. This is because inserting increases the height and rotations reduce the height, so they effectively cancel each other out.
Maximum number of rotations is $O(\log n)$. This is because deleting reduces the height but rotations also reduce the height. Thus, it is not sufficient to only fix lowest out-of-balance node.
Regular BSTs are not great for storing words because it takes $min(A.length, B.length)$ to compare each string. So, each tree operation takes $O(hL)$. Trees that are "optimised/better" for storing words.
Tries can also be used to store binary bits or IP addresses.
Search: $O(L)$ Worst case space usage (0 overlap): $O(nL)$ * overhead (need some array or other structure to store pointers to children, asymptotically same but practically more) A lot of space wasted (for *potential* children), especially for large character sets (unicode characters set). Binary Trie don't have this problem :D
Partial String Operations
Compressed tries (only unchain the nodes if needed). Using hash table instead of array.
// can also use a null-terminating character as a node to indicate the end (instead of a boolean)
public class Trie {
static class TrieNode {
char value;
boolean isEnd;
// can replace this array with hash table
TrieNode[] nextNodes;
TrieNode(char value, boolean isEnd) {
this.value = value;
this.isEnd = isEnd;
this.nextNodes = new TrieNode[26];
}
}
private final TrieNode root;
Trie() {
this.root = new TrieNode('0', false);
}
void insert(String word) {
char[] chars = word.toCharArray();
int N = chars.length;
TrieNode node = this.root;
for (int i = 0; i < N; i++) {
char c = chars[i];
if (node.nextNodes[c - 'a'] == null) {
node.nextNodes[c - 'a'] = new TrieNode(c, false);
}
node = node.nextNodes[c - 'a'];
if (i == N - 1) {
node.isEnd = true;
}
}
}
boolean search(String word) {
TrieNode node = this.root;
for (char c : word.toCharArray()) {
if (node.nextNodes[c - 'a'] == null) {
return false;
}
node = node.nextNodes[c - 'a'];
}
return node.isEnd;
}
boolean startsWith(String word) {
TrieNode node = this.root;
for (char c : word.toCharArray()) {
if (node.nextNodes[c - 'a'] == null) {
return false;
}
node = node.nextNodes[c - 'a'];
}
return true;
}
}
Each node has up to $m - 1$ keys and up to $m$ children nodes.
Binary Search Tree is simply a M-Way Search tree with degree of 2.
Child Pointer 0 | $k_1$ | Child Pointer 1 | $k_2$ | ... | $k_{m-1}$ | Child Pointer $m$ with increasing values
Problem with m-way search tree, no algorithm for proper insertion. So, B-Trees are m-way search trees with extra rules.
Child policy Root: 1 to b-1 keys, 2 to b children Other nodes: a-1 to b-1 keys, a to b children (except leaf nodes)
All non-leaf nodes must have 1 more child than its number of keys.
All leaf nodes must be at the same depth (from root).
https://www.youtube.com/watch?v=aZjYr87r1b8
B does NOT stand for binary!
B-Tree are M-Way Search Trees with extra rules! B-Tree is (b, 2b)-tree.
Rules of B-Tree:
Creating tree:
Theoretically $O(\log n)$ Practically $O(1)$ because of memory access times
Similar to B-Trees with few differences.
Most of the time, you need to add/store some extra data in an existing data structure to solve a new problem.
AVL Trees are an augmented BST.
Need to modify data structure to maintain additional information when the structure changes (subject to insert/delete or whatever else the data structure was supposed to handle).
Use properties that only depend on the subtree (so that it is easy to maintain)
Also do need to take note that storing properties also take up space.
Augmenting AVL Tree.
Store weight of the subtree in every node. Still takes $O(1)$ (same as rotation itself) to update weights during rotation.
select(k)
def select(v, k):
rank = v.left.weight + 1
if (k == rank):
return v
elif k < rank:
return select(v.left, k)
else if k > rank:
return select(v.right, k - rank)
rank(node)
def rank(node):
rank = node.left.weight + 1
while node != null:
if node is left child:
do nothing
elif node is right child:
rank += node.parent.left.weight (left sibling) + 1
node = node.parent
return rank
aka Interval Search Trees
Interval trees are used to find if a value is captured in a set of intervals.
Intervals are sorted by the left endpoints.
Augment BST with maximum endpoint in subtree self.max = max(self.max, left.max, right.max)
Insert similarly to regular BST, but need to update max (only in root to leaf path) Need to maintain max during rotations (if using AVL tree)!
Key is important even though not directly used in algorithm (the relation between the nodes are important)
def interval_search(x):
c = root
while (c != null and x is not in c.interval):
if c.left == null:
c = c.right
else if x > c.left.max:
c = c.right
else:
// always search left if left.max > x
c = c.left
return c.interval
$O(\log n)$ if balanced
If search goes left, if there is no interval in left subtree, then there is no interval in right subtree either. If search goes right, there is no overlap in left subtree.
Interval Search (all intervals that work)
$O(k\log n)$ where k is the number of intervals that work but a faster solution (albeit more complicated) works in $O(k+\log n)$
Split Node is the highest node where lo < node.val <= hi
.
## find the first node that is bounded within the query
def findSplit(lo, hi):
v = root
while true:
if hi <= v.key:
v = v.left
else if low > v.key:
v = v.right
else
break
def query(lo, hi):
v = findSplit(lo, hi)
leftTraversal(v, lo, hi)
rightTraversal(v, lo, hi)
def collect_all_leaf(v):
collects if its in the range
def leftTraversal(v, lo, hi):
if lo <= v.key:
collect_all_leaf(v.right)
leftTraversal(v.left, lo, hi)
else:
leftTraversal(v.right, lo, hi)
def rightTraversal(v, lo, hi):
if v.key <= hi:
collect_all_leaf(v.left)
rightTraversal(v.right, lo, hi)
else:
rightTraversal(v.left, lo, hi)
$O(k + \log n)$ where k is the number of points found $O(n\log n)$ time to build tree $O(n)$ space complexity
Can augment the tree with the count (if only want the count and not the exact values)
For rotations, not much changes need to be done.
Can build a x-tree using only x-coordinates For every node in the x-tree, build a y-tree out of nodes in subtree using only y-coordinates.
Query: $O(\log^2 n + k)$
Space: $O(n\log n)$
Building: $O(n\log n)$ but not easy
No longer a dynamic data structure because rotations are no longer efficient $O(n)$ thus cannot maintain balance.
Possible to extend to higher dimensions. BUT have to store a d-1 dimensional range tree in each node of a 1D range tree.
Good for storing coordinates.
Much better idea than n-dimensional range queries because this is a dynamic data structure. Alternating x and y (and z and ...) levels
$O(h)$
Example here for 2D.
Alternate x and y. WLOG, treat as x findMedian using quick select (and remove the coord after selecting it)
$T(n) = 2T(n/2) + O(n)$ $T(n) = (n\log n)$
For minimum x-coordinate. Finding maximum (and the equivalents for y-coordinate) is similar.
For a perfectly balanced kd-tree Even: $T(n)=T(n/2)+O(1)$ Odd: $T(n)=2T(n/2)+O(1)$ Overall: $T(n)=2T(n/4)+O(1)$ $T(n)=\sqrt n$
Symbol Table Operations
Dictionary are ordered symbol tables.
HashTables are an implementation of a symbol table (need not be ordered). By implementing symbol table with AVL Trees, can $O(\log n)$ operations AND maintain order.
Implementing sorting with symbol tables take $O(n^2)$ because cannot find successor quickly.
Symbol tables are not comparison-based because they can work in $O(1)$.
Can use hashing to prove that you have the password without sending the password! (cryptography)
Treating everything as a number. Longest english word is 28 letters and 5 bits per letter => 140 bits max So, 2^140 size array can store EVERY english words (approx number of atoms in observable universe, so obviously too much space!)
m = number of buckets Usually, m = 2n is chosen to keep probability of collision low
Huge universe U of possible keys. Smaller number n of actual keys.
Store key k in bucket h(k) Time: time to compute h + time to access bucket In CS2040S, can assume hash function takes O(1) unless otherwise mentioned
Two distinct keys $k_1$ and $k_2$ collide if $h(k_1) = h(k_2)$ Always have collisions eventually by PHP because Universe >>> number of buckets.
A few methods to avoid problems of collisions (hash collision resolution)
Assuming every key is equally likely to map to every bucket. (for simpler calculation)
Let X(i, j) = 1 if item i is put in bucket j = 0 otherwise P(X(i, j) == 1) = 1/m E(X(i, j)) = 1/m
By linearity of expectation, for n items, expected number of items per bucket: $n/m$
If m == n, can still insert but will be less efficient.
n > m is better for enumerating all the keys in a hash table (not normally prioritised) As long as m > n, equally efficient. (unlike open addressing)
Put all the items that collide in the same bucket using a linkedlist of items in that bucket.
Total Space: O(m+n) Table size: O(m) Linkedlist size: O(n)
Linkedlist wander all over the memory (bad for cache access) Needs to keep allocating list nodes into memory
Expected cost: $O(1+cost(h))$ because just insert into the front of linkedlist but if want to avoid duplicates, have to search for duplicates OR end up having many duplicates in linkedlist
Expected maximum cost of inserting n items: $O(\log n/\log\log n)$ (proof beyond CS2040S)
If all keys hash to the same bucket, then effectively become a linkedlist! Just assume that the hash function satisfies the simple uniform hashing assumption.
Expected: h + n/m = O(h) Worst case: O(h + n)
Assume hashing with chaining Assume simple uniform hashing Expected search time: O(1 + n/m) Optimal size: $m=\Theta(n)$
if m < 2n: too many collisions if m > 10n: too much wasted space note that: 2 and 10 above are arbitrary
PROBLEM: don't know what is n in advance
IDEA: grow and shrink the table size as necessary (similar to dynamic arrays)
Time Complexity Cost: $O(m_1 + m_2 + n)$ This is because allocating memory takes a lot of time!
Do you care that some insertions take a lot longer than others?
When you need it to work, then suddenly need to rebuild, gg.
If incrementing by size 1 Cost of resize: O(n) For inserting n items: $O(n^2)$
Even for constants > 1: bad!
Cost of resize: O(n) Cost of inserting n items + resizing: O(n)
Most insertions: O(1) Some insertions: O(n) (to rebuild) Average cost: O(1)
IDEAL!!!
Cost of resize: $O(n^2)$ Cost: $O(n + n^2 + n)$
High cost of resize AND a lot of space wasted. Potentially only $sqrt(n)$ of the table used.
When too many items are deleted, table is too big.
if n == m, m = 2m if n < m/2, m = m/2
Cost of resize: O(n)
Imagine that at the border, delete 1 item, shrink, add 1 item, grow. VERY BAD!!!!!
if n == m, m = 2m if n < m/4, m = m/2 Shrink when 3/4 empty instead of 1/2 half! And leave empty space. To avoid the problem above!
Claim:
This will then be of amortised expected costs of O(1) for insertion and deletion.
If m == n, can no longer insert!
Idea: Find another bucket.
Pros
Cons
Hash Function has to return a SEQUENCE of values for some key. h(key, i):
Two properties of a good hash function
For n items, in a table of size m, assuming uniform hashing, the expected cost of an operation is $<=\frac{1}{1-\alpha}$ where $\alpha=n/m <1$
because you will never probe the same bucket twice (because probe sequence does not repeat)
$1+\frac{n}{m}(1+\frac{n-1}{m-1}(1+\frac{n-2}{m-2}(...)))$ $\frac{n-i}{m-i}<=\frac{n}{m}<=\alpha$ so $<=1+\alpha(1+\alpha(...))$ $<=1+\alpha +\alpha ^2+\alpha ^3+...$ $<=\frac{1}{1-\alpha}$
Very fast because of caching despite the clusters (because the clusters are mostly in the same cache, so negligible difference)
for i in 0..m:
if buckets[hash(x) + i % m] is empty:
insert x into this bucket
break
Keep trying adjacent buckets until an empty bucket.
h(k, 0), h(k, 1), h(k, 2), ... until it finds an empty bucket, then insert there in general: h(k, i) = h(k, 1) + i mod m (i is number of tries, because circular)
If DELETED cell, overwrite the deleted cell
Keep probing by incrementing i until the correct key is found OR null (empty)
If DELETED cell, continue probing
Cannot just set to null, because will affect searching other elements (e.g. expected to be 3rd try but if null, will terminate before 3rd try)
Tombstone value: Set bucket to be DELETED
Similar to linear probing but increment by squares instead
increment by 1, 4, 9, 16, ..., $i^2$
for i in 0..m:
if buckets[(hash(x) + i * i) % m] is empty:
insert x into this bucket
break
Quadratic probing can fail to insert even when table is not full When table size is of $n^2-1$ and constantly insert into slot 0.
Start with two original hash functions f(k) and g(k) let $h(k,i)=f(k) + i*g(k) \mod m$
Since f(k) is good, f(k, 1) is almost random Since g(k) is good, the probe sequence is almost random
If g(k) is relatively prime to m, then h(k, i) hits all buckets Can prove by contradiction.
Example: m=$2^r$, and g(k) output always odd Example: m = 330, g(k) = 7^k
Use a table lookup with XOR operations to generate a hash. This lookup table is a table of size [x, N] where x is the numerical base of the key and N is the length (in bits) of the key.
Assuming the key is a N-bit binary value, so x = 2, N = N.
hash = 0
for j in range(N):
hash = hash XOR T[key[j]][j]
Standard is to use HashMap<Key,Value>
interface java.util.Map<Key,Value>
void clear()
boolean containsKey(Object k)
boolean containsValue(Object v) // may not be efficient!
Value get(Object k)
Value put(Key k, Value v)
Value remove(Object k)
int size()
Set<Map.Entry<Key,Value>> entrySet()
Set<Key> keySet()
Collection<Value> values() Above 3 are not sorted and not necessarily efficient
equals()
method to be consistent
More Java HashMap Behaviour
It is implemented with chaining. It will hash the hashCode of your object again to a value that can fit in the internal table! Because of this, table_hash == table_hash does not imply that hash == hash.
TreeMap (dictionary in Java)
Lots of functionality Easy to use No guarantee of efficiency (some operations are not optimised) Java mostly has wide interfaces
Limited functionality Enforces proper usage Restricts usage
Every Object in Java has a hashCode() method By default Java hashCode returns the memory location of the object. Every Object hashes to a different location. Should override this!
Rules:
For Integer:
public int hashCode() {
return value;
}
For Long:
public int hashCode() {
return (int) (value ^ (value >>> 32))
}
For String:
s[0] * 31^(n-1) +
s[1] * 31^(n-2) +
s[n-1] * 31^(0)
// 31 is just a prime number that happens to work well
A set of edges and a set of vertices.
These should not be conditional and should work regardless of direction, distance, etc...
Multi Graphs: Has more than 1 edge between same pair of nodes
Hyper graphs: Has edges that connect > 2 nodes
Multi graphs and hyper graphs not covered in CS2040S. (both not considered regular graphs for CS2040S).
(Simple) Path: Set of edges connecting 2 nodes, path intersects each node at most once (no repeated nodes).
Connected: Every pair of nodes is connected by a path.
Disconnected: Some pair of nodes not connected by a path.
Connected Components: Set of nodes that are reachable (exists some path) from each other.
Cycle: "Path" where first and last node are the same.
(Unrooted) Tree: Connected graph with no cycles. Tree with N vertices has N - 1 edges.
Forest: Graph with no cycles (>= 1 trees)
Degree of Node: Number of adjacent edges. * can draw example picture
Degree of Graph: Maximum degree of all the nodes.
Diameter: The maximum distance between two nodes, following the shortest path. Only interested in reachable nodes (not defined for unreachable).
Finding diameter:
Sparse: Average degree of nodes VS total number of nodes. E = O(V) is sparse
Star: One central node, all edges connect center to edges. (also a tree) Degree = n - 1 Diameter = 2
Clique (Complete Graph): All pairs are connected by edges. Degree = n - 1 Diameter = 1
Line (or path): Line graph Degree = 2 Diameter = n - 1 Line is also a bipartite graph.
Cycle: Circle Degree = 2 Diameter = n / 2 or (n - 1) / 2 based on odd/even Even length cycle is a bipartite graph. Odd length cycle cannot be a bipartite graph.
Bipartite Graph: Nodes divided into two sets with no edges between nodes in the same set. Maximum Diameter = n - 1 (essentially a Line Graph) To prove bipartite graph: re-arrange/colour into a bipartite graph
A graph does NOT contain an odd length cycle <=> it is a bipartite graph. https://math.stackexchange.com/questions/311665/proof-a-graph-is-bipartite-if-and-only-if-it-contains-no-odd-cycles
4-Colouring Problem Nodes: tiles Edges: pairs of adjacent tiles A map in a plane can be coloured using 4 colours such that adjacent tiles sharing a common boundary do not share the same colour.
Sliding Puzzle Every state is a node. Moves = edges
For n = 9 Maximum degree: 4 (maximum possible of moves) Nodes = 9! Edges: 4*9! (4 from maximum degree)
Diameter = scrambled -> unscrambled
Rubik's Cube Every state is a node. Moves = edges
Number of nodes (for 2x2x2 cube) = 7! * 3^8 (7! and not 8! because symmetry rule, fix one of the pieces) ^ upper bound that includes illegal states
For 3x3x3 cube: 43 quintillion vertices (approximately)
Diameter of n x n x n cube: $\Theta(n^2 /\log n)$
Chosen based on if graph is dense or sparse and what type of queries to be optimised for.
If a graph is dense, then use an adjacency matrix, else use an adjacency list. Dense: |E| = $\Theta(V^2)$
Nodes: stored in an array Edges: linked list per node
// or LinkedList<Integer>
class NeighbourList extends LinkedList<Node> {}
class Node {
int key;
NeighbourList neighbours;
}
class Graph {
Node[] nodes;
}
// Don't do this:
List<List<Integer>> nodes;
// ^ harder to read, debug, extend
Memory usage for G = (V, E) array of size: |V| linkedlist of size: |E|
Total O(V + E)
For a cycle: O(V) For a clique: O(V^2)
Fast query: find a neighbour of a node Fast query: enumerate all neighbours Slower query: whether 2 nodes are neighbours
If 2 nodes are connected: O(min(|V|, |E|))
n * n matrix where matrix[i][i]
= 1 iff there is an edge between the two nodes
n-length neighbours, matrix^n
infinite-length approximately equals Google page rank (simulating random walks)
class Graph {
Node[][] matrix;
boolean[][] matrix;
// use boolean if don't care about path lengths
}
Memory usage for G = (V, E) array of size: V * V Total: O(V^2)
For a cycle: O(V^2) For a clique: O(V^2)
Fast query: whether 2 nodes are neighbours Slow query: find a neighbour of a node Slow query: enumerate all neighbours
If 2 nodes are connected: O(1)
Only visit every node once!
BFS/DFS visits every node and edge in the graph, but NOT every path in the graph.
Visiting every possible path is exponential (an undirected graph), so we clearly don't want to do this unless there is no other better choice (which there IS).
DON'T modify BFS/DFS to backtrack/re-visit, probably will be end up being exponential or infinite loops.
BFS/DFS forms a tree/forest (depends on number of connected components).
For a connected graph, this is because n - 1 new nodes will be discovered from the source node. With each discovery, the parent pointer is an edge. The final structure will thus consist of n nodes and n - 1 edges and will form one connected component. Thus, it is a tree by definition of a tree.
Shortest path NEVER contains a cycle (ONLY IF shortest path is unique -> but most algorithms avoids this to only construct trees) IF there exists a cycle, then it would NOT be a shortest path (prove by contradiction)
Use BFS.
Use a queue. Takes edge from vertex that was discovered least recently.
Explore level by level. Keep track of visited nodes with a boolean[]. And only visit if not visited before.
Crossed edges = edges that will visit a node that has already been visited before
Parent edges = edges that are used in a BFS
// need to restart for nodes not visited (if not connected)
BFS(Node[] nodeList, int startId) {
// keep track of visited nodes
boolean[] visited = new boolean[nodeList.length];
// even though Java boolean[] default is already false
Arrays.fill(visited, false);
int[] parent = new int[nodeList.length];
Arrays.fill(parent, -1);
Collection<Integer> frontier = new Collection<>();
frontier.add(startId);
while (!frontier.isEmpty()) {
// can reuse same queue if using queue (just keep track of number of nodes in a level)
Collection<Integer> nextFrontier = new Collection<>();
for (Integer v : frontier) {
// do things to v here
for (Integer w : nodeList[v].neighbours) {
if (!visited[w]) {
// only visit each node once
visited[w] = true;
parent[w] = v;
nextFrontier.add(w);
}
}
}
frontier = nextFrontier;
}
}
For Adjacency List: Each vertex only added to frontier ONCE (aka only visited once) -> O(V) Each v.neighbours only enumerated ONCE -> O(E), each edge is examined twice, once from each end Time Complexity of BFS: O(V + E)
For Adjacency Matrix Time Complexity of BFS: O(V^2)
Use a stack or recursion stack. Takes edge from vertex that was discovered most recently.
DFS does not necessarily yield the shortest path
Use recursion as the stack :D
DFS parent graph is a tree (because never visit same node twice) parent edges logic similar to BFS
DFS_visit(Node[] nodeList, boolean[] visisted, int startId) {
for (Integer v : nodeList[startId].neighbours) {
if (!visited[v]) {
visited[v] = true;
DFS_visit(nodeList, visited, v);
}
}
}
DFS(Node[] nodeList) {
boolean[] visited = new boolean[nodeList.length];
Arrays.fill(visited, false);
for (int start = 0; start < nodeList.length; start++) {
if (!visisted[start]) {
visisted[start] = true;
DFS_visit(nodeList, visisted, start);
}
}
}
For Adjacency List Time complexity: O(V + E) (same reason as BFS)
For Adjacency Matrix Time Complexity: O(V^2)
Undirected graphs NOT considered a directed graph (in CS2040S).
Each edge is unique (to not be a multi graph).
$(v, w) \neq (w, v)$
In-degree: number of incoming edges Out-degree: number of outgoing edges
Exactly the same but only store outgoing edges in linked list
Exactly the same but no longer symmetrical matrix
Follow outgoing edges Ignore incoming edges
Follow outgoing edges Backtrack (through incoming edges) when unrolling recursion
Every directed acyclic graph (DAG) has a topological ordering. DAG must have a partial order, hence antisymmetric, reflexive, transitive.
Topological orderings are not unique.
Must have:
Pre-order traversal of trees is topological ordering of the tree because it respects the direction of edges where parents are ordered before children.
Can use post-order DFS to find Topological Ordering. Process each node when it is last visited and put it at the BACK of the ordering. Also, need to start the DFS multiple times because we don't know wheres the actual start(s).
for (int start = 0; start < nodeList.length; start++) {
if (!visisted[start]) {
visisted[start] = true;
DFS_visit(nodeList, visisted, start);
// post-order
add to end of ordering
}
}
O(V + E) time complexity because its just a DFS with an extra O(1) on each node
Reverse of post-order DFS.
Kahn's Algorithm is a form of BFS. A bit more difficult to implement than DFS.
Repeat:
O(V + E) time complexity
For every vertex v and w in the component:
A single node is trivially a strongly connected component.
Cycles are strongly connected components.
Graph of multiple strongly connected components is acyclic if connecting the strongly connected components in a non-cyclic way.
DAG has no cycles and hence there are no (u, v) in the same strongly connected component. Thus, the DAG will have n strongly connected components.
DAG: Directed Acyclic Graph
CANNOT use BFS to find shortest path in weighted graphs because it does not attempt all possible paths
$\delta(u,v)$ = shortest distance from u to v
Triangle Inequality $\delta(S,C)\leq \delta(S,A)+\delta(A,C)$
EVERY shortest path algorithm makes use of this triangle inequality.
The need to maintain distance estimate for each distance.
int[] dist = new int[V.length];
Arrays.fill(dist, INFINITY);
dist[start] = 0;
At every iteration:
If all the edges added some constant factor of C > 0: shortest path will change (because not every path has the same number of edges). If all the edges multiplied some constant factor of C > 0: shortest path will not change.
// application of triangle inequality
void relax(int S, int A) {
// essentially, take the shortest path from S to A
if (dist[v] > dist[u] + weight(u, v)) {
dist[v] = dist[u] + weight(u, v);
}
}
Need to relax edges in the correct order to only relax each edge once. Look at Dijkstra's Algorithm below.
n = V.length;
for (i = 0; i < n; i++) {
if during an iteration, all the relax does nothing, then terminate early
for (Edge e : graph) {
relax(e);
}
}
If relax every edge in the graph (n - 1) times, will have correct final answer. Can terminate early if an entire sequence of relax has no effect. n - 1 times because the longest path is n - 1 edges (if no negative weight cycle)
If P is the shortest path from S to D, and if P goes through X, then P is also the shortest path from S to X (and from X to D). After i iteration of Bellman-Ford, i hop estimate (i hops) on shortest path is correct. After 1 iteration of Bellman-Ford, NOT ALL nodes 1 hop away from source is correct (only the one along the shortest path is guaranteed)!
Thus, by induction, after n iterations, shortest path reaches the destination.
Time complexity: O(E * V) in the worst case but can terminate early!
Should work because triangle inequality still holds.
Notion of shortest path is not well-defined with negative weight cycles (negative infinity essentially).
Bellman-Ford does not work for negative weights (with cycles). If after |V| iterations and there is still a change => negative weight cycle exist. Thus, can use Bellman-Ford to detect if negative weight cycles exist
Bellman-Ford will still work.
Effectively unweighted so just use BFS easier.
These queue-based alternative allows one to easily detect paths that include negative weight cycles while not interfering with the results of the nodes whose paths do not include a negative weight cycle.
boolean[] onQ
Queue<Node>
int[] distances
One way is to loop while queue is not empty AND number of loops do not exceed V * V.
Another way is to check for existence of negative weight cycles every V iterations (parent pointers are needed).
While looping, relax all the edges of the polled node and add those adjacent edges to the queue, if onQ[dest]
is False.
After said loop, every node left in the queue (and every node which is linked to those nodes and so forth) has a path that includes a negative weight cycle. Thus, can set the distances of these nodes to NEGINF.
Proof that a right order of relaxing edges exists
If relax edges in the correct order, each edge is only relaxed once.
Thus, it is possible for an algorithm with O(E) time complexity. BUT can't use the proof as the algorithm itself as you need the answer to solve the answer.
Relax the edges in the right order such that each edge is only relaxed once! Use priority queue. Take edge from vertex that is closest to source.
Necessary assumptions:
Idea Only relax an edge once its estimate is correct (and will never change again)!
Connecting all the edges which last caused each node to change value (while relaxing) will form a shortest path tree
Use a priority queue
Can use AVL Tree (with a hash table) or heap to implement a priority queue
Indexed by priority but still need to do fast contains() One solution is to use a hash table to point to the nodes in the AVL Tree
decreaseKey(key, priority): delete, then decrease, then re-insert back -> still O(log n)
public Dijkstra {
private Graph g;
private IPriorityQueue pq = new PriorityQueue();
private double[] distTo;
void searchPath(int start) {
pq.insert(start, 0.0);
distTo = new double[G.size()];
Arrays.fill(distTo, INFINITY);
distTo[start] = 0;
while (!pq.isEmpty()) {
int w = pq.deleteMin();
for (Edge e : G[w].neighbours) {
// this loop is ran <= V times
// but since each edge is relaxed only once
relax(e);
}
}
}
// relax is O(log V)
void relax(Edge e) {
int v = e.from();
int w = e.to();
double weight = e.weight();
if (distTo[w] > distTo[v] + weight) {
distTo[w] = distTo[v] + weight;
parent[w] = v;
if (pq.contains(w)) {
pq.decreaseKey(w, distTo[w]);
} else {
pq.insert(w, distTo[w]);
}
}
}
}
Time complexity if implemented with an AVL Tree Priority Queue O(E + V) in the main loop insert O(log n) -> at most V times extractMin O(log n) -> at most V times decreaseKey O(log n) -> at most E times isEmpty() O(1) -> at most V times
Overall: O((V + E) _ log V) = O(E _ log V)
V = number of nodes in the connected component of source
E >= V (because every node in the connected component is connected) thus can remove V from the final asymptotic running time
Every edge crossing the boundary has been relaxed.
Fringe vertices: neighbour of a finished vertex (in priority queue)
Claim: When taking a node out of the priority queue, the estimate will be correct and it can join the finished set of vertices.
Proof by induction: Base:
Inductive step:
To prove the claim that "It has a correct estimate": Prove by contradiction, if fringe vertex has another shorter path then by triangle inequality, will find some contradiction.
Can terminate Dijkstra Algorithm immediately once the destination is dequeued from the priority queue. This is because once a node is dequeued, the estimate is correct.
PQ Implementation | insert | deleteMin | decreaseKey | Dijkstra |
---|---|---|---|---|
Array | 1 | V | 1 | $O(V^2)$ |
AVL Tree | $\log V$ | $\log V$ | $\log V$ | $O(E\log V)$ |
d-way Heap | $d\log_d V$ | $d\log_d V$ | $\log_d V$ | $O(E\log_{E/V}V)$ |
Fibonacci Heap | 1 | $\log V$ | 1 | $O(E+V\log V)$ |
Fibonacci Heap has good amortised performance but real-world performance is slower great YouTube video on fibonacci heap
Invariant of estimate being correct when removed from priority queue no longer holds true. Thus Dijkstra Algorithm cannot be performed on a graph with negative weight edges.
Re-weigh weights by increasing each edge's weight by some constant to make it non-negative will NOT WORK.
This will favour paths with less edges, which may not be the actual shortest path.
There is a way to re-weight weights properly for this to work, but it increases the time complexity so much that running Bellman-Ford is faster.
Heap trees are an implementation of priority queues.
For minimum heap tree: For an array of indices [1, 2, ..., n]
In array representation: heap has array of $2^n$ items, every item in this array is filled (no spaces). Use resizeable array if heap is resizable too.
Every full binary tree is also a complete binary tree.
In array representation: heap has array of $x$ items, every item in this array is filled (no gaps in the array).
Operations have to maintain the complete binary tree property.
Claim: For heap with unique values, second largest element must be a child of the root node. Proof: Suppose it is not, then there is a contradiction because that means there exists a node > second largest yet < largest.
Height is $\lfloor \log n\rfloor$ because complete binary tree
Minimum Heap Tree is the same except it is "lesser than or equal to" instead.
bubbleUp and bubbleDown
bubbleUp(Node u):
// root = no parent = no violation
while u is not the root:
if (u.priority > u.parent.priority) {
swap(u, u.parent)
}
u = u.parent
bubbleDown(Node u):
// leaf = no children = no violation
while u is not a leaf:
maxP = max(u.left.priority, u.right.priority)
if (maxP > u.priority) {
// choose larger side to bubble down to
if (maxP == u.left.priority) {
swap(u, u.left)
u = u.left
} else if (maxP == u.right.priority) {
swap(u, u.right)
u = u.right
}
}
Each iteration of bubble up and down raises/lowers the level of the invariant violation. Once each the root/leaf levels, no more invariant violation by definition!
They have different complexities based on where you start it (nearer to leaf or root)!
Maximum number of swaps is O(log n) Time complexity: O(log n)
To find insertion point:
Note: to get O(1) search for the tree, use hashtable to store
Only can remove the root (because its the maximum)
O(n log n)
In-place Not stable
Inserting n items into a heap in O(n) time.
Invariant: After i iterations, last i items are in the heap because after the bubble down on x, tree rooted at x is a heap.
Intuition: Bubbling down results in an overall O(n) because there are much more nodes below than on top.
https://stackoverflow.com/questions/9755721/how-can-building-a-heap-be-on-time-complexity
Cost of bubbleDown on item j is $O(\log(\frac{n}{j+1})+1)$ because height of item $j$ is $O(\log(\frac{n}{j+1}))$ and cost = height + 1
$\text{Heapfiy cost}=\sum_{j=1}^{n}\log(\frac{n}{j+1}+1)$ $=n+\log\frac{n}{1}+\log\frac{n}{2}+\log\frac{n}{3}+\cdots+\log\frac{n}{n}$ $=n+\log\frac{n^n}{n!}$ $=n+\log\frac{n^n}{\frac{n}{e}^n}\text{ by Stirling's Approximation}$ $=n+\log e^n=n+n\log e=n(1+\log e)=O(n)$
Alternate heapify cost analysis using GP of GP.
NOTE: Can skip all the leaf nodes because they have no children so definitely nothing to bubbleDown. (Doesn't change complexity but is an optimisation by about half)
Single-Source Shortest Paths BFS: Unweighted graphs Bellman-Ford: simple, general Dijkstra: Faster but only for non-negative weights
DAGs: topological sort (using either reverse DFS post-order OR Kahn's Algorithm) + relax DAG Longest Path: negation of weights + shortest path on DAG
Topological ordering because when reach any node, all nodes pointing into it definitely has the correct distance (because topologically before this node).
Time complexity (assuming adjacency list) Finding topological ordering: O(V + E) Relaxing E edges: O(E) Overall: O(V + E)
Negate all the edges to reduce this problem to be a Shortest Path on DAG problem
For a cyclic graph, CANNOT do this negation trick because it leads to negative weight cycles and everything breaks.
In general, finding longest path on a general graph is a NP-hard problem (no polynomial solution).
In general, modifying graph/input is better than modifying the algorithm because modifying algorithms could easily introduce bugs or change time complexities.
$\log(a_1*a_2* \cdots *a_n)=\log a_1+\log a_2+\cdots +\log a_n$ $\log(\frac{1}{a_1*a_2* \cdots *a_n})=-\log(a_1*a_2* \cdots *a_n)=-\log a_1+-\log a_2+\cdots +-\log a_n$
Note that 0 < x < 1, then $-\log x$ > 0. So, can use Dijkstra.
Start, Destination and single checkpoint in the middle. Run SSSP on start and SSSP on destination. Choose the best checkpoint and paths.
If need 2 different type of checkpoints, modify the graph such that each vertex will store whether a type A checkpoint and type B checkpoint has been visited respectively in the path.
If n checkpoints, its NP-hard!
Union Find
Naive Version (without wall breaking and static set) isConnected: return true if A and B are in the same connected component Preprocess: Identify connected components. Label each location with its component number
Given a set of objects:
union()
: connect two objects
find()
: is there a path connecting 2 objects?
Optimised for quick finding
If objects are not integers, convert them to integers.
Data Structure int[] componentId Two objects are connected if they have the same component identifier.
Essentially very flat trees (tree with height of max 2) Same component => in the same subtree Can model the above int array as a flat tree.
find(int p, int q):
// O(1)
return componentId[p] == componentId[q]
// O(n) because need to search through ALL objects
union(int p, int q):
// choose either to be the new id
updateComponent = componentId[q]
for (int i = 0; i < componentId.length; i++) {
if (componentId[i] == updateComponent) {
componentId[i] = componentId[p];
}
}
int[] parent (parent pointers) Two objects are connected if they are part of the same tree
These non-binary trees will not be balanced.
find(int p, int q) {
// O(n) if a very tall tree (all objects in same component)
while (parent[p] != p) p = parent[p];
while (parent[q] != q) q = parent[q];
return root of p == root of q;
}
union(int p, int q) {
// O(n) if a very tall tree (all objects in same component)
while (parent[p] != p) p = parent[p];
while (parent[q] != q) q = parent[q];
// pick either node to be the parent
parent[p] = parent[q];
}
This version is very bad because trees are tall (unbalanced).
Need to compress the tree to achieve fast union!
IDEA: Correctly choose which tree to make the root. Pick the taller tree to be the new root. (make shorter tree a child)
union(int p, int q) {
while (parent[p] != p) p = parent[p];
while (parent[q] != q) q = parent[q];
// pick taller tree to be the root
if (size[p] > size[q]) {
// p is taller
parent[q] = parent[p];
size[p] = size[p] + size[q];
} else {
// q is taller
parent[p] = parent[q];
size[q] = size[p] + size[q];
}
}
Tree will end up being much shorter!
Maximum depth of tree: O(log n) IDEA: Height only increases when total size doubles. Claim: A tree of height k has size at least $2^k$. Thus, conversely, height of tree of size n is at most log n.
Base case: Tree of height 0 contains 1 item Induction: Let T1 be the bigger tree (in terms of size) and T2 be the smaller tree.
Time complexity Find: O(log n) Union: O(log n)
AKA Union-By-Rank (rank = log(size)) AKA Union-By-Height
Important Properties:
After finding the root: set the parent of each traversed node to the root
When path compressing, height changes but weight doesn't :D
findRoot(int p) {
root = p
while (parent[root] != root) root = parent[root]
while (parent[p] != p) {
temp = parent[p]
parent[p] = root
p = temp
}
return root
}
// alternative: point each node to their grandparent instead
findRoot(int p) {
root = p
while (parent[root] != root) {
// pointing everything to their grandparent achieves the same effect as directly pointing to the root
root = parent[root]
parent[root] = parent[parent[root]]
}
return root
}
Do path compression whenever you area traversing up the tree. (doesn't affect complexity)
Time complexity Find: O(log n) Union: O(log n)
First find operation will be O(n) because haven't do any path compression.
Theorem: Starting from empty, any sequence of m union/find operations on n objects takes: $O(n + m\alpha (m,n))$
ALPHA IS NOT CONSTANT but is always < 5 in our universe
IMPOSSIBLE to achieve linear time
Time complexity Find: $O(\alpha (m,n))$ Union: $O(\alpha (m,n))$
Alpha is the Inverse Ackermann function
Weighted Union is faster:
Weighted Union + Path Compression very fast:
find | union | |
---|---|---|
quick-find | O(1) | O(n) |
quick-union | O(n) | O(n) |
weighted-union | O(log n) | O(log n) |
path-compression | O(log n) | O(log n) |
weighted-union with path compression | $\alpha(m,n)$ | $\alpha(m,n)$ |
Applications
Union-Split-Find Insert and delete edges. Very difficult problem.
B4 = (root + B0 + B1 + B2 + B3) = B3 + B3 size(Bk) = $\Theta(2^k)$ height(Bk) = k - 1
For weighted, undirected graphs. (For directed graphs, it gets much more complicated and is not well-defined)
A spanning tree is an acyclic subset of the edges that connect all nodes. Note: If there were cyclic edges, can remove one edge and reduce the weight.
A minimum spanning tree is a spanning tree with minimum weight.
MST cannot be used for finding shortest path (entirely unrelated!) Because the shortest path might not be included in the MST.
To avoid issue of duplicate edge weights, use a unique identifier like a node ID as a tie-breaker.
Minimum cost network to stream a movie to all the users.
A cut of a graph G = (V, E) is a partition of the vertices V into two disjoint subsets. An edge crosses a cut if it has one vertex in each of the two sets.
For every node, the minimum outgoing edge is always part of the MST. (this is because of the cut property, cut the one node away from the rest of the graph) For every node, the maximum outgoing edge can be part of the MST (if there is only 1 outgoing edge).
Red rule (based on cycle property): If C is a cycle with no red arcs, then colour the max-weight edge in C red. Blue rule (based on cut property): If D is a cut with no blue arcs, then colour the minimum-weight edge in D blue.
Greedily apply red rule and blue rule to an arbitrary edges until no more edges can be coloured.
On termination: The blue edges are an MST.
This generic MST algorithm is the basis of ALL MST algorithms.
Note: If all edges have different costs, there is a unique MST. https://math.stackexchange.com/questions/352163/show-that-theres-a-unique-minimum-spanning-tree-if-all-edges-have-different-cos
A divide-and-conquer implementation of this does not work because converse of cut property is not true!
BUT divide-and-conquer along the median does work.
Basic Idea: Identify cuts and keep growing the set S. Fairly similar to Dijkstra's Algorithm's idea with some small changes that completely change the outcome!
S: set of nodes connected by blue edges, initially, some random starting node Next, add the minimum weight edge on cut from S to the rest of the nodes (cut property) and relax the edges. Repeat this until all nodes are added into S.
Use a priority queue to find the minimum weight edge on cut. In this case, priority queue is used to store nodes and not edges.
// pq stores nodes
pq = new PQ();
for (Node v : G.V()) {
pq.insert(v, infinity);
}
pq.decreaseKey(start, 0);
HashSet<Node> S = new HashSet<Node>();
S.put(start);
HashMap<Node,Node> parent = new HashMap<Node, Node>();
parent.put(start, null);
while (!pq.isEmpty()) {
Node src = pq.deleteMin();
S.put(src);
for (Edge e : v.edgeList()) {
Node dest = e.otherNode(src);
if (!S.contains(dest)) {
// assume decreaseKey does nothing if weight is > original
double weight = e.getWeight();
pq.decreaseKey(dest, weight);
if (weight < currentWeight of dest) {
parent.put(dest, src);
}
}
}
}
Proof:
Time complexity of using a AVL Tree for PQ: O(E log V) because each vertex added/removed once from the priority queue -> O(V log V) each edge, <= 1 decreaseKey -> O(E log V) E >= V because connected -> O(E log V)
with fibonacci heap O(V) -> insert all nodes at the start (fibonacci heap is of size O(V)) O(E) -> decreaseKey called for each edge O(V log V) -> extractMin for each node O(V log V + E)
Basic Idea:
Can terminate early once V - 1 edges are obtained
Use Union-Find data structure. Connect two nodes if they are in the same blue-tree.
// sort edges
Edge[] sortedEdges = sort(G.E());
ArrayList<Edge> mstEdges = new ArrayList<Edge>();
UnionFind uf = new UnionFind(G.V());
for (int i = 0; i < sortedEdges.length; i++) {
Edge e = sortedEdges[i];
Node src = e.first();
Node dest = e.second();
if (!uf.find(src, dest)) {
mstEdges.add(e);
uf.union(src, dest);
}
}
Proof:
Time complexity on a connected graph: O(E log E) or O(E log V) because sorting edges -> O(E log E) = O(E log V) because O(E) = O(V) for E edges: Find and (sometimes) Union -> O(E $\alpha$(E))
Basic Idea: Add all "obvious" blue edges Parallelisable MST algorithm.
For every node: add minimum adjacent edge (at least n/2 edges added, at most n/2 connected components)
Repeat: for every connected component, add minimum outgoing edge
Initially, n connected components, one for each node
Each iteration:
At most O(log V) iterations. Each iteration costs O(E+ V) = O(E) Thus, overall O(E log V)
Run BFS or DFS because any spanning tree is a MST. O(E) in connected graph
Cost would be x*(V-1) where x is the cost of an edge.
For Kruskal Algorithm Use an array of size 10 to sort. Each A[i] holds a linked list of edges of weight i. Then, run the second portion of Kruskal's Algorithm as usual.
Then, sorting is O(E). Overall: O(E $\alpha$(E))
For Prim's Algorithm Inserting/removing nodes from PQ: O(V) decreaseKey: O(E) Total: O(V + E) = O(E)
CANNOT do this similar technique for Dijkstra's Algorithm
No root don't make sense. A rooted spanning tree: every node is reachable from the root.
Prim's, Kruskal's, Boruvka's do not work for minimum rooted spanning tree because cut property and cycle property do not work.
VERY hard problem (not covered in CS2040S)
Special case that makes it very easy :D
A directed acyclic graph with one root (no incoming edges)
For every node except the root: add minimum weight incoming edge
No cycles because acyclic graph
Each edge is chosen only once (thus overall V - 1 edges) Thus, must be a tree.
Since every node has to have at least one incoming edge in the MST, this is the minimum spanning tree!
By adding/subtract all edges by some constant k, the MST does not change. Only relative weights of edges matter. Not the absolute edge weights.
MST with negative weights have no problem! So, just multiply each edge by -1 and run Minimum Spanning Tree algorithm and take the most negative.
OR run Kruskal's Algorithm in reverse (start from largest edge weight to smallest)
https://cs.stackexchange.com/questions/4942/finding-paths-with-smallest-maximum-edge-weight
Optimal solution can be constructed from optimal solutions to smaller sub-problems.
It is a property of many problems. Greedy (Dijkstra's and MST) and divide-and-conquer (merge sort, Fast Fourier Transform).
Overlapping subproblems The same smaller problem is used to solve multiple different bigger problems. Divide and conquer problems do not have overlapping subproblems.
Most DP problems can be converted into graph problems and then topologically sort and relax.
Table that captures all subproblems.
Find the length of the longest increasing subsequence.
Node => element Edge u -> v => u to v is increasing
Now, use negated SSSP from every node and take the best + 1.
$O(n^3)$
This causes a lot of repeated computation.
This is defined as a suffix.
The sub problem is S[i] = LIS(A[i..n]) starting at A[i]
S[n] = 0 S[i] = max(all adjacent) + 1
// assuming V is already topo sorted
LIS(V):
// memo array
int[] S = new int[V.length];
for (i = 0; i < V.length; i++)
S[i] = 0;
S[n-1] = 1;
for (int v = A.length - 2; v >= 0; v--) {
// find maximum S for any outgoing edge
int max = 0;
for (Node w : v.neighbours) {
if (S[w] > max) {
max = S[w];
}
}
S[v] = max + 1;
}
This is defined as a prefix.
S[i] = LIS(A[1..i]) ending at A[i]
S[1] = 1 S[i] = max(all previous) + 1
static int lengthOfLIS (int[] nums) {
int[] dp = new int[nums.length];
dp[0] = 1;
for (int i = 1; i < nums.length; i++) {
dp[i] = 1;
for (int j = 0; j < i; j ++) {
if (nums[j] < nums[i])
dp[i] = Math.max(dp[i], dp[j] + 1);
}
}
int res = 0;
for (int x : dp) res = Math.max(res, x);
return res;
}
Both DP Solution 1 and 2 are $O(n^2)$
Can solve in O(n log n) by binary searching subproblems
tails
array storing the smallest tail of each subsequence of lengths 0 to n - 1tails
array is non-decreasing or increasing (if unique values), and thus is monotonic. So we can do binary search!tails
array.tails
array (not the actual size, but the size that has values). The size will represent the existence of a subsequence at that length.Input:
Find maximum prize possible.
Find maximum prize possible in at most k edges.
O(kVE) Transformed graph has kV nodes and kE edges (transforming graph need to recompute number of nodes and edges). Build transformed graph in O(kV + kE) = O(kE) because connected graph Topo-sort/longest path -> O(kV + kE) Once per source: repeat V times -> O(kVE)
Optimisation Realise that you don't need to run DAG_SSSP V times! Create super-source that is connected to every other node in the original graph (with 0 weight) Run DAG_SSSP ONCE from this super-source node.
O(kV + kE)
P[v, k] = maximum prize that you can collect starting at v and taking exactly k steps P[v, 0] = 0
P[v, k] = max(P[w1, k-1] + w(v, w1), P[w2, k-1] + w(v, w2), ...) where w1, w2, ... are the neighbours of v
int LazyPrizeCollecting(V, E, kMax) {
int[][] P = new int[V.length][kMax+1];
for (int i = 0; i < V.length; i++)
for (int j = 0; j < kMax + 1; j++)
P[i][j] = 0;
for (int k = 1; k < kMax + 1; k++) {
for (int v = 0; v < V.length; v++) {
int max = -INFINITY;
// find max prize in next step
for (int w : V[v].neighbours) {
if (P[w, k-1] + E[v, w] > max) {
max = P[w, k-1] + E[v, w];
}
}
P[v, k] = max;
}
}
return maxEntry(P);
}
Total Cost:
let X(j) = X[0..j]
LCS(A(n), B(n)) = A[n] == B[n] ? LCS(A(n-1), B(n-1)) + 1 : max(LCS(A(n), B(n-1)), LCS(A(n-1), B(n)))
$O(mn)$ where m = A.length, n - B.length
https://en.wikipedia.org/wiki/Vertex_cover Find set of nodes C (C = vertex covers) where every edge is adjacent to at least one node in C.
NP-Complete problem (no polynomial algorithm) but there is an easy 2-approximation algorithm (but easy solution better (read: lower) than 2 approximation, also provably no polynomial approximation better than $\sqrt 2-\epsilon \text{ for }\epsilon > 0$)
Simpler Problem Minimum vertex cover on a tree (maybe not binary tree)
Need 2 subproblems: consider 2 cases: v is covered and not covered S[v, 0] = size of vertex cover in subtree rooted at node v, if v is not covered S[v, 1] = size of vertex cover in subtree rooted at node v, if v is covered
Base case is leaf S[leaf, 0] = 0 S[leaf, 1] = 1
If v is not covered, all of v's children has to be covered. S[v, 0] = S[w1, 1] + S[w2, 1] + ... S[v, 1] = 1 + min(S[w1, 0], S[w1, 1]) + ... where w1, w2, ... are children of v
Total 2V subproblems
int treeVertexCover(V) {
int[][] S = new int[V.length][2];
for (int v = V.length - 1; v >= 0; --) {
// start from leaves
if (v.children.size() == 0) {
// v is leaf
S[v][0] = 0;
S[v][1] = 1;
} else {
S[v][0] = 0;
S[v][1] = 1;
for (int w : V[v].children) {
S[v][0] += S[w][1];
S[v][1] += Math.min(S[w][0], S[w][1]);
}
}
}
// return min at root
return Math.min(S[0][0], S[0][1]);
}
O(V)
Input of weighted, directed graph. Finding shortest path between any 2 nodes.
Trivial version: run SSSP on each vertex. Dynamic programming is the efficient way to compute APSP.
For positive weights only Run Dijkstra's Algorithm on every query and memoize. Preprocessing: 0 Responding to q queries: O(VE * log V) sparse graph: $O(V^2logV)$
For identical weights, O(V(E + V)) + O(VE) in dense graph: $O(V^3)$ in sparse graph: $O(V^2)$
Preprocessing: all-pairs-shortest-path Responding to q queries: O(q)
There are also some versions that sacrifice some response time for faster preprocessing.
let S[v, w, P] be the shortest path from v to w that only uses intermediate nodes in the set P.
Limit ourselves to V+1 different sets P (out of $2^V$)
P0 = {} P1 = {1} P2 = {1, 2} ... $P_n$ = {1, 2, ..., n}
Either use or don't use a node S[v, w, $P_8$] = min(S[v, w, $P_7$], S[v, 8, $P_7$] + S[8, w, $P_7$]) S[v, w, $P_{i+1}$] = min(S[v, w, $P_i$], S[v, i+1, $P_i$] + S[i+1, w, $P_i$])
In the end, you find the path using with $P_n$ which allows you to use ALL intermediate nodes.
int[][] APSP(E) {
// 3D matrix is unnecessary (only the previous set is used), thus only need a 2D matrix
int[][] S = new int[V.length][V.length];
for (int v = 0; v < V.length; v++) {
for (int w = 0; w < V.length; w+) {
// initialise with original weights
S[v][w] = E[v][w];
}
}
// for sets P0, P1, ...
for (int k = 0; k < V.length; k++) {
// for every pair of nodes
for (int v = 0; v < V.length; v++) {
for (int w = 0; w < V.length; w++) {
// either use or don't use the node
S[v][w] = Math.min(S[v][w], S[v][k] + S[k][w]);
}
}
}
return S;
}
$O(V^3)$ because triply nested loop (better than naive Dijkstra -> $O(V^3\log V)$ when dense)
$V^3$ subproblems
Return actual path and not just the distance. Don't store all the shortest paths: this takes $V^3$ space (V choose 2 pairs * V hops)
Only store the first hop for each destination, similarly to a routing table. In Floyd-Warshall, store "intermediate node" whenever modifying/updating the matrix entry for a pair. Then you can easily reconstruct the actual path in O(V).
$O(V^2)$ space
Return a matrix M where M[v, w] = 1 if there exists a path from v to w M[v, w] = 0 otherwise
Similarly to Floyd-Warshall but without the actual distance.
M[v, w, k + 1] = OR(M[v, w, k], AND(M[v, k, k], M[k, w, k]))
For (v, w), the bottleneck is the heaviest edge on a path between v and w
Return a matrix B where B[v, w] = weight of the minimum bottleneck.