CS 486

Rational Agent Paradigm

An entity that perceives and acts.

Function from percepts to actions.
Performance measures.
- Goal achievement, resource consumption.
Caveat: Computational limitations and environmental constraints means we do not have perfect rationality.

Task Environments

To design a rational agent the task environment must be defined.

Performance measures.
Environment.
Actuators.
Sensors.

Properties of Task Environment

Fully Observable vs. Partially Observable.
Deterministic vs. Stochastic.
- Is the next state completely determined by the current state and action executed?
Episodic vs. Dynamic.
Discrete vs. Continuous.
Static vs. Dynamic.
Single Agent vs. Multiagent.

Search

Search problem consists of a state space, a successor function, a start space, and a goal test.

Solution is a sequence of actions (plan) from the start state to some goal state.

Example: Sliding Tiles Problem.

State: Board configuration.

Start: Any state.

Actions: Slide the blank tile into an adjacent space.

Goal: Does it match picture?

Example: N Queens Problem.

State: 0 to N queens.

Start: 0 queens.

Actions: Add a queen to an empty space.

Goal: N queens none attacking.

Alternate representation which is more complicated but has a smaller search space.

State: 0 to N queens, first n columns not attacking each other.

Start: 0 queens.

Actions: Add a queen to the first empty column none attacking.

Goal: N queens. And babu is cutie

State Space

The world space includes every last detial in the environment.
A search space keeps only the details needed for planning (abstraction).

Representing State

State space graph.
- Vertices correspond to states with one vertex for each space.
- Edges correspond to successors.
- Goal test is a set of goal nodes.
We search for a solution by building a search tree and traversing it to find a goal state.

Search Tree

State state is the root of the tree.
Children are the successors.
Plan is a path in the tree. A solution is a path from the root to a goal node.

For most problems we do not actually generate the entire tree.

We expand a node by applying all legal actions on it and adding the new states to the tree.

Generic Search Algorithm

Initialize with initial state of the problem.
Repeat.
- If no candidate nodes, faliure.
- Choose leaf node for expansion according to search strategy.
- If node contains goal state, return solution.
- Otherwise, expand the node. Add resulting nodes to the tree.
Nodes can be classified as start node, explored nodes, frontier, unexplored nodes.

Key Properties

Completeness: Is the algorithm guaranteed to find a solution if one exists?
Optimality: Does the algorithm find the optimal solution?
Time complexity.
Space complexity: Size of the fringe.
b: Branching factor.
m: Maximum depth.
d: Depth of the nearest goal node.

Example: DFS.

Complete: No. Infinitely stuck in a loop. If m is finite then it is.

Optimal: No. Finds the first goal, not necessarily the optimal.

Time complexity: Whole tree, O(b^m).

Space complexity: Fringe and related path information. O(m \cdot b).

Example: BFS.

Complete: Yes.

Optimal: Depends on whether the shallowest goal node is the one with the least cost.

Time complexity: Whole tree, O(b^{d + 1}).

Space complexity: O(b^d).

Iterative Deepened Search

Combine search methods to take advantage of DFS space complexity and BFS completeness and shallow solution advantage?

Complete: Yes.
Optimal: Depends on whether the shallowest goal node is the one with the least cost.
Time complexity: Whole tree, O(b^d).
Space complexity: O(m \cdot b).

Cost-Sensitive Search

Uniform Cost Search

Strategy: Expand cheapest node first.
Implementation: Priority queue.
Complete: Yes.
Optimal: Yes if costs are all greater or less some \epsilon.
Time Complexity: O(b^{1 + \frac{C^*}{\epsilon}}), where C^* is the optimal cost.
Space Complexity: Same as BFS.

Informed Search

Uninformed search expands nodes on the distance from the start node. Why not try to expand on the distance to the goal?

Heuristics

A function that estimates the cost of reaching a goal from a given state.

If h(n_1) < h(n_2) we guess that it is cheaper to reach the goal from n_1 than n_2.
We require h(n, goal) = 0.

Example: Best First Search.

Search strategy: Expand the most promising node according to the heuristic. - Not complete (infinite expansion). Time complexity, space complexity O(b^m).

A^\star Search: Expand according to the cost of path and heuristic. f(n) = g(n) + h(n). Note: Goal test must be done while expanding the node, not when it is first discovered. Time complexity is O(b^{\delta d}), where \delta is the relative error of the heuristic. We are required to store all expanded nodes in memory.
Admissible Heuristic: 0 \le h(n) \le h^\star(n), where h^\star(n) is the true cost.
Heuristic is consistent if h(a) \le cost(a, b) + h(b). Required if we have a graph (multiple paths to a goal).

Let G be the optimal goal. Let G^\prime be a sub-optimal goal. So f(G) < f(G^\prime). Assume by way of contradiction that G^\prime is selected over some n. So f(G^\prime = g(G^\prime) > g(G) = cost(s, n) + cost(n, G) \ge g(n) + h(n). This is a contradiction since n would have been selected first.

Dominated Heuristic: h_1(n) dominates h_2(n) if for all n, h_1(n) \ge h_2(n) and there exists one strict inequality.
Theorem: If h_1(n) dominates h_2(n), then A^\star will never expand more nodes.

Let c^\star be the cost of the optimal solution. A^\star expands all nodes such that f(n) < c^\star, so h(n) \le c^\star - g(n). g(n) is constant across all heuristics, so since h_2(n) \le h_1(n), a node expanded in h_1 must also have been expanded in h_2.

Constraint Satisfaction

Special subset of search problems.

States are defined by variables X_i with values from domains D_i.
Goal test is a set of constraints specifying allowable combinations of values for subsets of variables.

Types of CPSs

Discrete variables.
- Finite domains: If domain has size d, there are O(d^n) complete assignments.
- Infinite domains: Linear constraints are solvable but non-linear are undecidable.
Continuous variables: Linear programming polynomial time.
Unary constraints.
Binary constraints: Representable with a constraint graph.
Higher-order constraints.
Soft constraints: Constrained optimization problem.

Commutativity

Key insight is that CPSs are commutative.

Order of actions do not affect outcome.
Algorithm takes advantage of this.

Backtracking

Basic search algorithm for CSPs.

Select unassigned variable X
For every value {x_1, ..., x_n} in domain of X
    If value satisfies constraints, assign X = x_i and exit loop
If an assignment is found
    Move to next variable
If no assignment is found
    Back up to preceding variable and try a different assignment

Backtracking Efficiency

Ordering

Which variables should be tried first? In what order should the variable’s values be tried?

Most constrained variable: Try the variable with the fewest remaining legal moves. Also known as minimum remaining values.
Most constraining variable: Try the variable with the most constraints on the remaining variables.
Least constraining value: Try the variable which rules out the fewest values in the remaining variables.

Filtering

How do we detect faliure early?

Forward checking: Keep track of remaining legal values for unassigned varibles. Terminate search when any variable has no legal values.
- Does not detect all future faliures early.
Arc consistency: Given domains D_1, D_2, an arc is consistent if for all x \in D_1, there exists y \in D_2 such that x, y are consistent.
K-consistency: For all sets of K - 1 variables and consistent assignment of values, a consistent value is always assignable to any Kth variable.

Structure

Is it possible to exploit the problem structure? Idea: Break down the graph into connected components and solve each component separately.

Independent Subproblems: Solve each connected component separately.
Tree Structures: If the graph is a DAG, topological sort and solve in O(nd^2).
Cutsets: For a general graph, we define a subset of variables S such that the when removed the graph is a tree. Try all values of the subset O(d^{|S|}) and see if there is a solution in the tree O((n - |S|)d^2), total runtime is O(d^{|S| + 2}(n-|S|)).
Tree Decomposition: Group nodes into mega-nodes forming a tree. All variables occur in at least one mega-node. Variables connected by constraints must appear together. If a variable is in multiple mega-nodes it must be in the entire path. If w is the width of the tree (1 less than the size of the largest sub-problem), the runtime is O(nd^{w + 1}. Finding min-width decomposition is NP-hard, but we can use heuristics.

Constraints and Local Search

For many problems, the path is unimportant.

Iterative Improvement Methods

Start at some potential solution. Generate all possible points to move to. If stuck then reset, otherwise move to one of the points.

Hill Climbing (Gradient Descent):

Take a step in the direction which improves the current solution value the most.
Not necessarily complete (flat optima), not optimal, gets stuck at local optima and plateaus. Random restarts fixes local optima issue.

Simulated Annealing

Escape local optima by allowing “downhill” movements.

Take selected move if it improves the solution, otherwise take it with probability p.

Boltzman Distribution

e^{\frac{\Delta V}{T}}, T > 0 is the temperature parameter.
When T is high, bad moves have a chance, exploration phase.
When T is low, bad moves have low probability of being selected, exploitation phase.
If T decreases slowly enough, then we are theoretically guaranteeed to reach optimum.

Genetic Algorithms

Encoded candidate solution is an individual.
Individuals have fitness which corresponds to the quality of the solution.
Populations change over generations by applying operators.

Selection: Fitness proportional selection could lead to overcrowding. Ranking by percentile of fitness gets around this. Softmax (Boltzman) selection similar to simulated annealing works.
Crossover: Select random crossover point. Implemented with bitmasks.
Mutate: With small probability, modify a feature.

In a new generation, for every child, select parents, crossover, then mutate with low probability.

Adversarial Search

Focused on zero-sum games.
MAX player maximizes utility, MIN player minimizes utility.
Optimal stategy leads to outcomes at least as good as any other strategy, given that MIN is playing optimally.
Nash Equilibrium: s^\star \in S is a Nash Equilibrium if for all i, u_i(s^\star) \ge u_i(s_i, s_{-i}^\star), for all s_i \in S_i.
Theorem (Kuhn): Every finite extensive form game has a subgame perfect equilibria.

Minimax

M(n) = \begin{cases}u(n),\ &\text{$n$ is terminal} \\ \max_{c \in succ(n)} M(c),\ &\text{$n$ is MAX node} \\ \min_{c \in succ(n)} M(c),\ &\text{$n$ is MIN node} \end{cases}.

Complete for fintite games, time complexity is O(b^m), space complexity $O(bm), where m is the number of moves in the game. Optimal.

Alpha-Beta Pruning

Compute minimax without searching the entire tree.

Alpha: Value of the best choice so far on path for MAX.
Beta: Value of the best (lowest) choice so far on path for MIN.
We update alpha and beta as we compute M. Prune as soon as the current node is known to be worse than our current best result.
Results in the same outcome as full minimax. In the worst-case it offers no improvement.

Optimizations

We might not have enough resources to compute full results.

Evaluation Functions: Returns on estimate of the expected utility. Needs to be fast to compute. Often is a weighted function of the state features.
Cutting Off Search: Instead of searching to terminal states, we search low, and then use evaluation functions.
- Typically, we search deeper when the node is clearly good. Avoids horizon effect, where we are surprised by a bad node in the future.

Monte-Carlo Tree Search (MCTS)

Build a search tree according to outcomes of simulated plays.

Selection: Traverse the tree following a policy using Upper Confidence Bounds until you run out of information required to progress.
- v_i + c\sqrt{\frac{\ln N}{n_i}}, where n_i is the number of times we have visited the node, N is the total number of runs, and v_i is the expected value of the node given the information we have.
Expansion: Eventually we run out of information to progress, so expand a random child.
Simulate: Quickly simulate a game from the node.
Back-propogation: Based on simulations, update expected values.

Stochastic Games

Element of randomness.

Modelled by adding chance nodes between MIN and MAX layers, with weights equal to the probability of every option.
Compute expected values for minimax.

Machine Learning

Learning: Improving behaviour based on experience. Range of behaviour, accuracy on tasks, or speed of execution are considered improvements.
Supervised Classification: Given a set of pre-classified, classify on a new instance.
- Feedback: Supervices learning is explicitly given what must be learned by each example.
Unsupervised Learning: Find natural classes for examples.
- Feedback: No feedback is given, learner has to find patterns themselves.
Reinforcement Learning: Determine what to do based on rewards and punishments.
- Feedback: Feedback is only given after a sequence of actions.
Transfer Learning: Learning from an expert.
Active Learning: Actively seeking to learn.
Representation: Richer representations are more useful for subsequent problem solving but are harder to learn.
Measured against how well the agent performs with new examples.
Tendency to prefer one hypothesis over another is a bias.

Given representations, data, and bias, we have a search problem. We search through the space of possible representations to best fit the data. The search space is typically too large for systematic search, so we use iterative improvement. So a learning problem is made up of a search space, an evaluation function, and a search method.

Supervised Learning

Given a set of input features X_1, ..., X_n, a set of target features f(x), and a set of training examples, predict the target features for a set of test examples. This is done by returning a function that approximates f.

Classification: f is discrete.
Regression: f is continuous.
Inductive Learning Hypothesis: We hope that the approximation of f which performs well over a sufficiently large set of training examples will also perform well over any unobserved examples.
Evaluating Performance: Suppose y is a feature and e is an example. y(e) is the true value of y for e. y^\star(e) is the predicted value. The error is a measure of how close y^\star(e) is to y(e).
- Receiver Operator Curve: Recall = Sensitivity = \frac{TP}{TP + FN}. Specificity = \frac{TN}{TN + FP}. Precision = \frac{TP}{TP + FP}. F\text{-}measure = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall}.

Decision Trees

Classify instructions by storting them down the tree from root to leaf.

Leaves are the classifications.
Any boolean function is representable using decision trees.
Data needs to be discrete.

Inducing Decision Tree

Recursively choose the most significant attribute as the root of the subtree.

Entropy: Measure of unpredictability, uncertainty of random variables.
I(P(v_1), ..., P(v_n)) = \sum_{i=1}^n -P(v_i) \log_2(P(v_i)) from Information Theory. Assume 0\log(0) = 0.
High entropy inplies that we gain information by observing that value.
Information Gain: Gain on attribute A is the expected reduce in entropy. Given that attribute A divides the training set E into subsets E_1, ..., E_v, remainder(A) is the weighted sum of their information. remainder(A) = \sum_{i=1}^v \frac{p_i + n_i}{p + n} I(\frac{p_i}{p_i + n_i}, \frac{n_i}{p_i + n_i}).
- So IG(A) = I(\frac{p}{p + n}, \frac{n}{p + n}) - remainder(A). We choose the attribute A which maximizes the information given, so the minimum remainder.

Assessing Performance

Divide large set of examples into disjoint sets. Apply learning algorithm to training set an dmeasure performance against test set.
Overfitting: Hypothesis h \in H overfits training data if there exists h^\prime, h \neq h^\prime such that h has a smaller error on the training examples but h^\prime has a smaller error on the entire distribution of instances.
- Reduces performance of decision trees by 10%-25%.
- Errors causes by bias, variance in data, noise in data.
Bias-Variance Trade-off: Complicated models will not have enough data (low bias, high variance). Simple models with will have lots of data (high bias, low variance).

Avoiding Overfitting

Regularization: Prefer small decision trees over large ones. Add complexity penalty to stopping criteria.
Pseudocounts: Add data based on previous knowledge.
Cross Validation: Split training set into training an validation. Use validation as pretend test set. Optimize hypothesis to perform well against validation set.
- K-fold Validation: Divide training set into K subsets, pull one out as validation and train on the rest.

Linear Classifiers

Data is of the form (x, f(x)), x \in \mathbb{R}^n, f(x) \in \{0, 1\}.

Want w with h_w(x) = \begin{cases}1,\ &w \cdot x \ge 0\\0,\ &w \cdot x < 0\end{cases}. Find w which minimizes the loss function L(h_w) = \sum_{i=1} (y_i - h_w(x_i))^2.
Gradient Descent: \frac{\partial}{\partial w_i} L(h_w) = 2(y-h_w(x))(-x_i). Iteratively until convergence, we update w_i = w_i - \alpha \frac{\partial}{\partial w_i} L(h_w), where \alpha is the step size or learning rate.

Ensembles

Use many hypothesis and combine their predictions.

Bagging

Majority vote.

Assume each hypothesis makes error with probability p.
Probability that the majority is wrong is \sum_{k = \lceil \frac{n}{2} \rceil}^n {n \choose k}p^k (1-p)^{n-k}.
Example: Random Forests.

Randomly sample subsets of training data and features, learn decision trees off the subsets. Classify using majority vote of the forest.

Boosting

In reality, hypothesis are not equally correct and independent.
Want to increase the weight of good hypothesis and decrease the weight of bad hypothesis, then use a weighted majority.
Similarily, we want to increase the weight of misclassified examples.
Example: AdaBoost.

Compute accuracy p for every hypothesis. Update weights of correct predictions by \frac{p}{1 - p}. Update weight of hypothesis by \log(\frac{1-p}{p}).

Neural Networks

Relationships between input and output may be complicated.

Want methods for learning arbitrary relationships.
Neurons: Dendrites are inputs, soma is activity, axon is output, synapse is links. Learning changes how efficiently signals transfer.
Artificial Neuron: in(i) = \sum_{j}w_{i, j}a(j), a(i) = g(in(i)). Activation functions should be non-linear and mimic real neurons, so output close to 0 or 1.
Common Activation Functions.
- Rectified Linear Unit: g(x) = \max\{0, x\}.
- Sigmoid Function: g(x) = \frac{1}{1 + e^x}.
- Hyperbolid Tangent: g(x) = \tanh(x) = \frac{e^{2x} - 1}{e^{2x} + 1}.
- Threshold: g(x) = \begin{cases}1,\ &x \ge b\\0,\ &x < b\end{cases}.

Network Structure

Feed-forward ANN: DAG, no internal state, maps input to outputs.
Recurrent ANN: Directed cyclic graph, dynamic system with internal state.
Perceptrons: Single layered feed-forward network. Can only learn linear separators. Learning means adjusting the weights.
- E = \sum_{k} \frac{1}{2}(y_k - (h_w(x))_k)^2. Gradient descent.
Stochastic Gradient Descent: Shuffle data at random.
Multilayer Networks: Any continuous function is learnable with just one hidden layer.
- Last level we can train with gradient descent. For other layers, we do not y, so how do we compute E?
- Back Propogation: Hidden layer nodes contributed to some error at output. Amount of error is proportional to the weight. Chain rule.
Deep Learning: Neural networks with more than one hidden layer.

Bayesian Networks

P(A, B) = \sum_i P(A | B_i) P(B_i).
P(A | B) P(B) = P(B | A) P(A).
Probabilistic Inferefence: Query to get probability.
- Representing a full joint distribution over a set of random variables X_1, ..., X_n is very expensive.
- Answering queries is very slow.

Variable Independence

Two variables X, Y are conditionally independent given variable Z if given that you know the value of Z, nothing you learn about Y will influence your beliefs about X.

P(x | y, z) = P(x | z), for all x \in X, y \in Y, z \in Z.
Represent link using directed acyclic graph. Calculate values with marginalization.
Bayesian Network: Graphical representation of direct dependencies over a set of variables and a set of conditional probability tables P(X | Parents(X)) quantifying the strength of influences.
- So every variable is conditionally independent of all non-descendants given its parents.
- Assume every random variable is directly influenced by at most k others, so every table is of size at most O(2^k), so the entire network is specified in O(n2^k).

Testing Independence (D-Separation)

D-Separation: Set of variables E (evidence) separates X, Y if it blocks every undirected path between X and Y. So X, Y are conditionally independent given E.

Let P be an undirected path between X, Y. E blocks path P if there exists a node Z \in P such that.

An arc on P goes into Z and an arc leaves Z, and Z \in E.
- X ... \to Z \to ... Y.
Both arcs on P leave Z and Z \in E.
- X ... \gets Z \to ... Y.
Both arcs on P enter Z and neither Z, nor any descentents, are in E.
- X ... \to Z \cup Descentents(Z) \gets ... Y.

Markov Blanket: Given parents and children, a node is conditionally independent of all other nodes in the network.

Example:

\begin{aligned} P(J) &= \sum P(J, TS, Fl, Fv, Th, M, ET) \\ &= \sum_{M} P(J, M) \\ &= \sum_{M} P(J \mid M) P(M) \\ &= \sum_{M} P(J \mid M) \sum_{ET} P(M, ET) \\ &= \sum_{M} P(J \mid M) \sum_{ET} P(M \mid ET) P(ET) \end{aligned}

Example:

\begin{aligned} P(J \mid ET) &= \frac{P(J, ET)}{P(ET)} \\ &= \frac{1}{P(ET)} \sum_{M} P(J, M, ET) \\ &= \frac{1}{P(ET)} \sum_{M} P(J | M, ET) P(M, ET) \\ &= \frac{1}{P(ET)} \sum_{M} P(J | M, ET) P(M \mid ET) P(ET) \\ &= \frac{1}{P(ET)} \sum_{M} P(J | M) P(M \mid ET) P(ET) \\ \end{aligned}

Example: Backwards Inference.

\begin{aligned} P(ET \mid J) &= \frac{P(J \mid ET)P(ET)}{P(J)} \\ &= \frac{1}{P(J)} P(J, ET) \\ \end{aligned}

Algorithm

Given query variable Q, evidence set E, remaining variables Z, set of factors F corresponding to \{Q\} \cup Z.

Replace factors f \in F that mention a variable in E with f_{E = e}.
Choose elimination ordering of Z.
For Z_j \in Z, eliminate as follows.
- Compute factor g_j = \sum_{Z_j} f_1 \times ... \times f_k, where f_i are factors that include Z_j.
- Remove the factors f_i from F and add new factor g_j to F.
Remaining factors refer to Q only. Take their product and normalize for P(Q).

Reasoning Under Uncertainty

Dynamic Inference

Allow the world to evolve.
Set of states representing all possible worlds.
Time-splices representing snapshots of the world.
Different probability distributions over states at different time-slices.
Dynamic encoding of how distributions change over time.

Stochastic Process

Set of states S, P(s_t | s_{t-1}, ..., s_0).
Bayes Net with ony random variable per time slice.
Problem since there are infinitely many variables with infinity large CPTs.
- Stationary Process: Dynamics do not change over time.
- Markov Assumption: Current state depends on a finite history of past states.

k-Order Markov Process

Assume the last k states are sufficient.
Problem is that uncertainty increases over time.

Hidden Markov Model

First Order Hidden Markov Model

Set of states S, set of observations O, transition model P(s_t | s_{t-1}), observation model P(o_t | s_t).

Monitoring: P(s_t | o_t, ..., o_1).
Prediction: P(s_{t+k} | o_t, ..., o_1).
Hindsight: P(s_k | o_t, ..., o_1).
Most Likely Explanation: argmax_{s_t, ..., s_1} P(s_t, ..., s_1 | o_t, ..., o_1).
- Verterbi algorithm is a DP algorithm to compute the most likely sequence of states.

Dynamic Bayes Net

What if the number of states or observations are exponential?

Encode states and observations with several random variables.
Exploit conditional independence to save time and space.

HMMs are DBN with one state variable and one observation variable.

Set of states S, with observations for every state.
Non-Stationary Processes: When process is not stationary, we must add new state components until dynamics are stationary.
Non-Markovian Processes: Add new state components until dynamics are Markovian.

Statistical Learning

Where do the probabilities for Bayes Nets come from?

Experts are scarce and expensive. Data is cheap and plentiful.
Build models of the world directly from data.

Example: Candy is sold in two flavours, lime and cherry, with the same wrapper for both flavours. Bags have different ratios, (e.g. 100% cherry, 50% cherry 50% line, 100% lime). After eating k candies, what is the flavour ratio of the bag?

Hypothesis H is a probabilistic theory about the world.
Data D is evidence.

\begin{aligned}P(x | d) &= \sum_i P(x | d, h_i) P(h_i | d) \\ &= \sum_i P(x | h_i) P(h_i | d) \end{aligned}

Predictions are weighted averages of the predictions of individual hypothesis.

Optimal: Given prior, no other prediction is correct more often than Bayesian one.
No Overfitting: Use prior to penalize complex hypothesis.
Intractable: When hypothesis space is large.
Approximations: Maximum a posteriori (MAP).

Maximum a Posteriori (MAP)

Make prediction on the most probably hypothesis h_{MAP}.
h_{MAP} = argmax_{h_i} P(h_i | d), P(x | d) = P(x | h_{MAP}).
Less accurate than Bayesian prediction since it only relies on one hypothesis, but converges as data increases.

\begin{aligned}h_{MAP} &= arg max_h P(h | d) \\ &= arg max_h P(h) P(d | h) \\ &= arg max_h P(h) \Pi_i P(d_i | h)\end{aligned}

This is nonlinear optimization, take the log to linearize. Will not change optimal hypothesis.

h_{MAP} = arg max_h \left[\log P(h) + \sum_i \log P(d_i | h)\right]

Maximum Likelihood (ML)

Simplify MAP by assuming uniform prior, P(h_i) = P(h_j).

\begin{aligned}h_{MAP} &= arg max_h P(h) P(d | h) \\ h_{ML} &= arg max_h P(d | h) \end{aligned}

Less accurate than Bayesian and MAP, still converges as data increases.
No prior, so subject to overfitting.

Example:

Hypothesis h_\theta, P(cherry) = \theta, P(lime) = 1 - \theta. - Data d, N candies with c cherry and N - c lime.

\begin{aligned}P(d | h_\theta) &= \theta^c (1 - \theta)^{N - c} \\ L(d | h_\theta) &= \log P(d | h_\theta) \\ &= c \log \theta + (N - c) \log(1 - \theta) \end{aligned}

Find the \theta that maximizes log likelihood.

\begin{aligned} \frac{\partial L(d | h_\theta)}{\partial \theta} &= \frac{c}{\theta} - \frac{N - c}{1 - \theta} = 0 \\ \theta &= \frac{c}{N} \end{aligned}

With multiple hypothesis, take partial derivatives.
Approach is extendable to any Bayes networ with complete data.

Laplace Smoothing

What happens if we have not observed any cherry candies?
\theta = 0 is not a good prediction.

Given observations x = (x_1, ..., x_d) from N trials, we use estimate parameters \theta.

\begin{aligned}\theta &= (\theta_1, ..., \theta_d) \\ \theta_i &= \frac{x_i + \alpha}{N + \alpha d},\ \alpha > 0\end{aligned}

Naive Bayes

Want to predict a class C based on attributes A_i.
Parameters.
- \theta = P(C = True).
- \theta_{j, 1} = P(A_j = True | C = True).
- …
Assumption is that A_i are independent given C.

With observed attribute values x_1, ..., x_n.

P(C | x_1, ..., x_n) = \alpha P(C) \Pi_i P(x_i | C)

From ML we know that the parameters are just observed frequencies with possible Laplace smoothing, we just need to choose the most likely class C.

Text Classifiation

Example application of Naive Bayes.

Given: Collection of documents, classified as interesting or not interesting.
Goal: Learn a classifier to look at a text of new documents and provide a label.

Data Representation

Consider possible significant words that can occur in documents.
Stem words, map words to their root.
For every root, introduce common binary feature for whether the word is present in the document.
Naive Bayes assumption that words are independent of each other, so P(y | document) \propto \Pi_{i=1}^{|Vocab|} P(w_i | y).
- Use ML parameter estimation P(w_i | y) as the percentage of documents of class y which contain word w_i.
When \theta cannot be found analytically, gradient serach to find a good value of \theta. \theta \gets \theta + \alpha \frac{\partial L(\theta | d)}{\partial \theta}.

Expectation Maximization (EM)

Previous problems always have values of all attributes known and learning is relatively easy. Many real world problems have hidden variables with incomplete data and missing attribute values.

Maximum Likelihood Learning

\theta_{V = True, Parent(V) = x} is the proportion of instances of V = x with V = True.
What is some values are missing?
Naive Solutions.
- Ignore examples with missing attribute values. What is all examples have missing attribute values?
- Ignore hidden variables, model might become more complex.

Direct ML

Maximize likelihood directly where E are the evidence variables and Z are the hidden variables.

\begin{aligned} h_{ML} &= arg max_h P(E | h) \\ &= arg max_h \sum_Z P(E, Z | h) \\ &= arg max_h \sum_Z \Pi_i CPT(V_i) \\ &= \arg max_h \log \sum_z \Pi_i CPT(V_i) \end{aligned}

Expectation-Maximization (EM)

If we knew the missing values then we can compute h_{ML}.

Guess h_{ML}.
Iterate.
- Expectation: Based on h_{ML} compute expectation of (missing) values.
- Maximization: Based on expected (missing) values, compute new h_{ML}.

\begin{aligned} h_{i+1} &= arg max_h \sum_Z P(Z | h_i, e) \log P(e, Z | h_i) \\ &= arg max_h \sum_Z P(Z | h, e) \log \Pi_h CPT_j \\ &= arg max_h \sum_Z P(Z | h, e) \sum_j \log CPT_j \end{aligned}

Monotonic improvement of likelihood, P(e | h_{i+1}) \ge P(e | h_i).

Example: Two coins A, B. Probability of heads with A is \theta_A, probability of heads with B is \theta_B. Want to find \theta_A, \theta_B by performing a number of trials.

Set of trials where we know which coin was used gives us \theta_A = 0.8, \theta_B = 0.45.
Now, given a set of trials where we do not know which coin was used (hidden variable).

HTTTHHTHTH
HHHHTHHHHH
HTHHHHHTHH
HTHTTTHHTT
THHHTHHHTH

Initialization: Let’s say \theta_A^0 = 0.6, \theta_B^0 = 0.5. Assume that the probability of A, B being used is the same, so P(A) = P(B) = 0.5.

Expectation: Compute expected counts of heads and tails. P(A | T_i) = \frac{P(T_i | A) P(A)}{\sum_{j \in \{A, B\}} P(T_i | j) P(j)}.

P(T_1 | A) = {10 \choose 5} 0.6^5 0.4^5 \approx 0.2. P(T_1 | B) = {10 \choose 5} 0.5^{10} \approx 0.246. So P(A | T_1) = \frac{0.2}{0.2 + 0.246} \approx 0.45, P(B | T_1) = 1 - P(A | T_1) \approx 0.55. Give both coins a proportional number of heads and tails,

Coin A	Coin B
2.2, 2.2	2.8, 2.8

P(T_2 | A) = {10 \choose 9} 0.6^9 0.4^1 \approx 0.0403. P(T_2 | B) = {10 \choose 9} 0.5^10 \approx 0.01. So P(A | T_2) \approx 0.8, P(B | T_2 \approx 0.2.

Coin A	Coin B
2.2, 2.2	2.8, 2.8
7.2, 0.8	1.8, 0.2

…

Coin A	Coin B
2.2, 2.2	2.8, 2.8
7.2, 0.8	1.8, 0.2
5.9, 1.5	2.1, 0.5
1.4, 2.1	2.6, 3.9
4.5, 1.9	2.5, 1.1
23.1, 8.6	11.7, 8.4

Maximization: Compute parameters based on expected counts and repeat. \theta^1_A = \frac{21.3}{21.3 + 8.6} = 0.71, \theta^1_B = \frac{11.7}{11.7 + 8.4}.

…

Eventually we will see similar probabilities to 0.8, 0.45.

K-Means Algorithm

Input.
- Set of examples E.
- Input features X_1, ..., X_n.
- val(e, X), value of feature j for example e.
- k classes.
Output.
- Function class E \to \{1, ... k\} which classifies examples.
- Function pval, where pval(i, X_j) is the predicted value of feature X_j for every example in class i.
Use sum-of-squares error for class i and pval.

\sum_{e \in E} \sum_{j=1}^n (pval(class(e), X_j) - val(e, X_j))^2

Given class, the pval that minimizes the error is the mean value for that class. Prove by taking derivative.
Given pval, every example assigned to the class that minimizes the error for that example.

Randomly assign examples to classes.
Repeat following EM steps until assignment does not change.
- M: For every class i and feature X_j, pval(i, X_j) = \dfrac{\sum_{e: class(e) = i} val(e, X_j)}{|\{e: class(e) = i\}|}
- E: For every example e, assign e to class that minimizes \sum_{j=1}^n (pval(class(e), X_j) - val(e, X_j))^2.

Will eventually converge to a stable local minimum.
Increasing k can alwasy decrease error but is not always meaningful.

Markov Decision Processes

Markov Chains

Agent in a state s \in S.
Initial state s_0 = 0.
State that agent transitions to from s_t and time t is only determined by probability distribution of current state.
Represent transition as probability matrix P.

Discounted Rewards

Reward in future is not worth as much as reward now.
Discounted sum of future awards: Using discount factor \gamma, \sum_{i \ge 0} \gamma^i R_i, where R_i is the reward after i time steps.

Solving a Markov Process

Input.
- Set of states S = \{s_1, ..., s_n\} with corresponding rewards \{r_1, ..., r_n\}.
- Discount factor \gamma.
- Transition probability matrix P.
Let U^*(s_i) be the expected discount sum of future rewards starting at state s_i.
U^*(s_i) = r_i + \gamma \sum_{j=1}^n P_{ij} U*(s_j).
Closed form solution U = (I - \gamma P)^{-1} R.

Value Iteration

Manually solve U^k(s_i), the expected discounted sum of rewards over the next k time steps.

\begin{aligned} U^1(s_i) &= r_i \\ U^k(s_i) &= r_i + \gamma \sum_{j=1}^n P_{ij} U^{k-1}(s_j) \end{aligned}

Repeat until convergence to desired \epsilon.

Policies

Mapping from states to actions.
For every MDP there exists an optimal policy. For every possible start state, there is no better option than to follow the policy.
Run through every possible policy and select the best.

Let P^k represent the transition matrix for policy k.

Bellman’s Equation: V^{t+1}(s_i) = \max_k\left[r_i + \gamma \sum_{j=1}^n P^k_{ij}V^t(s_j)\right]

Find optimal policy by taking another step and seeing which policy gives the best result.

Policy Iteration

Policy evaluation: Given \pi, compute V_i = V^\pi.
Policy improvement: Calcualte new \pi_{i+1} using 1-step lookahead.

Partially Observable MDPs (POMDPs)

Agent maintains belief state b, probability distribution over all states.
Optimal action depends only on agent’s current belief state.
Representably as a MDP by summing over possible states s^\prime that an agent might reach.
- Given current b, execute action a = \pi^*(b).
- Receive observation o.

P(b^\prime | a, b) = \sum_o P(b^\prime, o, a, b) \sum_{s^\prime} O(o | s^\prime) \sum_s P(s^\prime | a, s) b(s)

Reinforcement Learning

Learning what to do to maximize a numerical reward signal.
Not told what actions to take.
Discover value of actions by trying them out and seeing what reward is.

Characteristics.

Set of states S.
Set of actions A.
Set of reinforcement signals (rewards). Delayed reward possible, credit assignment problem.
Exploration and exploitation.
States may be partially observable only.
Continuous learning.

Multi-Armed Bandit

Simplest reinforcement learning problem.

At every time step t, choose action A_t from k possible actions, receive reward R_t.

Reward only depends on action taking.
True values are unknown, distribution is unknown.
Goal: Maximize total reward.

Exploration vs. Exploitation

Assume estimates are Q_t(a) \approx q_*(a).
Greedy action is A^*_t = arg \max_a Q_t(a).
- When A_t = A_t^*, then exploiting.
- When A_t \neq A_t^*, then exploring.

Action-Value Methods

Estimate action values as sample averages.

Q_t(a) \approx \frac{\sum_{i=1}^{t-1} R_i \cdot 1_{A_i = a}}{\sum_{i=1}^{t-1} 1_{A_i = a}}

\epsilon-Greedy Action Selection

Greedy action selection you always exploit.
\epsilon-greedy, usually greedy but with probability \epsilon pick an action at random.
Simple way to balance exploration and exploitation.

Averaging and Learning Rules

We have Q_n as the average \frac{\sum_{i=1}^{n-1}R_i}{n-1}, want to avoid incrementally storing all rewards.
Q_{n+1} = Q_n + \frac{1}{n}[R_n - Q_n].
When the problem is non-stationary, sample average is not appropriate. Use exponential weighted average Q_{n+1} = Q_n + \alpha[R_n - Q_n].
To ensure convergence, we need \sum_{n=1}^\infty a_n(a) = \infty and \sum_{n=1}^\infty a^2_n(a) < \infty.

Optimism

Initialize actions values to some Q_1(a) \neq 0.

Upper Confidence Bounds

Reduce exploration over time using optimism.
Estimate an upper bound on the true action values, select action with largest estimated upper bound.

A_t \approx arg \max_a \left[Q_t(a) + c\sqrt{\frac{\log t}{N_t(a)}}\right]

Want to estimate an upper confidence U_t(a) for every action such that q^*(a) \le Q_t(a) + U_t(a).
Depends on number of times we have sampled a.

Determining Bound

Hoeffding’s Inequality: Let X_1, ... X_t be iid random variables in [0, 1], \overline{X}_t = \frac{1}{T} \sum_i X_i be the sample mean. Then P[E[X] > \overline{X}_t + u] \le e^{-2tu^2}.
Apply bound to bandit awards, P[q^*(a) > Q_t(a) + U_t(a)] \le e^{2N_t(a) U_t(a)^2}.
Assume we have some probability p,

\begin{aligned} e^{2N_t(a)U_t(a)^2} &= p \\ U_t(a) &= \sqrt{\frac{-\log p}{2N_t(a)}} \end{aligned}

Reduce p as we observe more rewards, for example p = t^{-4}.

MDPs and RL

Given the MDP, we could compute the optimal policy.
Prediction problem: Learn V^* or V^\pi.
Control problem: Learn \pi^*.

All RL Methods Driven By Values

G_t = \sum_{i /ge 0} \gamma^i R_{t + 1 + i}.
Value of a state given policy \pi is V^\pi(s) = E\{G_t | S_t = s, A_{t:\infty} \sim \pi\}.
Value of a state-action pair, given policy \pi is Q^\pi(s, a) = E\{G_t | S_t = s, A_t = a, A_{t+1:\infty} \sim \pi\}.
Optimal value of state is V^*(s) = \max_\pi V^\pi(s).
Optimal value of state-action pair is Q^*(s, a) = \max_\pi Q^\pi (s, a).
Optimal policy \pi^*(s, a) > 0 only where Q^*(s, a) = \max_b Q^*(s, b), for all s \in S.

Bellman Optimality Equations

\begin{aligned} V^*(s) &= \max_a Q^{\pi^*}(s, a) \\ &= \max_a E\left[R_{t+1} + \gamma V^*(s_{t+1} | S_t = s, A_t = a\right] \\ &= \max_a \sum_{s^\prime, r^\prime} p(s^\prime, r | s, a)\left[r + \gamma V^*(s^\prime)\right] \end{aligned}

\begin{aligned} Q^*(s, a) &= E[R_{t+1} + \gamma \max_{a^\prime} Q^*(S_{t+1}, a^\prime) [S_t = s, A_t = a] \\ &= \sum_{s^\prime, r} p(s^\prime, r | s, a)[r + \gamma \max_{a^\prime} Q^*(s^\prime, a^\prime)] \end{aligned}

Model-Based: Learn model of the environment.
Model-Free: Never explicity learn the model.

Prediction

Temporal Difference: Bootstrapping and sampling method.
- Bootstrapping: Update estimation of value function.
- Sampling: Update does not involve an expected value.
Simplest temporal-difference method TD(0).

V(S_t)^\prime = V(S_t) + \alpha[R_{t+1} + \gamma V(S_{t+1}) - V(S_t)]

R_{t+1} + \gamma V(S_{t+1}) is an estimate of the return.

Initialize V(s) arbitrarily.
Repeat for every episode.
- Initialize S.
- Repeat for every episode.
  - A is action given by \pi for S.
  - Take action A, observe R, S^\prime.
  - V(S) \gets V(S) + \alpha[R + \gamma V(S^\prime) - V(S)].
  - S \gets S^\prime.
- While S is not terminal.

TD(\lambda), update whole training sequence not just a single transition.

From Learning Action-Values to Control

Take action-value prediction problem and change to control problem by updating the policy to be greedy with respect to the current estimates.

Behavioural Policy: Generate actions and gather data.
Learning Policy: Target policy to learn.

On-Policy Learning: Behaviour = Learning.
Off-Policy Learning: Behaviour \neq Learning.

Sarsa (On-Policy TD Control)

Repeat for every episode.

Initialize S.
Choose A from S using policy derived from Q such as \epsilon-greedy.
Repeat for every episode.
- Take action A, observe R, S^\prime.
- Choose A^\prime from S^\prime using policy derived from Q.
- Q(S, A) \gets Q(S, A) + \alpha[R + \gamma Q(S^\prime, A^\prime) - Q(S, A)].
- S \gets S^\prime, A \gets A^\prime.
While S is not terminal.

Q-Learning (Off-Policy TD Control)

Repeat for every episode.

Initialize S.
Repeat for every episode.
- Choose A from S using policy derived from Q.
- Take action A, observe R, S^\prime.
- Q(S, A) \gets Q(S, A) + \alpha[R + \gamma \max_a Q(S^\prime, a) - Q(S, A)].
- S \gets S^\prime.

Game Theory

Multiagent Decision Making.

Focus on self-interested multi-agent systems.
- Agents have their own description of the states of the world and take actions based on these descriptions.
Game Theory: Formal way to analyze interactions among a group of rational agents that behave strategically.
Best response: a_i = arg \max \sum_{a_{-i}} u_i(a_i, a_{-i})p_i(a_{-i}).
a_i^\prime strictly dominates a_i if u_i(a^\prime_i, a_{-i}) > u_i(a_i, a_{-i}) for all a_{-i}.
Nash Equilibrium: a* is NE if u_i(a^*) \ge u_i(a_i, a^*_{-i}) for all a_i for all players i.

Mixed Nash Equilibria

Mixed Strategy: Probability distribution over A.
Expected utility.

Repeated Games

Strategy space is significantly larger.
S: H \to A where H is the history of play so far.
Reward and punish past behaviour, worry about reputation, trust.

Example: Prisoner’s Dilemma

Grim Strategy: Cooperate first, defect forever if the opponent defects.
Tit-for-Tat: Cooperate first, copy whatever opponent did in previous stage.

Mechanism Design

Given rational agents, what sort of games should we design?
Guarantee that agents will reach an outcome with properties that we want.
Set of possible outcomes O.
Set of agents N.
- Every agent has \theta_i \in \Theta_i, captures all private information relevant to agent’s decision making.
Utility functions u_i(o, \theta_i).
Social choice function f: \Theta_1 \times ... \times \Theta_n \to O.
Mechanism: M = (S_1, ..., S_n, g(\cdot)).
- S_i is stategy space of agent i.
- g: S_1 \times ... \times S_n \to O is the outcome function.

Implementation

Mechanism M implements social choice function f(\theta) if there is an equilibrium s^*, such that s^* = (s^*_1(\theta_1),...,s^*_n(\theta_n)) for all g(s^*_1(\theta_1),...,s^*_n(\theta_n)) = f(\theta_1, ...,\theta_n).

Direct mechanism is a mechanism where S_i = \Theta_i for all i, and g(\theta) = f(\theta) for all \theta \in \Theta_1 \times ... \times \Theta_n.
Direct mechnaism is incentive compatible if it has equilibrium s^* where s_i^*(\theta_i) = \theta_i for all \theta_i \in \Theta_i for all i.
Strategy proof if the equilibrium above is a dominant strategy equlibrium.

Relevation Principle

Theorem: Suppose there exists a mechanism M that implements social choise function f in dominant strategies. Then there is a direct strategy-proof mechanism M^\prime which also implements f.

Gibbard-Satterthwaite Theorem

Assume that O is finite and |O| > 2, every o \in O can be achieved by SCF f for some \theta.
\Theta includes all possible strict orderings over O.
Then f is implementable in dominant strategies if and only if f is dictatorial.

So we must use a weaker equilibrium concept. Designing mechanisms where computing manipulation is computationally hard. Restrict the structure of agents’ preferences.

Single-Peaked Preferences

Median-Voter rule is strategy proof for single-peaked preferences.

Quasilinear Preferences

Outcome o = (x, t_1, ..., t_n). Where x is a project choice, t_i are transfers (money).
Utility functions u_i(o, \theta_i) = v_i(x, \theta_i) - t_i.
Quasilinear mechanism M = (S_1, ..., S_n, g(\cdot)), where g(\cdot) = (x(\cdot), t_1, ..., t_n).

Groves Mechanisms

Choice rule x^*(\theta) = arg \max_x \sum_i v_i(x, \theta_i).
Transfer rules t_i(\theta) = h_i(\theta_{-i}) - \sum_{j \neq i} v_j(x^*(\theta), \theta_j).
Theorem: Groves mechanisms are strategy-proof and efficient, unique.

Vickrey-Clarke-Grokves Mechanism (VCG)

Outcome x^* = arg \max_x \sum_i v_i(x, \theta_i).
Transfers t_i(\theta) = \sum_{j \neq i} v_j(x^{-i}, \theta_j) - \sum_{j \neq i} v_j(x^*(\theta), \theta_j).
- Equilibrium utility is marginal contribution to welfare of the system.

Example: Allocation.

Social choise function is to maximize social welfare, give item to the agent who values it the most.

Utility function u_i = v_i(o) - t_i. Mechanism (Vickrey Auction), outcome function g is second-price auction.