CS 486

Rational Agent Paradigm

An entity that perceives and acts.

Task Environments

To design a rational agent the task environment must be defined.

Properties of Task Environment

Search

Search problem consists of a state space, a successor function, a start space, and a goal test.

Example: Sliding Tiles Problem.

Example: N Queens Problem.

Alternate representation which is more complicated but has a smaller search space.

State Space

Representing State

Search Tree

For most problems we do not actually generate the entire tree.

Generic Search Algorithm

Key Properties

Example: DFS.

Example: BFS.

Combine search methods to take advantage of DFS space complexity and BFS completeness and shallow solution advantage?

Informed Search

Uninformed search expands nodes on the distance from the start node. Why not try to expand on the distance to the goal?

Heuristics

A function that estimates the cost of reaching a goal from a given state.

Example: Best First Search.

Search strategy: Expand the most promising node according to the heuristic. - Not complete (infinite expansion). Time complexity, space complexity O(b^m).

Let G be the optimal goal. Let G^\prime be a sub-optimal goal. So f(G) < f(G^\prime). Assume by way of contradiction that G^\prime is selected over some n. So f(G^\prime = g(G^\prime) > g(G) = cost(s, n) + cost(n, G) \ge g(n) + h(n). This is a contradiction since n would have been selected first.

Let c^\star be the cost of the optimal solution. A^\star expands all nodes such that f(n) < c^\star, so h(n) \le c^\star - g(n). g(n) is constant across all heuristics, so since h_2(n) \le h_1(n), a node expanded in h_1 must also have been expanded in h_2.

Constraint Satisfaction

Special subset of search problems.

Types of CPSs

Commutativity

Key insight is that CPSs are commutative.

Backtracking

Basic search algorithm for CSPs.

Select unassigned variable X
For every value {x_1, ..., x_n} in domain of X
    If value satisfies constraints, assign X = x_i and exit loop
If an assignment is found
    Move to next variable
If no assignment is found
    Back up to preceding variable and try a different assignment

Backtracking Efficiency

Ordering

Which variables should be tried first? In what order should the variable’s values be tried?

Filtering

How do we detect faliure early?

Structure

Is it possible to exploit the problem structure? Idea: Break down the graph into connected components and solve each component separately.

Constraints and Local Search

For many problems, the path is unimportant.

Iterative Improvement Methods

Hill Climbing (Gradient Descent):

Simulated Annealing

Escape local optima by allowing “downhill” movements.

Boltzman Distribution

Genetic Algorithms

  1. Selection: Fitness proportional selection could lead to overcrowding. Ranking by percentile of fitness gets around this. Softmax (Boltzman) selection similar to simulated annealing works.
  2. Crossover: Select random crossover point. Implemented with bitmasks.
  3. Mutate: With small probability, modify a feature.

Adversarial Search

Minimax

M(n) = \begin{cases}u(n),\ &\text{$n$ is terminal} \\ \max_{c \in succ(n)} M(c),\ &\text{$n$ is MAX node} \\ \min_{c \in succ(n)} M(c),\ &\text{$n$ is MIN node} \end{cases}.

Alpha-Beta Pruning

Compute minimax without searching the entire tree.

Optimizations

We might not have enough resources to compute full results.

Monte-Carlo Tree Search (MCTS)

  1. Selection: Traverse the tree following a policy using Upper Confidence Bounds until you run out of information required to progress.
  2. Expansion: Eventually we run out of information to progress, so expand a random child.
  3. Simulate: Quickly simulate a game from the node.
  4. Back-propogation: Based on simulations, update expected values.

Stochastic Games

Element of randomness.

Machine Learning

Given representations, data, and bias, we have a search problem. We search through the space of possible representations to best fit the data. The search space is typically too large for systematic search, so we use iterative improvement. So a learning problem is made up of a search space, an evaluation function, and a search method.

Supervised Learning

Given a set of input features X_1, ..., X_n, a set of target features f(x), and a set of training examples, predict the target features for a set of test examples. This is done by returning a function that approximates f.

Decision Trees

Classify instructions by storting them down the tree from root to leaf.

Inducing Decision Tree

Recursively choose the most significant attribute as the root of the subtree.

Assessing Performance

Avoiding Overfitting

  1. Regularization: Prefer small decision trees over large ones. Add complexity penalty to stopping criteria.
  2. Pseudocounts: Add data based on previous knowledge.
  3. Cross Validation: Split training set into training an validation. Use validation as pretend test set. Optimize hypothesis to perform well against validation set.

Linear Classifiers

Data is of the form (x, f(x)), x \in \mathbb{R}^n, f(x) \in \{0, 1\}.

Ensembles

Use many hypothesis and combine their predictions.

Bagging

Majority vote.

Randomly sample subsets of training data and features, learn decision trees off the subsets. Classify using majority vote of the forest.

Boosting

Compute accuracy p for every hypothesis. Update weights of correct predictions by \frac{p}{1 - p}. Update weight of hypothesis by \log(\frac{1-p}{p}).

Neural Networks

Relationships between input and output may be complicated.

Network Structure

Bayesian Networks

Variable Independence

Two variables X, Y are conditionally independent given variable Z if given that you know the value of Z, nothing you learn about Y will influence your beliefs about X.

Testing Independence (D-Separation)

Let P be an undirected path between X, Y. E blocks path P if there exists a node Z \in P such that.

  1. An arc on P goes into Z and an arc leaves Z, and Z \in E.
  2. Both arcs on P leave Z and Z \in E.
  3. Both arcs on P enter Z and neither Z, nor any descentents, are in E.

Example:

\begin{aligned} P(J) &= \sum P(J, TS, Fl, Fv, Th, M, ET) \\ &= \sum_{M} P(J, M) \\ &= \sum_{M} P(J \mid M) P(M) \\ &= \sum_{M} P(J \mid M) \sum_{ET} P(M, ET) \\ &= \sum_{M} P(J \mid M) \sum_{ET} P(M \mid ET) P(ET) \end{aligned}

Example:

\begin{aligned} P(J \mid ET) &= \frac{P(J, ET)}{P(ET)} \\ &= \frac{1}{P(ET)} \sum_{M} P(J, M, ET) \\ &= \frac{1}{P(ET)} \sum_{M} P(J | M, ET) P(M, ET) \\ &= \frac{1}{P(ET)} \sum_{M} P(J | M, ET) P(M \mid ET) P(ET) \\ &= \frac{1}{P(ET)} \sum_{M} P(J | M) P(M \mid ET) P(ET) \\ \end{aligned}

Example: Backwards Inference.

\begin{aligned} P(ET \mid J) &= \frac{P(J \mid ET)P(ET)}{P(J)} \\ &= \frac{1}{P(J)} P(J, ET) \\ \end{aligned}

Algorithm

Given query variable Q, evidence set E, remaining variables Z, set of factors F corresponding to \{Q\} \cup Z.

  1. Replace factors f \in F that mention a variable in E with f_{E = e}.
  2. Choose elimination ordering of Z.
  3. For Z_j \in Z, eliminate as follows.
  4. Remaining factors refer to Q only. Take their product and normalize for P(Q).

Reasoning Under Uncertainty

Dynamic Inference

Stochastic Process

k-Order Markov Process

Hidden Markov Model

First Order Hidden Markov Model

  1. Monitoring: P(s_t | o_t, ..., o_1).
  2. Prediction: P(s_{t+k} | o_t, ..., o_1).
  3. Hindsight: P(s_k | o_t, ..., o_1).
  4. Most Likely Explanation: argmax_{s_t, ..., s_1} P(s_t, ..., s_1 | o_t, ..., o_1).

Dynamic Bayes Net

What if the number of states or observations are exponential?

HMMs are DBN with one state variable and one observation variable.

Statistical Learning

Where do the probabilities for Bayes Nets come from?

Example: Candy is sold in two flavours, lime and cherry, with the same wrapper for both flavours. Bags have different ratios, (e.g. 100% cherry, 50% cherry 50% line, 100% lime). After eating k candies, what is the flavour ratio of the bag?

\begin{aligned}P(x | d) &= \sum_i P(x | d, h_i) P(h_i | d) \\ &= \sum_i P(x | h_i) P(h_i | d) \end{aligned}

Predictions are weighted averages of the predictions of individual hypothesis.

Maximum a Posteriori (MAP)

\begin{aligned}h_{MAP} &= arg max_h P(h | d) \\ &= arg max_h P(h) P(d | h) \\ &= arg max_h P(h) \Pi_i P(d_i | h)\end{aligned}

h_{MAP} = arg max_h \left[\log P(h) + \sum_i \log P(d_i | h)\right]

Maximum Likelihood (ML)

\begin{aligned}h_{MAP} &= arg max_h P(h) P(d | h) \\ h_{ML} &= arg max_h P(d | h) \end{aligned}

Example:

Hypothesis h_\theta, P(cherry) = \theta, P(lime) = 1 - \theta. - Data d, N candies with c cherry and N - c lime.

\begin{aligned}P(d | h_\theta) &= \theta^c (1 - \theta)^{N - c} \\ L(d | h_\theta) &= \log P(d | h_\theta) \\ &= c \log \theta + (N - c) \log(1 - \theta) \end{aligned}

Find the \theta that maximizes log likelihood.

\begin{aligned} \frac{\partial L(d | h_\theta)}{\partial \theta} &= \frac{c}{\theta} - \frac{N - c}{1 - \theta} = 0 \\ \theta &= \frac{c}{N} \end{aligned}

Laplace Smoothing

Given observations x = (x_1, ..., x_d) from N trials, we use estimate parameters \theta.

\begin{aligned}\theta &= (\theta_1, ..., \theta_d) \\ \theta_i &= \frac{x_i + \alpha}{N + \alpha d},\ \alpha > 0\end{aligned}

Naive Bayes

With observed attribute values x_1, ..., x_n.

P(C | x_1, ..., x_n) = \alpha P(C) \Pi_i P(x_i | C)

From ML we know that the parameters are just observed frequencies with possible Laplace smoothing, we just need to choose the most likely class C.

Text Classifiation

Example application of Naive Bayes.

Data Representation

Expectation Maximization (EM)

Previous problems always have values of all attributes known and learning is relatively easy. Many real world problems have hidden variables with incomplete data and missing attribute values.

Maximum Likelihood Learning

Direct ML

Maximize likelihood directly where E are the evidence variables and Z are the hidden variables.

\begin{aligned} h_{ML} &= arg max_h P(E | h) \\ &= arg max_h \sum_Z P(E, Z | h) \\ &= arg max_h \sum_Z \Pi_i CPT(V_i) \\ &= \arg max_h \log \sum_z \Pi_i CPT(V_i) \end{aligned}

Expectation-Maximization (EM)

  1. Guess h_{ML}.
  2. Iterate.

\begin{aligned} h_{i+1} &= arg max_h \sum_Z P(Z | h_i, e) \log P(e, Z | h_i) \\ &= arg max_h \sum_Z P(Z | h, e) \log \Pi_h CPT_j \\ &= arg max_h \sum_Z P(Z | h, e) \sum_j \log CPT_j \end{aligned}

Monotonic improvement of likelihood, P(e | h_{i+1}) \ge P(e | h_i).

Example: Two coins A, B. Probability of heads with A is \theta_A, probability of heads with B is \theta_B. Want to find \theta_A, \theta_B by performing a number of trials.

  1. HTTTHHTHTH
  2. HHHHTHHHHH
  3. HTHHHHHTHH
  4. HTHTTTHHTT
  5. THHHTHHHTH

Initialization: Let’s say \theta_A^0 = 0.6, \theta_B^0 = 0.5. Assume that the probability of A, B being used is the same, so P(A) = P(B) = 0.5.

  1. P(T_1 | A) = {10 \choose 5} 0.6^5 0.4^5 \approx 0.2. P(T_1 | B) = {10 \choose 5} 0.5^{10} \approx 0.246. So P(A | T_1) = \frac{0.2}{0.2 + 0.246} \approx 0.45, P(B | T_1) = 1 - P(A | T_1) \approx 0.55. Give both coins a proportional number of heads and tails,
Coin A Coin B
2.2, 2.2 2.8, 2.8
  1. P(T_2 | A) = {10 \choose 9} 0.6^9 0.4^1 \approx 0.0403. P(T_2 | B) = {10 \choose 9} 0.5^10 \approx 0.01. So P(A | T_2) \approx 0.8, P(B | T_2 \approx 0.2.
Coin A Coin B
2.2, 2.2 2.8, 2.8
7.2, 0.8 1.8, 0.2

Coin A Coin B
2.2, 2.2 2.8, 2.8
7.2, 0.8 1.8, 0.2
5.9, 1.5 2.1, 0.5
1.4, 2.1 2.6, 3.9
4.5, 1.9 2.5, 1.1
23.1, 8.6 11.7, 8.4

Eventually we will see similar probabilities to 0.8, 0.45.

K-Means Algorithm

\sum_{e \in E} \sum_{j=1}^n (pval(class(e), X_j) - val(e, X_j))^2

  1. Randomly assign examples to classes.
  2. Repeat following EM steps until assignment does not change.

Markov Decision Processes

Markov Chains

Discounted Rewards

Solving a Markov Process

Value Iteration

Manually solve U^k(s_i), the expected discounted sum of rewards over the next k time steps.

\begin{aligned} U^1(s_i) &= r_i \\ U^k(s_i) &= r_i + \gamma \sum_{j=1}^n P_{ij} U^{k-1}(s_j) \end{aligned}

Policies

Let P^k represent the transition matrix for policy k.

Bellman’s Equation: V^{t+1}(s_i) = \max_k\left[r_i + \gamma \sum_{j=1}^n P^k_{ij}V^t(s_j)\right]

Policy Iteration

Partially Observable MDPs (POMDPs)

P(b^\prime | a, b) = \sum_o P(b^\prime, o, a, b) \sum_{s^\prime} O(o | s^\prime) \sum_s P(s^\prime | a, s) b(s)

Reinforcement Learning

Characteristics.

Multi-Armed Bandit

Simplest reinforcement learning problem.

At every time step t, choose action A_t from k possible actions, receive reward R_t.

Exploration vs. Exploitation

Action-Value Methods

Q_t(a) \approx \frac{\sum_{i=1}^{t-1} R_i \cdot 1_{A_i = a}}{\sum_{i=1}^{t-1} 1_{A_i = a}}

\epsilon-Greedy Action Selection

Averaging and Learning Rules

Optimism

Upper Confidence Bounds

A_t \approx arg \max_a \left[Q_t(a) + c\sqrt{\frac{\log t}{N_t(a)}}\right]

Determining Bound

\begin{aligned} e^{2N_t(a)U_t(a)^2} &= p \\ U_t(a) &= \sqrt{\frac{-\log p}{2N_t(a)}} \end{aligned}

MDPs and RL

All RL Methods Driven By Values

Bellman Optimality Equations

\begin{aligned} V^*(s) &= \max_a Q^{\pi^*}(s, a) \\ &= \max_a E\left[R_{t+1} + \gamma V^*(s_{t+1} | S_t = s, A_t = a\right] \\ &= \max_a \sum_{s^\prime, r^\prime} p(s^\prime, r | s, a)\left[r + \gamma V^*(s^\prime)\right] \end{aligned}

\begin{aligned} Q^*(s, a) &= E[R_{t+1} + \gamma \max_{a^\prime} Q^*(S_{t+1}, a^\prime) [S_t = s, A_t = a] \\ &= \sum_{s^\prime, r} p(s^\prime, r | s, a)[r + \gamma \max_{a^\prime} Q^*(s^\prime, a^\prime)] \end{aligned}

Prediction

V(S_t)^\prime = V(S_t) + \alpha[R_{t+1} + \gamma V(S_{t+1}) - V(S_t)]

R_{t+1} + \gamma V(S_{t+1}) is an estimate of the return.

  1. Initialize V(s) arbitrarily.
  2. Repeat for every episode.

From Learning Action-Values to Control

  1. Behavioural Policy: Generate actions and gather data.
  2. Learning Policy: Target policy to learn.

Sarsa (On-Policy TD Control)

Repeat for every episode.

  1. Initialize S.
  2. Choose A from S using policy derived from Q such as \epsilon-greedy.
  3. Repeat for every episode.
  4. While S is not terminal.

Q-Learning (Off-Policy TD Control)

Repeat for every episode.

  1. Initialize S.
  2. Repeat for every episode.

Game Theory

Multiagent Decision Making.

Mixed Nash Equilibria

Repeated Games

Example: Prisoner’s Dilemma

Mechanism Design

Implementation

Mechanism M implements social choice function f(\theta) if there is an equilibrium s^*, such that s^* = (s^*_1(\theta_1),...,s^*_n(\theta_n)) for all g(s^*_1(\theta_1),...,s^*_n(\theta_n)) = f(\theta_1, ...,\theta_n).

Relevation Principle

Gibbard-Satterthwaite Theorem

So we must use a weaker equilibrium concept. Designing mechanisms where computing manipulation is computationally hard. Restrict the structure of agents’ preferences.

Single-Peaked Preferences

Quasilinear Preferences

Groves Mechanisms

Vickrey-Clarke-Grokves Mechanism (VCG)

Example: Allocation.