# Iterative Machine Teaching

This is a random discussion surrounding the ICML 2017 paper Iterative Machine Teaching.

Let us assume we are performing SGD to learn a function $f_w : \mathcal{X} \subset \mathbf{R}^d \to \mathcal{Y}$ by minimizing a convex loss $\ell(f_w(x), y)$ over sample $(x, y)$:

Now comes the weird part: we actually know where we want the algorithm to converge ($w^*$). And we are feeding examples $(x, y)$ to it. After one example:

- The first term is where we were at the previous step, compared to the optimal.
- The second term is the
**difficulty**of example $(x, y)$: how much do we move in space? - The third term is the
**usefulness**of example $(x, y)$: how much do we move in the direction that we’re interested in? (= going towards $w^*$)

## Active learning

In active learning, we don’t know where we are converging, and we don’t know the $y$. So we want to move the most:

Ask sample $x$ for which $Var_y\left(\grad\right)$ will be biggest.

By the way, this is Fisher information, right? Variance of the score!

## MAP inference

If we assume: $w_{t + 1} \sim \mathcal{N}(w_t, (1/\lambda) I_d)$:

A mistake is hidden in the formula above. Can you find it?