Iterative Machine Teaching

This is a random discussion surrounding the ICML 2017 paper Iterative Machine Teaching.

Let us assume we are performing SGD to learn a function $f_w : \mathcal{X} \subset \mathbf{R}^d \to \mathcal{Y}$ by minimizing a convex loss $\ell(f_w(x), y)$ over sample $(x, y)$:

Now comes the weird part: we actually know where we want the algorithm to converge ($w^*$). And we are feeding examples $(x, y)$ to it. After one example:

Active learning

In active learning, we don’t know where we are converging, and we don’t know the $y$. So we want to move the most:

Ask sample $x$ for which $Var_y\left(\grad\right)$ will be biggest.

By the way, this is Fisher information, right? Variance of the score!

MAP inference

If we assume: $w_{t + 1} \sim \mathcal{N}(w_t, (1/\lambda) I_d)$:

A mistake is hidden in the formula above. Can you find it?