-
In this video, I'll describe the first way
we discovered for getting Sigmoid Belief
-
Nets to learn efficiently.
It's called the wake-sleep algorithm and
-
it should not be confused with Boltzmann
machines.
-
They have two phases, a positive and a
negative phase, that could plausibly be
-
related to wake and sleep.
But the wake-sleep algorithm is a very
-
different kind of learning, mainly because
it's for directed graphical models like
-
Sigmoid Belief Nets, rather than for
undirected graphical models
-
like Boltzmann machines.
-
The ideas behind the wake-sleep algorithm
-
led to a whole new area of machine
learning called variational learning,
-
which didn't take off until the late
1990s,
-
despite early examples like the wake-sleep
algorithm, and is now one of the main ways
-
of learning complicated graphical models
in machine learning.
-
The basic idea behind these variational
methods sounds crazy.
-
The idea is that since it's hard to
compute the correct posterior distribution,
-
we'll compute some cheap
approximation to it.
-
And then, we'll do maximum likelihood
learning anyway.
-
That is, we'll apply the learning rule
that would be correct,
-
if we'd got a sample from the true posterior,
-
and hope that it works, even though we haven't.
-
Now, you could reasonably expect this to
be a disaster,
-
but actually the learning comes to your rescue.
-
So, if you look at what's driving the weights during the learning,
-
when you use an approximate posterior,
-
there are actually two terms driving the
weights.
-
One term is driving them to get a better
model of the data.
-
That is, to make the Sigmoid Belief Net
more likely to generate the observed data
-
in the training center.
But there's another term that's added to that,
-
that's actually driving the weights
-
towards sets of weights for which the
approximate posterior it's using is a good fit
-
to the real posterior.
It does this by manipulating the real
-
posterior to try to make it fit the
approximate posterior.
-
It's because of this effect, the
variational learning of these models works
-
quite nicely.
Back in the mid 90s,' when we first came
-
up with it, we thought this was an
interesting new theory of how the brain
-
might learn.
That idea has been taken up since by Karl
-
Friston, who strongly believes this is
what's going on in real neural learning.
-
So, we're now going to look in more detail
at how we can use an approximation to the
-
posterior distribution for learning.
To summarize, it's hard to learn
-
complicated models like Sigmoid Belief
Nets because it's hard to get samples from
-
the true posterior distribution over
hidden configurations, when given a data vector.
-
And it's hard even to get a sample from that posterior.
-
That is, it's hard to get an unbiased sample.
-
So, the crazy idea is that we're going to
-
use samples from some other distribution
and hope that the learning will still work.
-
And as we'll see, that turns out to be true for Sigmoid Belief Nets.
-
So, the distribution that we're going to
-
use is a distribution that ignores
explaining away.
-
We're going to assume (wrongly) that the
posterior over hidden configurations
-
factorizes into a product of
distributions for each separate hidden unit.
-
In other words, we're going to assume that
-
given the data, the units in each hidden
layer are independent of one another,
-
as they are in a Restricted Boltzmann
machine.
-
But in a Restricted Boltzmann machine,
this is correct.
-
Whereas, in a Sigmoid Belief Net, it's
wrong.
-
So, let's quickly look at what a factorial
distribution is.
-
In a factorial distribution, the
probability of a whole vector is just the
-
product of the probabilities of its
individual terms.
-
So, suppose we have three hidden units in
the layer and they have probabilities of
-
being wrong of 0.3, 0.6, and 0.8. If we
want to compute the probability of the
-
hidden layer having the state (1, 0, 1),
We compute that by multiplying 0.3
-
by (1 - 0.6), by 0.8.
So, the probability of a configuration of
-
the hidden layer is just the product of
the individual probabilities.
-
That's why it's called factorial.
In general, the distribution of binary
-
vectors of length n will have two to the n
degrees of freedom.
-
Actually, it's only two to the n minus one
because the probabilities must add to one.
-
A factorial distribution, by contrast,
only has n degrees of freedom.
-
It's a much simpler beast.
So now, I'm going to describe the
-
wake-sleep algorithm that makes use of the
idea of using the wrong distribution.
-
And in this algorithm, we have a neural
net that has two different sets of weights.
-
It's really a generative model and so, the
-
weights shown in green for generative are
the weights of the model.
-
Those are the weights that define the
probability distribution over data vectors.
-
We've got some extra weights.
The weights shown in red, for recognition weights,
-
and these are weights that are used for
-
approximately getting the posterior
distribution.
-
That is, we're going to use these weights
to get a factorial distribution at each
-
hidden layer that approximates the
posterior, but not very well.
-
So, in this algorithm, there's a wake
phase.
-
In the wake phase, you put data in at the
visible layer, the bottom, and you do a
-
forward pass through the network using the
recognition weights.
-
And in each hidden layer, you make a
stochastic binary decision for each hidden
-
unit independently, about whether it
should be on or off.
-
So, the forward pass gets us stochastic
binary states for all of the hidden units.
-
Then, once we have those stochastic binary
states, we treat that as though it was a
-
sample from the true posterior
distribution given the data and we do
-
maximum likelihood learning.
But what we're doing maximum likelihood
-
learning for is not the recognition
weights that we just used to get the
-
approximate sample.
It's the generative weights that define our models.
-
So, you drive the system in the forward
-
pass with the recognition weights, but you
learn the generative weights.
-
In the sleep phase, you do the opposite.
You drive the system with the generative weights.
-
That is, you start with a random vector of
-
the top hidden layer.
You generate the binary states of those
-
hidden units from their prior in which
they're independent.
-
And then, you go down through the system,
generating state for each layer at a time.
-
And here you're using the generative model
correctly.
-
That's how the generative model says you
want to generate data.
-
And so, you can generate an unbiased
sample from the model.
-
Having used the generative weights to
generate an unbiased sample, you then say,
-
let's see if we can recover the hidden
states from the data.
-
Well, let's see if we can recover the
hidden states that layer h2 from the
-
hidden states at layer h1.
So, you train recognition weights, to try
-
and recover the hidden states that
actually generated the states in the layer below.
-
So, it's just the opposite of the weight phase.
-
We're now using the generative weights to
-
drive the system and we're learning the
recognition weights.
-
It turns out that if you start with random
weights and you alternate between wake
-
phases and sleep phases it learns a pretty
good model.
-
There are flaws in this algorithm.
The first flaw is a rather minor one
-
which is, the recognition weights are
learning to invert the generative model.
-
But at the beginning of learning, they're
learning to invert the generative model in
-
parts of the space where there isn't any
data.
-
Because when you generate from the model,
you're generating stuff that looks very
-
different from the real data,
because the weights aren't any good.
-
That's a waste, but it's not a big
problem.
-
The serious problem with this algorithm is
that the recognition weights not only
-
don't follow the gradient of the log
probability of the data,
-
They don't even follow the gradient of the
variational bound on this probability.
-
And because they're not following the
right gradient, we get incorrect mode averaging,
-
which I'll explain in the next slide.
-
A final problem is that we know that the
-
true posterior over the top hidden layer
is bound to be far from independent
-
because of explaining away effects.
And yet, we're forced to approximate it
-
with a distribution that assumes
independence.
-
This independence approximation might not
be so bad for intermediate hidden layers,
-
because if we're lucky, the explaining
away effects that come from below will be
-
partially canceled out by prior effects
that come from above.
-
You'll see that in much more detail later.
Despite all these problems, Karl Friston
-
thinks this is how the brain works.
When we initially came up with the
-
algorithm, we thought it was an
interesting new theory of the brain.
-
I currently believe that it's got too many
problems to be how the brain works and
-
that we'll find better algorithms.
-
So now let me explain mode averaging, using the little model
-
with the earthquake and the truck that we saw before.
-
Suppose we run the sleep phase, and we generate data from this model.
-
Most of the time, those top two units would be off
-
because they are very unlikely to turn on under their prior,
-
and, because they are off, the visible unit will be firmly off, because its bias is -20.
-
Just occassionally, one time in about e to the -10, one of the two top units will turn on
-
and it will be equally often the left one or the right one.
-
When that unit turns on, there's a probability of a half
-
that the visible unit will turn on.
-
So, if you think about the occassions on which the visible unit turns on,
-
half those occassions have the left-hand hidden unit on,
-
the other half of those occassions have the right-hand hidden unit on
-
and almost none of those occassions have neither or both units on.
-
So now think what the learning would do for the recognition weights.
-
Half the time we'll have a 1 on the visible layer,
-
the leftmost unit will be on at the top,
-
so we'll actually learn to predict that that's on with a probability of 0.5, and the same for the right unit.
-
So the recognition units will learn to produce a factorial distribution
-
over the hidden layer, of (0.5, 0.5)
-
and that factorial distribution puts a quarter of its mass on the configuration (1,1)
-
and another quarter of its mass on the configuration (0,0)
-
and both of those are extremely unlikely configurations given that the visible unit was on.
-
It would have been better just to pick one mode,
-
that is, it would have been better for the visible unit just to go for truck, or just to go for earthquake.
-
That's the best recognition model you can have,
-
that's the best recognition model you can have if you're forced to have a factorial model.
-
So even though the hidden configurations we're dealing with
-
are best represented as the corners of a square
-
actually show it as if it's a one-dimensional continuous value,
-
and the true posterior is bimodal. It's focused on (1,0) or (0,1), that's shown in black.
-
The approximation window, if you use the sleep phase of the wake-sleep algorithm,
-
is the red curve, which gives all four states of the hidden units equal probability
-
and the best solution would be to pick one of these states, and give it all the probability mass.
-
That's the best solution, because in variational learning we're manipulating the true posterior
-
to make it fit the approximation we're using.
-
Normally, in learning we'll manipulate an approximation to fit the true thing, but here it's backwards.